![film](https://raw.githubusercontent.com/Mantvydas-data/Programming-for-DA-2021-Proj/main/data/denise-jans-tV80374iytg-unsplash.PNG)

Photo by <a href="https://unsplash.com/@dmjdenise?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Denise Jans</a> on <a href="https://unsplash.com/s/photos/film?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>


# Intro
***
The aim of this project is to simulate dataset that represents real life measurable phenomenon with simulated dataset having at least one hundred data point with at least four different variables. Sinthesised data should have close to real life properties and relationship amongst variables.

# Project description plan
***
During recent wordwide lockdown it became increasingly popular to have a subscription to one of many movie/TV streaming services. These platforms experienced unprecedented growth month on month basis becoming daily habit for most. This project will look into simulating a dataset of fictional movie streaming platform 'MFLIX' usage log with dates, movie stream count per hour, movie category, user ID, user gender and name with surname. Different methods learned during this semester will be used in this project.

# Generating data

In [1]:
# Importing required packages by their orthodox abbreviations
import matplotlib.pyplot as plt # Plotting
import numpy as np # Mathematical arrays
import pandas as pd # Work with data frames

# Importing random generator and assigning it to rng variable
from numpy.random import default_rng

rng = default_rng()

# Dates and times by hour
A date list with hourly timestamp

In [2]:
# Generating a range of dates by the hour

# Count of hourly observations for month of December
periods = 31*24

# Hourly date and time for December
dfdate = pd.date_range('2021-12-01', freq='H', periods=periods)

dfdate


DatetimeIndex(['2021-12-01 00:00:00', '2021-12-01 01:00:00',
               '2021-12-01 02:00:00', '2021-12-01 03:00:00',
               '2021-12-01 04:00:00', '2021-12-01 05:00:00',
               '2021-12-01 06:00:00', '2021-12-01 07:00:00',
               '2021-12-01 08:00:00', '2021-12-01 09:00:00',
               ...
               '2021-12-31 14:00:00', '2021-12-31 15:00:00',
               '2021-12-31 16:00:00', '2021-12-31 17:00:00',
               '2021-12-31 18:00:00', '2021-12-31 19:00:00',
               '2021-12-31 20:00:00', '2021-12-31 21:00:00',
               '2021-12-31 22:00:00', '2021-12-31 23:00:00'],
              dtype='datetime64[ns]', length=744, freq='H')

# Streaming counts - Depends on time of the day. 
***
To have realistic sream data views per hour needs to adhere to time of day to simulate peak-offpeak times of the day. Time serries data will be split by time of the day for random stream data count to be generated having not many streams during late night and early morning hours, medium usage during the day and peak time from 5 PM to 1AM. Streams will be 30% higher on Fridays, where on Saturday and Sunday they will increase to 40% compared to work day usage.

In [3]:
# Creating H columnt to number each hour sample in the date range 
df= pd.DataFrame({'H': np.arange(len(dfdate))}, index=dfdate)

df2 = df['H']

# Spliting time series by time of day as dataframe variable
night = pd.DataFrame(df.between_time("02:01","06:00"))
morning = pd.DataFrame(df.between_time("06:01","12:00"))
afternoon = pd.DataFrame(df.between_time("12:01","17:00"))
evening = pd.DataFrame(df.between_time("17:01","02:00"))

night

Unnamed: 0,H
2021-12-01 03:00:00,3
2021-12-01 04:00:00,4
2021-12-01 05:00:00,5
2021-12-01 06:00:00,6
2021-12-02 03:00:00,27
...,...
2021-12-30 06:00:00,702
2021-12-31 03:00:00,723
2021-12-31 04:00:00,724
2021-12-31 05:00:00,725


In [4]:
# Setting time of day clasification by times  
night['periods'] = 'night'
morning['periods'] = 'morning'
afternoon['periods'] = 'afternoon'
evening['periods'] = 'evening'

In [5]:
# Adding date-time as value instead of index
df['udates'] = df.index
# Populating weekdays for weekend usage to be incremented
df['weekday'] = df.index.weekday

In [6]:
# Setting df index to H for data to be joined based on
df.set_index('H', inplace=True)
df

Unnamed: 0_level_0,udates,weekday
H,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2021-12-01 00:00:00,2
1,2021-12-01 01:00:00,2
2,2021-12-01 02:00:00,2
3,2021-12-01 03:00:00,2
4,2021-12-01 04:00:00,2
...,...,...
739,2021-12-31 19:00:00,4
740,2021-12-31 20:00:00,4
741,2021-12-31 21:00:00,4
742,2021-12-31 22:00:00,4


In [7]:
df

Unnamed: 0_level_0,udates,weekday
H,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2021-12-01 00:00:00,2
1,2021-12-01 01:00:00,2
2,2021-12-01 02:00:00,2
3,2021-12-01 03:00:00,2
4,2021-12-01 04:00:00,2
...,...,...
739,2021-12-31 19:00:00,4
740,2021-12-31 20:00:00,4
741,2021-12-31 21:00:00,4
742,2021-12-31 22:00:00,4


In [8]:
# Random steam counts for different time of the day
# Night streams 0 to 3 per hour
night['streams'] = rng.integers(0, 3, len(night))
# Morning streams 2 to 10 per hour
morning['streams'] = rng.integers(2, 10, len(morning))
# Afternoon streams 8 to 20 per hour
afternoon['streams'] = rng.integers(8, 20, len(afternoon))
# Evening streams 20 to 35 per hour
evening['streams'] = rng.integers(20, 35, len(evening))

In [9]:
evening

Unnamed: 0,H,periods,streams
2021-12-01 00:00:00,0,evening,33
2021-12-01 01:00:00,1,evening,31
2021-12-01 02:00:00,2,evening,29
2021-12-01 18:00:00,18,evening,31
2021-12-01 19:00:00,19,evening,28
...,...,...,...
2021-12-31 19:00:00,739,evening,20
2021-12-31 20:00:00,740,evening,20
2021-12-31 21:00:00,741,evening,31
2021-12-31 22:00:00,742,evening,26


In [10]:
# Joining data
# Setting index to H for all
night.set_index('H', inplace=True)
morning.set_index('H', inplace=True)
afternoon.set_index('H', inplace=True)
evening.set_index('H', inplace=True)

In [11]:
# Appending time of the day dataframes to be joined together
night = night.append(morning)
night = night.append(afternoon)
night = night.append(evening)
night

Unnamed: 0_level_0,periods,streams
H,Unnamed: 1_level_1,Unnamed: 2_level_1
3,night,2
4,night,0
5,night,1
6,night,2
27,night,2
...,...,...
739,evening,20
740,evening,20
741,evening,31
742,evening,26


In [12]:
# Joining values all together on preset index H
df = df.join(night, how='outer')

In [13]:
# Increasing weekend views to additional 30% on Fri, 40% Sat, Sun
# Also, rounding to int 
df['streams'] = np.rint(np.where(df['weekday'] == 4, df['streams'] * 1.3, df['streams']))
df['streams'] = np.rint(np.where((df['weekday'] == 5) | (df['weekday'] == 6), df['streams'] * 1.4, df['streams']))

In [14]:
df

Unnamed: 0_level_0,udates,weekday,periods,streams
H,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
0,2021-12-01 00:00:00,2,evening,33.0
1,2021-12-01 01:00:00,2,evening,31.0
2,2021-12-01 02:00:00,2,evening,29.0
3,2021-12-01 03:00:00,2,night,2.0
4,2021-12-01 04:00:00,2,night,0.0
...,...,...,...,...
739,2021-12-31 19:00:00,4,evening,26.0
740,2021-12-31 20:00:00,4,evening,26.0
741,2021-12-31 21:00:00,4,evening,40.0
742,2021-12-31 22:00:00,4,evening,34.0


# Movie/TV Series Category - Randon selection from available list
***
Movie - TV Series category will be randomly assigned to views data from list variable containing available categories on the platform. 

In [15]:
# Setting available movie/TV series categories
categories = ['Action', 'Comedy', 'Drama', 'Fantasy', 'Horror', 'Mystery', 'Romance', 'Thriller', 'Western']

#df['categories'] = rng.choice(categories,size=(len(df)))


# User ID

User ID will be generated chosing two letters followed by three random numbers, 50 active users from Ireland will have User IDs assigned.

In [16]:
# Importing Python modules String and Random
import string
import random

uid=[]
letters=''.join(random.choices(string.ascii_uppercase,k=2))
digits=''.join(random.choices(string.digits,k=3))
i=0
for i in range(50):
    uid.append(''.join(random.choices(string.ascii_uppercase,k=2)) + ''.join(random.choices(string.digits,k=3)))
    i +=1          
print(i)           

df2= pd.DataFrame({'N': np.arange(50)})

df2['uid'] = uid
df2       
# df['categories'] = rng.choice(categories,size=(len(df)))

50


Unnamed: 0,N,uid
0,0,TT908
1,1,JZ732
2,2,PB973
3,3,AN519
4,4,QL830
5,5,UR861
6,6,IE957
7,7,EB001
8,8,RJ138
9,9,YU976


# User gender

Gender will be populated for User IDs that will determine name surname 

In [17]:
gender = ['female', 'male']
df2['ugender'] = rng.choice(gender ,size=(len(df2)))
df2

Unnamed: 0,N,uid,ugender
0,0,TT908,female
1,1,JZ732,male
2,2,PB973,male
3,3,AN519,female
4,4,QL830,male
5,5,UR861,female
6,6,IE957,male
7,7,EB001,male
8,8,RJ138,female
9,9,YU976,female


In [18]:
df2['ugender'].value_counts()

female    26
male      24
Name: ugender, dtype: int64

# User Name and Surname? - Depends on gender to be chosen from a list
***
User name will be selected from list of popular Irish names depending on gender variable assigned previously.
Names will be retrieved from [Irishcentral](https://www.irishcentral.com/roots/100-irish-language-first-names-meanings) website by using regular expressions to extract it, if not sucessful manual list in CSV/Excel format will be prepared to be read for random selection.

To reproduce Names file in text format:
1. Open the website link
2. Press Control+p (opens printable wersion with less adds)
3. Control+a followed by Control+c
4. open new text document with editor of choice
5. Contol+v inside the file.
6. Save txt file in project location, data folder as names.txt


https://www.irishcentral.com/roots/100-irish-language-first-names-meanings

In [19]:
# Regular expressions.
import re

In [20]:
with open('data/names.txt', 'r', encoding="utf8" ) as f2:
    data = f2.read()
    print(data)

12/30/21, 9:25 PM 100 Irish first names and their beautiful meanings
https://www.irishcentral.com/roots/100-irish-language-first-names-meanings 1/12
100 Irish first names and their beautiful meanings
Need to find an Irish name quick? Here are the top 100.
Kayla Hertz @IrishCentral May 02, 2021
Have a wee one on the way? Check out these Irish names. GETTY
Looking for an Irish name for a little bundle of joy on the way or justinspired by the beauty of
Irish names and their meanings? Here are 100 ideas for you!
Here are today's 100 most popular Irish language baby names, with their meanings and
pronunciations - 50 girl names and 50 boy names. See if yours made the cut, or peruse the list for
some inspiration!
Irish Girls Names:
1. Aoife (ee-fa)
This name means beautiful, radiant or joyful, and likely derives from the Gaelic word ‘aoibh’
meaning ‘beauty’ or ‘pleasure.’ In Irish mythology, Aoife is known as the greatest woman warrior in
the world. She gave birth to the mythological hero Cuc

In [21]:
# Compile the regular expression for matching patern.
re_names = re.compile(r'[0-9]+\.\s[A-Za-z]+')

In [22]:
result = re.findall(re_names, data)
result


['100.\nKayla',
 '1. Aoife',
 '2. Caoimhe',
 '3. Saoirse',
 '4. Ciara',
 '5. Niamh',
 '6. Roisin',
 '7. Cara',
 '8. Clodagh',
 '9. Aisling',
 '10. Eabha',
 '11. Aoibhinn',
 '12. Aine',
 '13. Sadhbh',
 '14. Aoibheann',
 '15. Fiadh',
 '16. Aoibhe',
 '17. Laoise',
 '18. Eimear',
 '19. Orla',
 '20. Meabh',
 '21. Shauna',
 '22. Shannon',
 '23. Sinead',
 '24. Grainne',
 '25. Kayleigh',
 '26. Fiona',
 '27. Emer',
 '28. Siobhan',
 '29. Ailbhe',
 '30. Mairead',
 '31. Cliodhna',
 '32. Imogen',
 '33. Orlaith',
 '34. Caragh',
 '35. Aoibh',
 '36. Blathnaid',
 '37. Cadhla',
 '38. Dearbhla',
 '39. Bronagh',
 '40. Riona',
 '41. Sorcha',
 '42. Nuala',
 '43. Eireann',
 '44. Oonagh',
 '45. Sile',
 '46. Muireann',
 '47. Nessa',
 '48. Fionnuala',
 '49. Deirdre',
 '50. Eithne',
 '1. Conor',
 '2. Sean',
 '3. Oisin',
 '4. Patrick',
 '5. Cian',
 '6. Liam',
 '7. Darragh',
 '8. Cillian',
 '9. Fionn',
 '10. Finn',
 '11. Rian',
 '12. Eoin',
 '13. Oscar',
 '14. Callum',
 '15. Aidan',
 '16. Tadhg',
 '17. Cathal',
 '

In [23]:
df = pd.DataFrame({'names':result})
print (df)

                   names
0            100.\nKayla
1               1. Aoife
2             2. Caoimhe
3             3. Saoirse
4               4. Ciara
..                   ...
97            47. Aodhan
98           48. Tiernan
99            49. Daithi
100           50. Fergal
101  2014.\nIrishCentral

[102 rows x 1 columns]


# Joining all the data

Data will be joined together indexing one of data variables 

# Save as CSV/xlsx

In [24]:
df2.to_excel('data/check.xlsx')

# Data overview and analysis/plots

In [25]:
Descriptive statistics

SyntaxError: invalid syntax (<ipython-input-25-fb5d223276d5>, line 1)

In [None]:
views by time of the week kde line plot or time series line plot

In [None]:
pie plot for most streamed movie categories by gender 

In [None]:
average week day views vs average weekend views?

In [None]:
bar plot for one weeks streams

# References

https://stackoverflow.com/questions/41598916/resize-the-image-in-jupyter-notebook-using-markdown

https://stackoverflow.com/questions/41598916/resize-the-image-in-jupyter-notebook-using-markdown

https://stackoverflow.com/questions/56310849/generate-random-timeseries-data-with-dates

https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html

https://numpy.org/doc/stable/reference/generated/numpy.rint.html


https://stackoverflow.com/questions/28009370/get-weekday-day-of-week-for-datetime-column-of-dataframe

https://pandas.pydata.org/pandas-docs/version/0.17.1/merging.html

https://www.stackvidhya.com/pandas-iterate-over-rows/

https://datagy.io/pandas-conditional-column/

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.itertuples.html

https://www.geeksforgeeks.org/how-to-generate-a-random-letter-in-python/

https://stackoverflow.com/questions/2823316/generate-a-random-letter-in-python

https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html

In [None]:
https://www.tutorialspoint.com/python/string_decode.html