### Prepping Data Challenge: The Price of Streaming (Week 17)
 The involves bringing together 2 datasets which have different levels of aggregation. 

### Requirements
 - Input the data
 - Check the location field for spelling errors
   - Data roles may help you identify these
 - Fix the date fields so they are recognised as date data types
 - Aggregate the data to find the total duration of each streaming session (as identified by the timestamp)
 - We need to update the content_type field:
   - For London, Cardiff and Edinburgh, the content_type is defined as "Primary"
   - For other locations, maintain the "Preserved" content_type and update all others to have a "Secondary" content_type
- To join to the Avg Pricing Table, we need to work out when each user's first streaming session was. However, it's a little more complex than that. 
   - For "Primary" content, we take the overall minimum streaming month, ignoring location
   - For all other content, we work out the minimum active month for each user, in each location and for each content_type
- We're now ready to join to the Avg Pricing Table
- For "Preserved" content, we manually input the Avg Price as £14.98
- Output the data

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Input the data.
# Fix the date fields so they are recognised as date data types
with pd.ExcelFile('2022W17 Input.xlsx') as xlsx:
    stream = pd.read_excel(xlsx, 'Streaming',parse_dates=['t'])\
               .rename(columns={'t':'Timestamp'})
    price = pd.read_excel(xlsx, 'Avg Pricing', parse_dates = ['Month'])\
               .rename(columns={'Content_Type':'content_type'})

In [3]:
stream.head()

Unnamed: 0,userID,Timestamp,location,content_type,duration
0,3,2021-01-05 21:44:55+00:00,Essex,Preserved,47
1,3,2021-01-05 21:44:55+00:00,Essex,Preserved,29
2,3,2021-01-05 21:44:55+00:00,Essex,Preserved,31
3,3,2021-01-05 21:44:55+00:00,Essex,Preserved,4
4,3,2021-01-05 21:44:55+00:00,Essex,Preserved,8


In [4]:
#Check the location field for spelling errors
stream['location'].unique()

array(['Essex', 'Plymouth', 'Edinurgh', 'Newcastle', 'Cardiff', 'London',
       'Manchester', 'Cornwall', 'Nottingham', 'Perth', 'Glasgow',
       'Norfolk', 'Bristol', 'Kent'], dtype=object)

In [5]:
spellcheck = {'Edinburgh':'Edinurgh'}    
stream['location'] = stream['location'].replace(list(spellcheck.values()), list(spellcheck.keys()), regex = True)

In [6]:
stream['location'].unique()

array(['Essex', 'Plymouth', 'Edinburgh', 'Newcastle', 'Cardiff', 'London',
       'Manchester', 'Cornwall', 'Nottingham', 'Perth', 'Glasgow',
       'Norfolk', 'Bristol', 'Kent'], dtype=object)

In [7]:
#Aggregate the data to find the total duration of each streaming session (as identified by the timestamp)
stream['total_duration'] = stream.groupby(['Timestamp','userID','location'])['duration'].transform('sum')
stream = stream[['userID','Timestamp','location','content_type','total_duration']].rename(columns={'total_duration':'duration'})

In [8]:
#We need to update the content_type field:
#For London, Cardiff and Edinburgh, the content_type is defined as "Primary"
#For other locations, maintain the "Preserved" content_type and update all others to have a "Secondary" content_type
p1city = ['London','Cardiff','Edinburgh']
p2city = ['Essex','Manchester','Perth','Glasgow','Nottingham']
stream['content_type'] = np.where((stream['location'].isin(p1city)), 'Primary',
                                  np.where((stream['location'].isin(p2city)), 'Preserved','Secondary'))

In [9]:
stream.drop_duplicates(inplace=True)

In [10]:
#To join to the Avg Pricing Table, we need to work out when each user's first streaming session was.
#For "Primary" content, we take the overall minimum streaming month, ignoring location
#For all other content, we work out the minimum active month for each user, in each location and for each content_type
stream['Month'] = np.where(stream['content_type'] == 'Primary', stream.groupby(['userID'])['Timestamp']\
                    .transform('min').dt.tz_localize(None).astype('datetime64[M]'),
                           stream.groupby(['userID','location','content_type'])['Timestamp']\
                           .transform('min').dt.tz_localize(None).astype('datetime64[M]'))

In [11]:
stream.head()

Unnamed: 0,userID,Timestamp,location,content_type,duration,Month
0,3,2021-01-05 21:44:55+00:00,Essex,Preserved,137,2020-12-01
6,3,2021-01-05 22:28:58+00:00,Essex,Preserved,190,2020-12-01
12,3,2021-01-05 21:22:46+00:00,Essex,Preserved,269,2020-12-01
18,3,2021-01-05 22:31:00+00:00,Essex,Preserved,133,2020-12-01
24,3,2021-01-05 22:36:07+00:00,Essex,Preserved,196,2020-12-01


In [12]:
price.head()

Unnamed: 0,Month,Avg_Price,content_type
0,2020-08-01,20.92,Primary
1,2020-09-01,22.9,Primary
2,2020-10-01,23.41,Primary
3,2020-11-01,20.66,Primary
4,2020-12-01,19.61,Primary


In [13]:
#We're now ready to join to the Avg Pricing Table
output = pd.merge(stream, price, on = ['Month', 'content_type'], how = 'left')

In [14]:
#For "Preserved" content, we manually input the Avg Price as £14.98
output['Avg_Price'] = np.where(output['content_type']== 'Preserved', 14.98, output['Avg_Price']) 

In [15]:
output = output[['userID','Timestamp','location','content_type','duration','Avg_Price']]

In [16]:
output.head(10)

Unnamed: 0,userID,Timestamp,location,content_type,duration,Avg_Price
0,3,2021-01-05 21:44:55+00:00,Essex,Preserved,137,14.98
1,3,2021-01-05 22:28:58+00:00,Essex,Preserved,190,14.98
2,3,2021-01-05 21:22:46+00:00,Essex,Preserved,269,14.98
3,3,2021-01-05 22:31:00+00:00,Essex,Preserved,133,14.98
4,3,2021-01-05 22:36:07+00:00,Essex,Preserved,196,14.98
5,2,2021-02-06 14:27:36+00:00,Plymouth,Secondary,198,18.58
6,5,2021-11-08 07:17:36+00:00,Edinburgh,Primary,41,19.49
7,5,2021-11-08 07:31:37+00:00,Edinburgh,Primary,51,19.49
8,2,2021-01-22 14:29:24+00:00,Essex,Preserved,195,14.98
9,2,2021-01-22 14:52:25+00:00,Essex,Preserved,196,14.98


In [17]:
#output the data
output.to_csv('wk17-output.csv', index=False)