### Prepping Data Challenge: The Price of Streaming (Week 17)
 The involves bringing together 2 datasets which have different levels of aggregation. 

### Requirements
 - Input the data
 - Check the location field for spelling errors
   - Data roles may help you identify these
 - Fix the date fields so they are recognised as date data types
 - Aggregate the data to find the total duration of each streaming session (as identified by the timestamp)
 - We need to update the content_type field:
   - For London, Cardiff and Edinburgh, the content_type is defined as "Primary"
   - For other locations, maintain the "Preserved" content_type and update all others to have a "Secondary" content_type
- To join to the Avg Pricing Table, we need to work out when each user's first streaming session was. However, it's a little more complex than that. 
   - For "Primary" content, we take the overall minimum streaming month, ignoring location
   - For all other content, we work out the minimum active month for each user, in each location and for each content_type
- We're now ready to join to the Avg Pricing Table
- For "Preserved" content, we manually input the Avg Price as £14.98
- Output the data

In [1]:
import pandas as pd
import numpy as np

In [65]:
# Input the data.
with pd.ExcelFile('2022W17 Input.xlsx') as xlsx:
    stream = pd.read_excel(xlsx, 'Streaming')
    price = pd.read_excel(xlsx, 'Avg Pricing')

In [66]:
stream.head()

Unnamed: 0,userID,t,location,content_type,duration
0,3,2021-01-05T21:44:55Z,Essex,Preserved,47
1,3,2021-01-05T21:44:55Z,Essex,Preserved,29
2,3,2021-01-05T21:44:55Z,Essex,Preserved,31
3,3,2021-01-05T21:44:55Z,Essex,Preserved,4
4,3,2021-01-05T21:44:55Z,Essex,Preserved,8


In [68]:
#Check the location field for spelling errors
stream['location'].unique()

array(['Essex', 'Plymouth', 'Edinurgh', 'Newcastle', 'Cardiff', 'London',
       'Manchester', 'Cornwall', 'Nottingham', 'Perth', 'Glasgow',
       'Norfolk', 'Bristol', 'Kent'], dtype=object)

In [69]:
spellcheck = {'Edinburgh':'Edinurgh'}    
stream['location'] = stream['location'].replace(list(spellcheck.values()), list(spellcheck.keys()), regex = True)

In [70]:
stream['location'].unique()

array(['Essex', 'Plymouth', 'Edinburgh', 'Newcastle', 'Cardiff', 'London',
       'Manchester', 'Cornwall', 'Nottingham', 'Perth', 'Glasgow',
       'Norfolk', 'Bristol', 'Kent'], dtype=object)

In [71]:
stream['t'] = stream['t'].str.replace('([TZ])',' ')

  stream['t'] = stream['t'].str.replace('([TZ])',' ')


In [72]:
stream.head()

Unnamed: 0,userID,t,location,content_type,duration
0,3,2021-01-05 21:44:55,Essex,Preserved,47
1,3,2021-01-05 21:44:55,Essex,Preserved,29
2,3,2021-01-05 21:44:55,Essex,Preserved,31
3,3,2021-01-05 21:44:55,Essex,Preserved,4
4,3,2021-01-05 21:44:55,Essex,Preserved,8


In [73]:
stream.columns

Index(['userID', 't', 'location', 'content_type', 'duration'], dtype='object')

In [74]:
#Fix the date fields so they are recognised as date data types
stream['t'] = pd.to_datetime(stream['t'], format= '%Y-%m-%d %H:%M:%S')

In [75]:
stream = stream.rename(columns={'t':'timestamp'})

In [85]:
#Aggregate the data to find the total duration of each streaming session (as identified by the timestamp)
stream['total_duration'] = stream.groupby(['timestamp','userID','location'])['duration'].transform('sum')
stream = stream[['userID','timestamp','location','content_type','total_duration']].rename(columns={'total_duration':'duration'})

In [96]:
#We need to update the content_type field:
#For London, Cardiff and Edinburgh, the content_type is defined as "Primary"
#For other locations, maintain the "Preserved" content_type and update all others to have a "Secondary" content_type
p1city = ['London','Cardiff','Edinburgh']
p2city = ['Essex','Manchester','Perth','Glasgow','Nottingham']
stream['content_type'] = np.where((stream['location'].isin(p1city)), 'Primary',
                                  np.where((stream['location'].isin(p2city)), 'Preserved','Secondary'))

In [99]:
stream.tail(25)

Unnamed: 0,userID,timestamp,location,content_type,duration
2018,3,2021-04-23 22:22:03,Plymouth,Secondary,129
2019,3,2021-04-23 22:11:27,Plymouth,Secondary,187
2020,3,2021-04-23 22:11:27,Plymouth,Secondary,187
2021,3,2021-04-23 22:11:27,Plymouth,Secondary,187
2022,3,2021-04-23 22:11:27,Plymouth,Secondary,187
2023,3,2021-04-23 22:11:27,Plymouth,Secondary,187
2024,3,2021-04-23 22:11:27,Plymouth,Secondary,187
2025,3,2020-11-21 22:54:25,Nottingham,Preserved,241
2026,3,2020-11-21 22:54:25,Nottingham,Preserved,241
2027,3,2020-11-21 22:54:25,Nottingham,Preserved,241


In [None]:
#To join to the Avg Pricing Table, we need to work out when each user's first streaming session was.
#For "Primary" content, we take the overall minimum streaming month, ignoring location
#For all other content, we work out the minimum active month for each user, in each location and for each content_type

In [67]:
price.head()

Unnamed: 0,Month,Avg_Price,Content_Type
0,08 2020,20.92,Primary
1,09 2020,22.9,Primary
2,10 2020,23.41,Primary
3,11 2020,20.66,Primary
4,12 2020,19.61,Primary


In [None]:
#We're now ready to join to the Avg Pricing Table

In [None]:
#For "Preserved" content, we manually input the Avg Price as £14.98

In [10]:
#output the data
output.to_csv('wk17-output.csv', index=False)