### Prepping Data Challenge: Karaoke Data (week 8)
For this dataset, we assume the following;
 
 - Customers often don't sing the entire song
 - Sessions last 60 minutes
 - Customers arrive a maximum of 10 minutes before their sessions begin

#### Requirement:

 1. Input the data
 2. Calculate the time between songs 
 3. If the time between songs is greater than (or equal to) 59 minutes, flag this as being a new session 
 4. Create a session number field 
 5. Number the songs in order for each session 
 6. Match the customers to the correct session, based on their entry time 
    - The Customer ID field should be null if there were no customers who arrived 10 minutes (or less) before the start of the session
 7. Output the data

### 1. Input the data 

In [1]:
#import libraries
import pandas as pd

In [2]:
with pd.ExcelFile('WK8-Karaoke Dataset.xlsx') as xlsx:
    karaoke_choices = pd.read_excel(xlsx, 'Karaoke Choices')
    customers = pd.read_excel(xlsx, 'Customers', converters={'Customer ID':str}).sort_values(by='Entry Time')

In [3]:
karaoke_choices.head()

Unnamed: 0,Date,Artist,Song
0,2020-12-22 13:59:59.971,Wham!,Last Christmas
1,2020-12-22 15:00:00.000,Dolly Parton,9 To 5
2,2020-12-22 15:02:00.010,Camilla Cabello Ft. Young Thug,Havana
3,2020-12-22 15:04:00.019,Moana,How Far I’ll Go
4,2020-12-22 18:00:00.000,Backstreet Boys,I Want It That Way


In [4]:
#fix customer IDs
def fix_id(customer_id):
    if len(customer_id) > 6 and customer_id.isnumeric():
        customer_id ="{:.2E}".format(float(customer_id))
    return customer_id

customers['Customer ID'] = [fix_id(x) for x in customers['Customer ID']]

In [5]:
customers.head()

Unnamed: 0,Customer ID,Entry Time
203,cd2834,2020-12-22 13:54:00
67,2de3d7,2020-12-22 14:55:00
92,6.99e+67,2020-12-22 17:51:00
206,316313,2020-12-22 19:00:00
228,aa0846,2020-12-22 22:55:00


###  2 - 5. 
   - Calculate the time between songs
   - If the time between songs is greater than (or equal to) 59 minutes, flag this as being a new session 
   - Create a session number field
   - Number the songs in order for each session

In [6]:
karaoke_choices['Session number'] = karaoke_choices['Date'].diff(1).dt.total_seconds().ge(59*60).cumsum() + 1

In [7]:
karaoke_choices['Song Order'] = karaoke_choices.groupby('Session number')['Date'].rank('dense', ascending=True).astype(int)

### 6. Match the customers to the correct session, based on their entry time 
   - The Customer ID field should be null if there were no customers who arrived 10 minutes (or less) before the start of the session

In [11]:
start = karaoke_choices[karaoke_choices['Song Order']==1][['Date','Session number']]
customer_sessions = pd.merge_asof(customers, start, left_on = 'Entry Time', right_on= 'Date',
                              tolerance=pd.Timedelta(minutes=10), direction='forward').dropna()

In [12]:
#join customer sessions to song list
output = karaoke_choices.merge(customer_sessions, how='left', on=['Session number'], suffixes=['','_y'])
output.drop(columns=['Date_y'], inplace=True)
output['Date'] = output['Date'].dt.round('1s')

### 6. Output the data

In [13]:
output.to_csv('WK8-karaoke dataset Output.csv', index=False, date_format='%d/%m/%Y %H:%M:%S',
              columns=['Session number','Customer ID','Song Order','Date','Artist','Song'])

In [14]:
output.head()

Unnamed: 0,Date,Artist,Song,Session number,Song Order,Customer ID,Entry Time
0,2020-12-22 14:00:00,Wham!,Last Christmas,1,1,cd2834,2020-12-22 13:54:00
1,2020-12-22 15:00:00,Dolly Parton,9 To 5,2,1,2de3d7,2020-12-22 14:55:00
2,2020-12-22 15:02:00,Camilla Cabello Ft. Young Thug,Havana,2,2,2de3d7,2020-12-22 14:55:00
3,2020-12-22 15:04:00,Moana,How Far I’ll Go,2,3,2de3d7,2020-12-22 14:55:00
4,2020-12-22 18:00:00,Backstreet Boys,I Want It That Way,3,1,6.99e+67,2020-12-22 17:51:00
