### Prepping Data Challenge: Dungeons & Dragons: Critical Role (Week 22)

For this challenge, we'll be looking at a Dungeons and Dragons dataset from the podcast "Critical Role". If we wanted to know how long each character speaks for during each episode we'd need to do a little bit of preppin' to prepare the dataset for such an analysis. . 
 
### Requirements
- Input the data
- To create our gantt chart we'll need to work out how long each character is talking. To do this we can work out the difference from one timestamp to the next. However for the last lines of dialogue we'll need to know when the episode ends. To do this we'll need to union the dialogue with the episode details to find the last timestamp
- Create a rank of the timestamp for each episode, ordered by earliest timestamp
  - Think carefully about the type of rank you want to use
- Create a new column that is -1 the rank, so we can lookup the next line
- Create a duplicate dataset and remove all columns except
  - episode
  - next_line
  - time_in_secs
- Inner join these two datasets
- Calculate the dialogue durations
- Some character names are comma separated, split these names out and trim any trailing whitespace
  - It's ok to leave "ALL" as "ALL"
- Reshape the data so we have a row per character
- Filter the data for just Gameplay sections
- Ensure no duplication of rows has occurred
- Output the data

In [1]:
import pandas as pd
import numpy as np

In [2]:
#input the data
with pd.ExcelFile("WK22-Input.xlsx") as xlsx:
    episode = pd.read_excel(xlsx, 'episode_details')
    dialogue = pd.read_excel(xlsx, 'dialogue')

In [3]:
episode.head()

Unnamed: 0,Episode,Title,Original airdate,Runtime,runtime_in_secs,youtube_link,youtube_episode_id
0,C1E001,"""Arrival at Kraghammer"" (1x01)",2015-03-12,03:03:34,11014,https://www.youtube.com/watch?v=i-p9lWIhcLQ,i-p9lWIhcLQ
1,C1E002,"""Into the Greyspine Mines"" (1x02)",2015-03-19,03:00:11,10811,https://www.youtube.com/watch?v=JTie0S_5gjE,JTie0S_5gjE
2,C1E003,"""Strange Bedfellows"" (1x03)",2015-03-26,02:36:17,9377,https://www.youtube.com/watch?v=kpkCcb--r90,kpkCcb--r90
3,C1E004,"""Attack on the Duergar Warcamp"" (1x04)",2015-04-02,04:42:42,16962,https://www.youtube.com/watch?v=kGxiZNbjwGI,kGxiZNbjwGI
4,C1E005,"""The Trick about Falling"" (1x05)",2015-04-09,03:09:42,11382,https://www.youtube.com/watch?v=u6QpXDL7E8Y,u6QpXDL7E8Y


In [4]:
dialogue.head()

Unnamed: 0,Episode,name,time_in_secs,youtube_timestamp,dialogue,section
0,C1E001,MATT,0,https://youtu.be/i-p9lWIhcLQ?t=0,"Hello everyone. My name is Matthew Mercer, voi...","Intros, breaks, outtros"
1,C1E001,TRAVIS,69,https://youtu.be/i-p9lWIhcLQ?t=69,"Right, listen up! If you have ale, then you ha...","Intros, breaks, outtros"
2,C1E001,MARISHA,177,https://youtu.be/i-p9lWIhcLQ?t=177,A first impression of Keyleth would leave you ...,"Intros, breaks, outtros"
3,C1E001,TALIESIN,301,https://youtu.be/i-p9lWIhcLQ?t=301,"Percy was the third child of seven children, b...","Intros, breaks, outtros"
4,C1E001,SAM,375,https://youtu.be/i-p9lWIhcLQ?t=375,"Oh, you haven't heard of Scanlan Shorthalt? We...","Intros, breaks, outtros"


In [5]:
#Inner join these two datasets
df = dialogue.merge(episode[['Episode','runtime_in_secs']])

In [6]:
#For the last lines of dialogue we'll need to know when the episode ends. 
#To do this we'll need to union the dialogue with the episode details to find the last timestamp
#Filter the data for just Gameplay sections
df2 = (pd.merge_asof(df.sort_values(by=['time_in_secs']).rename(columns={'time_in_secs':'start_time'}),
                    df[['Episode','time_in_secs']].sort_values(by=['time_in_secs'])
                    .rename(columns={'time_in_secs':'end_time'}),
                   left_on='start_time', right_on='end_time', by=['Episode'],
                   direction='forward', allow_exact_matches=False)
        .query("section == 'Gameplay'")
     )

In [7]:
df2.rename(columns={'runtime_in_secs' : 'time_in_secs'}, inplace=True)

In [8]:
df2.head(10)

Unnamed: 0,Episode,name,start_time,youtube_timestamp,dialogue,section,time_in_secs,end_time
4720,C1E003,LAURA,302,https://youtu.be/kpkCcb--r90?t=302,He's got nobody to blame but himself.,Gameplay,9377,303.0
4741,C1E003,MATT,303,https://youtu.be/kpkCcb--r90?t=303,"Kind of, that's okay. Grog owns it-- as you ca...",Gameplay,9377,309.0
4830,C1E003,MARISHA,309,https://youtu.be/kpkCcb--r90?t=309,I love those.,Gameplay,9377,310.0
4852,C1E003,MATT,310,https://youtu.be/kpkCcb--r90?t=310,"So. We're beginning now as Grog has collapsed,...",Gameplay,9377,313.0
4916,C1E003,LIAM,313,https://youtu.be/kpkCcb--r90?t=313,The future's so bright.,Gameplay,9377,315.0
4947,C1E003,MATT,315,https://youtu.be/kpkCcb--r90?t=315,The last of the duergar have been blown off th...,Gameplay,9377,323.0
5073,C1E003,TRAVIS,323,https://youtu.be/kpkCcb--r90?t=323,Somebody pretty give me CPR.,Gameplay,9377,325.0
5099,C1E003,LAURA,325,https://youtu.be/kpkCcb--r90?t=325,Don't talk.,Gameplay,9377,326.0
5115,C1E003,TRAVIS,326,https://youtu.be/kpkCcb--r90?t=326,Okay.,Gameplay,9377,327.0
5129,C1E003,SAM,327,https://youtu.be/kpkCcb--r90?t=327,"Grog! Grog, can you hear us?",Gameplay,9377,328.0


In [9]:
#Create a rank of the timestamp for each episode, ordered by earliest timestamp
df2['rank'] = df2.groupby(['Episode'])['time_in_secs'].rank(method='dense').astype(int)

In [10]:
#Create a new column that is -1 the rank, so we can lookup the next line
df2['next_line'] = df2['rank']-1

In [11]:
#Calculate the dialogue durations
df2['Duration'] = df2['end_time'] - df2['start_time']

In [12]:
#Some character names are comma separated, split these names out and trim any trailing whitespace
df2 = df2.assign(name=df2['name'].str.replace(' ', '').str.split(','),
                  dialogue=df2['dialogue'].astype(str).str.strip())

In [13]:
#Reshape the data so we have a row per character
df2 = df2.explode('name')

In [14]:
#Ensure no duplication of rows has occurred
df2.drop_duplicates(inplace=True)

In [15]:
output = df2[['Episode', 'name', 'start_time', 'Duration', 'youtube_timestamp','dialogue', 'section']]

In [16]:
output.head(10)

Unnamed: 0,Episode,name,start_time,Duration,youtube_timestamp,dialogue,section
4720,C1E003,LAURA,302,1.0,https://youtu.be/kpkCcb--r90?t=302,He's got nobody to blame but himself.,Gameplay
4741,C1E003,MATT,303,6.0,https://youtu.be/kpkCcb--r90?t=303,"Kind of, that's okay. Grog owns it-- as you ca...",Gameplay
4830,C1E003,MARISHA,309,1.0,https://youtu.be/kpkCcb--r90?t=309,I love those.,Gameplay
4852,C1E003,MATT,310,3.0,https://youtu.be/kpkCcb--r90?t=310,"So. We're beginning now as Grog has collapsed,...",Gameplay
4916,C1E003,LIAM,313,2.0,https://youtu.be/kpkCcb--r90?t=313,The future's so bright.,Gameplay
4947,C1E003,MATT,315,8.0,https://youtu.be/kpkCcb--r90?t=315,The last of the duergar have been blown off th...,Gameplay
5073,C1E003,TRAVIS,323,2.0,https://youtu.be/kpkCcb--r90?t=323,Somebody pretty give me CPR.,Gameplay
5099,C1E003,LAURA,325,1.0,https://youtu.be/kpkCcb--r90?t=325,Don't talk.,Gameplay
5115,C1E003,TRAVIS,326,1.0,https://youtu.be/kpkCcb--r90?t=326,Okay.,Gameplay
5129,C1E003,SAM,327,1.0,https://youtu.be/kpkCcb--r90?t=327,"Grog! Grog, can you hear us?",Gameplay


In [17]:
#output the data 
output.to_csv('wk22-output.csv', index=False)