# US election speech dataset
This is a dataset from Kaggle. You can see the detail of this dataset from [Kaggle: US Election 2020 - Presidential Debates](https://www.kaggle.com/headsortails/us-election-2020-presidential-debates)

I will try to do the following tasks:
   - [x] find the speed of speech
   - [ ] find the most frequent word
   - [ ] find the attitude toward the debating topic
   - [ ] train a RNN to translate the debate to other language

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity='all'

In [2]:
# import related packages 
import pandas as pd
import numpy as np

In [3]:
# import the data
df = pd.read_csv('../Data/US election speech/us_election_2020_1st_presidential_debate.csv')

In [4]:
# perform a check on data shape and quality
df.info()
df.shape
df.head(10)
df.tail(10)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 789 entries, 0 to 788
Data columns (total 3 columns):
speaker    789 non-null object
minute     788 non-null object
text       789 non-null object
dtypes: object(3)
memory usage: 18.6+ KB


(789, 3)

Unnamed: 0,speaker,minute,text
0,Chris Wallace,01:20,Good evening from the Health Education Campus ...
1,Chris Wallace,02:10,This debate is being conducted under health an...
2,Vice President Joe Biden,02:49,"How you doing, man?"
3,President Donald J. Trump,02:51,How are you doing?
4,Vice President Joe Biden,02:51,I’m well.
5,Chris Wallace,03:11,"Gentlemen, a lot of people been waiting for th..."
6,President Donald J. Trump,04:01,"Thank you very much, Chris. I will tell you ve..."
7,President Donald J. Trump,04:53,And we won the election and therefore we have ...
8,Chris Wallace,05:22,"President Trump, thank you. Same question to y..."
9,Vice President Joe Biden,05:29,"Well, first of all, thank you for doing this a..."


Unnamed: 0,speaker,minute,text
779,Vice President Joe Biden,01:09:30,Yes. And here’s the deal. We count the ballots...
780,President Donald J. Trump,01:10:07,It’s already been established. Take a look at ...
781,Chris Wallace,01:10:10,I asked you. You had an opportunity to respond...
782,Vice President Joe Biden,01:10:15,He has no idea what he’s talking about. Here’s...
783,President Donald J. Trump,01:10:41,I want to see an honest ballot cut-
784,Chris Wallace,01:10:43,"Gentlemen, just say that’s the end of it [cros..."
785,President Donald J. Trump,01:10:47,I want to see an honest ballot count.
786,Chris Wallace,01:10:48,We’re going to leave it there-
787,President Donald J. Trump,01:10:49,And I think he does too-
788,Chris Wallace,01:10:50,… to be continued in more debates as we go on....


In [5]:
# check for missing values
df.isnull().sum()

df['minute'][df['minute'].isnull()]

df.iloc[177:182,:]

df.iloc[179,2]

speaker    0
minute     1
text       0
dtype: int64

179    NaN
Name: minute, dtype: object

Unnamed: 0,speaker,minute,text
177,Vice President Joe Biden,24:03,His own CDC Director says we could lose as man...
178,President Donald J. Trump,24:25,"You don’t trust Johnson & Johnson, Pfizer?"
179,Chris Wallace:,,"Okay, gentlemen, gentlemen. Let me move on to ..."
180,President Donald J. Trump,00:15,"Well, I’ve spoken to the companies and we can ..."
181,Vice President Joe Biden,00:22,God.


'Okay, gentlemen, gentlemen. Let me move on to questions about the future because you both have touched on two of the questions I’m going to ask. Focusing on the future first, President Trump, you have repeatedly either contradicted or been at odds with some of your governments own top scientists. The week before last, the Head of the Centers for Disease Control, Dr. Redfield said it would be summer before the vaccine would become generally available to the public. You said that he was confused and mistaken. Those were your two words. But Dr. Slaoui, the head of your Operation Warp Speed, has said exactly the same thing. Are they both wrong?'

We are going to make a variable call 'duration' to represent the time duration of the related text.
However, we can see a mising data in row 179 and the minute count start from zero again. 
We can simply make the duration of this text equal to 0 for this moment and estimate the time duration of this text from chris later

In [6]:
# replace the nan by 0
for i in range(len(df)):
    if df.iloc[i,1] is np.nan:
        df.iloc[i,1] = '00:00'

In [7]:
# change the data type
df['duration'] = 0
for i in range(len(df)):
    if i == 0 :
        df['duration'][i] = pd.to_datetime(df.iloc[i,1],format = '%M:%S')-pd.to_datetime('00:00',format = '%M:%S')
        df['duration'][i] = df['duration'][i].seconds
    elif len(df.iloc[i,1]) == 5 and len(df.iloc[i-1,1]) == 5 :
        df['duration'][i] = pd.to_datetime(df.iloc[i,1],format = '%M:%S')-pd.to_datetime(df.iloc[i-1,1],format = '%M:%S')
        df['duration'][i] = df['duration'][i].seconds
    elif len(df.iloc[i,1]) == 8 and len(df.iloc[i-1,1]) == 5 :
        df['duration'][i] = pd.to_datetime(df.iloc[i,1],format = '%H:%M:%S')-pd.to_datetime(df.iloc[i-1,1],format = '%M:%S')
        df['duration'][i] = df['duration'][i].seconds
    elif len(df.iloc[i,1]) == 8:
        df['duration'][i] = pd.to_datetime(df.iloc[i,1],format = '%H:%M:%S')-pd.to_datetime(df.iloc[i-1,1],format = '%H:%M:%S')
        df['duration'][i] = df['duration'][i].seconds
df

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """


Unnamed: 0,speaker,minute,text,duration
0,Chris Wallace,01:20,Good evening from the Health Education Campus ...,80
1,Chris Wallace,02:10,This debate is being conducted under health an...,50
2,Vice President Joe Biden,02:49,"How you doing, man?",39
3,President Donald J. Trump,02:51,How are you doing?,2
4,Vice President Joe Biden,02:51,I’m well.,0
5,Chris Wallace,03:11,"Gentlemen, a lot of people been waiting for th...",20
6,President Donald J. Trump,04:01,"Thank you very much, Chris. I will tell you ve...",50
7,President Donald J. Trump,04:53,And we won the election and therefore we have ...,52
8,Chris Wallace,05:22,"President Trump, thank you. Same question to y...",29
9,Vice President Joe Biden,05:29,"Well, first of all, thank you for doing this a...",7


In [8]:
#df['duration'] = 0
df.iloc[177:182,:]

Unnamed: 0,speaker,minute,text,duration
177,Vice President Joe Biden,24:03,His own CDC Director says we could lose as man...,6
178,President Donald J. Trump,24:25,"You don’t trust Johnson & Johnson, Pfizer?",22
179,Chris Wallace:,00:00,"Okay, gentlemen, gentlemen. Let me move on to ...",84935
180,President Donald J. Trump,00:15,"Well, I’ve spoken to the companies and we can ...",15
181,Vice President Joe Biden,00:22,God.,7


In [9]:
# before we make a simple estimation on the speaking speed, we need to do some math on the length of the speech.
def count(text):
    return len(text.split())

df['count'] = 0
for i in range(len(df)):
    df.iloc[i,4]=count(df.iloc[i,2])
    
df  

Unnamed: 0,speaker,minute,text,duration,count
0,Chris Wallace,01:20,Good evening from the Health Education Campus ...,80,124
1,Chris Wallace,02:10,This debate is being conducted under health an...,50,102
2,Vice President Joe Biden,02:49,"How you doing, man?",39,4
3,President Donald J. Trump,02:51,How are you doing?,2,4
4,Vice President Joe Biden,02:51,I’m well.,0,2
5,Chris Wallace,03:11,"Gentlemen, a lot of people been waiting for th...",20,133
6,President Donald J. Trump,04:01,"Thank you very much, Chris. I will tell you ve...",50,156
7,President Donald J. Trump,04:53,And we won the election and therefore we have ...,52,98
8,Chris Wallace,05:22,"President Trump, thank you. Same question to y...",29,15
9,Vice President Joe Biden,05:29,"Well, first of all, thank you for doing this a...",7,16


In [10]:
# now we will estimate the normal speed of Chris for replacing the 'duration' of row 179
speaker_list = np.unique(df['speaker']).tolist()
speaker_list

df[df['speaker'] == 'Chris Wallace:']
df.iloc[179,0] = 'Chris Wallace'

speaker_list = np.unique(df['speaker']).tolist()
speaker_list

['Chris Wallace',
 'Chris Wallace:',
 'President Donald J. Trump',
 'Vice President Joe Biden']

Unnamed: 0,speaker,minute,text,duration,count
179,Chris Wallace:,00:00,"Okay, gentlemen, gentlemen. Let me move on to ...",84935,112


['Chris Wallace', 'President Donald J. Trump', 'Vice President Joe Biden']

In [11]:
speaker_attr = {'speaker':[], 'speed':[]}

for speaker in speaker_list:
    word_count = 0
    duration = 0
    
    for i in range(len(df)):
        if df.iloc[i,0] == speaker and i != 179:
            word_count+= df.iloc[i,4]
            duration+= df.iloc[i,3]
    
    speed = word_count / duration
    speaker_attr['speaker'] .append(speaker)
    speaker_attr['speed'].append(speed)

speaker_attr

{'speaker': ['Chris Wallace',
  'President Donald J. Trump',
  'Vice President Joe Biden'],
 'speed': [2.460625674217907, 3.3333333333333335, 3.9129662522202486]}

In [12]:
# now we can put back the estimated duration to row 179


####list(speaker_attr.items())[0][0]
df.iloc[179,3] = round(df.iloc[179,4]*list(speaker_attr.get('speed'))[0])
df.iloc[179,]

speaker                                         Chris Wallace
minute                                                  00:00
text        Okay, gentlemen, gentlemen. Let me move on to ...
duration                                                  276
count                                                     112
Name: 179, dtype: object

In [13]:
#now we can show the speed of each speaker:
for i in range(len(list(speaker_attr.get('speaker')))):
    print('{} speaks {:.2f}words/s'.format(list(speaker_attr.get('speaker'))[i],list(speaker_attr.get('speed'))[i]))

Chris Wallace speaks 2.46words/s
President Donald J. Trump speaks 3.33words/s
Vice President Joe Biden speaks 3.91words/s
