# Predicting Democratic Polling Results based on Debate Speeches

The goal of this project is to showcase some basic skills in text preprocessing, my thought process as I proceed through the data, and a quick use of some ensemble machine learning methods. The data was downloaded recently from Kaggle, and includes democratic debates from June to the New Hampshire debate in early. The data also has some noise where some parts of the transcripts include crowd noise, announcers, or people who prompt the debators. Some noise is removed, but we keep some of the noise to indicate a baseline of corpuses which would obtain 0% polling.

In [1]:
import pandas as pd
import numpy as np
import nltk
from nltk.corpus import stopwords
import string
import re
import os

Our first goal is to set our working directory path and upload our data. 

In [2]:
#os.chdir("path of code")
debates = pd.read_csv("debate_transcripts.csv", encoding="unicode_escape") # Download the dataset
debates

Unnamed: 0,debate_name,debate_section,speaker,speech,speaking_time_seconds
0,New Hampshire Democratic Debate Transcript,Part 1,George S.,"Candidates, welcome. Vice President Biden, the...",18.0
1,New Hampshire Democratic Debate Transcript,Part 1,Joe Biden,"Oh, they didn't miss anything. It's a long rac...",36.0
2,New Hampshire Democratic Debate Transcript,Part 1,George S.,Why are Senator Sanders and Mayor Buttigieg to...,4.0
3,New Hampshire Democratic Debate Transcript,Part 1,Joe Biden,"Well, you know that with regard to Senator San...",41.0
4,New Hampshire Democratic Debate Transcript,Part 1,George S.,"Senator Sanders, let me give you the chance to...",21.0
5,New Hampshire Democratic Debate Transcript,Part 1,Bernie Sanders,Because Donald Trump lies all the time. It doe...,41.0
6,New Hampshire Democratic Debate Transcript,Part 1,Bernie Sanders,I believe that the way we beat Trump is by hav...,39.0
7,New Hampshire Democratic Debate Transcript,Part 1,George S.,"But Senator, let me follow up there and then w...",12.0
8,New Hampshire Democratic Debate Transcript,Part 1,Bernie Sanders,That's true. And that's the disappointment and...,23.0
9,New Hampshire Democratic Debate Transcript,Part 1,George S.,"Before I move on to Mayor Buttigieg, let me ju...",11.0


Next we use some basic regular expressions to remove prompters, moderators, and unnamed speakers

In [3]:
speakers = debates['speaker'].apply(lambda x: x.lower())
debates = debates[~speakers.str.contains("speaker.*|moderator.*|george s.|chuck todd")]

We next take a look at every unique debate

In [4]:
debates['debate_name'].unique()

array(['New Hampshire Democratic Debate Transcript',
       'January Iowa Democratic Debate Transcript',
       'December Democratic Debate Transcript: Sixth Debate from Los Angeles',
       'November Democratic Debate Transcript \x96 5th Debate Transcript from Atlanta',
       'October Democratic Debate Transcript: 4th Debate in Ohio',
       'September Houston Democratic Debate Transcript \x96 Third Debate',
       'Transcript of July Democratic Debate 2nd Round, Night 2: Full Transcript July 31, 2019',
       'Transcript of July Democratic Debate 2nd Round Night 1: Full Transcript July 30, 2019',
       'Transcript from Night 2 of the First 2019 June Democratic Debates',
       'Transcript from Night 1 of the 2019 June Democratic Debates'],
      dtype=object)

It appears that there are 8 total debates that have been held. We look to map specific words to a specific debate title.

In [5]:
debate_name = debates['debate_name'].apply(lambda x: x.lower())

debate_map = {'hampshire' : 'New Hampshire Debate',
'january': 'Iowa Debate',
'december': 'Los Angeles Debate',
'november' : 'Atlanta Debate',
'october' : 'Ohio Debate',
'september' : 'Houston Debate',
'july' : '2nd Debate',
'june' : '1st Debate'}

Next we look to parse through the debate_name column of our dataframe. We define a function given an input x that will cycle through our map and look for a match and return our value for a given key. That value is then returned. We apply this to our column to get a consistent format.

In [6]:
def debate_name_parser(x):
    group = "unknown"
    for key in debate_map:
        if key in x:
            group = debate_map[key]
            break
    return group

In [7]:
debates['debate_name'] = debate_name.apply(lambda x: debate_name_parser(x))

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  """Entry point for launching an IPython kernel.


Next we look at the uniqie values for our speaker and debate_names columns. We look to combine all text that a single speaker said at a single event. For example, we combine all text from Joe Biden spoken at the New Hampshire Debate and set it to a single corpus. All of this text cam be later given a target value: his polling results at the New Hampshire debate. This corpus would be differentiated from the combined text of Joe Biden from the Iowa caucus, and his adjoined polling results would reflect that debate specifically.

In [8]:
uqdebate = debates['debate_name'].unique()
uqdebate

array(['New Hampshire Debate', 'Iowa Debate', 'Los Angeles Debate',
       'Atlanta Debate', 'Ohio Debate', 'Houston Debate', '2nd Debate',
       '1st Debate'], dtype=object)

In [9]:
uqspeaker = debates['speaker'].unique()
uqspeaker

array(['Joe Biden', 'Bernie Sanders', 'Amy Klobuchar', 'Tom Steyer',
       'Andrew Yang', 'Elizabeth Warren', 'Pete Buttigieg',
       'Linsey Davis', 'David Muir', 'Monica Hernandez', 'Adam Sexton',
       'Devin Dwyer', 'Rachel Scott', 'Announcer', 'Wolf Blitzer',
       'Abby Phillips', 'B. Pfannenstiel', 'Brianne P.', 'Judy Woodruff',
       'Amy Walter', 'Stephanie Sy', 'Tim Alberta', 'Amna Nawaz',
       'Yamiche A.', 'Rachel Maddow', 'Andrea Mitchell', 'Kamala Harris',
       'Cory Booker', 'Kristen Welker', 'Ashley Parker', 'Tulsi Gabbard',
       'Anderson Cooper', 'Erin Burnett', 'Marc Lacey', 'Julian Castro',
       "Beto O'Rourke", 'A. Cooper', 'Jake Tapper', 'Voiceover',
       'Jorge Ramos', 'Sec. Castro', 'Dana Bash', 'Bill de Blasio',
       'Michael Bennet', 'Jay Inslee', 'Kirsten Gillibrand', 'Don Lemon',
       'Crowd', 'Kirseten Gillibrand', 'Diana', 'Steve Bullock',
       'Marianne Williamson', 'John Delaney', 'Tim Ryan', 'John H.',
       'Female', 'Male', 'John

In [10]:
NHJoe = debates[(debates['speaker']=='Joe Biden') & (debates['debate_name']=='New Hampshire Debate')]
" ".join(NHJoe['speech'])

"Oh, they didn't miss anything. It's a long race. I took a hit in Iowa and I'll probably take a hit here. Traditionally Bernie won by 20 points last time. And usually it's the neighboring senators that do well. But no matter what, I'm still in this for the same reason, we have to restore the soul of this country, bring back the middle class and make sure we bring people together. And so it's a simple proposition. It doesn't matter whether it's this one or the next. I've always viewed the first four encounters, two primaries, and two caucuses as the starting point. And so that's how I view it. Well, you know that with regard to Senator Sanders, the President wants very much to sic a label on every candidate. We're going to not only have to win this time, we have to bring along the United States Senate. And Bernie's labeled himself, not me, a democratic socialist. I think that's the label that the President's going to lay on everyone running with Bernie if he's a nominee. And a Mayor But

In [11]:
def debate_corpus_joiner(s,n):
    SpeakerDebate=debates[(debates['speaker']==s) & (debates['debate_name']==n)]
    SpeakerDebateCorpus=" ".join(SpeakerDebate['speech'])
    return pd.DataFrame(data=[n,s,SpeakerDebateCorpus])

In [12]:
targetdf = pd.DataFrame()
for n in uqdebate:
    for s in uqspeaker:
        targetdf = pd.concat([targetdf,debate_corpus_joiner(s,n)], axis=1)
targetdf = targetdf.transpose()

We now have the dataframe we were looking for, where there is a {speaker, debate_name} pairing and a corpus corresponding to that pairing. Although some corpuses are empty due to how it was formed iteratively through all the unique names. For example, some of the earlier candidates in the 1st and 2nd debate do not appear in the Iowa Caucus, and therefore do not have speeches associated with that event. To remove these extra rows, we loop to only take {speaker, debate_name} pairings whose corpus length is greater than 0. 

I also have a piece of code that has been commented out. This code wrote part of the targetdf to a csv file so I could then manually input the polling result for each pairing. The polling results were either grabbed directly from the debates, or would be based off the nearest polling in the debate's state after the debate had occurred.

In [13]:
targetdf = targetdf[targetdf[2].apply(len)>0]
##targetdf.iloc[:,0:2].to_csv(r'C:\\Users\\Putts\\Documents\\Python Training\\debatespeaker.csv')


Next on our agenda is to read this file back in, and combine it back with our original dataframe. We call our completed dataframe 'dfcomplete.

In [14]:
df_with_polls=pd.read_csv('debatespeaker.csv')

In [15]:
targetdf = targetdf.reset_index().drop('index', axis = 1)

In [16]:
dfcomplete=df_with_polls.join(targetdf[2])

In [17]:
dfcomplete = dfcomplete.rename(columns={2:'speech'})

## Train/Test Split and Text Preprocessing

We look to divide our X and Y, as well as our training/test. Our goal is to predict the New Hampshire results using all prior polling results.

In [18]:
X_train = dfcomplete[dfcomplete['debate_name']!='New Hampshire Debate']['speech']
X_test = dfcomplete[dfcomplete['debate_name']=='New Hampshire Debate']['speech']
Y_train = dfcomplete[dfcomplete['debate_name']!='New Hampshire Debate']['poll_results']
Y_test = dfcomplete[dfcomplete['debate_name']=='New Hampshire Debate']['poll_results']

Next we look to create a function that preprocess the data, removing punctuation and stop words.

In [19]:
import string
def corpus_preprocessing(words):
    step1 = [char for char in words if char not in string.punctuation]
    step2 = ''.join(step1)
    step3 = [word for word in step2.split() if word not in stopwords.words('english')]
    return(step3)

Finally we look to use a simple approach to vectorizing our text data, using both the CountVectorizer and the TfidfTransformer (Term Frequency Inverse Document Frequency)

In [20]:
from sklearn.feature_extraction.text import CountVectorizer

In [21]:
bag_of_words = CountVectorizer(analyzer=corpus_preprocessing).fit(X_train)

In [22]:
X_train_bow = bag_of_words.transform(X_train)

In [23]:
from sklearn.feature_extraction.text import TfidfTransformer

In [24]:
tfidf_transformer = TfidfTransformer().fit(X_train_bow)
X_train_tfidf = tfidf_transformer.transform(X_train_bow)

## Quick and Dirty AdaBoostRegressor Model

Just to quickly implement and test the effectiveness of our vectorized word data, we use a non-tuned AdaBoostRegressor model on the above data and take a look at the explained_variance_score for the training data. For reference, the closer explained variance is to 1 the better it is.

In [25]:
from sklearn.ensemble import AdaBoostRegressor
ab_model = AdaBoostRegressor(random_state=0, n_estimators=300)
ab_model.fit(X_train_tfidf, Y_train)

AdaBoostRegressor(base_estimator=None, learning_rate=1.0, loss='linear',
                  n_estimators=300, random_state=0)

In [26]:
from sklearn.metrics import explained_variance_score
polling_predictions = ab_model.predict(X_train_tfidf)
explained_variance_score(Y_train, polling_predictions)

0.9707626421393457

This is a great score, but since the model is trained on this data it is expected for the explained variance to be high.

# Pipeline 1 (AdaBoost on Test Set)

In [27]:
from sklearn.pipeline import Pipeline
pipeline = Pipeline([
    ('bag_of_words', CountVectorizer(analyzer=corpus_preprocessing)),  
    ('tfidf', TfidfTransformer()),  
    ('regressor', AdaBoostRegressor(random_state=0, n_estimators=300)),
])


In [28]:
pipeline.fit(X_train, Y_train)

Pipeline(memory=None,
         steps=[('bag_of_words',
                 CountVectorizer(analyzer=<function corpus_preprocessing at 0x000002857D19FE18>,
                                 binary=False, decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('tfidf',
                 TfidfTransformer(norm='l2', smooth_idf=True,
                                  sublinear_tf=False, use_idf=True)),
                ('regressor',
                 AdaBoostRegressor(base_estimator=None, learning_rate=1.0,
                 

In [29]:
nh_predictions = pipeline.predict(X_test)
explained_variance_score(Y_test, nh_predictions)

0.32042210790918135

In [30]:
pd.DataFrame([nh_predictions,Y_test,dfcomplete['speaker'].iloc[0:14]])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,15.6524,17.7846,2.85672,3.47719,2.29322,8.73269,3.76377,2.07174,2.09655,2.11143,1.9902,1.89687,1.90952,2.00571
1,8.4,25.7,19.8,3.6,2.8,9.2,24.4,0,0,0,0,0,0,0
2,Joe Biden,Bernie Sanders,Amy Klobuchar,Tom Steyer,Andrew Yang,Elizabeth Warren,Pete Buttigieg,Linsey Davis,David Muir,Monica Hernandez,Adam Sexton,Devin Dwyer,Rachel Scott,Announcer


The above explained variance is not very good. In particular, the model predicts the non-candidates substantially higher than they should be, and under-estimates Pete and Amy.

# Pipeline 2 (Gradient Boosting for Interval Estimates)

Since the accuracy is off for a lot of candidates, I wanted to make predictions on the upper and lower bounds of our polling predictions. We look to create three gradient boosting models to do this.

In [31]:
# Lets create prediction intervals. Lowerbound, Median, and Upperbound models
from sklearn.ensemble import GradientBoostingRegressor
LBpipeline = Pipeline([
    ('bag_of_words', CountVectorizer(analyzer=corpus_preprocessing)),  
    ('tfidf', TfidfTransformer()),  
    ('regressor', GradientBoostingRegressor(loss='quantile',alpha=0.1, n_estimators=1000, random_state=0)),
])
Medpipeline = Pipeline([
    ('bag_of_words', CountVectorizer(analyzer=corpus_preprocessing)),  
    ('tfidf', TfidfTransformer()),  
    ('regressor', GradientBoostingRegressor(loss='lad', n_estimators=1000, random_state=0)),
])
UBpipeline = Pipeline([
    ('bag_of_words', CountVectorizer(analyzer=corpus_preprocessing)),  
    ('tfidf', TfidfTransformer()),  
    ('regressor', GradientBoostingRegressor(loss='quantile',alpha=0.9, n_estimators=1000, random_state=0)),
])


In [32]:
LBpipeline.fit(X_train, Y_train)
Medpipeline.fit(X_train, Y_train)
UBpipeline.fit(X_train, Y_train)


Pipeline(memory=None,
         steps=[('bag_of_words',
                 CountVectorizer(analyzer=<function corpus_preprocessing at 0x000002857D19FE18>,
                                 binary=False, decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=None, min_df=1,
                                 ngram_range=(1, 1), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_patte...
                                           learning_rate=0.1, loss='quantile',
                                           max_depth=3, max_features=None,
                                           max_leaf_nodes=None,
                                           min_impurity_decrease=0.0,
                                           min_impurity_split=None,
            

In [33]:
LBpredictions=LBpipeline.predict(X_test)
Medpredictions=Medpipeline.predict(X_test)
UBpredictions=UBpipeline.predict(X_test)


In [34]:
pd.DataFrame([LBpredictions,Medpredictions,UBpredictions,Y_test,dfcomplete['speaker'].iloc[0:14]])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,0.488535,0.411983,0.475511,0.787765,1.29721,0.489155,0.399317,0.413072,0.0205239,0.0681894,-1.12729e-06,0.000383388,8.37362e-05,-3.29463e-47
1,16.3454,17.9387,5.28919,5.75073,2.26232,4.56206,3.48202,-0.0787657,0.0803889,-0.178447,4.72191e-05,-0.295439,0.506816,0.0218599
2,20.9097,21.2422,11.2331,12.3585,6.99945,8.67148,6.56725,6.72459,6.81305,7.71983,7.50158,6.82588,6.85551,6.80128
3,8.4,25.7,19.8,3.6,2.8,9.2,24.4,0,0,0,0,0,0,0
4,Joe Biden,Bernie Sanders,Amy Klobuchar,Tom Steyer,Andrew Yang,Elizabeth Warren,Pete Buttigieg,Linsey Davis,David Muir,Monica Hernandez,Adam Sexton,Devin Dwyer,Rachel Scott,Announcer


So the bounds don't really give us anything more helpful. It doesn't capture Pete or Amy's true value. One guess for why the method is over-estimating Biden and under-estimating Pete and Amy is that Biden's speech patterns have remained consistent throughout the debate and that Biden was polling at a very high number to begin the debate season. One thing to keep in mind is corpuses spoken at more recent debates should have more weight than speeches from July. To account for this we look at a results dataframe, and then look to weight the polling results of the training data according to recency.

# Weighing Based on Recency

In [35]:
results_df = dfcomplete.iloc[:,0:3].groupby(['debate_name','speaker'])['poll_results'].sum().unstack().fillna(0)
results_df

speaker,A. Cooper,Abby Phillips,Adam Sexton,Amna Nawaz,Amy Klobuchar,Amy Walter,Anderson Cooper,Andrea Mitchell,Andrew Yang,Announcer,...,Stephanie Sy,Steve Bullock,Steve Kornacki,Tim Alberta,Tim Ryan,Tom Steyer,Tulsi Gabbard,Voiceover,Wolf Blitzer,Yamiche A.
debate_name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1st Debate,0.0,0.0,0.0,0.0,3.7,0.0,0.0,0.0,1.7,0.0,...,0.0,0.0,0.0,0.0,1.3,0.0,1.3,0.0,0.0,0.0
2nd Debate,0.0,0.0,0.0,0.0,4.0,0.0,0.0,0.0,2.0,0.0,...,0.0,1.0,0.0,0.0,1.3,0.0,1.3,0.0,0.0,0.0
Atlanta Debate,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Houston Debate,0.0,0.0,0.0,0.0,3.3,0.0,0.0,0.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
Iowa Debate,0.0,0.0,0.0,0.0,12.3,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.3,0.0,0.0,0.0,0.0
Los Angeles Debate,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
New Hampshire Debate,0.0,0.0,0.0,0.0,19.8,0.0,0.0,0.0,2.8,0.0,...,0.0,0.0,0.0,0.0,0.0,3.6,0.0,0.0,0.0,0.0
Ohio Debate,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,3.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [36]:
weight_debate=np.linspace(0.2,1.8,8)
weights_map = {uqdebate[i]:np.flip(weight_debate)[i] for i in range(len(uqdebate))}

In [37]:
def debate_weighter(x):
    value = "unknown"
    for key in weights_map:
        if key == x:
            return weights_map[key]           


In [38]:
dfcomplete['recency_weights'] = dfcomplete['debate_name'].apply(debate_weighter)
dfcomplete


Unnamed: 0,poll_results,debate_name,speaker,speech,recency_weights
0,8.4,New Hampshire Debate,Joe Biden,"Oh, they didn't miss anything. It's a long rac...",1.800000
1,25.7,New Hampshire Debate,Bernie Sanders,Because Donald Trump lies all the time. It doe...,1.800000
2,19.8,New Hampshire Debate,Amy Klobuchar,Bernie and I work together all the time. But I...,1.800000
3,3.6,New Hampshire Debate,Tom Steyer,"I don't think there's any question, George, th...",1.800000
4,2.8,New Hampshire Debate,Andrew Yang,"First, let me say America, it's great to be ba...",1.800000
5,9.2,New Hampshire Debate,Elizabeth Warren,"Oh, Bernie and I have been friends for a long ...",1.800000
6,24.4,New Hampshire Debate,Pete Buttigieg,I'm not interested in the labels. I'm not inte...,1.800000
7,0.0,New Hampshire Debate,Linsey Davis,"Thank you, Senator. David. I want to turn now ...",1.800000
8,0.0,New Hampshire Debate,David Muir,"Lindsey, thank you. Good evening, all. I want ...",1.800000
9,0.0,New Hampshire Debate,Monica Hernandez,Thank you George. It's an honor to be here in ...,1.800000


In [39]:
Y_train_weighted = dfcomplete['poll_results'].iloc[14:]* dfcomplete['recency_weights'].iloc[14:]


In [40]:
Medpipeline.fit(X_train, Y_train_weighted)
nh_predictions3=Medpipeline.predict(X_test)
pd.DataFrame([nh_predictions,Y_test,dfcomplete['speaker'].iloc[0:14]])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,15.6524,17.7846,2.85672,3.47719,2.29322,8.73269,3.76377,2.07174,2.09655,2.11143,1.9902,1.89687,1.90952,2.00571
1,8.4,25.7,19.8,3.6,2.8,9.2,24.4,0,0,0,0,0,0,0
2,Joe Biden,Bernie Sanders,Amy Klobuchar,Tom Steyer,Andrew Yang,Elizabeth Warren,Pete Buttigieg,Linsey Davis,David Muir,Monica Hernandez,Adam Sexton,Devin Dwyer,Rachel Scott,Announcer


In [41]:
explained_variance_score(Y_test,nh_predictions3)

0.4847245510807202

We have yet another improvement on our desired metric. Finally I look to do a grid search over a set of hyper-parameters to choose the best model. This grid search is pretty computationally heavy so we look to use parallel computing with GridSearchCV's built in n_jobs parameter and then "pickle" the fitted model afterwords so we don't have to waste money on retraining it every time.

# Hyper-Parameter Tuning, Parallel Computing, and Saving the Fitted Model

In [42]:
from sklearn.model_selection import GridSearchCV

In [43]:
param_grid = {"regressor__max_features" : [None, "sqrt"],
             "regressor__max_depth" : [None, 6, 8, 10],
             "regressor__max_leaf_nodes": [None, 5, 10, 20], 
             "regressor__min_impurity_decrease": [0, 0.2, 0.3]}

In [44]:
grid = GridSearchCV(Medpipeline, param_grid=param_grid, cv=3, n_jobs = -1)

In [45]:
#grid.fit(X_train, Y_train_weighted)

In [46]:
import dill as pickle
filename = 'debate_model_v1.pk'

#with open('File Path'+filename, 'wb') as file:
      #pickle.dump(grid, file)

with open('Insert File Path'+filename ,'rb') as file:
    debate_model = pickle.load(file)

In [47]:
debate_model.predict(X_test)

array([ 1.59839215e+01,  1.12476772e+01,  5.27597180e+00,  4.50754656e+00,
        2.55450579e-01,  1.20031507e+01,  6.15283749e+00,  2.12468947e-01,
        1.37895209e+00, -1.91958041e-01, -3.78835596e-01,  1.12836145e+00,
        1.00827210e-01,  7.75666380e-03])

In [48]:
explained_variance_score(Y_test,debate_model.predict(X_test))

0.38432296759927875

In [49]:
pd.DataFrame([debate_model.predict(X_test),Y_test,dfcomplete['speaker'].iloc[0:14]])

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13
0,15.9839,11.2477,5.27597,4.50755,0.255451,12.0032,6.15284,0.212469,1.37895,-0.191958,-0.378836,1.12836,0.100827,0.00775666
1,8.4,25.7,19.8,3.6,2.8,9.2,24.4,0,0,0,0,0,0,0
2,Joe Biden,Bernie Sanders,Amy Klobuchar,Tom Steyer,Andrew Yang,Elizabeth Warren,Pete Buttigieg,Linsey Davis,David Muir,Monica Hernandez,Adam Sexton,Devin Dwyer,Rachel Scott,Announcer


# Final Thoughts and Analysis

This final model does a great job at realizing who is not a candidate for president and predicts very low polling as a result. One humorous note is that this model predicts Andrew Yang nearly as low as the non-candidates and he withdrew from the primary right after the New Hampshire Debate. One thing to note is that technically each debate is not independant even though the model considers them as individual events. A human when voting will remember past debates and what the candidate stands for even without them putting on a strong showing in the individual debate. 

For example, maybe the words that Buttigieg spoke in the New Hampshire debate was only strong enough to net him 6.15% of the voters, but due to his past success in Iowa someone will poll based on that instead of the independent words spoken just at the new Hampshire Debate.

Some improvements I could make would be spending more time on the text-preprocessing to give more informative features. Other improvements could be made towards the goal of incorporating past results to their name, but at the current moment I like the idea that I could input any string of speech and it could output a polling result. One thing to note is that the values for polling results will generally not add to 100 as some people will poll as undecided, so we did not put restrictions on the model. Another route we could go down is to consider each number as a percent chance to win, aka a probability and use it in a classifier that outputs probabilities. We have a 0-1 binary classification where 1 = "You won the poll" and 0 = "You lost the poll". A lot of classifiers will output probabilities that a specific prediction is in class 1, and we could use that to predict a weaker sense of polling results. 

Overall this project was a lot of fun. I got to brush up on some of my basic programming and machine learning skills, as most of my time is dedicated to studying for my Master's Degree. This specific dataset called to me a bit, as it was a current events topic that I keep track of and I wanted to practice some text processing. I also got to implement "pickling" which I hadn't got to do before. Some next steps that I could take would be:

1. Restart the project and take a large amount of time on finding the best method of text preprocessing for this use-case
2. Keep the current text to vector method and spend time on the model selection process
3. Transform my currently saved model into a usable API that takes in any string of text data