### Language modeling is the task of predicting the next word, given the preceding history.

### Sentiment detection is just a special case of classification

**Data sets with fields:**

*Combined_Comments* := comment_id, author, author_flair, score, comment_name, comment_fullname, comment_is_root, comment_parent, comment_created, comment_created_utc, comment_created_utc_datetime, comment_created_utc_date, comment_created_utc_time, comment_depth, comment_body, submission_id, submission_title, submission_created_utc**

*Clean_Game_Data* := index, unnamed: 0, playnum, playid, 'Game Title Date', text, homeWinPercentage, matched_play_by_play_text, matched_play_by_play_index, matched_play_by_play_utc, matched_play_by_play_tweetid, home_team, away_team, awayWinPercentage

*Pickle files in Clean_Game_Data* := author, author_flair, score, comment_id, comment_name, comment_fullname, comment_is_root, comment_parent, comment_approved_at_utc, comment_approved_by, matched_play_by_play_utc, matched_play_by_play_tweetid, home_team, away_team, awayWinPercentage, vader_ss, vader_neg, vader_neu, vader_pos, vader_compound

*Comments_FanOfGame* := comment_body (from Reddit), fan_of_team_playing


**Ideas for data to model**

*-------------1-------------*

*Dependent var* := game state

*Independent vars* := comment_body, fan_of_team_playing

*-------------2-------------*

*Dependent var* := fan_of_team_playing

*Independent vars* := comment_body, game_state

*-------------3-------------*

*Dependent var* := author_flair

*Independent vars* := comment_body, game_fan_state (fan_team_prob_win, fan_team_prob_lose, fan_no_team)

*-------------4-------------*

*Dependent var* := author_game_state or game_fan_state (fan_team_prob_win, fan_team_prob_lose, fan_no_team)

*Independent vars* := comment_body

*-------------5-------------*

*Dependent var* := next word

*Independent vars* := previous word


*--------------------------*

*Next step* := apply language model to each game and examine by game_state, fan_of_team_playing

**Possible comment label combinations (author_game_state)**

*fan/close*

*fan/blowout*

*notfan/close*

*notfan/blowout*

*fan/lose*

*fan/win*

In [427]:
import numpy as np
import pandas as pd
import re
import pickle
import itertools
from __future__ import print_function
from __future__ import division

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.grid_search import GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

# NLTK libs
from nltk.tokenize import TweetTokenizer

## Data Processing

In [3]:
## Load comments by game
files = [
    'Bears_vs_Packers__2017-09-28_comment_sentiment.pickle',
    'Broncos_vs_Chiefs__2017-10-30_comment_sentiment.pickle',
    'Chargers_vs_Cowboys__2017-11-23_comment_sentiment.pickle',
    'Chiefs_vs_Patriots__2017-09-07_comment_sentiment.pickle',
    'Chiefs_vs_Raiders__2017-10-19_comment_sentiment.pickle',
    'Cowboys_vs_Cardinals__2017-09-25_comment_sentiment.pickle',
    'Cowboys_vs_Raiders__2017-12-17_comment_sentiment.pickle',
    'Eagles_vs_Panthers__2017-10-12_comment_sentiment.pickle',
    'Falcons_vs_Buccaneers__2017-12-18_comment_sentiment.pickle',
    'Falcons_vs_Patriots__2017-10-22_comment_sentiment.pickle',
    'Falcons_vs_Seahawks__2017-11-20_comment_sentiment.pickle',
    'Giants_vs_Cowboys__2017-09-10_comment_sentiment.pickle',
    'Jaguars_vs_Patriots__2018-01-21_comment_sentiment.pickle',
    'Lions_vs_Giants__2017-09-18_comment_sentiment.pickle',
    'Lions_vs_Packers__2017-11-06_comment_sentiment.pickle',
    'Packers_vs_Panthers__2017-12-17_comment_sentiment.pickle',
    'Packers_vs_Vikings__2017-10-15_comment_sentiment.pickle',
    'Patriots_vs_Dolphins__2017-12-11_comment_sentiment.pickle',
    'Raiders_vs_Eagles__2017-12-25_comment_sentiment.pickle',
    'Raiders_vs_Redskins__2017-09-24_comment_sentiment.pickle',
    'Rams_vs_49ers__2017-09-21_comment_sentiment.pickle',
    'Redskins_vs_Chiefs__2017-10-02_comment_sentiment.pickle',
    'Redskins_vs_Cowboys__2017-11-30_comment_sentiment.pickle',
    'Redskins_vs_Eagles__2017-10-23_comment_sentiment.pickle',
    'Saints_vs_Falcons__2017-12-07_comment_sentiment.pickle',
    'Saints_vs_Vikings__2017-09-11_comment_sentiment.pickle',
    'Seahawks_vs_Cardinals__2017-11-09_comment_sentiment.pickle',
    'Steelers_vs_Bengals__2017-12-04_comment_sentiment.pickle',
    'Steelers_vs_Lions__2017-10-29_comment_sentiment.pickle',
    'Texans_vs_Bengals__2017-09-14_comment_sentiment.pickle',
    'Vikings_vs_Packers__2017-12-23_comment_sentiment.pickle',
    'Vikings_vs_Panthers__2017-12-10_comment_sentiment.pickle']

In [5]:
path = "/Users/chadharness/mids/w266/w266_final_project/Clean_Game_Data/"

for index, filename in enumerate(files):
    print(path+filename)
    if index == 0:
        data = pd.read_pickle(path+filename)
        print(data.head())
    else:
        temp_data = pd.read_pickle(path+filename)
        data = data.append(temp_data)
        print(data.head())

/Users/chadharness/mids/w266/w266_final_project/Clean_Game_Data/Bears_vs_Packers__2017-09-28_comment_sentiment.pickle
                author author_flair score comment_id comment_name  \
0       Street_Spirit_      Raiders     2    dnnfsxx   t1_dnnfsxx   
1           irishkid46        Bears     1    dnnft5a   t1_dnnft5a   
2  SportsMasterGeneral        Bears    12    dnnft8q   t1_dnnft8q   
3       Street_Spirit_      Raiders    24    dnnftgx   t1_dnnftgx   
4      Fight_For_Tacos         None     0    dnnftkp   t1_dnnftkp   

  comment_fullname comment_is_root comment_parent comment_approved_at_utc  \
0       t1_dnnfsxx            True         7344it                    None   
1       t1_dnnft5a            True         7344it                    None   
2       t1_dnnft8q            True         7344it                    None   
3       t1_dnnftgx            True         7344it                    None   
4       t1_dnnftkp            True         7344it                    None   

  co

                author author_flair score comment_id comment_name  \
0       Street_Spirit_      Raiders     2    dnnfsxx   t1_dnnfsxx   
1           irishkid46        Bears     1    dnnft5a   t1_dnnft5a   
2  SportsMasterGeneral        Bears    12    dnnft8q   t1_dnnft8q   
3       Street_Spirit_      Raiders    24    dnnftgx   t1_dnnftgx   
4      Fight_For_Tacos         None     0    dnnftkp   t1_dnnftkp   

  comment_fullname comment_is_root comment_parent comment_approved_at_utc  \
0       t1_dnnfsxx            True         7344it                    None   
1       t1_dnnft5a            True         7344it                    None   
2       t1_dnnft8q            True         7344it                    None   
3       t1_dnnftgx            True         7344it                    None   
4       t1_dnnftkp            True         7344it                    None   

  comment_approved_by      ...        matched_play_by_play_utc  \
0                None      ...             2017-09-29 00

                author author_flair score comment_id comment_name  \
0       Street_Spirit_      Raiders     2    dnnfsxx   t1_dnnfsxx   
1           irishkid46        Bears     1    dnnft5a   t1_dnnft5a   
2  SportsMasterGeneral        Bears    12    dnnft8q   t1_dnnft8q   
3       Street_Spirit_      Raiders    24    dnnftgx   t1_dnnftgx   
4      Fight_For_Tacos         None     0    dnnftkp   t1_dnnftkp   

  comment_fullname comment_is_root comment_parent comment_approved_at_utc  \
0       t1_dnnfsxx            True         7344it                    None   
1       t1_dnnft5a            True         7344it                    None   
2       t1_dnnft8q            True         7344it                    None   
3       t1_dnnftgx            True         7344it                    None   
4       t1_dnnftkp            True         7344it                    None   

  comment_approved_by      ...        matched_play_by_play_utc  \
0                None      ...             2017-09-29 00

                author author_flair score comment_id comment_name  \
0       Street_Spirit_      Raiders     2    dnnfsxx   t1_dnnfsxx   
1           irishkid46        Bears     1    dnnft5a   t1_dnnft5a   
2  SportsMasterGeneral        Bears    12    dnnft8q   t1_dnnft8q   
3       Street_Spirit_      Raiders    24    dnnftgx   t1_dnnftgx   
4      Fight_For_Tacos         None     0    dnnftkp   t1_dnnftkp   

  comment_fullname comment_is_root comment_parent comment_approved_at_utc  \
0       t1_dnnfsxx            True         7344it                    None   
1       t1_dnnft5a            True         7344it                    None   
2       t1_dnnft8q            True         7344it                    None   
3       t1_dnnftgx            True         7344it                    None   
4       t1_dnnftkp            True         7344it                    None   

  comment_approved_by      ...        matched_play_by_play_utc  \
0                None      ...             2017-09-29 00

                author author_flair score comment_id comment_name  \
0       Street_Spirit_      Raiders     2    dnnfsxx   t1_dnnfsxx   
1           irishkid46        Bears     1    dnnft5a   t1_dnnft5a   
2  SportsMasterGeneral        Bears    12    dnnft8q   t1_dnnft8q   
3       Street_Spirit_      Raiders    24    dnnftgx   t1_dnnftgx   
4      Fight_For_Tacos         None     0    dnnftkp   t1_dnnftkp   

  comment_fullname comment_is_root comment_parent comment_approved_at_utc  \
0       t1_dnnfsxx            True         7344it                    None   
1       t1_dnnft5a            True         7344it                    None   
2       t1_dnnft8q            True         7344it                    None   
3       t1_dnnftgx            True         7344it                    None   
4       t1_dnnftkp            True         7344it                    None   

  comment_approved_by      ...        matched_play_by_play_utc  \
0                None      ...             2017-09-29 00

                author author_flair score comment_id comment_name  \
0       Street_Spirit_      Raiders     2    dnnfsxx   t1_dnnfsxx   
1           irishkid46        Bears     1    dnnft5a   t1_dnnft5a   
2  SportsMasterGeneral        Bears    12    dnnft8q   t1_dnnft8q   
3       Street_Spirit_      Raiders    24    dnnftgx   t1_dnnftgx   
4      Fight_For_Tacos         None     0    dnnftkp   t1_dnnftkp   

  comment_fullname comment_is_root comment_parent comment_approved_at_utc  \
0       t1_dnnfsxx            True         7344it                    None   
1       t1_dnnft5a            True         7344it                    None   
2       t1_dnnft8q            True         7344it                    None   
3       t1_dnnftgx            True         7344it                    None   
4       t1_dnnftkp            True         7344it                    None   

  comment_approved_by      ...        matched_play_by_play_utc  \
0                None      ...             2017-09-29 00

                author author_flair score comment_id comment_name  \
0       Street_Spirit_      Raiders     2    dnnfsxx   t1_dnnfsxx   
1           irishkid46        Bears     1    dnnft5a   t1_dnnft5a   
2  SportsMasterGeneral        Bears    12    dnnft8q   t1_dnnft8q   
3       Street_Spirit_      Raiders    24    dnnftgx   t1_dnnftgx   
4      Fight_For_Tacos         None     0    dnnftkp   t1_dnnftkp   

  comment_fullname comment_is_root comment_parent comment_approved_at_utc  \
0       t1_dnnfsxx            True         7344it                    None   
1       t1_dnnft5a            True         7344it                    None   
2       t1_dnnft8q            True         7344it                    None   
3       t1_dnnftgx            True         7344it                    None   
4       t1_dnnftkp            True         7344it                    None   

  comment_approved_by      ...        matched_play_by_play_utc  \
0                None      ...             2017-09-29 00

                author author_flair score comment_id comment_name  \
0       Street_Spirit_      Raiders     2    dnnfsxx   t1_dnnfsxx   
1           irishkid46        Bears     1    dnnft5a   t1_dnnft5a   
2  SportsMasterGeneral        Bears    12    dnnft8q   t1_dnnft8q   
3       Street_Spirit_      Raiders    24    dnnftgx   t1_dnnftgx   
4      Fight_For_Tacos         None     0    dnnftkp   t1_dnnftkp   

  comment_fullname comment_is_root comment_parent comment_approved_at_utc  \
0       t1_dnnfsxx            True         7344it                    None   
1       t1_dnnft5a            True         7344it                    None   
2       t1_dnnft8q            True         7344it                    None   
3       t1_dnnftgx            True         7344it                    None   
4       t1_dnnftkp            True         7344it                    None   

  comment_approved_by      ...        matched_play_by_play_utc  \
0                None      ...             2017-09-29 00

In [5]:
data.head()

Unnamed: 0,author,author_flair,score,comment_id,comment_name,comment_fullname,comment_is_root,comment_parent,comment_approved_at_utc,comment_approved_by,...,matched_play_by_play_utc,matched_play_by_play_tweetid,home_team,away_team,awayWinPercentage,vader_ss,vader_neg,vader_neu,vader_pos,vader_compound
0,Street_Spirit_,Raiders,2,dnnfsxx,t1_dnnfsxx,t1_dnnfsxx,True,7344it,,,...,2017-09-29 00:30:01,9.135615e+17,Packers,Bears,0.161,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0
1,irishkid46,Bears,1,dnnft5a,t1_dnnft5a,t1_dnnft5a,True,7344it,,,...,2017-09-29 00:30:01,9.135615e+17,Packers,Bears,0.161,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0
2,SportsMasterGeneral,Bears,12,dnnft8q,t1_dnnft8q,t1_dnnft8q,True,7344it,,,...,2017-09-29 00:30:01,9.135615e+17,Packers,Bears,0.161,"{'neg': 0.278, 'neu': 0.722, 'pos': 0.0, 'comp...",0.278,0.722,0.0,-0.5927
3,Street_Spirit_,Raiders,24,dnnftgx,t1_dnnftgx,t1_dnnftgx,True,7344it,,,...,2017-09-29 00:30:01,9.135615e+17,Packers,Bears,0.161,"{'neg': 0.636, 'neu': 0.364, 'pos': 0.0, 'comp...",0.636,0.364,0.0,-0.5423
4,Fight_For_Tacos,,0,dnnftkp,t1_dnnftkp,t1_dnnftkp,True,7344it,,,...,2017-09-29 00:30:01,9.135615e+17,Packers,Bears,0.161,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0


In [6]:
data[data.author == 'Scaryclouds'].comment_body.head()

8784                                Fisher you fat fuck! 
8814    Hmm, looks like a bad spot, I think Smith got ...
8825                           Yea, first down for sure. 
8843    God damn, our o-line is just terrible at run b...
8860      Part of it is injuries to our interior o-line. 
Name: comment_body, dtype: object

In [8]:
list(data.columns.values)

['author',
 'author_flair',
 'score',
 'comment_id',
 'comment_name',
 'comment_fullname',
 'comment_is_root',
 'comment_parent',
 'comment_approved_at_utc',
 'comment_approved_by',
 'comment_created',
 'comment_created_utc',
 'comment_created_utc_datetime',
 'comment_created_utc_date',
 'comment_created_utc_time',
 'comment_banned_at_utc',
 'comment_banned_by',
 'comment_depth',
 'comment_num_reports',
 'comment_body',
 'comment_body_parsed',
 'submission_id',
 'submission_title',
 'submission_created_utc',
 'playId',
 'index',
 'Unnamed: 0',
 'playnum',
 'Game Title Date',
 'text',
 'homeWinPercentage',
 'matched_play_by_play_text',
 'matched_play_by_play_index',
 'matched_play_by_play_utc',
 'matched_play_by_play_tweetid',
 'home_team',
 'away_team',
 'awayWinPercentage',
 'vader_ss',
 'vader_neg',
 'vader_neu',
 'vader_pos',
 'vader_compound']

**Model variables**

*Dependent var* := author_game_state or game_fan_state

*Independent vars* := comment_body


**game_fan_state values**

*fan_team_prob_win* 

*fan_team_prob_lose* 

*fan_no_team*


**author_game_state values**

*nofan_notclose*

*nofan_close*

*fan_lose_close*

*fan_lose_notclose*

*fan_win_close*

*fan_win_notclose*

### Create features for model

In [6]:
# Identify game state
data['win_differential'] = abs(data.homeWinPercentage - data.awayWinPercentage)

# Call it a win for away if away has same or higher win percentage
data['win_team'] = np.where(data.awayWinPercentage >= data.homeWinPercentage, 'away', 'home')

data['game_state'] = np.where(data.win_differential < 0.6, 'close', 'notclose')


In [97]:
data.win_differential.max()

1.0

In [98]:
data.win_differential.min()

0.0

In [7]:
# Identify author affiliation to game
data['fan_type'] = np.where(data.away_team == data.author_flair, 'away', 
                            np.where(data.home_team == data.author_flair, 'home', 'nofan'))


data['author_game_state'] = np.where(data.fan_type == 'nofan', 
                                     np.where(data.game_state == 'notclose', 'nofan_notclose', 'nofan_close'),
                                     np.where(data.game_state == 'notclose', 
                                             np.where(data.win_team == data.fan_type, 'fan_win_notclose','fan_lose_notclose'),
                                              np.where(data.win_team == data.fan_type, 'fan_win_close', 'fan_lose_close')))


In [18]:
data['author_game_state'].isnull().sum()

0

In [20]:
data[data.author_game_state == 'nofan_notclose'].head()

Unnamed: 0,author,author_flair,score,comment_id,comment_name,comment_fullname,comment_is_root,comment_parent,comment_approved_at_utc,comment_approved_by,...,vader_ss,vader_neg,vader_neu,vader_pos,vader_compound,fan_type,win_differential,win_team,game_state,author_game_state
0,Street_Spirit_,Raiders,2,dnnfsxx,t1_dnnfsxx,t1_dnnfsxx,True,7344it,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,nofan,0.678,home,notclose,nofan_notclose
3,Street_Spirit_,Raiders,24,dnnftgx,t1_dnnftgx,t1_dnnftgx,True,7344it,,,...,"{'neg': 0.636, 'neu': 0.364, 'pos': 0.0, 'comp...",0.636,0.364,0.0,-0.5423,nofan,0.678,home,notclose,nofan_notclose
4,Fight_For_Tacos,,0,dnnftkp,t1_dnnftkp,t1_dnnftkp,True,7344it,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,nofan,0.678,home,notclose,nofan_notclose
5,SirTopKekington,Lions,2,dnnfu1k,t1_dnnfu1k,t1_dnnfu1k,True,7344it,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,nofan,0.678,home,notclose,nofan_notclose
7,Rufio330,Patriots,1,dnnfu4i,t1_dnnfu4i,t1_dnnfu4i,True,7344it,,,...,"{'neg': 0.0, 'neu': 0.714, 'pos': 0.286, 'comp...",0.0,0.714,0.286,0.8478,nofan,0.678,home,notclose,nofan_notclose


In [21]:
data[data.author_game_state == 'nofan_close'].head()

Unnamed: 0,author,author_flair,score,comment_id,comment_name,comment_fullname,comment_is_root,comment_parent,comment_approved_at_utc,comment_approved_by,...,vader_ss,vader_neg,vader_neu,vader_pos,vader_compound,fan_type,win_differential,win_team,game_state,author_game_state
0,Always_Sunnyvale,Buccaneers,35,dq90l39,t1_dq90l39,t1_dq90l39,True,7f2ii3,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,nofan,0.14,away,close,nofan_close
1,MisterrAlex,Eagles,16,dq90ll8,t1_dq90ll8,t1_dq90ll8,False,dq90l39,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,nofan,0.14,away,close,nofan_close
2,shaolin_1993,Patriots,6,dq90llk,t1_dq90llk,t1_dq90llk,True,7f2ii3,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,nofan,0.14,away,close,nofan_close
4,That_One_Cool_Guy,Packers,3,dq90m45,t1_dq90m45,t1_dq90m45,True,7f2ii3,,,...,"{'neg': 0.0, 'neu': 0.53, 'pos': 0.47, 'compou...",0.0,0.53,0.47,0.7351,nofan,0.14,away,close,nofan_close
5,321polo,Redskins,9,dq90mg7,t1_dq90mg7,t1_dq90mg7,True,7f2ii3,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,nofan,0.14,away,close,nofan_close


In [22]:
data[data.author_game_state == 'fan_win_notclose'].head()

Unnamed: 0,author,author_flair,score,comment_id,comment_name,comment_fullname,comment_is_root,comment_parent,comment_approved_at_utc,comment_approved_by,...,vader_ss,vader_neg,vader_neu,vader_pos,vader_compound,fan_type,win_differential,win_team,game_state,author_game_state
8,holiday_bandit,Packers,4,dnnfug4,t1_dnnfug4,t1_dnnfug4,True,7344it,,,...,"{'neg': 0.0, 'neu': 0.723, 'pos': 0.277, 'comp...",0.0,0.723,0.277,0.3182,home,0.678,home,notclose,fan_win_notclose
18,ComptonNWA,Packers,2,dnnfvid,t1_dnnfvid,t1_dnnfvid,True,7344it,,,...,"{'neg': 0.0, 'neu': 0.777, 'pos': 0.223, 'comp...",0.0,0.777,0.223,0.5106,home,0.678,home,notclose,fan_win_notclose
23,StarksofWinterfell89,Packers,99,dnnfwj2,t1_dnnfwj2,t1_dnnfwj2,True,7344it,,,...,"{'neg': 0.0, 'neu': 0.4, 'pos': 0.6, 'compound...",0.0,0.4,0.6,0.4588,home,0.678,home,notclose,fan_win_notclose
25,ComptonNWA,Packers,7,dnnfwvc,t1_dnnfwvc,t1_dnnfwvc,True,7344it,,,...,"{'neg': 0.145, 'neu': 0.532, 'pos': 0.323, 'co...",0.145,0.532,0.323,0.8106,home,0.678,home,notclose,fan_win_notclose
44,magic_is_might,Packers,-1,dnnfzuf,t1_dnnfzuf,t1_dnnfzuf,False,dnnfy0k,,,...,"{'neg': 0.123, 'neu': 0.877, 'pos': 0.0, 'comp...",0.123,0.877,0.0,-0.4215,home,0.678,home,notclose,fan_win_notclose


In [23]:
data[data.author_game_state == 'fan_lose_notclose'].head()

Unnamed: 0,author,author_flair,score,comment_id,comment_name,comment_fullname,comment_is_root,comment_parent,comment_approved_at_utc,comment_approved_by,...,vader_ss,vader_neg,vader_neu,vader_pos,vader_compound,fan_type,win_differential,win_team,game_state,author_game_state
1,irishkid46,Bears,1,dnnft5a,t1_dnnft5a,t1_dnnft5a,True,7344it,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,away,0.678,home,notclose,fan_lose_notclose
2,SportsMasterGeneral,Bears,12,dnnft8q,t1_dnnft8q,t1_dnnft8q,True,7344it,,,...,"{'neg': 0.278, 'neu': 0.722, 'pos': 0.0, 'comp...",0.278,0.722,0.0,-0.5927,away,0.678,home,notclose,fan_lose_notclose
6,Chibears85,Bears,116,dnnfu4o,t1_dnnfu4o,t1_dnnfu4o,True,7344it,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,away,0.678,home,notclose,fan_lose_notclose
15,hjs24gl2814,Bears,12,dnnfv2k,t1_dnnfv2k,t1_dnnfv2k,True,7344it,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,away,0.678,home,notclose,fan_lose_notclose
21,EmoArbiter,Bears,213,dnnfvvu,t1_dnnfvvu,t1_dnnfvvu,True,7344it,,,...,"{'neg': 0.182, 'neu': 0.818, 'pos': 0.0, 'comp...",0.182,0.818,0.0,-0.5848,away,0.678,home,notclose,fan_lose_notclose


In [24]:
data[data.author_game_state == 'fan_win_close'].head()

Unnamed: 0,author,author_flair,score,comment_id,comment_name,comment_fullname,comment_is_root,comment_parent,comment_approved_at_utc,comment_approved_by,...,vader_ss,vader_neg,vader_neu,vader_pos,vader_compound,fan_type,win_differential,win_team,game_state,author_game_state
12,3SP,Chargers,5,dq90o97,t1_dq90o97,t1_dq90o97,True,7f2ii3,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,away,0.14,away,close,fan_win_close
101,scoot87,Chargers,5,dq90zj5,t1_dq90zj5,t1_dq90zj5,True,7f2ii3,,,...,"{'neg': 0.529, 'neu': 0.471, 'pos': 0.0, 'comp...",0.529,0.471,0.0,-0.5423,away,0.14,away,close,fan_win_close
108,Banditjack,Chargers,64,dq911hl,t1_dq911hl,t1_dq911hl,True,7f2ii3,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,away,0.14,away,close,fan_win_close
139,Clipdodgecharge,Chargers,229,dq916dc,t1_dq916dc,t1_dq916dc,True,7f2ii3,,,...,"{'neg': 0.341, 'neu': 0.659, 'pos': 0.0, 'comp...",0.341,0.659,0.0,-0.4767,away,0.14,away,close,fan_win_close
155,scoot87,Chargers,9,dq918u8,t1_dq918u8,t1_dq918u8,True,7f2ii3,,,...,"{'neg': 0.0, 'neu': 0.786, 'pos': 0.214, 'comp...",0.0,0.786,0.214,0.4588,away,0.14,away,close,fan_win_close


In [25]:
data[data.author_game_state == 'fan_lose_close'].head()

Unnamed: 0,author,author_flair,score,comment_id,comment_name,comment_fullname,comment_is_root,comment_parent,comment_approved_at_utc,comment_approved_by,...,vader_ss,vader_neg,vader_neu,vader_pos,vader_compound,fan_type,win_differential,win_team,game_state,author_game_state
3,TedBear72,Cowboys,24,dq90m0r,t1_dq90m0r,t1_dq90m0r,False,dq90l39,,,...,"{'neg': 1.0, 'neu': 0.0, 'pos': 0.0, 'compound...",1.0,0.0,0.0,-0.4404,home,0.14,away,close,fan_lose_close
26,TedBear72,Cowboys,3,dq90rkb,t1_dq90rkb,t1_dq90rkb,False,dq90quu,,,...,"{'neg': 1.0, 'neu': 0.0, 'pos': 0.0, 'compound...",1.0,0.0,0.0,-0.4404,home,0.14,away,close,fan_lose_close
32,Jabapy,Cowboys,29,dq90ryg,t1_dq90ryg,t1_dq90ryg,False,dq90rm3,,,...,"{'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound...",0.0,0.0,0.0,0.0,home,0.14,away,close,fan_lose_close
45,TedBear72,Cowboys,3,dq90ttc,t1_dq90ttc,t1_dq90ttc,True,7f2ii3,,,...,"{'neg': 0.0, 'neu': 0.392, 'pos': 0.608, 'comp...",0.0,0.392,0.608,0.7351,home,0.14,away,close,fan_lose_close
61,convfefe,Cowboys,-5,dq90vd6,t1_dq90vd6,t1_dq90vd6,False,dq90sh5,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,home,0.14,away,close,fan_lose_close


In [32]:
def data_prep(indata):
    # isolate comment body
    data = indata.loc[:, 'comment_body']

    # Convert links to word 'postedhyperlinkvalue'
    outdata = data.map(lambda x: re.sub(r"(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w\.-]*)*\/?\S", \
                                     "postedhyperlinkvalue", x))

    # Lowercase words that aren't all upper case
    outdata = outdata.map(lambda x: x.lower() if not x.isupper() else x)

    # Convert digits to 'DG'
    outdata = outdata.map(lambda x: re.sub('\d', 'DG', x))
    
    return data, outdata

### Prepare the data for modeling

In [197]:
print(data[data.comment_body == ''].head())

             author author_flair score comment_id comment_name  \
21657          None         None    12    dmppybo   t1_dmppybo   
29499          None         None     1    dmprzxg   t1_dmprzxg   
31087  mctoasterson       Chiefs     1    dmpsctn   t1_dmpsctn   
40665          None         None    16    dmpub2f   t1_dmpub2f   
6474           None         None     3    dom82k3   t1_dom82k3   

      comment_fullname comment_is_root comment_parent comment_approved_at_utc  \
21657       t1_dmppybo            True         6yr619                    None   
29499       t1_dmprzxg            True         6yr619                    None   
31087       t1_dmpsctn           False        dmpsb5g                    None   
40665       t1_dmpub2f            True         6yr619                    None   
6474        t1_dom82k3            True         77iet5                    None   

      comment_approved_by        ...         \
21657                None        ...          
29499                N

In [223]:
# Get rid of non-fans
use_data = data[data['fan_type']!='nofan']
#use_data = data

# Eliminate data with empty comments
use_data = use_data[pd.notnull(use_data['comment_body'])]

# Eliminate data for games in which the outcome is neither very close nor very clear
use_data = use_data[(use_data['win_differential'] <= 0.2) | (use_data['win_differential'] >= 0.9)]

print(use_data.shape)

# Separate comments and apply normalizations
comments, X_data = data_prep(use_data)

(83030, 48)


In [104]:
print(type(comments[0]))

<class 'pandas.core.series.Series'>


In [34]:
print(comments.head)

<bound method NDFrame.head of 3503                                         #wewantmitch 
4055                    "Im Mike Glennon and I have cable"
4057     It's our color rush "color".  It's the same as...
4062                     oooh.... that double team was hot
4063     It was supposed to be one of the alternate aud...
4064     Stupid UK time difference making me a book Fri...
4066     Skyrim npc after you put a bucket on their head. 
4070                Clay Matthews just broke sack record. 
4071     Looks like Redskins vs Giants, 1966. 72-41 Red...
4073     game thread is stickied, don't really need to ...
4077             I'm getting ready to pop open that nyquil
4081     someone tell mccarthy he can ALWAYS be this cr...
4082                                                NOPE! 
4085                         He’s too busy kissing titties
4090                                I love kissing titties
4091                  I honestly just never think of it.  
4093                      

In [35]:
print(X_data.head)

<bound method NDFrame.head of 3503                                         #wewantmitch 
4055                    "im mike glennon and i have cable"
4057     it's our color rush "color".  it's the same as...
4062         postedhyperlinkvalue that double team was hot
4063     it was supposed to be one of the alternate aud...
4064     stupid uk time difference making me a book fri...
4066     skyrim npc after you put a bucket on their head. 
4070                clay matthews just broke sack record. 
4071     looks like redskins vs giants, DGDGDGDG. DGDG-...
4073     game thread is stickied, don't really need to ...
4077             i'm getting ready to pop open that nyquil
4081     someone tell mccarthy he can always be this cr...
4082                                                NOPE! 
4085                         he’s too busy kissing titties
4090                                i love kissing titties
4091                  i honestly just never think of it.  
4093                      

### Set dependent variable

In [151]:
# Isolate the labels
# target_var = 'author_game_state'
target_var = 'game_state'
# target_var = 'fan_type'

labels = use_data.loc[:, target_var]

# Create some lookup dictionaries
use_data['label_id'] = use_data[target_var].factorize()[0]
label_id_df = use_data[[target_var, 'label_id']].drop_duplicates().sort_values('label_id')
label_to_id = dict(label_id_df.values)
id_to_label = dict(label_id_df[['label_id', target_var]].values)
#print(use_data.head())
#print(labels.head())
print(labels.shape)


counts = {}
for label in np.unique(labels):
    counts[label] = sum(labels == label)

print("Class counts:\n{}".format(counts))

(83030,)
Class counts:
{'close': 41287, 'notclose': 41743}


In [481]:
print(labels.unique())
print(label_id_df)
print(label_to_id)

['notclose' 'close']
     game_state  label_id
3503   notclose         0
3         close         1
{'notclose': 0, 'close': 1}


### Multinomial Naive Bayes

In [152]:
print(len(comments_canon))

83030


In [336]:
# Count or TF-IDF vectorize, removing stop words, tokenizing the strings, converting to lowercase and smoothing the doc freqs
vectorizer = TfidfVectorizer(analyzer='word', stop_words='english', lowercase=False, 
                             tokenizer=lambda text: text)
                             #tokenizer=lambda text: text, min_df=0.00002, max_df=0.005)
spmat = vectorizer.fit_transform(comments_canon)
#vectorizer = CountVectorizer(analyzer='word', stop_words='english', lowercase=False, binary=False)
#spmat = vectorizer.fit_transform(X_data)

In [167]:
print(vectorizer.get_stop_words())

frozenset({'be', 'amoungst', 'wherein', 'can', 'nothing', 'three', 'thus', 'indeed', 'less', 'at', 'nevertheless', 'besides', 'due', 'interest', 'most', 'rather', 'whether', 'eight', 'hereby', 'any', 'she', 'something', 'sometimes', 'will', 'five', 'about', 'must', 'thru', 'whence', 'becoming', 'off', 'front', 'once', 'they', 'only', 'each', 'hers', 'sixty', 'done', 'here', 'after', 'hereafter', 'moreover', 'now', 'while', 'move', 'couldnt', 'as', 'everyone', 'might', 'whenever', 'wherever', 'this', 'an', 'had', 'further', 'beside', 'their', 'how', 'or', 'first', 'that', 'its', 'become', 'several', 'somewhere', 'whoever', 'least', 'was', 'fire', 'could', 'keep', 'more', 'then', 'but', 'full', 'every', 'de', 'give', 'my', 'anyway', 'am', 'go', 'me', 'among', 'should', 'under', 'inc', 'ever', 'call', 'hundred', 'never', 'somehow', 'enough', 'four', 'many', 'per', 'third', 'them', 'hereupon', 'it', 'seeming', 'thereafter', 'between', 'two', 'yet', 'all', 'formerly', 'anywhere', 'show', 's

In [279]:
# features with default tokenization
print(vectorizer.get_feature_names())



In [242]:
# vocab size with default tokenization
print(spmat.shape)

(83030, 26493)


In [338]:
# features with tweet tokenization
print(vectorizer.get_feature_names())



In [337]:
# vocab size with tweet tokenization
print(spmat.shape)

(83030, 25816)


In [27]:
print(type(spmat))

<class 'scipy.sparse.csr.csr_matrix'>


In [129]:
print(spmat[:10])

  (0, 18082)	0.3044539261298268
  (0, 47741)	0.46107508330928915
  (0, 41283)	0.519207796609924
  (0, 18088)	0.3044595336825324
  (0, 1685)	0.3231137023452454
  (0, 61342)	0.35195969679446265
  (0, 1691)	0.32274934220513446
  (1, 8215)	1.0
  (2, 9775)	0.14681688364958037
  (2, 49152)	0.24654849205539028
  (2, 59181)	0.441442429118175
  (2, 1700)	0.1362819478760579
  (2, 50259)	0.653331772709188
  (2, 73052)	0.38528312186542923
  (2, 80587)	0.3590668001532445
  (3, 39670)	0.6622087397100487
  (3, 65308)	0.7493194145700679
  (4, 75546)	0.642530775964364
  (4, 40238)	0.38573181572953064
  (4, 76510)	0.5839467357646175
  (4, 2960)	0.31204387201558653
  (5, 1700)	0.11091601766355025
  (5, 76139)	0.4793364099593813
  (5, 50230)	0.26034118235210385
  (5, 24674)	0.2989608736164224
  :	:
  (7, 29062)	0.18550642578277338
  (7, 1712)	0.1594572811081614
  (7, 51128)	0.15206835378723735
  (7, 50464)	0.10686685798297757
  (7, 59468)	0.34354793906667513
  (7, 81718)	0.19231761960421456
  (7, 48694)	0

In [339]:
# Split into test and train
train_data, test_data, train_labels, test_labels = train_test_split(spmat, labels, test_size=0.10, random_state=42)  

In [340]:
# Train model
#a_values = [x * 0.01 for x in range(1,20)]
#gs_mnb = GridSearchCV(MultinomialNB(), {'alpha': a_values}, cv=5,
#                       scoring='f1_weighted')
clf = MultinomialNB()
clf.fit(train_data, train_labels)
#print(gs_mnb.best_estimator_)
#print(gs_mnb.best_score_)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [341]:
print(clf.class_count_)

[37124. 37603.]


In [342]:
# Get feature names and class labels
feature_names = vectorizer.get_feature_names()
class_labels = clf.classes_
print(class_labels)

['close' 'notclose']


In [343]:
#Create predictions and evaluate
pred_labels = clf.predict(test_data)
acc = metrics.accuracy_score(test_labels, pred_labels)
print("Accuracy on test set: {:.02%}".format(acc))
print('Test Data:')

print(classification_report(test_labels, pred_labels, target_names = class_labels, digits=3))
#print(classification_report(test_labels, pred_labels, target_names = ['fan_lose_close', 'fan_lose_notclose', 'fan_win_close', 'fan_win_notclose'], digits=3))

print("Confusion Matrix...")
confusionMatrix = metrics.confusion_matrix(test_labels, pred_labels)
print(confusionMatrix)

Accuracy on test set: 62.45%
Test Data:
             precision    recall  f1-score   support

      close      0.633     0.597     0.615      4163
   notclose      0.617     0.652     0.634      4140

avg / total      0.625     0.624     0.624      8303

Confusion Matrix...
[[2487 1676]
 [1442 2698]]


In [344]:
# Get top features
n = 20
for i, class_label in enumerate(class_labels):
    top_feats = np.argsort(clf.coef_[i])[-n:]
    print("%s: %s" % (class_label, " ".join(feature_names[j] for j in top_feats)))
    if i < 1:
        break

close: play don't DGDG shit it's good ’ " fucking lol DG like just ... fuck game ! ? , .


In [None]:
def most_informative_feature_for_binary_classification(vectorizer, classifier, n=10):
    class_labels = classifier.classes_
    feature_names = vectorizer.get_feature_names()
    topn_class1 = sorted(zip(classifier.coef_[0], feature_names))[:n]
    topn_class2 = sorted(zip(classifier.coef_[0], feature_names))[-n:]

    for coef, feat in topn_class1:
        print (class_labels[0], coef, feat)

    print()

    for coef, feat in reversed(topn_class2):
        print (class_labels[1], coef, feat)


most_informative_feature_for_binary_classification(vectorizer, nb, n = 15)

### Bernoulli Naive Bayes

### Vectorize data and split into train and test

In [255]:
# Count or TF-IDF vectorize, removing stop words, tokenizing the strings, converting to lowercase and smoothing the doc freqs
vectorizer = CountVectorizer(analyzer='word', stop_words='english', lowercase=False, binary=True)
#vectorizer = CountVectorizer(analyzer='word', stop_words='english')
spmat = vectorizer.fit_transform(X_data)

In [256]:
# Split into test and train
train_data, test_data, train_labels, test_labels = train_test_split(spmat, labels, test_size=0.10, random_state=42)  

In [257]:
# Train model
#a_values = [x * 0.01 for x in range(1,20)]
#gs_mnb = GridSearchCV(MultinomialNB(), {'alpha': a_values}, cv=5,
#                       scoring='f1_weighted')
clf = BernoulliNB()
clf.fit(train_data, train_labels)
#print(gs_mnb.best_estimator_)
#print(gs_mnb.best_score_)

BernoulliNB(alpha=1.0, binarize=0.0, class_prior=None, fit_prior=True)

In [258]:
#Create predictions and evaluate
pred_labels = clf.predict(test_data)
acc = metrics.accuracy_score(test_labels, pred_labels)
print("Accuracy on test set: {:.02%}".format(acc))
print('Test Data:')

print(classification_report(test_labels, pred_labels, target_names = class_labels, digits=3))
#print(classification_report(test_labels, pred_labels, target_names = ['fan_lose_close', 'fan_lose_notclose', 'fan_win_close', 'fan_win_notclose'], digits=3))

print("Confusion Matrix...")
confusionMatrix = metrics.confusion_matrix(test_labels, pred_labels)
print(confusionMatrix)

Accuracy on test set: 60.73%
Test Data:
             precision    recall  f1-score   support

      close      0.592     0.696     0.640      4163
   notclose      0.629     0.518     0.568      4140

avg / total      0.610     0.607     0.604      8303

Confusion Matrix...
[[2896 1267]
 [1994 2146]]


In [259]:
# Get top features
n = 10
for i, class_label in enumerate(class_labels):
    top_feats = np.argsort(clf.coef_[i])[-n:]
    print("%s: %s" % (class_label, " ".join(feature_names[j] for j in top_feats)))
    if i < 1:
        break

close: good don He That It DGDG just like DG game


IndexError: index 1 is out of bounds for axis 0 with size 1

### Formulate the problem as a regression problem

In [345]:
# Borrowed some functions from the w266 utils.py file
# Miscellaneous helpers
def flatten(list_of_lists):
    """Flatten a list-of-lists into a single list."""
    return list(itertools.chain.from_iterable(list_of_lists))


# Word processing functions
def canonicalize_digits(word):
    if any([c.isalpha() for c in word]): return word
    word = re.sub("\d", "DG", word)
    if word.startswith("DG"):
        word = word.replace(",", "") # remove thousands separator
    return word

def canonicalize_word(word, wordset=None, digits=True):
    word = re.sub(r"(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w\.-]*)*\/?\S", \
                                     "postedhyperlinkvalue", word)
    #if not word.isupper():
    word = word.lower()
    if digits:
        if (wordset != None) and (word in wordset): return word
        word = canonicalize_digits(word) # try to canonicalize numbers
    if (wordset == None) or (word in wordset):
        return word
    else:
        return constants.UNK_TOKEN

def canonicalize_words(words, **kw):
    return [canonicalize_word(word, **kw) for word in words]

In [346]:
# Get rid of non-fans
#use_data = data[data['fan_type']!='nofan']
use_data = data

# Eliminate data with empty comments
use_data = use_data[pd.notnull(use_data['comment_body'])]

# Eliminate data for games in which the outcome is neither very close nor very clear
# use_data = use_data[(use_data['win_differential'] <= 0.2) | (use_data['win_differential'] >= 0.9)]

print(use_data.shape)

# Separate comments and apply normalizations
# comments, X_data = data_prep(use_data)

comments = use_data.loc[:, 'comment_body']

# Convert to list
comment_list = comments.values.tolist()

print(len(comment_list))

print(comment_list[:10])
# Separate comments and apply normalizations
# comments, X_data = data_prep(use_data)

(588378, 48)
588378
['[jersey goals](https://i.ebayimg.com/images/g/9ncAAOxyGqZSWoeL/s-l300.jpg)', 'FTP', 'I know its already passed, but that leukemia story had me weak', 'fuck the refs', "There's a game tonight?", "Time's up, let's do this boys\n\nAAAAAAAAAAAAARON RODGERSSSS", 'Let me be the first of many tonight to say START TRUBISKY', 'Anyone see Deion’s hair coming in. Looks like a peach. I wish he kept the shaved look instead of getting hair plugs. But whatever makes you happy man!', 'For a skeleton guy, skeletor sure looks jacked', 'I agree']


In [347]:
print(len(use_data))

588378


In [348]:
tokenizer = TweetTokenizer()
x_tokens = [tokenizer.tokenize(sentence) for sentence in comment_list]
x_tokens[0:3]

[['[',
  'jersey',
  'goals',
  ']',
  '(',
  'https://i.ebayimg.com/images/g/9ncAAOxyGqZSWoeL/s-l300.jpg',
  ')'],
 ['FTP'],
 ['I',
  'know',
  'its',
  'already',
  'passed',
  ',',
  'but',
  'that',
  'leukemia',
  'story',
  'had',
  'me',
  'weak']]

In [482]:
comments_canon = []
for tweet in x_tokens:
    x_tokens_canon = canonicalize_words(tweet)
    #x_tokens_canon = [canonicalize_word(w) for w in tweet]
    comments_canon.append(x_tokens_canon)

print(comments_canon[140:150])

[['its', 'for', 'the', 'best'], ['what', 'kind', '?'], ['this', 'new', 'guy', 'has', 'great', 'sanders', 'impression', '.', '^', '^', '^', '/', 's'], ['pls', 'don', '’', 't', 'judge', 'me', 'but', 'can', 'someone', 'tell', 'me', 'who', 'sang', 'the', 'intro', '?', 'it', 'was', 'catchy'], ['that', 'tnf', 'intro', 'song', 'needs', 'to', 'be', 'burned', 'with', 'fire', '.'], ['i', 'hate', 'one', 'of', 'you', 'more', 'than', 'the', 'other', 'but', 'i', 'wish', 'nothing', 'but', 'hell', 'for', 'both', 'of', 'you', '.', 'may', 'both', 'of', 'you', 'play', 'to', 'an', 'injury', 'free', 'tie', 'and', 'get', 'sucked', 'into', 'hell', 'after', 'the', 'game'], ['my', 'feels', 'were', 'going', 'all', 'over', 'the', 'place', '.', "i'm", 'having', 'a', 'game', 'date', 'with', 'a', 'packers', 'fan', ',', 'and', 'trying', 'to', 'stay', 'macho', ',', 'and', 'that', 'commercial', "didn't", 'help', '.'], ['e'], ['#nfl', 'preseason', 'hall', 'of', 'famer', 'mitch', 'trubisky', 'needs', 'to', 'play', '#get

In [483]:
comments_canon = []
for tweet in x_tokens:
    #x_tokens_canon = canonicalize_words(tweet)
    x_tokens_canon = [canonicalize_word(w) for w in tweet]
    comments_canon.append(x_tokens_canon)

print(comments_canon[140:150])

[['its', 'for', 'the', 'best'], ['what', 'kind', '?'], ['this', 'new', 'guy', 'has', 'great', 'sanders', 'impression', '.', '^', '^', '^', '/', 's'], ['pls', 'don', '’', 't', 'judge', 'me', 'but', 'can', 'someone', 'tell', 'me', 'who', 'sang', 'the', 'intro', '?', 'it', 'was', 'catchy'], ['that', 'tnf', 'intro', 'song', 'needs', 'to', 'be', 'burned', 'with', 'fire', '.'], ['i', 'hate', 'one', 'of', 'you', 'more', 'than', 'the', 'other', 'but', 'i', 'wish', 'nothing', 'but', 'hell', 'for', 'both', 'of', 'you', '.', 'may', 'both', 'of', 'you', 'play', 'to', 'an', 'injury', 'free', 'tie', 'and', 'get', 'sucked', 'into', 'hell', 'after', 'the', 'game'], ['my', 'feels', 'were', 'going', 'all', 'over', 'the', 'place', '.', "i'm", 'having', 'a', 'game', 'date', 'with', 'a', 'packers', 'fan', ',', 'and', 'trying', 'to', 'stay', 'macho', ',', 'and', 'that', 'commercial', "didn't", 'help', '.'], ['e'], ['#nfl', 'preseason', 'hall', 'of', 'famer', 'mitch', 'trubisky', 'needs', 'to', 'play', '#get

In [350]:
print(comments_canon[0:10])

[['[', 'jersey', 'goals', ']', '(', 'postedhyperlinkvalue', ')'], ['ftp'], ['i', 'know', 'its', 'already', 'passed', ',', 'but', 'that', 'leukemia', 'story', 'had', 'me', 'weak'], ['fuck', 'the', 'refs'], ["there's", 'a', 'game', 'tonight', '?'], ["time's", 'up', ',', "let's", 'do', 'this', 'boys', 'aaaaaaaaaaaaaron', 'rodgerssss'], ['let', 'me', 'be', 'the', 'first', 'of', 'many', 'tonight', 'to', 'say', 'start', 'trubisky'], ['anyone', 'see', 'deion', '’', 's', 'hair', 'coming', 'in', '.', 'looks', 'like', 'a', 'peach', '.', 'i', 'wish', 'he', 'kept', 'the', 'shaved', 'look', 'instead', 'of', 'getting', 'hair', 'plugs', '.', 'but', 'whatever', 'makes', 'you', 'happy', 'man', '!'], ['for', 'a', 'skeleton', 'guy', ',', 'skeletor', 'sure', 'looks', 'jacked'], ['i', 'agree']]


### Set dependent variable

In [351]:
target_var = 'win_differential'
y_data = use_data.loc[:, target_var]

### Linear Regression

### Vectorize data and split into train and test

In [449]:
# Count or TF-IDF vectorize, removing stop words, tokenizing the strings, converting to lowercase and smoothing the doc freqs
#vectorizer = TfidfVectorizer(analyzer='word', stop_words='english', lowercase=False, 
                             #tokenizer=lambda text: text)
#                             tokenizer=lambda text: text, min_df=20)
#spmat = vectorizer.fit_transform(comments_canon)
vectorizer = CountVectorizer(analyzer='word', stop_words='english', tokenizer=lambda text: text, 
                             lowercase=False, binary=True, min_df=10)
spmat = vectorizer.fit_transform(comments_canon)

In [450]:
# vocab size with tweet tokenization
print(spmat.shape)

(588378, 13751)


In [420]:
# features with tweet tokenization
print(vectorizer.get_feature_names())



In [457]:
# Split into test and train
train_data, test_data, train_labels, test_labels = train_test_split(spmat, y_data, test_size=0.10, random_state=42)  

In [396]:
# Train model
#a_values = [x * 0.01 for x in range(1,20)]
#clf_gs = GridSearchCV(MultinomialNB(), {'alpha': a_values}, cv=5,
#                       scoring='f1_weighted')
lr = LinearRegression()
lr.fit(train_data, train_labels)
#print(clf.best_estimator_)
#print(clf.best_score_)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [397]:
#Create predictions and evaluate
pred_labels = lr.predict(test_data)
print("Training set score: {:.2f}".format(lr.score(train_data, train_labels)))
print("Test set score: {:.2f}".format(lr.score(test_data, test_labels)))

Training set score: 0.09
Test set score: 0.06


In [398]:
# Look at top scoring words
n = 20
print(lr.coef_.shape)
feature_names = vectorizer.get_feature_names()
top_feats = np.argsort(lr.coef_)[-n:]
print(" ".join(feature_names[j] for j in top_feats))

(9282,)
onside postgame bears siemian shutouts fog pouncey trevathon paxton rehost つ gamblers glennon vance glennon's trevathan ﾉ ¯ postedhyperlinkvalueto=jasie3k&subject=tweetsincommentsbot postedhyperlinkvalueto=%2fr%2fnfl


### Lasso Regression

In [476]:
lasso = Lasso(alpha=0.0002)
lasso.fit(train_data, train_labels)

Lasso(alpha=0.0002, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=False, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [477]:
#Create predictions and evaluate
pred_labels = lasso.predict(test_data)
print("Training set score: {:.2f}".format(lasso.score(train_data, train_labels)))
print("Test set score: {:.2f}".format(lasso.score(test_data, test_labels)))
print("Number of features used: {}".format(np.sum(lasso.coef_ != 0)))

Training set score: 0.02
Test set score: 0.02
Number of features used: 83


In [478]:
# Look at top scoring words
m = 20
n = np.sum(lasso.coef_ != 0)
if m < n:
    n = m
print(lasso.coef_.shape)
feature_names = vectorizer.get_feature_names()
top_feats = np.argsort(lasso.coef_)[-n:]
print(" ".join(feature_names[j] for j in top_feats))

(13751,)
won romo season hundley siemian game angle team rodgers broncos mcadoo giants trevathan fumble adams packers gg bears fog glennon


### Ridge Regression

In [461]:
rdg = Ridge()
rdg.fit(train_data, train_labels)

Ridge(alpha=1.0, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=False, random_state=None, solver='auto', tol=0.001)

In [462]:
#Create predictions and evaluate
pred_labels = rdg.predict(test_data)
print("Training set score: {:.2f}".format(rdg.score(train_data, train_labels)))
print("Test set score: {:.2f}".format(rdg.score(test_data, test_labels)))
#print("Number of features used: {}".format(np.sum(clf.coef_ != 0)))

Training set score: 0.10
Test set score: 0.05


In [463]:
# Look at top scoring words
n = 20
print(rdg.coef_.shape)
feature_names = vectorizer.get_feature_names()
top_feats = np.argsort(rdg.coef_)[-n:]
print(" ".join(feature_names[j] for j in top_feats))
#print(top_feats.shape)
#print("%s: %s" % (class_label, " ".join(feature_names[j] for j in top_feats)))

(13751,)
bethea endings sitton semen bioshock simultaneous goty atlantas trevathan chloe fog gamblers paxton joystick laterals trevethan pouncey shutouts travathan postedhyperlinkvaluev=fr9uj_ayayq&feature=postedhyperlinkvaluet=10s


### ElasticNet Regression

In [467]:
elnet = ElasticNet(alpha=0.0001, l1_ratio=0.2)
elnet.fit(train_data, train_labels)

ElasticNet(alpha=0.0001, copy_X=True, fit_intercept=True, l1_ratio=0.2,
      max_iter=1000, normalize=False, positive=False, precompute=False,
      random_state=None, selection='cyclic', tol=0.0001, warm_start=False)

In [468]:
#Create predictions and evaluate
pred_labels = elnet.predict(test_data)
print("Training set score: {:.2f}".format(elnet.score(train_data, train_labels)))
print("Test set score: {:.2f}".format(elnet.score(test_data, test_labels)))
print("Number of features used: {}".format(np.sum(elnet.coef_ != 0)))

Training set score: 0.06
Test set score: 0.05
Number of features used: 1215


In [469]:
# Look at top scoring words
m = 20
n = np.sum(lasso.coef_ != 0)
if m < n:
    n = m
print(lasso.coef_.shape)
feature_names = vectorizer.get_feature_names()
top_feats = np.argsort(lasso.coef_)[-n:]
print(" ".join(feature_names[j] for j in top_feats))

(13751,)
giants mcadoo weather lightning broncos angle fumble turkey achilles fox onside packers trubisky adams gg siemian trevathan bears glennon fog
