### Language modeling is the task of predicting the next word, given the preceding history.

### Sentiment detection is just a special case of classification

**Data sets with fields:**

*Combined_Comments* := comment_id, author, author_flair, score, comment_name, comment_fullname, comment_is_root, comment_parent, comment_created, comment_created_utc, comment_created_utc_datetime, comment_created_utc_date, comment_created_utc_time, comment_depth, comment_body, submission_id, submission_title, submission_created_utc**

*Clean_Game_Data* := index, unnamed: 0, playnum, playid, 'Game Title Date', text, homeWinPercentage, matched_play_by_play_text, matched_play_by_play_index, matched_play_by_play_utc, matched_play_by_play_tweetid, home_team, away_team, awayWinPercentage

*Pickle files in Clean_Game_Data* := author, author_flair, score, comment_id, comment_name, comment_fullname, comment_is_root, comment_parent, comment_approved_at_utc, comment_approved_by, matched_play_by_play_utc, matched_play_by_play_tweetid, home_team, away_team, awayWinPercentage, vader_ss, vader_neg, vader_neu, vader_pos, vader_compound

*Comments_FanOfGame* := comment_body (from Reddit), fan_of_team_playing


**Ideas for data to model**

*-------------1-------------*

*Dependent var* := game state

*Independent vars* := comment_body, fan_of_team_playing

*-------------2-------------*

*Dependent var* := fan_of_team_playing

*Independent vars* := comment_body, game_state

*-------------3-------------*

*Dependent var* := author_flair

*Independent vars* := comment_body, game_fan_state (fan_team_prob_win, fan_team_prob_lose, fan_no_team)

*-------------4-------------*

*Dependent var* := author_game_state or game_fan_state (fan_team_prob_win, fan_team_prob_lose, fan_no_team)

*Independent vars* := comment_body

*-------------5-------------*

*Dependent var* := next word

*Independent vars* := previous word


*--------------------------*

*Next step* := apply language model to each game and examine by game_state, fan_of_team_playing

**Possible comment label combinations (author_game_state)**

*fan/close*

*fan/blowout*

*notfan/close*

*notfan/blowout*

*fan/lose*

*fan/win*

In [32]:
import numpy as np
import pandas as pd
import re
import pickle
from __future__ import print_function
from __future__ import division

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.grid_search import GridSearchCV

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

## Data Processing

In [2]:
## Load comments by game
files = [
    'Bears_vs_Packers__2017-09-28_comment_sentiment.pickle',
    'Broncos_vs_Chiefs__2017-10-30_comment_sentiment.pickle',
    'Chargers_vs_Cowboys__2017-11-23_comment_sentiment.pickle',
    'Chiefs_vs_Patriots__2017-09-07_comment_sentiment.pickle',
    'Chiefs_vs_Raiders__2017-10-19_comment_sentiment.pickle',
    'Cowboys_vs_Cardinals__2017-09-25_comment_sentiment.pickle',
    'Cowboys_vs_Raiders__2017-12-17_comment_sentiment.pickle',
    'Eagles_vs_Panthers__2017-10-12_comment_sentiment.pickle',
    'Falcons_vs_Buccaneers__2017-12-18_comment_sentiment.pickle',
    'Falcons_vs_Patriots__2017-10-22_comment_sentiment.pickle',
    'Falcons_vs_Seahawks__2017-11-20_comment_sentiment.pickle',
    'Giants_vs_Cowboys__2017-09-10_comment_sentiment.pickle',
    'Jaguars_vs_Patriots__2018-01-21_comment_sentiment.pickle',
    'Lions_vs_Giants__2017-09-18_comment_sentiment.pickle',
    'Lions_vs_Packers__2017-11-06_comment_sentiment.pickle',
    'Packers_vs_Panthers__2017-12-17_comment_sentiment.pickle',
    'Packers_vs_Vikings__2017-10-15_comment_sentiment.pickle',
    'Patriots_vs_Dolphins__2017-12-11_comment_sentiment.pickle',
    'Raiders_vs_Eagles__2017-12-25_comment_sentiment.pickle',
    'Raiders_vs_Redskins__2017-09-24_comment_sentiment.pickle',
    'Rams_vs_49ers__2017-09-21_comment_sentiment.pickle',
    'Redskins_vs_Chiefs__2017-10-02_comment_sentiment.pickle',
    'Redskins_vs_Cowboys__2017-11-30_comment_sentiment.pickle',
    'Redskins_vs_Eagles__2017-10-23_comment_sentiment.pickle',
    'Saints_vs_Falcons__2017-12-07_comment_sentiment.pickle',
    'Saints_vs_Vikings__2017-09-11_comment_sentiment.pickle',
    'Seahawks_vs_Cardinals__2017-11-09_comment_sentiment.pickle',
    'Steelers_vs_Bengals__2017-12-04_comment_sentiment.pickle',
    'Steelers_vs_Lions__2017-10-29_comment_sentiment.pickle',
    'Texans_vs_Bengals__2017-09-14_comment_sentiment.pickle',
    'Vikings_vs_Packers__2017-12-23_comment_sentiment.pickle',
    'Vikings_vs_Panthers__2017-12-10_comment_sentiment.pickle']

In [3]:
path = "/Users/chadharness/mids/w266/w266_final_project/Clean_Game_Data/"

for index, filename in enumerate(files):
    print(path+filename)
    if index == 0:
        data = pd.read_pickle(path+filename)
        print(data.head())
    else:
        temp_data = pd.read_pickle(path+filename)
        data = data.append(temp_data)
        print(data.head())

/Users/chadharness/mids/w266/w266_final_project/Clean_Game_Data/Bears_vs_Packers__2017-09-28_comment_sentiment.pickle
                author author_flair score comment_id comment_name  \
0       Street_Spirit_      Raiders     2    dnnfsxx   t1_dnnfsxx   
1           irishkid46        Bears     1    dnnft5a   t1_dnnft5a   
2  SportsMasterGeneral        Bears    12    dnnft8q   t1_dnnft8q   
3       Street_Spirit_      Raiders    24    dnnftgx   t1_dnnftgx   
4      Fight_For_Tacos         None     0    dnnftkp   t1_dnnftkp   

  comment_fullname comment_is_root comment_parent comment_approved_at_utc  \
0       t1_dnnfsxx            True         7344it                    None   
1       t1_dnnft5a            True         7344it                    None   
2       t1_dnnft8q            True         7344it                    None   
3       t1_dnnftgx            True         7344it                    None   
4       t1_dnnftkp            True         7344it                    None   

  co

                author author_flair score comment_id comment_name  \
0       Street_Spirit_      Raiders     2    dnnfsxx   t1_dnnfsxx   
1           irishkid46        Bears     1    dnnft5a   t1_dnnft5a   
2  SportsMasterGeneral        Bears    12    dnnft8q   t1_dnnft8q   
3       Street_Spirit_      Raiders    24    dnnftgx   t1_dnnftgx   
4      Fight_For_Tacos         None     0    dnnftkp   t1_dnnftkp   

  comment_fullname comment_is_root comment_parent comment_approved_at_utc  \
0       t1_dnnfsxx            True         7344it                    None   
1       t1_dnnft5a            True         7344it                    None   
2       t1_dnnft8q            True         7344it                    None   
3       t1_dnnftgx            True         7344it                    None   
4       t1_dnnftkp            True         7344it                    None   

  comment_approved_by      ...        matched_play_by_play_utc  \
0                None      ...             2017-09-29 00

                author author_flair score comment_id comment_name  \
0       Street_Spirit_      Raiders     2    dnnfsxx   t1_dnnfsxx   
1           irishkid46        Bears     1    dnnft5a   t1_dnnft5a   
2  SportsMasterGeneral        Bears    12    dnnft8q   t1_dnnft8q   
3       Street_Spirit_      Raiders    24    dnnftgx   t1_dnnftgx   
4      Fight_For_Tacos         None     0    dnnftkp   t1_dnnftkp   

  comment_fullname comment_is_root comment_parent comment_approved_at_utc  \
0       t1_dnnfsxx            True         7344it                    None   
1       t1_dnnft5a            True         7344it                    None   
2       t1_dnnft8q            True         7344it                    None   
3       t1_dnnftgx            True         7344it                    None   
4       t1_dnnftkp            True         7344it                    None   

  comment_approved_by      ...        matched_play_by_play_utc  \
0                None      ...             2017-09-29 00

                author author_flair score comment_id comment_name  \
0       Street_Spirit_      Raiders     2    dnnfsxx   t1_dnnfsxx   
1           irishkid46        Bears     1    dnnft5a   t1_dnnft5a   
2  SportsMasterGeneral        Bears    12    dnnft8q   t1_dnnft8q   
3       Street_Spirit_      Raiders    24    dnnftgx   t1_dnnftgx   
4      Fight_For_Tacos         None     0    dnnftkp   t1_dnnftkp   

  comment_fullname comment_is_root comment_parent comment_approved_at_utc  \
0       t1_dnnfsxx            True         7344it                    None   
1       t1_dnnft5a            True         7344it                    None   
2       t1_dnnft8q            True         7344it                    None   
3       t1_dnnftgx            True         7344it                    None   
4       t1_dnnftkp            True         7344it                    None   

  comment_approved_by      ...        matched_play_by_play_utc  \
0                None      ...             2017-09-29 00

                author author_flair score comment_id comment_name  \
0       Street_Spirit_      Raiders     2    dnnfsxx   t1_dnnfsxx   
1           irishkid46        Bears     1    dnnft5a   t1_dnnft5a   
2  SportsMasterGeneral        Bears    12    dnnft8q   t1_dnnft8q   
3       Street_Spirit_      Raiders    24    dnnftgx   t1_dnnftgx   
4      Fight_For_Tacos         None     0    dnnftkp   t1_dnnftkp   

  comment_fullname comment_is_root comment_parent comment_approved_at_utc  \
0       t1_dnnfsxx            True         7344it                    None   
1       t1_dnnft5a            True         7344it                    None   
2       t1_dnnft8q            True         7344it                    None   
3       t1_dnnftgx            True         7344it                    None   
4       t1_dnnftkp            True         7344it                    None   

  comment_approved_by      ...        matched_play_by_play_utc  \
0                None      ...             2017-09-29 00

                author author_flair score comment_id comment_name  \
0       Street_Spirit_      Raiders     2    dnnfsxx   t1_dnnfsxx   
1           irishkid46        Bears     1    dnnft5a   t1_dnnft5a   
2  SportsMasterGeneral        Bears    12    dnnft8q   t1_dnnft8q   
3       Street_Spirit_      Raiders    24    dnnftgx   t1_dnnftgx   
4      Fight_For_Tacos         None     0    dnnftkp   t1_dnnftkp   

  comment_fullname comment_is_root comment_parent comment_approved_at_utc  \
0       t1_dnnfsxx            True         7344it                    None   
1       t1_dnnft5a            True         7344it                    None   
2       t1_dnnft8q            True         7344it                    None   
3       t1_dnnftgx            True         7344it                    None   
4       t1_dnnftkp            True         7344it                    None   

  comment_approved_by      ...        matched_play_by_play_utc  \
0                None      ...             2017-09-29 00

                author author_flair score comment_id comment_name  \
0       Street_Spirit_      Raiders     2    dnnfsxx   t1_dnnfsxx   
1           irishkid46        Bears     1    dnnft5a   t1_dnnft5a   
2  SportsMasterGeneral        Bears    12    dnnft8q   t1_dnnft8q   
3       Street_Spirit_      Raiders    24    dnnftgx   t1_dnnftgx   
4      Fight_For_Tacos         None     0    dnnftkp   t1_dnnftkp   

  comment_fullname comment_is_root comment_parent comment_approved_at_utc  \
0       t1_dnnfsxx            True         7344it                    None   
1       t1_dnnft5a            True         7344it                    None   
2       t1_dnnft8q            True         7344it                    None   
3       t1_dnnftgx            True         7344it                    None   
4       t1_dnnftkp            True         7344it                    None   

  comment_approved_by      ...        matched_play_by_play_utc  \
0                None      ...             2017-09-29 00

                author author_flair score comment_id comment_name  \
0       Street_Spirit_      Raiders     2    dnnfsxx   t1_dnnfsxx   
1           irishkid46        Bears     1    dnnft5a   t1_dnnft5a   
2  SportsMasterGeneral        Bears    12    dnnft8q   t1_dnnft8q   
3       Street_Spirit_      Raiders    24    dnnftgx   t1_dnnftgx   
4      Fight_For_Tacos         None     0    dnnftkp   t1_dnnftkp   

  comment_fullname comment_is_root comment_parent comment_approved_at_utc  \
0       t1_dnnfsxx            True         7344it                    None   
1       t1_dnnft5a            True         7344it                    None   
2       t1_dnnft8q            True         7344it                    None   
3       t1_dnnftgx            True         7344it                    None   
4       t1_dnnftkp            True         7344it                    None   

  comment_approved_by      ...        matched_play_by_play_utc  \
0                None      ...             2017-09-29 00

In [5]:
data.head()

Unnamed: 0,author,author_flair,score,comment_id,comment_name,comment_fullname,comment_is_root,comment_parent,comment_approved_at_utc,comment_approved_by,...,matched_play_by_play_utc,matched_play_by_play_tweetid,home_team,away_team,awayWinPercentage,vader_ss,vader_neg,vader_neu,vader_pos,vader_compound
0,Street_Spirit_,Raiders,2,dnnfsxx,t1_dnnfsxx,t1_dnnfsxx,True,7344it,,,...,2017-09-29 00:30:01,9.135615e+17,Packers,Bears,0.161,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0
1,irishkid46,Bears,1,dnnft5a,t1_dnnft5a,t1_dnnft5a,True,7344it,,,...,2017-09-29 00:30:01,9.135615e+17,Packers,Bears,0.161,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0
2,SportsMasterGeneral,Bears,12,dnnft8q,t1_dnnft8q,t1_dnnft8q,True,7344it,,,...,2017-09-29 00:30:01,9.135615e+17,Packers,Bears,0.161,"{'neg': 0.278, 'neu': 0.722, 'pos': 0.0, 'comp...",0.278,0.722,0.0,-0.5927
3,Street_Spirit_,Raiders,24,dnnftgx,t1_dnnftgx,t1_dnnftgx,True,7344it,,,...,2017-09-29 00:30:01,9.135615e+17,Packers,Bears,0.161,"{'neg': 0.636, 'neu': 0.364, 'pos': 0.0, 'comp...",0.636,0.364,0.0,-0.5423
4,Fight_For_Tacos,,0,dnnftkp,t1_dnnftkp,t1_dnnftkp,True,7344it,,,...,2017-09-29 00:30:01,9.135615e+17,Packers,Bears,0.161,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0


In [6]:
data[data.author == 'Scaryclouds'].comment_body.head()

8784                                Fisher you fat fuck! 
8814    Hmm, looks like a bad spot, I think Smith got ...
8825                           Yea, first down for sure. 
8843    God damn, our o-line is just terrible at run b...
8860      Part of it is injuries to our interior o-line. 
Name: comment_body, dtype: object

In [8]:
list(data.columns.values)

['author',
 'author_flair',
 'score',
 'comment_id',
 'comment_name',
 'comment_fullname',
 'comment_is_root',
 'comment_parent',
 'comment_approved_at_utc',
 'comment_approved_by',
 'comment_created',
 'comment_created_utc',
 'comment_created_utc_datetime',
 'comment_created_utc_date',
 'comment_created_utc_time',
 'comment_banned_at_utc',
 'comment_banned_by',
 'comment_depth',
 'comment_num_reports',
 'comment_body',
 'comment_body_parsed',
 'submission_id',
 'submission_title',
 'submission_created_utc',
 'playId',
 'index',
 'Unnamed: 0',
 'playnum',
 'Game Title Date',
 'text',
 'homeWinPercentage',
 'matched_play_by_play_text',
 'matched_play_by_play_index',
 'matched_play_by_play_utc',
 'matched_play_by_play_tweetid',
 'home_team',
 'away_team',
 'awayWinPercentage',
 'vader_ss',
 'vader_neg',
 'vader_neu',
 'vader_pos',
 'vader_compound']

**Model variables**

*Dependent var* := author_game_state or game_fan_state

*Independent vars* := comment_body


**game_fan_state values**

*fan_team_prob_win* 

*fan_team_prob_lose* 

*fan_no_team*


**author_game_state values**

*nofan_notclose*

*nofan_close*

*fan_lose_close*

*fan_lose_notclose*

*fan_win_close*

*fan_win_notclose*

### Create features for model

In [96]:
# Identify game state
data['win_differential'] = abs(data.homeWinPercentage - data.awayWinPercentage)

# Call it a win for away if away has same or higher win percentage
data['win_team'] = np.where(data.homeWinPercentage > data.awayWinPercentage, 'home', 'away')

data['game_state'] = np.where(data.win_differential < 0.45, 'close', 'notclose')


In [97]:
data.win_differential.max()

1.0

In [98]:
data.win_differential.min()

0.0

In [99]:
# Identify author affiliation to game
data['fan_type'] = np.where(data.away_team == data.author_flair, 'away', 
                            np.where(data.home_team == data.author_flair, 'home', 'nofan'))


data['author_game_state'] = np.where(data.fan_type == 'nofan', 
                                     np.where(data.game_state == 'notclose', 'nofan_notclose', 'nofan_close'),
                                     np.where(data.game_state == 'notclose', 
                                             np.where(data.win_team == data.fan_type, 'fan_win_notclose','fan_lose_notclose'),
                                              np.where(data.win_team == data.fan_type, 'fan_win_close', 'fan_lose_close')))



In [18]:
data['author_game_state'].isnull().sum()

0

In [20]:
data[data.author_game_state == 'nofan_notclose'].head()

Unnamed: 0,author,author_flair,score,comment_id,comment_name,comment_fullname,comment_is_root,comment_parent,comment_approved_at_utc,comment_approved_by,...,vader_ss,vader_neg,vader_neu,vader_pos,vader_compound,fan_type,win_differential,win_team,game_state,author_game_state
0,Street_Spirit_,Raiders,2,dnnfsxx,t1_dnnfsxx,t1_dnnfsxx,True,7344it,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,nofan,0.678,home,notclose,nofan_notclose
3,Street_Spirit_,Raiders,24,dnnftgx,t1_dnnftgx,t1_dnnftgx,True,7344it,,,...,"{'neg': 0.636, 'neu': 0.364, 'pos': 0.0, 'comp...",0.636,0.364,0.0,-0.5423,nofan,0.678,home,notclose,nofan_notclose
4,Fight_For_Tacos,,0,dnnftkp,t1_dnnftkp,t1_dnnftkp,True,7344it,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,nofan,0.678,home,notclose,nofan_notclose
5,SirTopKekington,Lions,2,dnnfu1k,t1_dnnfu1k,t1_dnnfu1k,True,7344it,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,nofan,0.678,home,notclose,nofan_notclose
7,Rufio330,Patriots,1,dnnfu4i,t1_dnnfu4i,t1_dnnfu4i,True,7344it,,,...,"{'neg': 0.0, 'neu': 0.714, 'pos': 0.286, 'comp...",0.0,0.714,0.286,0.8478,nofan,0.678,home,notclose,nofan_notclose


In [21]:
data[data.author_game_state == 'nofan_close'].head()

Unnamed: 0,author,author_flair,score,comment_id,comment_name,comment_fullname,comment_is_root,comment_parent,comment_approved_at_utc,comment_approved_by,...,vader_ss,vader_neg,vader_neu,vader_pos,vader_compound,fan_type,win_differential,win_team,game_state,author_game_state
0,Always_Sunnyvale,Buccaneers,35,dq90l39,t1_dq90l39,t1_dq90l39,True,7f2ii3,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,nofan,0.14,away,close,nofan_close
1,MisterrAlex,Eagles,16,dq90ll8,t1_dq90ll8,t1_dq90ll8,False,dq90l39,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,nofan,0.14,away,close,nofan_close
2,shaolin_1993,Patriots,6,dq90llk,t1_dq90llk,t1_dq90llk,True,7f2ii3,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,nofan,0.14,away,close,nofan_close
4,That_One_Cool_Guy,Packers,3,dq90m45,t1_dq90m45,t1_dq90m45,True,7f2ii3,,,...,"{'neg': 0.0, 'neu': 0.53, 'pos': 0.47, 'compou...",0.0,0.53,0.47,0.7351,nofan,0.14,away,close,nofan_close
5,321polo,Redskins,9,dq90mg7,t1_dq90mg7,t1_dq90mg7,True,7f2ii3,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,nofan,0.14,away,close,nofan_close


In [22]:
data[data.author_game_state == 'fan_win_notclose'].head()

Unnamed: 0,author,author_flair,score,comment_id,comment_name,comment_fullname,comment_is_root,comment_parent,comment_approved_at_utc,comment_approved_by,...,vader_ss,vader_neg,vader_neu,vader_pos,vader_compound,fan_type,win_differential,win_team,game_state,author_game_state
8,holiday_bandit,Packers,4,dnnfug4,t1_dnnfug4,t1_dnnfug4,True,7344it,,,...,"{'neg': 0.0, 'neu': 0.723, 'pos': 0.277, 'comp...",0.0,0.723,0.277,0.3182,home,0.678,home,notclose,fan_win_notclose
18,ComptonNWA,Packers,2,dnnfvid,t1_dnnfvid,t1_dnnfvid,True,7344it,,,...,"{'neg': 0.0, 'neu': 0.777, 'pos': 0.223, 'comp...",0.0,0.777,0.223,0.5106,home,0.678,home,notclose,fan_win_notclose
23,StarksofWinterfell89,Packers,99,dnnfwj2,t1_dnnfwj2,t1_dnnfwj2,True,7344it,,,...,"{'neg': 0.0, 'neu': 0.4, 'pos': 0.6, 'compound...",0.0,0.4,0.6,0.4588,home,0.678,home,notclose,fan_win_notclose
25,ComptonNWA,Packers,7,dnnfwvc,t1_dnnfwvc,t1_dnnfwvc,True,7344it,,,...,"{'neg': 0.145, 'neu': 0.532, 'pos': 0.323, 'co...",0.145,0.532,0.323,0.8106,home,0.678,home,notclose,fan_win_notclose
44,magic_is_might,Packers,-1,dnnfzuf,t1_dnnfzuf,t1_dnnfzuf,False,dnnfy0k,,,...,"{'neg': 0.123, 'neu': 0.877, 'pos': 0.0, 'comp...",0.123,0.877,0.0,-0.4215,home,0.678,home,notclose,fan_win_notclose


In [23]:
data[data.author_game_state == 'fan_lose_notclose'].head()

Unnamed: 0,author,author_flair,score,comment_id,comment_name,comment_fullname,comment_is_root,comment_parent,comment_approved_at_utc,comment_approved_by,...,vader_ss,vader_neg,vader_neu,vader_pos,vader_compound,fan_type,win_differential,win_team,game_state,author_game_state
1,irishkid46,Bears,1,dnnft5a,t1_dnnft5a,t1_dnnft5a,True,7344it,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,away,0.678,home,notclose,fan_lose_notclose
2,SportsMasterGeneral,Bears,12,dnnft8q,t1_dnnft8q,t1_dnnft8q,True,7344it,,,...,"{'neg': 0.278, 'neu': 0.722, 'pos': 0.0, 'comp...",0.278,0.722,0.0,-0.5927,away,0.678,home,notclose,fan_lose_notclose
6,Chibears85,Bears,116,dnnfu4o,t1_dnnfu4o,t1_dnnfu4o,True,7344it,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,away,0.678,home,notclose,fan_lose_notclose
15,hjs24gl2814,Bears,12,dnnfv2k,t1_dnnfv2k,t1_dnnfv2k,True,7344it,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,away,0.678,home,notclose,fan_lose_notclose
21,EmoArbiter,Bears,213,dnnfvvu,t1_dnnfvvu,t1_dnnfvvu,True,7344it,,,...,"{'neg': 0.182, 'neu': 0.818, 'pos': 0.0, 'comp...",0.182,0.818,0.0,-0.5848,away,0.678,home,notclose,fan_lose_notclose


In [24]:
data[data.author_game_state == 'fan_win_close'].head()

Unnamed: 0,author,author_flair,score,comment_id,comment_name,comment_fullname,comment_is_root,comment_parent,comment_approved_at_utc,comment_approved_by,...,vader_ss,vader_neg,vader_neu,vader_pos,vader_compound,fan_type,win_differential,win_team,game_state,author_game_state
12,3SP,Chargers,5,dq90o97,t1_dq90o97,t1_dq90o97,True,7f2ii3,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,away,0.14,away,close,fan_win_close
101,scoot87,Chargers,5,dq90zj5,t1_dq90zj5,t1_dq90zj5,True,7f2ii3,,,...,"{'neg': 0.529, 'neu': 0.471, 'pos': 0.0, 'comp...",0.529,0.471,0.0,-0.5423,away,0.14,away,close,fan_win_close
108,Banditjack,Chargers,64,dq911hl,t1_dq911hl,t1_dq911hl,True,7f2ii3,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,away,0.14,away,close,fan_win_close
139,Clipdodgecharge,Chargers,229,dq916dc,t1_dq916dc,t1_dq916dc,True,7f2ii3,,,...,"{'neg': 0.341, 'neu': 0.659, 'pos': 0.0, 'comp...",0.341,0.659,0.0,-0.4767,away,0.14,away,close,fan_win_close
155,scoot87,Chargers,9,dq918u8,t1_dq918u8,t1_dq918u8,True,7f2ii3,,,...,"{'neg': 0.0, 'neu': 0.786, 'pos': 0.214, 'comp...",0.0,0.786,0.214,0.4588,away,0.14,away,close,fan_win_close


In [25]:
data[data.author_game_state == 'fan_lose_close'].head()

Unnamed: 0,author,author_flair,score,comment_id,comment_name,comment_fullname,comment_is_root,comment_parent,comment_approved_at_utc,comment_approved_by,...,vader_ss,vader_neg,vader_neu,vader_pos,vader_compound,fan_type,win_differential,win_team,game_state,author_game_state
3,TedBear72,Cowboys,24,dq90m0r,t1_dq90m0r,t1_dq90m0r,False,dq90l39,,,...,"{'neg': 1.0, 'neu': 0.0, 'pos': 0.0, 'compound...",1.0,0.0,0.0,-0.4404,home,0.14,away,close,fan_lose_close
26,TedBear72,Cowboys,3,dq90rkb,t1_dq90rkb,t1_dq90rkb,False,dq90quu,,,...,"{'neg': 1.0, 'neu': 0.0, 'pos': 0.0, 'compound...",1.0,0.0,0.0,-0.4404,home,0.14,away,close,fan_lose_close
32,Jabapy,Cowboys,29,dq90ryg,t1_dq90ryg,t1_dq90ryg,False,dq90rm3,,,...,"{'neg': 0.0, 'neu': 0.0, 'pos': 0.0, 'compound...",0.0,0.0,0.0,0.0,home,0.14,away,close,fan_lose_close
45,TedBear72,Cowboys,3,dq90ttc,t1_dq90ttc,t1_dq90ttc,True,7f2ii3,,,...,"{'neg': 0.0, 'neu': 0.392, 'pos': 0.608, 'comp...",0.0,0.392,0.608,0.7351,home,0.14,away,close,fan_lose_close
61,convfefe,Cowboys,-5,dq90vd6,t1_dq90vd6,t1_dq90vd6,False,dq90sh5,,,...,"{'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound...",0.0,1.0,0.0,0.0,home,0.14,away,close,fan_lose_close


### Prepare the data for modeling

In [112]:
# Get rid of non-fans
#use_data = data[data['fan_type']!='nofan']
#use_data = data
use_data = data[pd.notnull(data['comment_body'])]

# Isolate comment_body
comments = use_data.loc[:,'comment_body']

# Isolate the labels
labels = use_data.loc[:, 'author_game_state']

# Create some lookup dictionaries
use_data['label_id'] = use_data['author_game_state'].factorize()[0]
label_id_df = use_data[['author_game_state', 'label_id']].drop_duplicates().sort_values('label_id')
label_to_id = dict(label_id_df.values)
id_to_label = dict(label_id_df[['label_id', 'author_game_state']].values)
print(use_data.head())

#Incorporate two of Keri's data normalizations
# Convert links to word 'postedhyperlinkvalue'
X_data = comments.map(lambda x: re.sub(r"(https?:\/\/)?([\da-z\.-]+)\.([a-z\.]{2,6})([\/\w\.-]*)*\/?\S", \
                                     "postedhyperlinkvalue", x))

# Convert digits to 'DG'
X_data = X_data.map(lambda x: re.sub('\d', 'DG', x))

                author author_flair score comment_id comment_name  \
0       Street_Spirit_      Raiders     2    dnnfsxx   t1_dnnfsxx   
1           irishkid46        Bears     1    dnnft5a   t1_dnnft5a   
2  SportsMasterGeneral        Bears    12    dnnft8q   t1_dnnft8q   
3       Street_Spirit_      Raiders    24    dnnftgx   t1_dnnftgx   
4      Fight_For_Tacos         None     0    dnnftkp   t1_dnnftkp   

  comment_fullname comment_is_root comment_parent comment_approved_at_utc  \
0       t1_dnnfsxx            True         7344it                    None   
1       t1_dnnft5a            True         7344it                    None   
2       t1_dnnft8q            True         7344it                    None   
3       t1_dnnftgx            True         7344it                    None   
4       t1_dnnftkp            True         7344it                    None   

  comment_approved_by   ...     vader_neg  vader_neu vader_pos vader_compound  \
0                None   ...         0.000

In [113]:
print(labels.unique())
print(label_id_df)

['nofan_notclose' 'fan_lose_notclose' 'fan_win_notclose' 'nofan_close'
 'fan_win_close' 'fan_lose_close']
     author_game_state  label_id
0       nofan_notclose         0
1    fan_lose_notclose         1
8     fan_win_notclose         2
517        nofan_close         3
518      fan_win_close         4
525     fan_lose_close         5


In [120]:
# Count or TF-IDF vectorize, removing stop words, tokenizing the strings, converting to lowercase and smoothing the doc freqs
vectorizer = TfidfVectorizer(analyzer='word', stop_words='english', lowercase=False, smooth_idf=True)
#vectorizer = CountVectorizer(analyzer='word', stop_words='english')
spmat = vectorizer.fit_transform(X_data)

In [121]:
# Split into test and train
train_data, test_data, train_labels, test_labels = train_test_split(spmat, labels, test_size=0.10, random_state=42)  

In [122]:
# Train model
#a_values = [x * 0.01 for x in range(1,20)]
#gs_mnb = GridSearchCV(MultinomialNB(), {'alpha': a_values}, cv=5,
#                       scoring='f1_weighted')
gs_mnb = MultinomialNB()
gs_mnb.fit(train_data, train_labels)
#print(gs_mnb.best_estimator_)
#print(gs_mnb.best_score_)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

In [123]:
print(gs_mnb.class_count_)

[ 35167.  41684.  41789.  71404. 136322. 203174.]


In [124]:
#Create predictions and evaluate
pred_labels = gs_mnb.predict(test_data)
acc = metrics.accuracy_score(test_labels, pred_labels)
print("Accuracy on test set: {:.02%}".format(acc))
print('Test Data:')

print(classification_report(test_labels, pred_labels, target_names = ['nofan_notclose','nofan_close','fan_lose_close', 'fan_lose_notclose', 'fan_win_close', 'fan_win_notclose'], digits=3))
#print(classification_report(test_labels, pred_labels, target_names = ['fan_lose_close', 'fan_lose_notclose', 'fan_win_close', 'fan_win_notclose'], digits=3))

print("Confusion Matrix...")
confusionMatrix = metrics.confusion_matrix(test_labels, pred_labels)
print(confusionMatrix)

Accuracy on test set: 39.05%
Test Data:
                   precision    recall  f1-score   support

   nofan_notclose      0.222     0.002     0.004      3904
      nofan_close      0.283     0.003     0.006      4692
   fan_lose_close      0.345     0.006     0.013      4617
fan_lose_notclose      0.289     0.020     0.038      7911
    fan_win_close      0.388     0.125     0.189     15214
 fan_win_notclose      0.392     0.927     0.551     22500

      avg / total      0.354     0.391     0.266     58838

Confusion Matrix...
[[    8     5     3    47   364  3477]
 [    4    15     2    28   229  4414]
 [    5     6    30    66   498  4012]
 [    2     7    14   162   467  7259]
 [    9     6    19    92  1898 13190]
 [    8    14    19   165  1430 20864]]


The classifier basically performs majority class classification.

In [110]:
print(gs_mnb.coef_.shape)
print(gs_mnb.coef_)
print(len(vectorizer.vocabulary_))
print(gs_mnb.feature_log_prob_[1,:])

(6, 63876)
[[-12.40865016 -12.40865016 -11.71550298 ... -12.40865016 -12.40865016
  -12.40865016]
 [-11.22979635 -12.61609071 -11.92294353 ... -12.61609071 -12.61609071
  -11.92294353]
 [-12.52367874 -12.52367874 -11.83053156 ... -12.52367874 -12.52367874
  -11.83053156]
 [-12.97620402 -11.87759173 -12.97620402 ... -12.97620402 -12.97620402
  -11.58990965]
 [-12.49340091 -12.49340091 -12.49340091 ... -13.59201319  -9.98109528
  -12.49340091]
 [-12.60237171 -12.89005378 -11.90922453 ... -13.29551889  -9.45606657
  -12.60237171]]
63876
[-11.22979635 -12.61609071 -11.92294353 ... -12.61609071 -12.61609071
 -11.92294353]


In [119]:
#Grab indices corresponding to top 5 words by weight for each class
top_feat_args = np.argsort(gs_mnb.coef_, axis=1)[:,-5:]
print(top_feat_args)

#Print unigram results
print("Feature Weights by Label")
rows = range(len(top_feat_args))
for i in rows:
    for arg in top_feat_args[i]:
        weights = []
        feat = vectorizer.get_feature_names()[arg]
        for j in rows:
            weights.append(gs_mnb.coef_[j][arg])
        #print("{:>19}|{:>18.5f}|{:>18.5f}|{:>18.5f}|{:>18.5f}".format(feat, weights[0], weights[1], weights[2], weights[3])) #, weights[4], weights[5]))
        print("{:>19}|{:>18.5f}|{:>18.5f}|{:>18.5f}|{:>18.5f}|{:>18.5f}|{:>18.5f}".format(feat, weights[0], weights[1], weights[2], weights[3], weights[4], weights[5]))

[[67147 40381 68448 61658  9507]
 [ 9569 68448 67147 61658  9507]
 [67147 40381 68448 61658  9507]
 [40381  9569 68448 61658  9507]
 [40381 68448 61658  9507 55694]
 [ 9569 68448  9507 61658 55694]]
Unigram Feature Weights by Label
               just|          -6.16973|          -5.93625|          -6.08745|          -5.85747|          -5.62908|          -5.53268
               That|          -6.16016|          -6.07622|          -6.06538|          -5.82606|          -5.51561|          -5.49071
               like|          -6.06832|          -5.94631|          -5.97939|          -5.73743|          -5.44452|          -5.37847
               game|          -5.94401|          -5.64441|          -5.88981|          -5.63118|          -5.38933|          -5.23963
                 DG|          -5.81572|          -5.63217|          -5.68843|          -5.40389|          -5.36642|          -5.27561
               DGDG|          -6.30443|          -5.98022|          -6.12678|          -5.77294|  