# 2012 President Twitter Sentiment Classification

### Fangda Fan, Xiaohan Liu

### March 2016

## Data

Tweets related with 2012 US president candidates Obama and Romney (each about 7200 tweets) during October 12-16, 2012

Sentiments labelled with -1 (negative), 0 (neutral), 1 (positive) and 2 (mixed).

Need to classify tweets with -1, 0, and 1 (2 is omitted).

## Requirements

- Platform: [Anaconda 4.3](https://www.continuum.io/downloads) (Python 3.6)

- Package for neural network: [Keras](https://github.com/fchollet/keras) with [Tensorflow](https://www.tensorflow.org/) or Theano backend
```sh
pip install keras
```
- Package for dataframe-based machine learning: [DFlearn](https://github.com/founderfan/DFlearn)
```sh
pip install dflearn
```
- Packages for advanced gradient boosting models: [LightGBM](https://github.com/Microsoft/LightGBM) and [XGBoost](https://github.com/dmlc/xgboost)
- NLTK package data download in python: 
```python 
nltk.download()
```
- [GloVe](http://nlp.stanford.edu/projects/glove/) 27B twitter dictionary

## Data Mining Methods

### 1. Bag-of-Word Analysis

Includes the following features:
1. TF-IDF vectorization of sentences
2. [Liu](https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html#lexicon)'s opinion lexicon counting
3. [Vader](http://www.nltk.org/_modules/nltk/sentiment/vader.html) sentiment analyzer
4. Date-time variables

Models:

- Bernoulli Naive Bayes
- Random Forest
- AdaBoost
- Gradient Boosting Tree
- XGBoost
- LightGBM
- SVM

### 2. Sequential Word Analysis

Includes the following features:

1. Word vectorization from [GloVe](http://nlp.stanford.edu/projects/glove/) 27B twitter dictionary
2. [Part-of-speech tagging](http://www.nltk.org/book/ch05.html)
3. [SentiWordNet](http://sentiwordnet.isti.cnr.it/)

Models:

- Deep Neural Network: CNN-LSTM

In [1]:
import re
import string

import numpy as np
import pandas as pd
import scipy.stats as st
import sklearn.preprocessing as prep
import sklearn.feature_extraction.text as txt
import sklearn.naive_bayes as nbayes
import sklearn.svm as svm
import sklearn.ensemble as ensm
import sklearn.metrics as met
import lightgbm as lgb
import xgboost as xgb
import keras as kr
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import nltk
import nltk.sentiment as sent
import dflearn.MLtools as mt
import dflearn.NLtools as nt

%matplotlib inline

Using TensorFlow backend.


## Load and Clean Data

- Clean date into days (as all data comes from 2012/10/12 to 2012/10/16
- Clean time into hours

In [2]:
def clean_time(x):
    if(len(x)<3):
        return(np.nan)
    else:
        return(x[0]+x[1]/60+x[2]/3600)
    
word_normalize = nt.word_Normalizer()
word_tokenize = nt.word_Tokenizer()

In [3]:
da = pd.read_excel("tweet.xlsx", sheetname=1).apply(lambda x: x.astype("str").str.strip("\t "))
da["date"] = da["date"].apply(mt.strnum, f_reduce = lambda x: np.sort(x)[-2])
da["time"] = da["time"].str.split("-", expand = True)[0].apply(mt.strnum, f_reduce = clean_time).where(~da["time"].str.contains("M"), lambda x: x%12) + 12*da["time"].str.contains("PM")
da["Class"] = pd.to_numeric(da["Class"], errors = "coerce")
da["Anootated tweet"] = da["Anootated tweet"].str.replace("</e>", "<e>").str.replace("<e>", "<em>").str.replace("</a>", "<a>").str.lower()
da.shape

(7200, 4)

In [4]:
S = da["Anootated tweet"].apply(lambda x: " ".join(word_tokenize.transform(x)))
S.head()

0    insidious ! <em> mitt romney <em> ' s bain hel...
1    senior <em> romney <em> advisor claims <em> ob...
2    . @wardbrenda @shortwave8669 @allanbourdius yo...
3    <em> mitt romney <em> still doesn't <a> believ...
4    <em> romney <em> ' s <a> tax plan <a> deserves...
Name: Anootated tweet, dtype: object

## 1. Bag-of-Word Analysis

- Use TF-IDF with lemmatization for words in the text
- Use polarity score of Vadar sentiment analysis
- Load Sentiwordnet dictionary for positive, negative and objective attributes of words (used in sequential word model)
- Get word vector for each tweet

In [5]:
vectorizer = txt.TfidfVectorizer(analyzer=lambda x: word_tokenize.transform(x, word_normalize.transform))
X_word = pd.DataFrame(vectorizer.fit_transform(S).toarray(), columns = vectorizer.get_feature_names(), dtype = "float16")
X_word.head()

Unnamed: 0,Unnamed: 1,!,"""",#,#012,#17electionistas,#18,#1u,#2012,#2012debate,...,💺,🗽,😂,😉,😒,😓,😝,😥,😭,😱
0,0.0,0.112244,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [6]:
daa_sent = pd.DataFrame([(i.synset.name()[:-5], i.synset.name()[-4], i.pos_score(), i.neg_score(), i.obj_score()) for i in nltk.corpus.sentiwordnet.all_senti_synsets()], columns=["word", "attr", "pos_swd", "neg_swd", "obj_swd"], dtype="float16")
daa_sent_c = daa_sent.groupby("word").mean()
daa_sent_word = daa_sent_c.reindex(X_word.columns).fillna({"pos_swd": 0, "neg_swd": 0, "obj_swd": 1})
daa_sent.shape, daa_sent_c.shape, daa_sent_word.shape

((117659, 5), (86571, 3), (9968, 3))

In [7]:
X_sent = pd.concat([X_word.loc[:, X_word.columns.intersection(getattr(nltk.corpus.opinion_lexicon, i)())].sum(axis=1) for i in ["positive", "negative"]], axis=1, keys = ["pos_sum", 'neg_sum'])

In [8]:
X_sent = X_sent.join(pd.DataFrame.from_records(S.apply(sent.vader.SentimentIntensityAnalyzer().polarity_scores)))
X_sent.head()

Unnamed: 0,pos_sum,neg_sum,compound,neg,neu,pos
0,0.0,0.355469,0.0,0.0,1.0,0.0
1,0.0,0.295654,-0.4019,0.119,0.881,0.0
2,0.185303,0.356689,-0.2023,0.185,0.674,0.14
3,0.0,0.0,0.0,0.0,1.0,0.0
4,0.0,0.18457,-0.5267,0.116,0.884,0.0


## Data Cleaning and Spliting

In [9]:
X = pd.concat([da[["date", "time"]], X_sent, X_word.loc[:,~X_word.columns.isin(nltk.corpus.stopwords.words("english"))].rename(columns = lambda x: "w_{}".format(x))], axis = 1, copy = False)
X["word"] = (X_word > 0).sum(axis = 1)
X.shape

(7200, 9853)

In [10]:
ir = da["Class"].loc[da["Class"].isin([-1,0,1])].index
mdset = mt.CVdata(df = X.join(da["Class"]), ic_x = X.columns, ic_y = ["Class"], ir = ir, k = 10, sp = 0.002, 
                  f_norm = lambda x: x.fillna(x.mean()).astype("float16").rename(columns=lambda x: x.replace("<", "_a(").replace("[", "_s(").replace("]", "_s)")))
mdset["X"].shape, mdset["Y"].shape

((5648, 754), (5648, 1))

## Machine Learning Models

In [11]:
mdpar_df = pd.DataFrame.from_dict({
    "bnb": {"namemd": "BNB", "f_model": nbayes.BernoulliNB, "par_model": {}},
    "rf": {"namemd": "RF", "f_model": ensm.RandomForestClassifier, "par_model": {"n_estimators": 500}},
    "adb": {"namemd": "ADB", "f_model": ensm.AdaBoostClassifier, "par_model": {"n_estimators": 500, "learning_rate": 0.1}},
    "gb": {"namemd": "GB", "f_model": ensm.GradientBoostingClassifier, "par_model": {"n_estimators": 100, "learning_rate": 0.1, "max_depth": 3, "max_features": 0.2}},
    "xgb": {"namemd": "XGB", "f_model": xgb.XGBClassifier, "par_model": {"n_estimators": 100, "colsample_bylevel": 0.2, "max_depth": 5, 'learning_rate': 0.1, 'eval_metric': 'merror'}},
    "lgb": {"namemd": "LGB", "f_model": lgb.LGBMClassifier, "par_model": {"n_estimators": 100, "colsample_bytree": 0.3, "num_leaves": 32, "learning_rate": 0.1, "subsample_for_bin": 10, 'eval_metric': 'multiclass'}},
    "lgbdart":  {"namemd": "LGB DART", "f_model": lgb.LGBMClassifier, "par_model": {"boosting_type": "dart", "n_estimators": 100, "colsample_bytree": 0.3, "num_leaves": 32, "learning_rate": 0.1, "subsample_for_bin": 10, 'eval_metric': 'multiclass'}},
    "svm": {"namemd": "SVM", "f_model": svm.LinearSVC, "par_model": {"dual": False}}}, orient = "index")
mdpar_df["f_loss"] = met.zero_one_loss
mdpar_df

Unnamed: 0,namemd,f_model,par_model,f_loss
adb,ADB,<class 'sklearn.ensemble.weight_boosting.AdaBo...,"{'n_estimators': 500, 'learning_rate': 0.1}",<function zero_one_loss at 0x7f1d2ea35ae8>
bnb,BNB,<class 'sklearn.naive_bayes.BernoulliNB'>,{},<function zero_one_loss at 0x7f1d2ea35ae8>
gb,GB,<class 'sklearn.ensemble.gradient_boosting.Gra...,"{'n_estimators': 100, 'learning_rate': 0.1, 'm...",<function zero_one_loss at 0x7f1d2ea35ae8>
lgb,LGB,<class 'lightgbm.sklearn.LGBMClassifier'>,"{'n_estimators': 100, 'colsample_bytree': 0.3,...",<function zero_one_loss at 0x7f1d2ea35ae8>
lgbdart,LGB DART,<class 'lightgbm.sklearn.LGBMClassifier'>,"{'boosting_type': 'dart', 'n_estimators': 100,...",<function zero_one_loss at 0x7f1d2ea35ae8>
rf,RF,<class 'sklearn.ensemble.forest.RandomForestCl...,{'n_estimators': 500},<function zero_one_loss at 0x7f1d2ea35ae8>
svm,SVM,<class 'sklearn.svm.classes.LinearSVC'>,{'dual': False},<function zero_one_loss at 0x7f1d2ea35ae8>
xgb,XGB,<class 'xgboost.sklearn.XGBClassifier'>,"{'n_estimators': 100, 'colsample_bylevel': 0.2...",<function zero_one_loss at 0x7f1d2ea35ae8>


### Single Training and Validation

In [12]:
xytv = mt.CVset(**mdset)
model = mt.MDinit(**mdpar_df.loc["rf"].to_dict())
mt.MDfit(model, **xytv)
mt.Loss(xytv["yv"].values, model.predict(xytv['xv']), met.zero_one_loss)

0.36460176991150439

In [13]:
xytv_df = mt.CVset_df(**mdset, ig=0)
model_df = mt.cross_join(xytv_df, mdpar_df.apply(lambda x: mt.MDinit(**x.to_dict()), axis=1).to_frame("model").join(mdpar_df[["namemd", "f_loss"]]))
model_df.apply(lambda x: mt.MDfit(**x.to_dict()), axis=1)
model_df["loss"] = model_df.apply(lambda x: mt.Loss(x["yv"], mt.MDpred(**x.to_dict()), x["f_loss"]), axis=1)
model_df

Unnamed: 0,xt,xv,yt,yv,model,namemd,f_loss,loss
0,date time pos_sum neg_sum com...,date time pos_sum neg_sum com...,Class 2 -1.0 3 -1.0 4 -1....,Class 0 -1.0 5 1.0 7 -1....,"(DecisionTreeClassifier(class_weight=None, cri...",ADB,<function zero_one_loss at 0x7f1d2ea35ae8>,0.375221
1,date time pos_sum neg_sum com...,date time pos_sum neg_sum com...,Class 2 -1.0 3 -1.0 4 -1....,Class 0 -1.0 5 1.0 7 -1....,"BernoulliNB(alpha=1.0, binarize=0.0, class_pri...",BNB,<function zero_one_loss at 0x7f1d2ea35ae8>,0.40354
2,date time pos_sum neg_sum com...,date time pos_sum neg_sum com...,Class 2 -1.0 3 -1.0 4 -1....,Class 0 -1.0 5 1.0 7 -1....,([DecisionTreeRegressor(criterion='friedman_ms...,GB,<function zero_one_loss at 0x7f1d2ea35ae8>,0.373451
3,date time pos_sum neg_sum com...,date time pos_sum neg_sum com...,Class 2 -1.0 3 -1.0 4 -1....,Class 0 -1.0 5 1.0 7 -1....,"LGBMClassifier(boosting_type='gbdt', colsample...",LGB,<function zero_one_loss at 0x7f1d2ea35ae8>,0.364602
4,date time pos_sum neg_sum com...,date time pos_sum neg_sum com...,Class 2 -1.0 3 -1.0 4 -1....,Class 0 -1.0 5 1.0 7 -1....,"LGBMClassifier(boosting_type='dart', colsample...",LGB DART,<function zero_one_loss at 0x7f1d2ea35ae8>,0.373451
5,date time pos_sum neg_sum com...,date time pos_sum neg_sum com...,Class 2 -1.0 3 -1.0 4 -1....,Class 0 -1.0 5 1.0 7 -1....,"(DecisionTreeClassifier(class_weight=None, cri...",RF,<function zero_one_loss at 0x7f1d2ea35ae8>,0.364602
6,date time pos_sum neg_sum com...,date time pos_sum neg_sum com...,Class 2 -1.0 3 -1.0 4 -1....,Class 0 -1.0 5 1.0 7 -1....,"LinearSVC(C=1.0, class_weight=None, dual=False...",SVM,<function zero_one_loss at 0x7f1d2ea35ae8>,0.4
7,date time pos_sum neg_sum com...,date time pos_sum neg_sum com...,Class 2 -1.0 3 -1.0 4 -1....,Class 0 -1.0 5 1.0 7 -1....,"XGBClassifier(base_score=0.5, colsample_byleve...",XGB,<function zero_one_loss at 0x7f1d2ea35ae8>,0.385841


### Variable Analysis

In [14]:
w = mt.MDweight_analysis(model, xytv["xt"])
w.sort_values("freq", ascending = False).head(50)

Unnamed: 0,freq,std,Z-score,p-value
compound,0.035465,0.000204,849.400636,0.0
w__a(em>,0.032842,0.000197,784.134798,0.0
neg,0.030734,0.000191,731.69752,0.0
w_romney,0.03057,0.00019,727.608793,0.0
neg_sum,0.029642,0.000187,704.536254,0.0
neu,0.027016,0.000179,639.198047,0.0
word,0.026045,0.000176,615.02052,0.0
w__a(a>,0.024854,0.000172,585.384351,0.0
pos,0.023923,0.000169,562.232707,0.0
pos_sum,0.023251,0.000166,545.516767,0.0


### Residual Analysis

In [15]:
# Classification result and cross-table
print(met.classification_report(xytv["yv"].iloc[:,0], model.predict(xytv['xv'])))
xytv["yv"].assign(Predict=model.predict(xytv['xv'])).groupby(["Class", "Predict"])["Predict"].count().unstack(["Predict"])

             precision    recall  f1-score   support

       -1.0       0.66      0.85      0.74       306
        0.0       0.52      0.38      0.44       159
        1.0       0.71      0.39      0.50       100

avg / total       0.63      0.64      0.61       565



Predict,-1.0,0.0,1.0
Class,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-1.0,260,40,6
0.0,89,60,10
1.0,45,16,39


In [16]:
# Get most misclassified tweets
ev = pd.get_dummies(xytv["yv"].astype("O")) - model.predict_proba(xytv['xv'])
ie = ev.abs().sort_values(["Class_-1.0"], ascending=False).head(10).index
ev.loc[ie]

Unnamed: 0,Class_-1.0,Class_0.0,Class_1.0
4970,-0.88,0.908,-0.028
3991,0.866,-0.546,-0.32
2702,-0.864,0.946,-0.082
4942,-0.862,0.892,-0.03
2435,-0.858,0.886,-0.028
5579,-0.856,0.88,-0.024
4010,0.848,-0.7,-0.148
229,0.846,-0.802,-0.044
3185,-0.846,0.886,-0.04
7044,0.818,-0.11,-0.708


In [17]:
da.loc[ie, "Anootated tweet"].tolist()

['<em>romney<em> lost #debate by #47percent',
 'cu professors double-down on prediction of <em>romney<em> win due to economic factors http://t.co/zxk3ronu',
 "@edshow wow they are idiots. i have been a ceo of nonprofits for 25 yrs & tell you they will <a>lose funding<a> over this. but <em>romney <em>don't care.",
 '@maddow <em>romney<em> said alll ak47s are illegal .wrong #presidentialdebate',
 'so <a>searching "completely wrong" on google images <a> yields pics of <em>romney<em> .. #mylifeismade',
 "he helped in 'getting the olympics on track'?! #bitchplease all <em>romney<em> done was <a>criticize<a> #liar",
 '<em>romney<em> rakes in staggering amount of money from lobbyists http://t.co/q7ejeurv',
 "<em>romney<em>'s <a>tax plan doesn't add up<a>, but does it deserve a second look? http://t.co/ygfyftd7",
 'in 1965, mitt <em>romney<em> was arrested for <a>using large blocks of ice to slide down the slopes of a golf course<a>.',
 '#sadfact <em>romney<em>is going to win"']

### Cross Validation

In [18]:
mdset_df = pd.DataFrame.from_records([mdset])
md_df = mt.DMinit(mdset_df, mdpar_df)
md_df.apply(lambda x: mt.CVply(f=mt.MDfit, parcv={"model": "modelL"}, **x.to_dict()), axis=1)
md_df

Unnamed: 0,X,Y,irts,namemd,f_loss,modelL
0,date time pos_sum neg_sum com...,Class 0 -1.0 2 -1.0 3 -1....,"[0, 3, 2, 2, 0, 1, 0, 1, 1, 6, 2, 5, 7, 9, 6, ...",ADB,<function zero_one_loss at 0x7f1d2ea35ae8>,"[(DecisionTreeClassifier(class_weight=None, cr..."
1,date time pos_sum neg_sum com...,Class 0 -1.0 2 -1.0 3 -1....,"[0, 3, 2, 2, 0, 1, 0, 1, 1, 6, 2, 5, 7, 9, 6, ...",BNB,<function zero_one_loss at 0x7f1d2ea35ae8>,"[BernoulliNB(alpha=1.0, binarize=0.0, class_pr..."
2,date time pos_sum neg_sum com...,Class 0 -1.0 2 -1.0 3 -1....,"[0, 3, 2, 2, 0, 1, 0, 1, 1, 6, 2, 5, 7, 9, 6, ...",GB,<function zero_one_loss at 0x7f1d2ea35ae8>,[([DecisionTreeRegressor(criterion='friedman_m...
3,date time pos_sum neg_sum com...,Class 0 -1.0 2 -1.0 3 -1....,"[0, 3, 2, 2, 0, 1, 0, 1, 1, 6, 2, 5, 7, 9, 6, ...",LGB,<function zero_one_loss at 0x7f1d2ea35ae8>,"[LGBMClassifier(boosting_type='gbdt', colsampl..."
4,date time pos_sum neg_sum com...,Class 0 -1.0 2 -1.0 3 -1....,"[0, 3, 2, 2, 0, 1, 0, 1, 1, 6, 2, 5, 7, 9, 6, ...",LGB DART,<function zero_one_loss at 0x7f1d2ea35ae8>,"[LGBMClassifier(boosting_type='dart', colsampl..."
5,date time pos_sum neg_sum com...,Class 0 -1.0 2 -1.0 3 -1....,"[0, 3, 2, 2, 0, 1, 0, 1, 1, 6, 2, 5, 7, 9, 6, ...",RF,<function zero_one_loss at 0x7f1d2ea35ae8>,"[(DecisionTreeClassifier(class_weight=None, cr..."
6,date time pos_sum neg_sum com...,Class 0 -1.0 2 -1.0 3 -1....,"[0, 3, 2, 2, 0, 1, 0, 1, 1, 6, 2, 5, 7, 9, 6, ...",SVM,<function zero_one_loss at 0x7f1d2ea35ae8>,"[LinearSVC(C=1.0, class_weight=None, dual=Fals..."
7,date time pos_sum neg_sum com...,Class 0 -1.0 2 -1.0 3 -1....,"[0, 3, 2, 2, 0, 1, 0, 1, 1, 6, 2, 5, 7, 9, 6, ...",XGB,<function zero_one_loss at 0x7f1d2ea35ae8>,"[XGBClassifier(base_score=0.5, colsample_bylev..."


In [30]:
md_df["yvpL"] = md_df.apply(lambda x: mt.CVply(f=mt.MDpred, parcv={"model": "modelL"}, **x.to_dict()), axis=1)
md_df["lossL"] = md_df.apply(lambda x: mt.CVply(f=mt.Loss, parcv={"yp": "yvpL"}, **x.to_dict()), axis=1)
md_df.set_index(["namemd"])["lossL"].apply(np.mean)

namemd
ADB         0.397846
BNB         0.403499
GB          0.386512
LGB         0.372166
LGB DART    0.383499
RF          0.386860
SVM         0.391996
XGB         0.383676
Name: lossL, dtype: float64

In [20]:
Yvp_combine = pd.concat([pd.concat(i) for i in md_df.loc[md_df["lossL"].apply(np.mean).rank()==1, "yvpL"]], axis=1).apply(lambda x: st.mode(x)[0][0], axis=1)[mdset["Y"].index]
mt.Loss(mdset["Y"].values, Yvp_combine.values, met.zero_one_loss)

0.37216713881019825

In [21]:
# Classification result and cross-table
print(met.classification_report(mdset["Y"].loc[ir].values, Yvp_combine.values))
mdset["Y"].loc[ir].join(Yvp_combine.rename("Predict")).groupby(["Class", "Predict"])["Class"].count().unstack(["Class"])

             precision    recall  f1-score   support

       -1.0       0.67      0.82      0.74      2893
        0.0       0.54      0.44      0.48      1680
        1.0       0.60      0.40      0.48      1075

avg / total       0.62      0.63      0.61      5648



Class,-1.0,0.0,1.0
Predict,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
-1.0,2380,780,394
0.0,388,737,252
1.0,125,163,429


## 2. Sequential Word Analysis

In [32]:
intcoder_seq = nt.Integer_Coder()
seq = S.apply(lambda x: intcoder_seq.fit_transform(word_tokenize.transform(x, word_normalize.transform))).tolist()
X_seq = pd.DataFrame(pad_sequences(seq, maxlen=int(np.percentile([len(i) for i in seq], 95))), S.index, dtype = "int16")
s_word = pd.Series({**{val:key for key, val in intcoder_seq.code_dict.items()}, **{0:np.inf}})
X_seq.shape, s_word.shape

((7200, 36), (9969,))

In [33]:
# daa = pd.read_csv("glove.twitter.27B.200d.zip", delim_whitespace=True, header=None, index_col=0, dtype={0: "str"}, quotechar=None, quoting=3).sort_index().astype("float32")
# daa.iloc[:-1].to_hdf("glove.twitter.27B.200d.h5", "da", format = "t", data_columns = [0], mode = "w", complevel = 5, complib = "zlib")
store = pd.HDFStore("glove.twitter.27B.200d.h5")
ic_tmp = store.select_column("da", 'index')
daa = store.select('da', ic_tmp[ic_tmp.isin(s_word.values)].index)
store.close()
daa.shape

be ready to see PyTables asking for *lots* of memory and possibly slow
I/O.  You may want to reduce the rowsize by trimming the value of
dimensions that are orthogonal (and preferably close) to the *main*
dimension of this leave.  Alternatively, in case you have specified a
very small/large chunksize, you may want to increase/decrease it.


(6511, 200)

In [34]:
da_embed = daa.reindex(s_word).join(daa_sent_c).fillna({"pos_swd": 0, "neg_swd": 0, "obj_swd": 1})
print("Dictionary Coverage: {}".format(da_embed.notnull().mean().mean()))
da_embed = mt.CLscale(da_embed)
da_embed.head()

Dictionary Coverage: 0.6582509226879215


Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,194,195,196,197,198,199,200,pos_swd,neg_swd,obj_swd
inf,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,-0.306466,-0.291534,0.384747
insidious,1.782293,-0.151287,-0.635173,0.086506,-0.666781,2.177102,-0.153495,2.319552,0.295783,-0.682121,...,0.779379,-0.340853,0.218779,-1.769679,-1.371769,0.723667,0.730113,0.611867,4.280963,-3.270761
!,1.483337,-0.482933,0.065889,-0.907354,-0.973925,0.74878,0.590777,0.699852,0.68277,0.734239,...,-1.558152,-0.855162,0.66976,-3.42912,-1.111927,-0.128121,3.216916,-0.306466,-0.291534,0.384747
<em>,0.243892,-0.15908,0.259676,1.28335,0.444095,-2.213586,-1.976314,1.079848,-2.385506,-0.88449,...,1.954455,-0.657083,0.548508,1.185473,1.808744,1.750447,-0.205635,-0.306466,-0.291534,0.384747
mitt,1.88994,0.134618,-1.26679,-0.935139,-0.814862,0.481039,-0.003477,-0.869186,0.785594,0.285991,...,-0.037611,-2.629066,0.260376,1.19217,-0.058297,-0.190996,0.025101,-0.306466,-0.291534,0.384747


In [35]:
intcoder_tag = nt.Integer_Coder()
seq_tag = [intcoder_tag.fit_transform([re.sub(r'[^\w]', '', j[1][:2]) for j in i]) for i in nltk.pos_tag_sents(S.apply(word_tokenize.transform).tolist())]
X_seq_tag = pd.DataFrame(pad_sequences(seq_tag, maxlen=int(np.percentile([len(i) for i in seq_tag], 95))), S.index, dtype = "int16")
s_tag = pd.Series({**{val:key for key, val in intcoder_tag.code_dict.items()}, **{0:0}})
X_seq_tag.shape, s_tag.shape

((7200, 36), (24,))

In [36]:
da_tag_embed = mt.CLscale(pd.get_dummies(s_tag).drop([0])).T
da_tag_embed.head()

Unnamed: 0,1,2,3,4,5,6,7,8,9,10,...,14,15,16,17,18,19,20,21,22,23
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
,-0.208514,4.587317,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,...,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514
CC,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,4.587317,-0.208514,-0.208514,...,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514
CD,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,...,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514
DT,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,4.587317,...,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514,-0.208514


In [37]:
mdset_seq = mt.CVdata(df = X_seq.join(da["Class"].astype("O")), ic_x = X_seq.columns, ic_y = ["Class"], ir=ir, k=10, f_norm_y=pd.get_dummies, f_norm=None)
mdset_tag = mt.CVdata(df = X_seq_tag.join(da["Class"].astype("O")), ic_x = X_seq_tag.columns, ic_y = ["Class"], ir=ir, k=10, f_norm_y=pd.get_dummies, f_norm=None)
mdset = mt.CVdata(df = X.join(da["Class"].astype("O")), ic_x = list(X_sent.columns)+["date", "time"], ic_y = ["Class"], ir=ir, k=10, f_norm_y=pd.get_dummies, f_norm=mt.CLscale)

### Sequential Neural Network

In [47]:
xytv = mt.CVset(**mdset, ig=2)
xytv_seq = mt.CVset(**mdset_seq, ig=2)
xytv_tag = mt.CVset(**mdset_tag, ig=2)

In [64]:
input_seq = kr.layers.Input(shape=(X_seq.shape[1],), dtype='int32')
x = kr.layers.Embedding(*da_embed.shape, weights=[da_embed.values], trainable=False)(input_seq)
input_seq_tag = kr.layers.Input(shape=(X_seq_tag.shape[1],), dtype='int32')
embed_tag = kr.layers.Embedding(*da_tag_embed.shape, weights=[da_tag_embed.values], trainable=False)(input_seq_tag)
x = kr.layers.concatenate([x, embed_tag], axis=2)
x = kr.layers.Dropout(0.5)(x)
x = kr.layers.Convolution1D(256, 1, activation = 'relu')(x)
x_br1 = kr.layers.Dropout(0.5)(x)
x_br1 = kr.layers.MaxPooling1D(2)(x_br1)
x_br1 = kr.layers.Convolution1D(16, 2, activation = 'relu')(x_br1)
x_br1 = kr.layers.MaxPooling1D(3)(x_br1)
x_br1 = kr.layers.Flatten()(x_br1)
x = kr.layers.Dropout(0.5)(x)
x = kr.layers.LSTM(48, activation = 'relu')(x)
input_base = kr.layers.Input(shape=(mdset["X"].shape[1],))
x = kr.layers.concatenate([x, x_br1, input_base], axis=1)
x = kr.layers.Dropout(0.5)(x)
x = kr.layers.Dense(256, activation = 'relu')(x)
x = kr.layers.Dropout(0.5)(x)
preds = kr.layers.Dense(mdset["Y"].shape[1], activation='softmax')(x)
model = kr.models.Model([input_seq, input_seq_tag, input_base], preds)
model.compile(loss='categorical_crossentropy', optimizer=kr.optimizers.Nadam(5e-4), metrics=['acc'])

In [49]:
model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
input_1 (InputLayer)             (None, 36)            0                                            
____________________________________________________________________________________________________
input_2 (InputLayer)             (None, 36)            0                                            
____________________________________________________________________________________________________
embedding_1 (Embedding)          (None, 36, 203)       2023707                                      
____________________________________________________________________________________________________
embedding_2 (Embedding)          (None, 36, 23)        552                                          
___________________________________________________________________________________________

In [50]:
model.fit([xytv_seq["xt"].values, xytv_tag["xt"].values, xytv["xt"].values], xytv["yt"].values, 
          validation_data=([xytv_seq["xv"].values, xytv_tag["xv"].values, xytv["xv"].values], xytv["yv"].values), 
          callbacks=[kr.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=2, min_lr=1e-6, verbose=1)], 
          epochs=50, batch_size=32, verbose=2)

Train on 4518 samples, validate on 1130 samples
Epoch 1/50
19s - loss: 1.0038 - acc: 0.5062 - val_loss: 0.9411 - val_acc: 0.5690
Epoch 2/50
15s - loss: 0.9575 - acc: 0.5458 - val_loss: 0.9254 - val_acc: 0.5788
Epoch 3/50
15s - loss: 0.9446 - acc: 0.5507 - val_loss: 0.8914 - val_acc: 0.6088
Epoch 4/50
16s - loss: 0.9289 - acc: 0.5602 - val_loss: 0.8847 - val_acc: 0.6159
Epoch 5/50
17s - loss: 0.9229 - acc: 0.5637 - val_loss: 0.8732 - val_acc: 0.6283
Epoch 6/50
20s - loss: 0.9095 - acc: 0.5810 - val_loss: 0.8726 - val_acc: 0.6150
Epoch 7/50
18s - loss: 0.8945 - acc: 0.5839 - val_loss: 0.8868 - val_acc: 0.6000
Epoch 8/50
22s - loss: 0.9073 - acc: 0.5799 - val_loss: 0.8536 - val_acc: 0.6283
Epoch 9/50
22s - loss: 0.8838 - acc: 0.5965 - val_loss: 0.8555 - val_acc: 0.6283
Epoch 10/50
20s - loss: 0.8714 - acc: 0.5994 - val_loss: 0.8548 - val_acc: 0.6186
Epoch 11/50
16s - loss: 0.8720 - acc: 0.6036 - val_loss: 0.8374 - val_acc: 0.6425
Epoch 12/50
21s - loss: 0.8674 - acc: 0.6014 - val_loss: 0.

<keras.callbacks.History at 0x7fe02c5f4e10>

In [79]:
model.fit([xytv_seq["xt"].values, xytv_tag["xt"].values, xytv["xt"].values], xytv["yt"].values, 
          validation_data=([xytv_seq["xv"].values, xytv_tag["xv"].values, xytv["xv"].values], xytv["yv"].values), 
          callbacks=[kr.callbacks.ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=2, min_lr=1e-6, verbose=1)], 
          epochs=50, batch_size=32, verbose=2)

Train on 4500 samples, validate on 1125 samples
Epoch 1/50
15s - loss: 1.0629 - acc: 0.4373 - val_loss: 0.9691 - val_acc: 0.5440
Epoch 2/50
16s - loss: 1.0013 - acc: 0.5067 - val_loss: 0.9299 - val_acc: 0.5769
Epoch 3/50
15s - loss: 1.3854 - acc: 0.4644 - val_loss: 0.9415 - val_acc: 0.5662
Epoch 4/50
13s - loss: 1.0038 - acc: 0.5164 - val_loss: 0.9364 - val_acc: 0.5600
Epoch 5/50

Epoch 00004: reducing learning rate to 0.0002500000118743628.
14s - loss: 0.9849 - acc: 0.5229 - val_loss: 0.9360 - val_acc: 0.5698
Epoch 6/50
13s - loss: 0.9720 - acc: 0.5324 - val_loss: 0.9332 - val_acc: 0.5662
Epoch 7/50
14s - loss: 0.9656 - acc: 0.5349 - val_loss: 0.9262 - val_acc: 0.5760
Epoch 8/50
14s - loss: 0.9618 - acc: 0.5413 - val_loss: 0.9210 - val_acc: 0.5867
Epoch 9/50
14s - loss: 0.9640 - acc: 0.5400 - val_loss: 0.9261 - val_acc: 0.5831
Epoch 10/50
14s - loss: 0.9653 - acc: 0.5424 - val_loss: 0.9218 - val_acc: 0.5920
Epoch 11/50
14s - loss: 0.9579 - acc: 0.5436 - val_loss: 0.9197 - val_acc: 0.5

<keras.callbacks.History at 0x7fcb63adcac8>