# Early Model (i.e. the MVP) Using text articles to predict sentiment

- this was my MVP, which was simply to train a model to predict sentiment for each article
- Target was the sentiment score provided by GDELT
- Each row represents a single article
- I achieved reasonable results with very little hyperparameter tuning, which gave me the confidence to move on to exploring external targets like sentiment and Google Trends data

# Imports

In [24]:
import pandas as pd
import csv
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
import seaborn as sns
import re


from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.decomposition import SparsePCA
import spacy
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.linear_model import SGDClassifier
from xgboost import XGBClassifier



# Reading and cleaning data

In [32]:
#reading in article data
text_tokens = pd.read_csv('/floyd/home/Capstone/cap_notebooks/data/master_data_set/text_with_tokens_52k.csv')

In [7]:
#pre-spacy tokenizer
#text cleaning helper function
def cleaning_tokens(list_of_tokens):
    #initial_list = []
    #grabs tokens and removes '[]', splits on comma and returns list
    initial_list = list_of_tokens.strip('[').strip(']').split(',')
    clean_list = []
    #loops through new list and removes whitespaces and extra "'"
    for token in initial_list:
        clean_list.append(token.strip().strip("'"))

    return clean_list

In [38]:
#reviewing data
text_tokens.head()

Unnamed: 0.1,Unnamed: 0,gkgcode,date,link,tone,title,authors,pub_date,text,date_str,Tone_only,polarity,text_token
0,0,20150302100000-674,2015-03-02,http://www.nationalreview.com/article/414611/a...,"0.350631136044881,2.73492286115007,2.384291725...",Is America a ‘Clean Energy’ Laggard?,"['Robert Bryce', 'Victor Davis Hanson', 'Isaac...",2015-03-02 04:00:00+00:00,"The answer is not only “No,” but a resounding ...",20150302,0.350631,5.119215,"['answer', 'resounding', 'myriad', 'claim', 'e..."
1,3,20150302153000-229,2015-03-02,http://www.latimes.com/business/hiltzik/la-fi-...,"-0.952380952380953,3.49206349206349,4.44444444...",Watch ‘Meet the Press’ treat climate change as...,"['Business Columnist', 'Los Angeles Times Colu...",2015-03-01 00:00:00,"As you may have heard, Sen. James Inhofe (R-Ok...",20150302,-0.952381,7.936508,"['hear', 'sen.', 'james', 'inhofe', 'r', 'okla..."
2,6,20150302163000-237,2015-03-02,http://www.usatoday.com/story/news/nation-now/...,"0,1.8140589569161,1.8140589569161,3.6281179138...",,[],2015-03-02 00:00:00,Mary Bowerman USA TODAY Network Visitors sho...,20150302,0.0,3.628118,"['mary', 'bowerman', 'usa', 'today', 'network'..."
3,4,20150302180000-1352,2015-03-02,http://www.nytimes.com/2015/03/03/business/int...,"-1.14754098360656,1.80327868852459,2.950819672...",Russian Energy Deal Comes at Contentious Time,['Stanley Reed'],2015-03-03 00:00:00,But Mr. Fridman has a business track record th...,20150302,-1.147541,4.754098,"['mr.', 'fridman', 'business', 'track', 'recor..."
4,2,20150302203000-163,2015-03-02,http://www.cbsnews.com/news/did-climate-change...,"-8.0545229244114,0.371747211895911,8.426270136...",Did climate change cause the Syrian civil war?,"['Michael Casey', 'Michael Casey Covers The En...",,Climate change sparked a historic drought in S...,20150302,-8.054523,8.798017,"['climate', 'change', 'spark', 'historic', 'dr..."


In [39]:
#calculating mean tone
text_tokens['Tone_only'].mean()

-1.774077419672286

In [33]:
#binarizing tone
text_tokens['binary_tone'] = np.where(text_tokens['Tone_only']>=-1.6, 1, 0)

In [41]:
#checking balance of dataset target
text_tokens['binary_tone'].sum()

26431

In [42]:
#comparing above cell to length of dataset
len(text_tokens)

52758

In [34]:
#setting X and y
X = text_tokens['text_token']
y = text_tokens['binary_tone']

In [35]:
#train test split
X_train, X_test, y_train,y_test = train_test_split(X,y, test_size=.3, stratify=y)

### Bag of Words, Log Reg, Tone Target

- Applies CountVectorizer and Logistic Regression
- Target is article tone score from Gdelt
- train acc: 81 percent
- test acc: 79 percent

In [None]:
#creating Count Vectorizer
bagofwords = CountVectorizer(min_df=5)

In [37]:
#fitting bagofwords
bagofwords.fit(X_train)

CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                lowercase=True, max_df=1.0, max_features=None, min_df=5,
                ngram_range=(1, 1), preprocessor=None, stop_words=None,
                strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                tokenizer=None, vocabulary=None)

In [58]:
#transforming X_train, X_test
X_train_transformed = bagofwords.transform(X_train)
X_test_transformed = bagofwords.transform(X_test)

In [59]:
#fitting log reg model
model = LogisticRegression(C=.01, solver='saga')
model.fit(X_train_transformed, y_train)



LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='saga', tol=0.0001, verbose=0,
                   warm_start=False)

In [60]:
#scoring model
display(model.score(X_test_transformed, y_test))
display(model.score(X_train_transformed, y_train))

0.7988375031589589

0.8139994584348768

### TFIDF, Log Reg, Tone Target

- Applies TFIDF Vectorizer and Logistic Regression
- Target is article tone score from Gdelt
- train acc: 89 percent
- test acc: 84 percent

In [6]:
X_train, X_test, y_train,y_test = train_test_split(X, y, test_size=.3, stratify=y)
print(f'Split done - X_train shape: {X_train.shape}, X_test shape: {X_test.shape}, y_train shape: {y_train.shape}, y_test shape: {y_test.shape}')

#create vectorizer
bagofwords = TfidfVectorizer(min_df=5)
print('vectorizer done')

#fit vectorizer
print('beginng vectorizer fitting')
bagofwords.fit(X_train)
print('vectorizer fitting complete')


#transform X_train
print('beginning transformation')
X_train_transformed = bagofwords.transform(X_train)
print('X_train transformed')

#transform X_test
X_test_transformed = bagofwords.transform(X_test)
print('X_test_transformed')

#create model
print('creating model')
model = LogisticRegression(C=1, solver='liblinear')
print('model completed')


#fit model
print('fitting model')
model.fit(X_train_transformed, y_train)
print('model fitted')

#score training set 
print('scoring training data')
train_score = model.score(X_train_transformed, y_train)

#score test set
print('scoring test data')
test_score = model.score(X_test_transformed, y_test)

print(f'Training score: {train_score}')
print(f'Test score: {test_score}')
#return (bagofwords, model, X_train_transformed, X_test_transformed, y_train, y_test)


Split done - X_train shape: (36930,), X_test shape: (15828,), y_train shape: (36930,), y_test shape: (15828,)
vectorizer done
beginng vectorizer fitting
vectorizer fitting complete
beginning transformation
X_train transformed
X_test_transformed
creating model
model completed
fitting model
model fitted
scoring training data
scoring test data
Training score: 0.8919848361765502
Test score: 0.846095526914329


### TFIDF with N-Grams, Log Reg, Tone Target with C = .01

- Applies TFIDF Vectorizer, bi-grams, and Logistic Regression
- Target is article tone score from Gdelt
- train acc: 74 percent
- test acc: 73 percent

In [6]:
X_train, X_test, y_train,y_test = train_test_split(X, y, test_size=.3, stratify=y)
print(f'Split done - X_train shape: {X_train.shape}, X_test shape: {X_test.shape}, y_train shape: {y_train.shape}, y_test shape: {y_test.shape}')

#create vectorizer
bagofwords = TfidfVectorizer(min_df=5, ngram_range=(1,2))
print('vectorizer done')

#fit vectorizer
print('beginng vectorizer fitting')
bagofwords.fit(X_train)
print('vectorizer fitting complete')


#transform X_train
print('beginning transformation')
X_train_transformed = bagofwords.transform(X_train)
print('X_train transformed')

#transform X_test
X_test_transformed = bagofwords.transform(X_test)
print('X_test_transformed')

#create model
print('creating model')
model = LogisticRegression(C=.01, solver='saga')
print('model completed')


#fit model
print('fitting model')
model.fit(X_train_transformed, y_train)
print('model fitted')

#score training set 
print('scoring training data')
train_score = model.score(X_train_transformed, y_train)

#score test set
print('scoring test data')
test_score = model.score(X_test_transformed, y_test)

print(f'Training score: {train_score}')
print(f'Test score: {test_score}')
#return (bagofwords, model, X_train_transformed, X_test_transformed, y_train, y_test)

Split done - X_train shape: (36930,), X_test shape: (15828,), y_train shape: (36930,), y_test shape: (15828,)
vectorizer done
beginng vectorizer fitting
vectorizer fitting complete
beginning transformation
X_train transformed
X_test_transformed
creating model
model completed
fitting model
model fitted
scoring training data
scoring test data
Training score: 0.7423503926347144
Test score: 0.7323730098559514


### TFIDF with N-Grams, Log Reg, Tone Target with C = .1

- Applies TFIDF Vectorizer, bi-grams, and Logistic Regression
- Target is article tone score from Gdelt
- reduced penalty
- train acc: 81 percent
- test acc: 78 percent

In [7]:
X_train, X_test, y_train,y_test = train_test_split(X, y, test_size=.3, stratify=y)
print(f'Split done - X_train shape: {X_train.shape}, X_test shape: {X_test.shape}, y_train shape: {y_train.shape}, y_test shape: {y_test.shape}')

#create vectorizer
bagofwords = TfidfVectorizer(min_df=5, ngram_range=(1,2))
print('vectorizer done')

#fit vectorizer
print('beginng vectorizer fitting')
bagofwords.fit(X_train)
print('vectorizer fitting complete')


#transform X_train
print('beginning transformation')
X_train_transformed = bagofwords.transform(X_train)
print('X_train transformed')

#transform X_test
X_test_transformed = bagofwords.transform(X_test)
print('X_test_transformed')

#create model
print('creating model')
model = LogisticRegression(C=.1, solver='saga')
print('model completed')


#fit model
print('fitting model')
model.fit(X_train_transformed, y_train)
print('model fitted')

#score training set 
print('scoring training data')
train_score = model.score(X_train_transformed, y_train)

#score test set
print('scoring test data')
test_score = model.score(X_test_transformed, y_test)

print(f'Training score: {train_score}')
print(f'Test score: {test_score}')

Split done - X_train shape: (36930,), X_test shape: (15828,), y_train shape: (36930,), y_test shape: (15828,)
vectorizer done
beginng vectorizer fitting
vectorizer fitting complete
beginning transformation
X_train transformed
X_test_transformed
creating model
model completed
fitting model
model fitted
scoring training data
scoring test data
Training score: 0.8124830760898998
Test score: 0.7849380844073793


#next step, build a grid search

# Dimensionality Reduction

- Tried dimensionality reduction
- I seemed to have failed to save the results, but I did make a note that it took a long time and did not improve on previous models, so abandoned this approach


In [15]:
#this takes a long time and doesnt yield great results
tsvd = TruncatedSVD(n_components=1000)
X_train_transformed_sparse_tsvd = tsvd.fit(X_train_transformed).transform(X_train_transformed)
X_test_transformed_sparse_tsvd = tsvd.transform(X_test_transformed)

In [38]:
X_train, X_test, y_train,y_test = train_test_split(X, y, test_size=.3, stratify=y)
print(f'Split done - X_train shape: {X_train.shape}, X_test shape: {X_test.shape}, y_train shape: {y_train.shape}, y_test shape: {y_test.shape}')

#create vectorizer
bagofwords = TfidfVectorizer(min_df=5)
print('vectorizer done')

#fit vectorizer
print('beginng vectorizer fitting')
bagofwords.fit(X_train)
print('vectorizer fitting complete')


#transform X_train
print('beginning transformation')
X_train_transformed = bagofwords.transform(X_train)
print('X_train transformed')

Split done - X_train shape: (36930,), X_test shape: (15828,), y_train shape: (36930,), y_test shape: (15828,)
vectorizer done
beginng vectorizer fitting
vectorizer fitting complete
beginning transformation
X_train transformed
