## TF-IDF Feature Extraction and Random Forest Classifier

The following sources were used to construct this Jupyter Notebook:

* [Numpy: Dot Multiplication, Vstack, Hstack, Flatten](https://www.youtube.com/watch?v=nkO6bmp511M)
* [Scikit Learn TF-IDF Feature Extraction and Latent Semantic Analysis](https://www.youtube.com/watch?v=BJ0MnawUpaU)
* [Fake News Challenge TF-IDF Baseline](https://github.com/gmyrianthous/fakenewschallenge/blob/master/baseline.py)
* [Python TF-IDF Algorithm Built From Scratch](https://www.youtube.com/watch?v=hXNbFNCgPfY)
* [Theory Behind TF-IDF](https://www.youtube.com/watch?v=4vT4fzjkGCQ)

In [53]:
#Import all required modules
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack

In [54]:
#Import data from CSV file and create a dataframe
def create_dataframe(filename):
    #Read file into a pandas dataframe
    df = pd.read_csv(filename)
    #Remove white space in column names
    df.columns = [c.replace(' ', '_') for c in df.columns]
    return df

In [56]:
#Create dataframes for both training and testing sets
train_df_tmp = create_dataframe('train_stances.csv')
test_df_tmp = create_dataframe('competition_test_stances.csv')
test_bodies_df = create_dataframe('train_bodies.csv')
test_bodies_df = create_dataframe('test_bodies.csv')

train_df_tmp.head(5)

Unnamed: 0,Headline,Body_ID,Stance
0,Police find mass graves with at least '15 bodi...,712,unrelated
1,Hundreds of Palestinians flee floods in Gaza a...,158,agree
2,"Christian Bale passes on role of Steve Jobs, a...",137,unrelated
3,HBO and Apple in Talks for $15/Month Apple TV ...,1034,unrelated
4,Spider burrowed through tourist's stomach and ...,1923,disagree


In [57]:
train_df = pd.merge(train_df_tmp,
                 train_bodies_df[['Body_ID', 'articleBody']],
                 on='Body_ID')

test_df = pd.merge(test_df_tmp,
                 test_bodies_df[['Body_ID', 'articleBody']],
                 on='Body_ID')

train_df = train_df.rename(columns={'articleBody': 'Body_Text'})
test_df = test_df.rename(columns={'articleBody': 'Body_Text'})

In [70]:
test_df.sort_values(by=['Body_ID']).head(5)

Unnamed: 0,Headline,Body_ID,Stance,Body_Text
7305,Apple to keep gold Watch Editions in special i...,1,unrelated,Al-Sisi has denied Israeli reports stating tha...
7303,Apple installing safes in-store to protect gol...,1,unrelated,Al-Sisi has denied Israeli reports stating tha...
7304,El-Sisi denies claims he'll give Sinai land to...,1,agree,Al-Sisi has denied Israeli reports stating tha...
7306,Apple Stores to Keep Gold “Edition” Apple Watc...,1,unrelated,Al-Sisi has denied Israeli reports stating tha...
7307,South Korean woman's hair 'eaten' by robot vac...,1,unrelated,Al-Sisi has denied Israeli reports stating tha...


In [71]:
train_df.sort_values(by=['Body_ID']).head(5)

Unnamed: 0,Headline,Body_ID,Stance,Body_Text
41651,"Soldier shot, Parliament locked down after gun...",0,unrelated,A small meteorite crashed into a wooded area i...
41657,Italian catches huge wels catfish; is it a rec...,0,unrelated,A small meteorite crashed into a wooded area i...
41658,Not coming to a store near you: The pumpkin sp...,0,unrelated,A small meteorite crashed into a wooded area i...
41659,One gunman killed in shooting on Parliament Hi...,0,unrelated,A small meteorite crashed into a wooded area i...
41660,Soldier shot at war memorial in Canada,0,unrelated,A small meteorite crashed into a wooded area i...


In [72]:
#Apply Scikit Learn TFIDF Feature Extraction Algorithm
body_text_vectorizer = TfidfVectorizer(ngram_range=(1, 3), lowercase=True, stop_words='english')
headline_vectorizer = TfidfVectorizer(ngram_range=(1, 3), lowercase=True, stop_words='english')

#Create vocabulary based on training data
train_body_tfidf = body_text_vectorizer.fit_transform(train_df['Body_Text'])
train_headline_tfidf = headline_vectorizer.fit_transform(train_df['Headline'])

#Use vocabulary for testing data
test_body_tfidf = body_text_vectorizer.transform(test_df['Body_Text'])
test_headline_tfidf = headline_vectorizer.transform(test_df['Headline'])

In [74]:
train_body_tfidf[0]

<1x482271 sparse matrix of type '<type 'numpy.float64'>'
	with 281 stored elements in Compressed Sparse Row format>

In [75]:
train_headline_tfidf[0]

<1x19830 sparse matrix of type '<type 'numpy.float64'>'
	with 35 stored elements in Compressed Sparse Row format>

In [77]:
#Tuple represents (Instance, Feature); value to the right of the tuple 
#represents the feature's tf-idf score
print(train_body_tfidf)

  (0, 111622)	0.07332486904279714
  (0, 61926)	0.11062358813848425
  (0, 125574)	0.09436328724448594
  (0, 448715)	0.09620281606126824
  (0, 158554)	0.06121278707581918
  (0, 381761)	0.039478725956742046
  (0, 359171)	0.08017103385429222
  (0, 149291)	0.051972696403530645
  (0, 315341)	0.09230802764085554
  (0, 33684)	0.07507013612897592
  (0, 166231)	0.03303528239637393
  (0, 407023)	0.06514224536738578
  (0, 476203)	0.08019141018236453
  (0, 395340)	0.038881204072746076
  (0, 221486)	0.1692701173979176
  (0, 56006)	0.03695559013641697
  (0, 49547)	0.02305411365934867
  (0, 460278)	0.03851537680773665
  (0, 215190)	0.03917419126394295
  (0, 59820)	0.038118581066324334
  (0, 17420)	0.05709566097575283
  (0, 11454)	0.03592709280337572
  (0, 395698)	0.036095354657982366
  (0, 31411)	0.039964295283783646
  (0, 56077)	0.05709566097575283
  :	:
  (49971, 45214)	0.054193507145058835
  (49971, 34836)	0.054193507145058835
  (49971, 408595)	0.054193507145058835
  (49971, 73715)	0.05419350714505

In [78]:
#Merge headline and body_text tf-idf vectors together - Stack arrays in sequence horizontally
train_features = hstack([train_body_tfidf, train_headline_tfidf])
test_features = hstack([test_body_tfidf, test_headline_tfidf])

In [80]:
#Feature vector (the headline and body text features merged by hstack)
#SVD should be applied here?
print(train_features)

  (0, 111622)	0.07332486904279714
  (0, 61926)	0.11062358813848425
  (0, 125574)	0.09436328724448594
  (0, 448715)	0.09620281606126824
  (0, 158554)	0.06121278707581918
  (0, 381761)	0.039478725956742046
  (0, 359171)	0.08017103385429222
  (0, 149291)	0.051972696403530645
  (0, 315341)	0.09230802764085554
  (0, 33684)	0.07507013612897592
  (0, 166231)	0.03303528239637393
  (0, 407023)	0.06514224536738578
  (0, 476203)	0.08019141018236453
  (0, 395340)	0.038881204072746076
  (0, 221486)	0.1692701173979176
  (0, 56006)	0.03695559013641697
  (0, 49547)	0.02305411365934867
  (0, 460278)	0.03851537680773665
  (0, 215190)	0.03917419126394295
  (0, 59820)	0.038118581066324334
  (0, 17420)	0.05709566097575283
  (0, 11454)	0.03592709280337572
  (0, 395698)	0.036095354657982366
  (0, 31411)	0.039964295283783646
  (0, 56077)	0.05709566097575283
  :	:
  (49970, 488938)	0.24283265576733026
  (49970, 494271)	0.24283265576733026
  (49970, 491697)	0.24283265576733026
  (49970, 489901)	0.24283265576733

In [81]:
#Initialize random forest classifier (Scikit Learn)
rf_classifier = RandomForestClassifier(n_estimators=10)

#Extract training and test labels
train_labels = list(train_df['Stance'])
test_labels = list(test_df['Stance'])

In [82]:
#Train the classifier on the training data; use it to predict test feature labels
y_pred = rf_classifier.fit(train_features, train_labels).predict(test_features)

In [83]:
#Check the accuracy of the classifier
accuracy_score(test_labels, y_pred)

0.6906701294612994

In [84]:
#Add predicted labels to test dataframe
test_df['RF_Predicted_Stance'] = list(y_pred)

In [87]:
test_df2 = test_df[['Headline','Body_Text','RF_Predicted_Stance','Stance']]

In [91]:
test_df2[test_df2['RF_Predicted_Stance'] == 'disagree']

Unnamed: 0,Headline,Body_Text,RF_Predicted_Stance,Stance
5074,Is Kim Jong ill? North Korean dictator in poor...,The despot is putting on the pounds with the S...,disagree,discuss
5075,Fat dictator Kim Jong-un dying from cheese add...,The despot is putting on the pounds with the S...,disagree,discuss
5076,North Korean dictator Kim Jong-un may be serio...,The despot is putting on the pounds with the S...,disagree,discuss
5077,Cheese blamed for North Korean leader Kim Jong...,The despot is putting on the pounds with the S...,disagree,discuss
5078,Missing Kim Jong-un suffering with 'condition'...,The despot is putting on the pounds with the S...,disagree,discuss
5079,Kim Jong-un's discomfort 'down to cheese addic...,The despot is putting on the pounds with the S...,disagree,discuss
5081,"Switzerland’s Assassination, Kim Jong Un Might...",The despot is putting on the pounds with the S...,disagree,discuss
5082,Kim Jong-un loves CHEESE so much he's balloone...,The despot is putting on the pounds with the S...,disagree,discuss
25371,Climate Change is a Hoax,Climate change is the biggest scam in the hist...,disagree,agree
25372,Climate change is not a hoax — ask any millenn...,Climate change is the biggest scam in the hist...,disagree,disagree
