## TF-IDF Feature Extraction and Random Forest Classifier

The following sources were used to construct this Jupyter Notebook:

* [Numpy: Dot Multiplication, Vstack, Hstack, Flatten](https://www.youtube.com/watch?v=nkO6bmp511M)
* [Scikit Learn TF-IDF Feature Extraction and Latent Semantic Analysis](https://www.youtube.com/watch?v=BJ0MnawUpaU)
* [Fake News Challenge TF-IDF Baseline](https://github.com/gmyrianthous/fakenewschallenge/blob/master/baseline.py)
* [Python TF-IDF Algorithm Built From Scratch](https://www.youtube.com/watch?v=hXNbFNCgPfY)
* [Theory Behind TF-IDF](https://www.youtube.com/watch?v=4vT4fzjkGCQ)

In [20]:
#Import all required modules
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack
import score

In [2]:
#Import data from CSV file and create a dataframe
def create_dataframe(filename):
    #Read file into a pandas dataframe
    df = pd.read_csv(filename)
    #Remove white space in column names
    df.columns = [c.replace(' ', '_') for c in df.columns]
    return df

In [3]:
#Create dataframes for both training and testing sets
train_df_tmp = create_dataframe('train_stances.csv')
test_df_tmp = create_dataframe('competition_test_stances.csv')
train_bodies_df = create_dataframe('train_bodies.csv')
test_bodies_df = create_dataframe('test_bodies.csv')

train_df_tmp.head(5)

Unnamed: 0,Headline,Body_ID,Stance
0,Police find mass graves with at least '15 bodi...,712,unrelated
1,Hundreds of Palestinians flee floods in Gaza a...,158,agree
2,"Christian Bale passes on role of Steve Jobs, a...",137,unrelated
3,HBO and Apple in Talks for $15/Month Apple TV ...,1034,unrelated
4,Spider burrowed through tourist's stomach and ...,1923,disagree


In [4]:
train_df = pd.merge(train_df_tmp,
                 train_bodies_df[['Body_ID', 'articleBody']],
                 on='Body_ID')

test_df = pd.merge(test_df_tmp,
                 test_bodies_df[['Body_ID', 'articleBody']],
                 on='Body_ID')

train_df = train_df.rename(columns={'articleBody': 'Body_Text'})
test_df = test_df.rename(columns={'articleBody': 'Body_Text'})

In [5]:
test_df.sort_values(by=['Body_ID']).head(5)

Unnamed: 0,Headline,Body_ID,Stance,Body_Text
7305,Apple to keep gold Watch Editions in special i...,1,unrelated,Al-Sisi has denied Israeli reports stating tha...
7303,Apple installing safes in-store to protect gol...,1,unrelated,Al-Sisi has denied Israeli reports stating tha...
7304,El-Sisi denies claims he'll give Sinai land to...,1,agree,Al-Sisi has denied Israeli reports stating tha...
7306,Apple Stores to Keep Gold “Edition” Apple Watc...,1,unrelated,Al-Sisi has denied Israeli reports stating tha...
7307,South Korean woman's hair 'eaten' by robot vac...,1,unrelated,Al-Sisi has denied Israeli reports stating tha...


In [6]:
train_df.sort_values(by=['Body_ID']).head(5)

Unnamed: 0,Headline,Body_ID,Stance,Body_Text
41651,"Soldier shot, Parliament locked down after gun...",0,unrelated,A small meteorite crashed into a wooded area i...
41657,Italian catches huge wels catfish; is it a rec...,0,unrelated,A small meteorite crashed into a wooded area i...
41658,Not coming to a store near you: The pumpkin sp...,0,unrelated,A small meteorite crashed into a wooded area i...
41659,One gunman killed in shooting on Parliament Hi...,0,unrelated,A small meteorite crashed into a wooded area i...
41660,Soldier shot at war memorial in Canada,0,unrelated,A small meteorite crashed into a wooded area i...


In [7]:
#Apply Scikit Learn TFIDF Feature Extraction Algorithm
body_text_vectorizer = TfidfVectorizer(ngram_range=(1, 2), lowercase=True, stop_words='english',max_features=1024)
headline_vectorizer = TfidfVectorizer(ngram_range=(1, 2), lowercase=True, stop_words='english',max_features=1024)

#Create vocabulary based on training data
train_body_tfidf = body_text_vectorizer.fit_transform(train_df['Body_Text'])
train_headline_tfidf = headline_vectorizer.fit_transform(train_df['Headline'])

#Use vocabulary for testing data
test_body_tfidf = body_text_vectorizer.transform(test_df['Body_Text'])
test_headline_tfidf = headline_vectorizer.transform(test_df['Headline'])

In [8]:
train_body_tfidf[0]

<1x755550 sparse matrix of type '<type 'numpy.float64'>'
	with 380 stored elements in Compressed Sparse Row format>

In [9]:
train_headline_tfidf[0]

<1x27736 sparse matrix of type '<type 'numpy.float64'>'
	with 45 stored elements in Compressed Sparse Row format>

In [10]:
#Tuple represents (Instance, Feature); value to the right of the tuple 
#represents the feature's tf-idf score
print(train_body_tfidf)

  (0, 174354)	0.06200681380656636
  (0, 97557)	0.09354829162140987
  (0, 196066)	0.07979784837978038
  (0, 702552)	0.08135343684960032
  (0, 247007)	0.05176429143809242
  (0, 597868)	0.03338498986328063
  (0, 561949)	0.06779623930840628
  (0, 232528)	0.04395045434091362
  (0, 494070)	0.07805982823422668
  (0, 53070)	0.063482690308879
  (0, 258872)	0.027936123600898782
  (0, 637481)	0.05508721845898659
  (0, 745821)	0.06781347044972756
  (0, 618881)	0.032879698429556264
  (0, 345545)	0.1431424397960018
  (0, 88205)	0.031251312503035235
  (0, 78167)	0.01949559749389113
  (0, 720639)	0.03257033840746655
  (0, 335059)	0.03312746160253445
  (0, 94229)	0.03223479004606912
  (0, 27501)	0.04828266406067473
  (0, 18115)	0.03038156880675647
  (0, 619425)	0.030523858614096625
  (0, 49573)	0.03379560916946003
  (0, 88317)	0.04828266406067473
  :	:
  (49971, 54969)	0.04538400470399851
  (49971, 640031)	0.04538400470399851
  (49971, 115859)	0.04538400470399851
  (49971, 199903)	0.04538400470399851
 

In [11]:
#Merge headline and body_text tf-idf vectors together - Stack arrays in sequence horizontally
train_features = hstack([train_body_tfidf, train_headline_tfidf])
test_features = hstack([test_body_tfidf, test_headline_tfidf])

In [12]:
#Feature vector (the headline and body text features merged by hstack)
#SVD should be applied here?
print(train_features)

  (0, 174354)	0.06200681380656636
  (0, 97557)	0.09354829162140987
  (0, 196066)	0.07979784837978038
  (0, 702552)	0.08135343684960032
  (0, 247007)	0.05176429143809242
  (0, 597868)	0.03338498986328063
  (0, 561949)	0.06779623930840628
  (0, 232528)	0.04395045434091362
  (0, 494070)	0.07805982823422668
  (0, 53070)	0.063482690308879
  (0, 258872)	0.027936123600898782
  (0, 637481)	0.05508721845898659
  (0, 745821)	0.06781347044972756
  (0, 618881)	0.032879698429556264
  (0, 345545)	0.1431424397960018
  (0, 88205)	0.031251312503035235
  (0, 78167)	0.01949559749389113
  (0, 720639)	0.03257033840746655
  (0, 335059)	0.03312746160253445
  (0, 94229)	0.03223479004606912
  (0, 27501)	0.04828266406067473
  (0, 18115)	0.03038156880675647
  (0, 619425)	0.030523858614096625
  (0, 49573)	0.03379560916946003
  (0, 88317)	0.04828266406067473
  :	:
  (49971, 757303)	0.17594146420094306
  (49971, 781978)	0.09006071587243318
  (49971, 757403)	0.09697627881678512
  (49971, 774465)	0.14103932478419584


In [15]:
#Extract training and test labels
train_labels = list(train_df['Stance'])
test_labels = list(test_df['Stance'])

In [18]:
#Initialize random forest classifier (Scikit Learn)
rf_classifier = RandomForestClassifier(n_estimators=10)

y_pred = rf_classifier.fit(train_features, train_labels).predict(test_features)

accuracy_score(test_labels, y_pred)

0.692834376106717

In [19]:
#Initialize multinomialnb classifier
nb_classifier = MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

y_pred = nb_classifier.fit(train_features,train_labels).predict(test_features)

accuracy_score(test_labels, y_pred)

0.717782237437532

In [22]:
score.report_score(test_labels,y_pred)

-------------------------------------------------------------
|           |   agree   | disagree  |  discuss  | unrelated |
-------------------------------------------------------------
|   agree   |     0     |     0     |     9     |   1894    |
-------------------------------------------------------------
| disagree  |     0     |     0     |     2     |    695    |
-------------------------------------------------------------
|  discuss  |     0     |     0     |    48     |   4416    |
-------------------------------------------------------------
| unrelated |    17     |     0     |    139    |   18193   |
-------------------------------------------------------------
Score: 4599.0 out of 11651.25	(39.4721596395%)


In [None]:
#Add predicted labels to test dataframe
#test_df['RF_Predicted_Stance'] = list(y_pred)
#test_df2 = test_df[['Headline','Body_Text','RF_Predicted_Stance','Stance']]
#test_df2[test_df2['RF_Predicted_Stance'] == 'unrelated']