## TF-IDF Feature Extraction and Random Forest Classifier

The following sources were used to construct this Jupyter Notebook:

* [Numpy: Dot Multiplication, Vstack, Hstack, Flatten](https://www.youtube.com/watch?v=nkO6bmp511M)
* [Scikit Learn TF-IDF Feature Extraction and Latent Semantic Analysis](https://www.youtube.com/watch?v=BJ0MnawUpaU)
* [Fake News Challenge TF-IDF Baseline](https://github.com/gmyrianthous/fakenewschallenge/blob/master/baseline.py)
* [Python TF-IDF Algorithm Built From Scratch](https://www.youtube.com/watch?v=hXNbFNCgPfY)
* [Theory Behind TF-IDF](https://www.youtube.com/watch?v=4vT4fzjkGCQ)

In [110]:
#Import all required modules
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack

In [111]:
#Import data from CSV file and create a dataframe
def create_dataframe(filename):
    #Read file into a pandas dataframe
    df = pd.read_csv(filename)
    #Remove white space in column names
    df.columns = [c.replace(' ', '_') for c in df.columns]
    return df

In [112]:
#Add a body text column to the main dataframe (which contains headline, body id, and stance)
def add_body_text(main_df, body_df):
    #Add new column for Body Text
    main_df['Body_Text'] = ""
    #Using body id, add apropriate body text
    for index, i in enumerate(main_df.itertuples()):
        for j in body_df.itertuples():
            if i.Body_ID == j.Body_ID:
                main_df.at[index,'Body_Text'] = j.articleBody

In [113]:
#Create dataframes for both training and testing sets
train_df = create_dataframe('train_stances.csv')
test_df = create_dataframe('competition_test_stances.csv')
train_bodies_df = create_dataframe('train_bodies.csv')
test_bodies_df = create_dataframe('test_bodies.csv')

#Add body text to the training and testing dataframes
add_body_text(train_df,train_bodies_df)
add_body_text(test_df,test_bodies_df)

In [114]:
train_df

Unnamed: 0,Headline,Body_ID,Stance,Body_Text
0,Police find mass graves with at least '15 bodi...,712,unrelated,Danny Boyle is directing the untitled film\n\n...
1,Hundreds of Palestinians flee floods in Gaza a...,158,agree,Hundreds of Palestinians were evacuated from t...
2,"Christian Bale passes on role of Steve Jobs, a...",137,unrelated,30-year-old Moscow resident was hospitalized w...
3,HBO and Apple in Talks for $15/Month Apple TV ...,1034,unrelated,(Reuters) - A Canadian soldier was shot at the...
4,Spider burrowed through tourist's stomach and ...,1923,disagree,"Fear not arachnophobes, the story of Bunbury's..."
5,'Nasa Confirms Earth Will Experience 6 Days of...,154,agree,Thousands of people have been duped by a fake ...
6,Accused Boston Marathon Bomber Severely Injure...,962,unrelated,A British fighter who travelled to Iraq to sto...
7,Identity of ISIS terrorist known as 'Jihadi Jo...,2033,unrelated,"Adding to Apple's iOS 8 launch troubles, a rep..."
8,Banksy 'Arrested & Real Identity Revealed' Is ...,1739,agree,If you’ve seen a story floating around on your...
9,British Aid Worker Confirmed Murdered By ISIS,882,unrelated,The British Islamic State militant who has fea...


In [115]:
test_df

Unnamed: 0,Headline,Body_ID,Stance,Body_Text
0,Ferguson riots: Pregnant woman loses eye after...,2008,unrelated,A RESPECTED senior French police officer inves...
1,Crazy Conservatives Are Sure a Gitmo Detainee ...,1550,unrelated,Dave Morin's social networking company Path is...
2,A Russian Guy Says His Justin Bieber Ringtone ...,2,unrelated,A bereaved Afghan mother took revenge on the T...
3,"Zombie Cat: Buried Kitty Believed Dead, Meows ...",1793,unrelated,Hewlett-Packard is officially splitting in two...
4,Argentina's President Adopts Boy to End Werewo...,37,unrelated,An airline passenger headed to Dallas was remo...
5,Next-generation Apple iPhones' features leaked,2353,unrelated,When faced with the choice of feasting on a fi...
6,Saudi national airline may introduce gender se...,192,unrelated,The US declared the video of Sotloff to be aut...
7,'Zombie Cat' Claws Way Out Of Grave And Into O...,2482,unrelated,19-year-old Iga Jasica of Poland began making ...
8,"ISIS might be harvesting organs, Iraq tells UN",250,unrelated,Michael Foley says the administration threaten...
9,Woman has surgery to get third breast: The thr...,85,unrelated,Brian Stelter from CNN just reported that hack...


In [116]:
#Apply Scikit Learn TFIDF Feature Extraction Algorithm
body_text_vectorizer = TfidfVectorizer(ngram_range=(1, 3), lowercase=True, stop_words='english')
headline_vectorizer = TfidfVectorizer(ngram_range=(1, 3), lowercase=True, stop_words='english')

#Create vocabulary based on training data
train_body_tfidf = body_text_vectorizer.fit_transform(train_df['Body_Text'])
train_headline_tfidf = headline_vectorizer.fit_transform(train_df['Headline'])

#Use vocabulary for testing data
test_body_tfidf = body_text_vectorizer.transform(test_df['Body_Text'])
test_headline_tfidf = headline_vectorizer.transform(test_df['Headline'])

In [117]:
train_body_vec[0]

<1x482271 sparse matrix of type '<type 'numpy.float64'>'
	with 281 stored elements in Compressed Sparse Row format>

In [118]:
train_headline_vec[0]

<1x19830 sparse matrix of type '<type 'numpy.float64'>'
	with 35 stored elements in Compressed Sparse Row format>

In [119]:
#Tuple represents (Instance, Feature); value to the right of the tuple 
#represents the feature's tf-idf score
print(train_body_vec)

  (0, 111622)	0.07332486904279714
  (0, 61926)	0.11062358813848425
  (0, 125574)	0.09436328724448594
  (0, 448715)	0.09620281606126824
  (0, 158554)	0.06121278707581918
  (0, 381761)	0.039478725956742046
  (0, 359171)	0.08017103385429222
  (0, 149291)	0.051972696403530645
  (0, 315341)	0.09230802764085554
  (0, 33684)	0.07507013612897592
  (0, 166231)	0.03303528239637393
  (0, 407023)	0.06514224536738578
  (0, 476203)	0.08019141018236453
  (0, 395340)	0.038881204072746076
  (0, 221486)	0.1692701173979176
  (0, 56006)	0.03695559013641697
  (0, 49547)	0.02305411365934867
  (0, 460278)	0.03851537680773665
  (0, 215190)	0.03917419126394295
  (0, 59820)	0.038118581066324334
  (0, 17420)	0.05709566097575283
  (0, 11454)	0.03592709280337572
  (0, 395698)	0.036095354657982366
  (0, 31411)	0.039964295283783646
  (0, 56077)	0.05709566097575283
  :	:
  (49971, 297508)	0.06264392079607924
  (49971, 262343)	0.06264392079607924
  (49971, 328442)	0.06264392079607924
  (49971, 317155)	0.06264392079607

In [120]:
#Merge headline and body_text tf-idf vectors together - Stack arrays in sequence horizontally
train_features = hstack([train_body_tfidf, train_headline_tfidf])
test_features = hstack([test_body_tfidf, test_headline_tfidf])

In [121]:
print(train_vec)

  (0, 111622)	0.07332486904279714
  (0, 61926)	0.11062358813848425
  (0, 125574)	0.09436328724448594
  (0, 448715)	0.09620281606126824
  (0, 158554)	0.06121278707581918
  (0, 381761)	0.039478725956742046
  (0, 359171)	0.08017103385429222
  (0, 149291)	0.051972696403530645
  (0, 315341)	0.09230802764085554
  (0, 33684)	0.07507013612897592
  (0, 166231)	0.03303528239637393
  (0, 407023)	0.06514224536738578
  (0, 476203)	0.08019141018236453
  (0, 395340)	0.038881204072746076
  (0, 221486)	0.1692701173979176
  (0, 56006)	0.03695559013641697
  (0, 49547)	0.02305411365934867
  (0, 460278)	0.03851537680773665
  (0, 215190)	0.03917419126394295
  (0, 59820)	0.038118581066324334
  (0, 17420)	0.05709566097575283
  (0, 11454)	0.03592709280337572
  (0, 395698)	0.036095354657982366
  (0, 31411)	0.039964295283783646
  (0, 56077)	0.05709566097575283
  :	:
  (49970, 484717)	0.25226273309179803
  (49970, 492169)	0.25226273309179803
  (49970, 490933)	0.23233767736407243
  (49970, 488605)	0.25226273309179

In [122]:
#Initialize random forest classifier (Scikit Learn)
rf_classifier = RandomForestClassifier(n_estimators=10)

#Extract training and test labels
train_labels = list(train_df['Stance'])
test_labels = list(test_df['Stance'])

In [123]:
#Train the classifier on the training data; use it to predict test feature labels
y_pred = rf_classifier.fit(train_features, train_labels).predict(test_features)

In [124]:
#Check the accuracy of the classifier
accuracy_score(test_labels, y_pred)

0.7106205485381498

In [125]:
#Add predicted labels to test dataframe
test_df['RF_Predicted_Stance'] = list(y_pred)

In [126]:
test_df[['Headline','Body_Text','RF_Predicted_Stance','Stance']]

Unnamed: 0,Headline,Body_Text,RF_Predicted_Stance,Stance
0,Ferguson riots: Pregnant woman loses eye after...,A RESPECTED senior French police officer inves...,unrelated,unrelated
1,Crazy Conservatives Are Sure a Gitmo Detainee ...,Dave Morin's social networking company Path is...,unrelated,unrelated
2,A Russian Guy Says His Justin Bieber Ringtone ...,A bereaved Afghan mother took revenge on the T...,unrelated,unrelated
3,"Zombie Cat: Buried Kitty Believed Dead, Meows ...",Hewlett-Packard is officially splitting in two...,unrelated,unrelated
4,Argentina's President Adopts Boy to End Werewo...,An airline passenger headed to Dallas was remo...,unrelated,unrelated
5,Next-generation Apple iPhones' features leaked,When faced with the choice of feasting on a fi...,unrelated,unrelated
6,Saudi national airline may introduce gender se...,The US declared the video of Sotloff to be aut...,unrelated,unrelated
7,'Zombie Cat' Claws Way Out Of Grave And Into O...,19-year-old Iga Jasica of Poland began making ...,unrelated,unrelated
8,"ISIS might be harvesting organs, Iraq tells UN",Michael Foley says the administration threaten...,unrelated,unrelated
9,Woman has surgery to get third breast: The thr...,Brian Stelter from CNN just reported that hack...,unrelated,unrelated
