## Baseline Features Implementation + TF-IDF

The following sources were used to construct this Jupyter Notebook:

* [Numpy: Dot Multiplication, Vstack, Hstack, Flatten](https://www.youtube.com/watch?v=nkO6bmp511M)
* [Scikit Learn TF-IDF Feature Extraction and Latent Semantic Analysis](https://www.youtube.com/watch?v=BJ0MnawUpaU)
* [Fake News Challenge TF-IDF Baseline](https://github.com/gmyrianthous/fakenewschallenge/blob/master/baseline.py)
* [Python TF-IDF Algorithm Built From Scratch](https://www.youtube.com/watch?v=hXNbFNCgPfY)
* [Theory Behind TF-IDF](https://www.youtube.com/watch?v=4vT4fzjkGCQ)

In [1]:
#Import all required modules
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import score
from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack
import baseline_features

In [2]:
#Import data from CSV file and create a dataframe
def create_dataframe(filename):
    #Read file into a pandas dataframe
    df = pd.read_csv(filename)
    #Remove white space in column names
    df.columns = [c.replace(' ', '_') for c in df.columns]
    return df

In [3]:
#Create dataframes for both training and testing sets
train_df_tmp = create_dataframe('train_stances.csv')
test_df_tmp = create_dataframe('competition_test_stances.csv')
train_bodies_df = create_dataframe('train_bodies.csv')
test_bodies_df = create_dataframe('test_bodies.csv')

train_df_tmp.head(5)

Unnamed: 0,Headline,Body_ID,Stance
0,Police find mass graves with at least '15 bodi...,712,unrelated
1,Hundreds of Palestinians flee floods in Gaza a...,158,agree
2,"Christian Bale passes on role of Steve Jobs, a...",137,unrelated
3,HBO and Apple in Talks for $15/Month Apple TV ...,1034,unrelated
4,Spider burrowed through tourist's stomach and ...,1923,disagree


In [4]:
train_df = pd.merge(train_df_tmp,
                 train_bodies_df[['Body_ID', 'articleBody']],
                 on='Body_ID')

test_df = pd.merge(test_df_tmp,
                 test_bodies_df[['Body_ID', 'articleBody']],
                 on='Body_ID')

train_df = train_df.rename(columns={'articleBody': 'Body_Text'})
test_df = test_df.rename(columns={'articleBody': 'Body_Text'})

In [5]:
test_df.sort_values(by=['Body_ID']).head(5)

Unnamed: 0,Headline,Body_ID,Stance,Body_Text
7305,Apple to keep gold Watch Editions in special i...,1,unrelated,Al-Sisi has denied Israeli reports stating tha...
7303,Apple installing safes in-store to protect gol...,1,unrelated,Al-Sisi has denied Israeli reports stating tha...
7304,El-Sisi denies claims he'll give Sinai land to...,1,agree,Al-Sisi has denied Israeli reports stating tha...
7306,Apple Stores to Keep Gold “Edition” Apple Watc...,1,unrelated,Al-Sisi has denied Israeli reports stating tha...
7307,South Korean woman's hair 'eaten' by robot vac...,1,unrelated,Al-Sisi has denied Israeli reports stating tha...


In [6]:
train_df.sort_values(by=['Body_ID']).head(5)

Unnamed: 0,Headline,Body_ID,Stance,Body_Text
41651,"Soldier shot, Parliament locked down after gun...",0,unrelated,A small meteorite crashed into a wooded area i...
41657,Italian catches huge wels catfish; is it a rec...,0,unrelated,A small meteorite crashed into a wooded area i...
41658,Not coming to a store near you: The pumpkin sp...,0,unrelated,A small meteorite crashed into a wooded area i...
41659,One gunman killed in shooting on Parliament Hi...,0,unrelated,A small meteorite crashed into a wooded area i...
41660,Soldier shot at war memorial in Canada,0,unrelated,A small meteorite crashed into a wooded area i...


In [7]:
#Apply Scikit Learn TFIDF Feature Extraction Algorithm
body_text_vectorizer = TfidfVectorizer(ngram_range=(1, 2), lowercase=True, stop_words='english',max_features=1024)
headline_vectorizer = TfidfVectorizer(ngram_range=(1, 2), lowercase=True, stop_words='english',max_features=1024)

# #Create vocabulary based on training data
train_body_tfidf = body_text_vectorizer.fit_transform(train_df['Body_Text'])
train_headline_tfidf = headline_vectorizer.fit_transform(train_df['Headline'])

# #Use vocabulary for testing data
test_body_tfidf = body_text_vectorizer.transform(test_df['Body_Text'])
test_headline_tfidf = headline_vectorizer.transform(test_df['Headline']) 

In [8]:
train_hand_features = baseline_features.hand_features(train_df['Headline'],train_df['Body_Text'])

49972it [04:20, 192.16it/s]


In [9]:
test_hand_features = baseline_features.hand_features(test_df['Headline'],test_df['Body_Text'])

25413it [02:07, 199.70it/s]


In [10]:
train_hand_features = np.array(train_hand_features)
test_hand_features = np.array(test_hand_features)

In [11]:
train_features = hstack([train_body_tfidf,train_headline_tfidf,train_hand_features])
test_features = hstack([test_body_tfidf,test_headline_tfidf,test_hand_features])

In [12]:
#Extract training and test labels
train_labels = list(train_df['Stance'])
test_labels = list(test_df['Stance'])

In [13]:
#Initialize random forest classifier (Scikit Learn)
rf_classifier = RandomForestClassifier(n_estimators=10)

y_pred = rf_classifier.fit(train_features, train_labels).predict(test_features)

score.report_score(test_labels, y_pred)

-------------------------------------------------------------
|           |   agree   | disagree  |  discuss  | unrelated |
-------------------------------------------------------------
|   agree   |    746    |     7     |    787    |    363    |
-------------------------------------------------------------
| disagree  |    192    |    15     |    206    |    284    |
-------------------------------------------------------------
|  discuss  |    811    |    12     |   2783    |    858    |
-------------------------------------------------------------
| unrelated |    74     |     0     |    400    |   17875   |
-------------------------------------------------------------
Score: 8516.5 out of 11651.25	(73.0951614633623%)


73.0951614633623

In [14]:
#Initialize multinomialnb classifier
nb_classifier = MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

y_pred = nb_classifier.fit(train_features,train_labels).predict(test_features)

score.report_score(test_labels, y_pred)

-------------------------------------------------------------
|           |   agree   | disagree  |  discuss  | unrelated |
-------------------------------------------------------------
|   agree   |   1029    |    11     |    607    |    256    |
-------------------------------------------------------------
| disagree  |    304    |     3     |    190    |    200    |
-------------------------------------------------------------
|  discuss  |   1088    |    22     |   2740    |    614    |
-------------------------------------------------------------
| unrelated |    271    |     2     |    686    |   17390   |
-------------------------------------------------------------
Score: 8675.0 out of 11651.25	(74.45553052247612%)


74.45553052247612

In [15]:
#Add predicted labels to test dataframe
test_df['RF_Predicted_Stance'] = list(y_pred)

In [16]:
test_df2 = test_df[['Headline','Body_Text','RF_Predicted_Stance','Stance']]

In [17]:
test_df2[test_df2['RF_Predicted_Stance'] == 'unrelated']

Unnamed: 0,Headline,Body_Text,RF_Predicted_Stance,Stance
0,Ferguson riots: Pregnant woman loses eye after...,A RESPECTED senior French police officer inves...,unrelated,unrelated
1,Apple Stores to install safes to secure gold A...,A RESPECTED senior French police officer inves...,unrelated,unrelated
2,Pregnant woman loses eye after police shoot be...,A RESPECTED senior French police officer inves...,unrelated,unrelated
3,We just found out the #Ferguson Protester who ...,A RESPECTED senior French police officer inves...,unrelated,unrelated
4,Police Chief In Charge of Paris Attacks Commit...,A RESPECTED senior French police officer inves...,unrelated,discuss
6,Pregnant Ferguson woman loses her EYE after po...,A RESPECTED senior French police officer inves...,unrelated,unrelated
7,Pregnant woman loses eye after Ferguson cops f...,A RESPECTED senior French police officer inves...,unrelated,unrelated
9,Pregnant Woman Loses Eye During Ferguson Riots...,A RESPECTED senior French police officer inves...,unrelated,unrelated
10,‘I will have justice for what they did to me’:...,A RESPECTED senior French police officer inves...,unrelated,unrelated
12,AP report: Police say gunman in FSU library sh...,A RESPECTED senior French police officer inves...,unrelated,unrelated


In [18]:
#Initialize random forest classifier (Scikit Learn)
rf_classifier = RandomForestClassifier(n_estimators=10)

y_pred = rf_classifier.fit(train_features, train_labels).predict(test_features)

score.report_score(test_labels, y_pred)

-------------------------------------------------------------
|           |   agree   | disagree  |  discuss  | unrelated |
-------------------------------------------------------------
|   agree   |    769    |     3     |    755    |    376    |
-------------------------------------------------------------
| disagree  |    198    |     2     |    217    |    280    |
-------------------------------------------------------------
|  discuss  |    949    |     4     |   2576    |    935    |
-------------------------------------------------------------
| unrelated |    75     |     0     |    406    |   17868   |
-------------------------------------------------------------
Score: 8345.5 out of 11651.25	(71.6275077781354%)


71.6275077781354