## Cosine Feature Extraction and Random Forest Classifier

The following sources were used to construct this Jupyter Notebook:

* [Numpy: Dot Multiplication, Vstack, Hstack, Flatten](https://www.youtube.com/watch?v=nkO6bmp511M)
* [Scikit Learn TF-IDF Feature Extraction and Latent Semantic Analysis](https://www.youtube.com/watch?v=BJ0MnawUpaU)
* [Fake News Challenge TF-IDF Baseline](https://github.com/gmyrianthous/fakenewschallenge/blob/master/baseline.py)
* [Python TF-IDF Algorithm Built From Scratch](https://www.youtube.com/watch?v=hXNbFNCgPfY)
* [Theory Behind TF-IDF](https://www.youtube.com/watch?v=4vT4fzjkGCQ)

In [2]:
#Import all required modules
from pandas import DataFrame, read_csv
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from scipy.spatial.distance import cosine
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import hstack
import score
from tqdm import tqdm

In [3]:
#Import data from CSV file and create a dataframe
def create_dataframe(filename):
    #Read file into a pandas dataframe
    df = pd.read_csv(filename)
    #Remove white space in column names
    df.columns = [c.replace(' ', '_') for c in df.columns]
    return df

In [4]:
#Create dataframes for both training and testing sets
train_df_tmp = create_dataframe('train_stances.csv')
test_df_tmp = create_dataframe('competition_test_stances.csv')
train_bodies_df = create_dataframe('train_bodies.csv')
test_bodies_df = create_dataframe('test_bodies.csv')

train_df_tmp.head(5)

Unnamed: 0,Headline,Body_ID,Stance
0,Police find mass graves with at least '15 bodi...,712,unrelated
1,Hundreds of Palestinians flee floods in Gaza a...,158,agree
2,"Christian Bale passes on role of Steve Jobs, a...",137,unrelated
3,HBO and Apple in Talks for $15/Month Apple TV ...,1034,unrelated
4,Spider burrowed through tourist's stomach and ...,1923,disagree


In [5]:
train_df = pd.merge(train_df_tmp,
                 train_bodies_df[['Body_ID', 'articleBody']],
                 on='Body_ID')

test_df = pd.merge(test_df_tmp,
                 test_bodies_df[['Body_ID', 'articleBody']],
                 on='Body_ID')

train_df = train_df.rename(columns={'articleBody': 'Body_Text'})
test_df = test_df.rename(columns={'articleBody': 'Body_Text'})

In [6]:
test_df.sort_values(by=['Body_ID']).head(5)

Unnamed: 0,Headline,Body_ID,Stance,Body_Text
7305,Apple to keep gold Watch Editions in special i...,1,unrelated,Al-Sisi has denied Israeli reports stating tha...
7303,Apple installing safes in-store to protect gol...,1,unrelated,Al-Sisi has denied Israeli reports stating tha...
7304,El-Sisi denies claims he'll give Sinai land to...,1,agree,Al-Sisi has denied Israeli reports stating tha...
7306,Apple Stores to Keep Gold “Edition” Apple Watc...,1,unrelated,Al-Sisi has denied Israeli reports stating tha...
7307,South Korean woman's hair 'eaten' by robot vac...,1,unrelated,Al-Sisi has denied Israeli reports stating tha...


In [7]:
train_df.sort_values(by=['Body_ID']).head(5)

Unnamed: 0,Headline,Body_ID,Stance,Body_Text
41651,"Soldier shot, Parliament locked down after gun...",0,unrelated,A small meteorite crashed into a wooded area i...
41657,Italian catches huge wels catfish; is it a rec...,0,unrelated,A small meteorite crashed into a wooded area i...
41658,Not coming to a store near you: The pumpkin sp...,0,unrelated,A small meteorite crashed into a wooded area i...
41659,One gunman killed in shooting on Parliament Hi...,0,unrelated,A small meteorite crashed into a wooded area i...
41660,Soldier shot at war memorial in Canada,0,unrelated,A small meteorite crashed into a wooded area i...


In [8]:
#Apply Scikit Learn TFIDF Feature Extraction Algorithm
body_text_vectorizer = TfidfVectorizer(ngram_range=(1, 2), lowercase=True, stop_words='english',max_features=1024)
headline_vectorizer = TfidfVectorizer(ngram_range=(1, 2), lowercase=True, stop_words='english',max_features=1024)

#Create vocabulary based on training data
train_body_tfidf = body_text_vectorizer.fit_transform(train_df['Body_Text'])
train_headline_tfidf = headline_vectorizer.fit_transform(train_df['Headline'])

#Use vocabulary for testing data
test_body_tfidf = body_text_vectorizer.transform(test_df['Body_Text'])
test_headline_tfidf = headline_vectorizer.transform(test_df['Headline']) 

In [9]:
#Cosine Similarity
def get_cosine_similarity(body_tfidf,headline_tfidf):
    cosine_features = []
    #len body_tfidf = len headline_tfidf
    for i in tqdm(range(body_tfidf.shape[0])):
        cosine_features.append(cosine_similarity((body_tfidf.A[0].reshape(1,-1)),(headline_tfidf.A[0].reshape(1,-1)))[0][0])
    return np.array(cosine_features).reshape(body_tfidf.shape[0],1)

In [10]:
train_cosine_features = get_cosine_similarity(train_body_tfidf,train_headline_tfidf)
test_cosine_features = get_cosine_similarity(test_body_tfidf,test_headline_tfidf)

100%|██████████| 49972/49972 [15:00:27<00:00,  1.08s/it]  
100%|██████████| 25413/25413 [4:55:51<00:00,  1.43it/s]  


In [11]:
train_features = hstack([train_body_tfidf,train_headline_tfidf,train_cosine_features])
test_features = hstack([test_body_tfidf,test_headline_tfidf,test_cosine_features])

In [12]:
#Extract training and test labels
train_labels = list(train_df['Stance'])
test_labels = list(test_df['Stance'])

In [13]:
#Initialize random forest classifier (Scikit Learn)
rf_classifier = RandomForestClassifier(n_estimators=10)

y_pred = rf_classifier.fit(train_features, train_labels).predict(test_features)

accuracy_score(test_labels, y_pred)

0.670129461299335

In [14]:
#Initialize multinomialnb classifier
nb_classifier = MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

y_pred = nb_classifier.fit(train_features,train_labels).predict(test_features)

accuracy_score(test_labels, y_pred)

0.7025931609806004

In [15]:
score.report_score(test_labels, y_pred)

-------------------------------------------------------------
|           |   agree   | disagree  |  discuss  | unrelated |
-------------------------------------------------------------
|   agree   |    28     |     0     |    99     |   1776    |
-------------------------------------------------------------
| disagree  |     0     |     0     |    10     |    687    |
-------------------------------------------------------------
|  discuss  |    18     |     0     |    831    |   3615    |
-------------------------------------------------------------
| unrelated |    18     |     0     |   1335    |   16996   |
-------------------------------------------------------------
Score: 5139.75 out of 11651.25	(44.1132925652%)


In [None]:
#####

In [16]:
import baseline_features

In [17]:
train_hand_features = baseline_features.hand_features(train_df['Headline'],train_df['Body_Text'])

49972it [05:42, 145.77it/s]


In [18]:
test_hand_features = baseline_features.hand_features(test_df['Headline'],test_df['Body_Text'])

25413it [02:33, 165.91it/s]


In [19]:
train_hand_features = np.array(train_hand_features)
test_hand_features = np.array(test_hand_features)

In [20]:
train_features1 = hstack([train_body_tfidf,train_headline_tfidf,train_hand_features,train_cosine_features])
test_features1 = hstack([test_body_tfidf,test_headline_tfidf,test_hand_features,test_cosine_features])

In [30]:
#Initialize multinomialnb classifier
nb_classifier1 = MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

y_pred1 = nb_classifier1.fit(train_features1,train_labels).predict(test_features1)

score.report_score(test_labels, y_pred1)

-------------------------------------------------------------
|           |   agree   | disagree  |  discuss  | unrelated |
-------------------------------------------------------------
|   agree   |   1039    |    10     |    598    |    256    |
-------------------------------------------------------------
| disagree  |    297    |     4     |    183    |    213    |
-------------------------------------------------------------
|  discuss  |   1085    |    26     |   2712    |    641    |
-------------------------------------------------------------
| unrelated |    289    |     2     |    680    |   17378   |
-------------------------------------------------------------
Score: 8649.25 out of 11651.25	(74.2345241927%)
