## Cosine Feature Extraction and Random Forest Classifier

The following sources were used to construct this Jupyter Notebook:

* [Numpy: Dot Multiplication, Vstack, Hstack, Flatten](https://www.youtube.com/watch?v=nkO6bmp511M)
* [Scikit Learn TF-IDF Feature Extraction and Latent Semantic Analysis](https://www.youtube.com/watch?v=BJ0MnawUpaU)
* [Fake News Challenge TF-IDF Baseline](https://github.com/gmyrianthous/fakenewschallenge/blob/master/baseline.py)
* [Python TF-IDF Algorithm Built From Scratch](https://www.youtube.com/watch?v=hXNbFNCgPfY)
* [Theory Behind TF-IDF](https://www.youtube.com/watch?v=4vT4fzjkGCQ)
* [Plotting Classifier Boundaries](http://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html)

In [1]:
import sys
print(sys.version)

#Import all required modules

#For parsing and visualizing data
from pandas import DataFrame, read_csv
import pandas as pd

#For visualizing data
import matplotlib.pyplot as plt

#For processing data
import numpy as np
import pickle
from sklearn.model_selection import train_test_split

#Feature Engineering
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from scipy.sparse import hstack
import baseline_features
from sklearn.decomposition import TruncatedSVD

#Classifiers
from sklearn.naive_bayes import MultinomialNB
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression

#For scoring
from sklearn.metrics import accuracy_score
import score #Score used in competition

#Progress Bar
from tqdm import tqdm

#Reloading modules that have been updated
#import importlib
#importlib.reload(baseline_features)

3.6.4 (default, Jan  6 2018, 11:51:59) 
[GCC 4.2.1 Compatible Apple LLVM 9.0.0 (clang-900.0.39.2)]


# Data Preparation

## Create Dataframes

In [2]:
#Import data from CSV file and create a dataframe
def create_dataframe(filename):
    #Read file into a pandas dataframe
    df = pd.read_csv(filename)
    #Remove white space in column names
    df.columns = [c.replace(' ', '_') for c in df.columns]
    return df

In [3]:
#Create dataframes for both training and testing sets
train_df_tmp = create_dataframe('train_stances.csv')
train_bodies_df = create_dataframe('train_bodies.csv')

test_df_tmp = create_dataframe('competition_test_stances.csv')
test_bodies_df = create_dataframe('test_bodies.csv')

train_df_tmp.head(5)

Unnamed: 0,Headline,Body_ID,Stance
0,Police find mass graves with at least '15 bodi...,712,unrelated
1,Hundreds of Palestinians flee floods in Gaza a...,158,agree
2,"Christian Bale passes on role of Steve Jobs, a...",137,unrelated
3,HBO and Apple in Talks for $15/Month Apple TV ...,1034,unrelated
4,Spider burrowed through tourist's stomach and ...,1923,disagree


## Join Dataframes on Body_ID

In [4]:
train_df = pd.merge(train_df_tmp,
                 train_bodies_df[['Body_ID', 'articleBody']],
                 on='Body_ID')

test_df = pd.merge(test_df_tmp,
                 test_bodies_df[['Body_ID', 'articleBody']],
                 on='Body_ID')

train_df = train_df.rename(columns={'articleBody': 'Body_Text'})
test_df = test_df.rename(columns={'articleBody': 'Body_Text'})

In [5]:
test_df.sort_values(by=['Body_ID']).head(5)

Unnamed: 0,Headline,Body_ID,Stance,Body_Text
7305,Apple to keep gold Watch Editions in special i...,1,unrelated,Al-Sisi has denied Israeli reports stating tha...
7303,Apple installing safes in-store to protect gol...,1,unrelated,Al-Sisi has denied Israeli reports stating tha...
7304,El-Sisi denies claims he'll give Sinai land to...,1,agree,Al-Sisi has denied Israeli reports stating tha...
7306,Apple Stores to Keep Gold “Edition” Apple Watc...,1,unrelated,Al-Sisi has denied Israeli reports stating tha...
7307,South Korean woman's hair 'eaten' by robot vac...,1,unrelated,Al-Sisi has denied Israeli reports stating tha...


In [6]:
train_df.sort_values(by=['Body_ID']).head(5)

Unnamed: 0,Headline,Body_ID,Stance,Body_Text
41651,"Soldier shot, Parliament locked down after gun...",0,unrelated,A small meteorite crashed into a wooded area i...
41657,Italian catches huge wels catfish; is it a rec...,0,unrelated,A small meteorite crashed into a wooded area i...
41658,Not coming to a store near you: The pumpkin sp...,0,unrelated,A small meteorite crashed into a wooded area i...
41659,One gunman killed in shooting on Parliament Hi...,0,unrelated,A small meteorite crashed into a wooded area i...
41660,Soldier shot at war memorial in Canada,0,unrelated,A small meteorite crashed into a wooded area i...


In [9]:
#Split training data into training and validation set
train_df, validate_df, train_labels, validate_labels = train_test_split(train_df[['Body_Text','Headline']], train_df['Stance'], test_size=.4, random_state=42)
print(X_train)

                                               Body_Text  \
14020  A former PGA Tour player claims Tiger Woods ha...   
12430  KANSAS CITY, Mo. - Kansas City health official...   
36868  Ottawa shooting video shown by police\n\nThe R...   
30213  British man suspected of appearing in videos o...   
2948   While Apple announced that the base model of i...   
18034  First lady Michelle Obama’s face was reportedl...   
28735  A touching tribute to the victims of the Charl...   
25686  A baseball cap and a portrait of Michael Brown...   
38318  A rumor on Tuesday claims Apple's upcoming App...   
33269  The British Islamic State militant known as "J...   
31956  Two Aussie mates had to be talked out of the g...   
47785  The reported ceasefire between the Nigerian go...   
47489  KANSAS CITY, MO (KCTV) -\nA man rushed to a Ka...   
45092  Days before Christmas, Internet prankster Josh...   
8462   Judd Nelson rebuffs Internet rumors that he di...   
44652  Gill Rosenberg, 31,said last week

# Feature Engineering

## TF-IDF Features

In [11]:
#Apply Scikit Learn TFIDF Feature Extraction Algorithm
body_text_vectorizer = TfidfVectorizer(ngram_range=(1, 2), lowercase=True, stop_words='english',max_features=1024)
headline_vectorizer = TfidfVectorizer(ngram_range=(1, 2), lowercase=True, stop_words='english',max_features=1024)

#Create vocabulary based on training data
train_body_tfidf = body_text_vectorizer.fit_transform(train_df['Body_Text'])
train_headline_tfidf = headline_vectorizer.fit_transform(train_df['Headline'])

#Create vocabulary based on validation data
validate_body_tfidf = body_text_vectorizer.transform(validate_df['Body_Text'])
validate_headline_tfidf = headline_vectorizer.transform(validate_df['Headline'])

#Use vocabulary for testing data
test_body_tfidf = body_text_vectorizer.transform(test_df['Body_Text'])
test_headline_tfidf = headline_vectorizer.transform(test_df['Headline']) 

## Cosine Similarity Features

In [12]:
#Cosine Similarity
def get_cosine_similarity(body_tfidf,headline_tfidf):
    cosine_features = []
    #len body_tfidf = len headline_tfidf
    for i in tqdm(range(body_tfidf.shape[0])):
        cosine_features.append(cosine_similarity((body_tfidf.A[0].reshape(1,-1)),(headline_tfidf.A[0].reshape(1,-1)))[0][0])
    return np.array(cosine_features).reshape(body_tfidf.shape[0],1)

In [10]:
#Leave this commented out unless you are re-calculating the cosine similarity
#which can be found in the pickle files labeled: 
#train_cosine_features.p and test_cosine_features.p

#Train data
#train_cosine_features = get_cosine_similarity(train_body_tfidf,train_headline_tfidf)

#Validate data
#validate_cosine_features = get_cosine_similarity(validate_body_tfidf,validate_headline_tfidf)

#Test data
#test_cosine_features = get_cosine_similarity(test_body_tfidf,test_headline_tfidf)

#pickle.dump(train_cosine_features,open('train_cosine_features.p','wb'))
#pickle.dump(train_cosine_features,open('validate_cosine_features.p','wb'))
#pickle.dump(test_cosine_features,open('test_cosine_features.p','wb'))

In [11]:
train_cosine_features = pickle.load(open('train_cosine_features.p','rb'))
validate_cosine_features = pickle.load(open('validate_cosine_features.p','rb'))
test_cosine_features = pickle.load(open('test_cosine_features.p','rb'))

## Hand Selected Features (Baseline Features)

In [12]:
train_hand_features = baseline_features.hand_features(train_df['Headline'],train_df['Body_Text'])

49972it [04:05, 203.63it/s]


In [13]:
validate_hand_features = baseline_features.hand_features(validate_df['Headline'],validate_df['Body_Text'])

19989it [02:40, 124.45it/s]


In [13]:
test_hand_features = baseline_features.hand_features(test_df['Headline'],test_df['Body_Text'])

25413it [02:01, 209.77it/s]


In [14]:
train_hand_features = np.array(train_hand_features)
validate_hand_features = np.array(validate_hand_features)
test_hand_features = np.array(test_hand_features)

## Word Overlap Features (Baseline Feature)

In [16]:
train_overlap_features = baseline_features.word_overlap_features(train_df['Headline'],train_df['Body_Text'])

49972it [03:25, 243.04it/s]


In [14]:
validate_overlap_features = baseline_features.word_overlap_features(validate_df['Headline'],validate_df['Body_Text'])

19989it [02:07, 156.18it/s]


In [17]:
test_overlap_features = baseline_features.word_overlap_features(test_df['Headline'],test_df['Body_Text'])

25413it [01:44, 243.16it/s]


In [18]:
train_overlap_features = np.array(train_overlap_features)
validate_overlap_features = np.array(validate_overlap_features)
test_overlap_features = np.array(test_overlap_features)

## Polarity Features (Baseline Feature)

In [19]:
train_polarity_features = baseline_features.polarity_features(train_df['Headline'],train_df['Body_Text'])

49972it [03:37, 229.53it/s]


In [15]:
validate_polarity_features = baseline_features.polarity_features(validate_df['Headline'],validate_df['Body_Text'])

19989it [02:16, 146.16it/s]


In [20]:
test_polarity_features = baseline_features.polarity_features(test_df['Headline'],test_df['Body_Text'])

25413it [01:41, 249.67it/s]


In [21]:
train_polarity_features = np.array(train_polarity_features)
validate_polarity_features = np.array(validate_polarity_features)
test_polarity_features = np.array(test_polarity_features)

## Refuting Features (Baseline)

In [22]:
train_refuting_features = baseline_features.refuting_features(train_df['Headline'],train_df['Body_Text'])

49972it [00:12, 3917.19it/s]


In [16]:
validate_refuting_features = baseline_features.refuting_features(validate_df['Headline'],validate_df['Body_Text'])

19989it [00:07, 2605.04it/s]


In [23]:
test_refuting_features = baseline_features.refuting_features(test_df['Headline'],test_df['Body_Text'])

25413it [00:06, 3646.16it/s]


In [24]:
train_refuting_features = np.array(train_refuting_features)
validate_refuting_features = np.array(validate_refuting_features)
test_refuting_features = np.array(test_refuting_features)

## Concatenate feature vectors

In [26]:
train_features = hstack([
                            train_body_tfidf,
                            train_headline_tfidf,
                            train_hand_features,
                            train_cosine_features,
                            train_overlap_features,
                            train_polarity_features,
                            train_refuting_features
    
                        ])
validate_features = hstack([
                            validate_body_tfidf,
                            validate_headline_tfidf,
                            validate_hand_features,
                            validate_cosine_features,
                            validate_overlap_features,
                            validate_polarity_features,
                            validate_refuting_features
                            
                        ])
test_features = hstack([
                            test_body_tfidf,
                            test_headline_tfidf,
                            test_hand_features,
                            test_cosine_features,
                            test_overlap_features,
                            test_polarity_features,
                            test_refuting_features
                        ])

# Classification

## Extract labels

In [27]:
#We already have train_labels and validate_labels from before
test_labels = list(test_df['Stance'])

## Run Classifiers and Score Validation Output

In [30]:
names = ["Random Forest", "Multinomial Naive Bayes", "Gradient Boosting","K Nearest Neighbors","Linear SVM", "Decision Tree", "Logistic Regression"]

classifiers = [
    RandomForestClassifier(n_estimators=10),
    MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
    GradientBoostingClassifier(n_estimators=200, random_state=14128, verbose=True),
    KNeighborsClassifier(4),
    SVC(kernel="linear", C=0.025),
    DecisionTreeClassifier(max_depth=5),
    LogisticRegression(C=1e5)
]

for n, clf in zip(names, classifiers):
    print(n)
    y_pred = clf.fit(train_features,train_labels).predict(validate_features)
    print(score.report_score(test_labels, y_pred))
    print('\n')

Random Forest
-------------------------------------------------------------
|           |   agree   | disagree  |  discuss  | unrelated |
-------------------------------------------------------------
|   agree   |    724    |     2     |    823    |    354    |
-------------------------------------------------------------
| disagree  |    198    |     2     |    234    |    263    |
-------------------------------------------------------------
|  discuss  |    797    |     4     |   2874    |    789    |
-------------------------------------------------------------
| unrelated |    61     |     3     |    460    |   17825   |
-------------------------------------------------------------
Score: 8570.75 out of 11651.25	(73.56077674069306%)
73.56077674069306
Multinomial Naive Bayes
-------------------------------------------------------------
|           |   agree   | disagree  |  discuss  | unrelated |
-------------------------------------------------------------
|   agree   |   1000    

MemoryError: 

## Run Classifiers and Score Test Output

In [None]:
#This is how well we would have scored in the actual competition
names = ["Random Forest", "Multinomial Naive Bayes", "Gradient Boosting","K Nearest Neighbors","Linear SVM", "Decision Tree", "Logistic Regression"]

classifiers = [
    RandomForestClassifier(n_estimators=10),
    MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True),
    GradientBoostingClassifier(n_estimators=200, random_state=14128, verbose=True),
    KNeighborsClassifier(4),
    SVC(kernel="linear", C=0.025),
    DecisionTreeClassifier(max_depth=5),
    LogisticRegression(C=1e5)
]

for n, clf in zip(names, classifiers):
    print(n)
    y_pred = clf.fit(train_features,train_labels).predict(validate_features)
    print(score.report_score(test_labels, y_pred))
    print('\n')