# Bayes and SVM Models

In this the bayes and tf-idf models are generated. There will primarily be 4 models. 
- Bayes: on text
- Bayes: on text and keywords
- SVM: on text
- SVM: on text and keywords

In [1]:
import pandas as pd
import numpy as np
import re
import string
import spacy

import clean_data

In [2]:
import nltk
from nltk.sentiment import vader
from nltk.sentiment.vader import SentimentIntensityAnalyzer
nltk.download('vader_lexicon') ## this only needs to be run once

vader_model = SentimentIntensityAnalyzer()

[nltk_data] Downloading package vader_lexicon to
[nltk_data]     /home/sybolt/nltk_data...
[nltk_data]   Package vader_lexicon is already up-to-date!


In [3]:
df = pd.read_csv('train.csv')

# drop the instances where the keywords are missing
df = df[df.keyword.notnull()]
df = df[df.text.notnull()]

In [4]:
# Should we also use TextBlob and compare its performance with VADER!?
df['compound'] = df['text'].apply(lambda x:vader_model.polarity_scores(x)['compound'])

In [5]:
df.head()

Unnamed: 0,id,keyword,location,text,target,compound
31,48,ablaze,Birmingham,@bbcmtd Wholesale Markets ablaze http://t.co/l...,1,0.0
32,49,ablaze,Est. September 2012 - Bristol,We always try to bring the heavy. #metal #RT h...,0,0.0
33,50,ablaze,AFRICA,#AFRICANBAZE: Breaking news:Nigeria flag set a...,1,0.0
34,52,ablaze,"Philadelphia, PA",Crying out for more! Set me ablaze,0,-0.5255
35,53,ablaze,"London, UK",On plus side LOOK AT THE SKY LAST NIGHT IT WAS...,0,0.0


In [6]:
clean_data.clean(df, 'train', False)

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6685 entries, 0 to 7551
Data columns (total 5 columns):
keyword     6685 non-null object
location    4532 non-null object
text        6685 non-null object
target      6685 non-null int64
compound    6685 non-null float64
dtypes: float64(1), int64(1), object(3)
memory usage: 313.4+ KB
None


In [7]:
train_df = pd.read_csv('train_clean.csv')
train_df = train_df[train_df['text'].notnull()]

In [8]:
train_df.head(20)

Unnamed: 0,keyword,location,text,target,compound
0,ablaze,Birmingham,wholesale market ablaze,1,0.0
1,ablaze,Est. September 2012 - Bristol,try bring heavy metal,0,0.0
2,ablaze,AFRICA,africanbaze break news nigeria flag set ablaze...,1,0.0
3,ablaze,"Philadelphia, PA",cry set ablaze,0,-0.5255
4,ablaze,"London, UK",plus look sky night ablaze,0,0.0
5,ablaze,Pretoria,mufc build hype new acquisition doubt set epl ...,0,-0.5023
6,ablaze,World Wide!!,inec office abia set ablaze,1,0.0
7,ablaze,,barbado bridgetown jamaica car set ablaze sant...,1,0.0
8,ablaze,Paranaque City,ablaze lord d,0,0.6166
9,ablaze,Live On Webcam,check nsfw,0,0.0


# Models

In [9]:
# import for the models

import matplotlib.pyplot as plt # are we using this?
import scipy.sparse as sp
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn import metrics
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

Splitting data in validation- and training set. We speak of validation and not test, becaues the test data is contained within a separate file and thus is not present within the current dataframe. 

We can refer to these for all four models that are present here:

In [10]:
# This cell can be an overarching train testsplit cell.

x_keyword = train_df['keyword']
x_text = train_df['text']
y = train_df['target']
# Hashtag?
# Any other feature?

x_train, x_test, y_train, y_test = train_test_split(x_text, y, test_size=0.2, random_state=40) # for what do we use the random state?

In [11]:
# Needed for stacking

keyword_vectorizer = CountVectorizer()
keyword_vectors = keyword_vectorizer.fit_transform(train_df['keyword'])

text_vectorizer = CountVectorizer()
text_vectors = text_vectorizer.fit_transform(train_df['text'])

# x_features_train is a combined representation containing both the keywords and the text vectors
x_features_train = sp.hstack([keyword_vectors, text_vectors], format='csr')

# y is decided above
x_train_stack, x_test_stack, y_train_stack, y_test_stack = train_test_split(x_features_train, y, test_size=0.2)

In [12]:
# Needed for concatination

combine_texts = lambda x: x.keyword + " " + x.keyword + " " + x.keyword + " " + x.text
train_df['kt_combined'] = train_df.apply(combine_texts, axis=1)
train_df.head()

x_con = train_df['kt_combined']

vect_con = CountVectorizer()

# y is given above
x_train_con, x_test_con, y_train_con, y_test_con = train_test_split(x_con, y, test_size=0.2, random_state=40)

vect_con.fit(x_train_con)

x_train_vect_con = vect_con.transform(x_train_con)
x_test_vect_con = vect_con.transform(x_test_con)


## SVM

In [13]:
# Support Vector Machines
from sklearn.svm import SVC
from sklearn.svm import LinearSVC

### Pipeline usage for TFidf with SVM

Here we run make use of tf-idf in combination with the a svm.

Important to note is that this model only runs on one feature, i.e. the text.
Thereby the model does not include 'keywords' in its prediction. 

In [14]:
text_clf_svm = Pipeline([('tfidf', TfidfVectorizer()),
                     ('clf', LinearSVC()),
])

# Feed the training data through the pipeline
text_clf_svm.fit(x_train, y_train)

Pipeline(steps=[('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

In [15]:
pred_pipe_svm_test = text_clf_svm.predict(x_test)

In [16]:
print(metrics.classification_report(y_test, pred_pipe_svm_test))

              precision    recall  f1-score   support

           0       0.79      0.82      0.80       806
           1       0.71      0.67      0.69       531

    accuracy                           0.76      1337
   macro avg       0.75      0.74      0.75      1337
weighted avg       0.76      0.76      0.76      1337



### SVM model: with both keywords and text as features

Here we run make use of tf-idf in combination with the a svm on two features.

Important to note is that this model only runs on two features, i.e. the keyword and text.

#### Stacking

In [17]:
clf_svm_stack = LinearSVC() 

clf_svm_stack.fit(x_train_stack, y_train_stack)

LinearSVC()

In [18]:
pred_stack_svm_test = clf_svm_stack.predict(x_test_stack)

In [19]:
print(metrics.accuracy_score(y_test_stack, pred_stack_svm_test))
print(metrics.classification_report(y_test_stack, pred_stack_svm_test))

0.7636499626028422
              precision    recall  f1-score   support

           0       0.78      0.83      0.80       787
           1       0.73      0.67      0.70       550

    accuracy                           0.76      1337
   macro avg       0.76      0.75      0.75      1337
weighted avg       0.76      0.76      0.76      1337



#### Concatenation

In [20]:
clf_svm_con = LinearSVC() 

clf_svm_con.fit(x_train_vect_con, y_train_con)



LinearSVC()

In [23]:
pred_svm_con_test = clf_svm_con.predict(x_test_vect_con)

print(metrics.accuracy_score(y_test_con, pred_svm_con_test))
print(metrics.classification_report(y_test_con, pred_svm_con_test))

0.7352281226626777
              precision    recall  f1-score   support

           0       0.77      0.79      0.78       806
           1       0.67      0.65      0.66       531

    accuracy                           0.74      1337
   macro avg       0.72      0.72      0.72      1337
weighted avg       0.73      0.74      0.73      1337



## Bayesain classifier

In [24]:
# Bayes
from sklearn.naive_bayes import MultinomialNB

### Pipeline usage for TFidf with Bayes

Here we run make use of tf-idf in combination with the a bayes.

Important to note is that this model only runs on one feature, i.e. the text.
Thereby the model does not include 'keywords' in its prediction. 

In [25]:
text_clf_nb = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', MultinomialNB()),
])
text_clf_nb.fit(x_train, y_train)

pred_pipe_nb_test = text_clf_nb.predict(x_test)

print(metrics.accuracy_score(y_test, pred_pipe_nb_test))
print(metrics.classification_report(y_test,pred_pipe_nb_test))

0.7808526551982049
              precision    recall  f1-score   support

           0       0.77      0.90      0.83       806
           1       0.79      0.60      0.69       531

    accuracy                           0.78      1337
   macro avg       0.78      0.75      0.76      1337
weighted avg       0.78      0.78      0.77      1337



### Bayes Model: with both keywords and text as features

Here we run make use of tf-idf in combination with the a bayes.

Important to note is that this model only runs on one feature, i.e. the text.
Thereby the model includes both keywords and text.

We take two approaches:
- Stacking
- Concatination (nth degree)

#### Stacking

In [26]:
clf_nb_stack = MultinomialNB() 

clf_nb_stack.fit(x_train_stack, y_train_stack)

MultinomialNB()

In [27]:
pred_stack_nb_test = clf_nb_stack.predict(x_test_stack)

In [28]:
print(metrics.accuracy_score(y_test_stack, pred_stack_nb_test))
print(metrics.classification_report(y_test_stack, pred_stack_nb_test))

0.8010471204188482
              precision    recall  f1-score   support

           0       0.83      0.83      0.83       787
           1       0.76      0.75      0.76       550

    accuracy                           0.80      1337
   macro avg       0.79      0.79      0.79      1337
weighted avg       0.80      0.80      0.80      1337



#### Concatenation

In [29]:
clf_nb_con = MultinomialNB() 

clf_nb_con.fit(x_train_vect_con, y_train_con)

MultinomialNB()

In [31]:
pred_conc_nb_test = clf_nb_con.predict(x_test_vect_con)

print(metrics.accuracy_score(y_test_con, pred_conc_nb_test))
print(metrics.classification_report(y_test_con, pred_conc_nb_test))

0.768885564697083
              precision    recall  f1-score   support

           0       0.82      0.79      0.81       806
           1       0.70      0.73      0.72       531

    accuracy                           0.77      1337
   macro avg       0.76      0.76      0.76      1337
weighted avg       0.77      0.77      0.77      1337

