# NLP with Disaster Tweets

The aim of this task is to develope a machine learning model is able classify a tweet as either a tweet about a real disaster and not.

* id - a unique identifier for each tweet
* text - the text of the tweet
* location - the location the tweet was sent from (may be blank)
* keyword - a particular keyword from the tweet (may be blank)
* target - in train.csv only, this denotes whether a tweet is about a real disaster (1) or not (0)

# 1. Loading Data

In [28]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly
import plotly.graph_objects as go


from sklearn import feature_extraction, linear_model, model_selection, preprocessing
from textblob import TextBlob
from nltk.tokenize import word_tokenize
from nltk import pos_tag
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.preprocessing import LabelEncoder
from collections import defaultdict
from nltk.corpus import wordnet as wn
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import model_selection, naive_bayes, svm
from sklearn.metrics import accuracy_score
import random

from sklearn.feature_extraction.text import CountVectorizer


from IPython.display import display, HTML, Markdown

CSS = """
.output {
    flex-direction: column;
}
"""

HTML('<style>{}</style>'.format(CSS))
display(HTML("<style>.container { width:100% !important; }</style>"))

In [29]:
train = pd.read_csv('data/qeno.csv',sep=',', engine='python')
# test=pd.read_csv('data/test.csv')
# test['target']='*test'

# Merge Dataframes
dataset=train

### Initial Overview

In [30]:
# display(Markdown("There are {} observations in training set, and {} observations in the test set: {} observations in total.".format(len(train), len(test), len(dataset))))
# display(dataset.head())
# display(dataset.info())

In [31]:
# pd.options.display.max_colwidth = 200
# display(train['text'].sample(10))

**Target Variable**

In [32]:
# pd.DataFrame(train["target"].value_counts())

**Missing Values**

In [33]:
NAs=pd.DataFrame(dataset.isna().sum()[dataset.isna().sum().ne(0)],columns = ["total"])
NAs['%']=(NAs['total'])/len(dataset)*100
NAs

Unnamed: 0,total,%


# 2. Data Overview

#### Tweet Length

We will compare how long the tweets are for those classified as real disaster and not from the train dataset. To get this we will check: 
* Total number of words in the tweets
* Total tweet length
* Average word lenght in the tweets

In [34]:
# dataset['word_count']=dataset['text'].str.split().map(lambda x: len(x))

# display(round(dataset['word_count'].groupby(dataset['target']).mean())) 

# fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20,4))
# ax1.hist(dataset[dataset['target']==1]['word_count'], color='Crimson')
# ax1.set_title('Disaster Tweets')
# ax1.set_xlabel('Number of Words in a Tweet')
# ax1.set_ylabel('Count of Tweets')

# ax2.hist(dataset[dataset['target']==0]['word_count'], color='LimeGreen')
# ax2.set_title('Non-Disaster Tweets')
# ax2.set_xlabel('Number of Words in a Tweet')

# fig.suptitle('Tweet Word Count Distribution')
# plt.show()

In [35]:
# dataset['tweetlen'] = dataset['text'].astype(str).apply(len)
# display(round(dataset['tweetlen'].groupby(dataset['target']).max())) 

# #140 characters is the max on twitter, but because of links we see that we have higher character counts.

In [36]:
# dataset['avgword_len'] = dataset['text'].str.split().apply(lambda x: round(np.mean([len(i) for i in x])))
# display(round(dataset['avgword_len'].groupby(dataset['target']).mean()))

# fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20,4))
# ax1.hist(dataset[dataset['target']==1]['avgword_len'], color='Crimson')
# ax1.set_title('Disaster Tweets')
# ax1.set_xlabel('Average Length of Words in Tweet')
# ax1.set_ylabel('Count of Tweets')

# ax2.hist(dataset[dataset['target']==0]['avgword_len'], color='LimeGreen')
# ax2.set_title('Non-Disaster Tweets')
# ax2.set_xlabel('Average Length of Words in Tweet')

# fig.suptitle('Average Word Length Distribution')
# plt.show()

#### Tweet Contents

We can take a glimps into what the tweets actually contain by extracting some comment components we can expect. We will check how many hashtags are used per tweet, wether there are any links/urls and if the tweet is mentioning a username or not.

In [37]:
# #we check here per each tweet how many urls are shared, users are mentioned and hashtags are used. 
# import re
# twitter_urlpattern= '(https?:\/\/)(\s)?(www\.)?(\s?)(\w+\.)*([\w\-\s]+\/)*([\w-]+)\/?'
# # twitter_urlpattern2= 'http:\/\/t.co\/[a-zA-Z0-9\-\.]{9}'
# username_pattern = "(@[A-Za-z0-9_]+)"
# hashtag_patten ="(^|\\s)#(\\w*[a-zA-Z_]+\\w*)"

# dataset['urlcount'] = dataset.text.apply(lambda x: re.findall(twitter_urlpattern, x)).str.len()
# dataset['user_count']=dataset.text.apply(lambda x: re.findall(username_pattern, x)).str.len()
# dataset['hashtag_count']=dataset.text.apply(lambda x: re.findall(hashtag_patten, x)).str.len()

# display(round(dataset['urlcount'].groupby(dataset['target']).max())) 
# display(round(dataset['user_count'].groupby(dataset['target']).max())) 
# display(round(dataset['hashtag_count'].groupby(dataset['target']).max()))

#### Location and Keywords

In [38]:
# display(dataset['location'].groupby(dataset['target']).value_counts()) 
# display(dataset['keyword'].groupby(dataset['target']).value_counts()) 

# 3. Data Preprocessing

## 3.1 Data Cleaning

Here we will remove all the content that such as the usernames mentioned, and the urls. Since we have emoticons and below we will convert them to text we first need to primarely remove links and mentions as some of the symbols there might be read as emoticons. One this primary noise removal is done then we will convert the emoticons, followed by removed the remaining punctuations. 
Numbers will also be removed

In [39]:
dataset.head()

Unnamed: 0,challenge,question
0,Marketing/Pitching,A list of where to advertise in social media
1,Fund/Capital,Accounting
2,Business Idea,Am a law student and am interested in legal an...
3,Skills/Experiance,Any beginner ideas to share
4,Skills/Experiance,Any course recommendations for starting a SME.?


In [40]:
dataset['question']=dataset['question'].astype(str)
# dataset['keyword']=dataset['keyword'].astype(str)

def preprocess1(ReviewText):
    ReviewText = ReviewText.str.replace("(<br/>)", "")
    ReviewText = ReviewText.str.replace("nan", "")
    ReviewText = ReviewText.str.replace('(<a).*(>).*(</a>)', '')
    ReviewText = ReviewText.str.replace('(&amp)', 'and')
    ReviewText = ReviewText.str.replace('(&gt)', '')
    ReviewText = ReviewText.str.replace('(&lt)', '') 
    ReviewText = ReviewText.str.replace('(\xa0)', '')
    ReviewText = ReviewText.str.replace('\n' ,' ') #new line
    ReviewText = ReviewText.str.replace('\d+', "") #numbers
    ReviewText = ReviewText.str.replace(' +', " ") #extra space
    ReviewText = ReviewText.str.replace('RT[ ]?@','') # Retweets  
    ReviewText = ReviewText.str.replace('(RT|rt)[ ]*@[ ]*[\S]+' ,'') # Retweets
    return ReviewText
dataset['processed_question'] = preprocess1(dataset['question'])

In [41]:
dataset.head()

Unnamed: 0,challenge,question,processed_question
0,Marketing/Pitching,A list of where to advertise in social media,A list of where to advertise in social media
1,Fund/Capital,Accounting,Accounting
2,Business Idea,Am a law student and am interested in legal an...,Am a law student and am interested in legal an...
3,Skills/Experiance,Any beginner ideas to share,Any beginner ideas to share
4,Skills/Experiance,Any course recommendations for starting a SME.?,Any course recommendations for starting a SME.?


#### Emoticons to Text Coversion

Before removing any punctuation we will first convert emoticons to text. Because the emoticons in the text are in keyboard form (e.g. :) :( ) we need to convert these into text, as they are significat in determining the sentiment of the text and provided added valueable information.
If we go into the EMOTICONS the each emoticon is described by more than one word and we can only use the first one. 

eg: "London is cool ;)"  will be converted to  ->  "London is cool WinkÇ_orÇ_smirk"  and we will extract the emoticon text using the  "ç_" and out final emoticon feature will contain the substring "Wink".
After this extraction we will remove all punctuation from original tweet so we are only goin to be left with words.

NOTE: This can be commented out as the results show no difference in emoji usage between the classes and majority of tweets have none. 

#### More Noise Removal

We will now remove the remaining punctuations. We will keep hashtags as they are since they are unique and not necessarly work as words. It will also be more important to see if the presence of cetain hashtag is more important in classifying than other words. eg: having a word 'fire' and '#fire'. 

In [42]:
def preprocess3(ReviewText):
    ReviewText = ReviewText.str.replace("[^\w\s\#]", "") #remove anything that isn't word space or hashtag.
    ReviewText = ReviewText.str.replace('[^\x00-\x7f]', "")    
    return ReviewText
dataset['processed_question'] = preprocess3(dataset['processed_question'])
display(pd.DataFrame(dataset['processed_question'].sample(5)))

Unnamed: 0,processed_question
550,With a family and a demanding job how does som...
132,How do you get it all set up Like on the ficia...
177,How to be more motivated about the side hustle...
557,Would need some ides how to start our ideas
426,What are the hot side hustles that involves le...


## 3.2 Normalization

#### Tokenization, Lemmetization, and Removing Stopwords
We will use twitter tokenizer to define observation points and lowercase all the tokens.
Part of speach tagging is not necessary here as we are dealing with tweeter data where grammer isn't focused or and there are not proper sentences constracted. 
Lemmetization will be done however although it won't change many words since it is much more effective through POS tagging. Stemming on the other hand might transfom words into unpredictable form if it starts removing prefix and suffix especially in such unstructured text and the meaning might change.

In [43]:
import nltk
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords

tt = TweetTokenizer(preserve_case=False)
lemmatizer = nltk.stem.WordNetLemmatizer()

stop_words = list(stopwords.words('english'))

# dataset['processed_question']=word_tokenize(text)
dataset['tokinized']=dataset.processed_question.apply(lambda x: word_tokenize(x))


# def lemmatize_text(text):
#     return [lemmatizer.lemmatize(w) for w in tt.tokenize(text)]

def remove_stopwords(text):
    text = [x for x in text if not x in stop_words]
    tokens=str(text).replace('[','').replace(']','')
    return tokens

# dataset['lemmet_tokens'] = dataset.processed_tweet.apply(lambda x: lemmatize_text(x))
dataset['tokinized'] = dataset.tokinized.apply(lambda x: remove_stopwords(x))

display(pd.DataFrame(dataset.sample(2)))

Unnamed: 0,challenge,question,processed_question,tokinized
423,Fund/Capital,What are the best ways to find remote opportun...,What are the best ways to find remote opportun...,"'What', 'best', 'ways', 'find', 'remote', 'opp..."
428,Fund/Capital,What are the most profitable side hustles with...,What are the most profitable side hustles with...,"'What', 'profitable', 'side', 'hustles', 'leas..."


# 4. Some EDA

### Tweet Sentiment
It will also be interesting to check what general sentiment the tweets have and how the two classes of tweets compare in this regard. We will check the distribution of the polarity.

In [44]:
# import plotly.graph_objs as go
# dataset['polarity'] = dataset['processed_tweet'].map(lambda text: TextBlob(text).sentiment.polarity)

# train0 = dataset[dataset["target"]==0]
# train1 = dataset[dataset["target"]==1]

# fig = go.Figure()
# fig.add_trace(go.Histogram(x=train0['polarity'],name = 'Non-Disaster Tweets Sentiment Polarity'))
# fig.add_trace(go.Histogram(x=train1['polarity'],name = 'Disaster Tweets Sentiment Polarity'))

# # Overlay both histograms
# fig.update_layout(barmode='stack')
# # Reduce opacity to see both histograms
# fig.update_traces(opacity=0.75)
# fig.show()

In [45]:
dataset['challenge'].value_counts()

Business Idea                     129
Getting Started                   125
Fund/Capital                       88
Time Management                    38
COVID 19                           28
Securing/Managing Clients          22
Marketing/Pitching                 17
None                               16
Skills/Experiance                  16
Uncertainty                        14
Other                              13
Revenue/Sales                      11
Unemployment/Job Security          11
Business Plan                       6
Working from Home                   5
Scaling/Improving                   4
Focus                               3
Mentorship                          3
Human Resourse                      3
Content Generation/Interaction      3
other                               2
Planning/Business Management        2
business Idea                       1
Name: challenge, dtype: int64

### N-Gram Analysis

In [50]:
import cufflinks as cf
cf.go_offline()
cf.set_config_file(offline=False, world_readable=True)

def top_word(corpus, n=None):
    vec = CountVectorizer(ngram_range=(1, 1)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]
Business_Idea = top_word(dataset['tokinized'], 20)
# unig_nondisaster = top_word(dataset[dataset['target']==0]['tokinized'], 20)

Business_Idea = pd.DataFrame(Business_Idea, columns = ['tweet_word' , 'count'])
# unig_nondisaster= pd.DataFrame(unig_nondisaster, columns = ['tweet_word' , 'count'])

Business_Idea.groupby('tweet_word').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', yTitle='Count', linecolor='black', color='darkred',title='Unigram Distribution - Top 20 words in the Disaster Tweet')
# unig_nondisaster.groupby('tweet_word').sum()['count'].sort_values(ascending=False).iplot(
#     kind='bar', yTitle='Count', linecolor='black',color='darkgreen',title='Unigram Distribution - Top 20 words in the Disaster Tweet')

In [64]:
def top_bigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(2, 2)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

bg_disaster = top_bigram(dataset[dataset['target']==1]['final_text'], 20)
bg_nondisaster = top_bigram(dataset[dataset['target']==0]['final_text'], 20)
    
bg_disaster = pd.DataFrame(bg_disaster, columns = ['tweet_word' , 'count'])
bg_nondisaster= pd.DataFrame(bg_nondisaster, columns = ['tweet_word' , 'count'])

bg_disaster.groupby('tweet_word').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', yTitle='Count', linecolor='black', color='darkred',title='Bigram Distribution - Top 20 words in the Disaster Tweet')
bg_nondisaster.groupby('tweet_word').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', yTitle='Count', linecolor='black',color='darkgreen',title='Bigram Distribution - Top 20 words in the Non-Disaster Tweet')

In [65]:
def top_trigram(corpus, n=None):
    vec = CountVectorizer(ngram_range=(3, 3)).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

tg_disaster = top_trigram(dataset[dataset['target']==1]['final_text'], 20)
tg_nondisaster = top_trigram(dataset[dataset['target']==0]['final_text'], 20)
    
tg_disaster = pd.DataFrame(tg_disaster, columns = ['tweet_word' , 'count'])
tg_nondisaster= pd.DataFrame(tg_nondisaster, columns = ['tweet_word' , 'count'])

tg_disaster.groupby('tweet_word').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', yTitle='Count', linecolor='black', color='darkred',title='Trigram Distribution - Top 20 words in the Disaster Tweet')
tg_nondisaster.groupby('tweet_word').sum()['count'].sort_values(ascending=False).iplot(
    kind='bar', yTitle='Count', linecolor='black',color='darkgreen',title='Trigram Distribution - Top 20 words in the Non Disaster Tweet')

# 5. Vectorization

### Count Vectorization 
This method of creating the word tweet matrix will give different weight to different words based on how many times they occure, instead binary method that only chooses whether a word is present or not. 
However in our task we saw that indiviual words by themselves overlapp between the classes and bigrams and trigrams are much better so our ngram_range will be (1,3). 
At the same time, we don't expect these pharases to be repeated within a tweet since the characters on a tweet are limited. So we will model with binary CountVectorizer, in other words dummifying to 0 and 1. 

For the sake of comparing how the models perform among the weighted on frequency and binary countvectorization, we will also vectorize our tokens through TFIDF.

In [66]:
dataset_final = dataset[['id','processed_tweet','final_text','target']]
train=dataset_final[dataset_final['target']!= '*test']
test=dataset_final[dataset_final['target']== '*test']

count_vec = CountVectorizer(ngram_range=(1,3),analyzer='word',binary=True) 
cv_mat = count_vec.fit_transform(train['final_text']) 
cv_mat_test = count_vec.transform(test['final_text']) 

tf_vect = TfidfVectorizer(ngram_range=(1,3),binary=False,use_idf=False)
tf_mat = tf_vect.fit_transform(train['final_text'].values) # fit_transform vectorizer to dtrain['text']
tf_mat_test = tf_vect.transform(test['final_text'].values)


display(cv_mat.shape)
display(cv_mat_test.shape)



(7613, 127394)

(3263, 127394)

In [67]:
pd.DataFrame(cv_mat.A, columns=count_vec.get_feature_names()).sample(100)

Unnamed: 0,__,__ derailed,__ derailed eve,__ month,__ month fr,__ want,__ want sleep,__ yazidi_shingal_genocide,__ yazidi_shingal_genocide ezidigenocide,___,...,zrnf,zuidholland,zuidholland armageddon,zuidholland armageddon good,zumiez,zumiez location,zurich,zurich swiss,zurich swiss premiere,zzzz
2940,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3309,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
5550,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
590,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3010,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1935,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1527,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
191,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6826,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


# 6. Modeling

In [68]:
import sklearn
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import classification_report
from sklearn.model_selection import cross_val_score
from sklearn.metrics import f1_score

def printreport(exp, pred):
    print(pd.crosstab(exp, pred, rownames=['Actual'], colnames=['Predicted']))
    print('\n \n')
    print(classification_report(exp, pred))

results_df = pd.DataFrame(columns=['Model', 'F1'])

##### Logistic Regression

In [69]:
## Logistic Regression CountVectorization VS TS
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import LogisticRegression

penalty = ['l1', 'l2']
C = np.logspace(0, 4, 10)
hyperparameters = dict(C=C, penalty=penalty)
logisticr = LogisticRegression()
logisticr_grid = GridSearchCV(logisticr, hyperparameters, cv=10, verbose=0)

#Count Vectorization
target = train['target'].astype('int')
x, x_test, y, y_test = train_test_split(cv_mat,target,test_size=0.2,train_size=0.8, random_state = 0)

best_logR_cv= logisticr_grid.fit(x, y)
y_pred1 = best_logR_cv.predict(x_test)
f1score = f1_score(y_test, y_pred1)

results_df.loc[len(results_df)] = ['Best LR Count Vector Baseline', f1score]
results_df

#Term Frequency Vectorization
x, x_test, y, y_test = train_test_split(tf_mat,target,test_size=0.2,train_size=0.8, random_state = 0)

best_logR_tf= logisticr_grid.fit(x, y) #we change the matrix we fit

y_pred2 = best_logR_tf.predict(x_test)

f1score = f1_score(y_test, y_pred2)

results_df.loc[len(results_df)] = ['Best LR Term Frequency', f1score]
results_df







Liblinear failed to converge, increase the number of iterations.



Unnamed: 0,Model,F1
0,Best LR Count Vector Baseline,0.732143
1,Best LR Term Frequency,0.729849


##### Naive Bayes

In [71]:
## NaiveBayes CountVectorization VS TS
from sklearn.model_selection import cross_val_predict
from sklearn.naive_bayes import MultinomialNB


#Count Vectorization
target = train['target'].astype('int')
x, x_test, y, y_test = train_test_split(cv_mat,target,test_size=0.2,train_size=0.8, random_state = 0)

NB_mcv = MultinomialNB(alpha=1).fit(x, y)
y_pred3 = NB_mcv.predict(x_test)
f1score = f1_score(y_test, y_pred3)

results_df.loc[len(results_df)] = ['Naive Bayes Multinomial CountVector', f1score]


#Term Frequency Vectorization
x, x_test, y, y_test = train_test_split(tf_mat,target,test_size=0.2,train_size=0.8, random_state = 0)

NB_mtf = MultinomialNB(alpha=1).fit(x, y)
y_pred4 = NB_mcv.predict(x_test)
f1score = f1_score(y_test, y_pred4)

results_df.loc[len(results_df)] = ['Naive Bayes Multinomial Term Frequency', f1score]

results_df

Unnamed: 0,Model,F1
0,Best LR Count Vector Baseline,0.732143
1,Best LR Term Frequency,0.729849
2,Naive Bayes Multinomial CountVector,0.74057
3,Naive Bayes Multinomial Term Frequency,0.733002


##### Support Vector Machine

In [72]:
param_grid = {'C': [0.1,1, 10],'kernel': ['linear']}

SVM = svm.SVC()
gridSVM = GridSearchCV(SVM,param_grid,refit=True,verbose=2)

# Count Vectorization
x, x_test, y, y_test = train_test_split(cv_mat,target,test_size=0.2,train_size=0.8, random_state = 0)

gridSVM_cv=gridSVM.fit(x,y)

y_pred5 = gridSVM_cv.predict(x_test)
f1score = f1_score(y_test, y_pred5)

results_df.loc[len(results_df)] = ['SVM Count Vectorization', f1score]

# Term Frequency
x, x_test, y, y_test = train_test_split(tf_mat,target,test_size=0.2,train_size=0.8, random_state = 0)
gridSVM_tf=gridSVM.fit(x,y)

y_pred6 = gridSVM_tf.predict(x_test)
f1score = f1_score(y_test, y_pred6)

results_df.loc[len(results_df)] = ['SVM Term Frequency', f1score]

results_df



[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 3 candidates, totalling 9 fits
[CV] C=0.1, kernel=linear ............................................
[CV] ............................. C=0.1, kernel=linear, total=   9.6s
[CV] C=0.1, kernel=linear ............................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    9.5s remaining:    0.0s


[CV] ............................. C=0.1, kernel=linear, total=   9.5s
[CV] C=0.1, kernel=linear ............................................
[CV] ............................. C=0.1, kernel=linear, total=   9.5s
[CV] C=1, kernel=linear ..............................................
[CV] ............................... C=1, kernel=linear, total=   9.4s
[CV] C=1, kernel=linear ..............................................
[CV] ............................... C=1, kernel=linear, total=   9.5s
[CV] C=1, kernel=linear ..............................................
[CV] ............................... C=1, kernel=linear, total=   9.5s
[CV] C=10, kernel=linear .............................................
[CV] .............................. C=10, kernel=linear, total=   9.5s
[CV] C=10, kernel=linear .............................................
[CV] .............................. C=10, kernel=linear, total=   9.5s
[CV] C=10, kernel=linear .............................................
[CV] .

[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  1.4min finished


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Fitting 3 folds for each of 3 candidates, totalling 9 fits
[CV] C=0.1, kernel=linear ............................................
[CV] ............................. C=0.1, kernel=linear, total=   9.9s
[CV] C=0.1, kernel=linear ............................................


[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    9.8s remaining:    0.0s


[CV] ............................. C=0.1, kernel=linear, total=   9.4s
[CV] C=0.1, kernel=linear ............................................
[CV] ............................. C=0.1, kernel=linear, total=   9.4s
[CV] C=1, kernel=linear ..............................................
[CV] ............................... C=1, kernel=linear, total=   9.1s
[CV] C=1, kernel=linear ..............................................
[CV] ............................... C=1, kernel=linear, total=   9.1s
[CV] C=1, kernel=linear ..............................................
[CV] ............................... C=1, kernel=linear, total=   9.2s
[CV] C=10, kernel=linear .............................................
[CV] .............................. C=10, kernel=linear, total=   9.9s
[CV] C=10, kernel=linear .............................................
[CV] .............................. C=10, kernel=linear, total=   9.5s
[CV] C=10, kernel=linear .............................................
[CV] .

[Parallel(n_jobs=1)]: Done   9 out of   9 | elapsed:  1.4min finished


Unnamed: 0,Model,F1
0,Best LR Count Vector Baseline,0.732143
1,Best LR Term Frequency,0.729849
2,Naive Bayes Multinomial CountVector,0.74057
3,Naive Bayes Multinomial Term Frequency,0.733002
4,SVM Count Vectorization,0.733274
5,SVM Term Frequency,0.741514


In [None]:
results_df

In [None]:
display(printreport(y_test, y_pred3))
display(printreport(y_test, y_pred6))

## Final Model

In [None]:
testset_id=pd.DataFrame(test['id'])

preds = NB_mcv.predict(cv_mat_test)
submission= pd.DataFrame(preds)
submission['target']=submission[0]
del submission[0]
finalcsv = pd.concat([testset_id.reset_index(drop=True), submission], axis=1)
finalcsv
finalcsv.to_csv('submission.csv',index=False)