## Overview

Many companies are built around lessening one’s environmental impact or carbon footprint. They offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. They would like to determine how people perceive climate change and whether or not they believe it is a real threat. This would add to their market research efforts in gauging how their product/service may be received.

With this context, EDSA is challenging you during the Classification Sprint with the task of creating a Machine Learning model that is able to classify whether or not a person believes in climate change, based on their novel tweet data.

Providing an accurate and robust solution to this task gives companies access to a broad base of consumer sentiment, spanning multiple demographic and geographic categories - thus increasing their insights and informing future marketing strategies.



<br></br>

<div align="center" style="width: 600px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://raw.githubusercontent.com/Explore-AI/Pictures/master/wine.jpg"
     alt="Some fine wine for your fine model"
     style="float: center; padding-bottom=0.5em"
     width=600px/>
Some fine wine for your fine modeling process. 
Photo by <a href="https://unsplash.com/@hermez777?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText"> Hermes Rivera</a> on Unsplash
</div>

The structure of this notebook is as follows:

 - First, we'll load our data to get a view of the predictor and response variables we will be modeling. 
 - We'll then preprocess our data, binarising the target variable and splitting up the data intro train and test sets. 
 - We then model our data using a Support Vector Classifier.
 - Following this modeling, we define a custom metric as the log-loss in order to evaluate our produced model.
 - Using this metric, we then take several steps to improve our base model's performance by optimising the hyperparameters of the SVC through a grid search strategy. 

In [19]:
# import relevant libraries
import nltk
import warnings
warnings.filterwarnings("ignore")
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re
pd.set_option('display.max_rows', 100)
from sklearn.utils import resample

from nltk.corpus import stopwords
from sklearn.metrics import classification_report

# set plot style
sns.set()

In [2]:
# Loading Data
df = pd.read_csv('C:/Users/Mpilenhle/Documents/EDSA/Classification/Advanced_Classification_Predict-student_data-2780/train.csv')
df_test = pd.read_csv('C:/Users/Mpilenhle/Documents/EDSA/Classification/Advanced_Classification_Predict-student_data-2780/test_with_no_labels.csv')

## The Dataset 

For this coding challenge we'll be using the [Wine Quality dataset](https://archive.ics.uci.edu/ml/datasets/wine+quality) from the UCI Machine Learning Repository. The constituents of this dataset are red and white variants of the Portuguese "Vinho Verde" wine. 

This dataset consists of the following variables: 

 - sentiments
 - message
 - tweetid



In [3]:
# looking at the data
df.head(15)

Unnamed: 0,sentiment,message,tweetid
0,1,PolySciMajor EPA chief doesn't think carbon di...,625221
1,1,It's not like we lack evidence of anthropogeni...,126103
2,2,RT @RawStory: Researchers say we have three ye...,698562
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954
5,1,Worth a read whether you do or don't believe i...,425577
6,1,RT @thenation: Mike Pence doesn’t believe in g...,294933
7,1,RT @makeandmendlife: Six big things we can ALL...,992717
8,1,@AceofSpadesHQ My 8yo nephew is inconsolable. ...,664510
9,1,RT @paigetweedy: no offense… but like… how do ...,260471


In [4]:
df_test.head()

Unnamed: 0,message,tweetid
0,Europe will now be looking to China to make su...,169760
1,Combine this with the polling of staffers re c...,35326
2,"The scary, unimpeachable evidence that climate...",224985
3,@Karoli @morgfair @OsborneInk @dailykos \nPuti...,476263
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928


In [11]:
import string
import re

"""
The function uses some of the functions remove_emoji() which removes
emojis in a tweet
it also uses the function remove_punctuation() which removes
punctuations

The function data_cleaner() implements both these functions to make a 
clean data frame, with the use of the regular expressions it removes
noise or unwanted charecters in the tweets


"""
#creating a function for removing emojis
def remove_emoji(string):
    emoji_pattern = re.compile("[" 
                u"\U0001F600-\U0001F64F"  # emoticons
                u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                u"\U0001F680-\U0001F6FF"  # transport & map symbols
                u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                u"\U00002702-\U000027B0"
                u"\U000024C2-\U0001F251"
                u"\U0001f926-\U0001f937"
                u'\U00010000-\U0010ffff'
                u"\u200d"
                u"\u2640-\u2642"
                u"\u2600-\u2B55"
                u"\u23cf"
                u"\u23e9"
                u"\u231a"
                u"\u3030"
                u"\ufe0f"
                           "]+", flags=re.UNICODE)
    return emoji_pattern.sub(r'', string)


# punctuation remover function
def remove_punctuation(tweets):
    return ''.join([l for l in tweets if l not in string.punctuation])



def data_cleaner(df, column):
    
    #remmoving the urls
    pattern_url = r'http[s]?://t.co/[A-Za-z0-9]+'
    subs_url = r'url-web'
    df[column] = df[column].replace(to_replace = pattern_url, value = subs_url, regex = True)
    
    #remmoving the Re Tweets 
    pattern_url = r'RT\s\@[A-Za-z0-9_]+:'
    subs_url = r''
    df[column] = df[column].replace(to_replace = pattern_url, value = subs_url, regex = True)
    

    #remmoving the mentions 
    pattern_url = r'@[A-Za-z0-9_]+'
    subs_url = r''
    df[column] = df[column].replace(to_replace = pattern_url, value = subs_url, regex = True)


    #remmoving the Hashtags 
    pattern_url = r'\#[A-Za-z0-9#?_]+'
    subs_url = r''
    df[column] = df[column].replace(to_replace = pattern_url, value = subs_url, regex = True)


    #remmoving the remaining https
    pattern_url = r'https:[.*?]+'
    subs_url = r''
    df[column] = df[column].replace(to_replace = pattern_url, value = subs_url, regex = True)

    # turning all tweets to lower case
    df[column] = df[column].str.lower()
    
    # using apply method to remove the punctuation marks
    df[column] = df[column].apply(remove_punctuation)
    
    # Removing the emojis using the apply method
    df[column] = df[column].apply(remove_emoji)
    
    #remmoving the uknown charecters from words
    pattern_url = r'\ã¢â‚¬â¦ | \ã¢â‚¬â„¢[a-z] | \… | \ã¢â‚¬â€œ | \ã¢å¾â¡ã¯â¸ï†\x8f'
    subs_url = r''
    df[column] = df[column].replace(to_replace = pattern_url, value = subs_url, regex = True)
    
    return df

In [12]:
df = data_cleaner(df, 'message')

In [14]:
df

Unnamed: 0,sentiment,message,tweetid
0,1,polyscimajor epa chief doesnt think carbon dio...,625221
1,1,its not like we lack evidence of anthropogenic...,126103
2,2,researchers say we have three years to act on...,698562
3,1,wired 2016 was a pivotal year in the war on ...,573736
4,1,its 2016 and a racist sexist climate change d...,466954
...,...,...,...
15814,1,they took down the material on global warming...,22001
15815,2,how climate change could be breaking up a 200...,17856
15816,0,notiven rt nytimesworld what does trump actual...,384248
15817,-1,hey liberals the climate change crap is a hoa...,819732


In [15]:
# importing tokenizing library
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
from nltk.stem import WordNetLemmatizer
#importing stemmer library
from nltk import SnowballStemmer
#importing stemmer library
from nltk import SnowballStemmer

"""

Pre processing the data by creating new columns each with feature
normalization technique applied, the use of outer functions also help
in cleaning and removing the stop words



"""



# lemmatizing function
lemmatizer = WordNetLemmatizer()
def tweet_lemma(words, lemmatizer):
    return [lemmatizer.lemmatize(word) for word in words if word.isalpha()]    

#Stemmer function
stemmer = SnowballStemmer('english')
def token_stemmer(words, stemmer):
    return [stemmer.stem(word) for word in words]

def remove_stop_words(tokens):    
    return [t for t in tokens if t not in stopwords.words('english')]

def token_lemmatizer_stemmer(df):
    
    #tokenise the tweets and create a column
    tokeniser = TreebankWordTokenizer()
    df['tokens'] = df['message'].apply(tokeniser.tokenize)
     
    #creating a lemma column   
    df['lemma'] = df['tokens'].apply(tweet_lemma, args=(lemmatizer, ))
    
    # find the stem of each word in the original tokens
    df['original_stem'] = df['tokens'].apply(token_stemmer, args=(stemmer, ))
    
    # find the stem of each word in the Lemma tokens
    df['lemma_stem'] = df['lemma'].apply(token_stemmer, args=(stemmer, ))
    
    #removing the stop words
    df['lemma_no_stop_words'] = df['lemma_stem'].apply(remove_stop_words)
    
    #making the original stemmer
    df['original_no_stop_words'] = df['lemma_stem'].apply(remove_stop_words)
    
    return df
    

In [17]:
df = token_lemmatizer_stemmer(df)
df.head()

Unnamed: 0,sentiment,message,tweetid,tokens,lemma,original_stem,lemma_stem,lemma_no_stop_words,original_no_stop_words
0,1,polyscimajor epa chief doesnt think carbon dio...,625221,"[polyscimajor, epa, chief, doesnt, think, carb...","[polyscimajor, epa, chief, doesnt, think, carb...","[polyscimajor, epa, chief, doesnt, think, carb...","[polyscimajor, epa, chief, doesnt, think, carb...","[polyscimajor, epa, chief, doesnt, think, carb...","[polyscimajor, epa, chief, doesnt, think, carb..."
1,1,its not like we lack evidence of anthropogenic...,126103,"[its, not, like, we, lack, evidence, of, anthr...","[it, not, like, we, lack, evidence, of, anthro...","[it, not, like, we, lack, evid, of, anthropoge...","[it, not, like, we, lack, evid, of, anthropoge...","[like, lack, evid, anthropogen, global, warm]","[like, lack, evid, anthropogen, global, warm]"
2,2,researchers say we have three years to act on...,698562,"[researchers, say, we, have, three, years, to,...","[researcher, say, we, have, three, year, to, a...","[research, say, we, have, three, year, to, act...","[research, say, we, have, three, year, to, act...","[research, say, three, year, act, climat, chan...","[research, say, three, year, act, climat, chan..."
3,1,wired 2016 was a pivotal year in the war on ...,573736,"[wired, 2016, was, a, pivotal, year, in, the, ...","[wired, wa, a, pivotal, year, in, the, war, on...","[wire, 2016, was, a, pivot, year, in, the, war...","[wire, wa, a, pivot, year, in, the, war, on, c...","[wire, wa, pivot, year, war, climat, chang, ur...","[wire, wa, pivot, year, war, climat, chang, ur..."
4,1,its 2016 and a racist sexist climate change d...,466954,"[its, 2016, and, a, racist, sexist, climate, c...","[it, and, a, racist, sexist, climate, change, ...","[it, 2016, and, a, racist, sexist, climat, cha...","[it, and, a, racist, sexist, climat, chang, de...","[racist, sexist, climat, chang, deni, bigot, l...","[racist, sexist, climat, chang, deni, bigot, l..."


In [None]:
df.head()

In [21]:
# Pick a class size of roughly half the size of the largest size
class_size = 5000

# Downsample classes with more than 5000 observations
pro_downsampled = resample(df[df['sentiment']==1],
                          replace=False, # sample without replacement (no need to duplicate observations)
                          n_samples=class_size, # match number in class_size
                          random_state=27) # reproducible results

# Upsample classes with less than 5000 observations
neutral_upsampled = resample(df[df['sentiment']==0],
                          replace=True, # sample with replacement (we need to duplicate observations)
                          n_samples=class_size, # match number in class_size
                          random_state=27) # reproducible results

# Upsample classes with less than 5000 observations
anti_upsampled = resample(df[df['sentiment']==-1],
                          replace=True, # sample with replacement (we need to duplicate observations)
                          n_samples=class_size, # match number in class_size
                          random_state=27) # reproducible results

# Upsample classes with less than 5000 observations
news_upsampled = resample(df[df['sentiment']==2],
                          replace=True, # sample with replacement (we need to duplicate observations)
                          n_samples=class_size, # match number in class_size
                          random_state=27) # reproducible results





In [22]:
# Combine sampled classes with majority class
sampled = pd.concat([pro_downsampled, neutral_upsampled, anti_upsampled, news_upsampled])

# Check new class counts
sampled['sentiment'].value_counts()


 1    5000
 0    5000
-1    5000
 2    5000
Name: sentiment, dtype: int64

In [28]:
sampled.head()

Unnamed: 0,sentiment,message,tweetid,tokens,lemma,original_stem,lemma_stem,lemma_no_stop_words,original_no_stop_words,token_no_stop_word,stem_no_stop_word
11729,1,funding from will support s team as they add...,977844,"[funding, from, will, support, s, team, as, th...","[funding, from, will, support, s, team, a, the...","[fund, from, will, support, s, team, as, they,...","[fund, from, will, support, s, team, a, they, ...","[fund, support, team, address, impact, climat,...","[fund, support, team, address, impact, climat,...","[funding, support, team, address, impact, clim...","[fund, support, team, address, impact, climat,..."
8308,1,gag orders sure hes definitely green and does...,441956,"[gag, orders, sure, hes, definitely, green, an...","[gag, order, sure, he, definitely, green, and,...","[gag, order, sure, hes, definit, green, and, d...","[gag, order, sure, he, definit, green, and, do...","[gag, order, sure, definit, green, doesnt, thi...","[gag, order, sure, definit, green, doesnt, thi...","[gag, orders, sure, hes, definitely, green, do...","[gag, order, sure, hes, definit, green, doesnt..."
7159,1,not ominous at all he also wants the names of...,978938,"[not, ominous, at, all, he, also, wants, the, ...","[not, ominous, at, all, he, also, want, the, n...","[not, omin, at, all, he, also, want, the, name...","[not, omin, at, all, he, also, want, the, name...","[omin, also, want, name, anyon, work, climat, ...","[omin, also, want, name, anyon, work, climat, ...","[ominous, also, wants, names, anyone, working,...","[omin, also, want, name, anyon, work, climat, ..."
5644,1,in case you forgot about that chinese hoax gl...,587737,"[in, case, you, forgot, about, that, chinese, ...","[in, case, you, forgot, about, that, chinese, ...","[in, case, you, forgot, about, that, chines, h...","[in, case, you, forgot, about, that, chines, h...","[case, forgot, chines, hoax, global, warm, url...","[case, forgot, chines, hoax, global, warm, url...","[case, forgot, chinese, hoax, global, warming,...","[case, forgot, chines, hoax, global, warm, url..."
6732,1,hrc proposes installing half a billion solar ...,804767,"[hrc, proposes, installing, half, a, billion, ...","[hrc, proposes, installing, half, a, billion, ...","[hrc, propos, instal, half, a, billion, solar,...","[hrc, propos, instal, half, a, billion, solar,...","[hrc, propos, instal, half, billion, solar, pa...","[hrc, propos, instal, half, billion, solar, pa...","[hrc, proposes, installing, half, billion, sol...","[hrc, propos, instal, half, billion, solar, pa..."


In [26]:
# Remove stop words first
sampled['token_no_stop_word'] = sampled['tokens'].apply(remove_stop_words)

In [27]:
# Remove stop words first
sampled['stem_no_stop_word'] = sampled['token_no_stop_word'].apply(token_stemmer, args=(stemmer, ))

In [31]:
clean_sentences = [" ".join(i) for i in sampled['stem_no_stop_word']]
sampled['clean_sentences'] = clean_sentences

In [32]:
sampled.head()

Unnamed: 0,sentiment,message,tweetid,tokens,lemma,original_stem,lemma_stem,lemma_no_stop_words,original_no_stop_words,token_no_stop_word,stem_no_stop_word,clean_sentences
11729,1,funding from will support s team as they add...,977844,"[funding, from, will, support, s, team, as, th...","[funding, from, will, support, s, team, a, the...","[fund, from, will, support, s, team, as, they,...","[fund, from, will, support, s, team, a, they, ...","[fund, support, team, address, impact, climat,...","[fund, support, team, address, impact, climat,...","[funding, support, team, address, impact, clim...","[fund, support, team, address, impact, climat,...",fund support team address impact climat chang ...
8308,1,gag orders sure hes definitely green and does...,441956,"[gag, orders, sure, hes, definitely, green, an...","[gag, order, sure, he, definitely, green, and,...","[gag, order, sure, hes, definit, green, and, d...","[gag, order, sure, he, definit, green, and, do...","[gag, order, sure, definit, green, doesnt, thi...","[gag, order, sure, definit, green, doesnt, thi...","[gag, orders, sure, hes, definitely, green, do...","[gag, order, sure, hes, definit, green, doesnt...",gag order sure hes definit green doesnt think ...
7159,1,not ominous at all he also wants the names of...,978938,"[not, ominous, at, all, he, also, wants, the, ...","[not, ominous, at, all, he, also, want, the, n...","[not, omin, at, all, he, also, want, the, name...","[not, omin, at, all, he, also, want, the, name...","[omin, also, want, name, anyon, work, climat, ...","[omin, also, want, name, anyon, work, climat, ...","[ominous, also, wants, names, anyone, working,...","[omin, also, want, name, anyon, work, climat, ...",omin also want name anyon work climat chang re...
5644,1,in case you forgot about that chinese hoax gl...,587737,"[in, case, you, forgot, about, that, chinese, ...","[in, case, you, forgot, about, that, chinese, ...","[in, case, you, forgot, about, that, chines, h...","[in, case, you, forgot, about, that, chines, h...","[case, forgot, chines, hoax, global, warm, url...","[case, forgot, chines, hoax, global, warm, url...","[case, forgot, chinese, hoax, global, warming,...","[case, forgot, chines, hoax, global, warm, url...",case forgot chines hoax global warm urlweb
6732,1,hrc proposes installing half a billion solar ...,804767,"[hrc, proposes, installing, half, a, billion, ...","[hrc, proposes, installing, half, a, billion, ...","[hrc, propos, instal, half, a, billion, solar,...","[hrc, propos, instal, half, a, billion, solar,...","[hrc, propos, instal, half, billion, solar, pa...","[hrc, propos, instal, half, billion, solar, pa...","[hrc, proposes, installing, half, billion, sol...","[hrc, propos, instal, half, billion, solar, pa...",hrc propos instal half billion solar panel end...


In [35]:
# importing necessary libraries
from sklearn import datasets
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
 
# X -> features, y -> label
y =  sampled['sentiment']
X =  sampled['clean_sentences']

# dividing X, y into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                random_state = 42)

# extracting features
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer()
V_train_X = vectorizer.fit_transform(X_train)
V_test_X = vectorizer.transform(X_test)



In [37]:

 # training a DescisionTreeClassifier
from sklearn.tree import DecisionTreeClassifier
dtree_model = DecisionTreeClassifier(max_depth = 2).fit(V_train_X, y_train)
dtree_predictions = dtree_model.predict(V_test_X)
 
# creating a confusion matrix
cm1 = confusion_matrix(y_test, dtree_predictions)
print('Classification Report Decision Tree')
print(classification_report(y_test, dtree_predictions))


Classification Report Decision Tree
              precision    recall  f1-score   support

          -1       0.39      0.17      0.24      1256
           0       0.47      0.36      0.41      1280
           1       0.36      0.52      0.43      1244
           2       0.52      0.70      0.60      1220

    accuracy                           0.44      5000
   macro avg       0.43      0.44      0.42      5000
weighted avg       0.43      0.44      0.41      5000



In [39]:

# training a linear SVM classifier
from sklearn.svm import SVC
svm_model_linear = SVC(kernel = 'linear', C = 1).fit(V_train_X, y_train)
svm_predictions = svm_model_linear.predict(V_test_X)
 
# model accuracy for V_test_X 
accuracy = svm_model_linear.score(V_test_X, y_test)
 
# creating a confusion matrix
cm2 = confusion_matrix(y_test, svm_predictions)

print('Classification Report SVM')
print(classification_report(y_test, svm_predictions))

Classification Report SVM
              precision    recall  f1-score   support

          -1       0.86      0.91      0.89      1256
           0       0.80      0.81      0.80      1280
           1       0.73      0.65      0.69      1244
           2       0.83      0.87      0.85      1220

    accuracy                           0.81      5000
   macro avg       0.81      0.81      0.81      5000
weighted avg       0.81      0.81      0.81      5000



In [44]:
cm2

array([[1149,   55,   42,   10],
       [  66, 1035,  135,   44],
       [ 102,  180,  803,  159],
       [  14,   26,  116, 1064]], dtype=int64)

In [40]:

# training a KNN classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 7).fit(V_train_X, y_train)
 
# accuracy on V_test_X
accuracy = knn.score(V_test_X, y_test)
print(accuracy)
 
# creating a confusion matrix
knn_predictions = knn.predict(V_test_X)
cm3 = confusion_matrix(y_test, knn_predictions)

print('Classification Report KNN')
print(classification_report(y_test, knn_predictions))


0.6552
Classification Report KNN
              precision    recall  f1-score   support

          -1       0.67      0.80      0.73      1256
           0       0.59      0.66      0.62      1280
           1       0.65      0.39      0.48      1244
           2       0.72      0.77      0.74      1220

    accuracy                           0.66      5000
   macro avg       0.66      0.66      0.64      5000
weighted avg       0.65      0.66      0.64      5000



In [45]:

# training a Naive classifierBayes 
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB().fit(V_train_X.toarray(), y_train)
gnb_predictions = gnb.predict(V_test_X.toarray())
 
# accuracy on V_test_X
accuracy = gnb.score(V_test_X.toarray(), y_test)
print(accuracy)
 
# creating a confusion matrix
cm4 = confusion_matrix(y_test, gnb_predictions)


print('Classification Report Naive Bayes')
print(classification_report(y_test, gnb_predictions))


0.6662
Classification Report Naive Bayes
              precision    recall  f1-score   support

          -1       0.57      0.98      0.72      1256
           0       0.80      0.64      0.71      1280
           1       0.75      0.29      0.42      1244
           2       0.70      0.75      0.72      1220

    accuracy                           0.67      5000
   macro avg       0.70      0.67      0.64      5000
weighted avg       0.70      0.67      0.64      5000



In [47]:

# training a Naive classifierBayes 
from sklearn.linear_model import LogisticRegression
lm_vt = LogisticRegression(multi_class='ovr').fit(V_train_X, y_train)
pred_lr_vt = lm_vt.predict(V_test_X)



In [None]:
# accuracy on V_test_X
accuracy = gnb.score(pred_lr_vt, y_test)
print(accuracy)

In [49]:
print('Classification Report Linear logistics')
print(classification_report(y_test, pred_lr_vt))

Classification Report Linear logistics
              precision    recall  f1-score   support

          -1       0.84      0.88      0.86      1256
           0       0.77      0.77      0.77      1280
           1       0.72      0.61      0.66      1244
           2       0.79      0.88      0.83      1220

    accuracy                           0.78      5000
   macro avg       0.78      0.78      0.78      5000
weighted avg       0.78      0.78      0.78      5000



# making a Submission

In [55]:
#traing the model with the entire data
yf = sampled['sentiment']
Xf = sampled['clean_sentences']

In [58]:
clean_test = data_cleaner(df_test, 'message')


In [61]:
clean_test

Unnamed: 0,message,tweetid
0,europe will now be looking to china to make su...,169760
1,combine this with the polling of staffers re c...,35326
2,the scary unimpeachable evidence that climate ...,224985
3,\nputin got to you too jill \ntrump doesn...,476263
4,female orgasms cause global warming\nsarcasti...,872928
...,...,...
10541,brb writing a poem about climate change ...,895714
10542,2016 the year climate change came home during ...,875167
10543,pacific countries positive about fiji leading...,78329
10544,you’re so hot you must be the cause for globa...,867455


In [62]:
#tokenise the tweets and create a column
tokeniser = TreebankWordTokenizer()
clean_test['tokens'] = clean_test['message'].apply(tokeniser.tokenize)

# Remove stop words first
clean_test['token_no_stop_word'] = clean_test['tokens'].apply(remove_stop_words)

# Remove stop words first
clean_test['stem_no_stop_word'] = clean_test['token_no_stop_word'].apply(token_stemmer, args=(stemmer, ))

clean_sentences = [" ".join(i) for i in clean_test['stem_no_stop_word']]
clean_test['clean_sentences'] = clean_sentences


In [65]:
clean_test

Unnamed: 0,message,tweetid,tokens,token_no_stop_word,stem_no_stop_word,clean_sentences
0,europe will now be looking to china to make su...,169760,"[europe, will, now, be, looking, to, china, to...","[europe, looking, china, make, sure, alone, fi...","[europ, look, china, make, sure, alon, fight, ...",europ look china make sure alon fight climat c...
1,combine this with the polling of staffers re c...,35326,"[combine, this, with, the, polling, of, staffe...","[combine, polling, staffers, climate, change, ...","[combin, poll, staffer, climat, chang, women, ...",combin poll staffer climat chang women right f...
2,the scary unimpeachable evidence that climate ...,224985,"[the, scary, unimpeachable, evidence, that, cl...","[scary, unimpeachable, evidence, climate, chan...","[scari, unimpeach, evid, climat, chang, alread...",scari unimpeach evid climat chang alreadi urlweb
3,\nputin got to you too jill \ntrump doesn...,476263,"[putin, got, to, you, too, jill, trump, doesnt...","[putin, got, jill, trump, doesnt, believe, cli...","[putin, got, jill, trump, doesnt, believ, clim...",putin got jill trump doesnt believ climat chan...
4,female orgasms cause global warming\nsarcasti...,872928,"[female, orgasms, cause, global, warming, sarc...","[female, orgasms, cause, global, warming, sarc...","[femal, orgasm, caus, global, warm, sarcast, r...",femal orgasm caus global warm sarcast republican
...,...,...,...,...,...,...
10541,brb writing a poem about climate change ...,895714,"[brb, writing, a, poem, about, climate, change...","[brb, writing, poem, climate, change, urlweb…]","[brb, write, poem, climat, chang, urlweb…]",brb write poem climat chang urlweb…
10542,2016 the year climate change came home during ...,875167,"[2016, the, year, climate, change, came, home,...","[2016, year, climate, change, came, home, hott...","[2016, year, climat, chang, came, home, hottes...",2016 year climat chang came home hottest year ...
10543,pacific countries positive about fiji leading...,78329,"[pacific, countries, positive, about, fiji, le...","[pacific, countries, positive, fiji, leading, ...","[pacif, countri, posit, fiji, lead, global, cl...",pacif countri posit fiji lead global climat ch...
10544,you’re so hot you must be the cause for globa...,867455,"[you’re, so, hot, you, must, be, the, cause, f...","[you’re, hot, must, cause, global, warming]","[you'r, hot, must, caus, global, warm]",you'r hot must caus global warm


In [66]:
X_testf = clean_test['clean_sentences']

In [71]:
# extracting features
from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer1 = TfidfVectorizer()
V_train_Xf = vectorizer1.fit_transform(Xf)
V_test_Xf = vectorizer1.transform(X_testf)

In [74]:
V_test_Xf.shape

(10546, 10417)

In [75]:
# training a linear SVM classifier
from sklearn.svm import SVC
svm_model_linear_f = SVC(kernel = 'linear', C = 1).fit(V_train_Xf,yf)


In [76]:
final_preds = svm_model_linear_f.predict(V_test_Xf)


In [77]:
daf = pd.DataFrame(final_preds, columns=['sentiment'])
daf.head()

output = pd.DataFrame({"tweetid":df_test['tweetid']})
final = output.join(daf)        
final.to_csv("final.csv", index=False)