# Introduction

## Task at hand

### It is a common act in general to try and reduce one's environmental impact or carbon footprint. Company's on the other hand offer services and products that are environmentally friendly and sustainable and in turn, make it part of their ideals and values. Companies, therefore, are highly interested in determining and knowing whether people at large believe climate change is a real threat or not.  Companies will then use the information in their respective market research in efforts to gauge opinions on how their products or services are received.

### With that being said, EDSA provided a challenge where I am required to create a machine learning model that is able to classify whether an individual believes climate change is real or not based on their previous tweet(s).

### The information as a whole will also help companies in future marketing strategy, increasing insights about how their services are perceived and so forth.

# 1. Data Preprocessing

### Importing libraries

In [1]:
import numpy as np
import pandas as pd
import re
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk.stem import WordNetLemmatizer
from nltk.tokenize import TweetTokenizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score, f1_score

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Admin\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


#### Train data set

In [2]:
df = pd.read_csv('train.csv')

#### test data

In [3]:
df_test = pd.read_csv('test.csv')

In [4]:
df.head()

Unnamed: 0,sentiment,message,tweetid
0,1,PolySciMajor EPA chief doesn't think carbon di...,625221
1,1,It's not like we lack evidence of anthropogeni...,126103
2,2,RT @RawStory: Researchers say we have three ye...,698562
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954


In [5]:
df_test.head()

Unnamed: 0,message,tweetid
0,Europe will now be looking to China to make su...,169760
1,Combine this with the polling of staffers re c...,35326
2,"The scary, unimpeachable evidence that climate...",224985
3,@Karoli @morgfair @OsborneInk @dailykos \nPuti...,476263
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928


In [6]:
df['message'].iloc[0]

"PolySciMajor EPA chief doesn't think carbon dioxide is main cause of global warming and.. wait, what!? https://t.co/yeLvcEFXkC via @mashable"

In [7]:
df['message'].iloc[1]

"It's not like we lack evidence of anthropogenic global warming"

### Taking care of null values - Train set

In [8]:
len(df)

15819

In [9]:
df['sentiment'].value_counts()

 1    8530
 2    3640
 0    2353
-1    1296
Name: sentiment, dtype: int64

##### Checking if an entry is null.

In [10]:
df.isnull().sum()

sentiment    0
message      0
tweetid      0
dtype: int64

##### Checking if a string in an entry is empty.

In [11]:
blank = []
for i, sentiment, message, tweetid in df.itertuples():
    if type(message) == 'str':
        if message.issspace():
            blank.append(i)

In [12]:
blank

[]

### Taking care of null values - Test set

##### Checking if an entry is null.

In [13]:
df_test.isnull().sum()

message    0
tweetid    0
dtype: int64

##### Checking if a string in an is empty.

In [14]:
blank_t = []
for i, message, tweetid in df_test.itertuples():
    if type(message) == 'str':
        if message.issspace():
            blank_t.append(i)

In [15]:
blank_t

[]

### Preprocessing the text

#### To clean up the text. I will be using a combination of regular expression to remove unwanted features, Tweettokenizer to tokenize the text and will also lemmatize the text in a single fucntion.

#### Creating relevant preprocessing instances.

In [17]:
lem = WordNetLemmatizer()
token = TweetTokenizer()

In [18]:
def cleaning_text(Data):
    tweet_list = []

    for tweet in Data:
    
        # converting text to lowercase
        doc = tweet.lower()
    
        # remove all punctuation and special characters from a tweet
        doc = re.sub(r'\W', ' ', doc)
    
        # remove all numbers
    
        doc = re.sub(r'\d', ' ', doc)

        # remove all singe characters after special characters have been removed
        doc = re.sub(r'\s+[a-zA-Z]\s+', ' ', doc)
    
        # remove all single characters from the start
        doc = re.sub(r'\^[a-zA-Z]\s+', ' ', doc)
    
        # substituting multiple spaces with a single space
        doc = re.sub(r'\s+', ' ', doc)
    
        # Tokenizing and lemmatization
    
        doc = [lem.lemmatize(word) for word in token.tokenize(doc)]
    
        # joining to get the tokens back into a string
        doc = ' '.join(doc)
    
        # appending to list
    
        tweet_list.append(doc)
    
    
    return tweet_list
    
    

In [19]:
train_data = cleaning_text(df['message'])

In [20]:
train_data[:5]

['polyscimajor epa chief doesn think carbon dioxide is main cause of global warming and wait what http co yelvcefxkc via mashable',
 'it not like we lack evidence of anthropogenic global warming',
 'rt rawstory researcher say we have three year to act on climate change before it too late http co wdt kdur http co anpt',
 'todayinmaker wired wa pivotal year in the war on climate change http co wotxtlcd',
 'rt soynoviodetodas it and racist sexist climate change denying bigot is leading in the poll electionnight']

In [21]:
df['message_clean'] = train_data

In [22]:
df

Unnamed: 0,sentiment,message,tweetid,message_clean
0,1,PolySciMajor EPA chief doesn't think carbon di...,625221,polyscimajor epa chief doesn think carbon diox...
1,1,It's not like we lack evidence of anthropogeni...,126103,it not like we lack evidence of anthropogenic ...
2,2,RT @RawStory: Researchers say we have three ye...,698562,rt rawstory researcher say we have three year ...
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736,todayinmaker wired wa pivotal year in the war ...
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954,rt soynoviodetodas it and racist sexist climat...
...,...,...,...,...
15814,1,RT @ezlusztig: They took down the material on ...,22001,rt ezlusztig they took down the material on gl...
15815,2,RT @washingtonpost: How climate change could b...,17856,rt washingtonpost how climate change could be ...
15816,0,notiven: RT: nytimesworld :What does Trump act...,384248,notiven rt nytimesworld what doe trump actuall...
15817,-1,RT @sara8smiles: Hey liberals the climate chan...,819732,rt sara smile hey liberal the climate change c...


# 2. Splitting train data into features and labels

In [23]:
x = df['message_clean']
y = df['sentiment']

# 3.1 -  Cross validation of our train data set

### Using the train data set, I perfom cross validation in building a classification model via a pipeline object and training it with the data. The built model will then be used to predict the results of the full test data set in section 4.

## Splitting the train data into train test split

In [24]:
X_train, X_test, y_train, y_test = train_test_split(x,y, test_size = 0.3, random_state = 42)

## Building pipeline object to vectorize the cross validation train test split and train the model.

### I have built 7 classification models with relevant hyper-parameter tuning. 

### To speed up the process of obtaining the macro f1_score of all the models, i built a pipeline object function to easily train the model and vectiorised the text. Further more, I looped through all the models to train them, predict the y_train of the train test split and obtain their f1_score.\

##### Importing classification models: 

In [25]:
from sklearn.svm import LinearSVC, SVC
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier

#### The Pipeline object has been built inside a function called: pipeline object

In [26]:
def pipeline_object(model):

    text_clf = Pipeline([('tfid_vectorizer', TfidfVectorizer(ngram_range = (1,2), stop_words = stopwords.words('english'))), 
                     ('L_SVC', model)])
    
    return text_clf

#### Building/creating models 

In [27]:
LinearSVC_model = pipeline_object(LinearSVC(multi_class = 'crammer_singer', C = 2.1, loss = 'hinge'))
SVC_model = pipeline_object(SVC(kernel = 'linear', C = 0.1))
Random_forest_model = pipeline_object(RandomForestClassifier(max_depth = 5, n_estimators = 10, max_features = 1))
Ada_Boost_model = pipeline_object(AdaBoostClassifier())
Multinomial_model = pipeline_object(MultinomialNB())
Nearest_neighbors_model = pipeline_object(KNeighborsClassifier(3))
Decision_tree_model = pipeline_object(DecisionTreeClassifier(max_depth = 5))

##### Putting models into a list

In [28]:
classifiers = [LinearSVC_model, 
               SVC_model, 
               Random_forest_model,
               Ada_Boost_model,
              Multinomial_model,
              Nearest_neighbors_model,
              Decision_tree_model]

#### Defining the names of the classification models in a list.

In [29]:
names = ['LinearSVC', 'SVC', 'Random forest', 'AdaBoost', 'MultinomialNB', 'Nearest Neighbors', 'Decision Tree']

## Training the pipeline object model for all imported models with the train set of the cross validation train test split and predict the result of the respective test set.

In [30]:
results = []

for name, model in zip(names, classifiers):
    print(f'Fitting {name} model to the CV train set')
    trained_model = model.fit(X_train, y_train)
    
    print('Predicting the CV test set')
    y_pred = model.predict(X_test)
    
    print('Obtaining f1_score\n\n')
    f1_score_ = f1_score(y_test, y_pred, average = 'macro') 
    results.append([name, f1_score_])    

Fitting LinearSVC model to the CV train set




Predicting the CV test set
Obtaining f1_score


Fitting SVC model to the CV train set
Predicting the CV test set
Obtaining f1_score


Fitting Random forest model to the CV train set
Predicting the CV test set
Obtaining f1_score


Fitting AdaBoost model to the CV train set


  'precision', 'predicted', average, warn_for)


Predicting the CV test set
Obtaining f1_score


Fitting MultinomialNB model to the CV train set
Predicting the CV test set
Obtaining f1_score


Fitting Nearest Neighbors model to the CV train set
Predicting the CV test set
Obtaining f1_score


Fitting Decision Tree model to the CV train set
Predicting the CV test set
Obtaining f1_score




In [31]:
results

[['LinearSVC', 0.6691529740819062],
 ['SVC', 0.31358461233553997],
 ['Random forest', 0.17687908496732027],
 ['AdaBoost', 0.5008575354153117],
 ['MultinomialNB', 0.34833604160863396],
 ['Nearest Neighbors', 0.5641407897459046],
 ['Decision Tree', 0.3717057333298502]]

#### Converting results into a Dataframe to make it more readable: 

In [57]:
results_dict = dict(results)
results_series = pd.Series(results_dict)    
results_df = pd.DataFrame(results_series, columns = ['f1_score'])

In [58]:
results_df.sort_values('f1_score', ascending=False)

Unnamed: 0,f1_score
LinearSVC,0.669153
Nearest Neighbors,0.564141
AdaBoost,0.500858
Decision Tree,0.371706
MultinomialNB,0.348336
SVC,0.313585
Random forest,0.176879


### We can see that the "LinearSVC" model performed best with the highest f1_score out of all the trained models. With that being said, I will now continue this challenge with the LinearSVC model as my chosen model.

## 3.2 - Evaluation of the chosen Linear SVC model. 

## I will now retrain the model (built with the pipeline object) with the cross validation train test split, and use it to the predict the result of the full test data set in section 4...This step is merely to evaluate the Linear SVC model, and show it's performance in terms of Accuracy, classification report and precision

#### Training the pipleline LinearSVC model 

In [38]:
LinearSVC_clf = pipeline_object(LinearSVC(multi_class = 'crammer_singer', C = 2.1, loss = 'hinge'))

In [39]:
LinearSVC_clf.fit(X_train, y_train)



Pipeline(memory=None,
         steps=[('tfid_vectorizer',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=None,
                                 min_df=1, ngram_range=(1, 2), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words=['i', 'me', 'my', 'myself', 'we',
                                             'our', 'ours', 'ourse...
                                             'it', "it's", 'its', 'itself', ...],
                                 strip_accents=None, sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 

#### Predicting the test set of the train test split (bear in mind we already know the f1_score)

In [40]:
y_pred_LinearSVC = LinearSVC_clf.predict(X_test)

## Evaluation

#### (bear in mind that this macro F1_score is different from that of final score on Kaggle)

In [54]:
print('\nf1_score: We know the score, however, I will print it again:\n\n', f1_score(y_test, y_pred_LinearSVC, average='macro'))
print('\n\nClassification report:\n\n', classification_report(y_test, y_pred_LinearSVC))
print('\n\nConfusion matrix:\n\n',pd.DataFrame(confusion_matrix(y_test, y_pred_LinearSVC), index = [-1, 0, 1, 2], columns = [-1, 0, 1, 2]))
print('\n\nAccuracy:\n\n',accuracy_score(y_test, y_pred_LinearSVC))


f1_score: We know the score, however, I will print it again:

 0.6691529740819062


Classification report:

               precision    recall  f1-score   support

          -1       0.74      0.54      0.63       401
           0       0.63      0.40      0.49       666
           1       0.79      0.83      0.81      2598
           2       0.68      0.83      0.75      1081

    accuracy                           0.74      4746
   macro avg       0.71      0.65      0.67      4746
weighted avg       0.74      0.74      0.74      4746



Confusion matrix:

      -1    0     1    2
-1  218   48   111   24
 0   33  268   277   88
 1   38  102  2148  310
 2    4    9   170  898


Accuracy:

 0.7442056468605142


# 4. Evaluation of the test set

### I will now be evaluating the full test data set by cleaning the data and then predicting the results using our test data set.

### Pre processing our test set using the cleaning text function.

In [45]:
test_data = cleaning_text(df_test['message'])

In [46]:
test_data[:5]

['europe will now be looking to china to make sure that it is not alone in fighting climate change http co t rcgwdq',
 'combine this with the polling of staffer re climate change and woman right and you have fascist state http co ifrm eexpj',
 'the scary unimpeachable evidence that climate change is already here http co yaedqcv ki itstimetochange climatechange zeroco _',
 'karoli morgfair osborneink dailykos putin got to you too jill trump doesn believe in climate change at all think it s hoax',
 'rt fakewillmoore female orgasm cause global warming sarcastic republican']

### Prediction of the test set with our trained model

In [47]:
test_set_pred = LinearSVC_clf.predict(test_data)

In [48]:
df_test['sentiment'] = test_set_pred

In [49]:
df_test.head()

Unnamed: 0,message,tweetid,sentiment
0,Europe will now be looking to China to make su...,169760,1
1,Combine this with the polling of staffers re c...,35326,1
2,"The scary, unimpeachable evidence that climate...",224985,1
3,@Karoli @morgfair @OsborneInk @dailykos \nPuti...,476263,1
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928,0


# Submission to csv file that is to be uploaded/submitted on Kaggle.

In [None]:
df_test[['tweetid', 'sentiment']].to_csv('submission.csv', index = False)