# Team ZM3 EDSA - Climate Change Belief Analysis 2021

Predict an individual’s belief in climate change based on historical tweet data

###### Members
1. Ndamulelo Innocent Nelwamondo
2. Thobekani Masondo
3. Nomvuselelo Simelane
4. John Sekgobela
5. Namhla Sokapase
6. Sandra Malope

In [1]:
%%capture
!pip install ipython-autotime
%load_ext autotime

time: 0 ns (started: 2021-06-24 14:23:57 +02:00)


In [2]:
# Import comet_ml at the top of your file
from comet_ml import Experiment

time: 14.5 s (started: 2021-06-24 14:29:35 +02:00)


In [3]:
# Create an experiment with your api key
experiment = Experiment(
    api_key="KQ1UTh7hBvPLWlz3034oIgusG",
    project_name="global-warming-climate-change-sentiment-analysis-zm3",
    workspace="thobekanimasondo84-gmail-com",)

COMET INFO: Experiment is live on comet.ml https://www.comet.ml/thobekanimasondo84-gmail-com/global-warming-climate-change-sentiment-analysis-zm3/59cc3a06cdc14f6682b965379a4771f8



time: 16.9 s (started: 2021-06-24 14:29:58 +02:00)


### Introduction

###### Using Twitter to measure the impact of climate change:
As the climate crisis intensifies and natural disasters become more frequent and powerful, scientists are increasingly turning to social media as a way to assess the damage and impact on a more localized scale. In our case, Twitter was useful given the geographical reach of Twitter as well as the volume and location-specific nature of tweets. The platform can be used to track how individuals feel about climate change and how they view climate change.

Social media encourages greater knowledge of climate change, mobilization of climate change activists, space for discussing the issue with others, and online discussions that frame climate change as a negative for society. Social media, however, does provide space for framing climate change skeptically and activating those with a skeptical perspective of climate change.
<div align="center" style="width: 500px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://media.tenor.com/images/47d160eabb0927ed23827ab099ee83c3/tenor.gif"
     alt="Dummy image 1"
     style="float: center; padding-bottom=0.5em"
     width=500px/>

## Problem statement

We aim to explore machine learning as a method to assist us in identifying whether or not a person believes in climate change and could possibly be converted to a new customer based on their tweets. To do so, we will develop a ML model that is able to classify textual passage as relevant to climate change adaptation. To produce such a model, we first select an appropriate corpus of documents for training that has been annotated. Then, we pre-process and clean the documents, transform them to extract appropriate features, select a ML model, train it, and evaluate its performance. Model evaluation is done both by comparing model predictions against a human panel at block level and comparing model performance against data that have been annotated but not used for training using cross-fold validation. Once a satisfactory performance of the model has been achieved, we interpret the patterns learned and apply them for further decision-making in a climate change adaptation context.

<div align="center" style="width: 500px; font-size: 80%; text-align: center; margin: 0 auto">
<img src="https://en.reset.org/files/imagecache/sc_832x468/2018/02/27/planet_earth.jpg"
     alt="Dummy image 1"
     style="float: center; padding-bottom=0.5em"
     width=500px/>

### Import libraries

In [4]:
!pip install imblearn --user

time: 6.7 s (started: 2021-06-24 14:30:24 +02:00)




In [5]:
import numpy as np 
import pandas as pd


import matplotlib.pyplot as plt
import seaborn as sns


import re
from string import punctuation
import nltk
nltk.download(['stopwords','punkt'])
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from wordcloud import WordCloud, STOPWORDS


from sklearn.model_selection import train_test_split

from sklearn.feature_extraction.text import TfidfVectorizer,CountVectorizer
from sklearn.utils import resample

from imblearn.pipeline import Pipeline

from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import RidgeClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier

from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier 

from sklearn.ensemble import StackingClassifier

from sklearn.metrics import classification_report,confusion_matrix

from sklearn.model_selection import cross_val_score
from imblearn.over_sampling import SMOTE
from sklearn import metrics

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Webster\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Webster\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


time: 38.2 s (started: 2021-06-24 14:37:35 +02:00)


## LOADING THE DATA

In [6]:
train = pd.read_csv('train.csv')
test = pd.read_csv('test_with_no_labels.csv')

time: 344 ms (started: 2021-06-24 14:40:05 +02:00)


<a id="EDA"></a>
# Exploratory Data analysis

<a id="data"></a>
# DATA PREPROCESSING

Combine the train and test data in order to do preprocessing on both datasets. This is essential to test the models being built. This step will also be useful for implementing the API.

Creating a function which will preprocess all of our data.

In [7]:
def tweet_preprocessing(tweet):
    
    '''
    This functions cleans tweets from line breaks, URLs, numbers, etc.
    '''
    
    tweet = tweet.lower() #to lower case
    tweet = tweet.replace('\n', ' ') # remove line breaks
    tweet = tweet.replace('\@(\w*)', '') # remove mentions
    tweet = re.sub(r"\bhttps://t.co/\w+", '', tweet) # remove URLs
    tweet = re.sub('\w*\d\w*', '', tweet) # remove numbers
    tweet = re.sub(r'\#', '', tweet) # remove hashtags. To remove full hashtag: '\#(\w*)'
    tweet = re.sub('\w*\d\w*', '', tweet) # removes numbers?
    tweet = re.sub(' +', ' ', tweet) # remove 1+ spaces

    return tweet

time: 63 ms (started: 2021-06-24 14:40:09 +02:00)


After we create a function for preprocessing we must split the data into labels and features (X and y)

In [8]:
# Splitting the labels and features
train['processed'] = train['message'].apply(tweet_preprocessing)
X = train['processed'].values
y = train['sentiment'].values

time: 1.28 s (started: 2021-06-24 14:40:12 +02:00)


In [9]:
# preprocess testing data by applying our function
test['processed'] = test['message'].apply(tweet_preprocessing)

time: 531 ms (started: 2021-06-24 14:40:15 +02:00)


<a id="feature"></a>
# Feature Selection

### Naive Bayes Classifier 

In [10]:
# Splitting the labels and fetures into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10,random_state=42,stratify=y)

time: 219 ms (started: 2021-06-24 14:40:18 +02:00)


In [11]:
mnb = Pipeline([('Count',CountVectorizer()),('classify',MultinomialNB())])
#fitting the model
mnb.fit(X_train, y_train)

#apply model on test data
y_pred_mnb = mnb.predict(X_test)

time: 500 ms (started: 2021-06-24 14:40:21 +02:00)


In [12]:
# Classification report
print(classification_report(y_test, y_pred_mnb))

              precision    recall  f1-score   support

          -1       0.86      0.28      0.42       130
           0       0.68      0.22      0.33       235
           1       0.69      0.92      0.79       853
           2       0.78      0.69      0.73       364

    accuracy                           0.71      1582
   macro avg       0.75      0.53      0.57      1582
weighted avg       0.72      0.71      0.68      1582

time: 16 ms (started: 2021-06-24 14:40:22 +02:00)


As we can see the '-1' and '0' class are poorly predicted when using unbalanced data. Once we implement resampling their f1-score increases for these model but only slightly. While at the same time the overall accuracy is slightly reduced.

<a id="modelling"></a>
# MODELLING

#### SVC and LinearSVC

SVC Provides a best fit to catergorize our data this fit can be nonlinear, while a linearSVC provides a linear interpolation.

In [13]:
#SVC
svc = Pipeline([('Count',CountVectorizer()),('classify',SVC(max_iter=300,C=1))])

time: 0 ns (started: 2021-06-24 14:40:27 +02:00)


In [14]:
#linearSVC
linsvc = Pipeline([('Count',CountVectorizer()),('classify',LinearSVC(max_iter=300,C=1))])

time: 0 ns (started: 2021-06-24 14:40:27 +02:00)


#### Logistic Regression

Models the discrete probability distribution between classes and classifies based on the inflection point of the curve.

In [15]:
#Logistic Regression
lr = Pipeline([('Count',CountVectorizer()),('classify',LogisticRegression(max_iter=300))])

time: 0 ns (started: 2021-06-24 14:40:29 +02:00)


#### KNN
The KNN classifier assumes that all data points that are close together fall into the same class.K is the number of neighbours. So K=3 implies we will make our predictions based off f the 3 closest points.

In [16]:
#KNN
knn = Pipeline([('Count',CountVectorizer()),('classify',KNeighborsClassifier(n_neighbors=3))])

time: 0 ns (started: 2021-06-24 14:40:32 +02:00)


#### Decision Tree

The decision tree uses a tree-like model of decisions and their possible consequences including chance event outcomes, resource costs and utility.Starting from the decision itself (called a "node"), each branch of the decision tree represents a possible decision, outcome, or reaction.

In [17]:
#Decision Tree
dt = Pipeline([('Count',CountVectorizer()),('classify',DecisionTreeClassifier())])

time: 0 ns (started: 2021-06-24 14:40:34 +02:00)


#### Random Forest
Using the decision tree as a base estimator,each estimator is trained on a different bootstrap sample having the same size as the training set. At each node of the forest, features are sampled without replacement to increase randomization. Nodes are split to maximise information gain.

In [18]:
#Random Forest
rf = Pipeline([('Count',CountVectorizer()),('classify',RandomForestClassifier())])

time: 0 ns (started: 2021-06-24 14:40:36 +02:00)


### MODEL PERFORMANCE

In [19]:
num=3
# SVC
scores = cross_val_score(
        svc, X, y, cv=num, scoring='f1_weighted')
print('The average weighted F1 score over '+str(num)+' SVC models is ' + str(sum(scores)/len(scores)))



The average weighted F1 score over 3 SVC models is 0.5244944329747319
time: 16 s (started: 2021-06-24 14:40:38 +02:00)


In [20]:
#linearSVC
scores = cross_val_score(
        linsvc, X, y, cv=num, scoring='f1_weighted')
print('The average weighted F1 score over '+str(num)+ ' LinearSVC models is ' + str(sum(scores)/len(scores)))



The average weighted F1 score over 3 LinearSVC models is 0.718741536152168
time: 3.81 s (started: 2021-06-24 14:40:58 +02:00)




In [21]:
#Logistic Regression
scores = cross_val_score(
        lr, X, y, cv=num, scoring='f1_weighted')
print('The average weighted F1 score over '+str(num)+' Logistic Regression models is ' + str(sum(scores)/len(scores)))

The average weighted F1 score over 3 Logistic Regression models is 0.7314523570671643
time: 13.4 s (started: 2021-06-24 14:41:02 +02:00)


In [22]:
#KNN
scores = cross_val_score(
        knn, X, y, cv=num, scoring='f1_weighted')
print('The average weighted F1 score over '+str(num)+' KNN models is ' + str(sum(scores)/len(scores)))

The average weighted F1 score over 3 KNN models is 0.52540597989338
time: 13.6 s (started: 2021-06-24 14:41:21 +02:00)


In [23]:
#Decision Tree
scores = cross_val_score(
        dt, X, y, cv=num, scoring='f1_weighted')
print('The average weighted F1 score over '+str(num)+' Decision Tree models is ' + str(sum(scores)/len(scores)))

The average weighted F1 score over 3 Decision Tree models is 0.6193632567179077
time: 7.75 s (started: 2021-06-24 14:41:38 +02:00)


In [24]:
#Random Forest
scores = cross_val_score(
        rf, X, y, cv=num, scoring='f1_weighted')
print('The average weighted F1 score over '+str(num)+' KNN models is ' + str(sum(scores)/len(scores)))

The average weighted F1 score over 3 KNN models is 0.6740678448443699
time: 47.2 s (started: 2021-06-24 14:41:47 +02:00)


The Logistic Regression Model and the LinearSVC model perform the best. The best performance for every model is found when resampling is not done. This could be because because upsampling the minority classes to the level of the majority class results in too much overfitting.

#### Tuning parameters

We take a look and see if we can improve our best 2 models: linearSVC and Logistic Regression

In [25]:
from sklearn.model_selection import GridSearchCV
Cs = [0.001, 0.01, 0.1, 1, 10]
param_grid = {
    'C'     : Cs
    }
grid_SVM = GridSearchCV(LogisticRegression(), param_grid, scoring='f1_weighted', cv=3)
grid_SVM.fit(CountVectorizer().fit_transform(X), y)
grid_SVM.best_params_

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

{'C': 1}

time: 36.1 s (started: 2021-06-24 14:42:39 +02:00)


In [46]:
param_grid = {'C'     : Cs }
grid_SVM = GridSearchCV(LinearSVC(), param_grid, scoring='f1_weighted', cv=3)
grid_SVM.fit(CountVectorizer().fit_transform(X), y)
grid_SVM.best_params_



{'C': 0.1}

<a id="conclusion"></a>
# Conclusion

#### Model performance
Several strategies we attempted to improve model performance, ranging from data processing techniques to clean the tweets, data balancing strategies, cross validation and grid search for the best values for model hyperparameters.

On the whole, the models performed better on the uncleaned data. Data balancing strategies yielded little to no improvement in model performance. A few models that were tried resulted in overfitting.


#### What else we can try
Language models and the use of neural networks were two other strategies that we wanted to implement, to see how the performance of model improves with the use of a language model, and how a neural network performance.

#### Business case value

From the above analysis, the story that is emerging is fairly clear; the sentiment from the negative class of tweets is that of individuals who consider the science of climate change as being a hoax. Seeing that the debate has also become ideological, it would probably be best to tailor a message to this group that does not emphasize the environmental friendliness and sustainability aspects of the products and services, but rather a message that speaks to product features and price etc, would be the best approach when targeting this group.

On the other hand, individuals from the positive class of tweets certainly believe in climate change, it is however not clear whether these individuals in their daily lives necessarily make decisions based on the environmental friendliness and sustainability of the products and services they purchase. Emphasizing a message of environmental friendliness and sustainability within this group, will not negatively impact how the products and services are received.


Some organisations are mentioned in the tweets, many which share the same values and ideals when it comes to protecting the environment, who have a substantial membership and following on social media of individuals who share the same values and ideals. The formation of potential partnerships with these organisations could lead to brand exposure with individuals who in their daily lives make conscious decisions with regards to the products and services they purchase.

We recommend that the latter strategy of pursuing partnerships with like minded organisations will yield the best results, in terms of finding a group of potential customers who share the same values and ideals, and would be likely to purchase your products and services.

<a id="save"></a>
# SUBMISSION

For our final model, we build a stacking classifier to combine Logistic Regression, LinearSVC and Random Forest

In [47]:
estimators = [
       ('rf', Pipeline([('Count',CountVectorizer(ngram_range=(1,2))),('classify',RandomForestClassifier())])),
         
        ('lnsvc', Pipeline([('Count',CountVectorizer(ngram_range=(1,2))),('classify',LinearSVC(C=0.1))])),
         
        ('MNB',Pipeline([('Count',CountVectorizer()),('classify',MultinomialNB())])),
    
        ('lr', Pipeline([('Count',CountVectorizer(ngram_range=(1,2))),('classify',LogisticRegression(C=1))]))]

In [48]:
clf = StackingClassifier(
        estimators=estimators
    )

#fitting the model
clf.fit(X, y)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

StackingClassifier(estimators=[('rf',
                                Pipeline(steps=[('Count',
                                                 CountVectorizer(ngram_range=(1,
                                                                              2))),
                                                ('classify',
                                                 RandomForestClassifier())])),
                               ('lnsvc',
                                Pipeline(steps=[('Count',
                                                 CountVectorizer(ngram_range=(1,
                                                                              2))),
                                                ('classify',
                                                 LinearSVC(C=0.1))])),
                               ('MNB',
                                Pipeline(steps=[('Count', CountVectorizer()),
                                                ('classify',
                         

In [28]:
# End experiment
experiment.end()

COMET INFO: ---------------------------
COMET INFO: Comet.ml Experiment Summary
COMET INFO: ---------------------------
COMET INFO:   Data:
COMET INFO:     display_summary_level : 1
COMET INFO:     url                   : https://www.comet.ml/thobekanimasondo84-gmail-com/global-warming-climate-change-sentiment-analysis-zm3/59cc3a06cdc14f6682b965379a4771f8
COMET INFO:   Parameters:
COMET INFO:     C                            : 1
COMET INFO:     algorithm                    : auto
COMET INFO:     alpha                        : 1.0
COMET INFO:     bootstrap                    : True
COMET INFO:     break_ties                   : 1
COMET INFO:     cache_size                   : 200
COMET INFO:     ccp_alpha                    : 1
COMET INFO:     class_prior                  : 1
COMET INFO:     class_weight                 : 1
COMET INFO:     coef0                        : 1
COMET INFO:     criterion                    : gini
COMET INFO:     cv                           : 3
COMET INFO:    

time: 3.23 s (started: 2021-06-24 14:57:36 +02:00)


In [29]:
# Display results on comet page
experiment.display()

time: 16 ms (started: 2021-06-24 14:57:44 +02:00)


In [30]:
# Creating the unseen set, so that we can post to Kaggle and recieve a score based on the performance
x_unseen = test['processed']

submission = pd.DataFrame(
    {'tweetid': test['tweetid'],
     'sentiment': clf.predict(x_unseen)
    })

# save DataFrame to csv file for submission
submission.to_csv("Submission_final.csv", index=False)