# PREDICT OVERVIEW: CLIMATE CHANGE BELIEF ANALYSIS 2022 

Many companies are built around lessening one’s environmental impact or carbon footprint. They offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. They would like to determine how people perceive climate change and whether or not they believe it is a real threat. This would add to their market research efforts in gauging how their product/service may be received. Our company has been awarded the contract to:




1.   Analyse the supplied data;
2.   Identify potential errors in the data and clean the existing data set;
3. Determine if additional features can be added to enrich the data set;
build a model that is capable of predicting tweets;
4. Evaluate the accuracy of the best machine learning model;
5. Determine what features were most important in the model’s prediction decision; and
6. Explain the inner working of the model to a non-technical audience.

Table of Contents
1. Importing Packages

2. Loading Data

3. Exploratory Data Analysis (EDA)

4. Data Engineering

5. Modeling

6. Model Performance

7. Model Explanations

# IMPORTING PACKAGES

In [29]:
# library for natural language processing
import nltk

# libraries for importing and loading data
import numpy as np
import pandas as pd

# libraries for plotting and data visualization
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set()

# Libraries for data preparation 
import re
import string
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
from nltk import SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer

# Libraries for data visualizations
import matplotlib.pyplot as plt
import seaborn as sns

# Libraries for model building
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Libraries for assessing model accuracy 
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.metrics import f1_score
from sklearn.metrics import classification_report
from sklearn.utils import resample
from wordcloud import WordCloud, STOPWORDS , ImageColorGenerator

# Setting global constants to ensure notebook results are reproducible

RANDOM_STATE = 42


import warnings
warnings.filterwarnings('ignore')


# BRIEF DESCRIPTION OF THE DATA



The data chosen for this project aggregates tweets pertaining to climate change collected between Apr 27, 2015 and Feb 21, 2018. In total, 43943 tweets were collected. 

It is published on the Kaggle website and can is accessible using the following link: 
https://www.kaggle.com/competitions/edsa-climate-change-belief-analysis-2022/data

There are three files available for download:
* train.csv - You will use this data to train your model.
* test.csv - You will use this data to test your model.
* SampleSubmission.csv - is an example of what your submission file should look like. The order of the rows does not matter, but the names of the tweetid's must be correct.








# LOADING THE DATA

This section loads the three files above 

In [30]:
# load the train dataset
train = pd.read_csv('/content/train.csv')

# load the test dataset
test = pd.read_csv('/content/test.csv')

# load the sample submission dataset
sample_submission = pd.read_csv('/content/sample_submission.csv')


# EXPLORATORY DATA ANALYSIS

1. VIEW THE TRAIN DATAFRAME



In [31]:
# view the first five rows of the train dataframe
train.head()

Unnamed: 0,sentiment,message,tweetid
0,1,PolySciMajor EPA chief doesn't think carbon di...,625221
1,1,It's not like we lack evidence of anthropogeni...,126103
2,2,RT @RawStory: Researchers say we have three ye...,698562
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954


From the results above, the train dataframe has three variables: 


*   sentiment: Sentiment of tweet
*   message: Tweet body
*   tweetid: Twitter unique id


2. VIEW THE TEST DATAFRAME



In [9]:
# view the first five rows of the test dataframe
test.head()

Unnamed: 0,message,tweetid
0,Europe will now be looking to China to make su...,169760
1,Combine this with the polling of staffers re c...,35326
2,"The scary, unimpeachable evidence that climate...",224985
3,@Karoli @morgfair @OsborneInk @dailykos \r\nPu...,476263
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928


From the results above, the test dataframe has only two variables: 



*   message: Tweet body
*   tweetid: Twitter unique id








3. VIEW THE SAMPLE_SUBMISSION DATAFRAME

In [10]:
# view the first five rows of the sample_submission dataframe
sample_submission.head()

Unnamed: 0,tweetid,sentiment
0,169760,1
1,35326,1
2,224985,1
3,476263,1
4,872928,1


From the results above, the sample submission has two variables as well: 



*   tweetid: the id repressing each tweet in the final submission
*   sentiment: the predicted sentiment of each tweet






4. NUMBER OF ROWS AND COLUMNS IN THE TRAIN AND DATAFRAMES

In [19]:
#number of rows and columns
print('The train dataframe has',train.shape[0],'rows and',train.shape[1],'columns')
print('The test dataframe has',test.shape[0],'rows and',test.shape[1],'columns')

The train dataframe has 15819 rows and 3 columns
The test dataframe has 10546 rows and 2 columns


5. CLASSIFICATION OF THE SENTIMENTS VARIABLE IN TRAIN DATAFRAME

















In [28]:
# checking the unique values/ categories in the sentiments column  
train['sentiment'].value_counts()

 1    8530
 2    3640
 0    2353
-1    1296
Name: sentiment, dtype: int64

From the results below, there are four unique categories of sentiments, i.e. 1, 2, 0, and -1


**Class Description**


*   1 Pro: the tweet supports the belief of man-made climate change
*   2 News: the tweet links to factual news about climate change
*   0 Neutral: the tweet neither supports nor refutes the belief of man-made climate change
*   -1 Anti: the tweet does not believe in man-made climate change

6. MISSING VALUES IN THE DATAFRAMES

In [29]:
# check for missing values in the train data
train.isnull().sum()

sentiment    0
message      0
tweetid      0
dtype: int64

From the results above, there are no missing values in the train data

In [30]:
# check for missing values in the test data
test.isnull().sum()

message    0
tweetid    0
dtype: int64

From the results above, there are no missing values in the test data

# TEXT CLEANING SECTION

This section cleans the text data in the 'message' column in the test and train data.

It is important to clean data and rid it of any noises, so as to improve the performance of the classification models

This data cleaning phases includes the following steps:


*   Removing digits from the text,
*   Converting the text into lower case,
*   Removing punctuations and other noises, 
*   Removing url patterns, such as 'http', 'https', etc., 




**PREPROCESS AND TRANSFORM THE TEXT INTO VECTORS USING COUNTVECTORIZER()**

The CountVectorizer() is an object that converts a collection of text documents to a matrix of token counts.

Not only does it tokenize text, it also helps to clean the data and remove noises such as:



*   Punctuation
*   Stopwords
*   Converts text to lower case
*   Performs feature extraction 





We use it here to clean the data, then tokenize the words, and perform feature extraction





**1. REMOVING DIGITS FROM THE TEXTS** 

The messages in the data contain some digits, which need to be removed, in order to reduce the noises

In [32]:
# define function to remove punctuations
def remove_digits(text):
    return ''.join([i for i in text if not i.isdigit()])

# call the function 
train['no_digits'] = train['message'].apply(remove_digits)
test['no_digits'] = test['message'].apply(remove_digits)

# view the results
train.head()

Unnamed: 0,sentiment,message,tweetid,no_digits
0,1,PolySciMajor EPA chief doesn't think carbon di...,625221,PolySciMajor EPA chief doesn't think carbon di...
1,1,It's not like we lack evidence of anthropogeni...,126103,It's not like we lack evidence of anthropogeni...
2,2,RT @RawStory: Researchers say we have three ye...,698562,RT @RawStory: Researchers say we have three ye...
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736,#TodayinMaker# WIRED : was a pivotal year in ...
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954,"RT @SoyNovioDeTodas: It's , and a racist, sexi..."


From the results above, the text no longer contains digits

**2. DEFINE THE RESPONSE AND PREDICTOR VARIABLES**

Split the data into response (y) and predictor (X) variables:


*   The predictor variable (x) in this case is the column with the messages,
*   The response variable, on the other hand, is the column with the list of sentiments


In [55]:
# the predictor variables
X = train['no_digits']

# the response variable
y = train['sentiment']

**3. PERFORM A TRAIN-TEST SPLIT**

Split the data into train and test sections in order to fit the classification models



*   The train data is used to fit the model onto the data, while the test data is used to evaluate the performance of the model on unseen observations

In [56]:
# import the train_test_split module from sklearn
from sklearn.model_selection import train_test_split

# split into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


**4. IMPORT THE COUNTVECTORIZER() MODULE AND TUNE THE PARAMETERS**

In [47]:
# import the CountVectorizer module from sklearn
from sklearn.feature_extraction.text import CountVectorizer

# create an instance of the CountVectorizer() object and add parameters to it to handle stopwords and remove punctuations from text
vectorizer = CountVectorizer(stop_words='english', token_pattern=r'(?u)\b\w\w+\b')


**5. FIT THE COUNTVECTORIZER() INSTANCE ONTO THE TRAIN AND TEST DATASETS**

In [49]:
# fit transform the x_train data onto the countvectorizer()
x_train_vectorized = vectorizer.fit_transform(X_train)

# fit the x_test data onto the countvectorizer()
x_test_vectorized = vectorizer.transform(X_test)


# MODELLING

This section builds and evaluates the performance of various classification models, such as:

*   Logistic Regression
*   Support Vector Machine
*   Naive Bayes Classifier
*   Kth Nearest Neighbour Classifier





**1. CLASSIFICATION MODEL 1: LOGISTIC REGRESSION**


In [71]:
# import the LogisticRegression module from sklearn
from sklearn.linear_model import LogisticRegression

# create an instance of the LogisticRegression() object
lr = LogisticRegression()

# fit the model onto the train data
lr.fit(x_train_vectorized, y_train)

# generate predictions of the sentiment classifications
pred_lr = lr.predict(x_test_vectorized)

# calculate and print the performance of the LogisticRegression model
print("The accuracy score of the Logistic Regression Classifier is:", lr.score(x_test_cv,y_test))

The accuracy score of the Logistic Regression Classifier is: 0.7531605562579013


**2. CLASSIFICATION MODEL 2: SUPPORT VECTOR MACHINE**

In [72]:
# import the support vector machine classification module
from sklearn.svm import SVC

# create an instance of the SVC() object
svc = SVC(kernel='rbf')

# fit the svc model onto the train data
svc.fit(x_train_vectorized, y_train)

# generate predictions of the sentiments
svc_pred = svc.predict(x_test_vectorized) 

# evaluate performance of the svc model
from sklearn.metrics import classification_report, accuracy_score, log_loss
print("The accuracy score of the Support Vector Machine Classifier is:", accuracy_score(y_test, svc_pred))
print("\n\nClassification Report:\n\n", classification_report(y_test, svc_pred))

The accuracy score of the Support Vector Machine Classifier is: 0.7319848293299621


Classification Report:

               precision    recall  f1-score   support

          -1       0.89      0.23      0.37       278
           0       0.66      0.37      0.47       425
           1       0.73      0.89      0.80      1755
           2       0.74      0.76      0.75       706

    accuracy                           0.73      3164
   macro avg       0.76      0.56      0.60      3164
weighted avg       0.74      0.73      0.71      3164



**3. CLASSIFICATION MODEL 3: NAIVE BAYES CLASSIFIER**

In [73]:
# import the Gaussian Naive Bayes classification module
from sklearn.naive_bayes import GaussianNB

# Define the model 
naive_bayes = GaussianNB()

# Fit the model 
naive_bayes.fit(x_train_vectorized.toarray(), y_train)

# generate predictions
naive_pred = naive_bayes.predict_proba(x_test_vectorized.toarray())

# calculate and print the log loss error
print("The accuracy score of the Naive Bayes Classifier is:", log_loss(y_test, naive_pred))

The accuracy score of the Naive Bayes Classifier is: 15.446387041339639


**4. CLASSIFICATION MODEL 4: K-NEAREST-NEIGHBOURS (KNN) CLASSIFIER**

In [74]:
# import the KNN module from sklearn
from sklearn.neighbors import KNeighborsClassifier

# define the number of neighbours
n_neighbors = 3 # <--- change this number to play around with how many nearest neighbours to look for.

# define the model
knn = KNeighborsClassifier(n_neighbors)

# fit the model 
knn.fit(x_train_vectorized, y_train)

# get predictions on the test set 
knn_pred = knn.predict_proba(x_test_vectorized)

# calculate and print the log loss error
print("The log loss error for the KNN model is: ", log_loss(y_test, knn_pred))

The log loss error for the KNN model is:  12.097323967437175


For the log loss metric, lower is better, i.e.:, a perfect model would have a log loss of 0