# Classification Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

I {**YOUR NAME, YOUR SURNAME**}, confirm - by submitting this document - that the solutions in this notebook are a result of my own work and that I abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Predict Overview: Twitter Sentiment Classification Challenge

The task involves developing a Machine Learning model to classify tweets based on people's beliefs regarding climate change. The dataset provided contains 43,943 tweets collected between April 27, 2015, and February 21, 2018. Each tweet is labeled with one of four classes: News, Pro, Neutral, or Anti, representing different perspectives on climate change.
Your company has been awarded the contract to:

 1. Analyse the supplied data;
 2. Clean and transform data, including removing noise, handling missing values, and applying text preprocessing techniques like tokenization, stop-word removal, and stemming or lemmatization;
 3. Extract relevant features from the tweet data that can contribute to the classification task;
 4. Choose an appropriate Machine Learning algorithm for the classification task;
 5. Train the model and evaluate using appropriate evaluation metrics such as accuracy, precision, recall, and F1 score;
 6. Deploy to classify new, unseen tweets into the belief classes, and
 7. Explain the inner working of the model to a non-technical audience.

Formally the problem statement was given to you, the senior data scientist, by your manager via email reads as follow:

> In this project you are tasked to model this classification problem, by exploring and preprocessing the data, perform feature engineering, and train a suitable Machine Learning model. The model will learn from the labeled tweets to classify new, unseen tweets into one of the belief classes accurately. The goal is to create a robust and accurate model that can provide valuable insights into people's perceptions of climate change.
 
On top of this, she has provided you with a starter notebook containing vague explanations of what the main outcomes are. 

<a id="cont"></a>

## Table of Contents
<a href=#one>1. Introduction</a>

<a href=#two>2. Problem Statement</a>

<a href=#three>3. Importing Packages</a>

<a href=#four>4. Loading Data</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Data Engineering</a>

<a href=#seven>7. Modeling</a>

<a href=#eight>8. Model Performance</a>

<a href=#nine>9. Model Explanations</a>

<a href=#ten>10. Conclusion</a>

<a href=#eleven>11. References</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

In today's world, climate change is a pressing global issue that has gained significant attention. Many companies are dedicated to reducing environmental impact and carbon footprints by offering sustainable and environmentally friendly products and services. To gauge public sentiment and understand how their offerings may be received, these companies require insights into people's beliefs regarding climate change.


 <a id="two"></a>
## 2. Problem Statement
<a href=#cont>Back to Table of Contents</a>

Develop a Machine Learning model to classify tweets based on people's beliefs regarding climate change. The dataset includes 43,943 tweets collected between April 27, 2015, and February 21, 2018, labeled with four classes:

- News: The tweet links to factual news about climate change.
- Pro: The tweet supports the belief of man-made climate change.
- Neutral: The tweet neither supports nor refutes the belief of man-made climate change.
- Anti: The tweet does not believe in man-made climate change.

The objective is to create an accurate and robust model that can provide insights into public sentiment on climate change, aiding companies in understanding the reception of their environmentally friendly products and services for informed marketing strategies.

 <a id="three"></a>
## 3. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [12]:
# Libraries for data loading, data manipulation and data visulisation
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from collections import Counter
from wordcloud import WordCloud
import re
import time

#Libraries for models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

#Libraries for model selection
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV


#Libraries for measuring metrics
from sklearn.metrics import accuracy_score
from  sklearn.metrics import f1_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


#Library for feature selection
from sklearn.feature_selection import VarianceThreshold
from sklearn.utils import resample


#Libraries for Natural Language processing
import nltk
from nltk import TreebankWordTokenizer, SnowballStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
import string
import urllib
# Setting global constants to ensure notebook results are reproducible
#PARAMETER_CONSTANT = ###
# set plot style
sns.set()

<a id="four"></a>
## 4. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

In [3]:
# Training data
train = pd.read_csv('../resources/data/train.csv')
#Testing data
test = pd.read_csv('../resources/data/test_with_no_labels.csv')

## 5. DATA CLEANING

**Viewing the whole tweet for the training data**

In [4]:
#displaying the message of the training data
with pd.option_context('display.max_colwidth', 400):
    display(train.head(10))

Unnamed: 0,sentiment,message,tweetid
0,1,"PolySciMajor EPA chief doesn't think carbon dioxide is main cause of global warming and.. wait, what!? https://t.co/yeLvcEFXkC via @mashable",625221
1,1,It's not like we lack evidence of anthropogenic global warming,126103
2,2,RT @RawStory: Researchers say we have three years to act on climate change before it’s too late https://t.co/WdT0KdUr2f https://t.co/Z0ANPT…,698562
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year in the war on climate change https://t.co/44wOTxTLcD,573736
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, sexist, climate change denying bigot is leading in the polls. #ElectionNight",466954
5,1,Worth a read whether you do or don't believe in climate change https://t.co/ggLZVNYjun https://t.co/7AFE2mAH8j,425577
6,1,RT @thenation: Mike Pence doesn’t believe in global warming or that smoking causes lung cancer. https://t.co/gvWYaauU8R,294933
7,1,"RT @makeandmendlife: Six big things we can ALL do today to fight climate change, or how to be a climate activistÃ¢â‚¬Â¦ https://t.co/TYMLu6DbNM hÃ¢â‚¬Â¦",992717
8,1,"@AceofSpadesHQ My 8yo nephew is inconsolable. He wants to die of old age like me, but will perish in the fiery hellscape of climate change.",664510
9,1,RT @paigetweedy: no offense… but like… how do you just not believe… in global warming………,260471


**Viewing the Testing data message**

In [5]:
with pd.option_context('display.max_colwidth', 400):
    display(test.head(10))

Unnamed: 0,message,tweetid
0,Europe will now be looking to China to make sure that it is not alone in fighting climate change… https://t.co/O7T8rCgwDq,169760
1,Combine this with the polling of staffers re climate change and womens' rights and you have a fascist state. https://t.co/ifrm7eexpj,35326
2,"The scary, unimpeachable evidence that climate change is already here: https://t.co/yAedqcV9Ki #itstimetochange #climatechange @ZEROCO2_;..",224985
3,@Karoli @morgfair @OsborneInk @dailykos \r\nPutin got to you too Jill ! \r\nTrump doesn't believe in climate change at all \r\nThinks it's s hoax,476263
4,RT @FakeWillMoore: 'Female orgasms cause global warming!'\r\n-Sarcastic Republican,872928
5,RT @nycjim: Trump muzzles employees of several gov’t agencies in effort to suppress info on #climate change &amp; the environment. https://t.co…,75639
6,@bmastenbrook yes wrote that in 3rd yr Comp Sci ethics part. Was told by climate change denying Lecturer that I was wrong &amp; marked down.,211536
7,RT @climatehawk1: Indonesian farmers weather #climate change w/ conservation agriculture | @IPSNews https://t.co/1NZUCCMlYr…,569434
8,RT @guardian: British scientists face a ‘huge hit’ if the US cuts climate change research https://t.co/KlKQnYDXzh,315368
9,Aid For Agriculture | Sustainable agriculture and climate change adaptation for small-scale farmers https://t.co/q7IPCP59x9 via @aid4ag,591733


**Observations**

- The message had redudant features that are not necessary for EDA or modelling namely punctuations, hashtags, mentions, numbers, extra white space, web URLs, https and Twitter handles.
- The features mentioned  does not carry significant meaning and can introduce unnecessary noise to the text. Also, this can limit tokenization. Thus, they will be removed from the message.

**A function of Cleaning the data**

In [7]:
def clean_tweet(tweet):
    """
    This function removes punctuation, hashtags, numbers, extra white space,
    web URLs, https and Twitter handles from tweets. It converts everything to lowercase letters.

    Args:
    - tweet (str): The tweet to be cleaned.

    Returns:
    - str: The cleaned tweet.
    """
    # Converting everything to lowercase
    tweet = tweet.lower()

    # Removal of  punctuation
    tweet = re.sub(r"[,.;':@#?!\&/$]+\ *", ' ', tweet)

    # Removal of hashtags
    tweet = re.sub(r'#\w*', '', tweet)

    # Removal of numbers
    tweet = re.sub(r'\d+', '', tweet)

    # Removal of whitespace in front of tweet
    tweet = tweet.lstrip(' ')

    # Removal extra whitespace
    tweet = re.sub(r'\s\s+', ' ', tweet)

    # Removal web URLs
    pattern_url = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'
    tweet = re.sub(pattern_url, '', tweet)

    # Removal web https
    pattern_url = r'https'
    tweet = re.sub(pattern_url, '', tweet)

    # Removal Twitter handles (user mentions)
    pattern_handles = r'@(\w+)'
    tweet = re.sub(pattern_handles, '', tweet)

    return tweet

**Cleaned training dataframe**

In [8]:
#Applying the function to the training data
train['message'] = train['message'].apply(clean_tweet)
#displaying the message of the training data first 10 rows
with pd.option_context('display.max_colwidth', 400):
    display(train.head(10))

Unnamed: 0,sentiment,message,tweetid
0,1,polyscimajor epa chief doesn t think carbon dioxide is main cause of global warming and wait what t co yelvcefxkc via mashable,625221
1,1,it s not like we lack evidence of anthropogenic global warming,126103
2,2,rt rawstory researchers say we have three years to act on climate change before it’s too late t co wdtkdurf t co zanpt…,698562
3,1,todayinmaker wired was a pivotal year in the war on climate change t co wotxtlcd,573736
4,1,rt soynoviodetodas it s and a racist sexist climate change denying bigot is leading in the polls electionnight,466954
5,1,worth a read whether you do or don t believe in climate change t co gglzvnyjun t co afemahj,425577
6,1,rt thenation mike pence doesn’t believe in global warming or that smoking causes lung cancer t co gvwyaauur,294933
7,1,rt makeandmendlife six big things we can all do today to fight climate change or how to be a climate activistã¢â‚¬â¦ t co tymludbnm hã¢â‚¬â¦,992717
8,1,aceofspadeshq my yo nephew is inconsolable he wants to die of old age like me but will perish in the fiery hellscape of climate change,664510
9,1,rt paigetweedy no offense… but like… how do you just not believe… in global warming………,260471


**Cleaned testing dataframe**

In [9]:
#Applying the function to the testing data
test['message'] = test['message'].apply(clean_tweet)
#displaying the message of the testing data first 10 rows
with pd.option_context('display.max_colwidth', 400):
    display(train.head(10))

Unnamed: 0,sentiment,message,tweetid
0,1,polyscimajor epa chief doesn t think carbon dioxide is main cause of global warming and wait what t co yelvcefxkc via mashable,625221
1,1,it s not like we lack evidence of anthropogenic global warming,126103
2,2,rt rawstory researchers say we have three years to act on climate change before it’s too late t co wdtkdurf t co zanpt…,698562
3,1,todayinmaker wired was a pivotal year in the war on climate change t co wotxtlcd,573736
4,1,rt soynoviodetodas it s and a racist sexist climate change denying bigot is leading in the polls electionnight,466954
5,1,worth a read whether you do or don t believe in climate change t co gglzvnyjun t co afemahj,425577
6,1,rt thenation mike pence doesn’t believe in global warming or that smoking causes lung cancer t co gvwyaauur,294933
7,1,rt makeandmendlife six big things we can all do today to fight climate change or how to be a climate activistã¢â‚¬â¦ t co tymludbnm hã¢â‚¬â¦,992717
8,1,aceofspadeshq my yo nephew is inconsolable he wants to die of old age like me but will perish in the fiery hellscape of climate change,664510
9,1,rt paigetweedy no offense… but like… how do you just not believe… in global warming………,260471


**Observation**

- The above dataframes are now readable, but however have stopwords that do not carry important meaning or aid much to the understanding of the message.
- Thus, the stop words will be removed to reduce the noise, and will later help in the improvement of our model.

In [13]:
def remove_stopwords(text):
    """
    This function removes stop words from the given text.

    Args:
    - text (str): The input text to be processed.

    Returns:
    - str: The text after removing stop words.
    """

    # Get the list of English stopwords
    stop_words = set(stopwords.words('english'))

    # Tokenize the text
    tokens = word_tokenize(text)

    # Remove stop words
    filtered_tokens = [token for token in tokens if token.lower() not in stop_words]

    # Join the filtered tokens back into a string
    filtered_text = ' '.join(filtered_tokens)

    return filtered_text

In [14]:
#Applying the function to the training data
train['message']= train['message'].apply(lambda x: remove_stopwords(x))
#displaying the message of the training data first 10 rows
with pd.option_context('display.max_colwidth', 400):
    display(train.head(10))

Unnamed: 0,sentiment,message,tweetid
0,1,polyscimajor epa chief think carbon dioxide main cause global warming wait co yelvcefxkc via mashable,625221
1,1,like lack evidence anthropogenic global warming,126103
2,2,rt rawstory researchers say three years act climate change ’ late co wdtkdurf co zanpt…,698562
3,1,todayinmaker wired pivotal year war climate change co wotxtlcd,573736
4,1,rt soynoviodetodas racist sexist climate change denying bigot leading polls electionnight,466954
5,1,worth read whether believe climate change co gglzvnyjun co afemahj,425577
6,1,rt thenation mike pence ’ believe global warming smoking causes lung cancer co gvwyaauur,294933
7,1,rt makeandmendlife six big things today fight climate change climate activistã¢â‚¬â¦ co tymludbnm hã¢â‚¬â¦,992717
8,1,aceofspadeshq yo nephew inconsolable wants die old age like perish fiery hellscape climate change,664510
9,1,rt paigetweedy offense… like… believe… global warming………,260471


In [15]:
#Applying the remove stop words function to the testing data
test['message']= test['message'].apply(lambda x: remove_stopwords(x))
#displaying the message of the testing data first 10 rows
with pd.option_context('display.max_colwidth', 400):
    display(test.head(10))

Unnamed: 0,message,tweetid
0,europe looking china make sure alone fighting climate change… co otrcgwdq,169760
1,combine polling staffers climate change womens rights fascist state co ifrmeexpj,35326
2,scary unimpeachable evidence climate change already co yaedqcvki itstimetochange climatechange zeroco_,224985
3,karoli morgfair osborneink dailykos putin got jill trump believe climate change thinks hoax,476263
4,rt fakewillmoore female orgasms cause global warming -sarcastic republican,872928
5,rt nycjim trump muzzles employees several gov ’ agencies effort suppress info climate change amp environment co…,75639
6,bmastenbrook yes wrote rd yr comp sci ethics part told climate change denying lecturer wrong amp marked,211536
7,rt climatehawk indonesian farmers weather climate change w conservation agriculture | ipsnews co nzuccmlyr…,569434
8,rt guardian british scientists face ‘ huge hit ’ us cuts climate change research co klkqnydxzh,315368
9,aid agriculture | sustainable agriculture climate change adaptation small-scale farmers co qipcpx via aidag,591733


<a id="five"></a>
## 5. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


**Checking the shape of the training and testing data:** The training set have three features and the testing set has two features. The Training set has 15819 observations and the testing set has 10546 Observations

In [None]:
#Training data
print(train.shape)
#Testing data
print(test.shape)

**Reading the first five rows of the training data:** As elaborated by the shape, the testing dataframe have three features namely the message, sentiment and the tweetid. 

In [None]:
# Reading the first 5 rows of test data
train.head()


**Reading the first five rows of the testing data:** As elaborated by the shape, the testing dataframe have two features namely the message and the tweetid. 

In [None]:
# Reading the first 5 rows of test data
test.head()

**DataFrame Information**

In [None]:
print('Checking for the datatypes, null values')
print('')
# Checking for the datatypes, null values
print(train.info())

print('')
print("Checking the sum of null values")
print('')
#Checking the sum of null values
print(train.isnull().sum())

**Observations:**

- By checking the infomation of the train set, it is confirmed that there are 15819 rows and 3 columns.
- It is also observed that there are no null values present in all the columns.
- Columns sentiment and tweetid contain numerical values, their dtype is int64. The message column contains non-numerical values, therefore it had a dtype object.
- Moreover, the dataframe takes up the space of 721.6 kb. 

In [None]:
def seperate_sent(sent_list):
    sentiments_dict = {}
    for num in sent_list:
        if num in sentiments_dict:
            sentiments_dict[num] += 1
        else:
             sentiments_dict[num] = 1
    return  sentiments_dict  

seperate_sent(train['sentiment'])

**Distribution of the Tweets over Four Sentiments**

In [None]:
# Counting Number of words
train['sentiment'].value_counts().plot(kind = 'bar')

sentiments = list(seperate_sent(train['sentiment']).keys())
numbers = list(seperate_sent(train['sentiment']).values())

for i, value in enumerate(numbers):
    plt.text(i, value, str(value), ha='center', va='bottom')
plt.xlabel('Sentiments')
plt.ylabel('Number of Tweets')
plt.title('Distribution of tweets over the four sentiments')
plt.show()

**Observations:**

1. There are 8530 tweets for sentiment 1, 3640 tweets for sentiment 2, 2353 tweets for sentiment 0 and 1296 tweets for sentiment -1. 
2. There is data imbalance, those who supports man-made climate change make up half of all the tweets and those who are against the matter makes up only 8% of the entire tweets. 
3. There are many tweets from those who believe in man-made Climate change because those supports man made climate change are well versed about the matter and also, they easily share their views because they are more interested in the matter.
4. However, those who are against the notion may be skeptical due to personal beliefs which can lead to a reluctance to engage in discussions about man-made climate change. Also, those who do not believe in the matter may not have enough information to support their stance. 
5. The data imbalanced can be solved by resampling in order to improve Model Performance, Address Data Skewness, Preserve Information and mitigate Model Bias.

**Distribution of the number of characters on each tweets for the four sentiments**

In [None]:
#Create a number of Sentences features
train['num_sentences'] = train['message'].apply(lambda x:len(nltk.sent_tokenize(x)))

# Create a 2x2 grid of subplots
fig, axes = plt.subplots(2, 2, figsize=(10, 8))

# Plot histogram for sentiment category 0
sns.histplot(train[train['sentiment'] == 0]['num_sentences'], ax=axes[0, 0])

# Plot histogram for sentiment category 1 (with a different color)
sns.histplot(train[train['sentiment'] == 1]['num_sentences'], color='red', ax=axes[0, 1])

# Plot histogram for sentiment category -1 (with a different color)
sns.histplot(train[train['sentiment'] == -1]['num_sentences'], color='green', ax=axes[1, 0])

# Plot histogram for sentiment category -2 (with a different color)
sns.histplot(train[train['sentiment'] == 2]['num_sentences'], color='purple', ax=axes[1, 1])

# Set the title for each subplot
axes[0, 0].set_title('Sentiment Category 0')
axes[0, 1].set_title('Sentiment Category 1')
axes[1, 0].set_title('Sentiment Category -1')
axes[1, 1].set_title('Sentiment Category 2')

# Adjust the spacing between subplots
plt.tight_layout()

# Show the subplots
plt.show()

**Observation:** 

1. The news have a range of 1 to 4 number of sentences. Those who are neutral and those who are against the matter have a range of 1 to 6 sentences. However, those whose who supports man-made climate has a range of 1 to 11 sentences.
2. Those who supports the matters have outliers who might have a deeper understanding of the topic due to personal experiences, research, or education. Consequently, they may feel more confident and knowledgeable, resulting in more extensive and well-articulated responses.

**Distribution of the Number of Words on each tweets**

In [None]:
#Creating a number of Words feature
train['num_words'] =train['message'].apply(lambda x:len(nltk.word_tokenize(x)))

# A2x2 grid of subplots
fig, axes = plt.subplots(2, 2, figsize=(10, 8))

# Histogram for sentiment category 0
sns.histplot(train[train['sentiment'] == 0]['num_words'], ax=axes[0, 0])

# Histogram for sentiment category 1 (with a different color)
sns.histplot(train[train['sentiment'] == 1]['num_words'], color='red', ax=axes[0, 1])

#Histogram for sentiment category -1 (with a different color)
sns.histplot(train[train['sentiment'] == -1]['num_words'], color='green', ax=axes[1, 0])

#Histogram for sentiment category 2 (with a different color)
sns.histplot(train[train['sentiment'] == 2]['num_words'], color='purple', ax=axes[1, 1])

#Titles for each subplot
axes[0, 0].set_title('Sentiment Category 0')
axes[0, 1].set_title('Sentiment Category 1')
axes[1, 0].set_title('Sentiment Category -1')
axes[1, 1].set_title('Sentiment Category 2')

# Adjusting the spacing between subplots
plt.tight_layout()

# Show the subplots
plt.show()

**Observation:** 

1. The word count of the news range from 8 to 34 words, With the frequent words count of 22 words per tweet. The word count for those who are neutral range from 2 to 39 words, with the frequent words count of 26 words per tweet.The word count of who are against man made climate change range from 6 to 49 words, with the frequent word counts of 27 words per tweet. The word count of those who supports man made climate change range from 4 to 45 words, with frequent word count of 26 words per tweet.

2. Those who are supports man-made climate change and those who are against climate change have wide range of word counts compared to the neutral and News sentiment. Those individuals have strong emotions and beliefs associated with it. They felt compelled to express their opinions, arguments, and concerns in more detail, resulting in the use of many words. They felt a greater need to persuade others or defend their position, leading them to provide more extensive justifications and evidence to support their viewpoint.   This is beacause People who are against or support the  matter often face resistance and criticism. In response, they might employ more words to address potential counterarguments, clarify their stance, or counter opposing viewpoints.

*Need attention for revision*


**Distribution of the Number of Characters on each tweets**

In [None]:
#Creating a number of characters feature
train['num_characters'] = train['message'].apply(len)

#A 2x2 grid of subplots
fig, axes = plt.subplots(2, 2, figsize=(10, 8))

#Histogram for sentiment category 0
sns.histplot(train[train['sentiment'] == 0]['num_characters'], ax=axes[0, 0])

#Histogram for sentiment category 1 (with a different color)
sns.histplot(train[train['sentiment'] == 1]['num_characters'], color='red', ax=axes[0, 1])

#Histogram for sentiment category -1 (with a different color)
sns.histplot(train[train['sentiment'] == -1]['num_characters'], color='green', ax=axes[1, 0])

#Histogram for sentiment category 2 (with a different color)
sns.histplot(train[train['sentiment'] == 2]['num_characters'], color='purple', ax=axes[1, 1])

#The title for each subplot
axes[0, 0].set_title('Sentiment Category 0')
axes[0, 1].set_title('Sentiment Category 1')
axes[1, 0].set_title('Sentiment Category -1')
axes[1, 1].set_title('Sentiment Category 2')

# Adjusting the spacing between subplots
plt.tight_layout()

# Show the subplots
plt.show()

**Observation:** 

1. All the sentement has a mode of 140 characters. Research has shown that initially twitter had a strict limit of 140 characters per tweet. Thus, this was one of the constraints which might have led to most individuals writing tweets with 140 chracaters though they wanted to write even more words. 
2. However, the constraints was later increased to 280, this has enable some individuals to fit their thoughts since the graphs above show no indication of a tweet with more than 280 characters. 

**Removing Stopwords**

In [None]:
nltk.download(['punkt','stopwords'])

#A function of removing stopwords
def remove_stopwords(text):
    stop_words = set(stopwords.words('english'))
    tokens = text.split()
    filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
    return ' '.join(filtered_tokens)
#Removing stop words from training data
train['updated message'] = train['message'].apply(remove_stopwords)

**Removing punctuation Marks**

In [None]:
#A fucntion of removing puntuationa marks
def remove_punctuation(post):
    return ''.join([l for l in post if l not in string.punctuation])

# Create and Check if a new column contains messages with no punctuations
train['updated message'] = train['updated message'].apply(remove_punctuation).str.lower() #Words converted to lower case
train['updated message'].iloc[0]


**Masking the four sentiments**

In [None]:
#Tweets of those who believe in Climate change
pro = train.loc[train['sentiment'] == 1, 'updated message']
#Tweets of those dont believe in climate change
anti = train.loc[train['sentiment'] == -1, 'updated message']
#Tweets of those are neautral about climate change
neutral = train.loc[train['sentiment'] == 0, 'updated message']
##Tweets of the news
news = train.loc[train['sentiment'] == 2, 'updated message']

**Word Clouds of the top fifty Words that appear the most from each sentiment**

In [None]:
# Define the categories and their respective texts
categories = ['Pro', 'Anti', 'Neutral', 'News']
texts = [pro, anti, neutral, news]

# Create subplots
fig, axs = plt.subplots(2, 2, figsize=(12, 9))
fig.subplots_adjust(hspace=0)  # Adjust the hspace parameter to remove vertical spacing

# Generate word clouds for each category and plot them
for i, ax in enumerate(axs.flat):
    # Calculate word distribution
    text = ' '.join(texts[i])
    words = text.split()
    word_counts = Counter(words)
    
    # Generate word cloud
    wordcloud = WordCloud(max_words=50)
    wordcloud.generate_from_frequencies(word_counts)
    
    # Plot the word cloud
    ax.imshow(wordcloud, interpolation='bilinear')
    ax.axis('off')
    ax.set_title(categories[i])

# Show the subplots
plt.show()

## Observations:

1. From the cloud words above, it is observed that climate, change, global and warming are the most popular words that are used by individuals, this is expected since the topic revolves around these words.  
2. Those who support the matter use the word believe more than those who don't against the matter. Also, they also use the word going and die which doesnt appear on the word cloud of those against climate change.
3. People who are against the man-made Climate change use words like Fake and scam to show their resistance in the man made climate change. Moreover, they use hoax more often than those who support man-made climate change. 

**Ayanda**

<a id="six"></a>
## 6. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

**Palesa**

**Shibu**

**Lebo**

In [None]:
# View the data after EDA
train.head()

In [None]:
# remove missing values/ features
# Remove the column message
train = train.drop('message', axis = 1)
train

In [None]:
# create new features
Pro = train[train['sentiment'] == 1]
Anti = train[train['sentiment'] == -1]
Neutral = train[train['sentiment'] == 0]
News = train[train['sentiment'] == 2]

In [None]:
# engineer existing features
for word in 

<a id="seven"></a>
## 7. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

**Tshifhu**

In [None]:
names = ['Logistic Regression', 'Nearest Neighbors',
         'RBF SVM',
         'Decision Tree', 'Random Forest']

In [None]:
classifiers = [
    LogisticRegression(max_iter=1000),
    KNeighborsClassifier(10),
    SVC(kernel="rbf", gamma=2, C=0.025),
    DecisionTreeClassifier(max_depth=3, min_samples_split=2),
    RandomForestClassifier(n_estimators=100, max_depth=3)
]

In [None]:
results_train = []
results_test=[]

models = {}
confusion = {}
class_report = {}

for name, clf in zip(names, classifiers):

    #Fitting the models and recording the time
    run_time = %timeit -q -o clf.fit(X_train, y_train)

    #Start time
    start_time = time.time()
    #Predicting on y_train using X_train
    y_pred = clf.predict(X_train)
    #End time
    end_time = time.time()
    #Execution Time
    train_execution_time = end_time - start_time

    #Start time
    start_time = time.time()
    #Predicting on y_train using X_train
    y_pred_test = clf.predict(X_test)
    #End time
    end_time = time.time()
    #Execution Time
    test_execution_time = end_time - start_time

    #Calculating the accuracy score of the training data
    accuracy_train = accuracy_score(y_train, y_pred)
    #Calculating the precision score of the training data
    precision_train = precision_score(y_train, y_pred, average='micro')
    #Calculating the recall score of the training data
    recall_train = recall_score(y_train, y_pred, average='micro')
    #Calculating the f1 score of the training data
    f1_train = f1_score(y_train, y_pred, average='micro')


    #Calculating the accuracy score of the testing data
    accuracy_test = accuracy_score(y_test, y_pred_test)
    #Calculating the precision score of the testing data
    precision_test = precision_score(y_test, y_pred_test, average='micro')
    #Calculating the recall score of the testing data
    recall_test = recall_score(y_test, y_pred_test, average='micro')
    #Calculating the F1 score of the testing data
    f1_test   = f1_score(y_test, y_pred_test, average='micro')

    # Save the results to dictionaries
    models[name]=clf
    #Confusion on the training data
    confusion[name] = confusion_matrix(y_train, y_pred)
    #Confusion on testing data
    confusion[name] = confusion_matrix(y_test, y_pred_test)
    #Classification report of the training data
    class_report[name] = classification_report(y_train, y_pred)
    #Classification report of the testing data
    class_report[name] = classification_report(y_test, y_pred_test)

    # Appending the name of the model, training data  results and fitting time of each model
    results_train.append([name, accuracy_train, precision_train, recall_train, f1_train, run_time.best, test_execution_time])
    # Appending the name of the model and  testing data  results
    results_test.append([name, accuracy_test, precision_test, recall_test, f1_test, train_execution_time])

#Converting the training results to a dataframe
results_train= pd.DataFrame(results_train, columns=['Classifier', 'Accuracy Train', 'Precision Train', 'Recall', 'F1 Train', 'Train Time', 'predicting time'])
results_train.set_index('Classifier', inplace= True)

#Converting the Testing data into a Dataframe
results_test= pd.DataFrame(results_test, columns=['Classifier', 'Accuracy Test', 'Precision  Test', 'Recall  Test', 'F1 Test', "Predicting time" ])
results_test.set_index('Classifier', inplace= True)

In [None]:
# split data

In [None]:
# create targets and features dataset

In [None]:
# create one or more ML models

In [None]:
# evaluate one or more ML models

<a id="eight"></a>
## 8. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

In [None]:
# Compare model performance

In [None]:
# Choose best model and motivate why it is the best choice

<a id="nine"></a>
## 9. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---

In [None]:
# discuss chosen methods logic

<a id="ten"></a>
## 10. Conclusion
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

<a id="eleven"></a>
## 11. References
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>