# TEAM CW-4 ADVANCED CLASSIFICATION NOTEBOOK
This notebook contains the entire workflow of team CW-4 advanced classification predict. The predict involves a project on Twitter sentiment analysis of tweets to get people's perception on climate change. We shall be helping marketing departments to develop marketing strategies based on people's views

<a id="cont"></a>

## Table of contents

<a href=#one>1. Introduction</a>

<a href=#two>2. Importing Packages</a>

<a href=#three>3. Loading Data</a>

<a href=#four>4. Exploratory Data Analysis</a>

<a href=#five>5. Feature Engineering</a>

<a href=#six>6. Model Building</a>

<a href=#seven>7. Model Evaluation</a>

<a href=#one>8. Model Selection</a>


 <a id="one"></a>
 ## 1. Introduction
 <a href=#cont>Back to Table of Contents</a>

In this notebook, we are going to go through the entire data science workflow to build models, analyze models and select the best model to solve our problem.

### 1.1 Problem Statement

We are challenged to determine people's perception on climate change, whether they believe that climate change is real and if it is a threat. We shall create a machine learning model that uses natural language processing to determine a person's view on climate change based on their tweet data. We aim to come up with a viable model that is able to accurately classify people into groups of those who believe and those who do not. With this we will be able to offer insights to marketing departments on how well or badly their product will be recieved on the market  based on its effects on the climate. This will help marketing teams to come up with strategies on how to run their campaigns in the future.

## 1.2 The dataset
we are provided with a dataset containing tweets collected from 27/04/2015 to 21/02/2018. The dataset contains three features;

* sentiment - the class in which a tweet belongs ranging
 from -1 to 2

* message - The body of the tweet provided

* tweetid - a unique identifier for each tweet

The dataset is split into training data and test data with training data containing 80% of the data



 <a id="one"></a>
 ## 2. Importing Packages
 <a href=#cont>Back to Table of Contents</a>

 ---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section, necessary packages to be used throughout the notebook are imported, and briefly discussed. |
| The imported libraries are used in the following stages of the data science process : data cleaning, exploratory data analysis and data modelling. |

---

In [1]:
#!pip install wordcloud

In [2]:
# import libraries for use in loading data, EDA and data manipulation
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import nltk
#import contractions
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
#import advertools as adv
from wordcloud import WordCloud
import string
import urllib
from sklearn.utils import resample

nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')

#import libraries for use in model development
import sklearn
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
# import libraries for use in model evaluation
from sklearn import metrics 
from sklearn.metrics import classification_report


[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\PC\AppData\Roaming\nltk_data...


ModuleNotFoundError: No module named 'xgboost'

 <a id="one"></a>
 ## 3. Loading Data
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section, the datasets to be used in the modelling process are loaded into DataFrames using the pandas library. |

---

In [None]:
#Load the csv file containing the training data using pandas
train_df = pd.read_csv("train.csv")

#Load the csv file containing the test data using pandas
test_df = pd.read_csv("test_with_no_labels.csv")


 <a id="one"></a>
 ## 4. Exploratory Data Analysis
 
 <a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, an in-depth analysis of all the variables in the DataFrame is performed. |
| This phase of the project cycle is very important; it offers insight into the data, and any underlying patterns within it, |
| as well as any errors, duplicates or outliers present. |
| It is essential in understanding the data objectively and guides the data pre-processing and modelling processes. |
| The investigations conducted include: the dimensionality of the data, the descriptive statistics, data completeness, | 
| data distribution, existence of outliers and duplicates, as well as tweet entity extraction, analysis and visualisation.|

---



### 4.1 Basic Analysis

In [None]:
# print out a section of the dataset
train_df.head()

In [None]:
# Check dataset shape
train_df.shape

The train dataset has 15,819 rows and 3 columns

In [None]:
# Summarize data
train_df.info()

Two of the columns are numeric type columns: i.e `sentiment` and `tweetid`; with `sentiment` being the encoded categorical target variable. The other column; `message` is of object type. The output also shows that there are no null values in this dataset.

In [None]:
# look at data statistics - numeric type columns
train_df.describe()

The descriptive statistics of the numeric variables do not offer much insight. This is because the `tweetid` column is a unique column while the `sentiment` column is an encoded categorical column.

In [None]:
# look at data statistics - object type columns
train_df.describe(include=['O'])

The descriptive statistics of the `message` column suggests that there are duplicate tweets in the data. This may be a case of retweets or copied tweets. The most common tweet in the dataset is a retweet(due to the `RT` tag) and it appears 307 times throughout the dataset.

In [None]:
# Count of unique values in each column
train_df.nunique()

The `tweetid` column contains all-unique values while the `message` column has some duplicate values. The target variable; `sentiment`, contains 4 different class labels.

In [None]:
# Showcase the duplicate tweets in the message column
train_df.loc[train_df['message'].duplicated(keep=False)]

There are 1,908 duplicate tweets in the `message` column. These tweets might be duplicates but are all associated with different unique tweet IDs.

In [None]:
train_df.loc[train_df['message'].duplicated(keep=False) & train_df['message'].str.contains('RT')]

Out of all 1,908 duplicate tweets, majority (i.e 1,899) are retweets as is evidenced by the `RT` tag.

In [None]:
train_df.loc[train_df['message'].duplicated(keep=False) & ~train_df['message'].str.contains('RT')]

Out of all 1,908 duplicate tweets, 308 of them appear to be copied tweets.

In [None]:
# Investigate feature symmetry
train_df.skew(numeric_only=True)

The `sentiment` column has a moderate negative skew, and this is as a result of the labels chosen for encoding the variable i.e: -1, 0, 1 and 2;
while `tweetid` has a fairly symmetrical distribution.

In [None]:
# Evaluate existence of outliers in the data using kurtosis
train_df.kurt(numeric_only=True)

Both numeric columns show no evidence of the existence of outliers.

In [None]:
# checking the distribution of tweets in the test and train datasets

length_train = train_df['message'].str.len().plot.hist(color = 'grey', figsize = (6, 4))
length_test = test_df['message'].str.len().plot.hist(color = 'orange', figsize = (6, 4))

### 4.2 Target Variable Distribution

The pro label is the most frequent category in this dataset; with 8,530 tweets labeled as 1 for supporting belief in man-made climate change, while the least; 1,296 tweets are labeled as -1, for tweets that do not believe in man-made climate change. 
There is also evidence of class imbalance within our dataset that will need to be remedied to ensure predictability of the model using methods such as resampling techniques.

In [None]:
# plot distribution plots of the target variable
f = sns.countplot(x='sentiment', data=train_df)
f.set_xticklabels(['Anti', 'Neutral', 'Pro', 'News'])
f.bar_label(f.containers[0])
plt.show() 

### 4.3 Copy Creation and Data Segmentation by Label

In [None]:
# Create a copy of the train data to use for extracting twitter-related information
df = train_df.copy()

In [None]:
# Get subsets of the data as per the label.
news_data = df[df['sentiment'] == 2]
pro_data = pd.DataFrame(df[df['sentiment'] == 1])
neutral_data = df[df['sentiment'] == 0]
anti_data = df[df['sentiment'] == -1]

### 4.4 Hashtag Extraction

In [None]:
# Extract all hashtags from the dataframe using advertools
hashtag_summary = adv.extract_hashtags(df['message'])
hashtag_summary['overview']

The data has a significant amount of hashtags. Hashtags are popular on Twitter and they are used to index and group tweets around a particular topic. It would be beneficial to explore the popular hashtags in this data and further determine the popular hashtags per tweet category. 

In [None]:
#Function to extract hashtags
def hashtag_extractor(data):
    
    """
    This function extracts all the hashtags from a collection of tweets using advertools.
        Input: a tweet column from a dataframe
        Output: a sequence of strings(hashtags) separated by space
    """ 
    hashtag_summary = adv.extract_hashtags(data)
    hashtags = hashtag_summary['hashtags_flat'] #Create a list of all the available hashtags.
    tags = (" ").join(hashtags) #Create a sequence of strings from the hashtags list
    
    return tags

In [None]:
# Function to create a wordcloud using the extracted entities
def wordcloud_visualizer(extracted_entity, color):
    
     
    """
    This function creates a wordclod visual from a collection of extracted entities.
    It uses the WordCloud function from wordcloud to create a wordcloud of the extracted entities with a white background.
        Input: a sequence of stings of the extracted entity, preferred color scheme for the wordcloud
        Output: a wordcloud visual
    """  
    wordcloud = WordCloud(collocations = False, colormap = color, background_color = 'white').generate(extracted_entity)
    
    return wordcloud

In [None]:
# Use the hashtag extractor function to extract hashtags from the different subsets of the data
all_tags = hashtag_extractor(df['message'])
news_tags = hashtag_extractor(news_data['message'])
pro_tags = hashtag_extractor(pro_data['message'])
neutral_tags = hashtag_extractor(neutral_data['message'])
anti_tags = hashtag_extractor(anti_data['message'])

In [None]:
# Generate a wordcloud for all the hashtags available in the full data
full_data_cloud = wordcloud_visualizer(all_tags, 'brg')
# Display the generated Word Cloud
plt.imshow(full_data_cloud, interpolation='bilinear')
plt.axis("off")
plt.title('Popular Climate Change Hashtags')
plt.show()

* As would be expected, words such as: `climate`, `climatechange`, `environment`, `actonclimate` and `globalwarming` constitute the most popular hashtags in this data. As previously established, hashtags are used to index and group tweets around a particular topic and the aforementioned hashtags would be the most appropriate tags for the climate change topic.


* `BeforetheFlood` is among the most popular climate change hashtags. It is in reference to a film produced by Leonardo DiCaprio that was released in  21 October 2016. The film highlights the dangers of climate change and the possible solutions.


* `Imvotingbecause` and `Ivotedbecause` hashtags are prominent in the data as well. Climate change has become a political issue over the decades and is very central to American politics. These hashtags are used throughout the dataset to show support in the now-politicized climate change issue by voters.


* `COP22` is a popular hashtag and aligns with the timeframe of the collected data. It stands for the 22nd Session of the Conference of the Parties, the 2016 United Nations Climate Change Conference. It was an international meeting of political leaders and activists to discuss environmental issues, and was held in Marrakech, Morocco, on 7–18 November 2016. Naturally, this conference sparked a lot of debate and conversation on the topic of climate change. 


* `TheParisAgreement` hashtag aligns with the timeframe of the collected data as well. It is a legally binding international treaty on climate change that was adopted by 196 Parties at COP 21 in Paris, on 12 December 2015. The agreement covers climate change mitigation, adaptation, and finance.


In [None]:
# Display the generated wordclouds for the different labels
f, axarr = plt.subplots(2,2, figsize=(35,25))
axarr[0,0].imshow(wordcloud_visualizer(news_tags, 'summer'), interpolation="bilinear")
axarr[0,1].imshow(wordcloud_visualizer(pro_tags, 'Blues'), interpolation="bilinear")
axarr[1,0].imshow(wordcloud_visualizer(neutral_tags, 'Wistia'), interpolation="bilinear")
axarr[1,1].imshow(wordcloud_visualizer(anti_tags, 'gist_gray'), interpolation="bilinear")

# Remove the ticks on the x and y axarres
for ax in f.axes:
    plt.sca(ax)
    plt.axis('off')

axarr[0,0].set_title('News label hashtags\n', fontsize=50)
axarr[0,1].set_title('Pro climate change hashtags\n', fontsize=50)
axarr[1,0].set_title('Neutral label hashtags\n', fontsize=50)
axarr[1,1].set_title('Anti climate change hashtags\n', fontsize=50)
plt.suptitle("Climate Change Hashtags by Label", fontsize = 100)
plt.tight_layout()
plt.show()



* `Trump` is a popular hashtag across the labels in the climate change tweets. Trump's administration saw alot of controversial moves and statements around climate change. Most of his efforts were geared towards dismissing climate change and slowing down efforts to mitigate it; including withdrawal from the 2015 Paris Climate Change agreement where 196 nations pledged to reduce greenhouse gas emissions and assist poor nations struggling with the consequences of global warming. Tweets including this hashtags are most likely centered around peoples' opinions, criticism and/ or support of Trump's views on climate change.


* Anti climate change hashtags are laden with former president Trump's slogans and declarations. For example, `DraintheSwamp` and `maga`, which was his slogan during the 2016 campaigns and stands for Make America Great Again. Majority of the hashtags in the anti climate change subset of the data seem to be skeptical and to question the reality of the climate issue e.g `fakenews`, `myth`, `hoax`, `climatescam` and `greenscam`. The inclusion of `tcot` meaning top conservatives on Twitter, suggest that anti climate change tweets are favored by Republican-leaning users.


* Pro, News and Neutral tweets seem to share more or less the same hashtags, with Pro tweets having inclusions of vote-related hashtags eg `Imvotingbecause`, News label tweets include broadcast entities eg `CNN` AND `WorldNews`. The apparance of `P2` (Progressives 2.0) hashtags on these labels suggests that these type of tweets are mostly favored by Democrat-leaning users and are used to show progressive political standpoints on Twitter.


### 4.5 Username Extraction

In [None]:
# Extract all mentions from the dataframe using advertools
mention_summary = adv.extract_mentions(df['message'])
mention_summary['overview']

The data has a significant amount of mentions. These are most-likely politicians and celebrities who have made public their opinions on climate change and are sparking a lot of conversation on the issue.

In [None]:
#Function to extract mentionss
def mentions_extractor(data):
    
    """
    This function extracts all the mentions from a collection of tweets using advertools.
        Input: a tweet column from a dataframe
        Output: a sequence of strings(mentions) separated by space
    """ 
    mentions_summary = adv.extract_mentions(data)
    mentions = mentions_summary['mentions_flat'] #Create a list of all the available mentions.
    usernames = (" ").join(mentions) #Create a sequence of strings from the mentions list
    
    return usernames

In [None]:
# Use the mentions extractor function to extract mentionss from the different subsets of the data
all_mentions = mentions_extractor(df['message'])
news_mentions = mentions_extractor(news_data['message'])
pro_mentions = mentions_extractor(pro_data['message'])
neutral_mentions = mentions_extractor(neutral_data['message'])
anti_mentions = mentions_extractor(anti_data['message'])

In [None]:
# Generate a wordcloud for all the mentions available in the full data
full_data_cloud = wordcloud_visualizer(all_mentions, 'brg')
# Display the generated Word Cloud
plt.imshow(full_data_cloud, interpolation='bilinear')
plt.axis("off")
plt.title('Popular Usernames Mentioned')
plt.show()

* `Stephen Schlegel` is among the most frequently mentioned usernames in this data. His tweet received alot of excitement after making a quip at former first lady, Melania Trump for her husband's beliefs on climate change. It received a lot of retweets.


* The most frequently mentioned users are either politicians or celebrities who have made remarks on climate change that have been met by criticism, support or both by the general public. Celebrities include: `Leo Dicaprio`, `Seth Macfarlane` and `D D Lovato`; politicians include: `Donald Trump`, `Bernie Sanders` and `Kamala Harris`; and Journalists such as `Kurt Eichenwald`. 


* The other most mentioned entities are broadcast and news channels eg `CNN`, newspaper publications eg `NYTimes` and magazines such as `Mother Jones`

In [None]:
# Display the generated wordclouds for the different labels
f, axarr = plt.subplots(2,2, figsize=(35,25))
axarr[0,0].imshow(wordcloud_visualizer(news_mentions, 'summer'), interpolation="bilinear")
axarr[0,1].imshow(wordcloud_visualizer(pro_mentions, 'Blues'), interpolation="bilinear")
axarr[1,0].imshow(wordcloud_visualizer(neutral_mentions, 'Wistia'), interpolation="bilinear")
axarr[1,1].imshow(wordcloud_visualizer(anti_mentions, 'gist_gray'), interpolation="bilinear")

# Remove the ticks on the x and y axarres
for ax in f.axes:
    plt.sca(ax)
    plt.axis('off')

axarr[0,0].set_title('News label mentions\n', fontsize=50)
axarr[0,1].set_title('Pro climate change mentions\n', fontsize=50)
axarr[1,0].set_title('Neutral label mentions\n', fontsize=50)
axarr[1,1].set_title('Anti climate change mentions\n', fontsize=50)
plt.suptitle("Climate Change mentions by Label", fontsize = 100)
plt.tight_layout()
plt.show()



* `Donald Trump` is the most mentioned person throughout the labels. This could be because of his strong opinions on climate change that are met by equally strong opposition or support by Twitter users.


* The news label is characterized by mentions targeting news/information outlets; most prominent being `The Hill`, `NY Times`, `Reuters`, `Washington Post` and `Independent`, all which have exemplary climate news coverage and  Twitter users are likely tagging them or following developing climate change news from them.


* The Pro label is characterized by mentions of people who are actively pro climate change while the Anti label is charcterized by mentions of people who are anti climate change e.g `Steve Goddard`, `Dinesh D'souza`, climate change denialists, who have often made remarks that climate change is a hoax. Users are likely trying to interact with those of whom they share the same views on climate change.

In [None]:
# This script of code extracts all the words in the message column of the dataframe
# Create a regular-expressions tokenizing instance
regexp = RegexpTokenizer('\w+')
# Create a new column of tweet tokens
df['text_token']=df['message'].apply(regexp.tokenize)
# Make a list of english stopwords
stopwords = nltk.corpus.stopwords.words("english")
# Remove stopwords
df['text_token'] = df['text_token'].apply(lambda x: [item for item in x if item not in stopwords])
# Convert to a sequence of strings and keep words longer than one character
df['text_token'] = df['text_token'].apply(lambda x: ' '.join([item for item in x if len(item) > 1]))
# Create a list of all words
all_words = ' '.join([word for word in df['text_token']])

In [None]:
word_cloud = WordCloud(collocations = False, colormap = 'brg', background_color = 'white').generate(all_words)
# Display the generated Word Cloud_r
plt.imshow(word_cloud, interpolation='bilinear')
plt.axis("off")
plt.show()

* `https` occurs frequently, implying that many links are being shared around the topic of climate change, most probably by the news label. 


* The tag `RT` which implies retweet appears frequently meaning that there are alot of shared opinions within the data, which is expected when you have groups of people sharing the same sentiments.


* Other common words include climate-specific vocabulary for example: `climate`, `change`, `warming` and `global`

 <a id="one"></a>
 ## 5. Feature Engineering
<a href=#cont>Back to Table of Contents</a>

### 5.1 Text Cleaning

Removing noise (i.e. unneccesary information) is a key part of getting the data into a usable format.  For this dataset, we will be carrying out the following cleaning techniques:

* removing the web urls

* removing duplicates

* removing usernames

* converting all text into lowercase

* removing punctuation marks

* removing stopwords from tweets

### 5.1.1 Remove web urls

At this point, it is important we clean our test and remove the noise in other to amke it usable. The first thing we will do is to remove the URLs links and replace them with the string `web url`.

In [None]:
#let us check for rows with URL links
pattern_url = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'
train_df.loc[train_df['message'].str.contains(pattern_url, regex=True )] 

In [None]:
#replace the url links with the the text 'web url'
subs_url = r'url-web'
train_df['message'] = train_df['message'].replace(to_replace = pattern_url, value = subs_url, regex = True)
train_df.loc[train_df['message'].str.contains(subs_url, regex=True )]

In [None]:
train_df.loc[train_df['message'].str.contains('https')]

### 5.1.2 Drop duplicate tweets

In [None]:
train_df.drop_duplicates(subset = 'message', keep = 'first',inplace = True) # drop all duplicate tweets an keep only one

### 5.1.3 Remove extra space from each Tweet

In [None]:
train_df['message'] = train_df['message'].str.replace('\s\s+', '', regex=True)#extra whitespace

### 5.1.4 Remove the retweet tags from tweets

In [None]:
train_df['message'] = train_df['message'].str.replace('RT', '') 


### 5.1.5 Remove numbers from tweets

In [None]:
train_df['message'] = train_df['message'].str.replace('\d+', '', regex=True)#numbers

### 5.1.6 Convert the text into lower case

In [None]:
train_df['message'] = train_df['message'].str.lower() 


### 5.1.6 Remove punctuation marks from the dataset

In [None]:
def punc_remover(message):
    return ''.join([l for l in message if l not in string.punctuation])

train_df['message'] = train_df['message'].apply(punc_remover)

### 5.1.7 Expand contractions in tweets

In [None]:
train_df['message']=train_df['message'].apply(lambda x: [contractions.fix(word) for word in x.split()])
train_df['message'] = train_df['message'].apply(lambda x: ' '.join([item for item in x]))

In [None]:
# Preview clean dataset
train_df

### Cleaning the Test Data

Let us carry out the same cleaning we did in the training dataset, but unlike in the training data, we will not be dropping duplicate rows in our test dataset. this is because for the kaggle competition, our dataset must `10546` row entries.

In [None]:
test_df.shape

In [None]:
pattern_url = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'
test_df.loc[test_df['message'].str.contains(pattern_url, regex=True )]

subs_url = r'url-web'
test_df['message'] = test_df['message'].replace(to_replace = pattern_url, value = subs_url, regex = True)
test_df.loc[test_df['message'].str.contains(subs_url, regex=True )]

In [None]:
test_df['message'] = test_df['message'].str.replace('RT', '') #remove the retweet tag drom tweets
test_df['message'] = test_df['message'].str.lower() # convert the text into lower case
test_df['message'] = test_df['message'].str.replace('\s\s+', '', regex=True)#extra whitespace
test_df['message'] = test_df['message'].str.replace('\d+', '', regex=True)#numbers
test_df['message'] = test_df['message'].apply(lambda x: [contractions.fix(word) for word in x.split()])
test_df['message'] = test_df['message'].apply(lambda x: ' '.join([item for item in x]))

In [None]:
# remove punctuation marks from the dataset
def punc_remover(message):
    return ''.join([l for l in message if l not in string.punctuation])

test_df['message'] = test_df['message'].apply(punc_remover)

### 5.2 Dealing with Class Imbalance
Class imbalance occurs when the number of observations across different class labels are unevenly distributed. To understand if and why we should correct any imbalance in our datset, let's quickly take a look at our label.

In [None]:
# Separate minority and majority classes
news = train_df[train_df['sentiment']==2]
pro = train_df[train_df['sentiment']==1]
neutral = train_df[train_df['sentiment']==0]
anti = train_df[train_df['sentiment']==-1]

In [None]:
# Get all possible labels
labels = train_df['sentiment'].unique()
heights = [len(news),len(pro),len(neutral),len(anti)]
plt.bar(labels,heights,color='orange')
plt.xticks(labels,['news','pro','neutral','anti'])
plt.ylabel("# of observations")
plt.show()

As we can see, there is a clear imbalance in our label. This is a problem as it can affect the accuracy of our final model. There are three possible approaches we can use to correct this. 

1. Upsampling the minority class(es)
2. Downsampling the majority class(es)
3. Upsample minority class + downsample majority class(es)

For this dataset, we are going to use Approach #3 which happens to be the best of the three approaches. This technique involves:

1. Establishing a **class size** (i.e. the number of observations we want in each class). For this approach to work, the **class size** has to be a value between the size of the majority class and the size of the minority class. A good heuristic to use here, is to **set the class size to be half the size of the majority class**.

2. Downsampling the majority class to be as small as the **class size**.

3. Upsampling the minority class to be as big as the **class size**.

for more effective ness, we are going to create a function for this!

In [None]:
#let create the class size
class_size=len(pro)/2

In [None]:
resampled_classes=[]

for label in list(train_df['sentiment'].unique()):
    label_data = train_df[train_df['sentiment'] == label]
    
    if label < class_size:
        label_resampled = resample(label_data,
                                   replace=True,
                                   n_samples=int(class_size),
                                   random_state=27) 
    else:      
        label_resampled = resample(label_data,
                                   replace=False, # sample without replacement (no need to duplicate observations)
                                   n_samples=int(class_size), # match number in minority class
                                   random_state=27) # reproducible results
    resampled_classes.append(label_resampled)


resampled_data = pd.concat(resampled_classes, axis=0)  

In [None]:
#let's take a look at our new labels
labels = train_df['sentiment'].unique()
heights = [len(news),len(pro),len(neutral),len(anti)]
resampled_heights= [len(resampled_data[resampled_data['sentiment']==2]),
                    len(resampled_data[resampled_data['sentiment']==1]),
                    len(resampled_data[resampled_data['sentiment']==0]),
                    len(resampled_data[resampled_data['sentiment']==-1])]
plt.bar(labels,heights,color='orange')
plt.bar(labels,resampled_heights,color='grey')
plt.xticks(labels,['news','pro','neutral','anti'])
plt.ylabel("# of observations")
plt.legend(['original','resampled'])
plt.show()

At this point, we have succesfully, balanced our class, let go ahead with our modelling

### 5.3 Variable Creation

At this point, let us extract our features and labels for our modelling

In [None]:
corpus=resampled_data['message']
y=resampled_data['sentiment']
X_test=test_df['message']

In [None]:
vect = TfidfVectorizer(max_df=0.9, min_df=1, ngram_range=(1, 3))
# fit the countvectorizer to the data and 
#store the results in a variable tokens
X = vect.fit_transform(corpus)
X_test_df = vect.transform(X_test)

print(X.shape)
print(X_test_df.shape)

 <a id="one"></a>
 ## 6. Model Building
 <a href=#cont>Back to Table of Contents</a>

Finally! The sweet stuff!

In this section, we shall;

* Build machine learning models

* Fit the machine learning models with training data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 42)

### 6.1 Tree Classification Model

In [None]:
# initialize machine learning models
tree = DecisionTreeClassifier(random_state=42)

#fit the model
tree.fit(X_train, y_train)

#lets predict the label for our test set

pred_tree= tree.predict(X_test) 

let's build another models!

### 6.2: logistice regression model 

In [None]:
# Training the logistic regression model on our rebalanced data
logreg = LogisticRegression(multi_class='ovr')
logreg.fit(X_train, y_train)

# Generate predictions
pred_lr = logreg.predict(X_test)

from sklearn.metrics import accuracy_score


  
# calculating accuracy score
accuracy_score = accuracy_score(pred_lr,y_test)
print('accuracy score : ',accuracy_score)

### 6.3 Random classification model

In [None]:
rf= RandomForestClassifier()
rf.fit(X_train, y_train)

pred_rf = rf.predict(X_test)

### 6.4: SVC model

In [None]:
svc = SVC()
svc.fit(X_train, y_train)

pred_svc = svc.predict(X_test)

### 6.5 K Nearest Neighbors

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)

In [None]:
pred_knn = knn.predict(X_test)

### 6.6 Linear SVC

In [None]:
svm = LinearSVC()
svm.fit(X_train, y_train)  
pred = svm.predict(X_test)

In [None]:
print(classification_report(y_test, pred))

In [None]:
lsvc= LinearSVC(C=100)
lsvc.fit(X_train, y_train)

In [None]:

# Generate predictions from full model
pred_lsvc = lsvc.predict(X_test)

 <a id="one"></a>
 ## 7. Model Evaluation
 <a href=#cont>Back to Table of Contents</a>

In this section, we shall build the previously developed models on various perfomance metrics such as;

* F1 score

* Accuracy

* Precision 

* Recall

In [None]:
print('Tree classification model')
print(classification_report(y_test, pred_tree))
print('\n')

print('logistice regression model (no selection)')
print(classification_report(y_test, pred_lr))
print('\n')

print('Random Forest Classification')
print(classification_report(y_test, pred_rf))
print('\n')

print('SVC model')
print(classification_report(y_test, pred_svc))
print('\n')


print('KNN Classification model')
print(classification_report(y_test, pred_knn))
print('\n')

print('Linear SVC Classification model')
print(classification_report(y_test, pred_lsvc))
print('\n')

In [None]:
from sklearn.metrics import accuracy_score
# calculating accuracy score
accuracy_score = accuracy_score(pred_lr,y_test)
print('accuracy score : ',accuracy_score)


 <a id="one"></a>
 ## 8. Model Selection
 <a href=#cont>Back to Table of Contents</a>
 
Select the best performing model.

we shall select the best performing model based on their accuracy scores 

In [None]:
from sklearn.metrics import f1_score

tree_classifier=f1_score(y_test,pred_tree, pos_label='positive',average='micro')
log_regression=f1_score(y_test,pred_lr, pos_label='positive',average='micro')
random_forest=f1_score(y_test,pred_rf, pos_label='positive',average='micro')
SVC=f1_score(y_test,pred_svc, pos_label='positive',average='micro')
KNN=f1_score(y_test,pred_knn, pos_label='positive',average='micro')
LinearSVC=f1_score(y_test,pred_lsvc, pos_label='positive',average='micro')

In [None]:
results_dict={'F1_Score':
              {
           'tree_classifier':tree_classifier,
           'log_regression':log_regression,
           'random_forest':random_forest,
           'SVC':SVC,
           'KNN' : KNN,
           'LinearSVC' : LinearSVC,
              }
             }

results_df = pd.DataFrame(data=results_dict)
# View the results
results_df

In [None]:
px.bar(results_df, y =results_df['F1_Score'],
       color = results_df.index, width =700, height=400)

## Conclusion
from our model evaluation, it looks like our logistic regression model perfromed the best( with a f1 score of 87%).

It is possible for this model to perform even better if we carry our further feature engineering and tuning

In [None]:
# Generate predictions
pred_test = logreg.predict(X_test_df)

In [None]:
#pred_test=tree.predict(X_df)

In [None]:
# Create csv file

tweet_id=test_df['tweetid']
model_test_df = pd.DataFrame({'sentiment':pred_test,'tweetid':tweet_id})


model_test_df.to_csv('submission.csv', index=False)

model_test_df.head(10)

In [None]:
lsvc= LinearSVC()
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [0.001, 0.01, 0.1, 1]}
grid = GridSearchCV(LinearSVC(), param_grid)
grid.fit(X_train, y_train)

In [None]:
import pickle

In [None]:
with open('lsvc_pkl','wb') as files:
    pickle.dump(lsvc,files)

In [None]:
with open('svc_pkl','wb') as files:
    pickle.dump(svc,files)   

In [None]:
with open('rf_pkl','wb') as files:
    pickle.dump(rf,files)

In [None]:
with open('vect_pkl','wb') as files:
    pickle.dump(vect,files)