# Advanced Classification Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

We {**TEAM 18**}, confirm - by submitting this document - that the solutions in this notebook are a result of our own work and that we abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Predict Overview: Twitter Sentiment Classification

Many companies are built around lessening one’s environmental impact or carbon footprint. They offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. They would like to determine how people perceive climate change and whether or not they believe it is a real threat. This would add to their market research efforts in gauging how their product/service may be received.

With this context, EDSA is challenging you during the Classification Sprint with the task of creating a Machine Learning model that is able to classify whether or not a person believes in climate change, based on their novel tweet data.

Providing an accurate and robust solution to this task gives companies access to a broad base of consumer sentiment, spanning multiple demographic and geographic categories - thus increasing their insights and informing future marketing strategies.


### The evaluation metric
Mean F1-Score. The F1 score, commonly used in information retrieval, measures performance using using the statistics precision and recall.

Precision is the ratio of true positives to all predicted positives. Recall is the ratio of true positives to all actual positives.
The F1 metric weights recall and precision equally, and a good retrieval algorithm will maximize both precision and recall simultaneously. Thus, moderately good performance on both will be favored over extremely good performance on one and poor performance on the other.

### Submission Format
For every tweet in the dataset, submission files should contain two columns: tweetid and sentiment. sentiment should be a space-delimited list. Every tweetid will have a sentiment, as per your prediction. Refer to the Description page for more information about the valid classes in the sentiment column.

The file should contain a header and have the following format:

tweetid,sentiment
35326,1
15327,-1
54232,0 

<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [1]:
# Libraries for data loading, data manipulation and data visulisation
import nltk

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re

# set plot style
sns.set()

from nltk.corpus import stopwords
import string
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
from nltk import SnowballStemmer, PorterStemmer, LancasterStemmer
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer
import urllib
import string
from string import printable


nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('omw-1.4')



[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `df_train` file into a DataFrame. |

---

In [2]:
# Loading the train and test data
train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test_with_no_labels.csv') 

<a id="three"></a>
## 3. Exploratory Data Analysis (EDA)
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Exploratory data analysis ⚡ |
| :--------------------------- |
| In this section, you are required to perform an in-depth analysis of all the variables in the DataFrame. |

---


In [3]:
# Look at a sample from the top of the dataset

In [4]:
# We'll print off a list of all the tweetid which are present in this dataset.
## tweetid = list(train_data.tweetid.unique())

## Text Cleaning

### Removing Noise

In text analytics, removing noise (i.e. unneccesary information) is a key part of getting the data into a usable format. Some techniques are standard, but your own data will require some creative thinking on your part.

For the train dataset we will be doing the following steps:

- removing the web-urls
- making everything lower case
- removing punctuation

In [5]:
#removing web-urls and replacing with empty ''
pattern_url = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'
subs_url = ''
train_data['message'] = train_data['message'].replace(to_replace = pattern_url, 
                                            value = subs_url, regex = True)

In [6]:
# removing text before and including the colon : sign
train_data['message'] = train_data['message'].replace(r'^.+:', '', regex=True)

In [7]:
# removing words that starts with '@' 
pattern = r'@\w+'
subs = ''
train_data['message'] = train_data['message'].replace(to_replace = pattern, 
                                            value = subs, regex = True)



In [8]:
# removing words that starts with '#'
patt = r'#\w+'
sub = ''
train_data['message'] = train_data['message'].replace(to_replace = patt, 
                                            value = sub, regex = True)

In [9]:
# transforming to lower case
train_data['message'] = train_data['message'].str.lower()

#### Remove punctuation

- First we make all the text lower case to remove some noise from capitalisation.

In [10]:
#creating a function that removes punctuation from the data frame
def remove_punctuation(post):
    return ''.join([l for l in post if l not in string.punctuation])

In [11]:
train_data['message'] = train_data['message'].apply(remove_punctuation)

In [12]:
## train_data['message'].str.encode('ascii', 'ignore').str.decode('ascii')

In [13]:
# removing all ascii characters and figures using printable module
st = set(printable)
train_data['message']= train_data['message'].apply(lambda x: ''.join(
    ["" if  i not in  st else i for i in x]))

### Tokenisation

A tokeniser divides text into a sequence of tokens, which roughly correspond to "words" (see the Stanford Tokeniser). We will use tokenisers to clean up the data, making it ready for analysis.

In [14]:
# transforming the message column into tokens 
tokeniser = TreebankWordTokenizer()
train_data['message'] = train_data['message'].apply(tokeniser.tokenize)

In [16]:
train_data

Unnamed: 0,sentiment,message,tweetid
0,1,"[polyscimajor, epa, chief, doesnt, think, carb...",625221
1,1,"[its, not, like, we, lack, evidence, of, anthr...",126103
2,2,"[researchers, say, we, have, three, years, to,...",698562
3,1,"[2016, was, a, pivotal, year, in, the, war, on...",573736
4,1,"[its, 2016, and, a, racist, sexist, climate, c...",466954
...,...,...,...
15814,1,[],22001
15815,2,"[how, climate, change, could, be, breaking, up...",17856
15816,0,"[what, does, trump, actually, believe, about, ...",384248
15817,-1,"[hey, liberals, the, climate, change, crap, is...",819732


### Stemming

Stemming is the process of transforming to the root word. It uses an algorithm that removes common word-endings from English words, such as “ly,” “es,” “ed,” and “s.”

For example, assuming for an analysis you may want to consider “carefully,” “cared,” “cares,” “caringly” as “care” instead of separate words. There are three widely used stemming algorithms, namely:

    - Porter
    - Lancaster
    - Snowball
Out of these three, we will be using the SnowballStemmer.

In [17]:
# creating a function to stemm data
stemmer = SnowballStemmer('english')
def tweet_stemmer(words, stemmer):
    return [stemmer.stem(word) for word in words]

In [18]:
# stemming the data
train_data['message']= train_data['message'].apply(tweet_stemmer, args=(stemmer, ))

### Lemmatization

A very similar operation to stemming is called lemmatization. Lemmatizing is the process of grouping words of similar meaning together. So, your root stem, meaning the word you end up with, is not something you can just look up in a dictionary, but you can look up a lemma.

### Stop Words

Stop words are words which do not contain important significance to be used in Search Queries. Usually these words are filtered out from search queries because they return a vast amount of unnecessary information. nltk has a corpus of stopwords. Let's print out the stopwords for English.

In [19]:
# creating function to remove stop_words 
def remove_stop_words(tokens):    
    return [t for t in tokens if t not in stopwords.words('english')]

In [20]:
train_data['message'] = train_data['message'].apply(remove_stop_words)

train_data['message']

0        [polyscimajor, epa, chief, doesnt, think, carb...
1            [like, lack, evid, anthropogen, global, warm]
2        [research, say, three, year, act, climat, chan...
3                  [2016, pivot, year, war, climat, chang]
4        [2016, racist, sexist, climat, chang, deni, bi...
                               ...                        
15814                                                   []
15815    [climat, chang, could, break, 200millionyearol...
15816    [doe, trump, actual, believ, climat, chang, ri...
15817    [hey, liber, climat, chang, crap, hoax, tie, c...
15818                [climat, chang, equat, 4, screenshot]
Name: message, Length: 15819, dtype: object

In [39]:
train_data

Unnamed: 0,sentiment,message,tweetid
0,1,"[polyscimajor, epa, chief, doesnt, think, carb...",625221
1,1,"[like, lack, evid, anthropogen, global, warm]",126103
2,2,"[research, say, three, year, act, climat, chan...",698562
3,1,"[2016, pivot, year, war, climat, chang]",573736
4,1,"[2016, racist, sexist, climat, chang, deni, bi...",466954
...,...,...,...
15814,1,[],22001
15815,2,"[climat, chang, could, break, 200millionyearol...",17856
15816,0,"[doe, trump, actual, believ, climat, chang, ri...",384248
15817,-1,"[hey, liber, climat, chang, crap, hoax, tie, c...",819732


## Text feature extraction

### Bag of words

Text feature extraction is the process of transforming what is essentially a list of words into a feature set that is usable by a classifier. The NLTK classifiers expect dict style feature sets, so we must therefore transform our text into a dict. The Bag of Words model is the simplest method; it constructs a word presence feature set from all the words in the text, indicating the number of times each word has appeared.

<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more regression models that are able to accurately predict the thee hour load shortfall. |

---

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---