# Classification Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

We, **Team 6**, confirm - by submitting this document - that the solutions in this notebook are a result of our own work and that we abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Predict Overview: EDSA - Climate Change Belief Analysis 2022

Many companies are built around lessening one’s environmental impact or carbon footprint. They offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. They would like to determine how people perceive climate change and whether or not they believe it is a real threat. This would add to their market research efforts in gauging how their product/service may be received.

With this context, you are challenged to create a Machine Learning model that is able to classify whether or not a person believes in climate change, based on their novel tweet data.

Providing an accurate and robust solution to this task gives companies access to a broad base of consumer sentiment, spanning multiple demographic and geographic categories - thus increasing their insights and informing future marketing strategies.


<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [1]:
!pip install textatistic



In [2]:
import numpy as np
import numpy.linalg as LA
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix\

import matplotlib.pyplot as plt
import string
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
from nltk import SnowballStemmer, PorterStemmer, LancasterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


from textatistic import Textatistic

---
## Discussion of libraries that will be used

---

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `train.csv` file into a DataFrame. |

---

In [3]:
df = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
df_test.head()

Unnamed: 0,message,tweetid
0,Europe will now be looking to China to make su...,169760
1,Combine this with the polling of staffers re c...,35326
2,"The scary, unimpeachable evidence that climate...",224985
3,@Karoli @morgfair @OsborneInk @dailykos \nPuti...,476263
4,RT @FakeWillMoore: 'Female orgasms cause globa...,872928


In [4]:
df['message'][2]

'RT @RawStory: Researchers say we have three years to act on climate change before it’s too late https://t.co/WdT0KdUr2f https://t.co/Z0ANPT…'

<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

In [5]:
all_msg = df

In [6]:
#removing web URL
#COnsider removing twitter handles

pattern_url = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'
subs_url = r'url-web'
all_msg['message'] = all_msg['message'].replace(to_replace = pattern_url, value = subs_url, regex = True)

In [7]:
all_msg.head()

Unnamed: 0,sentiment,message,tweetid
0,1,PolySciMajor EPA chief doesn't think carbon di...,625221
1,1,It's not like we lack evidence of anthropogeni...,126103
2,2,RT @RawStory: Researchers say we have three ye...,698562
3,1,#TodayinMaker# WIRED : 2016 was a pivotal year...,573736
4,1,"RT @SoyNovioDeTodas: It's 2016, and a racist, ...",466954


In [8]:
#converting message column to lower case

all_msg['message'] = all_msg['message'].str.lower()

In [9]:
#remove punctuation
def remove_punctuation(post):
    return ''.join([l for l in post if l not in string.punctuation])

all_msg['message'] = all_msg['message'].apply(remove_punctuation)

In [10]:
tokeniser = TreebankWordTokenizer()
all_msg['tokens'] = all_msg['message'].apply(tokeniser.tokenize)

In [11]:
lemmatizer = WordNetLemmatizer()
def mbti_lemma(words, lemmatizer):
    return [lemmatizer.lemmatize(word) for word in words]    

all_msg['lemma'] = all_msg['tokens'].apply(mbti_lemma, args=(lemmatizer, ))

In [12]:
def remove_stop_words(tokens):    
    return [t for t in tokens if t not in stopwords.words('english')]

In [13]:
betterVect = CountVectorizer(stop_words='english', 
                             min_df=2, 
                             max_df=0.5, 
                             ngram_range=(1, 2))

In [14]:
added_info = all_msg.copy()
def count_words(word):
    word_list = word.split(" ")
    return len(word_list)

added_info['word_count'] = added_info['message'].apply(count_words)

def avg_word_length(word):
    string_length =  len(word)
    word_list = word.split(" ")
    word_count = len(word_list)
    return string_length/word_count

added_info['avg_word_length'] = added_info['message'].apply(avg_word_length)

def count_citations(word):
    word_list = word.split(" ")
    count = 0 
    
    for word in word_list:
        if word == "urlweb":
            count = count + 1
            
    return count            

added_info['citations'] = added_info['message'].apply(count_citations)

def count_retweets(word):
    word_list = word.split(" ")
    rt_count = 0
    for word in word_list:
        if word == 'rt':
            rt_count = rt_count + 1
    return rt_count

added_info['rt_count'] = added_info['message'].apply(count_retweets)

def list_to_string(post):
    return ' '.join(post)

added_info['lemma'] = added_info['lemma'].apply(list_to_string)

tf_vect = TfidfVectorizer()
X_train_vect = tf_vect.fit_transform(added_info.lemma)

added_info['lemma']

0        polyscimajor epa chief doesnt think carbon dio...
1        it not like we lack evidence of anthropogenic ...
2        rt rawstory researcher say we have three year ...
3        todayinmaker wired 2016 wa a pivotal year in t...
4        rt soynoviodetodas it 2016 and a racist sexist...
                               ...                        
15814    rt ezlusztig they took down the material on gl...
15815    rt washingtonpost how climate change could be ...
15816    notiven rt nytimesworld what doe trump actuall...
15817    rt sara8smiles hey liberal the climate change ...
15818    rt chetcannon kurteichenwalds climate change e...
Name: lemma, Length: 15819, dtype: object

In [15]:
#X_train_df = pd.DataFrame(X_train_vect.toarray())

In [16]:
#X_train_df['word_count'] = added_info['word_count']
#X_train_df['avg_word_length'] = added_info['avg_word_length']
#X_train_df['citations'] = added_info['citations']
#X_train_df['rt_count'] = added_info['rt_count']


In [15]:
y_train_df = added_info.sentiment
#X_train2 = X_train_df.to_numpy()
#y_train2 = y_train_df.to_numpy()

In [None]:
# one = added_info[added_info.sentiment == 1]
# one = tf_vect.transform(one.lemma)
# one = one.toarray()
# one = pd.DataFrame(one)
# one_vect = pd.DataFrame(one.mean()).T.to_numpy()[0]
# one_vect

In [None]:
# one_ = added_info[added_info.sentiment == -1]
# one_ = tf_vect.transform(one_.lemma)
# one_ = one_.toarray()
# one_ = pd.DataFrame(one_)
# one__vect = pd.DataFrame(one_.mean()).T.to_numpy()[0]
# one__vect

In [None]:
# zero = added_info[added_info.sentiment == 0]
# zero = tf_vect.transform(zero.lemma)
# zero = zero.toarray()
# zero = pd.DataFrame(zero)
# zero_vect = pd.DataFrame(zero.mean()).T.to_numpy()[0]
# zero_vect

In [None]:
# two = added_info[added_info.sentiment == 2]
# two = tf_vect.transform(two.lemma)
# two = two.toarray()
# two = pd.DataFrame(two)
# two_vect = pd.DataFrame(two.mean()).T.to_numpy()[0]
# two_vect

In [16]:
newdf = all_msg.copy()

In [17]:
df_test['message'] = df_test['message'].replace(to_replace = pattern_url, value = subs_url, regex = True)
df_test['message'] = df_test['message'].str.lower()
df_test['message'] = df_test['message'].apply(remove_punctuation)
df_test['tokens'] = df_test['message'].apply(tokeniser.tokenize)
df_test['lemma'] = df_test['tokens'].apply(mbti_lemma, args=(lemmatizer, ))
newtestdf = df_test.copy()
df_test.head(2)

Unnamed: 0,message,tweetid,tokens,lemma
0,europe will now be looking to china to make su...,169760,"[europe, will, now, be, looking, to, china, to...","[europe, will, now, be, looking, to, china, to..."
1,combine this with the polling of staffers re c...,35326,"[combine, this, with, the, polling, of, staffe...","[combine, this, with, the, polling, of, staffe..."


In [18]:
def list_to_string(post):
    return ' '.join(post)

In [19]:
all_msg = newdf.copy()
df_test = newtestdf.copy()

all_msg['lemma'] = all_msg['lemma'].apply(list_to_string)
df_test['lemma'] = df_test['lemma'].apply(list_to_string)

In [26]:
X_train = betterVect.fit_transform(all_msg['lemma'])
y_train = df.sentiment
X_test = betterVect.transform(df_test['lemma'])
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.20, random_state=1)

In [95]:
X_train2 = X_train_vect
y_train2 = y_train_df
X_test2 = tf_vect.transform(df_test['lemma'])

In [93]:
addition = added_info[['word_count', 'avg_word_length', 'citations', 'rt_count']]

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more classification models that is/are able to classify whether or not a person believes in climate change, based on their novel tweet data.

---

In [29]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
logreg.score(X_val, y_val)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.7493678887484198

In [96]:
logreg2 = LogisticRegression()
logreg2.fit(X_train2, y_train2)
logreg2.score(X_val2, y_val2)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.8644121365360303

In [None]:
svc_poly = SVC(kernel="poly")
svc_poly.fit(X_train, y_train)
svc_poly.score(X_val, y_val)

In [None]:
svc_poly2 = SVC(kernel="poly")
svc_poly2.fit(X_train2, y_train2)
svc_poly2.score(X_val2, y_val2)

In [None]:
svc_lm = SVC(kernel="linear")
svc_lm.fit(X_train, y_train)
svc_lm.score(X_val, y_val)

In [119]:
svc_lm2 = SVC(kernel="linear", probability=True)
svc_lm2.fit(X_train2, y_train2)
svc_lm2.score(X_val2, y_val2)

0.9171934260429836

In [None]:
AdBoost = AdaBoostClassifier()
AdBoost.fit(X_train, y_train)
AdBoost.score(X_val, y_val)

In [None]:
AdBoost2 = AdaBoostClassifier()
AdBoost2.fit(X_train2, y_train2)
AdBoost2.score(X_val2, y_val2)

In [None]:
RF = RandomForestClassifier()
RF.fit(X_train, y_train)
RF.score(X_val, y_val)

In [98]:
RF2 = RandomForestClassifier()
RF2.fit(X_train2, y_train2)
RF2.score(X_val2, y_val2)

0.9993678887484198

In [None]:
MNB = MultinomialNB()
MNB.fit(X_train, y_train)
MNB.score(X_val, y_val)

In [33]:
MNB2 = MultinomialNB()
MNB2.fit(X_train2, y_train2)
MNB2.score(X_val2, y_val2)

0.6536030341340076

In [105]:
logistic = logreg2.predict_proba(X_train2)

In [106]:
support_vector = svc_lm2.predict_proba(X_train2)

In [107]:
random_forest = RF2.predict_proba(X_train2)

In [108]:
classes = ['classA', 'classB', 'classC', 'classD']

In [109]:
logistic_df = pd.DataFrame(logistic)
logistic_df.columns = classes

support_vector_df = pd.DataFrame(support_vector)
support_vector_df.columns = classes

random_forest_df = pd.DataFrame(random_forest)
random_forest_df.columns = classes

In [113]:
interim_data = pd.DataFrame()

interim_data['classA'] = logistic_df['classA'] + support_vector_df['classA'] + random_forest_df['classA']
interim_data['classB'] = logistic_df['classB'] + support_vector_df['classB'] + random_forest_df['classB']
interim_data['classC'] = logistic_df['classC'] + support_vector_df['classC'] + random_forest_df['classC']
interim_data['classD'] = logistic_df['classD'] + support_vector_df['classD'] + random_forest_df['classD']

interim = pd.concat([interim_data, addition], axis=1)

In [114]:
m_train, m_val, m_y_train, m_y_val = train_test_split(interim, y_train2, test_size=0.20, random_state=1)

In [115]:
m_log_clf = LogisticRegression()
m_log_clf.fit(m_train, m_y_train)
m_log_clf.score(m_val, m_y_val)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


0.9756637168141593

In [116]:
m_svc_lm2 = SVC(kernel="linear")
m_svc_lm2.fit(m_train, m_y_train)
m_svc_lm2.score(m_val, m_y_val)

0.9759797724399494

In [118]:
m_RF2 = RandomForestClassifier()
m_RF2.fit(m_train, m_y_train)
m_RF2.score(m_val, m_y_val)

0.97724399494311

### Prepare for submission

In [120]:
test_addition = df_test.copy()
def count_words(word):
    word_list = word.split(" ")
    return len(word_list)

test_addition['word_count'] = test_addition['message'].apply(count_words)

def avg_word_length(word):
    string_length =  len(word)
    word_list = word.split(" ")
    word_count = len(word_list)
    return string_length/word_count

test_addition['avg_word_length'] = test_addition['message'].apply(avg_word_length)

def count_citations(word):
    word_list = word.split(" ")
    count = 0 
    
    for word in word_list:
        if word == "urlweb":
            count = count + 1
            
    return count            

test_addition['citations'] = test_addition['message'].apply(count_citations)

def count_retweets(word):
    word_list = word.split(" ")
    rt_count = 0
    for word in word_list:
        if word == 'rt':
            rt_count = rt_count + 1
    return rt_count

test_addition['rt_count'] = test_addition['message'].apply(count_retweets)

In [125]:
addition.head(1)

Unnamed: 0,word_count,avg_word_length,citations,rt_count
0,19,6.105263,1,0


In [127]:
test_addition = test_addition[['word_count', 'avg_word_length', 'citations', 'rt_count']]

In [130]:

logistic = logreg2.predict_proba(X_test2)
support_vector = svc_lm2.predict_proba(X_test2)
random_forest = RF2.predict_proba(X_test2)

classes = ['classA', 'classB', 'classC', 'classD']

logistic_df = pd.DataFrame(logistic)
logistic_df.columns = classes

support_vector_df = pd.DataFrame(support_vector)
support_vector_df.columns = classes

random_forest_df = pd.DataFrame(random_forest)
random_forest_df.columns = classes

interim_data = pd.DataFrame()

interim_data['classA'] = logistic_df['classA'] + support_vector_df['classA'] + random_forest_df['classA']
interim_data['classB'] = logistic_df['classB'] + support_vector_df['classB'] + random_forest_df['classB']
interim_data['classC'] = logistic_df['classC'] + support_vector_df['classC'] + random_forest_df['classC']
interim_data['classD'] = logistic_df['classD'] + support_vector_df['classD'] + random_forest_df['classD']

interim = pd.concat([interim_data, test_addition], axis=1)
interim

Unnamed: 0,classA,classB,classC,classD,word_count,avg_word_length,citations,rt_count
0,0.090894,0.132705,2.336278,0.440123,20,5.200000,1,0
1,0.118136,0.458887,2.305033,0.117944,20,5.650000,1,0
2,0.027738,0.165350,2.724315,0.082597,14,8.071429,1,0
3,0.083101,0.364161,2.437053,0.115686,23,5.608696,0,0
4,0.240293,1.877761,0.703504,0.178442,8,9.125000,0,1
...,...,...,...,...,...,...,...,...
10541,0.321340,0.304039,1.726491,0.648130,15,8.000000,0,1
10542,0.109459,0.111651,2.430984,0.347906,19,6.157895,1,0
10543,0.073130,0.634892,0.998445,1.293533,16,7.250000,1,1
10544,0.183798,2.536068,0.268651,0.011483,16,6.375000,0,1


In [136]:
predictions = m_log_clf.predict(interim)


df_CSV = pd.DataFrame({"tweetid": df_test['tweetid'].values,
                   "sentiment": predictions,
                  })

df_CSV.to_csv("Team6_sample_MNB_ProT.csv", index=False)

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---