# Classification Predict Student Solution

© Explore Data Science Academy

---
### Honour Code

We, **Team 6**, confirm - by submitting this document - that the solutions in this notebook are a result of our own work and that we abide by the [EDSA honour code](https://drive.google.com/file/d/1QDCjGZJ8-FmJE3bZdIQNwnJyQKPhHZBn/view?usp=sharing).

Non-compliance with the honour code constitutes a material breach of contract.

### Predict Overview: EDSA - Climate Change Belief Analysis 2022

Many companies are built around lessening one’s environmental impact or carbon footprint. They offer products and services that are environmentally friendly and sustainable, in line with their values and ideals. They would like to determine how people perceive climate change and whether or not they believe it is a real threat. This would add to their market research efforts in gauging how their product/service may be received.

With this context, you are challenged to create a Machine Learning model that is able to classify whether or not a person believes in climate change, based on their novel tweet data.

Providing an accurate and robust solution to this task gives companies access to a broad base of consumer sentiment, spanning multiple demographic and geographic categories - thus increasing their insights and informing future marketing strategies.


<a id="cont"></a>

## Table of Contents

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Loading Data</a>

<a href=#three>3. Exploratory Data Analysis (EDA)</a>

<a href=#four>4. Data Engineering</a>

<a href=#five>5. Modeling</a>

<a href=#six>6. Model Performance</a>

<a href=#seven>7. Model Explanations</a>

 <a id="one"></a>
## 1. Importing Packages
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Importing Packages ⚡ |
| :--------------------------- |
| In this section you are required to import, and briefly discuss, the libraries that will be used throughout your analysis and modelling. |

---

In [None]:
!pip install textatistic

In [None]:
import numpy as np
import numpy.linalg as LA
import pandas as pd
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import StackingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score, precision_score, recall_score, confusion_matrix\

import matplotlib.pyplot as plt
import string
from nltk.tokenize import word_tokenize, TreebankWordTokenizer
from nltk import SnowballStemmer, PorterStemmer, LancasterStemmer
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
from nltk.util import ngrams
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer


from textatistic import Textatistic

---
## Discussion of libraries that will be used

---

<a id="two"></a>
## 2. Loading the Data
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Loading the data ⚡ |
| :--------------------------- |
| In this section you are required to load the data from the `train.csv` file into a DataFrame. |

---

In [None]:
df = pd.read_csv("train.csv")
df_test = pd.read_csv("test.csv")
df_test.head()

In [None]:
df['message'][2]

<a id="four"></a>
## 4. Data Engineering
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Data engineering ⚡ |
| :--------------------------- |
| In this section you are required to: clean the dataset, and possibly create new features - as identified in the EDA phase. |

---

In [None]:
all_msg = df

In [None]:
#removing web URL
#COnsider removing twitter handles

pattern_url = r'http[s]?://(?:[A-Za-z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9A-Fa-f][0-9A-Fa-f]))+'
subs_url = r'url-web'
all_msg['message'] = all_msg['message'].replace(to_replace = pattern_url, value = subs_url, regex = True)

In [None]:
all_msg.head()

In [None]:
#converting message column to lower case

all_msg['message'] = all_msg['message'].str.lower()

In [None]:
#remove punctuation
def remove_punctuation(post):
    return ''.join([l for l in post if l not in string.punctuation])

all_msg['message'] = all_msg['message'].apply(remove_punctuation)

In [None]:
tokeniser = TreebankWordTokenizer()
all_msg['tokens'] = all_msg['message'].apply(tokeniser.tokenize)

In [None]:
lemmatizer = WordNetLemmatizer()
def mbti_lemma(words, lemmatizer):
    return [lemmatizer.lemmatize(word) for word in words]    

all_msg['lemma'] = all_msg['tokens'].apply(mbti_lemma, args=(lemmatizer, ))

In [None]:
def remove_stop_words(tokens):    
    return [t for t in tokens if t not in stopwords.words('english')]

In [None]:
betterVect = CountVectorizer(stop_words='english', 
                             min_df=2, 
                             max_df=0.5, 
                             ngram_range=(1, 2))

In [None]:
added_info = all_msg.copy()
def count_words(word):
    word_list = word.split(" ")
    return len(word_list)

added_info['word_count'] = added_info['message'].apply(count_words)

def avg_word_length(word):
    string_length =  len(word)
    word_list = word.split(" ")
    word_count = len(word_list)
    return string_length/word_count

added_info['avg_word_length'] = added_info['message'].apply(avg_word_length)

def count_citations(word):
    word_list = word.split(" ")
    count = 0 
    
    for word in word_list:
        if word == "urlweb":
            count = count + 1
            
    return count            

added_info['citations'] = added_info['message'].apply(count_citations)

def count_retweets(word):
    word_list = word.split(" ")
    rt_count = 0
    for word in word_list:
        if word == 'rt':
            rt_count = rt_count + 1
    return rt_count

added_info['rt_count'] = added_info['message'].apply(count_retweets)

def list_to_string(post):
    return ' '.join(post)

added_info['lemma'] = added_info['lemma'].apply(list_to_string)

tf_vect = TfidfVectorizer()
X_train_vect = tf_vect.fit_transform(added_info.lemma)

added_info['lemma']

In [None]:
X_train_df = pd.DataFrame(X_train_vect.toarray())

X_train_df['word_count'] = added_info['word_count']
X_train_df['avg_word_length'] = added_info['avg_word_length']
X_train_df['citations'] = added_info['citations']
X_train_df['rt_count'] = added_info['rt_count']


y_train_df = added_info.sentiment
X_train2, X_val2, y_train2, y_val2 = train_test_split(X_train_df.to_numpy(), y_train_df.to_numpy(), test_size=0.20, random_state=1)

In [None]:
one = added_info[added_info.sentiment == 1]
one = tf_vect.transform(one.lemma)
one = one.toarray()
one = pd.DataFrame(one)
one_vect = pd.DataFrame(one.mean()).T.to_numpy()[0]
one_vect

In [None]:
one_ = added_info[added_info.sentiment == -1]
one_ = tf_vect.transform(one_.lemma)
one_ = one_.toarray()
one_ = pd.DataFrame(one_)
one__vect = pd.DataFrame(one_.mean()).T.to_numpy()[0]
one__vect

In [None]:
zero = added_info[added_info.sentiment == 0]
zero = tf_vect.transform(zero.lemma)
zero = zero.toarray()
zero = pd.DataFrame(zero)
zero_vect = pd.DataFrame(zero.mean()).T.to_numpy()[0]
zero_vect

In [None]:
two = added_info[added_info.sentiment == 2]
two = tf_vect.transform(two.lemma)
two = two.toarray()
two = pd.DataFrame(two)
two_vect = pd.DataFrame(two.mean()).T.to_numpy()[0]
two_vect

In [None]:
newdf = all_msg.copy()

In [None]:
df_test['message'] = df_test['message'].replace(to_replace = pattern_url, value = subs_url, regex = True)
df_test['message'] = df_test['message'].str.lower()
df_test['message'] = df_test['message'].apply(remove_punctuation)
df_test['tokens'] = df_test['message'].apply(tokeniser.tokenize)
df_test['lemma'] = df_test['tokens'].apply(mbti_lemma, args=(lemmatizer, ))
newtestdf = df_test.copy()
df_test.head(2)

In [None]:
def list_to_string(post):
    return ' '.join(post)

In [None]:
all_msg = newdf.copy()
df_test = newtestdf.copy()

all_msg['lemma'] = all_msg['lemma'].apply(list_to_string)
df_test['lemma'] = df_test['lemma'].apply(list_to_string)

In [None]:
all_msg.head(1)

In [None]:
df_test[['lemma']].head(1)

In [None]:
all_msg['lemma']

In [None]:
X_train = betterVect.fit_transform(all_msg['lemma'])


X_test = betterVect.transform(df_test['lemma'])

In [None]:
y_train = df.sentiment

In [None]:
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.20, random_state=1)

In [None]:
X_test2 = tf_vect.transform(df_test['lemma'])

<a id="five"></a>
## 5. Modelling
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Modelling ⚡ |
| :--------------------------- |
| In this section, you are required to create one or more classification models that is/are able to classify whether or not a person believes in climate change, based on their novel tweet data.

---

In [None]:
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
logreg.score(X_val, y_val)

In [None]:
logreg2 = LogisticRegression()
logreg2.fit(X_train2, y_train2)
logreg2.score(X_val2, y_val2)

In [None]:
svc_poly = SVC(kernel="poly")
svc_poly.fit(X_train, y_train)
svc_poly.score(X_val, y_val)

In [None]:
svc_poly2 = SVC(kernel="poly")
svc_poly2.fit(X_train2, y_train2)
svc_poly2.score(X_val2, y_val2)

In [None]:
svc_lm = SVC(kernel="linear")
svc_lm.fit(X_train, y_train)
svc_lm.score(X_val, y_val)

In [None]:
svc_lm2 = SVC(kernel="linear")
svc_lm2.fit(X_train2, y_train2)
svc_lm2.score(X_val2, y_val2)

In [None]:
AdBoost = AdaBoostClassifier()
AdBoost.fit(X_train, y_train)
AdBoost.score(X_val, y_val)

In [None]:
AdBoost2 = AdaBoostClassifier()
AdBoost2.fit(X_train2, y_train2)
AdBoost2.score(X_val2, y_val2)

In [None]:
RF = RandomForestClassifier()
RF.fit(X_train, y_train)
RF.score(X_val, y_val)

In [None]:
RF2 = RandomForestClassifier()
RF2.fit(X_train2, y_train2)
RF2.score(X_val2, y_val2)

In [None]:
MNB = MultinomialNB()
MNB.fit(X_train, y_train)
MNB.score(X_val, y_val)

In [None]:
MNB2 = MultinomialNB()
MNB2.fit(X_train2, y_train2)
MNB2.score(X_val2, y_val2)

In [None]:
#models = [("LM", LogisticRegression()), ("RF", RandomForestClassifier()), ("SVR", SVC(kernel="linear"))]

#meta_learner = LogisticRegression()


#stack_clf = StackingClassifier(estimators=models, final_estimator=meta_learner)
#stack_clf.fit(X_train2, y_train2)
#stack_clf.score(X_val2, y_val2)

### Emsemble methods

In [None]:
predictions = svc_lm2.predict(X_test)


df_CSV = pd.DataFrame({"tweetid": df_test['tweetid'].values,
                   "sentiment": predictions,
                  })

df_CSV.to_csv("Team6_sample_MNB_ProT.csv", index=False)

<a id="six"></a>
## 6. Model Performance
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model performance ⚡ |
| :--------------------------- |
| In this section you are required to compare the relative performance of the various trained ML models on a holdout dataset and comment on what model is the best and why. |

---

<a id="seven"></a>
## 7. Model Explanations
<a class="anchor" id="1.1"></a>
<a href=#cont>Back to Table of Contents</a>

---
    
| ⚡ Description: Model explanation ⚡ |
| :--------------------------- |
| In this section, you are required to discuss how the best performing model works in a simple way so that both technical and non-technical stakeholders can grasp the intuition behind the model's inner workings. |

---