## <div style="color:white;display:fill;border-radius:8px;background-color:#323232;font-size:150%; letter-spacing:1.0px"><p style="padding: 12px;color:white;"><b><b><span style='color:white'><span style='color:#F1A424'>4 |</span></span></b> Modelling</b></p></div>

## <b>4.1 <span style='color:#F1A424'>|</span> Train Test Split</b>

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['lemma_preprocessed_tweet'],
                                                    df['sentiment'],
                                                    test_size=0.2,
                                                    random_state = 0)

X_train.shape, X_test.shape

((17353,), (4339,))

## <b>4.2 <span style='color:#F1A424'>|</span> Feature Extraction</b>

We will convert text from text to vector using **TF-IDF** vectorizer. **TF-IDF** stands for **Term Frequency-Inverse Document Frequency**. It is a technique to quantify a word in documents, we generally compute a weight to each word which signifies the importance of the word in the document and corpus. This method is a widely used technique in Information Retrieval and Text Mining.

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

clf = Pipeline([('tfidf', TfidfVectorizer()),
                ('log_clf', LogisticRegression())])

## <b>4.3 <span style='color:#F1A424'>|</span> Baseline Model</b>

- **In order to be able to evaluate our model performances and truly assess how well they are performing compared to random guessing, we will build a dummy classifier as our baseline.**

### <b>4.3.1 <span style='color:#F1A424'>|</span> Vanilla Dummy Model</b>

In [None]:
from sklearn.dummy import DummyClassifier

dummy_clf_pipe = Pipeline([('vectorizer', TfidfVectorizer()),
                     ('dummy_clf_pipe', DummyClassifier(random_state=42))])

dummy_clf_pipe.fit(X_train, y_train)
y_pred = dummy_clf_pipe.predict(X_test)
print(f"Accuracy score: {metrics.accuracy_score(y_test, y_pred)}")

Accuracy score: 0.4150725973726665


### <b>4.3.2 <span style='color:#F1A424'>|</span> Vanilla Logistic Regression Model</b>

In [None]:
# Create a baseline model
# Logistic Regression
clf.fit(X_train, y_train)

In [None]:
# run an accuracy score on the training data
y_pred_train = clf.predict(X_train)
print(f"Accuracy score: {metrics.accuracy_score(y_train, y_pred_train)}")

Accuracy score: 0.926986688180718


In [None]:
y_pred = clf.predict(X_test)
print(f"Accuracy score: {metrics.accuracy_score(y_test, y_pred)}")

Accuracy score: 0.81793039870938


## <div style="color:white;display:fill;border-radius:8px;background-color:#323232;font-size:150%; letter-spacing:1.0px"><p style="padding: 12px;color:white;"><b><b><span style='color:white'><span style='color:#F1A424'>5 |</span></span></b> Iterate</b></p></div>

## <b>5.1 <span style='color:#F1A424'>|</span> RandomForestClassifier</b>

In [None]:
rf_clf = Pipeline([('tfidf', TfidfVectorizer()),
                ('rf_clf', RandomForestClassifier(n_jobs = -1))])

rf_clf.fit(X_train, y_train)

In [None]:
# run an accuracy score on the training data
y_pred_train = rf_clf.predict(X_train)
print(f"Accuracy score: {metrics.accuracy_score(y_train, y_pred_train)}")

Accuracy score: 1.0


In [None]:
rf_y_pred = rf_clf.predict(X_test)
print(f"Accuracy score: {metrics.accuracy_score(y_test, rf_y_pred)}")

Accuracy score: 0.8186218022585849


## <b>5.2 <span style='color:#F1A424'>|</span> MultinomialNB</b>

In [None]:
multiNB_clf = Pipeline([('tfidf', TfidfVectorizer()),
                ('multiNB_clf', MultinomialNB())])

multiNB_clf.fit(X_train, y_train)

In [None]:
# run an accuracy score on the training data
y_pred_train = multiNB_clf.predict(X_train)
print(f"Accuracy score: {metrics.accuracy_score(y_train, y_pred_train)}")

Accuracy score: 0.8849190341727655


In [None]:
multiNB_pred = multiNB_clf.predict(X_test)
print(f"Accuracy score: {metrics.accuracy_score(y_test, multiNB_pred)}")

Accuracy score: 0.7416455404471076


## <b>5.3 <span style='color:#F1A424'>|</span> ComplementNB</b>

In [None]:
from sklearn.naive_bayes import BernoulliNB, GaussianNB, MultinomialNB, ComplementNB

ComplementNB_clf = Pipeline([('tfidf', TfidfVectorizer()),
                ('ComplementNB_clf', ComplementNB())])

ComplementNB_clf.fit(X_train, y_train)

In [None]:
# Run an accuracy score on the training data
y_pred_train = ComplementNB_clf.predict(X_train)
print(f"Accuracy score: {metrics.accuracy_score(y_train, y_pred_train)}")

Accuracy score: 0.8878580072609923


In [None]:
# Run an accuracy score on the test data
ComplementNB_pred = ComplementNB_clf.predict(X_test)
print(f"Accuracy score: {metrics.accuracy_score(y_test, ComplementNB_pred)}")

Accuracy score: 0.715141737727587


## <div style="color:white;display:fill;border-radius:8px;background-color:#323232;font-size:150%; letter-spacing:1.0px"><p style="padding: 12px;color:white;"><b><b><span style='color:white'><span style='color:#F1A424'>6 |</span></span></b> Evaluate</b></p></div>

In [None]:
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

    Negative       0.86      0.74      0.80      1236
     Neutral       0.78      0.93      0.84      1801
    Positive       0.86      0.74      0.79      1302

    accuracy                           0.82      4339
   macro avg       0.83      0.80      0.81      4339
weighted avg       0.82      0.82      0.82      4339



In [None]:
print(classification_report(y_test, rf_y_pred))

              precision    recall  f1-score   support

    Negative       0.85      0.73      0.78      1236
     Neutral       0.78      0.96      0.86      1801
    Positive       0.86      0.72      0.78      1302

    accuracy                           0.82      4339
   macro avg       0.83      0.80      0.81      4339
weighted avg       0.83      0.82      0.82      4339



In [None]:
print(classification_report(y_test, multiNB_pred))

              precision    recall  f1-score   support

    Negative       0.75      0.69      0.72      1236
     Neutral       0.72      0.85      0.78      1801
    Positive       0.78      0.64      0.70      1302

    accuracy                           0.74      4339
   macro avg       0.75      0.73      0.73      4339
weighted avg       0.75      0.74      0.74      4339



In [None]:
# Confusin matrix for the logistic regression model
cm = confusion_matrix(y_test, y_pred)
cm

array([[ 919,  229,   88],
       [  57, 1671,   73],
       [  88,  255,  959]], dtype=int64)

In [None]:
# Confusin matrix for the random forest model
cm = confusion_matrix(y_test, rf_y_pred)
cm

array([[ 897,  235,  104],
       [  37, 1720,   44],
       [ 117,  250,  935]], dtype=int64)

In [None]:
# Confusin matrix for the multinomial naive bayes model
cm = confusion_matrix(y_test, multiNB_pred)
cm

array([[ 855,  278,  103],
       [ 131, 1535,  135],
       [ 160,  314,  828]], dtype=int64)

In [None]:
rf_clf.predict(['Wow! This is an amazing initiative.'])

array(['Positive'], dtype=object)

In [None]:
rf_clf.predict(['this really sucks!'])

array(['Neutral'], dtype=object)

In [None]:
rf_clf.predict(['The judicial system in Kenya is really corrupt'])

array(['Negative'], dtype=object)

In [None]:
rf_clf.predict(['Kenyans are livid and angry at the current state of the economy'])

array(['Negative'], dtype=object)