# Classification

3 methods to construct features for Machine Learning -
> - Bag of Words (BOW)
> - TFIDF

The four methods of Classfication methods that will be used in this lab are -
> - **Naive Bayes**
> - **Decision Tree**
> - **Decision Forest**
> - **Neural Networks**

## Data Preprocessing

In [1]:
#import all the packages that will be used in this lab
import matplotlib.pyplot as plt
import re
import os
import pandas as pd
import numpy as np

#processing libraries
import nltk
from nltk.tokenize import word_tokenize
from nltk import PorterStemmer
from nltk import WordNetLemmatizer
from nltk.corpus import stopwords

#ML libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import metrics

import gensim

### Load the data

In [2]:
#use pandas to read the csv file
data = pd.read_csv("Mental-Health-Twitter.csv", usecols=["label", "post_text"])

In [3]:
data.head()

Unnamed: 0,post_text,label
0,It's just over 2 years since I was diagnosed w...,1
1,"It's Sunday, I need a break, so I'm planning t...",1
2,Awake but tired. I need to sleep but my brain ...,1
3,RT @SewHQ: #Retro bears make perfect gifts and...,1
4,It’s hard to say whether packing lists are mak...,1


In [4]:
display(data.isna().sum())

post_text    0
label        0
dtype: int64

In [6]:
data[['label']].value_counts()

label
0        10000
1        10000
dtype: int64

In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20000 entries, 0 to 19999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   post_text  20000 non-null  object
 1   label      20000 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 312.6+ KB


### Preprocessing of the data

In [8]:
# remove non alphabets
remove_non_alphabets = lambda x: re.sub(r'[^a-zA-Z]',' ',x)

In [9]:
# tokenization
tokenize = lambda x: word_tokenize(x)

In [10]:
# stemming
ps = PorterStemmer()
stem = lambda w: [ ps.stem(x) for x in w ]

In [11]:
# lemmatization
lemmatizer = WordNetLemmatizer()
leammtizer = lambda x: [ lemmatizer.lemmatize(word) for word in x ]

In [12]:
# apply all the methods above to the column Message
print('Processing : [=', end='')
data['post_text'] = data['post_text'].apply(remove_non_alphabets)
print('=', end='')
data['post_text'] = data['post_text'].apply(tokenize)
print('=', end='')
data['post_text'] = data['post_text'].apply(stem)
print('=', end='')
data['post_text'] = data['post_text'].apply(leammtizer)
print('=', end='')
data['post_text'] = data['post_text'].apply(lambda x: ' '.join(x))
print('] : Completed', end='')

Processing : [=====] : Completed

In [13]:
data.head()

Unnamed: 0,post_text,label
0,it s just over year sinc i wa diagnos with anx...,1
1,it s sunday i need a break so i m plan to spen...,1
2,awak but tire i need to sleep but my brain ha ...,1
3,rt sewhq retro bear make perfect gift and are ...,1
4,it s hard to say whether pack list are make li...,1


### Splitting the data set

In [14]:
# split to 30 percent test data and 70 percent train data

train_corpus, test_corpus, train_labels, test_labels = train_test_split(data["post_text"],
                                                                        data["label"],
                                                                        test_size=0.3)

## Feature Engineering

### Bag of Words (BOW)

In [15]:
# build bag of words features' vectorizer and get features
bow_vectorizer = CountVectorizer(min_df=1, ngram_range=(1,1))
bow_train_features = bow_vectorizer.fit_transform(train_corpus)
bow_test_features = bow_vectorizer.transform(test_corpus)

In [16]:
print(bow_train_features[0]) 

  (0, 21167)	1
  (0, 14881)	1
  (0, 6403)	1
  (0, 22148)	1
  (0, 16609)	1
  (0, 664)	1
  (0, 18244)	1
  (0, 1541)	1
  (0, 2691)	1


### TFIDF

In [17]:
# build tfidf features' vectorizer and get features
tfidf_vectorizer=TfidfVectorizer(min_df=1, 
                                 norm='l2',
                                 smooth_idf=True,
                                 use_idf=True,
                                 ngram_range=(1,1))
tfidf_train_features = tfidf_vectorizer.fit_transform(train_corpus)  
tfidf_test_features = tfidf_vectorizer.transform(test_corpus)

In [18]:
print(tfidf_train_features[0])

  (0, 2691)	0.3565280432315922
  (0, 1541)	0.2188968126795628
  (0, 18244)	0.4437949106730895
  (0, 664)	0.17562444220499382
  (0, 16609)	0.4661846958087107
  (0, 22148)	0.23982814859233698
  (0, 6403)	0.19205308325699855
  (0, 14881)	0.4512598404171227
  (0, 21167)	0.28602190567111235


## Modeling and Evaluation 

<font size='5'> Confusion Matrix </font>

In [29]:
# define a function to evaluate our classification models based on four metrics
# This defined function is also useful in other cases. This is comparing test_y and pred_y. 
# Both contain 1s and 0s.
def get_metrics(true_labels, predicted_labels):
    
    print ('Accuracy:', np.round(                                                    
                        metrics.accuracy_score(true_labels, 
                                               predicted_labels),
                        2))
    print ('Precision:', np.round(
                        metrics.precision_score(true_labels, 
                                               predicted_labels),
                        2))
    print ('Recall:', np.round(
                        metrics.recall_score(true_labels, 
                                               predicted_labels),
                        2))
    print ('F1 Score:', np.round(
                        metrics.f1_score(true_labels, 
                                               predicted_labels),
                        2))

### Assume the cost for each mis-classified person from depression to healthy is 50, and from healthy to depression is 10. 

In [32]:
# define a function that trains the model, performs predictions and evaluates the predictions
def train_predict_evaluate_model(classifier, 
                                 train_features, train_labels, 
                                 test_features, test_labels):
    # build model   
    classifier.fit(train_features, train_labels)
    # predict using model 
    predictions = classifier.predict(test_features) 
    # evaluate model prediction performance
    '''get_metrics(true_labels=test_labels, 
                predicted_labels=predictions)'''
    print(metrics.classification_report(test_labels,predictions))
    print("Confusion matrix: ")
    cm = metrics.confusion_matrix(test_labels, predictions)
    df=pd.DataFrame(cm, index=(0,1), columns=(0,1))  
    print(df)
    print()
    print('Total Cost:')
    print(df.iloc[0,1]*10+df.iloc[1,0]*50)
    return predictions, metrics.accuracy_score(test_labels,predictions)  

## Machine Learning

### Import Classfier from Libraries

In [33]:
from sklearn.naive_bayes import MultinomialNB # import naive bayes
from sklearn.tree import DecisionTreeClassifier # import Decision Tree
from sklearn.ensemble import RandomForestClassifier # import random forest

### Machine Learning on BOW Approach

#### NB on BOW

In [34]:
# assign naive bayes function to a variable
mnb = MultinomialNB()

# predict and evaluate naive bayes
mnb_bow_predictions, mnb_bow_accuracy = train_predict_evaluate_model(classifier=mnb,
                                           train_features=bow_train_features,
                                           train_labels=train_labels,
                                           test_features=bow_test_features,
                                           test_labels=test_labels)

              precision    recall  f1-score   support

           0       0.88      0.84      0.86      2993
           1       0.85      0.89      0.87      3007

    accuracy                           0.86      6000
   macro avg       0.86      0.86      0.86      6000
weighted avg       0.86      0.86      0.86      6000

Confusion matrix: 
      0     1
0  2510   483
1   342  2665

Total Cost:
21930


#### DT on BOW

In [35]:
# assign decision tree function to an object
dt = DecisionTreeClassifier()

# predict and evaluate decision tree
dt_bow_predictions, dt_bow_accuracy = train_predict_evaluate_model(classifier=dt,
                                                               train_features=bow_train_features,
                                                               train_labels=train_labels,
                                                               test_features=bow_test_features,
                                                               test_labels=test_labels)

              precision    recall  f1-score   support

           0       0.75      0.73      0.74      2993
           1       0.74      0.76      0.75      3007

    accuracy                           0.75      6000
   macro avg       0.75      0.75      0.75      6000
weighted avg       0.75      0.75      0.75      6000

Confusion matrix: 
      0     1
0  2198   795
1   721  2286

Total Cost:
44000


#### RF on BOW

In [36]:
# assign random forest function to an object
rf = RandomForestClassifier(criterion="entropy")

# predict and evaluate random forest
rf_bow_predictions, rf_bow_accuracy = train_predict_evaluate_model(classifier=rf,
                                           train_features=bow_train_features,
                                           train_labels=train_labels,
                                           test_features=bow_test_features,
                                           test_labels=test_labels)

              precision    recall  f1-score   support

           0       0.85      0.79      0.82      2993
           1       0.80      0.86      0.83      3007

    accuracy                           0.83      6000
   macro avg       0.83      0.83      0.83      6000
weighted avg       0.83      0.83      0.83      6000

Confusion matrix: 
      0     1
0  2355   638
1   406  2601

Total Cost:
26680


### Machine Learning on TFIDF Approach

#### NB on TFIDF

In [37]:
# predict and evaluate naive bayes
mnb_tfidf_predictions, mnb_tfidf_accuracy = train_predict_evaluate_model(classifier=mnb,
                                           train_features=tfidf_train_features,
                                           train_labels=train_labels,
                                           test_features=tfidf_test_features,
                                           test_labels=test_labels)

              precision    recall  f1-score   support

           0       0.90      0.81      0.85      2993
           1       0.83      0.91      0.86      3007

    accuracy                           0.86      6000
   macro avg       0.86      0.86      0.86      6000
weighted avg       0.86      0.86      0.86      6000

Confusion matrix: 
      0     1
0  2419   574
1   282  2725

Total Cost:
19840


#### DT on TFIDF

In [38]:
# predict and evaluate decision tree
dt_tfidf_predictions, dt_tfidf_accuracy = train_predict_evaluate_model(classifier=dt,
                                           train_features=tfidf_train_features,
                                           train_labels=train_labels,
                                           test_features=tfidf_test_features,
                                           test_labels=test_labels)

              precision    recall  f1-score   support

           0       0.75      0.71      0.73      2993
           1       0.73      0.76      0.74      3007

    accuracy                           0.74      6000
   macro avg       0.74      0.74      0.73      6000
weighted avg       0.74      0.74      0.74      6000

Confusion matrix: 
      0     1
0  2127   866
1   723  2284

Total Cost:
44810


#### RF on TFIDF

In [39]:
# predict and evaluate random forest
rf_tfidf_predictions, rf_tfidf_accuracy = train_predict_evaluate_model(classifier=rf,
                                           train_features=tfidf_train_features,
                                           train_labels=train_labels,
                                           test_features=tfidf_test_features,
                                           test_labels=test_labels)

              precision    recall  f1-score   support

           0       0.84      0.75      0.79      2993
           1       0.78      0.86      0.82      3007

    accuracy                           0.81      6000
   macro avg       0.81      0.81      0.80      6000
weighted avg       0.81      0.81      0.80      6000

Confusion matrix: 
      0     1
0  2259   734
1   433  2574

Total Cost:
28990


### Accuracy Comparison of Different models on different features

In [42]:
mnb = MultinomialNB()

In [45]:
# create a dictionary that stores all the accuracy information
accuracy_dict = {}
for m in ["mnb","dt","rf"]:
    accuracy_dict[m] = {}
    for f in ["bow","tfidf"]:
        exec('accuracy_dict["{}"]["{}"] = {}_{}_accuracy'.format(m, f, m, f))
        
#Accuracy Matrix
pd.DataFrame(accuracy_dict).rename(columns={"mnb":"Naive Bayes", 
                                            "dt":"Decision Tree", 
                                            "rf":"Random Forest"}, 
                                   index={"bow":"Bag-of-words", 
                                          "tfidf":"TFIDF"})

Unnamed: 0,Naive Bayes,Decision Tree,Random Forest
Bag-of-words,0.8625,0.747333,0.826
TFIDF,0.857333,0.735167,0.8055


## Bag-of-words & Naive Bayes has the highest accuracy (86.25%)
## TFIDF & Naive Bayes has the lowest cost (574 * 10 + 282 * 50 = 19840)