<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Project 3: NPL on Intermittent Fasting and Keto Diet

---

# Part 3: Modelling

In [1]:
# Import libaries
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
import numpy as np

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.corpus import stopwords
from sklearn.pipeline import Pipeline
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier, VotingClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, classification_report, ConfusionMatrixDisplay, f1_score

In [2]:
# We are dealing with large data sets, so setting max number of column and row displays to be unlimited
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)

In [3]:
# Reading in the cleaned data
df = pd.read_csv('data/subreddit_cleaned.csv')

In [4]:
df.head()

Unnamed: 0,title,selftext,subreddit,created_utc,post_word_count,post_length,clean_text
0,Plateau sruggles,I (27F) have been intermittent fasting for abo...,0,1625589667,123,660,27f intermittent fasting 2 months starting wei...
1,Can I still do IF/OMAD now that I started exer...,"I started lifting 4x/week (about 40 minutes), ...",0,1625586042,127,679,started lifting 4xweek 40 minutes well taking ...
2,A new mindset,"Hello everyone,\n\n I am a mostly lurker here...",0,1625584307,182,929,hello everyone mostly lurker reddit first id l...
3,Weekend habits are making it difficult to loos...,"Hi everyone,\n\nI have been doing IF (16:8) fo...",0,1625582039,110,569,hi everyone 168 almost 3 years remained consis...
4,Are these times acceptable for IF?,"So, due to loss of employment, family has take...",0,1625582007,167,806,due loss employment family taken bother trying...


In [5]:
# Function for lemmatizing
def lemmatize_text(text):

    # split into words
    split_text = text.split()

    # instantiate lemmatizer
    lemmatizer = WordNetLemmatizer()

    # lemmatize and rejoin
    return ' '.join([lemmatizer.lemmatize(word) for word in split_text])

In [6]:
def stem_text(text):

    # split into words
    split_text = text.split()
    
    # instantiate stemmer
    stemmer = PorterStemmer()

    # stem and rejoin
    return ' '.join([stemmer.stem(word) for word in split_text])

## Baseline Model

In [7]:
df['subreddit'].value_counts(normalize = True)

1    0.544412
0    0.455588
Name: subreddit, dtype: float64

Given that we have quite balanced data between both classes, our baseline model accuracy is the probability from our target subreddit -- Keto diet. Our baseline accuracy is 54.4%. Hopefully we can models that score better than this score.

## Modelling

In [8]:
X = df['clean_text']
y = df['subreddit']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 42, stratify = y)

In [9]:
print(y_train.shape)
print(y_test.shape)

(7717,)
(2573,)


In [10]:
# Let's instantiate a pipeline class with the following 3 as its list items:
# 1. CountVectorizer (transformer)
# 2. Multinomial Naive Bayes (estimator)

pipe = Pipeline([
    ('cvec', CountVectorizer()), # tuple for transformer object, class
    ('nb', MultinomialNB()) # tuple for estimator object, class
])

In [11]:
# Search over the following values of hyperparameters:
pipe_params = {
    'cvec__preprocessor' : [None, lemmatize_text, stem_text],
    'cvec__max_features': [None, 5_000], # start with CountVectorizer() class' object cvec__CountVectorizer()'s hyperparameter
    'cvec__max_df': [.9, .95],
    'cvec__ngram_range': [(1,1), (1,2)] # test unigram only (1,1) and unigram+bigram (1,2)
} # standard param dict definition for GridSearch CV

In [12]:
# Instantiate GridSearchCV.

gs = GridSearchCV(pipe, # the object that we are optimizing
                  param_grid=pipe_params, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [None]:
gs.fit(X_train, y_train)

In [None]:
print(gs.best_score_)

In [None]:
# best paramter
gs.best_params_ 

In [None]:
# Score model on training set.
gs.score(X_train, y_train)

In [None]:
# Score model on testing set.
gs.score(X_test, y_test)

In [None]:
pipe_2 = Pipeline([
    ('tvec', CountVectorizer()), # tuple for transformer object, class
    ('nb', MultinomialNB()) # tuple for estimator object, class
])

In [None]:
# Search over the following values of hyperparameters:
pipe_params_2 = {
    'tvec__preprocessor' : [None, lemmatize_title, stem_title],
    'tvec__max_features': [None, 5_000],
    'tvec__max_df': [0.5, 0.9],
    'tvec__ngram_range':[(1,1), (1,2)]
}

In [None]:
# Instantiate GridSearchCV.

gs_2 = GridSearchCV(pipe_2, # the object that we are optimizing
                  param_grid=pipe_params_2, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [None]:
gs_2.fit(X_train, y_train)

In [None]:
print(gs_2.best_score_)

In [None]:
gs_2.best_params_ 

In [None]:
gs_2.score(X_train, y_train)

In [None]:
gs_2.score(X_test, y_test)