# ![](https://ga-dash.s3.amazonaws.com/production/assets/logo-9f88ae6c9c3871690e33280fcf557f33.png) Project 3: Web APIs & Classification


## Problem statement

The objective of this project is to collect posts from two subreddits using Reddit's API and then use NLP to train a classifier on which subreddit a given post came from. Data are to be gathered and prepared using the `requests` library. Two models are to be created and compared and one of the two models must be a Bayes classifier. (context, objective, stakeholders)

## Importing libraries

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
import time
import random
import nltk
import re

from bs4 import BeautifulSoup
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
%matplotlib inline

## Data collection

### Keto subreddit

In [2]:
# Loading datasets and setting header
urlketo = 'https://www.reddit.com/r/keto.json'
urldrink = 'https://www.reddit.com/r/stopdrinking.json'
headers = {'User-agent': 'Oscar Goh 69.69'}

In [3]:
# Creating a loop to scrape posts from a subreddit
ketoposts = []
after = None
for i in range(40):
    if after == None:
        params = {}
    else:
        params = {'after': after}
    url = 'https://www.reddit.com/r/keto.json'
    res = requests.get(url, params=params, headers=headers)
    if res.status_code == 200:
        the_json = res.json()
        ketoposts.extend(the_json['data']['children'])
        after = the_json['data']['after']
    else:
        print(res, status_code)
        break
    sleep_duration = random.randint(2, 6)
    time.sleep(sleep_duration)

# Code from primer video on how to use Reddit's API

In [4]:
# Check to see how many posts we successfully scraped
len(ketoposts)

994

In [5]:
# Check to see how many unique posts out of the posts we scraped
len(set([p['data']['name'] for p in ketoposts]))

742

### Quit drinking subreddit

In [6]:
# Same loop to scrape posts from another subreddit
drinkposts = []
after = None
for i in range(40):
    if after == None:
        params = {}
    else:
        params = {'after': after}
    url = 'https://www.reddit.com/r/stopdrinking.json'
    res = requests.get(url, params=params, headers=headers)
    if res.status_code == 200:
        the_json = res.json()
        drinkposts.extend(the_json['data']['children'])
        after = the_json['data']['after']
    else:
        print(res, status_code)
        break
    sleep_duration = random.randint(2, 6)
    time.sleep(sleep_duration)

In [7]:
# Check to see how many posts we successfully scrape
len(drinkposts)

995

In [8]:
# Check to see how many unique posts out of the posts we scraped
len(set([p['data']['name'] for p in drinkposts]))

995

In [9]:
# Making the lists of posts into dataframes
ketodf = pd.DataFrame(ketoposts)
drinkdf = pd.DataFrame(drinkposts)

In [10]:
# Saving the dataframes into csv files
ketodf.to_csv('keto.csv')
drinkdf.to_csv('drink.csv')

In [11]:
# Checking the dataframe from the first subreddit
ketodf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 994 entries, 0 to 993
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   kind    994 non-null    object
 1   data    994 non-null    object
dtypes: object(2)
memory usage: 15.7+ KB


In [12]:
# Checking the dataframe from the second subreddit
drinkdf.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 995 entries, 0 to 994
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   kind    995 non-null    object
 1   data    995 non-null    object
dtypes: object(2)
memory usage: 15.7+ KB


In [13]:
ketodf.shape

(994, 2)

### Extracting useful data: title, post and name

In [14]:
# Creating a dictionary from the dataframe with only name, title and post
ketotext = {}
for i in range(ketodf.shape[0]):
    ketotext[i] = {
        'title': ketodf['data'][i]['title'],
        'post': ketodf['data'][i]['selftext'],
        'name': ketodf['data'][i]['name']
    }

In [15]:
# Making the dictionary into a dataframe
ketodf1 = pd.DataFrame(ketotext)

In [16]:
# Transposing the dataframe
ketodf1 = ketodf1.T

In [17]:
# Checking the dataframe
ketodf1.head()

Unnamed: 0,title,post,name
0,[2021-06-16] - /r/keto Beginners &amp; Communi...,Hello /r/keto Community!\n\nPlease use this su...,t3_o11ass
1,[2021-06-16] - [Workout Wednesday] – What’s yo...,Hey /r/keto!\n\nRunning? Lifting? Yoga? Swimmi...,t3_o11as5
2,"Keto transformation.. Over 77LB / 35kg gone, a...","https://imgur.com/a/Fce07sf\n\nHI all, hopeful...",t3_o1fb8a
3,"36 days, 15.2 pounds gone!",Good morning Keto peeps. This is my third atte...,t3_o157o7
4,"Year 2 Update - Slow Progress, but getting the...",Hey everyone! What a crazy year! \n\nThis is a...,t3_o1hkuz


In [18]:
# Repeat the same steps for the dataframe from the second subreddit
drinktext = {}
for i in range(drinkdf.shape[0]):
    drinktext[i] = {
        'title': drinkdf['data'][i]['title'],
        'post': drinkdf['data'][i]['selftext'],
        'name': drinkdf['data'][i]['name']
    }

In [19]:
drinkdf1 = pd.DataFrame(drinktext)

In [20]:
drinkdf1 = drinkdf1.T

In [21]:
drinkdf1.head()

Unnamed: 0,title,post,name
0,What’s Up Wednesday,It’s that day again. Guess what day it is? Hap...,t3_o0z9ja
1,"The Daily Check-In for Wednesday, June 16th: J...",*We may be anonymous strangers on the internet...,t3_o0z9nf
2,Marriage is Over,"My wife, my best friend, my everything, saved ...",t3_o16z5e
3,Sobriety did this.,I just got offered my dream position with an i...,t3_o1h9vc
4,30 days but forgot two bottles...,"I'm a month sober, after I had a total psychot...",t3_o1kqgt


## Data cleaning and EDA

In [22]:
# Checking for null
ketodf1.isnull().sum()

title    0
post     0
name     0
dtype: int64

In [23]:
drinkdf1.isnull().sum()

title    0
post     0
name     0
dtype: int64

In [24]:
# Checking for blank
(ketodf1 == '').sum()

title    0
post     0
name     0
dtype: int64

In [25]:
(drinkdf1 == '').sum()

title    0
post     0
name     0
dtype: int64

Null and blank values in data can affect the accuracy of the prediction. Upon inspection, no null or blank are observed in either dataset.

In [26]:
# Using 'name' column as reference, duplicated rows are dropped
ketodf1.drop_duplicates(subset=['name'], keep='first', inplace=True)

In [27]:
ketodf1.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 742 entries, 0 to 741
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   title   742 non-null    object
 1   post    742 non-null    object
 2   name    742 non-null    object
dtypes: object(3)
memory usage: 23.2+ KB


There are no duplicated rows in drinkdf so no action required in this step

In [28]:
# Saving the dataframes to csv
ketodf1.to_csv('./data/ketotextdf.csv')
drinkdf1.to_csv('./data/drinktextdf.csv')

In [2]:
# Reading the csv files
ketodf = pd.read_csv('./data/ketotextdf.csv')
drinkdf = pd.read_csv('./data/drinktextdf.csv')

In [3]:
ketodf.head()

Unnamed: 0.1,Unnamed: 0,title,post,name
0,0,[2021-06-16] - /r/keto Beginners &amp; Communi...,Hello /r/keto Community!\n\nPlease use this su...,t3_o11ass
1,1,[2021-06-16] - [Workout Wednesday] – What’s yo...,Hey /r/keto!\n\nRunning? Lifting? Yoga? Swimmi...,t3_o11as5
2,2,"Keto transformation.. Over 77LB / 35kg gone, a...","https://imgur.com/a/Fce07sf\n\nHI all, hopeful...",t3_o1fb8a
3,3,"36 days, 15.2 pounds gone!",Good morning Keto peeps. This is my third atte...,t3_o157o7
4,4,"Year 2 Update - Slow Progress, but getting the...",Hey everyone! What a crazy year! \n\nThis is a...,t3_o1hkuz


In [3]:
# Dropping the Unnamed column
ketodf.drop('Unnamed: 0', axis=1, inplace=True)
drinkdf.drop('Unnamed: 0', axis=1, inplace=True)

In [4]:
# Reordering the columns
ketodf = ketodf[['name', 'title', 'post']]
drinkdf = drinkdf[['name', 'title', 'post']]

In [6]:
ketodf.head()

Unnamed: 0,name,title,post
0,t3_o11ass,[2021-06-16] - /r/keto Beginners &amp; Communi...,Hello /r/keto Community!\n\nPlease use this su...
1,t3_o11as5,[2021-06-16] - [Workout Wednesday] – What’s yo...,Hey /r/keto!\n\nRunning? Lifting? Yoga? Swimmi...
2,t3_o1fb8a,"Keto transformation.. Over 77LB / 35kg gone, a...","https://imgur.com/a/Fce07sf\n\nHI all, hopeful..."
3,t3_o157o7,"36 days, 15.2 pounds gone!",Good morning Keto peeps. This is my third atte...
4,t3_o1hkuz,"Year 2 Update - Slow Progress, but getting the...",Hey everyone! What a crazy year! \n\nThis is a...


In [5]:
# Creating a new column 'message' by merging 'title' and 'post' columns
ketodf['message'] = ketodf['title'] + ketodf['post']
drinkdf['message'] = drinkdf['title'] + drinkdf['post']

In [6]:
# Dropping the 'title' and 'post' columns
ketodf.drop(['title', 'post'], axis=1, inplace=True)
drinkdf.drop(['title', 'post'], axis=1, inplace=True)

In [7]:
# Creating a new column 'target' to label each data sets
ketodf['target'] = 0
drinkdf['target'] = 1

In [8]:
# Merging both datasets into combine dataset
combine = pd.concat([ketodf, drinkdf], ignore_index=True)

In [11]:
# Checking number of rows from each data sets
combine.target.value_counts()

1    995
0    742
Name: target, dtype: int64

In [12]:
combine.head()

Unnamed: 0,name,message,target
0,t3_o11ass,[2021-06-16] - /r/keto Beginners &amp; Communi...,0
1,t3_o11as5,[2021-06-16] - [Workout Wednesday] – What’s yo...,0
2,t3_o1fb8a,"Keto transformation.. Over 77LB / 35kg gone, a...",0
3,t3_o157o7,"36 days, 15.2 pounds gone!Good morning Keto pe...",0
4,t3_o1hkuz,"Year 2 Update - Slow Progress, but getting the...",0


## Pre-processing

In [9]:
# Defining a fuction to pre-process the data for modeling
def message_to_words(raw_message):
    # Function to convert a raw message to a string of words
    # The input is a single string (a raw reddit message), and
    # the output is a single string (a preprocessed reddit message)

    # 1. Remove Url.
    message_text = re.sub('http[s]?://\S+', '', raw_message)

    # 2. Remove non-letters.
    letters_only = re.sub("[^a-zA-Z]", " ", message_text)

    # 3. Convert to lower case, split into individual words.
    words = letters_only.lower().split()

    # 4. Converting the list of stop words to a set for faster searching
    stops = set(stopwords.words('english'))

    # 5. Remove stopwords.
    meaningful_words = [w for w in words if w not in stops]

    # 6. Remove subreddit and words closely related to subreddit.
    subred = [
        'keto', 'ketogenic', 'diet', 'dieting', 'drink', 'drank', 'drunk',
        'drinks', 'drinking', 'alcohol', 'alcohols', 'alcoholic', 'alcoholics',
        'alcoholism', 'stopdrinking'
    ]
    clean_words = [x for x in meaningful_words if x not in subred]

    # 6. Join the words back into one string separated by space,
    # and return the result.
    return (" ".join(clean_words))

# code from lesson 05.05 nlp-i

In [10]:
# Get the number of messages based on the dataframe size.
total_messages = combine.shape[0]
print(f'There are {total_messages} messages.')

There are 1737 messages.


In [11]:
# Initialize an empty list to hold the clean messages.
clean_messages = []

print("Cleaning and parsing the combine set reddit messages...")

# Instantiate counter.
j = 0

# For every message in our list...
for message in combine['message']:

    # Convert message to words, then append to clean_messages.
    clean_messages.append(message_to_words(message))

    # If the index is divisible by 100, print a message.
    if (j + 1) % 100 == 0:
        print(f'Message {j + 1} of {total_messages}.')

    j += 1

Cleaning and parsing the combine set reddit messages...
Message 100 of 1737.
Message 200 of 1737.
Message 300 of 1737.
Message 400 of 1737.
Message 500 of 1737.
Message 600 of 1737.
Message 700 of 1737.
Message 800 of 1737.
Message 900 of 1737.
Message 1000 of 1737.
Message 1100 of 1737.
Message 1200 of 1737.
Message 1300 of 1737.
Message 1400 of 1737.
Message 1500 of 1737.
Message 1600 of 1737.
Message 1700 of 1737.


In [12]:
# Replacing message column with clean_message
combine['message'] = clean_messages

In [13]:
# Saving combine dataframe to csv
combine.to_csv('./data/combine.csv')

In [2]:
# Read in combine.csv file
combine = pd.read_csv('./data/combine.csv')

In [3]:
combine.drop('Unnamed: 0', axis=1, inplace=True)

## Modeling

In [4]:
# Setting X and y
X = combine['message']
y = combine['target']

In [6]:
# Splitting X and y into train and test for modeling
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.25,
                                                    stratify=y,
                                                    random_state=42)

In [7]:
# Baseline accuracy
y_test.value_counts(normalize=True)

1    0.572414
0    0.427586
Name: target, dtype: float64

Baseline accuracy: 57.24%

### CountVectorizer

In [8]:
# Instantiating countvectorizer
cvec = CountVectorizer()

In [9]:
# fit_transform() does two things: First, it fits the model and
# learns the vocabulary; second, it transforms our training data
# into feature vectors. The input to fit_transform should be a
# list of strings.

X_train_cvec = cvec.fit_transform(X_train)

X_test_cvec = cvec.transform(X_test)

In [10]:
print(X_train_cvec.shape)

(1302, 10116)


In [11]:
print(X_test_cvec.shape)

(435, 10116)


In [12]:
# Check features
vocab = cvec.get_feature_names()
print(vocab)



In [13]:
len(vocab)

10116

#### Logistic regression with count vectorizer

In [14]:
# Instantiate logistic regression model.
lr = LogisticRegression()

# Fit model to training data.
lr.fit(X_train_cvec, y_train)

# Evaluate model on training data.
lr.score(X_train_cvec, y_train)

# Evaluate model on testing data.
lr.score(X_test_cvec, y_test)

# Print scores
print(
    f'Logistic regression with count vectorizer training Score: {lr.score(X_train_cvec, y_train)}'
)
print(
    f'Logistic regression with count vectorizer testing Score: {lr.score(X_test_cvec, y_test)}'
)

Logistic regression with count vectorizer training Score: 1.0
Logistic regression with count vectorizer testing Score: 0.960919540229885


#### Naive bayes with count vectorizer

In [15]:
# Instantiate naive bayes model.
nb = MultinomialNB()

# Fit model to training data.
nb.fit(X_train_cvec, y_train)

# Evaluate model on training data.
nb.score(X_train_cvec, y_train)

# Evaluate model on testing data.
nb.score(X_test_cvec, y_test)

# Print scores
print(
    f'Naive bayes with count vectorizer training Score: {nb.score(X_train_cvec, y_train)}'
)
print(
    f'Naive bayes with count vectorizer testing Score: {nb.score(X_test_cvec, y_test)}'
)

Naive bayes with count vectorizer training Score: 0.9869431643625192
Naive bayes with count vectorizer testing Score: 0.9724137931034482


### Term Frequency-Inverse Document Frequency (TF-IDF) Vectorizer

In [16]:
# Instantiate the transformer.
tvec = TfidfVectorizer()

In [17]:
X_train_tvec = tvec.fit_transform(X_train)

X_test_tvec = tvec.transform(X_test)

In [18]:
print(X_train_tvec.shape)

(1302, 10116)


In [19]:
print(X_test_tvec.shape)

(435, 10116)


In [20]:
# Check features
vocab_t = tvec.get_feature_names()
print(vocab_t)



In [21]:
len(vocab_t)

10116

#### Logistic regression with tfidf vectorizer

In [22]:
# Instantiate logistic regression model.
lr = LogisticRegression()

# Fit model to training data.
lr.fit(X_train_tvec, y_train)

# Evaluate model on training data.
lr.score(X_train_tvec, y_train)

# Evaluate model on testing data.
lr.score(X_test_tvec, y_test)

# Print scores
print(
    f'Logistic regression with tfidf vectorizer training Score: {lr.score(X_train_tvec, y_train)}'
)
print(
    f'Logistic regression with tfidf vectorizer testing Score: {lr.score(X_test_tvec, y_test)}'
)

Logistic regression with tfidf vectorizer training Score: 0.9854070660522274
Logistic regression with tfidf vectorizer testing Score: 0.9540229885057471


#### Naive bayes with tfidf vectorizer

In [23]:
# Instantiate naive bayes model.
nb = MultinomialNB()

# Fit model to training data.
nb.fit(X_train_tvec, y_train)

# Evaluate model on training data.
nb.score(X_train_tvec, y_train)

# Evaluate model on testing data.
nb.score(X_test_tvec, y_test)

# Print scores
print(
    f'Naive bayes with tfidf vectorizer training Score: {nb.score(X_train_tvec, y_train)}'
)
print(
    f'Naive bayes with tfidf vectorizer testing Score: {nb.score(X_test_tvec, y_test)}'
)

Naive bayes with tfidf vectorizer training Score: 0.9685099846390169
Naive bayes with tfidf vectorizer testing Score: 0.9356321839080459


### GridSearchCV with..

#### Transformer: CountVectorizer, Model: Logistic Regression

In [24]:
# Set a pipeline up with two stages:
# 1. CountVectorizer (transformer)
# 2. LogisticRegression (model)

cvlrpipe = Pipeline([('cvec', CountVectorizer()),
                     ('lr', LogisticRegression())])

In [25]:
# Search over the following values of hyperparameters:
# Maximum number of features fit: 2000, 3000, 4000
# Maximum number of documents needed to include token: 80%, 90%, 100%
# Check (individual tokens) and also check (individual tokens and 2-grams)
# Regularization strength of model: 0.1, 1, 10

cvlrpipe_params = {
    'cvec__max_features': [2000, 3000, 4000],
    'cvec__max_df': [0.8, 0.9, 1.0],
    'cvec__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'lr__C': [0.1, 1, 10]
}

In [26]:
# Instantiate GridSearchCV.
cvlrgs = GridSearchCV(
    cvlrpipe,  # object we are optimizing
    param_grid=cvlrpipe_params,  # parameters values we are searching
    cv=5)  # 5-fold cross-validation

In [27]:
# Fit GridSearch to training data.
cvlrgs.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec', CountVectorizer()),
                                       ('lr', LogisticRegression())]),
             param_grid={'cvec__max_df': [0.8, 0.9, 1.0],
                         'cvec__max_features': [2000, 3000, 4000],
                         'cvec__ngram_range': [(1, 1), (1, 2), (2, 2)],
                         'lr__C': [0.1, 1, 10]})

In [28]:
# Best score
cvlrgs.best_score_

0.9477571470674919

In [29]:
# Save best model as cvlr_model.
cvlr_model = cvlrgs.best_estimator_
cvlr_model

Pipeline(steps=[('cvec', CountVectorizer(max_df=0.8, max_features=3000)),
                ('lr', LogisticRegression(C=0.1))])

In [51]:
# Check best parameters
cvlrgs.best_params_

{'cvec__max_df': 0.8,
 'cvec__max_features': 3000,
 'cvec__ngram_range': (1, 1),
 'lr__C': 0.1}

In [30]:
# Score model on training set.
cvlr_model.score(X_train, y_train)

# Score model on testing set.
cvlr_model.score(X_test, y_test)

# Print scores
print(
    f'Logistic regression with count vectorizer training Score: {cvlr_model.score(X_train, y_train)}'
)
print(
    f'Logistic regression with count vectorizer testing Score: {cvlr_model.score(X_test, y_test)}'
)

Logistic regression with count vectorizer training Score: 0.9946236559139785
Logistic regression with count vectorizer testing Score: 0.960919540229885


#### Transformer: CountVectorizer, Model: Naive bayes

In [31]:
# Set a pipeline up with two stages:
# 1. CountVectorizer (transformer)
# 2. Naive Bayes (model)

cvnbpipe = Pipeline([('cvec', CountVectorizer()), ('nb', MultinomialNB())])

In [32]:
# Search over the following values of hyperparameters:
# Maximum number of features fit: 2000, 3000, 4000
# Maximum number of documents needed to include token: 80%, 90%, 100%
# Check (individual tokens) and also check (individual tokens and 2-grams)
# Additive smoothing: 1, 2

cvnbpipe_params = {
    'cvec__max_features': [2000, 3000, 4000],
    'cvec__max_df': [0.8, 0.9, 1.0],
    'cvec__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'nb__alpha': [1, 2]
}

In [33]:
# Instantiate GridSearchCV.
cvnbgs = GridSearchCV(
    cvnbpipe,  # object we are optimizing
    param_grid=cvnbpipe_params,  # parameters values we are searching
    cv=5)  # 5-fold cross-validation

In [34]:
# Fit GridSearch to training data.
cvnbgs.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('cvec', CountVectorizer()),
                                       ('nb', MultinomialNB())]),
             param_grid={'cvec__max_df': [0.8, 0.9, 1.0],
                         'cvec__max_features': [2000, 3000, 4000],
                         'cvec__ngram_range': [(1, 1), (1, 2), (2, 2)],
                         'nb__alpha': [1, 2]})

In [35]:
# Best score
cvnbgs.best_score_

0.9646625405246094

In [36]:
# Save best model as cvnb_model.
cvnb_model = cvnbgs.best_estimator_
cvnb_model

Pipeline(steps=[('cvec', CountVectorizer(max_df=0.8, max_features=2000)),
                ('nb', MultinomialNB(alpha=1))])

In [52]:
# Check best parameters
cvnbgs.best_params_

{'cvec__max_df': 0.8,
 'cvec__max_features': 2000,
 'cvec__ngram_range': (1, 1),
 'nb__alpha': 1}

In [37]:
# Score model on training set.
cvnb_model.score(X_train, y_train)

# Score model on testing set.
cvnb_model.score(X_test, y_test)

# Print scores
print(
    f'Naive bayes with count vectorizer training Score: {cvnb_model.score(X_train, y_train)}'
)
print(
    f'Naive bayes with count vectorizer testing Score: {cvnb_model.score(X_test, y_test)}'
)

Naive bayes with count vectorizer training Score: 0.9800307219662059
Naive bayes with count vectorizer testing Score: 0.9770114942528736


#### Transformer: Tfidf Vectorizer, Model: Logistic Regression

In [38]:
# Set a pipeline up with two stages:
# 1. TfidfVectorizer (transformer)
# 2. LogisticRegression (model)

tvlrpipe = Pipeline([('tvec', TfidfVectorizer()),
                     ('lr', LogisticRegression())])

In [39]:
# Search over the following values of hyperparameters:
# Maximum number of features fit: 2000, 3000, 4000
# Maximum number of documents needed to include token: 80%, 90%, 100%
# Check (individual tokens) and also check (individual tokens and 2-grams)
# Regularization strength of model: 0.1, 1, 10

tvlrpipe_params = {
    'tvec__max_features': [2000, 3000, 4000],
    'tvec__max_df': [0.8, 0.9, 1.0],
    'tvec__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'lr__C': [0.1, 1, 10]
}

In [40]:
# Instantiate GridSearchCV.
tvlrgs = GridSearchCV(
    tvlrpipe,  # object we are optimizing
    param_grid=tvlrpipe_params,  # parameters values we are searching
    cv=5)  # 5-fold cross-validation

In [41]:
# Fit GridSearch to training data.
tvlrgs.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tvec', TfidfVectorizer()),
                                       ('lr', LogisticRegression())]),
             param_grid={'lr__C': [0.1, 1, 10], 'tvec__max_df': [0.8, 0.9, 1.0],
                         'tvec__max_features': [2000, 3000, 4000],
                         'tvec__ngram_range': [(1, 1), (1, 2), (2, 2)]})

In [42]:
# Best score
tvlrgs.best_score_

0.953910993221338

In [43]:
# Save best model as tvlr_model.
tvlr_model = tvlrgs.best_estimator_
tvlr_model

Pipeline(steps=[('tvec', TfidfVectorizer(max_df=0.8, max_features=2000)),
                ('lr', LogisticRegression(C=10))])

In [53]:
# Check best parameters
tvlrgs.best_params_

{'lr__C': 10,
 'tvec__max_df': 0.8,
 'tvec__max_features': 2000,
 'tvec__ngram_range': (1, 1)}

In [44]:
# Score model on training set.
tvlr_model.score(X_train, y_train)

# Score model on testing set.
tvlr_model.score(X_test, y_test)

# Print scores
print(
    f'Logistic regression with tfidf vectorizer training Score: {tvlr_model.score(X_train, y_train)}'
)
print(
    f'Logistic regression with tfidf vectorizer testing Score: {tvlr_model.score(X_test, y_test)}'
)

Logistic regression with tfidf vectorizer training Score: 1.0
Logistic regression with tfidf vectorizer testing Score: 0.9586206896551724


#### Transformer: Tfidf Vectorizer, Model: Naive bayes

In [45]:
# Set a pipeline up with two stages:
# 1. TfidfVectorizer (transformer)
# 2. Naive bayes (model)

tvnbpipe = Pipeline([('tvec', TfidfVectorizer()), ('nb', MultinomialNB())])

In [46]:
# Search over the following values of hyperparameters:
# Maximum number of features fit: 2000, 3000, 4000
# Maximum number of documents needed to include token: 80%, 90%, 100%
# Check (individual tokens) and also check (individual tokens and 2-grams)
# Additive smoothing: 1, 2

tvnbpipe_params = {
    'tvec__max_features': [2000, 3000, 4000],
    'tvec__max_df': [0.8, 0.9, 1.0],
    'tvec__ngram_range': [(1, 1), (1, 2), (2, 2)],
    'nb__alpha': [1, 2]
}

In [47]:
# Instantiate GridSearchCV.
tvnbgs = GridSearchCV(
    tvnbpipe,  # object we are optimizing
    param_grid=tvnbpipe_params,  # parameters values we are searching
    cv=5)  # 5-fold cross-validation

In [48]:
# Fit GridSearch to training data.
tvnbgs.fit(X_train, y_train)

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tvec', TfidfVectorizer()),
                                       ('nb', MultinomialNB())]),
             param_grid={'nb__alpha': [1, 2], 'tvec__max_df': [0.8, 0.9, 1.0],
                         'tvec__max_features': [2000, 3000, 4000],
                         'tvec__ngram_range': [(1, 1), (1, 2), (2, 2)]})

In [57]:
# Best score
tvnbgs.best_score_

0.9423902151488358

In [58]:
# Save best model as tvnb_model.
tvnb_model = tvnbgs.best_estimator_
tvnb_model

Pipeline(steps=[('tvec', TfidfVectorizer(max_df=0.8, max_features=2000)),
                ('nb', MultinomialNB(alpha=1))])

In [59]:
# Check best parameters
tvnbgs.best_params_

{'nb__alpha': 1,
 'tvec__max_df': 0.8,
 'tvec__max_features': 2000,
 'tvec__ngram_range': (1, 1)}

In [60]:
# Score model on training set.
tvnb_model.score(X_train, y_train)

# Score model on testing set.
tvnb_model.score(X_test, y_test)

# Print scores
print(
    f'Naive bayes with tfidf vectorizer training Score: {tvnb_model.score(X_train, y_train)}'
)
print(
    f'Naive bayes with tfidf vectorizer testing Score: {tvnb_model.score(X_test, y_test)}'
)

Naive bayes with tfidf vectorizer training Score: 0.977726574500768
Naive bayes with tfidf vectorizer testing Score: 0.9494252873563218


## Score comparison

### Logistic regression

In [55]:
print(
    f'Logistic regression with count vectorizer training Score: {cvlr_model.score(X_train, y_train)}'
)
print(
    f'Logistic regression with count vectorizer testing Score: {cvlr_model.score(X_test, y_test)}'
)
print(
    f'Logistic regression with tfidf vectorizer training Score: {tvlr_model.score(X_train, y_train)}'
)
print(
    f'Logistic regression with tfidf vectorizer testing Score: {tvlr_model.score(X_test, y_test)}'
)

Logistic regression with count vectorizer training Score: 0.9946236559139785
Logistic regression with count vectorizer testing Score: 0.960919540229885
Logistic regression with tfidf vectorizer training Score: 1.0
Logistic regression with tfidf vectorizer testing Score: 0.9586206896551724


### Naive bayes

In [61]:
print(
    f'Naive bayes with count vectorizer training Score: {cvnb_model.score(X_train, y_train)}'
)
print(
    f'Naive bayes with count vectorizer testing Score: {cvnb_model.score(X_test, y_test)}'
)
print(
    f'Naive bayes with tfidf vectorizer training Score: {tvnb_model.score(X_train, y_train)}'
)
print(
    f'Naive bayes with tfidf vectorizer testing Score: {tvnb_model.score(X_test, y_test)}'
)

Naive bayes with count vectorizer training Score: 0.9800307219662059
Naive bayes with count vectorizer testing Score: 0.9770114942528736
Naive bayes with tfidf vectorizer training Score: 0.977726574500768
Naive bayes with tfidf vectorizer testing Score: 0.9494252873563218


#### Conclusion and recommendations
First of all, the accuracy of every model is quite high. What could have happened could be the subtle similarities between the 2 subreddit could be a little too subtle than I have expected thus making them too different and easy for the models to distinguish between them. Both models improved only very slightly or even no improvement in terms of accuracy probably because the base scores are already really high and there aren't much room for improvements. The naive assumption of Naive bayes model is that all features are conditionally independent which works exceptionally well in this project especially with the parameter tuning with max_df, max_features and ngram_range thus emerging as the best model in this project with a test score of 0.977.

<p>Overall, objectives of this project have been well met by successfully scraping 1989 posts out of an intended 2000 and getting 1737 unique posts for analysis. 2 models are successfully fitted, tuned and tested to eventually yield a production model being Naive bayes with count vectorizer at a test score of 0.977.</p>