# Project 3: Reddit API NLP Classification

## Problem Statement

Being part of the Nintendo Marketing Team of nintendo switch, I am interested in finding out what popular topics and keyword jargons belong to the fields of Mario and Legend of Zelda. Conducting analysis on Reddit posts will allow me to craft online content and advertisements to better target people interested in casual, adventure, platformer video games.

The main objective of this project is to scrape two subreddits: `r/Mario` and `r/Zelda` using Reddit's API. The scraped data from the two subreddits will then be passed through various classification models, `CountVectorizer`/`TfidVectorizer` with `Naive Bayes Classifier` and `LogisticRegression` that will assign each observation to the most likely class of subreddit. The models should help the data science marketing team of my company identify what makes the respective subreddit posts unique from one another.

In this process, the subreddit posts will undergo preprocessing and EDA. The success of the models that we decide on will be determined through the highest accuracy based on the scores obtained.

## Executive Summary

Natural Language Processing, or NLP for short, involves using specialized machine learning techniques to make predictions about the text in a body of documents, including things like authorship attribution, sentiment analysis, text generation, and in some cases the appearance of something resembling semantic understanding.

Nintendo Switch, a hybrid home console and handheld device is one of highest revenue generator of Nintendo, it had outsold the lifetime sales of Wii U, its home console predecessor.Mario Kart 8 Deluxe is the best-selling game on the platform at over 46 million copies sold The Mario franchise alone has sold 167.11 million copies on the Nintendo Switch, which is the most the franchise has ever sold on a single platform. The Legend of Zelda has also sold the most on a single platform with the Nintendo Switch with 41.13 million copies. 

In this project, I would like to further explore the key similarities and differences between Data Science and Software Engineering in terms of the current discussions and topics that people are discussing on Reddit. Reddit, in recent times, have become a popular avenue for people all over the world to ask one another about different career prospects and experiences. As such, scraping Subreddit Posts gives us an interesting source of data that we can analyze to understand what are the popular topics in these respective career fields.

The web scraping portion of this project is covered in another notebook. In this notebook, I will be covering the steps taken to clean and analyze the data collected, as well as further steps taken to pre-process the text data, visualize the data, use different models to find the optimal model and analyze misclassified posts.

## Contents:
- [Data Collection](#Data-Collection) (In notebook "Data_Collection_Reddit.ipynb")
- [Data Cleaning](#Data-Cleaning)
- [Explanatory Data Analysis](#Explanatory-Data-Analysis)
- [Pre-Processing](#Pre-Processing)
- [Data Visualization](#Data-Visualization)
- [Modelling](#Modelling)
- [Evaluation](#Evaluation)
- [Conclusion](#Conclusion)
- [Recommendations](#Recommendations)

### Importing Packages

In [None]:
# Import libraries
import requests
import time
import nltk
import pandas as pd
import regex as re
import numpy as np
import matplotlib.pyplot as plt

from nltk.corpus import stopwords
from nltk.tokenize import RegexpTokenizer
from nltk.stem import WordNetLemmatizer

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB # NLP classification
from sklearn.metrics import confusion_matrix, plot_confusion_matrix, ConfusionMatrixDisplay

from wordcloud import WordCloud

I have outlined my process of scraping data in the other Jupyter Notebook in this project folder. In this notebook, I will be reading in the csv files that contain my scrapped data for both the mario and zelda subreddit posts.

### Read in CSV

In [None]:
mario = pd.read_csv("mario_reddit_posts.csv")
mario.sample(5)

In [None]:
zelda = pd.read_csv("zelda_reddit_posts.csv")
zelda.sample(5)

In [None]:
print(mario.shape)
print(zelda.shape)

In [None]:
# Import combinded df
df = pd.read_csv("./master_df.csv")
df.sample(5)

In [None]:
print(df.shape)

In [None]:
# Overview of Numerical Columns
df.groupby("subreddit").mean()

There is a clear difference of score between the reddits. Mario has less active communities on Reddit with 113k members to Zelda of 2.1m members.

### For Simplicity of the model: We would focus only on the title column of the dataframe. 

In [None]:
df = df[["subreddit","target","title"]]

In [None]:
df.sample(5)

### WordCloud Visualisation of Common Word in Mario and Zelda

In [None]:
# Create a list of word
text= mario['title']

# Create the wordcloud object
wordcloud = WordCloud(width=480, height=480, margin=0, 
                      background_color='white').generate(str(text))

# Display the generated image:
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()

There is significant interested on the upcoming "Mario Movie". This is out of the scope for this analysis. Hence, we would remove word relating to the movie.

In [None]:
# Create a list of word
text= zelda['title']

# Create the wordcloud object
wordcloud = WordCloud(width=480, height=480, margin=0, 
                      background_color='white').generate(str(text))

# Display the generated image:
plt.imshow(wordcloud,interpolation='bilinear')
plt.axis("off")
plt.margins(x=0, y=0)
plt.show()

There is significant mention of the characters' name and game title. This could be considered "cheat words" as it is irrelevant to find the characteristic of players.

**Set Tokenizer**

In [None]:
tokenizer = RegexpTokenizer(r'\w+|\$[\d\.]+|\S+')

In [None]:
lemmatizer = WordNetLemmatizer()

**Create function that takes a column containing text and returns the lemmatized version in a new 'cleaned' column**

In [None]:
def column_cleaner(column, df=df):   
    #for loop through each row in the column:
    for i in range(len(df[column])):
        
        #Tokenize, or separate, each word in column's string into its own string (prep for lemmatization):
        col_tok = []
        col_tok.extend(tokenizer.tokenize(df[column][i].lower()))
        col_token = []
        [col_token.append(s) for s in col_tok if s not in col_token]
        
        #Lemmatize the words (cut the word to its base/root, for improved model results):
        col_lem = []
        for x in range(len(col_token)):
            col_lem.append(lemmatizer.lemmatize(col_token[x]))
        
        #Remove characters and numbers (for improved model results, hopefully):
        letters_only_col = []
        for c in range(len(col_lem)):
            letters_only_col.append(re.sub("[^a-zA-Z]", "", col_lem[c]))
        
        #Remove stopwords (for improved model results):
        col_words = [w for w in letters_only_col if not w in stopwords.words('english')]
        
        #Remove 'cheat' words (words that are in the subreddit's name and also in the column)
        # Remove characters' name and game title
                       # Mario cheatwords
        cheat_words = ['mario', 'party', 'marioparty', 'smash', 'bros', 'browser',
                       'ultimate', 'smashbrosultimate', 'super', 'superstar',
                       'luigi', 'peach', 'toad', 'bowser', 'game', 'ss', 'waluigi',
                       'daisy', 'yoshi', 'donkey', 'king boo', 'kong','diddy', 
                       'rosalina', 'toadette','toadsworth', 'captain', 'poochy',
                       'birdo', 'pauline', 'kamek', 'kammy', 'koopa', 'jr',
                       'wario', 'fawful', 'kart', 'paper', 'kart', 'sonic', 'dr.',
                       
                       # Zelda cheatwords
                       "zelda", "princess", "link","botw", "totk", "oc", "oot",
                       'majora', 'mask', 'ocarina', 'time', "mm", "hyrule", "tp",
                       'ganon', 'darklink','nightmares','twinrova','vaati',
                       'zant', 'demise', 'yuga', 'kohga', 'master', 'twilight',
                       'tears', 'kingdom', 'wind', 'waker', 'phantom', 'hourglass',
                       'legend', 'awakening', 'adventure', 'skyward', 'sword',
                       'tri',  'force', 'heroes', 'ww', 'breath', 'wild',
                       
                       # Mario Movie
                       "movie", "trailer", "chris", "pratt", "voice", "poster",
                       "sequel", "remade", "charles", "martinet", "charlie",
                       "anya", "jack", "black", "actor", "live", "action",
                       "netflix", "live", "action", "cast"] 
        col_words = [w for w in letters_only_col if not w in cheat_words]
        
        #Ensure that there are no 'None' objects in title_words:
        col_words = list(filter(None, col_words))

        #Join the lemmatized words - stopwords back to one long string (prep for
        #vectorization, done outside/after this function):
        col_words = " ".join(col_words)

        #Fill new column with 'cleaned' string from column:
        df.loc[i,(column+'_clean')] = col_words

In [None]:
df

In [None]:
column_cleaner(column='title', df=df)

**Save version of DataFrame**

In [None]:
df.to_csv('master_df_cleaned.csv', index=False, sep=",")

In [None]:
df.head(3)

In [None]:
df.sample(3)

### Top 25 word for Mario Game

In [None]:
def nplot(subreddit,column):
    fig,ax = plt.subplots(1,3, figsize=(15,6))
    fig.suptitle("Top 25 Most Common Words of " + subreddit, fontsize=20)
    geb={} # Collet top 25 word of each n-gram

    for i in range(1,4): # common bigram trigram quadgram
        word = []
    
        cvec = CountVectorizer(ngram_range=(i,i), stop_words="english")
        text_cvec = cvec.fit_transform(df[df['subreddit']== subreddit ][column])
    
        vec = pd.DataFrame(text_cvec.toarray(),columns= cvec.get_feature_names_out())
        word.append(vec.sum().sort_values(ascending=False)[:25].index)
        geb[i] = list(word[0])
        ax[i-1].barh(vec.sum().sort_values(ascending=False)[:25][::-1].index,
                 vec.sum().sort_values(ascending=False)[:25][::-1])
        ax[i-1].set_xlabel("number of word")
        ax[i-1].set_ylabel("word important")
        ax[i-1].set_title(f" {i}-grams")
        plt.tight_layout();

    
    print(f"Word Important of each n_gram:")
    for x, y in geb.items():
        print(f"word {x}-gram", y)
        print("------------------------------------------------------------------------------------")

In [None]:
nplot("Mario","title")

In [None]:
# Remove character names, game titles and movie related.
nplot("Mario","title_clean")

Mario reddit are more focus on the looks of characters and meme. Moreover, there a considerable numbers of design and fan for the franchise.

### Top 25 word for Zelda Game

In [None]:
nplot("zelda","title")

In [None]:
# Remove character names and game titles.
nplot("zelda","title_clean")

Zelda reddit are more focus on the art with community led event of [linktober](https://www.linktober.com/). Moreover, there a considerable numbers of discussion on gameplay specific topics such as boss event, seasonal event and story.

### Messy Title Model

In [None]:
X = df["title"]
y = df["target"]

In [None]:
# Redefine training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.20,
                                                    stratify=y,
                                                    random_state=42)

In [None]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
# Let's set a pipeline up with two stages:
# 1. CountVectorizer (transformer)
# 2. Multinomial Naive Bayes (estimator)

pipe = Pipeline([
    ("cvec", CountVectorizer()), # Transformer (fit, transform)
    ("nb", MultinomialNB()) # Estimator or model (fit, predict)    
])

# .predict() of MultinomialNB allows us to have a score to judge
# our hyperparameters combinations when GridSearching

In [None]:
# Search over the following values of hyperparameters:
# Maximum number of features fit: 500, 1000, 1500, 2000
# Minimum number of documents needed to include token: 2, 3
# Maximum number of documents needed to include token: 90%, 95%
# Check (individual tokens) and also check (individual tokens and 2-grams).
# Check Stop_words: None or English.

pipe_params = {
    "cvec__max_features":[500,1000,1500,2000],
    "cvec__min_df": [2,3],
    "cvec__max_df": [0.9,0.95],
    "cvec__ngram_range": [(1,1),(1,2)],
    "cvec__stop_words": [None,"english"]
}

# ngram_range of (1,1) just returns individual tokens
# ngram_range of (1,2) just returns individual tokens AND bi-grams

In [None]:
# Instantiate GridSearchCV.

gs = GridSearchCV(pipe, # what object are we optimizing?
                  param_grid=pipe_params, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [None]:
# Fit GridSearch to training data.
start_time = time.time()
gs.fit(X_train,y_train)
print(f"Runtime:{time.time()-start_time}")

In [None]:
# What's the best score?
gs.best_score_

In [None]:
# What's the best params?
gs.best_params_

In [None]:
# Score model on training set.
# What is the score on a classifier? Accuracy
gs.score(X_train,y_train)

In [None]:
# Score model on testing set.
gs.score(X_test,y_test)

In [None]:
# Instantiate the transformer.
tvec = TfidfVectorizer(stop_words="english")

# convert training data to dataframe
X_train_df = pd.DataFrame(tvec.fit_transform(X_train).todense(), 
                          columns=tvec.get_feature_names_out())
# Plot top 20 occuring words
X_train_df.sum().sort_values(ascending=True).tail(20).plot(kind='barh')
plt.title("Top 20 Common Words from both Reddits");

Mario is mention more often than Zelda. 

## Cleaned Title Model (Remove "Cheat words")

In [None]:
X = df["title_clean"]
y = df["target"]

In [None]:
# Redefine training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    test_size=0.20,
                                                    stratify=y,
                                                    random_state=42)

In [None]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
# Let's set a pipeline up with two stages:
# 1. CountVectorizer (transformer)
# 2. Multinomial Naive Bayes (estimator)

pipe = Pipeline([
    ("cvec", CountVectorizer()), # Transformer (fit, transform)
    ("nb", MultinomialNB()) # Estimator or model (fit, predict)    
])

# .predict() of MultinomialNB allows us to have a score to judge
# our hyperparameters combinations when GridSearching

In [None]:
# Search over the following values of hyperparameters:
# Maximum number of features fit: 500, 1000, 1500, 2000
# Minimum number of documents needed to include token: 2, 3
# Maximum number of documents needed to include token: 90%, 95%
# Check (individual tokens) and also check (individual tokens and 2-grams).
# Check Stop_words: None or English.

pipe_params = {
    "cvec__max_features":[500,1000,1500,2000],
    "cvec__min_df": [2,3],
    "cvec__max_df": [0.9,0.95],
    "cvec__ngram_range": [(1,1),(1,2)],
    "cvec__stop_words": [None,"english"]
}

# ngram_range of (1,1) just returns individual tokens
# ngram_range of (1,2) just returns individual tokens AND bi-grams

In [None]:
# Instantiate GridSearchCV.

gs = GridSearchCV(pipe, # what object are we optimizing?
                  param_grid=pipe_params, # what parameters values are we searching?
                  cv=5) # 5-fold cross-validation.

In [None]:
# Fit GridSearch to training data.
start_time = time.time()
gs.fit(X_train,y_train)
print(f"Runtime:{time.time()-start_time}")

In [None]:
# What's the best score?
gs.best_score_

In [None]:
# What's the best params?
gs.best_params_

In [None]:
# Score model on training set.
# What is the score on a classifier? Accuracy
gs.score(X_train,y_train)

In [None]:
# Score model on testing set.
gs.score(X_test,y_test)

In [None]:
# Define score model dataframe for comparison
list_of_rows = []
score_df = pd.DataFrame()

def modelscore(gsmodel,transformer,model,cheatword):
    s = pd.Series({"Transformer": transformer, #"CountVectorizer","Tfid" 
               "Model" : model, #"MultinomialNB","Logistic"
               "Cheat Word" : cheatword , #"Included" , "Excluded"
               "Train Score" : gsmodel.score(X_train,y_train),
               "Test Score" : gsmodel.score(X_test,y_test)
               })
    list_of_rows.append(s) 
    global score_df
    score_df = pd.DataFrame(list_of_rows)
    print(score_df)

In [None]:
modelscore(gs, "CountVectorizer", "MultinomialNB", "Excluded")

In [None]:
score_df.head()

In [None]:
# Let's set a pipeline up with two stages:
# 1. TfidfVectorizer (transformer)
# 2. Multinomial Naive Bayes (estimator)

pipe_tvec = Pipeline([
    ("tvec", TfidfVectorizer()), # Transformer (fit, transform)
    ("nb", MultinomialNB()) # Estimator or model (fit, predict)    
])

# .predict() of MultinomialNB allows us to have a score to judge
# our hyperparameters combinations when GridSearching

In [None]:
# Search over the following values of hyperparameters:
# Maximum number of features fit: 500, 1000, 1500, 2000
# No stop words and english stop words
# Check (individual tokens) and also check (individual tokens and 2-grams).

pipe_tvec_params = {"tvec__max_features" : [500,1000,1500,2000],
                    "tvec__min_df": [2,3],
                    "tvec__max_df": [0.9,0.95],
                    "tvec__stop_words" : [None, "english"],
                    "tvec__ngram_range" : [(1,1),(1,2)]
                  }
    

In [None]:
# Instantiate GridSearchCV.
gs_tvec = GridSearchCV(estimator=pipe_tvec,
                      param_grid=pipe_tvec_params,
                      cv=5)

In [None]:
# Fit GridSearch to training data.
gs_tvec.fit(X_train,y_train)

In [None]:
gs_tvec.best_params_

In [None]:
# Score model on training set.
gs_tvec.score(X_train,y_train)

In [None]:
# Score model on testing set.
gs_tvec.score(X_test,y_test)

In [None]:
modelscore(gs_tvec, "TfidfVectorizer", "MultinomialNB", "Excluded")

### Find top occuring words after remove game name

In [None]:
# Instantiate the transformer.
tvec = TfidfVectorizer(stop_words="english")

# convert training data to dataframe
X_train_df = pd.DataFrame(tvec.fit_transform(X_train).todense(), 
                          columns=tvec.get_feature_names_out())
# Plot top 20 occuring words
X_train_df.sum().sort_values(ascending=True).tail(20).plot(kind='barh')
plt.title("Top 20 Common Words after removing game name");

In [None]:
X_train_df.head()

In [None]:
# Let's set a pipeline up with two stages:
# 1. CountfVectorizer (transformer)
# 2. LogisticRegression (estimator)

pipe_cvec_log = Pipeline([
    ("cvec", CountVectorizer()), # Transformer (fit, transform)
    ("logreg", LogisticRegression()) # Estimator or model (fit, predict)    
])

# .predict() of MultinomialNB allows us to have a score to judge
# our hyperparameters combinations when GridSearching

# Search over the following values of hyperparameters:
# Maximum number of features fit: 500, 1000, 1500, 2000
# No stop words and english stop words
# Check (individual tokens) and also check (individual tokens and 2-grams).

pipe_cvec_log_params = {"cvec__max_features":[500, 1000, 1500, 2000],
                        "cvec__min_df": [2,3],
                        "cvec__max_df": [0.9,0.95],
                        "cvec__ngram_range": [(1,1),(1,2)],
                        "cvec__stop_words": [None,"english"],
                       }
    

# Instantiate GridSearchCV.
gs_cvec_log = GridSearchCV(estimator=pipe_cvec_log,
                      param_grid=pipe_cvec_log_params,
                      cv=5)

# Fit GridSearch to training data.
gs_cvec_log.fit(X_train,y_train)

# Check best parameter
print(gs_cvec_log.best_params_)

# Score model on training set.
gs_cvec_log.score(X_train,y_train)

# Score model on testing set.
gs_cvec_log.score(X_test,y_test)

# Prediction
y_preds_cvec_log = gs_cvec_log.predict(X_test)

# Save model score to score dataframe
modelscore(gs_cvec_log, "CountVectorizer", "Logistic", "Excluded")

In [None]:
X_test.index

In [None]:
# Let's set a pipeline up with two stages:
# 1. TfidVectorizer (transformer)
# 2. LogisticRegression (estimator)

pipe_tvec_log = Pipeline([
    ("tvec", TfidfVectorizer()), # Transformer (fit, transform)
    ("logreg", LogisticRegression()) # Estimator or model (fit, predict)    
])

# .predict() of MultinomialNB allows us to have a score to judge
# our hyperparameters combinations when GridSearching

# Search over the following values of hyperparameters:
# Maximum number of features fit: 2000, 3000, 4000, 5000
# No stop words and english stop words
# Check (individual tokens) and also check (individual tokens and 2-grams).

pipe_tvec_log_params = {"tvec__max_features" : [2000,3000,4000,5000],
                        "tvec__min_df": [2,3],
                        "tvec__max_df": [0.9,0.95],
                        "tvec__stop_words" : [None, "english"],
                        "tvec__ngram_range" : [(1,1),(1,2)]
                       }
    

# Instantiate GridSearchCV.
gs_tvec_log = GridSearchCV(estimator=pipe_tvec_log,
                      param_grid=pipe_tvec_log_params,
                      cv=5)

# Fit GridSearch to training data.
gs_tvec_log.fit(X_train,y_train)

gs_tvec_log.best_params_

# Score model on training set.
gs_tvec_log.score(X_train,y_train)

# Score model on testing set.
gs_tvec_log.score(X_test,y_test)

# Prediction
y_preds_tvec_log = gs_tvec_log.predict(X_test)

# Save model score to score dataframe
modelscore(gs_tvec_log, "TfidfVectorizer", "Logistic", "Excluded")

In [None]:
score_df

**Error Analysis**

In [None]:
df

In [None]:
# Instantiate Model
cvec = CountVectorizer(stop_words = "english")

'cvec__max_df': 0.9, 'cvec__max_features': 5000, 'cvec__min_df': 2, 'cvec__ngram_range': (1, 2), 'cvec__stop_words': None

In [None]:
# Fit
cvec.fit(X_train)

In [None]:
# Transform the corpus
X_train = cvec.transform(X_train)

In [None]:
train_df = pd.DataFrame(X_train.todense(), 
                        columns=cvec.get_feature_names_out())

In [None]:
train_df["target"] = y_train.values

In [None]:
train_df.groupby('target').sum()\
.T.sort_values(1, ascending=True).tail(20).plot.barh();

In [None]:
cm = confusion_matrix(y_test, y_preds_tvec_log)

In [None]:
tn, fp, fn, tp = confusion_matrix(y_test, y_preds_tvec_log).ravel()

In [None]:
ConfusionMatrixDisplay(cm, display_labels=gs_tvec_log.classes_).plot();

In [None]:
fp

In [None]:
test_df = pd.DataFrame(X_test)
test_df['actual_class'] = y_test
test_df['predict_class'] = y_preds_tvec_log
test_df

In [None]:
false_df = test_df[test_df['actual_class'] != test_df['predict_class']]['title_clean']
false_df

In [None]:
df.shape

In [None]:
df.sample(30)