# Project:  Subreddit Post Analysis: r/iosgaming, r/android gaming

### Naive Bayes - Gaussian Regression Modeling
We will test our data imported from `subs.csv`. The data will be cleaned and prepared for GridSearching across various hyperparameters selected for our features. <br><br> Subreddit submission posts alongside accompanying post contents have been scraped. Target variable `y` is whether or not the document/row belongs to r/ios_gaming or not. `1` - indicates yes, `0` - indicates no (belongs to r/androidgaming instead)
**Contents:**
- [Library Imports](#Library-Imports)
- [Data Import and Cleaning](#Data-Import-and-Cleaning)
- [Modeling Hyperparameters](#Modeling-Hyperparameters)
 - [1. Count Vectorizer](#1.-Count-Vectorizer)
 - [2. TfidfVectorizer](#2.-TfidfVectorizer)
- [Scoring](#Scoring)

### Library Imports

In [1]:
import pandas as pd
import numpy as np
import requests
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
from nltk.corpus import stopwords
import random
import string

In [2]:
# this will handle sparse-matrix created from countvectorizer transformation
class DenseTransformer():

    def fit(self, X, y=None, **fit_params):
        return self

    def transform(self, X, y=None, **fit_params):
        return X.todense()

# https://stackoverflow.com/a/28384887
# modified to remove (xxx), James Dargan-SEA 11 found this

### Data Import and Cleaning

In [3]:
subsdf = pd.read_csv('../datasets/subs.csv', )
subsdf.drop(columns='Unnamed: 0', inplace = True)
subsdf['corpus'] = subsdf['title'] + subsdf['selftext']

In [4]:
subsdf['is_iosgaming'].value_counts()

1    10700
0    10700
Name: is_iosgaming, dtype: int64

In [5]:
punct = [i for i in string.punctuation]
stops = stopwords.words('english')
stops.extend(punct)

### Modeling Hyperparameters

#### 1. Count Vectorizer
- `CountVectorizer()`
- `GaussianNB()`
- `Pipeline()`
- `GridSearchCV()`

In [6]:
# establish our target and features
X = subsdf['corpus']
y = subsdf['is_iosgaming']

In [7]:
# Train-Test-Split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 14) 

In [8]:
#KNN Regression, CountVectorizer HyperParameters
#pipline
pipe = Pipeline([
    ('cvec', CountVectorizer(lowercase = True)),
    ('to_dense', DenseTransformer()),
    ('gn', GaussianNB())
])

#params
pipe_params = {
    'cvec__max_features' : [100, 500, 1000],
    'cvec__ngram_range' : [(1,1), (1,2), (2,2)],
    'cvec__stop_words' : [None, 'english', stops]
#     'cvec__max_df' : (.99, .95, .9) # test different high used words
}

#gridsearch
gs = GridSearchCV(pipe,
                 param_grid=pipe_params,
                 cv = 5)

In [9]:
gs.fit(X_train, y_train);

In [10]:
# Our model's scoring and best params
print(f'Model Training Score: {gs.best_score_}')
print(f'Model Test Score: {gs.score(X_test, y_test)}')
print(gs.best_params_)
print(gs.best_estimator_)
best_nbcv = gs.best_estimator_

Model Training Score: 0.8175700934579438
Model Test Score: 0.8250467289719626
{'cvec__max_features': 1000, 'cvec__ngram_range': (2, 2), 'cvec__stop_words': None}
Pipeline(memory=None,
         steps=[('cvec',
                 CountVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.int64'>, encoding='utf-8',
                                 input='content', lowercase=True, max_df=1.0,
                                 max_features=1000, min_df=1,
                                 ngram_range=(2, 2), preprocessor=None,
                                 stop_words=None, strip_accents=None,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, vocabulary=None)),
                ('to_dense',
                 <__main__.DenseTransformer object at 0x1a1cb64110>),
                ('gn', GaussianNB(priors=None, var_smoothing=1e-09))],
   

#### 2. TfidfVectorizer
- `TfidfVectorizer()`
- `GaussianNB()`
- `Pipeline()`
- `GridSearchCV()`

In [None]:
# establish our target and features
X = subsdf['corpus']
y = subsdf['is_iosgaming']

In [None]:
# Train-Test-Split
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify = y, random_state = 14) 

In [11]:
#NB Gaussian Regression, TfidfVectorizer HyperParameters
#pipline
pipe = Pipeline([
    ('tvec', TfidfVectorizer(lowercase = True)),
    ('to_dense', DenseTransformer()),
    ('gn', GaussianNB())
])

#params
pipe_params = {
    'tvec__max_features' : [100, 500, 1000],
    'tvec__ngram_range' : [(1,1), (1,2), (2,2)],
    'tvec__stop_words' : [None, 'english', stops]
#     'tvec__max_df' : (.99, .95, .9) # test different high used words
}

#gridsearch
gs = GridSearchCV(pipe,
                 param_grid=pipe_params,
                 cv = 5)

In [12]:
gs.fit(X_train, y_train);

In [13]:
# Our model's scoring and best params
print(f'Model Training Score: {gs.best_score_}')
print(f'Model Test Score: {gs.score(X_test, y_test)}')
print(gs.best_params_)
print(gs.best_estimator_)
best_nbtv = gs.best_estimator_

Model Training Score: 0.9682242990654206
Model Test Score: 0.9706542056074766
{'tvec__max_features': 1000, 'tvec__ngram_range': (1, 2), 'tvec__stop_words': 'english'}
Pipeline(memory=None,
         steps=[('tvec',
                 TfidfVectorizer(analyzer='word', binary=False,
                                 decode_error='strict',
                                 dtype=<class 'numpy.float64'>,
                                 encoding='utf-8', input='content',
                                 lowercase=True, max_df=1.0, max_features=1000,
                                 min_df=1, ngram_range=(1, 2), norm='l2',
                                 preprocessor=None, smooth_idf=True,
                                 stop_words='english', strip_accents=None,
                                 sublinear_tf=False,
                                 token_pattern='(?u)\\b\\w\\w+\\b',
                                 tokenizer=None, use_idf=True,
                                 vocabulary=None)),


### Scoring
- `Pipe`: `GaussianNB()`, `CountVectorizer()`, `DenseTransformer()`
- `Best Score:` 0.8175700934579438
- `Test Score:` 0.8250467289719626

<br><br>
- `Pipe`: `GaussianNB()`, `TfidfVectorizer()`, `DenseTransformer()`
- `Best Score:` 0.9682242990654206
- `Test Score:` 0.9706542056074766