# Project Desciption

This mini-project will give you experience with working with text data and machine learning with scikit-learn. You will develop models to analyze text from the website Reddit. Reddit is a popular social media forum where users post and comment on content in different themed communities, or subreddits. The goal is to, given a post or comment from reddit, identify the subreddit where that post or comment originally came from. The dataset is gathered from posts and comments from 8 subreddits:

rpg : Subreddit for tabletop role playing games
anime : Subreddit for discussing Japanese animation
datascience : Subreddit for discussing matters related to data science, including machine learning
hardware : Computer hardware news and discussions
cars : Discussions and news about cars 
gamernews : Discussing video game related news.
gamedev : Subreddit for video game developers
computers : Subreddit for discussing anything about computers.
Download the training and test data from the Kaggle competition page (details to follow below). Each line in train.csv includes two fields: text of a post/comment, and the subreddit it was posted on. Use train.csv for training your models.

Your tasks

Part 1: using the data provided in train.csv
Implement Bernoulli Naïve Bayes from scratch. Since your implementation may not be optimized for speed, you can set the max_features=5000 in CountVectorizer(). This will consider only the top 5000 words i.e.  m=5000.
You need to do additional experiments using at least one additional classifiers from the SciKit learn package (excluding Bernoulli Naïve Bayes). For example Logistic Regression, Decision Tree, etc.
Implement a model validation (e.g. using k-fold cross-validation) and use it to report, in your write-up, the performance of the above mentioned classification models (i.e., your Naïve Bayes model and the SciKit learn model(s)). You can use KFold from SciKit learn.
Part 2: run your best model (obtained in Part 1) on the data provided in test.csv and submit the result on Kaggle competition. (See below for more details) 
Part 3: Prepare and submit your report (max 5 pages) along with your codes (details below).

Hint 1: You many want to use Laplace smoothing with your Bernoulli Naïve Bayes model.
Hint 2: you may use CountVectorizer() with binary=False to obtain a binary vector representation for a text.
Hint 3: when debugging your code, you might start off by using only a quarter of train samples and a quarter test samples. When you are done with code debugging, apply your code to the complete data.

# Setup

In [None]:
import math
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os

DEBUG = True
KFOLDS_ENABLE = False
SHUFFLE = True

try:
    from google.colab import drive
    drive.mount('/content/drive', force_remount=True)
    projectDIR='/content/drive/My Drive/School/COMP4900A/COMP4900A-MP2/'
    !pip install PipelineHelper xgboost catboost

except:
    projectDIR='./'

dataDIR = projectDIR + 'data'

## Training  Data Import
trainCSV = dataDIR + '/dataset/train.csv'
train_df = pd.read_csv(trainCSV)

## Testing Data Import
testCSV = dataDIR + '/dataset/test.csv'
test_df = pd.read_csv(testCSV)


from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import LabelEncoder
from sklearn.utils import shuffle

sub_reddit = ['rpg', 'anime', 'datascience', 'hardware', 'cars', 'gamernews', 'gamedev', 'computers']
le = LabelEncoder()
le.fit(sub_reddit)  ## define the categories le will encode excplicity. Could also extract from data, but this is safer

X = train_df['body']       ## Main corpus is generated from the training set
y = le.transform(train_df['subreddit'])

if SHUFFLE:
    X, y = shuffle(X,y, random_state=27)

# Custom classes for transforming or cleaning data

In [None]:
from sklearn.base import TransformerMixin,BaseEstimator

class CleanTextTransformer(TransformerMixin,BaseEstimator):
  def cleanText(self,text):
     text = text.strip().replace("\n", " ").replace("\r", " ")
     text = text.lower()
     return text
  def transform(self, X, **transform_params):
       return [self.cleanText(text) for text in X]
  def fit(self, X, y=None, **fit_params):
       return self
  def get_params(self, deep=True):
          return {}

class BaseBernoulliNB(BaseEstimator):
  def _update_theta(self, x, y, smooth=0.1):
    '''
    x:feature matrix of examples
    y:result matrix of examples
    self.theta_k a 1xk matrix contain all possible yk: (# of examples where y=yk) / (total # of examples)
    self.theta_k_j is a kxj metrix of all possible xj in yk: (# of examples where xj=1 and y=k) / (# of examples where y=k)
    '''
    class_name = np.unique(y, axis=0)
    class_num = len(class_name)
    self.theta_k = np.zeros(class_num)
    self.theta_k_j = np.zeros((class_num, x.shape[1]))
    for k in range (class_num):
      k_num = np.count_nonzero(y == class_name[k], axis=0)
      self.theta_k[k] = k_num / y.shape[0]
      for j in range (x.shape[1]):
        k_j_count = 0
        k_num += 2 * smooth
        k_j_count += smooth
        for i in range (x.shape[0]):
          if y[i] == k:
            k_j_count += x[i][j]
        self.theta_k_j[k][j] = k_j_count / k_num

  def fit(self, x, y, smooth=0.1):
    '''
    self.class_log_prior is a 1xk metrix of log(self.theta_k), k is the number of classes
    self.feature_likelihood_log_one is a kxj metrix of log(self.theta_k_j), k is the number of the classes and j is the features
    self.feature_likelihood_log_one is a kxj metrix of log(self.1 - theta_k_j), k is the number of the classes and j is the features
    '''
    self._update_theta(x, y, smooth)
    self.class_log_prior = np.log(self.theta_k)
    self.feature_likelihood_log_one = np.log(self.theta_k_j)
    self.feature_likelihood_log_zero = np.log(1 - self.theta_k_j)

  def predict(self, x):
    if len(x.shape) ==1:
      y = np.zeros(1)
      class_pro = np.zeros(len(self.class_log_prior))
      for k in range (len(class_pro)):
        class_pro[k] += self.class_log_prior[k] + np.sum(np.multiply(x, self.feature_likelihood_log_one[k].T)) + np.sum(np.multiply(1 - x, self.feature_likelihood_log_zero[k].T))
      y[0] = np.argmax(class_pro)
      return y
    else:
      y = np.zeros(x.shape[0])
      for i in range (x.shape[0]):
        class_pro = np.zeros(len(self.class_log_prior))
        for k in range (len(class_pro)):
          class_pro[k] += self.class_log_prior[k] + np.sum(np.multiply(x[i], self.feature_likelihood_log_one[k].T)) + np.sum(np.multiply(1 - x[i], self.feature_likelihood_log_zero[k].T))
        y[i] = np.argmax(class_pro)
      return y

import tensorflow_hub as hub
import logging, os

logging.disable(logging.WARNING)
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2"                                                ## Suppress tf related warnings, as USE seems to slightly clash with tf version
os.environ["TFHUB_CACHE_DIR"] = projectDIR +'cache'                                     ## This is the ~1GB sentence encoder storage 
class UniversalEncoderVectorizer(TransformerMixin,BaseEstimator):

     def transform(self, X, **transform_params):
          universal_encoder = hub.load("https://tfhub.dev/google/universal-sentence-encoder/4") ### This is an ouchy workaround to parralelization
          universal_X_tensor = universal_encoder(X)
          generated_X = np.asarray(universal_X_tensor)   ### This is the data encoded as a semantic relation rather than tfidf 
          return generated_X

     def fit(self, X, y=None, **fit_params):
          return self

     def get_params(self, deep=True):
          return {}

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from pipelinehelper import PipelineHelper

## Vectorizers
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
# Plus our UniversalEncoderVectorizer


## Classifiers
from sklearn.svm            import LinearSVC
from sklearn.naive_bayes    import CategoricalNB
from sklearn.linear_model   import LogisticRegression
from sklearn.linear_model   import PassiveAggressiveClassifier
from sklearn.linear_model   import Perceptron
from sklearn.linear_model   import RidgeClassifier

## Ensemble Methods
from sklearn.ensemble       import RandomForestClassifier
from sklearn.ensemble       import GradientBoostingClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier


## Selectors
from sklearn.feature_selection import RFE

## Stop Words
from spacy.lang.en.stop_words import STOP_WORDS

tfidf_vect = TfidfVectorizer(stop_words='english',sublinear_tf=True,strip_accents='unicode',ngram_range=(1,1))
unviversal_sentence_encoder = UniversalEncoderVectorizer()

## Note: input is desiged to text into a form compatibile with the classifiers (vectors)
# if using the universal sentence encode, we want to disable the input and combine the vectors
pipe = Pipeline([
    ('input',  PipelineHelper([
        # ('tfidf', tfidf_vect),

        # ('count_vect', CountVectorizer()),

        # ('universal_sentence', unviversal_sentence_encoder),

        ('universal_sentence_tfid',  FeatureUnion([
            ('universal_sentence', unviversal_sentence_encoder),
            ('tfidf', tfidf_vect),
        ])),

    ])),
    ('clf', PipelineHelper([
    #   ('svm',           LinearSVC()),
    #   ('naive_bayes_classfier', BaseBernoulliNB()),      
    #   ('rfc',           RandomForestClassifier()),
    #   ('pac',           PassiveAggressiveClassifier()),
    #   ('percep',        Perceptron()),
    #   ('ridge',         RidgeClassifier()),
      # ('gradientboost', GradientBoostingClassifier()),
      # ('x_gradientboost', XGBClassifier()),
      ('catboost', CatBoostClassifier(task_type="GPU",)),

    ])),
])


parameters = {
    'input__selected_model': pipe.named_steps['input'].generate({}),

    ## pipe.named_steps is how to utilize pipeline helper, which can be used to swap in different 
    ## models/transformers/classed that implement fit/transform for piping
    'clf__selected_model': pipe.named_steps['clf'].generate({
        # 'svm__C': [1.0,],
        # 'ridge__alpha': [1,0.5,0.3],
        # 'ridge__normalize': [False,True],
    }),
}

# pipe = Pipeline([
#     ('pre-process', CleanTextTransformer()),
#     ('combo_vectorizer', FeatureUnion([
#         ('universal_sentence', unviversal_sentence_encoder),
#         ('tfidf', tfidf_vect),
#     ])),     
#     ('clf', PipelineHelper([
#       ('svm',           LinearSVC()),
#       ('rfc',           RandomForestClassifier()),
#       ('pac',           PassiveAggressiveClassifier()),
#       ('percep',        Perceptron()),
#       ('ridge',         RidgeClassifier()),
#       # ('gradientboost', GradientBoostingClassifier()),
#     ])),
# ])


# parameters = {
#     'combo_vectorizer__tfidf__ngram_range': ((1,1),(1,2)),
#     'combo_vectorizer__tfidf__stop_words': (STOP_WORDS,'english',None),

#     ## pipe.named_steps is how to utilize pipeline helper, which can be used to swap in different 
#     ## models/transformers/classed that implement fit/transform for piping
#     'clf__selected_model': pipe.named_steps['clf'].generate({
#         'svm__C': [1.0,],
#         'ridge__alpha': [1,0.5,0.3],
#         'ridge__normalize': [False,True],
#     }),
# }

print('################################################################\n###Buckle your seatbelts cause this is gonna be a long one!!! ###\n')
grid_search = GridSearchCV(pipe, parameters,cv=5, n_jobs=1, verbose=3)
grid_search.fit(X,y)

################################################################
###Buckle your seatbelts cause this is gonna be a long one!!! ###

Fitting 5 folds for each of 1 candidates, totalling 5 fits
[CV] clf__selected_model=('catboost', {}), input__selected_model=('universal_sentence_tfid', {}) 


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.


Learning rate set to 0.102216




0:	learn: 1.7698976	total: 374ms	remaining: 6m 13s
1:	learn: 1.5640084	total: 572ms	remaining: 4m 45s
2:	learn: 1.4156539	total: 753ms	remaining: 4m 10s
3:	learn: 1.3046499	total: 938ms	remaining: 3m 53s
4:	learn: 1.2093502	total: 1.14s	remaining: 3m 47s
5:	learn: 1.1288260	total: 1.33s	remaining: 3m 41s
6:	learn: 1.0540714	total: 1.53s	remaining: 3m 37s
7:	learn: 0.9968887	total: 1.74s	remaining: 3m 36s
8:	learn: 0.9523273	total: 1.92s	remaining: 3m 31s
9:	learn: 0.9021853	total: 2.11s	remaining: 3m 29s
10:	learn: 0.8584896	total: 2.31s	remaining: 3m 27s
11:	learn: 0.8193213	total: 2.5s	remaining: 3m 26s
12:	learn: 0.7854707	total: 2.7s	remaining: 3m 25s
13:	learn: 0.7542997	total: 2.88s	remaining: 3m 23s
14:	learn: 0.7268135	total: 3.08s	remaining: 3m 22s
15:	learn: 0.7018701	total: 3.29s	remaining: 3m 22s
16:	learn: 0.6796909	total: 3.45s	remaining: 3m 19s
17:	learn: 0.6579273	total: 3.64s	remaining: 3m 18s
18:	learn: 0.6385789	total: 3.85s	remaining: 3m 18s
19:	learn: 0.6195394	tot

[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:  3.3min remaining:    0.0s


Learning rate set to 0.102216




0:	learn: 1.7756215	total: 349ms	remaining: 5m 49s
1:	learn: 1.5691708	total: 567ms	remaining: 4m 42s
2:	learn: 1.4179851	total: 779ms	remaining: 4m 18s
3:	learn: 1.3010858	total: 976ms	remaining: 4m 3s
4:	learn: 1.2081023	total: 1.19s	remaining: 3m 57s
5:	learn: 1.1217994	total: 1.41s	remaining: 3m 52s
6:	learn: 1.0518711	total: 1.62s	remaining: 3m 49s
7:	learn: 0.9890237	total: 1.84s	remaining: 3m 48s
8:	learn: 0.9416343	total: 2.02s	remaining: 3m 42s
9:	learn: 0.8914499	total: 2.22s	remaining: 3m 40s
10:	learn: 0.8489630	total: 2.44s	remaining: 3m 39s
11:	learn: 0.8083637	total: 2.66s	remaining: 3m 38s
12:	learn: 0.7740037	total: 2.87s	remaining: 3m 38s
13:	learn: 0.7434407	total: 3.08s	remaining: 3m 36s
14:	learn: 0.7138120	total: 3.3s	remaining: 3m 36s
15:	learn: 0.6899342	total: 3.51s	remaining: 3m 36s
16:	learn: 0.6679711	total: 3.7s	remaining: 3m 33s
17:	learn: 0.6484047	total: 3.89s	remaining: 3m 32s
18:	learn: 0.6278118	total: 4.08s	remaining: 3m 30s
19:	learn: 0.6086660	tota

[Parallel(n_jobs=1)]: Done   2 out of   2 | elapsed:  6.7min remaining:    0.0s


Learning rate set to 0.102219




0:	learn: 1.7646779	total: 398ms	remaining: 6m 37s
1:	learn: 1.5596189	total: 616ms	remaining: 5m 7s
2:	learn: 1.4149374	total: 792ms	remaining: 4m 23s
3:	learn: 1.2991231	total: 988ms	remaining: 4m 5s
4:	learn: 1.2034943	total: 1.2s	remaining: 3m 59s
5:	learn: 1.1217758	total: 1.41s	remaining: 3m 53s
6:	learn: 1.0488417	total: 1.62s	remaining: 3m 50s
7:	learn: 0.9873918	total: 1.84s	remaining: 3m 48s
8:	learn: 0.9428375	total: 2.02s	remaining: 3m 42s
9:	learn: 0.8985387	total: 2.23s	remaining: 3m 40s
10:	learn: 0.8602519	total: 2.44s	remaining: 3m 39s
11:	learn: 0.8244404	total: 2.65s	remaining: 3m 37s
12:	learn: 0.7889315	total: 2.86s	remaining: 3m 37s
13:	learn: 0.7560756	total: 3.06s	remaining: 3m 35s
14:	learn: 0.7252225	total: 3.27s	remaining: 3m 34s
15:	learn: 0.6998337	total: 3.45s	remaining: 3m 32s
16:	learn: 0.6760170	total: 3.67s	remaining: 3m 32s
17:	learn: 0.6575019	total: 3.85s	remaining: 3m 29s
18:	learn: 0.6386514	total: 4.04s	remaining: 3m 28s
19:	learn: 0.6194063	tota



0:	learn: 1.7607613	total: 387ms	remaining: 6m 27s
1:	learn: 1.5438201	total: 579ms	remaining: 4m 49s
2:	learn: 1.3979370	total: 762ms	remaining: 4m 13s
3:	learn: 1.2847936	total: 952ms	remaining: 3m 57s
4:	learn: 1.1934603	total: 1.16s	remaining: 3m 51s
5:	learn: 1.1119309	total: 1.38s	remaining: 3m 48s
6:	learn: 1.0424154	total: 1.59s	remaining: 3m 45s
7:	learn: 0.9817988	total: 1.8s	remaining: 3m 43s
8:	learn: 0.9338603	total: 2.02s	remaining: 3m 42s
9:	learn: 0.8894598	total: 2.21s	remaining: 3m 38s
10:	learn: 0.8512110	total: 2.41s	remaining: 3m 36s
11:	learn: 0.8152036	total: 2.61s	remaining: 3m 35s
12:	learn: 0.7798011	total: 2.83s	remaining: 3m 35s
13:	learn: 0.7497513	total: 3.04s	remaining: 3m 34s
14:	learn: 0.7205478	total: 3.25s	remaining: 3m 33s
15:	learn: 0.6924563	total: 3.47s	remaining: 3m 33s
16:	learn: 0.6709426	total: 3.68s	remaining: 3m 32s
17:	learn: 0.6505788	total: 3.87s	remaining: 3m 31s
18:	learn: 0.6315863	total: 4.06s	remaining: 3m 29s
19:	learn: 0.6123530	to



0:	learn: 1.7640950	total: 372ms	remaining: 6m 11s
1:	learn: 1.5581959	total: 597ms	remaining: 4m 58s
2:	learn: 1.4103609	total: 783ms	remaining: 4m 20s
3:	learn: 1.2967485	total: 983ms	remaining: 4m 4s
4:	learn: 1.2077918	total: 1.2s	remaining: 3m 58s
5:	learn: 1.1190852	total: 1.41s	remaining: 3m 54s
6:	learn: 1.0459364	total: 1.63s	remaining: 3m 50s
7:	learn: 0.9851300	total: 1.84s	remaining: 3m 48s
8:	learn: 0.9398086	total: 2.02s	remaining: 3m 42s
9:	learn: 0.8903847	total: 2.24s	remaining: 3m 41s
10:	learn: 0.8496425	total: 2.45s	remaining: 3m 40s
11:	learn: 0.8155100	total: 2.65s	remaining: 3m 37s
12:	learn: 0.7803312	total: 2.86s	remaining: 3m 37s
13:	learn: 0.7501204	total: 3.05s	remaining: 3m 34s
14:	learn: 0.7253617	total: 3.23s	remaining: 3m 31s
15:	learn: 0.7013395	total: 3.45s	remaining: 3m 31s
16:	learn: 0.6770202	total: 3.64s	remaining: 3m 30s
17:	learn: 0.6545267	total: 3.85s	remaining: 3m 30s
18:	learn: 0.6349374	total: 4.05s	remaining: 3m 29s
19:	learn: 0.6154075	tot

[Parallel(n_jobs=1)]: Done   5 out of   5 | elapsed: 16.6min finished


Learning rate set to 0.106978




0:	learn: 1.7369369	total: 415ms	remaining: 6m 55s
1:	learn: 1.5172339	total: 655ms	remaining: 5m 27s
2:	learn: 1.3700407	total: 877ms	remaining: 4m 51s
3:	learn: 1.2573913	total: 1.13s	remaining: 4m 41s
4:	learn: 1.1672279	total: 1.39s	remaining: 4m 37s
5:	learn: 1.0935787	total: 1.59s	remaining: 4m 22s
6:	learn: 1.0299352	total: 1.81s	remaining: 4m 17s
7:	learn: 0.9684060	total: 2.07s	remaining: 4m 16s
8:	learn: 0.9170068	total: 2.32s	remaining: 4m 15s
9:	learn: 0.8703067	total: 2.56s	remaining: 4m 13s
10:	learn: 0.8294543	total: 2.81s	remaining: 4m 12s
11:	learn: 0.7956729	total: 3.03s	remaining: 4m 9s
12:	learn: 0.7607005	total: 3.28s	remaining: 4m 8s
13:	learn: 0.7277613	total: 3.53s	remaining: 4m 8s
14:	learn: 0.6974225	total: 3.78s	remaining: 4m 8s
15:	learn: 0.6716656	total: 4.03s	remaining: 4m 7s
16:	learn: 0.6510329	total: 4.22s	remaining: 4m 4s
17:	learn: 0.6292044	total: 4.47s	remaining: 4m 3s
18:	learn: 0.6087644	total: 4.72s	remaining: 4m 3s
19:	learn: 0.5929147	total: 4.

GridSearchCV(cv=5, error_score=nan,
             estimator=Pipeline(memory=None,
                                steps=[('input',
                                        PipelineHelper(available_models={'universal_sentence_tfid': FeatureUnion(n_jobs=None,
                                                                                                                 transformer_list=[('universal_sentence',
                                                                                                                                    UniversalEncoderVectorizer()),
                                                                                                                                   ('tfidf',
                                                                                                                                    TfidfVectorizer(analyzer='word',
                                                                                                                         

In [None]:
overviewCSV = dataDIR + '/results/overview_xboost.csv'

from IPython.display import display

results_df = pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score')
display(results_df)
results_df.to_csv(overviewCSV,index=False)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_clf__selected_model,param_input__selected_model,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,154.096808,0.914158,45.536737,0.984431,"(catboost, {})","(universal_sentence_tfid, {})","{'clf__selected_model': ('catboost', {}), 'inp...",0.901597,0.898576,0.898532,0.899827,0.896805,0.899067,0.001589,1


In [None]:

bestCSV = dataDIR + '/results/best_perf.csv'

best_est = grid_search.best_estimator_
predictions = best_est.predict(test_df['body'])

best_df = pd.DataFrame(data = zip(test_df['id'],le.inverse_transform(predictions)) ,columns = ['id','subreddit'])

display(best_df)
best_df.to_csv(bestCSV,index=False)

  y = column_or_1d(y, warn=True)


Unnamed: 0,id,subreddit
0,0,datascience
1,1,anime
2,2,rpg
3,3,computers
4,4,computers
...,...,...
2893,2893,datascience
2894,2894,rpg
2895,2895,gamedev
2896,2896,cars


In [None]:
predictions

array([[3],
       [0],
       [7],
       ...,
       [4],
       [1],
       [3]])