# Scott Breitbach
## 10-May-2021
## DSC550, Week 9

# 9.3 Exercise: Neural Network Classifiers

## Step 1. Neural Network Classifier with Scikit

Using the multi-label classifier dataset from earlier exercises (categorized-comments.jsonl in the reddit folder), fit a neural network classifier using scikit-learn. Use the code found in chapter 12 of the Applied Text Analysis with Python book as a guideline. Report the accuracy, precision, recall, F1-score, and confusion matrix.

## Load Data Set

In [1]:
# Load libraries
import numpy as np
import jsonlines
import pandas as pd

# Set random seed
np.random.seed(42)

C:\ProgramData\Anaconda3\lib\site-packages\numpy\.libs\libopenblas.NOIJJG62EMASZI6NYURL6JBKM4EVBGM7.gfortran-win_amd64.dll
C:\ProgramData\Anaconda3\lib\site-packages\numpy\.libs\libopenblas.WCDJNK7YVMPZQ2ME2ZZHJJRJ3JIKNDB7.gfortran-win_amd64.dll


In [2]:
# Load JSON data into a list of dictionaries
data = []
with jsonlines.open('categorized-comments.jsonl') as reader:
    for obj in reader.iter(type=dict, skip_invalid=True):
        data.append(obj)

In [3]:
# Convert data to DataFrame
cat_comments_df = pd.DataFrame(data)
cat_comments_df.head()

Unnamed: 0,cat,txt
0,sports,Barely better than Gabbert? He was significant...
1,sports,Fuck the ducks and the Angels! But welcome to ...
2,sports,Should have drafted more WRs.\n\n- Matt Millen...
3,sports,[Done](https://i.imgur.com/2YZ90pm.jpg)
4,sports,No!! NOO!!!!!


## Preprocess Text

In [4]:
# Load libraries
import sys
import unicodedata
import re

from nltk.corpus import stopwords
from collections import Counter
from nltk.stem.porter import PorterStemmer

In [5]:
# Create a copy of the data set to manipulate
df = cat_comments_df.copy()

In [6]:
# Create a dictionary of punctuation
punctuation_dict = dict.fromkeys(i for i in range(sys.maxunicode) 
                            if unicodedata.category(chr(i)).startswith('P'))
# Create a dictionary of stopwords
stop_words = stopwords.words('english')
stopwords_dict = Counter(stop_words)

def cleanText(string):
    '''Processes string and returns cleaned up list of words'''
    
    # Convert to lowercase
    string = string.lower()
    
    # Remove URLs
    string = re.sub(r'http\S+', '', string)
    
    # Remove punctuation
    string = string.translate(punctuation_dict)
    
    # Remove newlines
    string = string.replace("\n", " ")
    
    # Remove stopwords
    string = [word for word in string.split() if word not in stopwords_dict]
    
    return string

In [7]:
# Clean up the text in the 'txt' column
df.txt = df.txt.apply(lambda string: cleanText(string))

In [None]:
%%time
# Apply PorterStemmer
porter = PorterStemmer()
df['txt_stems'] = df.txt.apply(lambda words: [porter.stem(word) for word in words])

In [None]:
%%time
# Join tokenized stem words into a string
df['txt_str'] = df.txt_stems.apply(lambda s: ' '.join(map(str, s)))

In [None]:
# Take a look at data set
df.head()

## Sample Data Set Into Equal-Sized Groups

In [None]:
# Group data by category
cat_group = df.groupby('cat', as_index=False, group_keys=False)

In [None]:
# Sample 25000 rows from each category
balancedDF = cat_group.apply(lambda s: s.sample(25000, replace=False))

In [None]:
# Verify counts of categories
balancedDF.cat.value_counts()

# Prepare Text for Model-Building

In [None]:
# Load libraries
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

## Convert Feature Data to a Word-Count Vector

In [None]:
# Combine tokenized lists of words into a list word strings
text_data, string = [], " "

for text in balancedDF.txt_stems:
    text_data.append(string.join(text))

In [None]:
# Word-count vector as a sparse matrix
count = CountVectorizer(max_features=5000)
bal_sparseWCV = count.fit_transform(text_data)
bal_sparseWCV

### Split Training and Testing Data

In [None]:
# Set up data and labels
X = bal_sparseWCV
y = balancedDF.cat

In [None]:
# Train Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y)

# Modeling

In [None]:
# Load libraries
from sklearn.neural_network import MLPClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
%%time
mlp_gs = MLPClassifier(max_iter=100, verbose=True)#, early_stopping=True) max_iter=200, 300
parameter_space = {
    'hidden_layer_sizes': [(30,), (100,)], #[(10, 10, 10), (20,20, 20), (40,), (50, 30), (500, 150)],
    'activation': ['relu', 'tanh'], 
    'solver': ['adam', 'sgd'], 
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant', 'adaptive'], # only used when solver is 'sgd'
}
grid = GridSearchCV(mlp_gs, parameter_space, verbose=2, n_jobs=-1, cv=5)
grid_result = grid.fit(X_train, y_train) 

In [None]:
print('Best parameters found:\n', grid_result.best_params_)

Evaluation

In [None]:
y_true, y_pred = y_test , grid_result.predict(X_test)
from sklearn.metrics import classification_report
print('Results on the test set:')
print(classification_report(y_true, y_pred))

In [48]:
print('Best parameters found:\n', grid_result.best_params_)

Best parameters found:
 {'activation': 'relu', 'alpha': 0.0005, 'hidden_layer_sizes': (100,), 'learning_rate': 'constant', 'solver': 'adam'}


Evaluation

In [49]:
y_true, y_pred = y_test , grid_result.predict(X_test)
from sklearn.metrics import classification_report
print('Results on the test set:')
print(classification_report(y_true, y_pred))

Results on the test set:
                        precision    recall  f1-score   support

science_and_technology       0.74      0.78      0.76      6275
                sports       0.70      0.72      0.71      6223
           video_games       0.69      0.63      0.66      6252

              accuracy                           0.71     18750
             macro avg       0.71      0.71      0.71     18750
          weighted avg       0.71      0.71      0.71     18750



No.1, using `early_stopping=True`

In [42]:
print('Best parameters found:\n', grid_result.best_params_)

Best parameters found:
 {'activation': 'relu', 'alpha': 0.0005, 'hidden_layer_sizes': (30,), 'learning_rate': 'constant', 'solver': 'adam'}


Evaluation

In [45]:
y_true, y_pred = y_test , grid_result.predict(X_test)
from sklearn.metrics import classification_report
print('Results on the test set:')
print(classification_report(y_true, y_pred))

Results on the test set:
                        precision    recall  f1-score   support

science_and_technology       0.84      0.81      0.82      6275
                sports       0.74      0.81      0.77      6223
           video_games       0.76      0.71      0.74      6252

              accuracy                           0.78     18750
             macro avg       0.78      0.78      0.78     18750
          weighted avg       0.78      0.78      0.78     18750



In [None]:
from win32com.client import Dispatch
speak = Dispatch("SAPI.SpVoice").Speak

In [None]:
speak("modeling complete")

In [None]:
grid_result