## Project - Natural Language Processing and Text Classification

This project makes use of NLP techniques to train models to predict different classes of angry comments. Data containing a group of different texts are then fed to the models to generate a prediction classifying the texts into their respective category of angry. Membership in one output category is <b>independent</b> of membership in any of the other classes. Meaning a comment may belong to any number of categories.


## 1. Import Libraries

In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from joblib import dump, load
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import TruncatedSVD
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.metrics import roc_curve, auc, roc_auc_score
from sklearn.ensemble import RandomForestClassifier
from gensim.models import Word2Vec

## 2. Load and preview the train data

In [3]:
train_df = pd.read_csv("train.csv.zip")
train_df = train_df.drop(columns='id')
train_df.insert(0, 'ID', range(1, 1 + len(train_df)))
train_df.head()

Unnamed: 0,ID,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,1,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,2,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,3,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,4,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,5,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


## 3. Training the models

### Toxic Model

In [3]:
vec_tf = TfidfVectorizer(max_features=500)
vec_cv = CountVectorizer(max_features=150)

y = train_df["toxic"]
x = train_df["comment_text"]

x_train, x_test, y_train, y_test = train_test_split(x, y)

pipe = Pipeline([ 
                    ("vect", vec_tf),
                    ("model", SVC())
])

params = ["vec_cv"]

pipe.fit(x_train, y_train.ravel())
preds = pipe.predict(x_test)

print(classification_report(y_test, preds))

              precision    recall  f1-score   support

           0       0.94      0.99      0.97     36065
           1       0.88      0.41      0.56      3828

    accuracy                           0.94     39893
   macro avg       0.91      0.70      0.76     39893
weighted avg       0.93      0.94      0.93     39893



### Severe Toxic Model

In [10]:
vec_tf = TfidfVectorizer(max_features=1000)
vec_cv = CountVectorizer(max_features=150)

y1 = train_df["severe_toxic"]
x1 = train_df["comment_text"]

x1_train, x1_test, y1_train, y1_test = train_test_split(x1, y1)

pipe1 = Pipeline([ 
                    ("vect", vec_tf),
                    ("model", SVC())
])

params = ["vec_cv"]

pipe1.fit(x1_train, y1_train.ravel())
preds1 = pipe1.predict(x1_test)

print(classification_report(y1_test, preds1))


              precision    recall  f1-score   support

           0       0.99      1.00      0.99     39477
           1       0.47      0.08      0.14       416

    accuracy                           0.99     39893
   macro avg       0.73      0.54      0.57     39893
weighted avg       0.99      0.99      0.99     39893



### Obscene Model

In [12]:
vec_tf = TfidfVectorizer(max_features=1000)
vec_cv = CountVectorizer(max_features=150)

y2 = train_df["obscene"]
x2 = train_df["comment_text"]

x2_train, x2_test, y2_train, y2_test = train_test_split(x2, y2)

pipe2 = Pipeline([ 
                    ("vect", vec_tf),
                    ("model", SVC())
])

params = ["vec_cv"]

pipe2.fit(x2_train, y2_train.ravel())
preds2 = pipe2.predict(x2_test)

print(classification_report(y2_test, preds2))


              precision    recall  f1-score   support

           0       0.98      1.00      0.99     37756
           1       0.90      0.63      0.74      2137

    accuracy                           0.98     39893
   macro avg       0.94      0.81      0.86     39893
weighted avg       0.98      0.98      0.97     39893



### Threat Model

In [8]:
vec_tf = TfidfVectorizer(max_features=1000)
vec_cv = CountVectorizer(max_features=150)

y3 = train_df["threat"]
x3 = train_df["comment_text"]

x3_train, x3_test, y3_train, y3_test = train_test_split(x3, y3)

pipe3 = Pipeline([ 
                    ("vect", vec_tf),
                    ("model", SVC())
])

params = ["vec_cv"]

pipe3.fit(x3_train, y3_train.ravel())
preds3 = pipe3.predict(x3_test)

print(classification_report(y3_test, preds3))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     39772
           1       0.94      0.14      0.24       121

    accuracy                           1.00     39893
   macro avg       0.97      0.57      0.62     39893
weighted avg       1.00      1.00      1.00     39893



### Insult Model

In [4]:
vec_tf = TfidfVectorizer(max_features=1000)
vec_cv = CountVectorizer(max_features=150)

y4 = train_df["insult"]
x4 = train_df["comment_text"]

x4_train, x4_test, y4_train, y4_test = train_test_split(x4, y4)

pipe4 = Pipeline([ 
                    ("vect", vec_cv),
                    ("model", SVC())
])

params = ["vec_cv"]

pipe4.fit(x4_train, y4_train.ravel())
preds4 = pipe4.predict(x4_test)

print(classification_report(y4_test, preds4))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98     37961
           1       0.75      0.23      0.35      1932

    accuracy                           0.96     39893
   macro avg       0.85      0.61      0.67     39893
weighted avg       0.95      0.96      0.95     39893



### Identity Hate Model

In [6]:
vec_tf = TfidfVectorizer(max_features=1000)
vec_cv = CountVectorizer(max_features=150)

y5 = train_df["identity_hate"]
x5 = train_df["comment_text"]

x5_train, x5_test, y5_train, y5_test = train_test_split(x5, y5)

pipe5 = Pipeline([ 
                    ("vect", vec_tf),
                    ("model", SVC())
])

params = ["vec_cv"]

pipe5.fit(x5_train, y5_train.ravel())
preds5 = pipe5.predict(x5_test)

print(classification_report(y5_test, preds5))

              precision    recall  f1-score   support

           0       0.99      1.00      1.00     39529
           1       0.80      0.15      0.25       364

    accuracy                           0.99     39893
   macro avg       0.90      0.57      0.62     39893
weighted avg       0.99      0.99      0.99     39893



## 4. Load and preview the test Data

In [8]:
test_df = pd.read_csv("test.csv")
test_df.head()

Unnamed: 0,id,comment_text
0,1,Yo bitch Ja Rule is more succesful then you'll...
1,2,== From RfC == \n\n The title is fine as it is...
2,3,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,4,":If you have a look back at the source, the in..."
4,5,I don't anonymously edit articles at all.


## Save trained model reload and Predict

# Toxic

In [10]:
dump(pipe, "toxic.joblib")
premade_model = load("toxic.joblib")
toxic_preds = premade_model.predict(test_df["comment_text"])
toxic_preds

array([1, 0, 0, ..., 0, 0, 0], dtype=int64)

# Severe Toxic

In [11]:
dump(pipe1, "severe_toxic.joblib")
premade_model1 = load("severe_toxic.joblib")
severetoxic_preds = premade_model1.predict(test_df["comment_text"])
severetoxic_preds

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

# Obscene

In [12]:
dump(pipe2, "obscene.joblib")
premade_model2 = load("obscene.joblib")
obscene_preds = premade_model2.predict(test_df["comment_text"])
obscene_preds

array([1, 0, 0, ..., 0, 0, 0], dtype=int64)

# Threat

In [13]:
dump(pipe3, "threat.joblib")
premade_model3 = load("threat.joblib")
threat_preds = premade_model3.predict(test_df["comment_text"])
threat_preds

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

# Insult

In [6]:
## saving the model for insult
dump(pipe4, "insult.joblib")

['insult.joblib']

In [9]:
## Loading and predicting
premade_model4 = load("insult.joblib")
insult_preds = premade_model4.predict(test_df["comment_text"])
insult_preds

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

# Identity Hate

In [14]:
dump(pipe5, "identity_hate.joblib")
premade_model5 = load("identity_hate.joblib")
identityhate_preds = premade_model5.predict(test_df["comment_text"])
identityhate_preds

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

## Output Details
Predictions are loaded to a CSV file. With a row per test data and columns for each classification. The presence of a (1) in a column indicates that the comment on that row belongs to that class.


In [15]:
#coverting the index column of the data frame to a numpy array
id = test_df.index.to_numpy()

In [16]:
#Construct dummy data for a sample output. 
#You won't do this part, you have real data
#Your data should have the same structure, so the CSV output is the same
columns = ["id", "toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
sample_out = pd.DataFrame( list(zip(id, toxic_preds, severetoxic_preds, obscene_preds, threat_preds, insult_preds, identityhate_preds)),
                    columns=columns)
sample_out.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0,1,0,1,0,0,0
1,1,0,0,0,0,0,0
2,2,0,0,0,0,0,0
3,3,0,0,0,0,0,0
4,4,0,0,0,0,0,0


In [17]:
#Write DF to CSV. Please keep the "out.csv" filename. Moodle will auto-preface it with an identifier when I download it. 
#This command should work with your dataframe of predictions. 
sample_out.to_csv('out.csv', index=False)  