In [91]:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier

## Project 1 - NLP and Text Classification

For this project you will need to classify some angry comments into their respective category of angry. The process that you'll need to follow is (roughly):
<ol>
<li> Use NLP techniques to process the training data. 
<li> Train model(s) to predict which class(es) each comment is in.
    <ul>
    <li> A comment can belong to any number of classes, including none. 
    </ul>
<li> Generate predictions for each of the comments in the test data. 
<li> Write your test data predicitions to a CSV file, which will be scored. 
</ol>

You can use any models and NLP libraries you'd like. 

## What I Got

My results ended up being fairly simple, I didn't see much of an increase in accuracy in making the model more complex. Trials on the test data generally gave me around 94% to 96% accuracy overall. 

<ul>
<li> I used the tfidf vectorizer. I didn't get much of an increase (maybe ~1%) using Word2Vec in the toxic data, so I dropped it. 
<li> I used Bayes as my classifier - this was more of a compromise to practicality, though the accuracy is good. NB is fast, and I ran this many, many, many times, so it made a lot of sense in this case. 
<li> I tried oversampling since the dataset is so imbalanced, but scores generally went down fractionally when I did so. 
<li> Training the models on the full dataset after model selection improved accuracy a point or two. 
<ul>

## Training Data

Use the training data to train your prediction model(s). Each of the classification output columns (toxic to the end) is a human label for the comment_text, assessing if it falls into that category of "rude". A comment may fall into any number of categories, or none at all. Membership in one output category is <b>independent</b> of membership in any of the other classes (think about this when you plan on how to make these predictions - it may also make it easier to split work amongst a team...). 

In [92]:
train_df = pd.read_csv("train.csv.zip")
train_df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,0000997932d777bf,Explanation\nWhy the edits made under my usern...,0,0,0,0,0,0
1,000103f0d9cfb60f,D'aww! He matches this background colour I'm s...,0,0,0,0,0,0
2,000113f07ec002fd,"Hey man, I'm really not trying to edit war. It...",0,0,0,0,0,0
3,0001b41b1c6bb37e,"""\nMore\nI can't make any real suggestions on ...",0,0,0,0,0,0
4,0001d958c54c6e35,"You, sir, are my hero. Any chance you remember...",0,0,0,0,0,0


#### I didn't end up using the none column in the end. 

In [93]:
#Construct a "None" column for comments that are not in any category
train_df["none"] = ((train_df.toxic | train_df.severe_toxic | train_df.obscene | train_df.threat | train_df.insult | train_df.identity_hate) == 0)
train_df["none"] = train_df["none"].astype(int)
train_df.sample(10)

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate,none
59178,9e865aaf9b98dac0,""":::::::Sarbajit, there is another book by the...",0,0,0,0,0,0,1
45024,785deba849f6021c,| class=B | importance=Top,0,0,0,0,0,0,1
120210,82ebd3e08cc7a2ae,Trusilver....you have gone too far and I inten...,0,0,0,0,0,0,1
24648,412d5838e658948a,"""\n\n Good reference from German side of Derfl...",0,0,0,0,0,0,1
148325,4a8d6552b00df960,"Yeah, it is true: it is correct to write Movim...",0,0,0,0,0,0,1
151405,7c856efda7a634d9,The added content is speculative and your expl...,0,0,0,0,0,0,1
157120,d8e18502e65a8d30,"""Hi again MjolnirPants. Sorry again, but keep...",0,0,0,0,0,0,1
6220,109ba83f2f3edb6b,"""\n\nKosovo is de jure a province of Serbia. ...",0,0,0,0,0,0,1
155268,bb59e78acbca5e66,"In which case, rather than assuming anything, ...",0,0,0,0,0,0,1
4276,0b69af59048dde56,Merge \n\nIt has been suggested that the artic...,0,0,0,0,0,0,1


In [94]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

In [95]:
#Text preprocess
nltk.download("stopwords")
corpus = []
for i in range(1, 1000):
    #Remove punctuation
    review = re.sub("[^a-zA-Z]", " ", train_df["comment_text"][i])
    #Normalize to lower case
    review = review.lower()
    #Split into sentences
    review = review.split()
    #Remove stopwords
    ps = PorterStemmer()
    review = [ps.stem(word) for word in review if not word in set(stopwords.words("english"))]
    review = " ".join(review)
    corpus.append(review)
#At this point the corpus is clean text - free of basic "junk". It is its own list at the moment

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/akeems/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [96]:
#Make X dataset for modelling
#cv = CountVectorizer(max_features = 1500)
#X = cv.fit_transform(corpus).toarray()

#Make y datasets for modelling
y_t_1 = np.array(train_df["toxic"]).reshape(-1,1)
y_st_2 = np.array(train_df["severe_toxic"]).reshape(-1,1)
y_o_3 = np.array(train_df["obscene"]).reshape(-1,1)
y_th_4 = np.array(train_df["threat"]).reshape(-1,1)
y_i_5 = np.array(train_df["insult"]).reshape(-1,1)
y_ih_6 = np.array(train_df["identity_hate"]).reshape(-1,1)
y_n_7 = np.array(train_df["none"]).reshape(-1,1)
list_targets = [y_t_1, y_st_2, y_o_3, y_th_4, y_i_5, y_ih_6, y_n_7]


In [97]:
from joblib import dump, load

## Toxic

In [98]:
# Check Balance
np.unique(y_t_1, return_counts=True)

(array([0, 1]), array([144277,  15294]))

In [99]:
# Try Oversample
from imblearn.over_sampling import RandomOverSampler
from imblearn.pipeline import make_pipeline
oversample = RandomOverSampler(sampling_strategy='minority')


In [100]:
#NB Model
vector_nb_1 = TfidfVectorizer()
model_nb_1 = MultinomialNB()
#model_nb_1 = RandomForestClassifier()
stemmer_1 = PorterStemmer()
#lemmatizer = WordNetLemmatizer()

X_1 = train_df["comment_text"]
X_train_1, X_test_1, y_train_1, y_test_1 = train_test_split(X_1, y_t_1)

# Use original data
pipe_nb_1 = Pipeline([("vect", vector_nb_1),("model", model_nb_1)])
# Use oversample
#pipe_nb_1 = make_pipeline(vector_nb_1, oversample, model_nb_1)

pipe_nb_1.fit(X_train_1, y_train_1.ravel())
preds_1 = pipe_nb_1.predict(X_test_1)
#pipe_nb.score(X_test,y_test)
print(classification_report(y_test_1, preds_1))
confusion_matrix(y_test_1, preds_1)

#Retrain with full dataset
pipe_nb_1.fit(X_1, y_t_1.ravel())
dump(pipe_nb_1, "toxic.joblib")

              precision    recall  f1-score   support

           0       0.92      1.00      0.96     36051
           1       0.99      0.15      0.25      3842

    accuracy                           0.92     39893
   macro avg       0.95      0.57      0.61     39893
weighted avg       0.92      0.92      0.89     39893



['toxic.joblib']

#### Oversample

I tried oversampling, but I didn't get better results, so I dropped it here. Below I will include it where it looks to give an improvement. It looks like the ones that are very skewed tend to show an improvement. 

## Severe

In [101]:
print(np.unique(y_st_2, return_counts=True))

(array([0, 1]), array([157976,   1595]))


In [102]:
#NB Model
vector_nb_2 = TfidfVectorizer()
model_nb_2 = MultinomialNB()
#model_nb_2 = RandomForestClassifier()
stemmer_2 = PorterStemmer()

X_2 = train_df["comment_text"]
X_train_2, X_test_2, y_train_2, y_test_2 = train_test_split(X_2, y_st_2)

pipe_nb_2 = Pipeline([ ("vect", vector_nb_2),("model", model_nb_2)])
#pipe_nb_2 = make_pipeline(vector_nb_2, oversample, model_nb_2)

pipe_nb_2.fit(X_train_2, y_train_2.ravel())
preds_2 = pipe_nb_2.predict(X_test_2)
#pipe_nb.score(X_test,y_test)
print(classification_report(y_test_2, preds_2))
confusion_matrix(y_test_2, preds_2)

#Retrain with full dataset
pipe_nb_2.fit(X_2, y_st_2.ravel())


              precision    recall  f1-score   support

           0       0.99      1.00      0.99     39492
           1       0.00      0.00      0.00       401

    accuracy                           0.99     39893
   macro avg       0.49      0.50      0.50     39893
weighted avg       0.98      0.99      0.98     39893



Pipeline(steps=[('vect', TfidfVectorizer()), ('model', MultinomialNB())])

## Obscene

In [103]:
print(np.unique(y_o_3, return_counts=True))

(array([0, 1]), array([151122,   8449]))


In [104]:
#NB Model
vector_nb_3 = TfidfVectorizer()
model_nb_3 = MultinomialNB()
#model_nb_3 = RandomForestClassifier()
stemmer_3 = PorterStemmer()

X_3 = train_df["comment_text"]
X_train_3, X_test_3, y_train_3, y_test_3 = train_test_split(X_3, y_o_3)

pipe_nb_3 = Pipeline([ ("vect", vector_nb_3),("model", model_nb_3)])
#pipe_nb_3 = make_pipeline(vector_nb_3, oversample, model_nb_3)

pipe_nb_3.fit(X_train_3, y_train_3.ravel())
preds_3 = pipe_nb_3.predict(X_test_3)
#pipe_nb.score(X_test,y_test)
print(classification_report(y_test_3, preds_3))
confusion_matrix(y_test_3, preds_3)

#Retrain with full dataset
pipe_nb_3.fit(X_3, y_o_3.ravel())

              precision    recall  f1-score   support

           0       0.95      1.00      0.97     37766
           1       0.97      0.09      0.16      2127

    accuracy                           0.95     39893
   macro avg       0.96      0.54      0.57     39893
weighted avg       0.95      0.95      0.93     39893



Pipeline(steps=[('vect', TfidfVectorizer()), ('model', MultinomialNB())])

## Threat

In [105]:
print(np.unique(y_th_4, return_counts=True))

(array([0, 1]), array([159093,    478]))


In [126]:
#NB Model
vector_nb_4 = TfidfVectorizer()
model_nb_4 = MultinomialNB()
#model_nb_4 = RandomForestClassifier()
stemmer_4 = PorterStemmer()

X_4 = train_df["comment_text"]
X_train_4, X_test_4, y_train_4, y_test_4 = train_test_split(X_4, y_th_4)

#pipe_nb_4 = Pipeline([ ("vect", vector_nb_4),("model", model_nb_4)])
pipe_nb_4 = make_pipeline(vector_nb_4, oversample, model_nb_4)

pipe_nb_4.fit(X_train_4, y_train_4.ravel())
preds_4 = pipe_nb_4.predict(X_test_4)
#pipe_nb.score(X_test,y_test)
print(classification_report(y_test_4, preds_4))
confusion_matrix(y_test_4, preds_4)

#Retrain with full dataset
pipe_nb_4.fit(X_4, y_th_4.ravel())

              precision    recall  f1-score   support

           0       1.00      0.96      0.98     39788
           1       0.05      0.72      0.09       105

    accuracy                           0.96     39893
   macro avg       0.52      0.84      0.53     39893
weighted avg       1.00      0.96      0.98     39893



Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer()),
                ('randomoversampler',
                 RandomOverSampler(sampling_strategy='minority')),
                ('multinomialnb', MultinomialNB())])

## Insult

In [107]:
print(np.unique(y_i_5, return_counts=True))

(array([0, 1]), array([151694,   7877]))


In [108]:
#NB Model
vector_nb_5 = TfidfVectorizer()
model_nb_5 = MultinomialNB()
#model_nb_5 = RandomForestClassifier()
stemmer_5 = PorterStemmer()

X_5 = train_df["comment_text"]
X_train_5, X_test_5, y_train_5, y_test_5 = train_test_split(X_5, y_i_5)

pipe_nb_5 = Pipeline([ ("vect", vector_nb_5),("model", model_nb_5)])
#pipe_nb_5 = make_pipeline(vector_nb_5, oversample, model_nb_5)

pipe_nb_5.fit(X_train_5, y_train_5.ravel())
preds_5 = pipe_nb_5.predict(X_test_5)
#pipe_nb.score(X_test,y_test)
print(classification_report(y_test_5, preds_5))
confusion_matrix(y_test_5, preds_5)

#Retrain with full dataset
pipe_nb_5.fit(X_5, y_i_5.ravel())

              precision    recall  f1-score   support

           0       0.95      1.00      0.98     37919
           1       0.96      0.03      0.06      1974

    accuracy                           0.95     39893
   macro avg       0.95      0.52      0.52     39893
weighted avg       0.95      0.95      0.93     39893



Pipeline(steps=[('vect', TfidfVectorizer()), ('model', MultinomialNB())])

## Identity Hate

In [109]:
print(np.unique(y_ih_6, return_counts=True))

(array([0, 1]), array([158166,   1405]))


In [122]:
#NB Model
vector_nb_6 = TfidfVectorizer()
model_nb_6 = MultinomialNB()
#model_nb_6 = RandomForestClassifier()
stemmer_6 = PorterStemmer()

X_6 = train_df["comment_text"]
X_train_6, X_test_6, y_train_6, y_test_6 = train_test_split(X_6, y_ih_6)

#pipe_nb_6 = Pipeline([ ("vect", vector_nb_6),("model", model_nb_6)])
pipe_nb_6 = make_pipeline(vector_nb_6, oversample, model_nb_6)

pipe_nb_6.fit(X_train_6, y_train_6.ravel())
preds_6 = pipe_nb_6.predict(X_test_6)
#pipe_nb.score(X_test,y_test)
print(classification_report(y_test_6, preds_6))
confusion_matrix(y_test_6, preds_6)

#Retrain with full dataset
pipe_nb_6.fit(X_6, y_ih_6.ravel())

              precision    recall  f1-score   support

           0       1.00      0.94      0.97     39547
           1       0.10      0.78      0.18       346

    accuracy                           0.94     39893
   macro avg       0.55      0.86      0.58     39893
weighted avg       0.99      0.94      0.96     39893



Pipeline(steps=[('tfidfvectorizer', TfidfVectorizer()),
                ('randomoversampler',
                 RandomOverSampler(sampling_strategy='minority')),
                ('multinomialnb', MultinomialNB())])

I think this one was impacted by the balance in my test. 

## Test Data

In [111]:
test_df = pd.read_csv("test.csv")
test_df.head()

Unnamed: 0,id,comment_text
0,1,Yo bitch Ja Rule is more succesful then you'll...
1,2,== From RfC == \n\n The title is fine as it is...
2,3,""" \n\n == Sources == \n\n * Zawe Ashton on Lap..."
3,4,":If you have a look back at the source, the in..."
4,5,I don't anonymously edit articles at all.


In [112]:
#Toxic Predictions
pipe_nb_1 = load("toxic.joblib")
nb_preds_1 = pipe_nb_1.predict(test_df["comment_text"])
np.unique(nb_preds_1, return_counts=True)


(array([0, 1]), array([147594,   5570]))

In [113]:
# Severe Toxic Predictions
nb_preds_2 = pipe_nb_2.predict(test_df["comment_text"])
np.unique(nb_preds_2, return_counts=True)

(array([0, 1]), array([153159,      5]))

In [114]:
# Obscene Predictions
nb_preds_3 = pipe_nb_3.predict(test_df["comment_text"])
np.unique(nb_preds_3, return_counts=True)

(array([0, 1]), array([151335,   1829]))

In [127]:
# Threat Predictions
nb_preds_4 = pipe_nb_4.predict(test_df["comment_text"])
np.unique(nb_preds_4, return_counts=True)

(array([0, 1]), array([139881,  13283]))

In [116]:
# Insult Predictions
nb_preds_5 = pipe_nb_5.predict(test_df["comment_text"])
np.unique(nb_preds_5, return_counts=True)

(array([0, 1]), array([152443,    721]))

In [123]:
# Identity Hate Predictions
nb_preds_6 = pipe_nb_6.predict(test_df["comment_text"])
np.unique(nb_preds_6, return_counts=True)

(array([0, 1]), array([130935,  22229]))

In [128]:
test_df["toxic"] = nb_preds_1
test_df["severe_toxic"] = nb_preds_2
test_df["obscene"] = nb_preds_3
test_df["threat"] = nb_preds_4
test_df["insult"] = nb_preds_5
test_df["identity_hate"] = nb_preds_6
test_df.head()

Unnamed: 0,id,comment_text,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,1,Yo bitch Ja Rule is more succesful then you'll...,0,0,0,1,0,1
1,2,== From RfC == \n\n The title is fine as it is...,0,0,0,0,0,0
2,3,""" \n\n == Sources == \n\n * Zawe Ashton on Lap...",0,0,0,0,0,0
3,4,":If you have a look back at the source, the in...",0,0,0,0,0,0
4,5,I don't anonymously edit articles at all.,0,0,0,0,0,0


In [129]:
#test_df.info()
test_df.to_csv('out.csv', index=False)  

## Output Details, Submission Info, and Example Submission

For this project, please output your predictions in a CSV file. The structure of the CSV file should match the structure of the example below. 

The output should contain one row for each row of test data, complete with the columns for ID and each classification.

Into Moodle please submit:
<ul>
<li> Your notebook file(s). I'm not going to run them, just look. 
<li> Your sample submission CSV. This will be evaluated for accuracy against the real labels; only a subset of the predictions will be scored. 
</ul>

It is REALLY, REALLY, REALLY important the the structure of your output matches the specifications. The accuracies will be calculated by a script, and it is expecting a specific format. 

### Sample Evaluator

The file prediction_evaluator.ipynb contains an example scoring function, scoreChecker. This function takes a sumbission and an answer key, loops through, and evaluates the accuracy. You can use this to verify the format of your submission. I'm going to use the same function to evaluate the accuracy of your submission, against the answer key (unless I made some mistake in this counting function).

In [120]:
#Construct dummy data for a sample output. 
#You won't do this part, you have real data
#Your data should have the same structure, so the CSV output is the same
dummy_ids = ["dfasdf234", "asdfgw43r52", "asdgtawe4", "wqtr215432"]
dummy_toxic = [0,0,0,0]
dummy_severe = [0,0,0,0]
dummy_obscene = [0,1,1,0]
dummy_threat = [0,1,0,1]
dummy_insult = [0,0,1,0]
dummy_ident = [0,1,1,0]
columns = ["id", "toxic", "severe_toxic", "obscene", "threat", "insult", "identity_hate"]
sample_out = pd.DataFrame( list(zip(dummy_ids, dummy_toxic, dummy_severe, dummy_obscene, dummy_threat, dummy_insult, dummy_ident)),
                    columns=columns)
sample_out.head()

Unnamed: 0,id,toxic,severe_toxic,obscene,threat,insult,identity_hate
0,dfasdf234,0,0,0,0,0,0
1,asdfgw43r52,0,0,1,1,0,1
2,asdgtawe4,0,0,1,0,1,1
3,wqtr215432,0,0,0,1,0,0


In [121]:
#Write DF to CSV. Please keep the "out.csv" filename. Moodle will auto-preface it with an identifier when I download it. 
#This command should work with your dataframe of predictions. 
#sample_out.to_csv('out.csv', index=False)  

## Grading

The grading for this is split between accuracy and well written code:
<ul>
<li> 75% - Accuracy. The most accurate will get 100% on this, the others will be scaled down from there. 
<li> 25% - Code quality. Can the code be followed and made sense of - i.e. comments, sections, titles. 
</ul>