<a href="https://colab.research.google.com/github/Nidhushan/Deep-Learning-Projects/blob/main/Machine%20Learning/Q4_Insult_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


# Insult Classification

In this exercise, we would like to filter out insulting comments on a web forum.

To train our models, we have a list of historic comments with a judgement wether they're insulting or not.

In [2]:
import pandas as pd
path_to_insults = 'drive/MyDrive/ML_HW_2_data/'
data = pd.read_csv(path_to_insults + 'train-utf8.csv')
data.head(6)

Unnamed: 0,Insult,Date,Comment
0,1,20120618192155Z,You fuck your dad.
1,0,20120528192215Z,i really don't understand your point. It seem...
2,0,,A majority of Canadians can and has been wrong...
3,0,,listen if you dont wanna get married to a man ...
4,0,20120619094753Z,Các bạn xuống đường biểu tình 2011 có ôn hoà k...
5,0,20120620171226Z,"@SDL OK, but I would hope they'd sign him to a..."


In [3]:
print ("%d comments, of which %d insults (%d%%)" % \
    (len(data), data.Insult.sum(), 100 * data.Insult.mean()))

3947 comments, of which 1049 insults (26%)


### Looking for known bad words

One way to do this, is to load Google's bad word list and flag comments that contain one or more words.

- Load `google_badlist.txt` from `data/insults/`
- Add a column to `data` with a flag (0 or 1) if the comment contains a bad word
- Compute the accuracy of this method - does this look good?
- What would a naive classifier's score be (i.e., always predicting 0 or 1)?
- Also compute the precision, recall, F1 score and AUC score
- What is your verdict?

In [4]:
filename = path_to_insults + 'google_badlist.txt'
filename

'drive/MyDrive/ML_HW_2_data/google_badlist.txt'

In [5]:
# Your code here

In [6]:
with open(filename, 'r', encoding='utf-8') as file:
    bad_words = [line.strip().lower() for line in file]

In [7]:
def contains_bad_words(comment):
  sentence = comment.lower().split()
  for word in sentence:
    if word in bad_words:
      return 1
  return 0

data['containsBadWord'] = data['Comment'].apply(contains_bad_words)

accuracy = (data.containsBadWord.sum() / data.Insult.sum()) * 100
accuracy

75.50047664442326

In [8]:
InsultValues = data['containsBadWord'].value_counts()

accuracy0 = (InsultValues[0] / len(data['containsBadWord'])) * 100
print("The accuracy of 0 is:", accuracy0)

accuracy1 = (InsultValues[1] / len(data['containsBadWord'])) * 100
print("The accuracy of 1 is:", accuracy1)

The accuracy of 0 is: 79.93412718520395
The accuracy of 1 is: 20.065872814796048


In [9]:
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score

data['naiveFlag'] = 1

testValues = data['containsBadWord']
predValues = data['naiveFlag']

accuracy = (data.containsBadWord.value_counts()[data['naiveFlag'][0]] / len(data['containsBadWord'])) * 100
precision = precision_score(testValues, predValues, zero_division = 0)
recall = recall_score(testValues, predValues)
f1 = f1_score(testValues, predValues)
auc = roc_auc_score(testValues, predValues)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)
print("AUC Score:", auc)

Accuracy: 20.065872814796048
Precision: 0.20065872814796049
Recall: 1.0
F1 Score: 0.33424773158894283
AUC Score: 0.5


### Learning bad words on the fly

Another way of doing this, is to learn the insulting words on the fly using `CountVectorizer`.

Please refer to the scikit learn tutorial at 'http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html' if you need some help.

Here is what you need to do:

- Import `CountVectorizer` from `sklearn.feature_extraction.text`
- Train the `CountVectorizer` on the insults and create a feature set $X$ representing words in the comments
- Train `MultinomialNB` and `BernoulliNB` from `scikitsklearn`  on the new feature set $X$
- Using cross-validation, compute the accuracy, precision, recall, F1 and AUC of your model
- What is your verdict?

NOTE: The F1 score is another useful score to compute when one of the two classes is very rare. We didn't go over it in class but it's basically the harmonic mean between precision and recall and goes from 0 (min) to 1 (max).  You can see more here: 'https://en.wikipedia.org/wiki/F1_score'

In [10]:
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

In [11]:
# Your code here

In [12]:
count_vect = CountVectorizer()
X_train, X_test, y_train, y_test = train_test_split(data['Comment'], data['Insult'], test_size=0.2, random_state=66)
X_train_counts = count_vect.fit_transform(X_train)
X_train_counts.shape
count_vect.vocabulary_.get(u'algorithm')

In [13]:
from sklearn.feature_extraction.text import TfidfTransformer
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
X_train_tfidf.shape

(3157, 13841)

In [14]:
from sklearn.naive_bayes import (MultinomialNB, BernoulliNB)
clf_Multi = MultinomialNB().fit(X_train_tfidf, y_train)
clf_Bernoulli = BernoulliNB().fit(X_train_tfidf, y_train)

In [15]:
X_new_counts = count_vect.transform(X_test)
X_new_tfidf = tfidf_transformer.transform(X_new_counts)

predicted_Multi = clf_Multi.predict(X_new_tfidf)
predicted_Bernoulli = clf_Bernoulli.predict(X_new_tfidf)

# for doc, category in zip(X_test, predicted):
#     print('%r => %s' % (doc, twenty_train.target_names[category]))

In [16]:
from sklearn.pipeline import Pipeline
text_clf_Multi = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

text_clf_Bernoulli = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', BernoulliNB()),
])

text_clf_Multi.fit(X_train, y_train)
text_clf_Bernoulli.fit(X_train, y_train)

In [17]:
# y_test
import numpy as np
predicted_Multi = text_clf_Multi.predict(X_test)
predicted_Bernoulli = text_clf_Bernoulli.predict(X_test)


In [18]:
accuracy = np.mean(predicted_Multi == y_test) * 100
precision = precision_score(y_test, predicted_Multi, zero_division = 0)
recall = recall_score(y_test, predicted_Multi)
f1 = f1_score(y_test, predicted_Multi)
auc = roc_auc_score(y_test, predicted_Multi)

print("Accuracy MultinomialNB:", accuracy)
print("Precision MultinomialNB:", precision)
print("Recall MultinomialNB:", recall)
print("F1 Score MultinomialNB:", f1)
print("AUC Score MultinomialNB:", auc)

Accuracy MultinomialNB: 74.55696202531645
Precision MultinomialNB: 1.0
Recall MultinomialNB: 0.0779816513761468
F1 Score MultinomialNB: 0.14468085106382977
AUC Score MultinomialNB: 0.5389908256880734


In [19]:
accuracy = np.mean(predicted_Bernoulli == y_test) * 100
precision = precision_score(y_test, predicted_Bernoulli, zero_division = 0)
recall = recall_score(y_test, predicted_Bernoulli)
f1 = f1_score(y_test, predicted_Bernoulli)
auc = roc_auc_score(y_test, predicted_Bernoulli)

print("Accuracy BernoulliNB:", accuracy)
print("Precision BernoulliNB:", precision)
print("Recall BernoulliNB:", recall)
print("F1 Score BernoulliNB:", f1)
print("AUC Score BernoulliNB:", auc)

Accuracy BernoulliNB: 76.20253164556962
Precision BernoulliNB: 0.8947368421052632
Recall BernoulliNB: 0.1559633027522936
F1 Score BernoulliNB: 0.265625
AUC Score BernoulliNB: 0.5744851478796433


In [20]:
# X = data.Comment[:150]
count_vect = CountVectorizer()
X_test_counts = count_vect.fit_transform(X_test)
LR = LogisticRegression()
print(cross_val_score(LR, X_test_counts, y_test, cv=3))

[0.78409091 0.74524715 0.8365019 ]


In [21]:
from sklearn.metrics import make_scorer, accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

scoring = {
    'accuracy': make_scorer(accuracy_score),
    'precision': make_scorer(precision_score),
    'recall': make_scorer(recall_score),
    'f1': make_scorer(f1_score),
    'roc_auc': make_scorer(roc_auc_score)
}


In [22]:
accuracy_scores = cross_val_score(LR, X_test_counts, y_test, cv=3, scoring='accuracy')
precision_scores = cross_val_score(LR, X_test_counts, y_test, cv=3, scoring='precision')
recall_scores = cross_val_score(LR, X_test_counts, y_test, cv=3, scoring='recall')
f1_scores = cross_val_score(LR, X_test_counts, y_test, cv=3, scoring='f1')
auc_scores = cross_val_score(LR, X_test_counts, y_test, cv=3, scoring='roc_auc')

print("Accuracy:", accuracy_scores.mean())
print("Precision:", precision_scores.mean())
print("Recall:", recall_scores.mean())
print("F1 Score:", f1_scores.mean())
print("AUC Score:", auc_scores.mean())

Accuracy: 0.7886133195068558
Precision: 0.6820450885668278
Recall: 0.4635337392186707
F1 Score: 0.5499092124479121
AUC Score: 0.811537649318256


In [23]:
accuracy_scores = cross_val_score(clf_Multi, X_test_counts, y_test, cv=3, scoring='accuracy')
precision_scores = cross_val_score(clf_Multi, X_test_counts, y_test, cv=3, scoring='precision')
recall_scores = cross_val_score(clf_Multi, X_test_counts, y_test, cv=3, scoring='recall')
f1_scores = cross_val_score(clf_Multi, X_test_counts, y_test, cv=3, scoring='f1')
auc_scores = cross_val_score(clf_Multi, X_test_counts, y_test, cv=3, scoring='roc_auc')

print("Accuracy:", accuracy_scores.mean())
print("Precision:", precision_scores.mean())
print("Recall:", recall_scores.mean())
print("F1 Score:", f1_scores.mean())
print("AUC Score:", auc_scores.mean())

Accuracy: 0.7404702922763761
Precision: 0.5357220859300796
Recall: 0.47222222222222227
F1 Score: 0.5009457648727055
AUC Score: 0.7169346087293028


In [24]:
accuracy_scores = cross_val_score(clf_Bernoulli, X_test_counts, y_test, cv=3, scoring='accuracy')
precision_scores = cross_val_score(clf_Bernoulli, X_test_counts, y_test, cv=3, scoring='precision')
recall_scores = cross_val_score(clf_Bernoulli, X_test_counts, y_test, cv=3, scoring='recall')
f1_scores = cross_val_score(clf_Bernoulli, X_test_counts, y_test, cv=3, scoring='f1')
auc_scores = cross_val_score(clf_Bernoulli, X_test_counts, y_test, cv=3, scoring='roc_auc')

print("Accuracy:", accuracy_scores.mean())
print("Precision:", precision_scores.mean())
print("Recall:", recall_scores.mean())
print("F1 Score:", f1_scores.mean())
print("AUC Score:", auc_scores.mean())

Accuracy: 0.6898404194031572
Precision: 0.1865079365079365
Recall: 0.0365296803652968
F1 Score: 0.06049228201919065
AUC Score: 0.6843202252445525


Logistic Regression had the best accuracy out of the three.