# INTRODUCTION

## Overview

Natural Language Processing is a complex field which is hypothesised to be part of AI-complete set of problems, implying that the difficulty of these computational problems is equivalent to that of solving the central artificial intelligence problem of making computers as intelligent as people. With over 90% of data ever generated being produced in the last 2 years and with a great proportion being human generated unstructured text there is an ever increasing need to advance the field of Natural Language Processing.

Recent UK Government proposal to have measures to regulate social media companies over harmful content, including "substantial" fines and the ability to block services that do not stick to the rules is an example of the regulamentary need to better manage the content that is being generated by users.

Other initiatives like ​Riot Games​' work aimed to predict and reform toxic player behaviour during games is another example of this effort to understand the content being generated by users and moderate toxic content.
However, as highlighted by the Kaggle competition ​Jigsaw unintended bias in toxicity classification​, existing models suffer from unintended bias where models might predict high likelihood of toxicity for content containing certain words (e.g. "gay") even when those comments were not actually toxic (such as "I am a gay woman"), leaving machine only classification models still sub-standard.

The outcome of our analysis is the type of algorithm that companies will use to define what is free speech and what shouldn't be tolerated in a discussion. This challenge actually starts with how the training dataset was produced: Multiple people (annotators) read thousands of comments and defined if those comments were offensive or not. Where is the trick? They disagreed in many of them. Having tools that are able to flag up toxic content without suffering from unintended bias is of paramount importance to preserve Internet's fairness and freedom of speech

## Dataset

At the end of 2017 the Civil Comments platform shut down and chose make their ~2m public comments from their platform available in a lasting open archive so that researchers could understand and improve civility in online conversations for years to come. Jigsaw sponsored this effort and extended annotation of this data by human raters for various toxic conversational attributes.

In the data supplied for this competition, the text of the individual comment is found in the comment_text column. Each comment in Train has a toxicity label (target), and models should predict the target toxicity for the Test data. This attribute (and all others) are fractional values which represent the fraction of human raters who believed the attribute applied to the given comment.

For evaluation, test set examples with target >= 0.5 will be considered to be in the positive class (toxic).


In [1]:
import warnings
warnings.filterwarnings('ignore')



In [2]:
from __future__ import print_function

import os
import sys

import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objs as go
import plotly.plotly as py
import missingno as msno


import numpy as np
import pandas as pd
from scipy import stats
import spacy
from sklearn.decomposition import PCA

from wordcloud import WordCloud ,STOPWORDS

import watermark

from tqdm import tqdm_notebook

from wordcloud import WordCloud, STOPWORDS
import gensim
from gensim.utils import simple_preprocess
from gensim.parsing.preprocessing import STOPWORDS
from nltk.stem import WordNetLemmatizer, SnowballStemmer
from nltk.stem.porter import *
import nltk
from gensim import corpora, models
from sklearn.model_selection import train_test_split
import operator
from keras.preprocessing.text import Tokenizer
from nltk.stem import WordNetLemmatizer 
lemmatizer = WordNetLemmatizer()
#lemmatized_output = ' '.join([lemmatizer.lemmatize(w) for w in word_list])
nltk.download('wordnet')

%load_ext watermark

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import precision_score, recall_score
from sklearn.dummy import DummyClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import SVC
from sklearn.multiclass import OneVsRestClassifier
from sklearn.metrics import confusion_matrix
from sklearn.preprocessing import RobustScaler,robust_scale,MinMaxScaler
from sklearn.model_selection import RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

from sklearn.metrics import precision_score, recall_score,f1_score

PROJ_ROOT = os.path.join(os.pardir)

print(os.path.abspath(PROJ_ROOT))


paramiko missing, opening SSH/SCP/SFTP paths will be disabled.  `pip install paramiko` to suppress


detected Windows; aliasing chunkize to chunkize_serial

Using TensorFlow backend.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Parth\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


C:\Users\Parth\Documents\Parth\Data science\Springboard\github\Springboard\capstone projects\2. jigsaw-unintended-bias-in-toxicity-classification


### Important Library Version Information

In [3]:
%watermark -a "Parth Patel" -d -t -v -p  numpy,pandas  --iversions

matplotlib 3.0.3
scipy      1.2.1
numpy      1.16.4
gensim     3.4.0
plotly     3.9.0
watermark  1.8.1
spacy      2.1.4
seaborn    0.9.0
nltk       3.4
missingno  0.4.1
re         2.2.1
pandas     0.24.2
Parth Patel 2019-07-21 14:20:10 

CPython 3.6.8
IPython 7.4.0

numpy 1.16.4
pandas 0.24.2


In [4]:
train_df = pd.read_csv('../Data/train.csv')

#### Feature Engineering
Since our main independat column is commets lets try to create features out of it

In [5]:
train_df['total_length'] = train_df['comment_text'].apply(len)
train_df['capitals'] = train_df['comment_text'].apply(lambda comment: sum(1 for c in comment if c.isupper()))
train_df['caps_vs_length'] = train_df.apply(lambda row: float(row['capitals'])/float(row['total_length']),axis=1)
train_df['num_exclamation_marks'] = train_df['comment_text'].apply(lambda comment: comment.count('!'))
train_df['num_question_marks'] = train_df['comment_text'].apply(lambda comment: comment.count('?'))
train_df['num_punctuation'] = train_df['comment_text'].apply(lambda comment: sum(comment.count(w) for w in '.,;:'))
train_df['num_symbols'] = train_df['comment_text'].apply(lambda comment: sum(comment.count(w) for w in '*&$%'))
train_df['num_words'] = train_df['comment_text'].apply(lambda comment: len(comment.split()))
train_df['num_unique_words'] = train_df['comment_text'].apply(lambda comment: len(set(w for w in comment.split())))
train_df['words_vs_unique'] = train_df['num_unique_words'] / train_df['num_words']
train_df['num_smilies'] = train_df['comment_text'].apply(lambda comment: sum(comment.count(w) for w in (':-)', ':)', ';-)', ';)')))


In [6]:
train_df['Is_toxic'] =  train_df['target'].apply(lambda x: "Toxic" if x>=0.5 else "NonToxic")

In [7]:
features = ('total_length', 'capitals', 'caps_vs_length', 'num_exclamation_marks','num_question_marks', 'num_punctuation', 'num_words', 'num_unique_words','words_vs_unique', 'num_smilies', 'num_symbols','target')

In [8]:
Nontoxic_df = train_df.loc[train_df['Is_toxic'] == 'NonToxic']
Nontoxic_df = Nontoxic_df.head(481113)
#Nontoxic_df = Nontoxic_df.head(30000)

toxic_df = train_df.loc[train_df['Is_toxic'] == 'Toxic']
#toxic_df = toxic_df.head(10000)


In [9]:
#del final_df
final_df = pd.concat([Nontoxic_df,toxic_df])
#Nontoxic_df.append(toxic_df) 
len(final_df)

625447

In [10]:
#1804874

In [11]:
final_df['Is_toxic'].value_counts()

NonToxic    481113
Toxic       144334
Name: Is_toxic, dtype: int64

In [12]:
#dir()
del Nontoxic_df
del toxic_df
del train_df

In [13]:
final_df.id.nunique()

625447

In [14]:
final_df = final_df[['id','comment_text','target']]

In [15]:
final_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 625447 entries, 0 to 1804872
Data columns (total 3 columns):
id              625447 non-null int64
comment_text    625447 non-null object
target          625447 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 19.1+ MB


### Text Data Clening

Text cleaning will be performed mainly in 5 steps, which will be 
* Lower caseing
* Expanding Contractions
* Removing Special Characters
* Removing Stopwords

#### Lower Case
Here all alphabet will be converted to lower case as all avaible mapping are in lower case alphabet as well as it will remove inconsistant typing errors and make standard text words and sentence.

#### Removing Special Characters
Special characters and symbols are usually non-alphanumeric characters or even occasionally numeric characters (depending on the problem), which add to the extra noise in unstructured text. Usually, simple regular expressions (regexes) can be used to remove them.

#### Removing Stopwords
Words which have little or no significance, especially when constructing meaningful features from text, are known as stopwords or stop words. These are usually words that end up having the maximum frequency if you do a simple term or word frequency in a corpus. Typically, these can be articles, conjunctions, prepositions and so on. Some examples of stopwords are a, an, the, and the like.

#### Lemmatization
Lemmatization is very similar to stemming, where we remove word affixes to get to the base form of a word. However, the base form in this case is known as the root word, but not the root stem. The difference being that the root word is always a lexicographically correct word (present in the dictionary), but the root stem may not be so. Thus, root word, also known as the lemma, will always be present in the dictionary. Both nltk and spacy have excellent lemmatizers. We will be using spacy here.

#### Expanding Contractions
Contractions are shortened version of words or syllables. They often exist in either written or spoken forms in the English language. These shortened versions or contractions of words are created by removing specific letters and sounds. In case of English contractions, they are often created by removing one of the vowels from the word. Examples would be, do not to don’t and I would to I’d. Converting each contraction to its expanded, original form helps with text standardization.


In [16]:
def clean_contractions(text, mapping):
    
    specials = ["’", "‘", "´", "`"]
    for s in specials:
        text = text.replace(s, "'")
    text = ' '.join([mapping[t] if t in mapping else t for t in text.split(" ")])
    return text

contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not", "didn't": "did not",  "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",  "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would", "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would", "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam", "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have", "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock", "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have", "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is", "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as", "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would", "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have", "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have", "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are", "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",  "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is", "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have", "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have", "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all", "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have","you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have", "you're": "you are", "you've": "you have", 'colour': 'color', 'centre': 'center', 'favourite': 'favorite', 'travelling': 'traveling', 'counselling': 'counseling', 'theatre': 'theater', 'cancelled': 'canceled', 'labour': 'labor', 'organisation': 'organization', 'wwii': 'world war 2', 'citicise': 'criticize', 'youtu ': 'youtube ', 'Qoura': 'Quora', 'sallary': 'salary', 'Whta': 'What', 'narcisist': 'narcissist', 'howdo': 'how do', 'whatare': 'what are', 'howcan': 'how can', 'howmuch': 'how much', 'howmany': 'how many', 'whydo': 'why do', 'doI': 'do I', 'theBest': 'the best', 'howdoes': 'how does', 'mastrubation': 'masturbation', 'mastrubate': 'masturbate', "mastrubating": 'masturbating', 'pennis': 'penis', 'Etherium': 'Ethereum', 'narcissit': 'narcissist', 'bigdata': 'big data', '2k17': '2017', '2k18': '2018', 'qouta': 'quota', 'exboyfriend': 'ex boyfriend', 'airhostess': 'air hostess', "whst": 'what', 'watsapp': 'whatsapp', 'demonitisation': 'demonetization', 'demonitization': 'demonetization', 'demonetisation': 'demonetization'}


In [17]:
wpt = nltk.WordPunctTokenizer()
stop_words = nltk.corpus.stopwords.words('english')

def normalize_document(doc):
    # lower case and remove special characters\whitespaces
    doc = re.sub(r'[^a-zA-Z\s]', '', doc, re.I|re.A)
    doc = doc.lower()
    doc = doc.strip()
    # tokenize document
    tokens = wpt.tokenize(doc)
    # filter stopwords out of document
    filtered_tokens = [token for token in tokens if token not in stop_words]
    # lemmatizing words in dcoument
    filtered_tokens = [lemmatizer.lemmatize(w) for w in filtered_tokens]
    # re-create document from filtered tokens
    doc = ' '.join(filtered_tokens)
    return doc

normalize_corpus = np.vectorize(normalize_document)


In [18]:
final_df['comment_text'] = normalize_corpus(final_df['comment_text'])

In [19]:
final_df['comment_text'] = final_df['comment_text'].apply(lambda x: clean_contractions(x, contraction_mapping))

In [20]:
final_df.head()

Unnamed: 0,id,comment_text,target
0,59848,cool like would want mother read really great ...,0.0
1,59849,thank would make life lot le anxietyinducing k...,0.0
2,59852,urgent design problem kudos taking impressive,0.0
3,59855,something ill able install site releasing,0.0
6,59861,hahahahahahahahhha suck,0.457627


# Train test split

In [21]:
X=final_df['comment_text']
y=final_df['target']
#final_df = final_df[['id','comment_text','target']]

In [22]:
y=y.apply(lambda x: 1 if x>=0.5 else 0)

In [23]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.2 ,random_state=42)

In [24]:
#train = train_df[['comment_text','target']]

In [25]:
print(y_train.shape)
print(y_test.shape)
print(X_train.shape)
print(X_test.shape)

(500357,)
(125090,)
(500357,)
(125090,)


#### Count Vector


In [26]:
cvect = CountVectorizer(min_df = 0.01, ngram_range=(1, 2), analyzer="word").fit(X_train)

In [27]:
X_trcv = cvect.transform(X_train)
X_tscv = cvect.transform(X_test)

In [28]:
print(X_tscv.shape)
print(X_trcv.shape)
print(X_train.shape)

(125090, 417)
(500357, 417)
(500357,)


#### TfidfVectorizer

In [29]:
tfvect = TfidfVectorizer().fit(X_train)

In [30]:
X_trtf = tfvect.transform(X_train)
X_tstf = tfvect.transform(X_test)

In [31]:
print(X_trtf.shape)
print(X_tstf.shape)
print(X_train.shape)
print(X_test.shape)

(500357, 236688)
(125090, 236688)
(500357,)
(125090,)


### Logistic Regression with Count Vectoriser


In [32]:
#alphas_alt = {'C': [0.001,0.01,0.1,1,10,100]#,
           #  'max_iter' :[100000,300000,500000]
#             }
alphas_alt = {'C': [0.7,0.8,0.9,1,1.1,1.2,1.3,1.4,1.5],
            'max_iter' :[10,300,500]
             }

Logis = LogisticRegression()

In [33]:
Logis = RandomizedSearchCV(estimator = Logis, param_distributions =alphas_alt, n_iter=5, cv = 5, verbose=2,n_jobs=-1, random_state=42)
Logis.fit(X_trcv, y_train)

print('Best parameters for Logistic Regression Model : {}'.format(Logis.best_params_))

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  25 | elapsed:   37.4s remaining:   24.9s
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:   40.9s finished


Best parameters for Logistic Regression Model : {'max_iter': 10, 'C': 1.4}


In [34]:
Logis_final = LogisticRegression(C=1.4,max_iter=10,n_jobs=-1)
Logis_final.fit(X_trcv, y_train)
#prediction = losso_final.predict(X_test)
print("Training Accuracy: {}".format(Logis_final.score(X_trcv, y_train)))
print("Testing Accuracy: {}".format(Logis_final.score(X_tscv, y_test)))


'n_jobs' > 1 does not have any effect when 'solver' is set to 'liblinear'. Got 'n_jobs' = 12.



Training Accuracy: 0.807149695117686
Testing Accuracy: 0.8065872571748341


In [35]:
predicted = Logis_final.predict(X_tscv)

print("Test Precision: {}".format(precision_score(y_test, predicted)))
print("Test Recall: {}".format(recall_score(y_test, predicted)))
print("F1 Score: {}".format(f1_score(y_test, predicted)))

Test Precision: 0.7970996374546818
Test Recall: 0.2202646215497288
F1 Score: 0.3451523845612516


### Observation :
* Good initial model as Test and Train Accuracy are not very different

### Dummy Classifier
We can see that the accuracy scores are not that bad. However, the precision and recall scores are just unacceptable. This is because of the imbalanced targets, as most of the targets have label 0 and few have label 1. So even a dumb classifier which always predicts the most common class would give a respectable accuracy score. So we need to compare our classifier's performance with once such most-common-class-classifier

In [36]:
#alphas_alt = {'C': [0.001,0.01,0.1,1,10,100]#,
           #  'max_iter' :[100000,300000,500000]
#             }
alphas_alt = {'strategy': ["most_frequent","stratified","prior","uniform"]#,
           #  'max_iter' :[100000,300000,500000]
             }

Dummy = DummyClassifier()

In [37]:
Dummy = RandomizedSearchCV(estimator = Dummy, param_distributions =alphas_alt, n_iter=4, cv = 5, verbose=2,n_jobs=-1, random_state=42)
Dummy.fit(X_trcv, y_train)

print('Best parameters for Logistic Regression Model : {}'.format(Dummy.best_params_))

Fitting 5 folds for each of 4 candidates, totalling 20 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   8 out of  20 | elapsed:    0.8s remaining:    1.3s
[Parallel(n_jobs=-1)]: Done  20 out of  20 | elapsed:    1.3s finished


Best parameters for Logistic Regression Model : {'strategy': 'most_frequent'}


In [38]:
#Logis_final = LogisticRegression(C=0.1,n_jobs=-1)
Dummy_final = DummyClassifier(strategy="most_frequent")

Dummy_final.fit(X_trcv, y_train)
#prediction = losso_final.predict(X_test)
print("Training Accuracy: {}".format(Dummy_final.score(X_trcv, y_train)))
print("Testing Accuracy: {}".format(Dummy_final.score(X_tscv, y_test)))

Training Accuracy: 0.7693906550722784
Testing Accuracy: 0.7685906147573747


In [39]:
predicted = Dummy_final.predict(X_tscv)


print("Test Precision: {}".format(precision_score(y_test, predicted)))
print("Test Recall: {}".format(recall_score(y_test, predicted)))
print("F1 Score: {}".format(f1_score(y_test, predicted)))


Precision is ill-defined and being set to 0.0 due to no predicted samples.



Test Precision: 0.0
Test Recall: 0.0



F-score is ill-defined and being set to 0.0 due to no predicted samples.



F1 Score: 0.0


###  Observations:
* not a good model as basically it is predicting same thing for every input

#### Performance:
Therefore, we see that our classifier is not much better than a simple baseline model which just predicts all outputs to be the most frequent class. Therefore, we need better models. Hence we will explore other models better suited for text classification purposes viz Naive Bayes Classifier and Support Vector Machines

### Bernoulli Naive Bayes using Count Vectors as features


In [40]:
#alphas_alt = {'C': [0.001,0.01,0.1,1,10,100]#,
           #  'max_iter' :[100000,300000,500000]
#             }
alphas_alt = {'alpha': [0.001,0.01,0.1,1,10,100]
            
             }

Berrnoulli = BernoulliNB()

In [41]:
Berrnoulli = RandomizedSearchCV(estimator = Berrnoulli, param_distributions =alphas_alt, n_iter=5, cv = 5, verbose=2,n_jobs=-1, random_state=42)
Berrnoulli.fit(X_trcv, y_train)

print('Best parameters for Logistic Regression Model : {}'.format(Berrnoulli.best_params_))

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  25 | elapsed:    3.0s remaining:    1.9s
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:    3.6s finished


Best parameters for Logistic Regression Model : {'alpha': 100}


In [42]:
#Logis_final = LogisticRegression(C=0.1,n_jobs=-1)
Berrnoulli_final = BernoulliNB(alpha=100)

Berrnoulli_final.fit(X_trcv, y_train)
#prediction = losso_final.predict(X_test)
print("Training Accuracy: {}".format(Berrnoulli_final.score(X_trcv, y_train)))
print("Testing Accuracy: {}".format(Berrnoulli_final.score(X_tscv, y_test)))

Training Accuracy: 0.8031385590688248
Testing Accuracy: 0.8024222559756975


In [43]:
predicted = Berrnoulli_final.predict(X_tscv)


print("Test Precision: {}".format(precision_score(y_test, predicted)))
print("Test Recall: {}".format(recall_score(y_test, predicted)))
print("F1 Score: {}".format(f1_score(y_test, predicted)))

Test Precision: 0.6489092188599578
Test Recall: 0.31854769060697136
F1 Score: 0.4273234932919341


#### Observation
We can see that the precision and recall scores have gone down, but the total accuracy score is almost same. Next we will run:

### Multinomial NB with TFIDF

In [44]:
#alphas_alt = {'C': [0.001,0.01,0.1,1,10,100]#,
           #  'max_iter' :[100000,300000,500000]
#             }
#alphas_alt = {'alpha': [0.001,0.01,0.1,1,10,100]#,
           #  'max_iter' :[100000,300000,500000]
#             }
#alphas_alt = {'alpha': [0.05,0.08,0.1]#,
           #  'max_iter' :[100000,300000,500000]
#             }

alphas_alt = {'alpha': [0.01,0.02,0.03,0.04,0.05,0.07,0.08,0.09]#,
           #  'max_iter' :[100000,300000,500000]
             }
Multinomial = MultinomialNB()

In [45]:
Multinomial = RandomizedSearchCV(estimator = Multinomial, param_distributions =alphas_alt, n_iter=5, cv = 5, verbose=2,n_jobs=-1, random_state=42)
Multinomial.fit(X_trtf, y_train)

print('Best parameters for Logistic Regression Model : {}'.format(Multinomial.best_params_))

Fitting 5 folds for each of 5 candidates, totalling 25 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  25 | elapsed:    5.6s remaining:    3.7s
[Parallel(n_jobs=-1)]: Done  25 out of  25 | elapsed:    6.8s finished


Best parameters for Logistic Regression Model : {'alpha': 0.03}


In [46]:
Multinomial_final = MultinomialNB(alpha=0.03)

Multinomial_final.fit(X_trtf, y_train)
#prediction = losso_final.predict(X_test)
print("Training Accuracy: {}".format(Multinomial_final.score(X_trtf, y_train)))
print("Testing Accuracy: {}".format(Multinomial_final.score(X_tstf, y_test)))

Training Accuracy: 0.8926946160441445
Testing Accuracy: 0.8516747941482132


In [47]:
predicted = Multinomial_final.predict(X_tstf)

print("Test Precision: {}".format(precision_score(y_test, predicted)))
print("Test Recall: {}".format(recall_score(y_test, predicted)))
print("F1 Score: {}".format(f1_score(y_test, predicted)))

Test Precision: 0.8544921208813698
Test Recall: 0.432721871005631
F1 Score: 0.5745080952162547


#### Improvement:
In this model the precision and accuracy have improved, however, the recall is very low. Lets now play around with SVM Models, which are also used very often in text classification



### SVM Classifier with a linear kernel and TFIDF vectors


In [48]:
#alphas_alt = {'C': [0.001,0.01,0.1,1,10,100],
#             'kernel' :['rbf','linear'],
#              'max_iter' :[1,5,10]
#             }
alphas_alt = {'C': [0.08,0.9,1,1.1,1.2,2],
             'kernel' :['rbf','linear'],
             'max_iter' :[80,100,300]
             }
#alphas_alt = {'alpha': [0.05,0.08,0.1]#,
           #  'max_iter' :[100000,300000,500000]
#             }


SVCmodel = SVC()

In [49]:
SVCmodel = RandomizedSearchCV(estimator = SVCmodel, param_distributions =alphas_alt, n_iter=5, cv = 5, verbose=2,n_jobs=10, random_state=42)
SVCmodel.fit(X_trtf, y_train)

print('Best parameters for Logistic Regression Model : {}'.format(SVCmodel.best_params_))

Fitting 5 folds for each of 7 candidates, totalling 35 fits


[Parallel(n_jobs=10)]: Using backend LokyBackend with 10 concurrent workers.
[Parallel(n_jobs=10)]: Done  35 out of  35 | elapsed: 15.6min finished

Solver terminated early (max_iter=100).  Consider pre-processing your data with StandardScaler or MinMaxScaler.



Best parameters for Logistic Regression Model : {'max_iter': 100, 'kernel': 'rbf', 'C': 1.1}


In [58]:
#SVC_final = SVC(max_iter=80,kernel='linear',C=1)
SVC_final = SVC(max_iter=80,kernel='linear',C=1)

SVC_final.fit(X_trtf, y_train)
#prediction = losso_final.predict(X_test)
print("Training Accuracy: {}".format(SVC_final.score(X_trtf, y_train)))
print("Testing Accuracy: {}".format(SVC_final.score(X_tstf, y_test)))


Solver terminated early (max_iter=80).  Consider pre-processing your data with StandardScaler or MinMaxScaler.



Training Accuracy: 0.7847596815873467
Testing Accuracy: 0.7829242945079543


In [59]:
predicted = SVC_final.predict(X_tstf)


print("Test Precision: {}".format(precision_score(y_test, predicted)))
print("Test Recall: {}".format(recall_score(y_test, predicted)))
print("F1 Score: {}".format(f1_score(y_test, predicted)))

Test Precision: 0.584186308573575
Test Recall: 0.21491000794555568
F1 Score: 0.31422365895545007


### Random Forest with Count Vectors

In [60]:


#alphas_alt = {'n_estimators': [2500,3000,3500,10000,50000,100000],
#             'max_depth':[5,6],
#             'max_features': ['log2',None],
#              'min_samples_leaf':[3,5,7],
#              'min_samples_split':[8,10,12]
#          'min_weight_fraction_leaf':[]
            #  'loss':['huber','quantile'],
            #  'random_state':[42]
#             }
alphas_alt = {'n_estimators': [100,300,500],
             'max_depth':[2,3],
             'max_features': ['log2',None],
              'min_samples_leaf':[3,5,7],
              'min_samples_split':[8,10,12]
#          'min_weight_fraction_leaf':[]
            #  'loss':['huber','quantile'],
            #  'random_state':[42]
             }
RandomForest = RandomForestClassifier()

In [61]:
RandomForest = RandomizedSearchCV(estimator = RandomForest, param_distributions =alphas_alt, n_iter=3, cv = 3, verbose=2,n_jobs=-1, random_state=42)
RandomForest.fit(X_trtf, y_train)

print('Best parameters for RandomForest classifier Model : {}'.format(RandomForest.best_params_))

Fitting 3 folds for each of 3 candidates, totalling 9 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 12 concurrent workers.
[Parallel(n_jobs=-1)]: Done   6 out of   9 | elapsed:  2.6min remaining:  1.3min
[Parallel(n_jobs=-1)]: Done   9 out of   9 | elapsed:  3.5min finished


Best parameters for RandomForest classifier Model : {'n_estimators': 500, 'min_samples_split': 10, 'min_samples_leaf': 7, 'max_features': 'log2', 'max_depth': 3}


In [62]:
RandomForest_final = RandomForestClassifier(n_estimators=500,max_depth=3,min_samples_split=10,min_samples_leaf=7,max_features='log2',n_jobs=-1)

RandomForest_final.fit(X_trtf, y_train)
print("Training Accuracy: {}".format(RandomForest_final.score(X_trtf, y_train)))
print("Testing Accuracy: {}".format(RandomForest_final.score(X_tstf, y_test)))

predicted = RandomForest_final.predict(X_tstf)
print("Test Precision: {}".format(precision_score(y_test, predicted)))
print("Test Recall: {}".format(recall_score(y_test, predicted)))
print("F1 Score: {}".format(f1_score(y_test, predicted)))

Training Accuracy: 0.7693906550722784
Testing Accuracy: 0.7685906147573747



Precision is ill-defined and being set to 0.0 due to no predicted samples.



Test Precision: 0.0
Test Recall: 0.0



F-score is ill-defined and being set to 0.0 due to no predicted samples.



F1 Score: 0.0


#### Refrence:

contraction_mapping:
https://www.kaggle.com/nevermoi/jigsaw-toxic-prediction-by-simple-linearsvr-tfidf    