# ☣️ Jigsaw - Super simple Naive Bayes [LB=0.768]

## Very simple naive bayes with `LB=0768`.

Using data from [Toxic Comment Classification Challenge](https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge)

I have created a dataset for this data. It's public here :
* [jigsaw-toxic-comment-classification-challenge](https://www.kaggle.com/julian3833/jigsaw-toxic-comment-classification-challenge)


# Please, _DO_ upvote!

# Imports

In [1]:
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

# Create train data

The competition was multioutput

We turn it into a binary toxic/ no-toxic classification

In [2]:
df = pd.read_csv("../input/jigsaw-toxic-comment-classification-challenge/train.csv")
df['y'] = (df[['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']].sum(axis=1) > 0 ).astype(int)
df = df[['comment_text', 'y']].rename(columns={'comment_text': 'text'})
df.sample(5)

Unnamed: 0,text,y
41463,Suitecivil your a pussy ass bitch i'd fucken k...,1
29553,"""\n\n- You are right concerning the WP:SYNTH, ...",0
47889,"""Cab, your not Korean or Korea specialist. Hav...",0
89855,Actually this map was taken from the CIA fact ...,0
113479,But I do have one point to make regarding assu...,0


# Undersample

The dataset is very unbalanced. Here we undersample the majority class. Other strategies might work better.

In [3]:
df['y'].value_counts(normalize=True)

0    0.898321
1    0.101679
Name: y, dtype: float64

In [4]:
min_len = (df['y'] == 1).sum()

In [5]:
df_y0_undersample = df[df['y'] == 0].sample(n=min_len, random_state=201)

In [6]:
df = pd.concat([df[df['y'] == 1], df_y0_undersample])

In [7]:
df['y'].value_counts()

1    16225
0    16225
Name: y, dtype: int64

# TF-IDF

In [8]:
vec = TfidfVectorizer()

In [9]:
X = vec.fit_transform(df['text'])
X

<32450x65740 sparse matrix of type '<class 'numpy.float64'>'
	with 1221879 stored elements in Compressed Sparse Row format>

# Fit Naive Bayes

In [10]:
model = MultinomialNB()
model.fit(X, df['y'])

MultinomialNB()

# Validate

In [11]:
df_val = pd.read_csv("../input/jigsaw-toxic-severity-rating/validation_data.csv")

In [12]:
X_less_toxic = vec.transform(df_val['less_toxic'])
X_more_toxic = vec.transform(df_val['more_toxic'])

In [13]:
p1 = model.predict_proba(X_less_toxic)
p2 = model.predict_proba(X_more_toxic)

In [14]:
# Validation Accuracy
(p1[:, 1] < p2[:, 1]).mean()

0.6643749169655906

# Submission

In [15]:
df_sub = pd.read_csv("../input/jigsaw-toxic-severity-rating/comments_to_score.csv")
X_test = vec.transform(df_sub['text'])
p3 = model.predict_proba(X_test)

In [16]:
df_sub

Unnamed: 0,comment_id,text
0,114890,"""\n \n\nGjalexei, you asked about whether ther..."
1,732895,"Looks like be have an abuser , can you please ..."
2,1139051,I confess to having complete (and apparently b...
3,1434512,"""\n\nFreud's ideas are certainly much discusse..."
4,2084821,It is not just you. This is a laundry list of ...
...,...,...
7532,504235362,"Go away, you annoying vandal."
7533,504235566,This user is a vandal.
7534,504308177,""" \n\nSorry to sound like a pain, but one by f..."
7535,504570375,Well it's pretty fucking irrelevant now I'm un...


In [17]:
df_sub['score'] = p3[:, 1]

In [18]:
df_sub['score'].count()

7537

In [19]:
# 9 comments will fail if compared one with the other
df_sub['score'].nunique()

7507

In [20]:
df_sub[['comment_id', 'score']].to_csv("submission.csv", index=False)

# Please, _DO_ upvote!