## Checking for insults in comments
### An old 2013 Kaggle competition. 



Details [here](https://www.kaggle.com/c/detecting-insults-in-social-commentary)

The challenge is to detect when a comment from a conversation would be considered insulting to another participant in the conversation. Samples could be drawn from conversation streams like news commenting sites, magazine comments, message boards, blogs, text messages, etc.  

The idea is to create a generalizable single-class classifier which could operate in a near real-time mode, scrubbing the filth of the internet away in one pass.

Explore the data

In [1]:
import pandas as pd # to help us explore the data

In [2]:
df = pd.read_csv('data/train.csv')
df.head(5)

Unnamed: 0,Insult,Date,Comment
0,1,20120618192155Z,"""You fuck your dad."""
1,0,20120528192215Z,"""i really don't understand your point.\xa0 It ..."
2,0,,"""A\\xc2\\xa0majority of Canadians can and has ..."
3,0,,"""listen if you dont wanna get married to a man..."
4,0,20120619094753Z,"""C\xe1c b\u1ea1n xu\u1ed1ng \u0111\u01b0\u1edd..."


Doesn't look like we need the dates. We just need the insult and comment columns 

In [3]:
y = list(df['Insult'])
X = list(df['Comment'])
#X[:4] # You can view the data, but it displays some really offensive words, hence the comment

**We need to get rid of the extra quotation marks:**

In [4]:
#Using list comprehension
X = [x[1:-1] for x in X]
#X[:4]

In [5]:
print(len(X), len(y)) #looking good

3947 3947


**We want to use SKLearn to create a model to learn from the data**

In [6]:
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline, FeatureUnion
from sklearn.feature_selection import SelectPercentile, chi2
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import cross_val_score #sklearn.cross_validation has been deprecated

In [7]:
# Define pipeline (text vectorizer, selection, logistic regression)
select = SelectPercentile(score_func=chi2, percentile=16)
lr = LogisticRegression(tol=1e-8, penalty='l2', C=10, intercept_scaling=1e3, solver='lbfgs')
char_vect = TfidfVectorizer(ngram_range=(1,5), analyzer='char')
word_vect = TfidfVectorizer(ngram_range=(1,3), analyzer='word', min_df=3)
ft = FeatureUnion([('chars', char_vect), ('words', word_vect)])
clf = make_pipeline(ft, select, lr) # classifier

In [8]:
#run classification
import numpy as np # to easily display mean scores
scores = cross_val_score(clf, X, y, cv=2)
print(np.mean(scores))

0.8441838425635646


### This score is higher than the score on the leaderboard that won this competition!!!

[Leaderboard from 2013](https://www.kaggle.com/c/detecting-insults-in-social-commentary/leaderboard)

Let's evaluate a few things

In [9]:
XX = ft.fit_transform(X)
print('n_samples: %s, n_features: %s' % XX.shape)

n_samples: 3947, n_features: 228820


SKLearn extracted 228,820 features (that many columns)

In [10]:
%timeit lr.fit(XX, y) # time taken

6.32 s ± 1.22 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


### Scaling Up!

It is very possible to run out of memory on your local computer when your dataset is too large. To get around this, you need to work on your data in chunks: Go online or out of your CPU core.  
This will typically look like something like this ina generic setting:

In [None]:
from sklearn.linear_model import SGDClassifier

In [None]:
# we would now use something like this
clf = SGDClassifier(alpha=0.1, learning_rate='optimal')

for df in pd.read_csv('data.csv', chunksize=20): #notice the chunksize!! (number of rows)
    y = df['target'].values
    X = df.drop('target', axis=1).values
    clf.partial_fit(X,y, classes=[-1,1])