### Problem Statement

#### Hate Speech Classification

Hate speech is an unfortunately common occurrence on the Internet. Often social media sites like Facebook and Twitter face the problem of identifying and censoring problematic posts while weighing the right to freedom of speech. The importance of detecting and moderating hate speech is evident from the strong connection between hate speech and actual hate crimes. Early identification of users promoting hate speech could enable outreach programs that attempt to prevent an escalation from speech to action.

The objective of this task is to detect hate speech in tweets. For the sake of simplicity, we say a tweet contains hate speech if it has a racist or sexist sentiment associated with it. So, the task is to classify racist or sexist tweets from other tweets.

Formally, given a training sample of tweets and labels, where label '1' denotes the tweet is racist/sexist and label '0' denotes the tweet is not racist/sexist, your objective is to predict the labels on the test dataset.

#### Data Source
https://datahack.analyticsvidhya.com/contest/hate-speech-classification/

In [1]:
# import libraries 
from nltk.stem.wordnet import WordNetLemmatizer
from nltk.corpus import stopwords
import pandas as pd
import string
import nltk
import os

import nltk
nltk.download('stopwords')

## load dataset

#data = pd.read_csv(add)
df_train = pd.read_csv("train.csv", encoding='latin-1')
df_test = pd.read_csv("test_tweets.csv", encoding='latin-1')
#data = pd.read_csv(add, encoding='latin-1')


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Sourav\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [2]:
#Combine into data:
df_train['source']= 'train'
df_test['source'] = 'test'

data=pd.concat([df_train, df_test],ignore_index=True, sort=False)
data.head(5)

Unnamed: 0,id,label,tweet,source
0,1,0.0,@user when a father is dysfunctional and is s...,train
1,2,0.0,@user @user thanks for #lyft credit i can't us...,train
2,3,0.0,bihday your majesty,train
3,4,0.0,#model i love u take with u all the time in ...,train
4,5,0.0,factsguide: society now #motivation,train


In [3]:
import re
def  clean_text(df, text_field):
    df[text_field] = df[text_field].str.lower()
    df[text_field] = df[text_field].apply(lambda elem: re.sub(r"(@[A-Za-z0-9]+)|([^0-9A-Za-z \t])|(\w+:\/\/\S+)|^rt|http.+?", "", elem))  
    return df

data = clean_text(data, "tweet")
#train_clean = clean_text(train, "tweet")
data.head()


Unnamed: 0,id,label,tweet,source
0,1,0.0,when a father is dysfunctional and is so sel...,train
1,2,0.0,thanks for lyft credit i cant use cause they...,train
2,3,0.0,bihday your majesty,train
3,4,0.0,model i love u take with u all the time in u...,train
4,5,0.0,factsguide society now motivation,train


In [4]:
from nltk.corpus import stopwords

In [5]:
# text preprocessing 
lem = WordNetLemmatizer()

def clean_text(text):
    ## lower case 
    cleaned = text.lower()
    
    ## remove punctuations
    punctuations = string.punctuation
    cleaned = "".join(character for character in cleaned if character not in punctuations)
    
    ## remove stopwords 
    words = cleaned.split()
    stopword_lists = stopwords.words("english")
    cleaned = [word for word in words if word not in stopword_lists]
    
    ## normalization - lemmatization
    cleaned = [lem.lemmatize(word, "v") for word in cleaned]
    cleaned = [lem.lemmatize(word, "n") for word in cleaned]
    
    ## join 
    cleaned = " ".join(cleaned)
    return cleaned

data["cleaned"] = data["tweet"].apply(lambda x : clean_text(x))
data.head()

Unnamed: 0,id,label,tweet,source,cleaned
0,1,0.0,when a father is dysfunctional and is so sel...,train,father dysfunctional selfish drag kid dysfunct...
1,2,0.0,thanks for lyft credit i cant use cause they...,train,thank lyft credit cant use cause dont offer wh...
2,3,0.0,bihday your majesty,train,bihday majesty
3,4,0.0,model i love u take with u all the time in u...,train,model love u take u time ur
4,5,0.0,factsguide society now motivation,train,factsguide society motivation


In [6]:
## feature engineering 

## meta features 

data["word_count"] = data["tweet"].apply(lambda x : len(x.split()))
data["word_count_cleand"] = data["cleaned"].apply(lambda x : len(x.split()))

data["char_count"] = data["tweet"].apply(lambda x : len(x))
data["char_count_without_spaces"] = data["tweet"].apply(lambda x : len(x.replace(" ","")))

data["num_dig"] = data["tweet"].apply(lambda x :  sum([1 if w.isdigit() else 0 for w in x.split()])                         )

In [7]:
data.head()

Unnamed: 0,id,label,tweet,source,cleaned,word_count,word_count_cleand,char_count,char_count_without_spaces,num_dig
0,1,0.0,when a father is dysfunctional and is so sel...,train,father dysfunctional selfish drag kid dysfunct...,17,7,95,75,0
1,2,0.0,thanks for lyft credit i cant use cause they...,train,thank lyft credit cant use cause dont offer wh...,17,13,106,85,0
2,3,0.0,bihday your majesty,train,bihday majesty,3,2,21,17,0
3,4,0.0,model i love u take with u all the time in u...,train,model love u take u time ur,12,7,50,34,0
4,5,0.0,factsguide society now motivation,train,factsguide society motivation,4,3,37,30,0


In [8]:
pos_dic = {"noun" : ["NNP", "NN", "NNS", "NNPS"], "verb" : ["VBZ", "VB", "VBD","VBG", "VBN"]}
import nltk
def pos_check(txt, family):
    tags = nltk.pos_tag(nltk.word_tokenize(txt))
    count = 0
    for tag in tags:
        tag = tag[1]
        if tag in pos_dic[family]:
            count += 1 
    return count

# pos_check("They are playing in the ground", "verb")

data["noun_count"] = data["tweet"].apply(lambda x : pos_check(x, "noun"))
data["verb_count"] = data["tweet"].apply(lambda x : pos_check(x, "verb"))

In [9]:
data.head()

Unnamed: 0,id,label,tweet,source,cleaned,word_count,word_count_cleand,char_count,char_count_without_spaces,num_dig,noun_count,verb_count
0,1,0.0,when a father is dysfunctional and is so sel...,train,father dysfunctional selfish drag kid dysfunct...,17,7,95,75,0,3,4
1,2,0.0,thanks for lyft credit i cant use cause they...,train,thank lyft credit cant use cause dont offer wh...,17,13,106,85,0,7,2
2,3,0.0,bihday your majesty,train,bihday majesty,3,2,21,17,0,1,0
3,4,0.0,model i love u take with u all the time in u...,train,model love u take u time ur,12,7,50,34,0,4,0
4,5,0.0,factsguide society now motivation,train,factsguide society motivation,4,3,37,30,0,2,0


In [10]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

cvz = CountVectorizer()
cvz.fit(data["cleaned"].values)
count_vectors = cvz.transform(data["cleaned"].values)

In [11]:
count_vectors

<49159x48885 sparse matrix of type '<class 'numpy.int64'>'
	with 366047 stored elements in Compressed Sparse Row format>

In [12]:
word_tfidf =TfidfVectorizer(max_features=500)
word_tfidf.fit(data["cleaned"].values)
word_vectors_tfidf = word_tfidf.transform(data["cleaned"].values)

In [13]:
ngram_tfidf =TfidfVectorizer(max_features=500, ngram_range=(1,2))
ngram_tfidf.fit(data["cleaned"].values)
ngram_tfidf_tfidf = ngram_tfidf.transform(data["cleaned"].values)

In [14]:
char_tfidf =TfidfVectorizer(max_features=500, analyzer="char")
char_tfidf.fit(data["cleaned"].values)
char_tfidf_tfidf = char_tfidf.transform(data["cleaned"].values)

In [15]:
tfidf = dict(zip(word_tfidf.get_feature_names(), word_tfidf.idf_))
tfidf_idf = pd.DataFrame(columns=["word_tfidf"]).from_dict(tfidf, orient="index")
tfidf_idf.columns=["word_tfidf"]
tfidf_idf

Unnamed: 0,word_tfidf
10,6.598829
1st,6.920034
2016,5.936368
act,6.740241
adapt,6.696890
affirmation,5.335137
allahsoil,6.847009
almost,6.812403
alone,6.833022
already,6.470117


In [16]:
## feature engineering 

## meta features
data["cleaned"] = data["cleaned"].fillna("")

data["digit_count"] = data["tweet"].apply(lambda x : sum([1 if w.isdigit() else 0 for w in x.split()]))
data["upper_count"] = data["tweet"].apply(lambda x : sum([1 if w.isupper() else 0 for w in x.split()]))
data["word_count"] = data["cleaned"].apply(lambda x: len(x.split()))
data["char_count"] = data["cleaned"].apply(lambda x: len(x))
data["char_nospace_count"] = data["cleaned"].apply(lambda x: len(x.replace(" ","")))

data

Unnamed: 0,id,label,tweet,source,cleaned,word_count,word_count_cleand,char_count,char_count_without_spaces,num_dig,noun_count,verb_count,digit_count,upper_count,char_nospace_count
0,1,0.0,when a father is dysfunctional and is so sel...,train,father dysfunctional selfish drag kid dysfunct...,7,7,53,75,0,3,4,0,0,47
1,2,0.0,thanks for lyft credit i cant use cause they...,train,thank lyft credit cant use cause dont offer wh...,13,13,85,85,0,7,2,0,0,73
2,3,0.0,bihday your majesty,train,bihday majesty,2,2,14,17,0,1,0,0,0,13
3,4,0.0,model i love u take with u all the time in u...,train,model love u take u time ur,7,7,27,34,0,4,0,0,0,21
4,5,0.0,factsguide society now motivation,train,factsguide society motivation,3,3,29,30,0,2,0,0,0,27
5,6,0.0,22 huge fan fare and big talking before they l...,train,22 huge fan fare big talk leave chaos pay disp...,12,12,68,90,1,4,2,1,0,57
6,7,0.0,camping tomorrow danny,train,camp tomorrow danny,3,3,19,20,0,2,1,0,0,17
7,8,0.0,the next school year is the year for exams can...,train,next school year year exam cant think school e...,14,14,95,104,0,8,1,0,0,82
8,9,0.0,we won love the land allin cavs champions clev...,train,love land allin cavs champion cleveland clevel...,7,7,58,61,0,5,2,0,0,52
9,10,0.0,welcome here im its so gr8,train,welcome im gr8,3,3,14,21,0,1,1,0,0,12


In [17]:
## nlp based features 

pos_dic = {
    'noun' : ['NN','NNS','NNP','NNPS'],
    'pron' : ['PRP','PRP$','WP','WP$'],
    'verb' : ['VB','VBD','VBG','VBN','VBP','VBZ'],
    'adj' :  ['JJ','JJR','JJS'],
    'adv' : ['RB','RBR','RBS','WRB']
}

def pos_check(x, flag):
    tags = nltk.pos_tag(nltk.word_tokenize(x))
    count = 0
    for tag in tags:
        tag = tag[1]
        if tag in pos_dic[flag]:
            count += 1
    return count

data['noun_count'] = data['cleaned'].apply(lambda x: pos_check(x, 'noun'))
data['verb_count'] = data['cleaned'].apply(lambda x: pos_check(x, 'verb'))
data.head()

Unnamed: 0,id,label,tweet,source,cleaned,word_count,word_count_cleand,char_count,char_count_without_spaces,num_dig,noun_count,verb_count,digit_count,upper_count,char_nospace_count
0,1,0.0,when a father is dysfunctional and is so sel...,train,father dysfunctional selfish drag kid dysfunct...,7,7,53,75,0,3,1,0,0,47
1,2,0.0,thanks for lyft credit i cant use cause they...,train,thank lyft credit cant use cause dont offer wh...,13,13,85,85,0,10,3,0,0,73
2,3,0.0,bihday your majesty,train,bihday majesty,2,2,14,17,0,2,0,0,0,13
3,4,0.0,model i love u take with u all the time in u...,train,model love u take u time ur,7,7,27,34,0,3,1,0,0,21
4,5,0.0,factsguide society now motivation,train,factsguide society motivation,3,3,29,30,0,2,0,0,0,27


In [18]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer

cvz = CountVectorizer(analyzer='word') 
cvz.fit(data["cleaned"].values)
count_vectors = cvz.transform(data["cleaned"].values)

In [19]:
count_vectors

<49159x48885 sparse matrix of type '<class 'numpy.int64'>'
	with 366047 stored elements in Compressed Sparse Row format>

In [20]:
word_tfidf = TfidfVectorizer(analyzer='word') 
word_tfidf.fit(data["cleaned"].values)
word_vectors_tfidf = word_tfidf.transform(data["cleaned"].values)

ngram_tfidf = TfidfVectorizer(analyzer='word', ngram_range=(1,3)) 
ngram_tfidf.fit(data["cleaned"].values)
ngarm_vectors_tfidf = ngram_tfidf.transform(data["cleaned"].values)

char_tfidf = TfidfVectorizer(analyzer='char', ngram_range=(1,3)) 
char_tfidf.fit(data["cleaned"].values)
char_vectors_tfidf = char_tfidf.transform(data["cleaned"].values)

In [21]:
tfidf = dict(zip(word_tfidf.get_feature_names(), word_tfidf.idf_))
tfidf = pd.DataFrame(columns=['title_word_tfidf']).from_dict(dict(tfidf), orient='index')
tfidf.columns = ['title_word_tfidf']
tfidf.sort_values(by=['title_word_tfidf'], ascending=False).head()

Unnamed: 0,title_word_tfidf
000,11.109688
morelebronexcuses,11.109688
morellisices,11.109688
morelovelesshate,11.109688
moremulaaaa,11.109688


In [22]:
tfidf.sort_values(by=['title_word_tfidf'], ascending=False).tail()

Unnamed: 0,title_word_tfidf
amp,4.104353
happy,3.984808
get,3.942265
day,3.555092
love,3.433447


In [23]:
data.columns

Index(['id', 'label', 'tweet', 'source', 'cleaned', 'word_count',
       'word_count_cleand', 'char_count', 'char_count_without_spaces',
       'num_dig', 'noun_count', 'verb_count', 'digit_count', 'upper_count',
       'char_nospace_count'],
      dtype='object')

### Separate train & test:

In [25]:
train = data.loc[data['source']=='train']
test = data.loc[data['source']=='test']

In [26]:
train.drop('source',axis=1,inplace=True)
test.drop(['source','label'],axis=1,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  errors=errors)


In [28]:
train.to_csv('train_modified.csv',index=False)
test.to_csv('test_modified.csv',index=False)

In [29]:
train.head()

Unnamed: 0,id,label,tweet,cleaned,word_count,word_count_cleand,char_count,char_count_without_spaces,num_dig,noun_count,verb_count,digit_count,upper_count,char_nospace_count
0,1,0.0,when a father is dysfunctional and is so sel...,father dysfunctional selfish drag kid dysfunct...,7,7,53,75,0,3,1,0,0,47
1,2,0.0,thanks for lyft credit i cant use cause they...,thank lyft credit cant use cause dont offer wh...,13,13,85,85,0,10,3,0,0,73
2,3,0.0,bihday your majesty,bihday majesty,2,2,14,17,0,2,0,0,0,13
3,4,0.0,model i love u take with u all the time in u...,model love u take u time ur,7,7,27,34,0,3,1,0,0,21
4,5,0.0,factsguide society now motivation,factsguide society motivation,3,3,29,30,0,2,0,0,0,27


### Train and Fit Model:

In [64]:
train_modified = pd.read_csv('train_modified.csv')
test_modified = pd.read_csv('test_modified.csv')

In [65]:
# predictors
predictors = ['digit_count', 'upper_count', 'word_count', 'char_count', 'char_nospace_count', 'noun_count', 'verb_count']

Xtrain = train_modified[predictors]
Ytrain = train_modified['label']

from scipy.sparse import hstack, csr_matrix
meta_features = ['digit_count', 'upper_count', 'word_count', 'char_count', 'char_nospace_count', 'noun_count', 'verb_count']


feature_set1 = train_mod[meta_features]
train = hstack([word_vectors_tfidf, csr_matrix(feature_set1)], 'csr')
target = train_mod['label'].values
train

feature_set1 = test_mod[meta_features]
test_meta = hstack([word_vectors_tfidf, csr_matrix(feature_set1)], 'csr')
test_meta

In [66]:
target='label'
IDcol = 'id'

Ytrain.head()

0    0.0
1    0.0
2    0.0
3    0.0
4    0.0
Name: label, dtype: float64

In [67]:
from sklearn import naive_bayes
from sklearn.linear_model import LogisticRegression
from sklearn import svm
from sklearn import ensemble
import xgboost
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

In [50]:
from sklearn.preprocessing import LabelEncoder

target = LabelEncoder().fit_transform(train)

ValueError: bad input shape ()

In [None]:
# predictors
#predictors = [x for x in train.columns if x not in [target, IDcol]]

In [68]:
from sklearn.model_selection import train_test_split
trainx, valx, trainy, valy = train_test_split(Xtrain, Ytrain)

#### 1. Model: NaiveBayes

In [73]:
import numpy as np

## NaiveBayes
model = naive_bayes.MultinomialNB()
model.fit(trainx, trainy)
preds = model.predict(valx)
accur = accuracy_score(preds, valy)
f1 = f1_score(valy, preds, average='weighted', labels=np.unique(preds))

print('Accuracy Score (Train): %f' % accur)
print('F1 Score (Train): %f' % f1)

Accuracy Score (Train): 0.929045
F1 Score (Train): 0.963218


#### 2. Model: LogisticRegression

In [78]:
model = LogisticRegression()
model.fit(trainx, trainy)
preds = model.predict(valx)
accur = accuracy_score(preds, valy)
f1 = f1_score(valy, preds, average='weighted', labels=np.unique(preds))

print('Accuracy Score (Train): %f' % accur)
print('F1 Score (Train): %f' % f1)

Accuracy Score (Train): 0.929045
F1 Score (Train): 0.963218




#### 3. Model: SVC

In [75]:
model = svm.SVC()
model.fit(trainx, trainy)
preds = model.predict(valx)
accur = accuracy_score(preds, valy)
f1 = f1_score(valy, preds, average='weighted', labels=np.unique(preds))

print('Accuracy Score (Train): %f' % accur)
print('F1 Score (Train): %f' % f1)



Accuracy Score (Train): 0.929796
F1 Score (Train): 0.896707


#### 4. Model: XGBoost

In [76]:
model = xgboost.XGBClassifier()
model.fit(trainx, trainy)
preds = model.predict(valx)
accur = accuracy_score(preds, valy)
f1 = f1_score(valy, preds, average='weighted', labels=np.unique(preds))

print('Accuracy Score (Train): %f' % accur)
print('F1 Score (Train): %f' % f1)

Accuracy Score (Train): 0.929796
F1 Score (Train): 0.896707


#### 4. Model: Ensemble

In [84]:
model = ensemble.ExtraTreesClassifier()
model.fit(trainx, trainy)
preds = model.predict(valx)
accur = accuracy_score(preds, valy)
f1 = f1_score(valy, preds, average='weighted', labels=np.unique(preds))

print('Accuracy Score (Train): %f' % accur)
print('F1 Score (Train): %f' % f1)

Accuracy Score (Train): 0.925541
F1 Score (Train): 0.908625


