## Combating Hate Speech Using NLP and Machine Learning

* Objective:
Using NLP and ML, create a model to identify hate speech in Twitter.

* Analysis to be done:
Clean up tweets and build a classification model by using NLP techniques, cleanup for tweets data, regularization and hyperparameter tuning using stratifies k-fold and cross validation to get the best model.

* Content:
id: identifier number of the tweet Label: 0(non-hate) / 1(hate) Tweet: the text in the tweet



## Importing the libraries

In [35]:
import pandas as pd
import numpy as np
import os
import re #regularization library

## Read in the csv using pandas

In [36]:
inp_tweets0=pd.read_csv("Tweets_USA.csv")

In [37]:
inp_tweets0.head(10)

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
5,6,0,[2/2] huge fan fare and big talking before the...
6,7,0,@user camping tomorrow @user @user @user @use...
7,8,0,the next school year is the year for exams.ð...
8,9,0,we won!!! love the land!!! #allin #cavs #champ...
9,10,0,@user @user welcome here ! i'm it's so #gr...


In [38]:
inp_tweets0.label.value_counts(normalize=True)

0    0.929854
1    0.070146
Name: label, dtype: float64

#### 7 percent are hateful tweets

In [39]:
inp_tweets0.tweet.sample().values[0]

"  father's day.god will continue to bless my daddy, those that act like one &amp;me also a potential daddy. god bless our unborn daddy too#"

### Get the tweets into a list for easy text clean up and manipulation

In [40]:
tweets0=inp_tweets0.tweet.values

In [41]:
len(tweets0)

31962

In [42]:
tweets0[:5]

array([' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
       "@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
       '  bihday your majesty',
       '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
       ' factsguide: society now    #motivation'], dtype=object)

# cleanup
* Normalize case
* remove user handles beginning with "@"
* Removing URLs

#### Converting all your data to lowercase helps in the process of preprocessing and in later stages in the NLP application

In [43]:
tweets_lower=[twt.lower() for twt in tweets0]

In [44]:
tweets_lower[:5]

[' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation']

### Remove user handles begin with @

re is the regular expression library
"re” module included with Python primarily used for string searching and manipulation

In [45]:
import re

The ‘sub’ in the function stands for SubString, a certain regular expression pattern is searched in the given string(3rd parameter), and upon finding the substring pattern is replaced by repl(2nd parameter), count checks and maintains the number of times this occurs. 

In [46]:
re.sub("@\w+","", "@Rahim this course rocks! http://rahimbaig.com/ai")

' this course rocks! http://rahimbaig.com/ai'

Replacing the regular expression pattern in every tweet with ""

In [47]:
tweets_nouser=[re.sub("@\w+","",twt) for twt in tweets_lower]

In [48]:
tweets_nouser[:5]

['  when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "  thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation']

## Remove URLs

In [49]:
re.sub("\w+://\S+","", "@Rahim this course rocks! http://rahimbaig.com/ai")

'@Rahim this course rocks! '

In [50]:
tweets_nourl=[re.sub("\w+://\S+","",twt)for twt in tweets_nouser]

In [51]:
tweets_nourl[:5]

['  when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "  thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation']

## Tokenize using Tweet Tokenizer from NLTK

Tokenization is the process of breaking down a piece of text into small units called tokens. A token may be a word, part of a word or just characters like punctuation.

In [52]:
from nltk.tokenize  import TweetTokenizer

In [53]:
tkn=TweetTokenizer() #creating an instance

In [54]:
print(tkn.tokenize(tweets_nourl[0]))

['when', 'a', 'father', 'is', 'dysfunctional', 'and', 'is', 'so', 'selfish', 'he', 'drags', 'his', 'kids', 'into', 'his', 'dysfunction', '.', '#run']


In [55]:
tweet_token=[tkn.tokenize(sent) for sent in tweets_nourl]
print(tweet_token[31000])

['the', 'excitement', 'of', 'ordering', 'a', 'new', 'pole', 'outfit', '#poledance', '#poler', '#poledancer', '#poleoutfit', '#polefitness', '#healthy']


### remove punctuations and stop words and oyther redundant terms like rt,'amp'

* Also removing hashtags

* Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query. 
We would not want these words to take up space in our database, or taking up valuable processing time. For this, we can remove them easily, by storing a list of words that you consider to stop words. NLTK(Natural Language Toolkit) in python has a list of stopwords stored in 16 different languages.

In [56]:
from nltk.corpus import stopwords
from string import punctuation

In [57]:
stop_nltk= stopwords.words("english")
stop_punct=list(punctuation)

In [58]:
stop_punct

['!',
 '"',
 '#',
 '$',
 '%',
 '&',
 "'",
 '(',
 ')',
 '*',
 '+',
 ',',
 '-',
 '.',
 '/',
 ':',
 ';',
 '<',
 '=',
 '>',
 '?',
 '@',
 '[',
 '\\',
 ']',
 '^',
 '_',
 '`',
 '{',
 '|',
 '}',
 '~']

In [59]:
stop_punct.extend(['...','``',"''",".."]) # adding certain punctuations into the list

In [60]:
stop_context=['rt','amp']

In [61]:
stop_final=stop_nltk+stop_punct+stop_context

### Function to
* remove stop words from a single tokenized sentence
* remove # tags
* remove terms with length=1

Defining a function del_stop which returns the words of every sentence in the tweet after removing #. It returns only if those terms which are not in stopfinal list and len of the terms must be greater than 1.

In [62]:
def del_stop(sent):
    return [re.sub("#","",term) for term in sent if ((term not in stop_final) & (len(term)>1))]

In [63]:
del_stop(tweet_token[4])

['factsguide', 'society', 'motivation']

Obtaining the clean tweets

In [64]:
tweets_clean=[del_stop(tweet) for tweet in tweet_token]

### Check out the top terms in the tweets



In [65]:
from collections import Counter

In [66]:
term_list=[]
for tweet in tweets_clean:
    term_list.extend(tweet)

### Finding the top 10 most common words by calculating the frequency of their apppearance

In [67]:
res=Counter(term_list)
res.most_common(10)

[('love', 2748),
 ('day', 2276),
 ('happy', 1684),
 ('time', 1131),
 ('life', 1118),
 ('like', 1047),
 ("i'm", 1018),
 ('today', 1013),
 ('new', 994),
 ('thankful', 946)]

## Data Formatting for predictive Modelling

#### Join the tokens back into strings

In [68]:
tweets_clean[30000]

['never', 'msg', 'first', 'dun', 'msg', 'first', 'disappointed']

 * join () is an inbuilt string function in Python used to join elements of the sequence separated by a string separator. 

In [69]:
tweets_clean=[" ".join(tweet) for tweet in tweets_clean]

In [70]:
tweets_clean[30000]

'never msg first dun msg first disappointed'

### separate X and Y and perform train test split , 70-30

In [71]:
len(tweets_clean)

31962

In [72]:
len(inp_tweets0.label)

31962

In [73]:
X=tweets_clean
y=inp_tweets0.label.values

### Train test split

In [74]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.30,random_state=42)

### Create a document term matrix using count vectorizer

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text. 

* TFIDF

Term Frequency- Importance of the term within that document.
TF(d,t)= Number of occurances of term t in document d

Inverse Document Frequency - Importance of the term in the corpus.
IDF(t)=log(D/t) where D= total number of documents
t= number of documents with the term.


W(d,t)=tf(d,t) * log(D/t)

Tf*Idf do not convert directly raw data into useful features. Firstly, it converts raw strings or dataset into vectors and each word has its own vector. 


To extract features from a document of words, we import –



In [75]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [76]:
vectorizer=TfidfVectorizer(max_features=5000) 
#max_features = None is the deafult value
#If not none then, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

In [77]:
len(X_train), len(X_test)

(22373, 9589)

.fit_transform : Learn vocabulary and idf, return document-term matrix. (reference: sklearn documentation)

In [78]:
X_train_bow=vectorizer.fit_transform(X_train) #fitting the training data

In [79]:
X_test_bow=vectorizer.transform(X_test) #Transform documents to document-term matrix.

In [80]:
X_train_bow.shape,X_test_bow.shape

((22373, 5000), (9589, 5000))

### Model Building

#### Using a simple Logistic regression

In [81]:
from sklearn.linear_model import LogisticRegression

In [82]:
logreg=LogisticRegression() #creating an instance

In [83]:
logreg.fit(X_train_bow,y_train) #fitting the training data

LogisticRegression()

In [84]:
y_train_pred=logreg.predict(X_train_bow)
y_test_pred=logreg.predict(X_test_bow)

In [85]:
from sklearn.metrics import accuracy_score, classification_report

In [86]:
accuracy_score(y_train,y_train_pred)

0.9561078085191973

 the accuracy score on the training data is 95%

In [87]:
print(classification_report(y_train,y_train_pred))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98     20815
           1       0.96      0.39      0.55      1558

    accuracy                           0.96     22373
   macro avg       0.96      0.69      0.76     22373
weighted avg       0.96      0.96      0.95     22373



#### When using classification models in machine learning, there are three common metrics that we use to assess the quality of the model:

1. Precision: Percentage of correct positive predictions relative to total positive predictions.

2. Recall: Percentage of correct positive predictions relative to total actual positives.

3. F1 Score: A weighted harmonic mean of precision and recall. The closer to 1, the better the model.

 In an imbalanced dataset, F1 score but not accuracy will capture a poor balance between recall and precision. In our case our dataset is imbalanced.
 
  Weighted average considers how many of each class there were in its calculation, so fewer of one class means that it’s precision/recall/F1 score has less of an impact on the weighted average for each of those things.
  
 *  Here the weighted average is 0.96.The weighted average is higher for this model because the place where recall and F1 score fell down was for label= 1, but it’s underrepresented in this dataset, so accounted for less in the weighted average.

### Adjusting for class imbalance

When the class_weights = ‘balanced’, the model automatically assigns the class weights inversely proportional to their respective frequencies.

In [88]:
logreg=LogisticRegression(class_weight="balanced")

In [89]:
logreg.fit(X_train_bow,y_train) #training the model

LogisticRegression(class_weight='balanced')

In [90]:
y_train_pred=logreg.predict(X_train_bow)
y_test_pred=logreg.predict(X_test_bow)

In [91]:
accuracy_score(y_train,y_train_pred)

0.9521744960443391

In [92]:
print(classification_report(y_train,y_train_pred))

              precision    recall  f1-score   support

           0       1.00      0.95      0.97     20815
           1       0.60      0.97      0.74      1558

    accuracy                           0.95     22373
   macro avg       0.80      0.96      0.86     22373
weighted avg       0.97      0.95      0.96     22373



#### The recall values improves to a great a extent due to assigning class weights

#### GridSearchCV is the process of performing hyperparameter tuning in order to determine the optimal values for a given model. As mentioned above, the performance of a model significantly depends on the value of hyperparameters.

#### Stratified k-fold cross-validation is the same as just k-fold cross-validation, But Stratified k-fold cross-validation, it does stratified sampling instead of random sampling

In [93]:
from sklearn.model_selection import GridSearchCV,StratifiedKFold

In [94]:
#Create the parameter grid based on the results of random search
param_grid={'C':[0.01,0.1,0.5,1,5,10]}

In [95]:
classifier_lr=LogisticRegression(class_weight="balanced")

A grid  search consists of:

* an estimator (regressor or classifier );

* a parameter space;

* a method for searching or sampling candidates;

* a cross-validation scheme; and

* a score function.

In [96]:
#Instantiate the grid search model
grid_search=GridSearchCV(estimator=classifier_lr,
                        param_grid=param_grid,
                        cv=StratifiedKFold(4),
                        n_jobs=-1, verbose=1,scoring="recall")

In [97]:
grid_search.fit(X_train_bow,y_train)

Fitting 4 folds for each of 6 candidates, totalling 24 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  24 out of  24 | elapsed:   24.2s finished


GridSearchCV(cv=StratifiedKFold(n_splits=4, random_state=None, shuffle=False),
             estimator=LogisticRegression(class_weight='balanced'), n_jobs=-1,
             param_grid={'C': [0.01, 0.1, 0.5, 1, 5, 10]}, scoring='recall',
             verbose=1)

In [98]:
grid_search.best_estimator_

LogisticRegression(C=0.5, class_weight='balanced')

The best parameter is when C=0.5

## Using the best estimator to make predictions on the test set

In [99]:
y_test_pred=grid_search.best_estimator_.predict(X_test_bow)

In [100]:
y_train_pred=grid_search.best_estimator_.predict(X_train_bow)

In [101]:
print(classification_report(y_test,y_test_pred))

              precision    recall  f1-score   support

           0       0.98      0.93      0.96      8905
           1       0.47      0.77      0.59       684

    accuracy                           0.92      9589
   macro avg       0.73      0.85      0.77      9589
weighted avg       0.95      0.92      0.93      9589



### Here the recall value for hateful tweets is 0.77 and the weighted average for recall is 0.92 which is closer to 1 , hence our model is a good fit.