# Help Twitter Combat Hate Speech Using NLP and Machine Learning

### DESCRIPTION

Using NLP and ML, make a model to identify hate speech (racist or sexist tweets) in Twitter.

### Problem Statement:  

Twitter is the biggest platform where anybody and everybody can have their views heard. Some of these voices spread hate and negativity. Twitter is wary of its platform being used as a medium  to spread hate. 

You are a data scientist at Twitter, and you will help Twitter in identifying the tweets with hate speech and removing them from the platform. You will use NLP techniques, perform specific cleanup for tweets data, and make a robust model.

### Domain: 
Social Media

Analysis to be done: Clean up tweets and build a classification model by using NLP techniques, cleanup specific for tweets data, regularization and hyperparameter tuning using stratified k-fold and cross-validation to get the best model.

### Content: 

id: identifier number of the tweet

Label: 0 (non-hate) /1 (hate)

Tweet: the text in the tweet

### Tasks: 

1. Load the tweets file using read_csv function from Pandas package. 

2. Get the tweets into a list for easy text cleanup and manipulation.

3. To cleanup: 

    1. Normalize the casing.

    2. Using regular expressions, remove user handles. These begin with '@’.

    3. Using regular expressions, remove URLs.

    4. Using TweetTokenizer from NLTK, tokenize the tweets into individual terms.

    5. Remove stop words.

    6. Remove redundant terms like ‘amp’, ‘rt’, etc.

    7. Remove ‘#’ symbols from the tweet while retaining the term.

4. Extra cleanup by removing terms with a length of 1.

5. Check out the top terms in the tweets:

    1. First, get all the tokenized terms into one large list.

    2. Use the counter and find the 10 most common terms.

6. Data formatting for predictive modeling:

    1. Join the tokens back to form strings. This will be required for the vectorizers.

    2. Assign x and y.

    3. Perform train_test_split using sklearn.

7. We’ll use TF-IDF values for the terms as a feature to get into a vector space model.

    1. Import TF-IDF  vectorizer from sklearn.

    2. Instantiate with a maximum of 5000 terms in your vocabulary.

    3. Fit and apply on the train set.

    4. Apply on the test set.

8. Model building: Ordinary Logistic Regression

    1. Instantiate Logistic Regression from sklearn with default parameters.

    2. Fit into  the train data.

    3. Make predictions for the train and the test set.

9. Model evaluation: Accuracy, recall, and f_1 score.

    1. Report the accuracy on the train set.

    2. Report the recall on the train set: decent, high, or low.

    3. Get the f1 score on the train set.

10. Looks like you need to adjust the class imbalance, as the model seems to focus on the 0s.

    1. Adjust the appropriate class in the LogisticRegression model.

11. Train again with the adjustment and evaluate.

    1. Train the model on the train set.

    2. Evaluate the predictions on the train set: accuracy, recall, and f_1 score.

12. Regularization and Hyperparameter tuning:

    1. Import GridSearch and StratifiedKFold because of class imbalance.

    2. Provide the parameter grid to choose for ‘C’ and ‘penalty’ parameters.

    3. Use a balanced class weight while instantiating the logistic regression.

13. Find the parameters with the best recall in cross validation.

    1. Choose ‘recall’ as the metric for scoring.

    2. Choose stratified 4 fold cross validation scheme.

    3. Fit into  the train set.

14. What are the best parameters?

15. Predict and evaluate using the best estimator.

    1. Use the best estimator from the grid search to make predictions on the test set.

    2. What is the recall on the test set for the toxic comments?

    3. What is the f_1 score?

## Import Important librarys

In [2]:
import pandas as pd
import numpy as np
import os
import re
from nltk.tokenize import TweetTokenizer
from nltk.corpus import stopwords
from string import punctuation
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import GridSearchCV, StratifiedKFold

re:- A RegEx, or Regular Expression, is a sequence of characters that forms a search pattern.

RegEx can be used to check if a string contains the specified search pattern.

### 1. Load the tweets file using read_csv function from Pandas package.

In [3]:
tweets_data=pd.read_csv("TwitterHate.csv")

#### print loaded data

In [4]:
tweets_data

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
...,...,...,...
31957,31958,0,ate @user isz that youuu?ðððððð...
31958,31959,0,to see nina turner on the airwaves trying to...
31959,31960,0,listening to sad songs on a monday morning otw...
31960,31961,1,"@user #sikh #temple vandalised in in #calgary,..."


#### Print Head of the data

In [5]:
# first 5 rows print
tweets_data.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [6]:
# first 25 rows print
tweets_data.head(25)

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
5,6,0,[2/2] huge fan fare and big talking before the...
6,7,0,@user camping tomorrow @user @user @user @use...
7,8,0,the next school year is the year for exams.ð...
8,9,0,we won!!! love the land!!! #allin #cavs #champ...
9,10,0,@user @user welcome here ! i'm it's so #gr...


In [7]:
# show data information
tweets_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      31962 non-null  int64 
 1   label   31962 non-null  int64 
 2   tweet   31962 non-null  object
dtypes: int64(2), object(1)
memory usage: 749.2+ KB


In [8]:
# show data describe
tweets_data.describe()

Unnamed: 0,id,label
count,31962.0,31962.0
mean,15981.5,0.070146
std,9226.778988,0.255397
min,1.0,0.0
25%,7991.25,0.0
50%,15981.5,0.0
75%,23971.75,0.0
max,31962.0,1.0


In [9]:
tweets_data.label

0        0
1        0
2        0
3        0
4        0
        ..
31957    0
31958    0
31959    0
31960    1
31961    0
Name: label, Length: 31962, dtype: int64

In [10]:
# The value_counts() method returns a Series containing the counts of unique values. This means, for any column in a dataframe, this method returns the count of unique entries in that column.
tweets_data.label.value_counts(normalize=True)

0    0.929854
1    0.070146
Name: label, dtype: float64

In [11]:
# Show only tweets 
tweets_data.tweet

0         @user when a father is dysfunctional and is s...
1        @user @user thanks for #lyft credit i can't us...
2                                      bihday your majesty
3        #model   i love u take with u all the time in ...
4                   factsguide: society now    #motivation
                               ...                        
31957    ate @user isz that youuu?ðððððð...
31958      to see nina turner on the airwaves trying to...
31959    listening to sad songs on a monday morning otw...
31960    @user #sikh #temple vandalised in in #calgary,...
31961                     thank you @user for you follow  
Name: tweet, Length: 31962, dtype: object

In [12]:
# sample() is an inbuilt function of random module in Python that returns a particular length list of items chosen from the sequence i.e. list, tuple, string or set. Used for random sampling without replacement.
tweets_data.tweet.sample()

21159    @user @user @user @user @user haha damn those ...
Name: tweet, dtype: object

In [13]:
tweets_data.tweet.sample().values[0]

'if you want creative workers, give them enough time to play.   #success #quote  '

### Get the tweets into a list for easy text clean up and manipulation 

In [14]:
tweets=tweets_data.tweet.values

In [15]:
tweets

array([' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
       "@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
       '  bihday your majesty', ...,
       'listening to sad songs on a monday morning otw to work is sad  ',
       '@user #sikh #temple vandalised in in #calgary, #wso condemns  act  ',
       'thank you @user for you follow  '], dtype=object)

In [16]:
# check tweets length
len(tweets)

31962

In [17]:
# show only one to five tweets
tweets[:5]

array([' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
       "@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
       '  bihday your majesty',
       '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
       ' factsguide: society now    #motivation'], dtype=object)

#### The tweets contain ----
      1. URLS
      2. Hashtags
      3. Urel handles
      4. "RT"

### Cleanup
#### Normalizing case

In [18]:
# using lambda function . for change lower case in tweets
tweets_lower = [twt.lower() for twt in tweets]

In [19]:
tweets_lower[:5]

[' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation']

#### Using regular expressions, remove user handles. These begin with '@’.

In [20]:
re.sub("@\w+",""," @Hi This is random link ! https://sell.sawbrokers.com/domain/ai.com")


'  This is random link ! https://sell.sawbrokers.com/domain/ai.com'

In [21]:
# again use lambda function
tweets_no_user =[re.sub("@\w+","",twt) for twt in tweets_lower]

In [22]:
tweets_no_user[:5]

['  when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "  thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation']

#### Using regular expressions, remove URLs.

In [23]:
re.sub("\w+://\S+","","@Hi This is random link ! https://sell.sawbrokers.com/domain/ai.com")

'@Hi This is random link ! '

In [24]:
without_url_tweets=[re.sub("\w+://\S+","",twt)for twt in tweets_no_user]

In [25]:
without_url_tweets[:5]

['  when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "  thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation']

#### Using TweetTokenizer from NLTK, tokenize the tweets into individual terms.

In [26]:
?TweetTokenizer()

Object `TweetTokenizer()` not found.


In [27]:
tk=TweetTokenizer()

With the help of NLTK nltk.TweetTokenizer() method, we are able to convert the stream of words into small small tokens so that we can analyse the audio stream with the help of nltk.TweetTokenizer() method.

In [28]:
print(tk.tokenize(without_url_tweets[0]))

['when', 'a', 'father', 'is', 'dysfunctional', 'and', 'is', 'so', 'selfish', 'he', 'drags', 'his', 'kids', 'into', 'his', 'dysfunction', '.', '#run']


In [29]:
# all tweets are tokenlised 
tweets_token =[tk.tokenize(sent) for sent in without_url_tweets]

In [30]:
print(tweets_token[0])

['when', 'a', 'father', 'is', 'dysfunctional', 'and', 'is', 'so', 'selfish', 'he', 'drags', 'his', 'kids', 'into', 'his', 'dysfunction', '.', '#run']


#### Romove punctuations and stop words and other redundant terms tike 'rt' and 'amp'

In [31]:
stop_nltk = stopwords.words("english")

##### Removing stop words with NLTK

The process of converting data to something a computer can understand is referred to as pre-processing. 
One of the major forms of pre-processing is to filter out useless data.
In natural language processing, useless words (data), are referred to as stop words. 

##### What are Stop words?
Stop Words: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when retrieving them as the result of a search query.

In [32]:
print(stop_nltk)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

##### punctuation
Punctuation is the tool that allows us to organize our thoughts and make it easier to review and share our ideas.
The standard English punctuation is as follows: period, comma, apostrophe, quotation, question, exclamation, brackets, braces, parenthesis, dash, hyphen, ellipsis, colon, semicolon.

In [33]:
stop_punct=list(punctuation)

In [34]:
print(stop_punct)

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~']


In [35]:
stop_punct.extend(['...','``',"''",".."])
print(stop_punct)

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '...', '``', "''", '..']


In [36]:
stop_context =['rt','amp']

In [37]:
stop_context

['rt', 'amp']

In [38]:
stop_final =stop_nltk + stop_punct + stop_context

In [39]:
print(stop_final)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

#### Function to 
##### remove  stop words from a single tokenized sentence
##### Remove ‘#’ symbols from the tweet while retaining the term.
##### Extra cleanup by removing terms with a length of 1.

In [40]:
def del_stop(sent):
    return [re.sub("#","",term) for term in sent if ((term not in stop_final) & (len(term)>1))]

In [41]:
del_stop(tweets_token[4])

['factsguide', 'society', 'motivation']

In [42]:
tweets_clean = [del_stop(tweet) for tweet in tweets_token]

In [43]:
print(tweets_clean)



#### Check out the top terms in the tweets:

##### First, get all the tokenized terms into one large list.

##### Use the counter and find the 10 most common terms.

In [44]:
term_list =[]
for tweet in tweets_clean:
    term_list.extend(tweet)

In [45]:
# Most 10 common words
res=Counter(term_list)
res.most_common(10)

[('love', 2748),
 ('day', 2276),
 ('happy', 1684),
 ('time', 1131),
 ('life', 1118),
 ('like', 1047),
 ("i'm", 1018),
 ('today', 1013),
 ('new', 994),
 ('thankful', 946)]

### Data formatting for predictive modeling
###### join the tokens back info strings

In [46]:
tweets_clean[0]

['father', 'dysfunctional', 'selfish', 'drags', 'kids', 'dysfunction', 'run']

In [47]:
tweets_clean =[" ".join(tweet) for tweet in tweets_clean]

In [48]:
tweets_clean[0]

'father dysfunctional selfish drags kids dysfunction run'

#### Separate X and Y and peform train test split, 70-30 ratio.

In [49]:
len(tweets_clean)

31962

In [50]:
len(tweets_data.label)

31962

In [51]:
x= tweets_clean
y= tweets_data.label.values

#### train test split

In [52]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size=0.30, random_state=42)

#### Create a document term matrix using count vectorizer

In [53]:
vectorizer =TfidfVectorizer(max_features=5000)

In [54]:
len(x_train), len(x_test)

(22373, 9589)

In [55]:
x_train_bow=vectorizer.fit_transform(x_train)
x_test_bow = vectorizer.transform(x_test)

In [56]:
x_train_bow.shape, x_test_bow.shape

((22373, 5000), (9589, 5000))

### Model bulding
#### Using a simple Logistic Regression

In [57]:
from sklearn.linear_model import LogisticRegression

In [58]:
logreg = LogisticRegression()

In [59]:
logreg .fit(x_train_bow, y_train)

LogisticRegression()

In [60]:
y_train_pred = logreg.predict(x_train_bow)
y_test_pred = logreg.predict(x_test_bow)

In [61]:
accuracy_score(y_train,y_train_pred)

0.9560184150538595

In [62]:
print(classification_report(y_train, y_train_pred))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98     20815
           1       0.96      0.39      0.55      1558

    accuracy                           0.96     22373
   macro avg       0.96      0.69      0.76     22373
weighted avg       0.96      0.96      0.95     22373



### Adjusting for class imbalance

In [63]:
logreg = LogisticRegression(class_weight="balanced")

In [None]:
logreg.fit(x_train_bow, y_train)

LogisticRegression(class_weight='balanced')

In [None]:
y_train_pred = logreg.predict(x_train_bow)
y_test_pred = logreg.predict(x_test_bow)

In [None]:
accuracy_score(y_train,y_train_pred)

In [None]:
print(classification_report(y_train, y_train_pred))

In [None]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'C': [0.01,0.1,1,10,100],
    'penalty': ["l1","l2"]
}

In [None]:
?LogisticRegression()

In [None]:
classifier_lr = LogisticRegression(class_weight="balanced")

In [None]:
classifier_lr

In [None]:
# Instantiate the grid search model
grid_search = GridSearchCV(estimator = classifier_lr, param_grid = param_grid, 
                          cv = StratifiedKFold(4), n_jobs = -1, verbose = 1, scoring = "recall" )

In [None]:
grid_search.fit(x_train_bow, y_train)

In [None]:
grid_search.best_estimator_

### Using the best estimator to make predictions on the test set

## What is the f_1 score?
The F1-score combines the precision and recall of a classifier into a single metric by taking their harmonic mean. It is primarily used to compare the performance of two classifiers. Suppose that classifier A has a higher recall, and classifier B has higher precision.

In [None]:
y_test_pred = grid_search.best_estimator_.predict(x_test_bow)

In [None]:
print(y_test_pred)

In [None]:
y_train_pred = grid_search.best_estimator_.predict(x_train_bow)

In [None]:
print(y_train_pred)

In [None]:
print(classification_report(y_test, y_test_pred))