# **DESCRIPTION**

Using NLP and ML, make a model to identify hate speech (racist or sexist tweets) in Twitter.

**Problem Statement:**

Twitter is the biggest platform where anybody and everybody can have their views heard. Some of these voices spread hate and negativity. Twitter is wary of its platform being used as a medium  to spread hate. 

You are a data scientist at Twitter, and you will help Twitter in identifying the tweets with hate speech and removing them from the platform. You will use NLP techniques, perform specific cleanup for tweets data, and make a robust model.

Domain: Social Media

Analysis to be done: Clean up tweets and build a classification model by using NLP techniques, cleanup specific for tweets data, regularization and hyperparameter tuning using stratified k-fold and cross-validation to get the best model.

Content: 

id: identifier number of the tweet

Label: 0 (non-hate) /1 (hate)

Tweet: the text in the tweet



In [None]:
#Libraries
import pandas as pd

Load the tweets file using read_csv function from Pandas package. 

In [None]:
ds = pd.read_csv("TwitterHate.csv")
ds.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [None]:
ds.isna().sum()

id       0
label    0
tweet    0
dtype: int64

Get the tweets into a list for easy text cleanup and manipulation.

In [None]:
tweets = ds['tweet'].tolist()
tweets[:5]

[' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation']

# **To cleanup:** 


- Normalize the casing.

In [None]:
tweets_lower = [twts.lower() for twts in tweets]
tweets_lower[:5]

[' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation']

- Using regular expressions, remove user handles. These begin with '@’.

In [None]:
import re
tweets_re = [re.sub("@\w+", "", twts) for twts in tweets_lower]
tweets_re[:5]

['  when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "  thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation']

- Using regular expressions, remove URLs.

In [None]:
tweets_url = [re.sub("\w+://\S+","", twts) for twts in tweets_re]
tweets_url[:5]

['  when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "  thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation']

- Using TweetTokenizer from NLTK, tokenize the tweets into individual terms.

In [None]:
from nltk.tokenize import TweetTokenizer
tkn = TweetTokenizer()

In [None]:
tweet_token = [tkn.tokenize(twts) for twts in tweets_url]
tweet_token[0]

['when',
 'a',
 'father',
 'is',
 'dysfunctional',
 'and',
 'is',
 'so',
 'selfish',
 'he',
 'drags',
 'his',
 'kids',
 'into',
 'his',
 'dysfunction',
 '.',
 '#run']

- Remove stop words.
- Remove redundant terms like ‘amp’, ‘rt’, etc.
- Remove ‘#’ symbols from the tweet while retaining the term.
- Extra cleanup by removing terms with a length of 1.

In [None]:
import nltk
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [None]:
from nltk.corpus import stopwords
from string import punctuation

In [None]:
stop_nltk = stopwords.words("english")
stop_punc = list(punctuation)

In [None]:
stop_punc.extend(['...','``',"''",".."])
stop_context = ['amp','rt']

In [None]:
stop_final = stop_nltk + stop_punc + stop_context

In [None]:
def del_stop(sent):
  return [re.sub("#","",term) for term in sent if ((term not in stop_final) & (len(term)>1))]

In [None]:
tweets_clean = [del_stop(twts) for twts in tweet_token]

# **Check out the top terms in the tweets:**

- First, get all the tokenized terms into one large list.

- Use the counter and find the 10 most common terms.

In [None]:
from collections import Counter

In [None]:
tweets_list = []
for twt in tweets_clean:
  tweets_list.extend(twt)

In [None]:
res = Counter(tweets_list)
res.most_common(10)

[('love', 2748),
 ('day', 2276),
 ('happy', 1684),
 ('time', 1131),
 ('life', 1118),
 ('like', 1047),
 ("i'm", 1018),
 ('today', 1013),
 ('new', 994),
 ('thankful', 946)]

# **Data formatting for predictive modeling:**

- Join the tokens back to form strings. This will be required for the vectorizers.

In [None]:
tweets_join = [" ".join(twts) for twts in tweets_clean]

In [None]:
tweets_join[0]

'father dysfunctional selfish drags kids dysfunction run'

- Assign x and y.

In [None]:
X = tweets_join
y = ds.label.values

- Perform train_test_split using sklearn.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# **We’ll use TF-IDF values for the terms as a feature to get into a vector space model.**

- Import TF-IDF  vectorizer from sklearn.


In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

- Instantiate with a maximum of 5000 terms in your vocabulary.


In [None]:
vectorizer = TfidfVectorizer(max_features=5000)

- Fit and apply on the train set.
- Apply on the test set.

In [None]:
len(X_train), len(X_test)

(22373, 9589)

In [None]:
X_train_bow = vectorizer.fit_transform(X_train)
X_test_bow = vectorizer.fit_transform(X_test)

In [None]:
X_train_bow.shape, X_test_bow

((22373, 5000), <9589x5000 sparse matrix of type '<class 'numpy.float64'>'
 	with 57598 stored elements in Compressed Sparse Row format>)

# **Model building: Ordinary Logistic Regression**

- Instantiate Logistic Regression from sklearn with default parameters.
- Fit into  the train data.
- Make predictions for the train and the test set.



In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

In [None]:
lr.fit(X_train_bow, y_train)

In [None]:
y_pred_train = lr.predict(X_train_bow)
y_pred_test = lr.predict(X_test_bow)

# **Model evaluation: Accuracy, recall, and f_1 score.**

- Report the accuracy on the train set.
- Report the recall on the train set: decent, high, or low.
- Get the f1 score on the train set.

In [None]:
from sklearn.metrics import accuracy_score, classification_report

In [None]:
accuracy_score(y_train, y_pred_train)

0.9560184150538595

In [None]:
print(classification_report(y_train, y_pred_train))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98     20815
           1       0.96      0.39      0.55      1558

    accuracy                           0.96     22373
   macro avg       0.96      0.69      0.76     22373
weighted avg       0.96      0.96      0.95     22373



# **Looks like you need to adjust the class imbalance, as the model seems to focus on the 0s.**

Adjust the appropriate class in the LogisticRegression model.

In [None]:
lr = LogisticRegression(class_weight="balanced")

**Train again with the adjustment and evaluate**.

- Train the model on the train set.
- Evaluate the predictions on the train set: accuracy, recall, and f_1 score.

In [None]:
lr.fit(X_train_bow, y_train)

In [None]:
y_train_pred = lr.predict(X_train_bow)
y_test_pred = lr.predict(X_test_bow)

In [None]:
accuracy_score(y_train, y_train_pred)

0.9527108568363652

In [None]:
print(classification_report(y_train, y_train_pred))

              precision    recall  f1-score   support

           0       1.00      0.95      0.97     20815
           1       0.60      0.97      0.74      1558

    accuracy                           0.95     22373
   macro avg       0.80      0.96      0.86     22373
weighted avg       0.97      0.95      0.96     22373



# **Regularization and Hyperparameter tuning:**

- Import GridSearch and StratifiedKFold because of class imbalance.
- Provide the parameter grid to choose for ‘C’ and ‘penalty’ parameters.
- Use a balanced class weight while instantiating the logistic regression.

In [None]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold

In [None]:
# Create the parameter grid based on the results of random search 
param_grid = {
    'C': [0.01,0.1,1,10,100],
    'penalty': ["l1","l2"]
}

In [None]:
classifer_lr = LogisticRegression(class_weight="balanced")

In [None]:
# Instantiate the grid search model
grid_search = GridSearchCV(estimator=classifer_lr, param_grid=param_grid, cv=StratifiedKFold(4), n_jobs=-1, verbose=1,scoring="recall")

# **Find the parameters with the best recall in cross validation.**

- Choose ‘recall’ as the metric for scoring.
- Choose stratified 4 fold cross validation scheme.
- Fit into  the train set.

In [None]:
grid_search.fit(X_train_bow, y_train)

Fitting 4 folds for each of 10 candidates, totalling 40 fits


20 fits failed out of a total of 40.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
20 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/dist-packages/sklearn/model_selection/_validation.py", line 686, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_logistic.py", line 1162, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/usr/local/lib/python3.9/dist-packages/sklearn/linear_model/_logistic.py", line 54, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

        nan 0.73170852        nan 0.69641

In [None]:
grid_search.best_estimator_

What are the best parameters?
# **Predict and evaluate using the best estimator**

- Use the best estimator from the grid search to make predictions on the test set.
- What is the recall on the test set for the toxic comments?
- What is the f_1 score?

In [None]:
y_test_pred = grid_search.best_estimator_.predict(X_test_bow)
y_train_pred = grid_search.best_estimator_.predict(X_train_bow)

In [None]:
accuracy_score(y_train, y_train_pred)

0.9527108568363652

In [None]:
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.93      0.87      0.90      8905
           1       0.07      0.13      0.09       684

    accuracy                           0.82      9589
   macro avg       0.50      0.50      0.49      9589
weighted avg       0.87      0.82      0.84      9589

