**Steps**

1. Load the tweets file using read_csv function from Pandas package.

2. Get the tweets into a list for easy text cleanup and manipulation.

3. To cleanup:
    1. Normalize the casing.
    2. Using regular expressions, remove user handles. These begin with '@’.
    3. Using regular expressions, remove URLs.
    4. Using TweetTokenizer from NLTK, tokenize the tweets into individual terms.
    5. Remove stop words.
    6. Remove redundant terms like ‘amp’, ‘rt’, etc.
    7. Remove ‘#’ symbols from the tweet while retaining the term.

4. Extra cleanup by removing terms with a length of 1.

5. Check out the top terms in the tweets:
    1. First, get all the tokenized terms into one large list.
    2. Use the counter and find the 10 most common terms.

6. Data formatting for predictive modeling:
    1. Join the tokens back to form strings. This will be required for the vectorizers.
    2. Assign x and y.
    3. Perform train_test_split using sklearn.

7. We’ll use TF-IDF values for the terms as a feature to get into a vector space model.
    1. Import TF-IDF  vectorizer from sklearn.
    2. Instantiate with a maximum of 5000 terms in your vocabulary.
    3. Fit and apply on the train set.
    4. Apply on the test set.

8. Model building: Ordinary Logistic Regression
    1. Instantiate Logistic Regression from sklearn with default parameters.
    2. Fit into  the train data.
    3. Make predictions for the train and the test set.

9. Model evaluation: Accuracy, recall, and f_1 score.
    1. Report the accuracy on the train set.
    2. Report the recall on the train set: decent, high, or low.
    3. Get the f1 score on the train set.

10. Looks like you need to adjust the class imbalance, as the model seems to focus on the 0s.
    1. Adjust the appropriate class in the LogisticRegression model.

11. Train again with the adjustment and evaluate.
    1. Train the model on the train set.
    2. Evaluate the predictions on the train set: accuracy, recall, and f_1 score.

12. Regularization and Hyperparameter tuning:
    1. Import GridSearch and StratifiedKFold because of class imbalance.
    2. Provide the parameter grid to choose for ‘C’ and ‘penalty’ parameters.
    3. Use a balanced class weight while instantiating the logistic regression.

13. Find the parameters with the best recall in cross-validation.
    1. Choose ‘recall’ as the metric for scoring.
    2. Choose a stratified 4 fold cross-validation scheme.
    3. Fit into  the train set.

14. What are the best parameters?

15. Predict and evaluate using the best estimator.
    1. Use the best estimator from the grid search to make predictions on the test set.
    2. What is the recall on the test set for the toxic comments?
    3. What is the f_1 score?

In [65]:
from google.colab import drive
drive.mount('/content/drive/')

Drive already mounted at /content/drive/; to attempt to forcibly remount, call drive.mount("/content/drive/", force_remount=True).


In [66]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib
import nltk
import re
import os

Using matplotlib backend: agg


**1. Load the tweets file using read_csv function from Pandas package.**

In [67]:
twitter_data = pd.read_csv('/content/drive/MyDrive/nltk/TwitterHate.csv')
twitter_data.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [68]:
twitter_data.isnull().any()

id       False
label    False
tweet    False
dtype: bool

In [69]:
twitter_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      31962 non-null  int64 
 1   label   31962 non-null  int64 
 2   tweet   31962 non-null  object
dtypes: int64(2), object(1)
memory usage: 749.2+ KB


In [70]:
twitter_data.value_counts

<bound method DataFrame.value_counts of           id  label                                              tweet
0          1      0   @user when a father is dysfunctional and is s...
1          2      0  @user @user thanks for #lyft credit i can't us...
2          3      0                                bihday your majesty
3          4      0  #model   i love u take with u all the time in ...
4          5      0             factsguide: society now    #motivation
...      ...    ...                                                ...
31957  31958      0  ate @user isz that youuu?ðððððð...
31958  31959      0    to see nina turner on the airwaves trying to...
31959  31960      0  listening to sad songs on a monday morning otw...
31960  31961      1  @user #sikh #temple vandalised in in #calgary,...
31961  31962      0                   thank you @user for you follow  

[31962 rows x 3 columns]>

In [71]:
twitter_data['tweet'].head()

0     @user when a father is dysfunctional and is s...
1    @user @user thanks for #lyft credit i can't us...
2                                  bihday your majesty
3    #model   i love u take with u all the time in ...
4               factsguide: society now    #motivation
Name: tweet, dtype: object

In [72]:
list(twitter_data['tweet'])

[' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation',
 '[2/2] huge fan fare and big talking before they leave. chaos and pay disputes when they get there. #allshowandnogo  ',
 ' @user camping tomorrow @user @user @user @user @user @user @user dannyâ\x80¦',
 "the next school year is the year for exams.ð\x9f\x98¯ can't think about that ð\x9f\x98\xad #school #exams   #hate #imagine #actorslife #revolutionschool #girl",
 'we won!!! love the land!!! #allin #cavs #champions #cleveland #clevelandcavaliers  â\x80¦ ',
 " @user @user welcome here !  i'm   it's so #gr8 ! ",
 ' â\x86\x9d #ireland consume

**2. Get the tweets into a list for easy text cleanup and manipulation.**

In [73]:
tweets = twitter_data.tweet.values

In [74]:
len(tweets)

31962

In [75]:
tweets[:5]

array([' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
       "@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
       '  bihday your majesty',
       '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
       ' factsguide: society now    #motivation'], dtype=object)

1. Hashtag,
2. ID
3. URLs
4. 'RT'

## **3. Clean Up**
###a. Normalize

In [76]:
tweets_lower = [twt.lower() for twt in tweets]

In [77]:
tweets_lower[:5]

[' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation']

In [78]:
re.sub("@\w+","","@Amresh das attained the program!:/http://amreshbaig.com/ai")

' das attained the program!:/http://amreshbaig.com/ai'

**Using regular expressions, remove user handles. These begin with '@’.**

In [79]:
tweets_nouse = [re.sub("@\w+","",twt) for twt in tweets_lower]
tweets_nouse[:5]

['  when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "  thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation']

In [80]:
re.sub("\w+://\S+","","@Amresh das attained the program!:/http://amreshbaig.com/ai")

'@Amresh das attained the program!:/'

In [81]:
tweets_nourl = [re.sub("\w+://\S+","",twt) for twt in tweets_nouse]

In [82]:
tweets_nourl[:5]

['  when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "  thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation']

In [83]:
from nltk.tokenize import TweetTokenizer

In [84]:
?TweetTokenizer()

Object `TweetTokenizer()` not found.


**Using regular expressions, remove URLs.**

In [85]:
tkn = TweetTokenizer()

In [86]:
print(tkn.tokenize(tweets_nourl[0]))

['when', 'a', 'father', 'is', 'dysfunctional', 'and', 'is', 'so', 'selfish', 'he', 'drags', 'his', 'kids', 'into', 'his', 'dysfunction', '.', '#run']


In [87]:
tweet_token = [tkn.tokenize(sent) for sent in tweets_nourl]
print(tweet_token[2])

['bihday', 'your', 'majesty']


**Using TweetTokenizer from NLTK, tokenize the tweets into individual terms.**

In [88]:
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [89]:
nltk.download('punctuation')

[nltk_data] Error loading punctuation: Package 'punctuation' not found
[nltk_data]     in index


False

In [90]:
from nltk.corpus import stopwords
from string import punctuation

In [91]:
stop_nltk = stopwords.words('english')
print(stop_nltk)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [92]:
stop_punc = list(punctuation)

**Remove stop words.**

In [93]:
stop_punc.extend(['...','``',"''","..","¦"])

**Remove redundant terms like ‘amp’, ‘rt’, etc.**

In [94]:
stop_context = ['rt', 'amp']

In [95]:
stop_final = stop_nltk + stop_punc + stop_context

**Remove ‘#’ symbols from the tweet while retaining the term.**

In [96]:
def del_stop(sent):
  return [re.sub("#","",term) for term in sent if((term not in stop_final) & (len(term)>1))]

In [97]:
tweet_token[5]

['[',
 '2/2',
 ']',
 'huge',
 'fan',
 'fare',
 'and',
 'big',
 'talking',
 'before',
 'they',
 'leave',
 '.',
 'chaos',
 'and',
 'pay',
 'disputes',
 'when',
 'they',
 'get',
 'there',
 '.',
 '#allshowandnogo']

In [98]:
del_stop(tweet_token[5])

['2/2',
 'huge',
 'fan',
 'fare',
 'big',
 'talking',
 'leave',
 'chaos',
 'pay',
 'disputes',
 'get',
 'allshowandnogo']

In [99]:
tweet_clean = [del_stop(tweets) for tweets in tweet_token]

In [100]:
tweet_clean[6]

['camping', 'tomorrow', 'dannyâ']

**4. Extra cleanup by removing terms with a length of 1.**

In [101]:
from collections import Counter

In [102]:
term_list = []
for tweets in tweet_clean:
  term_list.extend(tweets)

In [103]:
res = Counter(term_list)
res.most_common(15)

[('love', 2748),
 ('day', 2276),
 ('happy', 1684),
 ('time', 1131),
 ('life', 1118),
 ('like', 1047),
 ("i'm", 1018),
 ('today', 1013),
 ('new', 994),
 ('thankful', 946),
 ('positive', 931),
 ('get', 917),
 ('good', 862),
 ('people', 859),
 ('bihday', 844)]

In [104]:
tweet_clean[0]

['father', 'dysfunctional', 'selfish', 'drags', 'kids', 'dysfunction', 'run']

In [105]:
tweet_clean = [" ".join(tweets) for tweets in tweet_clean]

In [106]:
tweet_clean[0]

'father dysfunctional selfish drags kids dysfunction run'

## **Data formatting for predictive modeling:**

1. **Join the tokens back to form strings.**
2. **This will be required for the vectorizers.**
**Assign x and y.**


In [107]:
len(tweet_clean)

31962

In [108]:
len(twitter_data.label)

31962

In [109]:
x = tweet_clean
y = twitter_data.label.values

**Perform train_test_split using sklearn.**

In [110]:
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y, test_size=.30, random_state=42)

In [111]:
x_train

['summer timeð summeriscoming swimming picoftheday tattoos issho pulsera fluyendo',
 'dese niggas show dese otha bitches fb snap nd twitter attention dey girl true',
 'boost immune system allow bodies use energy forâ',
 'reading manuscript wanting stop .  . good evening good night',
 'baby says hates today',
 "i'm dj lol womanofmanyhats instamood love silentdisco networkâ",
 'christmas eve christmas adam towards men feminism',
 'lover stop angry visit us lover friend astrologer love',
 'best wishes outside gym fitness macboys blue white grey',
 'stress pretty ditch laugh exercise headisease',
 'creative sent thebigscreen weekend cheers',
 'thankful coffee thankful positive',
 'whisky connoisseur kits fathersday whisky',
 'makes innovative_nous',
 'hapoyfathersday kimkardashian wishes kanyewest fathers day cheekâ',
 'allahsoil like religions islam strong unifying force teambts teamsuperjunior',
 "weekend near weekend tvshows see what's coming",
 'father gave greatest gift anyone could g

##  **7. We’ll use TF-IDF values for the terms as a feature to get into a vector space model.**

1. **Import TF-IDF vectorizer from sklearn.**

In [112]:
from sklearn.feature_extraction.text import TfidfVectorizer

2. **Instantiate with a maximum of 5000 terms in your vocabulary.**

In [113]:
vectorizer = TfidfVectorizer(max_features=5000)

In [114]:
len(x_train), len(x_test)

(22373, 9589)

3. **Fit and apply on the train set.**
4. **Apply on the test set.**

In [115]:
x_train_bow = vectorizer.fit_transform(x_train)

x_test_bow = vectorizer.transform(x_test)

In [116]:
x_train_bow.shape, x_test_bow.shape

((22373, 5000), (9589, 5000))

## **8. Model building: Ordinary Logistic Regression**

1. **Instantiate Logistic Regression from sklearn with default parameters.**
2. **Fit into the train data.**
3. **Make predictions for the train and the test set.**

In [117]:
from sklearn.linear_model import LogisticRegression

In [118]:
logreg = LogisticRegression()

In [119]:
logreg.fit(x_train_bow, y_train)

In [120]:
y_train_pred = logreg.predict(x_train_bow)
y_test_pred = logreg.predict(x_test_bow)

## **9. Model evaluation: Accuracy, recall, and f_1 score.**

**1. Report the accuracy on the train set.**

**2. Report the recall on the train set: decent, high, or low.**

**3. Get the f1 score on the train set.**

In [121]:
from sklearn.metrics import accuracy_score, classification_report

In [122]:
print(classification_report(y_train, y_train_pred))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98     20815
           1       0.96      0.39      0.55      1558

    accuracy                           0.96     22373
   macro avg       0.96      0.69      0.76     22373
weighted avg       0.96      0.96      0.95     22373



## **10. Looks like you need to adjust the class imbalance, as the model seems to focus on the 0s.**

**Adjust the appropriate class in the LogisticRegression model.**

In [123]:
#Create the parameter grid based on the result of random search
param_grid = {
    'C': [0.01,0.1,1,10,100],
    'penalty':["l1"]
}

In [124]:
?LogisticRegression()

Object `LogisticRegression()` not found.


## **12. Regularization and Hyperparameter tuning:**

**1. Import GridSearch and StratifiedKFold because of class imbalance.**

In [125]:
accuracy_score(y_train, y_train_pred)

0.9560184150538595

In [126]:
print(classification_report(y_train, y_train_pred))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98     20815
           1       0.96      0.39      0.55      1558

    accuracy                           0.96     22373
   macro avg       0.96      0.69      0.76     22373
weighted avg       0.96      0.96      0.95     22373



In [127]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold

**2. Provide the parameter grid to choose for ‘C’ and ‘penalty’ parameters.**


In [128]:
#Create the parameter grid based on the result of random search
param_grid = {
    'C': [0.01,0.1,1,10,100],
    'penalty':["l2"]
}

In [129]:
?LogisticRegression()

Object `LogisticRegression()` not found.


**3.Use a balanced class weight while instantiating the logistic regression.**

In [130]:
classifier_lr = LogisticRegression(class_weight="balanced")

## **13. Find the parameters with the best recall in cross-validation.**

**1. Choose ‘recall’ as the metric for scoring.**
**2. Choose a stratified 4 fold cross-validation scheme.**
**3. Fit into the train set.**

In [131]:
#Instantiate Grid search model
grid_search = GridSearchCV(estimator=classifier_lr, param_grid=param_grid,
                           cv = StratifiedKFold(4), n_jobs=-1, verbose=1, scoring="recall")

In [132]:
grid_search.fit(x_train_bow, y_train)

Fitting 4 folds for each of 5 candidates, totalling 20 fits


## **14. What are the best parameters?**

In [133]:
grid_search.best_estimator_

## **15. Predict and evaluate using the best estimator.**

1. **Use the best estimator from the grid search to make predictions on the test set.**
2. **What is the recall on the test set for the toxic comments?**
3. **What is the f_1 score?**

In [134]:
y_test_pred = grid_search.best_estimator_.predict(x_test_bow)

In [135]:
y_train_pred = grid_search.best_estimator_.predict(x_train_bow)

In [136]:
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.98      0.94      0.96      8905
           1       0.49      0.77      0.60       684

    accuracy                           0.93      9589
   macro avg       0.73      0.85      0.78      9589
weighted avg       0.95      0.93      0.93      9589

