# **Import Necessary Libraries**

**Pandas** is used to analyze data.

**Numpy** is used for working with arrays.

**Re** is used to work with Regular Expressions

**os** module is used to interact with the underlying operating system

In [87]:
import pandas as pd
import numpy as np
import os, re

# **Importing the dataset using pandas**

In [88]:
inp_tweets0 = pd.read_csv("Tweets_USA.csv")
inp_tweets0.head(10)

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
5,6,0,[2/2] huge fan fare and big talking before the...
6,7,0,@user camping tomorrow @user @user @user @use...
7,8,0,the next school year is the year for exams.ð...
8,9,0,we won!!! love the land!!! #allin #cavs #champ...
9,10,0,@user @user welcome here ! i'm it's so #gr...


Counting the number of tweets under labels 0 and 1

In [89]:
inp_tweets0.label.value_counts()

0    29720
1     2242
Name: label, dtype: int64

Thus we have **29720** tweets under **label 0** and **2242** tweets under **label 1**. From this, we can observe that our data is highly imbalanced.

Let's count after normalizing it!

In [90]:
inp_tweets0.label.value_counts(normalize= True)

0    0.929854
1    0.070146
Name: label, dtype: float64

Thus **92.9%** of the tweets are under **label 0** and only **7%** are under **label** **1**.

Printing a sample tweet from the dataset

In [91]:
inp_tweets0.tweet.sample().values[0]

' @user @user @user @user beauties with broken heas .      ð\x9f\x9a£ð\x9f\x9a£ð\x9f\x9a£'

We can observe that the tweets contain soo many meaningless words, some special characters , punctuations etc.

We have to clean these tweets in order to build the model.

#**Converting the tweets into a list, for easy text clean up and manipulation**

In [92]:
tweets0 = inp_tweets0.tweet.values

Printing the length of the list

In [93]:
len(tweets0)

31962

Printing the first four tweets

In [94]:
tweets0[:5]

array([' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
       "@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
       '  bihday your majesty',
       '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
       ' factsguide: society now    #motivation'], dtype=object)

#**Cleaning up the Data**
#**Normalizing case**

First lets normalize the case by converting all the words in the list to lowercase

In [95]:
tweets_lower = [twt.lower() for twt in tweets0]

Let's check whether all the texts are converted to lowercase by printing the first four tweets in the list

In [96]:
tweets_lower[:5]

[' @user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation']

We can see that all the words in our dataset is converted into lowercase

#**Remove user handles, begin with '@'**

Now let's remove words begining with **"@"** using regular expression.

Here, **sub()** function is used to replace occurrences of a particular sub-string with another sub-string.

**\w+** matches one or more word characters

The below code removes all the words starting with '@'

In [97]:
re.sub("@\w+","", "@Rahim this course rocks! http://rahimbaig.com/ai")

' this course rocks! http://rahimbaig.com/ai'

Now let's remove all the words starting with '@' from the list using **for loop**

In [98]:
tweets_nouser = [re.sub("@\w+","",twt)for twt in tweets_lower]

Let's check whether all the texts starting with '@' are removed by printing the first four tweets in the list

In [99]:
tweets_nouser[:5]

['  when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "  thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation']

#**Remove URLs**

Now let's remove all the URLs in the list since there is no use of those URLs .

We can do this by adding **'://'** in the middle of the regex **'\w+'** and **'\S+'** so that it will remove all the words containing **'://'** in the middle.

In [100]:
re.sub("\w+://\S+","", "@Rahim this course rocks! http://rahimbaig.com/ai")

'@Rahim this course rocks! '

In [101]:
tweets_nour1 = [re.sub("\w+://\S+","",twt)for twt in tweets_nouser]

Now let's check whether it removed all the URLs

In [102]:
tweets_nour1[:5]

['  when a father is dysfunctional and is so selfish he drags his kids into his dysfunction.   #run',
 "  thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx.    #disapointed #getthanked",
 '  bihday your majesty',
 '#model   i love u take with u all the time in urð\x9f\x93±!!! ð\x9f\x98\x99ð\x9f\x98\x8eð\x9f\x91\x84ð\x9f\x91\x85ð\x9f\x92¦ð\x9f\x92¦ð\x9f\x92¦  ',
 ' factsguide: society now    #motivation']

#**Tokenize using tweet tokenizer from NLTK**

NLTK has this special method called **TweetTokenizer**() that helps to tokenize Tweet Corpus into relevant tokens. The advantage of using TweetTokenizer() compared to regular word_tokenize is that, when processing tweets, we often come across emojis, hashtags that need to be handled differently.

In [103]:
from nltk.tokenize import TweetTokenizer

In [104]:
tkn = TweetTokenizer()

In [105]:
print (tkn.tokenize(tweets_nour1[0]))

['when', 'a', 'father', 'is', 'dysfunctional', 'and', 'is', 'so', 'selfish', 'he', 'drags', 'his', 'kids', 'into', 'his', 'dysfunction', '.', '#run']


Let's tokenize all the words in our cleaned data using for loop

In [106]:
tweet_token = [tkn.tokenize(sent)for sent in tweets_nour1]
print(tweet_token[:5])

[['when', 'a', 'father', 'is', 'dysfunctional', 'and', 'is', 'so', 'selfish', 'he', 'drags', 'his', 'kids', 'into', 'his', 'dysfunction', '.', '#run'], ['thanks', 'for', '#lyft', 'credit', 'i', "can't", 'use', 'cause', 'they', "don't", 'offer', 'wheelchair', 'vans', 'in', 'pdx', '.', '#disapointed', '#getthanked'], ['bihday', 'your', 'majesty'], ['#model', 'i', 'love', 'u', 'take', 'with', 'u', 'all', 'the', 'time', 'in', 'urð', '\x9f', '\x93', '±', '!', '!', '!', 'ð', '\x9f', '\x98', '\x99', 'ð', '\x9f', '\x98', '\x8e', 'ð', '\x9f', '\x91', '\x84', 'ð', '\x9f', '\x91', 'ð', '\x9f', '\x92', '¦', 'ð', '\x9f', '\x92', '¦', 'ð', '\x9f', '\x92', '¦'], ['factsguide', ':', 'society', 'now', '#motivation']]


Now, we can observe that all the sentences in our list is converted into words but still we have to do some more steps to get cleaned dataset for building the model. Let's do it now!


#**Remove punctuations and stop words and other redundant terms like 'rt', 'amp'**
#**also remove hashtags**

**Stop words** are all those words that don't add much information to the sentence.In order to remove stopwords and punctuation using NLTK, we have to download all the stop words using nltk. download('stopwords'), then we have to specify the language for which we want to remove the stopwords, therefore, we use stopwords. words('english') to specify and save it to the variable.

**Punctuation** can be removed by importing punctuation from string and then saving it as a list in a variable. Then we can define a function to remove all those unnecessary stopwards and punctuation from our data

In [107]:
import nltk

In [108]:
from nltk.corpus import stopwords
from string import punctuation
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [109]:
stop_nltk = stopwords.words("english")
stop_punct = list(punctuation)

Now let's extend the list of Punctuation by adding the following expression since those expressions are also there in our data which adds no meaning to the sentance.

In [110]:
stop_punct.extend(['...','``',"''",".."])

Now let's remove redundant words like 'rt' and 'amp' since there is no meaning in those words.

In [111]:
stop_context = ['rt', 'amp']

Now, lets finalize the words which we are going to remove from our list by adding stopwords and punctuations and redundant contexts.

In [112]:
stop_final = stop_nltk + stop_punct + stop_context

#**Function to**
- -> Remove stopwords from a single tokenized sentence
- -> remove #tags
- -> remove terms with length = 1

Now let's define a function to remove stopwords and punctuation using for loop and also giving condition to remove terms which are having length less than 1 .

Also we are removing #tags from the term .

Finally, we will be having a list of meaningful words.

In [113]:
def del_stop(sent):
    return [re.sub("#","",term) for term in sent if ((term not in stop_final) & (len(term)>1))]

In [114]:
del_stop (tweet_token[4])

['factsguide', 'society', 'motivation']

In [115]:
tweets_clean = [del_stop(tweet) for tweet in tweet_token]

In [116]:
print(tweets_clean[:5])

[['father', 'dysfunctional', 'selfish', 'drags', 'kids', 'dysfunction', 'run'], ['thanks', 'lyft', 'credit', "can't", 'use', 'cause', 'offer', 'wheelchair', 'vans', 'pdx', 'disapointed', 'getthanked'], ['bihday', 'majesty'], ['model', 'love', 'take', 'time', 'urð'], ['factsguide', 'society', 'motivation']]


#**Check out the top terms in the tweets**

In [117]:
from collections import Counter

**A Counter** is a dict subclass for counting hashable objects. It is a collection where elements are stored as dictionary keys and their counts are stored as dictionary values.

Now, let's print the top 10 terms in the tweet by creating an empty list and then using forloop and extend method.Then we can count the occurance of each terms in the tweet by passing the list in Counter subclass .

In [118]:
term_list = []
for tweet in tweets_clean:
    term_list.extend(tweet)

In [119]:
res = Counter(term_list)
res.most_common(10)

[('love', 2748),
 ('day', 2276),
 ('happy', 1684),
 ('time', 1131),
 ('life', 1118),
 ('like', 1047),
 ("i'm", 1018),
 ('today', 1013),
 ('new', 994),
 ('thankful', 946)]

#**Data formatting for predictive modelling** 
Join the tokens back into strings

In [120]:
tweets_clean[0]

['father', 'dysfunctional', 'selfish', 'drags', 'kids', 'dysfunction', 'run']

Now let's format the data back to string using join function.So that we can join the empty spaces and convert it to a string.

In [121]:
tweets_clean = [" ".join(tweet) for tweet in tweets_clean]

Let's print a sample

In [122]:
tweets_clean[30]

'never chance vote presidential candidate excited cycle looks different'

#**Separate X and Y and perform train test split, 70-30**

let's print the length of the tweets and the labels to check whether we have any missing terms.

In [123]:
len(tweets_clean)

31962

In [124]:
len(inp_tweets0.label)

31962

We have to use the **train_test_split()** method to split our data into train and test sets. 

First, we need to divide our data into features (X) and labels (y).

Then we can divide our dataset into X_train, X_test, y_train, and y_test. 

Here,X_train and y_train sets are used for training and fitting the model.

Now, Splitting the data into dependent and independent variable to make predictions.

In [125]:
X = tweets_clean
y = inp_tweets0.label.values

# Train Test  split

Now splitting the whole dataset into testing and training set

In [126]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size = 0.30, random_state=42 )

#**Create a document term matrix using vectorizer**

**TF-IDF** is a popular approach used to weigh terms for NLP tasks because it assigns a value to a term according to its importance in a document scaled by its importance across all documents in your corpus, which mathematically eliminates naturally occurring words in the English language, and selects words that are more descriptive of your text.

In [127]:
from sklearn.feature_extraction.text import TfidfVectorizer

Here,The purpose of max_features is to limit the number of features (words) from the dataset for which we want to calculate the TF-IDF scores. We are taking a maximum of 5000 features to create the vectorizer

In [128]:
vectorizer = TfidfVectorizer(max_features = 5000)

In [129]:
len(X_train), len(X_test)

(22373, 9589)

In [130]:
X_train_bow = vectorizer.fit_transform(X_train)

X_test_bow = vectorizer.transform(X_test)

In [131]:
X_train_bow.shape, X_test_bow.shape

((22373, 5000), (9589, 5000))

#**Model building**

#**Using a simple Logistic Regression**

Now let's use a simpl logistic regression which is used to calculate or predict the probability of a binary (yes/no) event occurring


In [132]:
from sklearn.linear_model import LogisticRegression

In [133]:
logreg = LogisticRegression()

In [134]:
logreg.fit(X_train_bow, y_train)

LogisticRegression()

In [135]:
y_train_pred = logreg.predict(X_train_bow)
y_test_pred = logreg.predict(X_test_bow)

Now let's find the performance of our model by finding evaluation metrices like accuracy_score and classification_report.



In [136]:
from sklearn.metrics import accuracy_score, classification_report

**Accuracy** is one metric for evaluating classification models. Informally, accuracy is the fraction of predictions our model got right. Formally, accuracy has the following definition: Accuracy = Number of correct predictions divided by Total number of predictions.

A **classification report** is a performance evaluation metric in machine learning. It is used to show the precision, recall, F1 Score, and support of your trained classification model.



In [137]:
accuracy_score(y_train, y_train_pred)

0.9560184150538595

Thus we got a accuracy of **95%**

In [138]:
print(classification_report(y_train, y_train_pred))

              precision    recall  f1-score   support

           0       0.96      1.00      0.98     20815
           1       0.96      0.39      0.55      1558

    accuracy                           0.96     22373
   macro avg       0.96      0.69      0.76     22373
weighted avg       0.96      0.96      0.95     22373



Here, we can see that the total number of a class of data (positive) is far less than the total number of another class of data (negative) which implies that our dataset has skewed class proportions.

#**Adjusting for class imbalance**

In order to adjust the imbalance class, we have added the class_weight parameter to our logistic regression algorithm and the value we have passed is ‘balanced’ and then we just fit the model to our data.

In [139]:
logreg = LogisticRegression(class_weight="balanced")

In [140]:
logreg.fit(X_train_bow, y_train)

LogisticRegression(class_weight='balanced')

In [141]:
y_train_pred = logreg.predict(X_train_bow)
y_test_pred = logreg.predict(X_test_bow)

In [142]:
accuracy_score(y_train, y_train_pred)

0.9527108568363652

In [143]:
print(classification_report(y_train, y_train_pred))

              precision    recall  f1-score   support

           0       1.00      0.95      0.97     20815
           1       0.60      0.97      0.74      1558

    accuracy                           0.95     22373
   macro avg       0.80      0.96      0.86     22373
weighted avg       0.97      0.95      0.96     22373



# **Hyperparameter Tuning**

**GridSearchCV** is a technique for finding the optimal parameter values from a given set of parameters in a grid. It's essentially a cross-validation technique. The model as well as the parameters must be entered. After extracting the best parameter values, predictions are made.

**StratifiedKFold** is a cross-validator that divides the dataset into k folds and ensures that each fold of dataset has the same proportion of observations with a given label. We can use this while dealing with classification tasks with imbalanced class distributions



In [144]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold

creating the parameter grid based on the results of random search

In [145]:
param_grid = {
    'C':[0.01,0.1,1,10]
}

In [146]:
classifier_lr = LogisticRegression(class_weight="balanced")

Now let's instantiate the grid search model specifying all the parameter values.

In [147]:
grid_search = GridSearchCV(estimator = classifier_lr, 
                           param_grid = param_grid,
                          cv=StratifiedKFold(4),
                          n_jobs = -1, verbose = 1,
                          scoring = "recall")

In [148]:
grid_search.fit(X_train_bow, y_train)

Fitting 4 folds for each of 4 candidates, totalling 16 fits


GridSearchCV(cv=StratifiedKFold(n_splits=4, random_state=None, shuffle=False),
             estimator=LogisticRegression(class_weight='balanced'), n_jobs=-1,
             param_grid={'C': [0.01, 0.1, 1, 10]}, scoring='recall', verbose=1)

Finding the best estimator .

In [149]:
grid_search.best_estimator_

LogisticRegression(C=1, class_weight='balanced')

# **Using the best estimator to make predictions on the test set**

In [150]:
y_test_pred = grid_search.best_estimator_.predict(X_test_bow)

In [151]:
y_train_pred = grid_search.best_estimator_.predict(X_train_bow)

In [152]:
print(classification_report(y_test, y_test_pred))

              precision    recall  f1-score   support

           0       0.98      0.94      0.96      8905
           1       0.49      0.77      0.60       684

    accuracy                           0.93      9589
   macro avg       0.73      0.85      0.78      9589
weighted avg       0.95      0.93      0.93      9589

