# **Sentiment analysis**

**Sentiment analysis**, also known as opinion mining, is a natural language processing (NLP) technique that identifies the emotional tone embedded within text. This methodology is widely used by organizations to evaluate and categorize public opinions regarding products, services, or concepts. It leverages data mining, machine learning, and artificial intelligence to analyze textual data for sentiment and subjective information.

Sentiment analysis systems enable organizations to derive meaningful insights from unstructured and unorganized text originating from various online sources, such as emails, blog posts, customer support tickets, web chats, social media platforms, forums, and user comments. These systems automate data processing using rule-based, automated, or hybrid methodologies. Rule-based approaches rely on predefined lexicon-based rules to perform sentiment classification, while automated systems utilize machine learning algorithms to infer sentiment from training data. Hybrid systems combine both rule-based and automated methods to enhance performance. Beyond determining sentiment, opinion mining can extract additional information, including polarity (the degree of positivity or negativity), subjects, and opinion holders. Moreover, sentiment analysis can be conducted at various granularities, including document, paragraph, sentence, and sub-sentence levels.



## **Types of Sentiment Analysis**

Sentiment analysis can be categorized into several types based on the level of granularity and the focus of the analysis:

* **Fine-Grained Sentiment Analysis**: This approach provides a detailed breakdown of sentiment polarity, typically on a scale from very positive to very negative. It mirrors the granularity of rating systems, such as a 5-star scale, offering precise insights into opinion strength.

* **Emotion Detection**: Unlike traditional polarity-focused methods, emotion detection identifies specific emotions within the text. Examples include happiness, frustration, shock, anger, and sadness, enabling a deeper understanding of emotional context.

* **Intent-Based Analysis**: This type of analysis extends beyond opinion to recognize the underlying actions or intentions within a text. For instance, a comment expressing frustration about replacing a battery may indicate a need for assistance, prompting customer service intervention to address the issue.

* **Aspect-Based Analysis**: Aspect-based sentiment analysis focuses on identifying sentiment tied to specific components or attributes of a subject. For example, a product review might mention dissatisfaction with battery life rather than the product as a whole. In such cases, the system attributes the negative sentiment specifically to the battery life aspect.

These specialized types of sentiment analysis allow for nuanced and actionable insights, tailored to different analytical needs and use cases.

**Import required libraries** for this Python project for Twitter sentiment analysis of hatred speech recognition.

In [60]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn import model_selection, preprocessing, linear_model, metrics
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn import ensemble
from lightgbm import LGBMClassifier
from sklearn.metrics import roc_auc_score, roc_curve
from xgboost import XGBClassifier

import warnings
warnings.filterwarnings('ignore')

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from textblob import Word
nltk.download('wordnet')

from termcolor import colored
from warnings import filterwarnings
filterwarnings('ignore')

from sklearn import set_config
set_config(print_changed_only = False)

print(colored("\nLIBRARIES WERE SUCCESFULLY IMPORTED...", color = "green", attrs = ["dark", "bold"]))


LIBRARIES WERE SUCCESFULLY IMPORTED...


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!



LIBRARIES WERE SUCCESFULLY IMPORTED...


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


Dataset of Twitter hatred speech is avilable in the colab folder as train.csv and test.csv

### **Read the dataset files**

In [61]:
# Load the train and test datasets
try:
    train_set = pd.read_csv('/content/train.csv')
    test_set  = pd.read_csv('/content/test.csv')
    print("Datasets loaded successfully.")
    # Now you can work with train_df and test_df
    # Example: print the first 5 rows of the training data
    print(train_set.head())
except FileNotFoundError:
    print("Error: One or both of the CSV files were not found in /content/. Please ensure they are uploaded correctly.")
except pd.errors.ParserError:
    print("Error: Could not parse the CSV files. Please check their format.")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

Datasets loaded successfully.
   id  label                                              tweet
0   1      0   @user when a father is dysfunctional and is s...
1   2      0  @user @user thanks for #lyft credit i can't us...
2   3      0                                bihday your majesty
3   4      0  #model   i love u take with u all the time in ...
4   5      0             factsguide: society now    #motivation
Datasets loaded successfully.
   id  label                                              tweet
0   1      0   @user when a father is dysfunctional and is s...
1   2      0  @user @user thanks for #lyft credit i can't us...
2   3      0                                bihday your majesty
3   4      0  #model   i love u take with u all the time in ...
4   5      0             factsguide: society now    #motivation


In [62]:
train_set.head(n = 5)


Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


Read the first five rows of test set

In [63]:
test_set.head(n = 5).style.background_gradient(cmap = "summer")

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedication #willpower to find #newmaterialsâ¦
1,31964,@user #white #supremacists want everyone to see the new â #birdsâ #movie â and hereâs why
2,31965,safe ways to heal your #acne!! #altwaystoheal #healthy #healing!!
3,31966,"is the hp and the cursed child book up for reservations already? if yes, where? if no, when? ððð #harrypotter #pottermore #favorite"
4,31967,"3rd #bihday to my amazing, hilarious #nephew eli ahmir! uncle dave loves you and missesâ¦"


Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedication #willpower to find #newmaterialsâ¦
1,31964,@user #white #supremacists want everyone to see the new â #birdsâ #movie â and hereâs why
2,31965,safe ways to heal your #acne!! #altwaystoheal #healthy #healing!!
3,31966,"is the hp and the cursed child book up for reservations already? if yes, where? if no, when? ððð #harrypotter #pottermore #favorite"
4,31967,"3rd #bihday to my amazing, hilarious #nephew eli ahmir! uncle dave loves you and missesâ¦"


Shapes of the train and test sets

In [64]:
print("Train set shape: {} and test set shape: {}".format(train_set.shape, test_set.shape))

Train set shape: (31962, 3) and test set shape: (17197, 2)
Train set shape: (31962, 3) and test set shape: (17197, 2)


Get general information about train set

In [65]:
print("Train set information" )
train_set.info()

print("\n\nTest set information")
test_set.info()

Train set information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      31962 non-null  int64 
 1   label   31962 non-null  int64 
 2   tweet   31962 non-null  object
dtypes: int64(2), object(1)
memory usage: 749.2+ KB


Test set information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17197 entries, 0 to 17196
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      17197 non-null  int64 
 1   tweet   17197 non-null  object
dtypes: int64(1), object(1)
memory usage: 268.8+ KB
Train set information
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 31962 entries, 0 to 31961
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      31962 non-null  int64 
 1   label   31962 non-null  int64 
 2   tweet   31962 non-null  object
d

Check whether there are duplicated values

In [66]:
print("Totally there are {} duplicated values in train_set".format(train_set.duplicated().sum()))

Totally there are 0 duplicated values in train_set
Totally there are 0 duplicated values in train_set


Get the number of classes of the "label" variable of train set

In [67]:
train_set.groupby("label").count().style.background_gradient(cmap = "summer")

Unnamed: 0_level_0,id,tweet
label,Unnamed: 1_level_1,Unnamed: 2_level_1
0,29720,29720
1,2242,2242


Unnamed: 0_level_0,id,tweet
label,Unnamed: 1_level_1,Unnamed: 2_level_1
0,29720,29720
1,2242,2242


In [68]:
# Assuming 'train_set' DataFrame from the previous code is available.
# Downsample the majority class to match the minority class (label 1)

# Separate majority and minority classes
majority_class = train_set[train_set.label==0]
minority_class = train_set[train_set.label==1]

# Downsample majority class
majority_downsampled = majority_class.sample(n=len(minority_class), random_state=42)

# Combine minority class with downsampled majority class
downsampled_train = pd.concat([majority_downsampled, minority_class])

# Display the class distribution in the downsampled dataset
print(downsampled_train.label.value_counts())
train_set_copy = train_set.copy()
train_set = downsampled_train

label
0    2242
1    2242
Name: count, dtype: int64
label
0    2242
1    2242
Name: count, dtype: int64


### **Clean And Process Dataset**
1. Convert uppercase letters to lowercase letters in "tweet" column

In [69]:
train_set["tweet"] = train_set["tweet"].apply(lambda x: " ".join(x.lower() for x in x.split()))
test_set["tweet"] = test_set["tweet"].apply(lambda x: " ".join(x.lower() for x in x.split()))

print(colored("\nDELETED SUCCESFULLY...", color = "green", attrs = ["dark", "bold"]))


DELETED SUCCESFULLY...

DELETED SUCCESFULLY...


2. Delete punctuation marks from "tweet" columns

In [70]:
train_set["tweet"] = train_set["tweet"].str.replace('[^\w\s]','')
test_set["tweet"] = test_set["tweet"].str.replace('[^\w\s]','')
print(colored("\nDELETED SUCCESFULLY...", color = "green", attrs = ["dark", "bold"]))


DELETED SUCCESFULLY...

DELETED SUCCESFULLY...


3. Delete numbers from "tweet" columns

In [71]:
train_set['tweet'] = train_set['tweet'].str.replace('\d','')
test_set['tweet'] = test_set['tweet'].str.replace('\d','')

print(colored("\n NUMBERS DELETED SUCCESFULLY...", color = "green", attrs = ["dark", "bold"]))


 NUMBERS DELETED SUCCESFULLY...

 NUMBERS DELETED SUCCESFULLY...


4. Delete stopwords from "tweet" columns

In [72]:
sw = stopwords.words("english")
train_set['tweet'] = train_set['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in sw))
test_set['tweet'] = test_set['tweet'].apply(lambda x: " ".join(x for x in x.split() if x not in sw))

print(colored("\nSTOPWORDS DELETED SUCCESFULLY...", color = "green", attrs = ["dark", "bold"]))


STOPWORDS DELETED SUCCESFULLY...

STOPWORDS DELETED SUCCESFULLY...


5. Lemmatization. That is, we get the roots of the words in the "tweet" columns

In [73]:
train_set['tweet'] = train_set['tweet'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))
test_set['tweet'] = test_set['tweet'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

print(colored("\nDONE SUCCESFULLY...", color = "green", attrs = ["dark", "bold"]))


DONE SUCCESFULLY...

DONE SUCCESFULLY...


6. Drop "id" column from datasets

In [74]:
train_set = train_set.drop("id", axis = 1)
test_set = test_set.drop("id", axis = 1)

print(colored("\n'ID' COLUMNS DROPPED SUCCESFULLY...", color = "green", attrs = ["dark", "bold"]))


'ID' COLUMNS DROPPED SUCCESFULLY...

'ID' COLUMNS DROPPED SUCCESFULLY...


Look at the latest condition of train set

In [75]:
train_set.head(n = 10)

Unnamed: 0,label,tweet
8824,0,"#body body massage ending oil #massage ,body h..."
31854,0,@user @ call back! #casting #castingcall #mode...
28079,0,help creates #environment #togetherness &amp; ...
29214,0,summer friendâ¨ð¥ #summer #friend #life #vl...
20025,0,follow snapchat awesomecutenes7 #snapchat #sel...
21437,0,@user robbie told @user #thesmb lead asked bro...
24167,0,cupcakes! #beachpay #payplanning #friends #eno...
12833,0,happy boyð love guy #littleman #loveofmylif...
15840,0,keshi's news really hard accept. #sosudden
20134,0,day porn movie skinny amateur sex


Unnamed: 0,label,tweet
8824,0,"#body body massage ending oil #massage ,body h..."
31854,0,@user @ call back! #casting #castingcall #mode...
28079,0,help creates #environment #togetherness &amp; ...
29214,0,summer friendâ¨ð¥ #summer #friend #life #vl...
20025,0,follow snapchat awesomecutenes7 #snapchat #sel...
21437,0,@user robbie told @user #thesmb lead asked bro...
24167,0,cupcakes! #beachpay #payplanning #friends #eno...
12833,0,happy boyð love guy #littleman #loveofmylif...
15840,0,keshi's news really hard accept. #sosudden
20134,0,day porn movie skinny amateur sex


In [76]:
test_set.head(n = 10)

Unnamed: 0,tweet
0,#studiolife #aislife #requires #passion #dedic...
1,@user #white #supremacists want everyone see n...
2,safe way heal #acne!! #altwaystoheal #healthy ...
3,"hp cursed child book reservation already? yes,..."
4,"3rd #bihday amazing, hilarious #nephew eli ahm..."
5,choose :) #momtips
6,something inside dy ð¦ð¿â¨ eye ness #smok...
7,#finished#tattoo#inked#ink#loveitâ¤ï¸ #â¤ï¸...
8,@user @user @user never understand dad left yo...
9,#delicious #food #lovelife #capetown mannaepic...


Unnamed: 0,tweet
0,#studiolife #aislife #requires #passion #dedic...
1,@user #white #supremacists want everyone see n...
2,safe way heal #acne!! #altwaystoheal #healthy ...
3,"hp cursed child book reservation already? yes,..."
4,"3rd #bihday amazing, hilarious #nephew eli ahm..."
5,choose :) #momtips
6,something inside dy ð¦ð¿â¨ eye ness #smok...
7,#finished#tattoo#inked#ink#loveitâ¤ï¸ #â¤ï¸...
8,@user @user @user never understand dad left yo...
9,#delicious #food #lovelife #capetown mannaepic...


In [77]:
# Using dataframe train_set: remove @user # 

import re

# Define a function to remove @user and # symbols
def remove_mentions_hashtags(text):
    # Remove mentions (@user)
    text = re.sub(r'@\w+', '', text)
    # Remove hashtags (#)
    text = re.sub(r'#', '', text)
    return text

# Apply the function to the 'tweet' column
train_set['cleaned_tweet'] = train_set['tweet'].apply(remove_mentions_hashtags)


In [78]:
train_set.head(n = 10)

Unnamed: 0,label,tweet,cleaned_tweet
8824,0,"#body body massage ending oil #massage ,body h...","body body massage ending oil massage ,body hap..."
31854,0,@user @ call back! #casting #castingcall #mode...,@ call back! casting castingcall model cute t...
28079,0,help creates #environment #togetherness &amp; ...,help creates environment togetherness &amp; mu...
29214,0,summer friendâ¨ð¥ #summer #friend #life #vl...,summer friendâ¨ð¥ summer friend life vlog w...
20025,0,follow snapchat awesomecutenes7 #snapchat #sel...,follow snapchat awesomecutenes7 snapchat selfi...
21437,0,@user robbie told @user #thesmb lead asked bro...,robbie told thesmb lead asked broadcast ble...
24167,0,cupcakes! #beachpay #payplanning #friends #eno...,cupcakes! beachpay payplanning friends enough ...
12833,0,happy boyð love guy #littleman #loveofmylif...,happy boyð love guy littleman loveofmylife ...
15840,0,keshi's news really hard accept. #sosudden,keshi's news really hard accept. sosudden
20134,0,day porn movie skinny amateur sex,day porn movie skinny amateur sex


Unnamed: 0,label,tweet,cleaned_tweet
8824,0,"#body body massage ending oil #massage ,body h...","body body massage ending oil massage ,body hap..."
31854,0,@user @ call back! #casting #castingcall #mode...,@ call back! casting castingcall model cute t...
28079,0,help creates #environment #togetherness &amp; ...,help creates environment togetherness &amp; mu...
29214,0,summer friendâ¨ð¥ #summer #friend #life #vl...,summer friendâ¨ð¥ summer friend life vlog w...
20025,0,follow snapchat awesomecutenes7 #snapchat #sel...,follow snapchat awesomecutenes7 snapchat selfi...
21437,0,@user robbie told @user #thesmb lead asked bro...,robbie told thesmb lead asked broadcast ble...
24167,0,cupcakes! #beachpay #payplanning #friends #eno...,cupcakes! beachpay payplanning friends enough ...
12833,0,happy boyð love guy #littleman #loveofmylif...,happy boyð love guy littleman loveofmylife ...
15840,0,keshi's news really hard accept. #sosudden,keshi's news really hard accept. sosudden
20134,0,day porn movie skinny amateur sex,day porn movie skinny amateur sex


In [79]:
test_set['cleaned_tweet'] = test_set['tweet'].apply(remove_mentions_hashtags)


In [80]:
test_set.head(n = 10)

Unnamed: 0,tweet,cleaned_tweet
0,#studiolife #aislife #requires #passion #dedic...,studiolife aislife requires passion dedication...
1,@user #white #supremacists want everyone see n...,white supremacists want everyone see new â ...
2,safe way heal #acne!! #altwaystoheal #healthy ...,safe way heal acne!! altwaystoheal healthy hea...
3,"hp cursed child book reservation already? yes,...","hp cursed child book reservation already? yes,..."
4,"3rd #bihday amazing, hilarious #nephew eli ahm...","3rd bihday amazing, hilarious nephew eli ahmir..."
5,choose :) #momtips,choose :) momtips
6,something inside dy ð¦ð¿â¨ eye ness #smok...,something inside dy ð¦ð¿â¨ eye ness smoke...
7,#finished#tattoo#inked#ink#loveitâ¤ï¸ #â¤ï¸...,finishedtattooinkedinkloveitâ¤ï¸ â¤ï¸â¤ï¸...
8,@user @user @user never understand dad left yo...,never understand dad left young.... :/ deep...
9,#delicious #food #lovelife #capetown mannaepic...,delicious food lovelife capetown mannaepicure ...


Unnamed: 0,tweet,cleaned_tweet
0,#studiolife #aislife #requires #passion #dedic...,studiolife aislife requires passion dedication...
1,@user #white #supremacists want everyone see n...,white supremacists want everyone see new â ...
2,safe way heal #acne!! #altwaystoheal #healthy ...,safe way heal acne!! altwaystoheal healthy hea...
3,"hp cursed child book reservation already? yes,...","hp cursed child book reservation already? yes,..."
4,"3rd #bihday amazing, hilarious #nephew eli ahm...","3rd bihday amazing, hilarious nephew eli ahmir..."
5,choose :) #momtips,choose :) momtips
6,something inside dy ð¦ð¿â¨ eye ness #smok...,something inside dy ð¦ð¿â¨ eye ness smoke...
7,#finished#tattoo#inked#ink#loveitâ¤ï¸ #â¤ï¸...,finishedtattooinkedinkloveitâ¤ï¸ â¤ï¸â¤ï¸...
8,@user @user @user never understand dad left yo...,never understand dad left young.... :/ deep...
9,#delicious #food #lovelife #capetown mannaepic...,delicious food lovelife capetown mannaepicure ...


In [81]:
# prompt: Using dataframe test_set: remove tweet colum and rename cleaned_tweet colum to tweet

# Drop the 'tweet' column
train_set = train_set.drop('tweet', axis=1)
test_set = test_set.drop('tweet', axis=1)

# Rename the 'cleaned_tweet' column to 'tweet'
train_set = train_set.rename(columns={'cleaned_tweet': 'tweet'})
test_set = test_set.rename(columns={'cleaned_tweet': 'tweet'})


In [82]:
train_set.head(n = 10)

Unnamed: 0,label,tweet
8824,0,"body body massage ending oil massage ,body hap..."
31854,0,@ call back! casting castingcall model cute t...
28079,0,help creates environment togetherness &amp; mu...
29214,0,summer friendâ¨ð¥ summer friend life vlog w...
20025,0,follow snapchat awesomecutenes7 snapchat selfi...
21437,0,robbie told thesmb lead asked broadcast ble...
24167,0,cupcakes! beachpay payplanning friends enough ...
12833,0,happy boyð love guy littleman loveofmylife ...
15840,0,keshi's news really hard accept. sosudden
20134,0,day porn movie skinny amateur sex


Unnamed: 0,label,tweet
8824,0,"body body massage ending oil massage ,body hap..."
31854,0,@ call back! casting castingcall model cute t...
28079,0,help creates environment togetherness &amp; mu...
29214,0,summer friendâ¨ð¥ summer friend life vlog w...
20025,0,follow snapchat awesomecutenes7 snapchat selfi...
21437,0,robbie told thesmb lead asked broadcast ble...
24167,0,cupcakes! beachpay payplanning friends enough ...
12833,0,happy boyð love guy littleman loveofmylife ...
15840,0,keshi's news really hard accept. sosudden
20134,0,day porn movie skinny amateur sex


In [83]:
train_set_copy = train_set.copy()
test_set_copy = test_set.copy()

test_set.head(n = 10)

Unnamed: 0,tweet
0,studiolife aislife requires passion dedication...
1,white supremacists want everyone see new â ...
2,safe way heal acne!! altwaystoheal healthy hea...
3,"hp cursed child book reservation already? yes,..."
4,"3rd bihday amazing, hilarious nephew eli ahmir..."
5,choose :) momtips
6,something inside dy ð¦ð¿â¨ eye ness smoke...
7,finishedtattooinkedinkloveitâ¤ï¸ â¤ï¸â¤ï¸...
8,never understand dad left young.... :/ deep...
9,delicious food lovelife capetown mannaepicure ...


Unnamed: 0,tweet
0,studiolife aislife requires passion dedication...
1,white supremacists want everyone see new â ...
2,safe way heal acne!! altwaystoheal healthy hea...
3,"hp cursed child book reservation already? yes,..."
4,"3rd bihday amazing, hilarious nephew eli ahmir..."
5,choose :) momtips
6,something inside dy ð¦ð¿â¨ eye ness smoke...
7,finishedtattooinkedinkloveitâ¤ï¸ â¤ï¸â¤ï¸...
8,never understand dad left young.... :/ deep...
9,delicious food lovelife capetown mannaepicure ...


Divide datasets

In [84]:
x = train_set["tweet"]
y = train_set["label"]

train_x, test_x, train_y, test_y = model_selection.train_test_split(x, y, test_size = 0.20, shuffle = True, random_state = 11)

print(colored("\nDIVIDED SUCCESFULLY...", color = "green", attrs = ["dark", "bold"]))


DIVIDED SUCCESFULLY...

DIVIDED SUCCESFULLY...


In [85]:
train_x_copy, test_x_copy, train_y_copy, test_y_copy = train_x, test_x, train_y, test_y

In [86]:
train_x_copy

Unnamed: 0,tweet
22249,"apparently taragon ""christmassy herb"""
671,sex 40 sex nake woman
13836,what else new
21925,great one great one. mrhockey
7827,graduated usa &amp; friend wanted 2 forget rec...
...,...
16239,"morning! hope everyone wonderful, fun-filled d..."
10920,officially purchased ticket see sia
25435,"please repo guy, openly racist reprehensible..."
22094,obviously tcot get &amp; inaspanof maybe allof...


Unnamed: 0,tweet
22249,"apparently taragon ""christmassy herb"""
671,sex 40 sex nake woman
13836,what else new
21925,great one great one. mrhockey
7827,graduated usa &amp; friend wanted 2 forget rec...
...,...
16239,"morning! hope everyone wonderful, fun-filled d..."
10920,officially purchased ticket see sia
25435,"please repo guy, openly racist reprehensible..."
22094,obviously tcot get &amp; inaspanof maybe allof...


In [87]:
!pip install transformers



In [88]:
train_x, test_x, train_y, test_y = train_x_copy, test_x_copy, train_y_copy, test_y_copy

In [89]:
import tensorflow as tf
from transformers import BertTokenizer, TFBertModel, BertConfig
from tensorflow.keras.layers import Input, Bidirectional, LSTM, Dense, Dropout
from tensorflow.keras.models import Model
import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score

# Custom Metrics for Precision, Recall, and F1 Score
def precision_m(y_true, y_pred):
    y_true = tf.cast(y_true, tf.float32)  # Cast y_true to float32
    true_positives = tf.reduce_sum(tf.round(tf.clip_by_value(y_true * y_pred, 0, 1)))
    predicted_positives = tf.reduce_sum(tf.round(tf.clip_by_value(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + tf.keras.backend.epsilon())
    return precision

def recall_m(y_true, y_pred):
    y_true = tf.cast(y_true, tf.float32)  # Cast y_true to float32
    true_positives = tf.reduce_sum(tf.round(tf.clip_by_value(y_true * y_pred, 0, 1)))
    possible_positives = tf.reduce_sum(tf.round(tf.clip_by_value(y_true, 0, 1)))
    recall = true_positives / (possible_positives + tf.keras.backend.epsilon())
    return recall

def f1_m(y_true, y_pred):
    y_true = tf.cast(y_true, tf.float32)  # Cast y_true to float32
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2 * ((precision * recall) / (precision + recall + tf.keras.backend.epsilon()))


# 1. Instantiate the BERT tokenizer and model outside the model definition
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
config = BertConfig.from_pretrained('bert-base-uncased')
bert_model = TFBertModel.from_pretrained('bert-base-uncased', config=config)

# 2. Define a custom layer to wrap the BERT model
class BertLayer(tf.keras.layers.Layer):
    def __init__(self, **kwargs):
        super(BertLayer, self).__init__(**kwargs)
        self.bert = bert_model

    def call(self, inputs):
        return self.bert(inputs)[0]  # Get the pooled output

# 3. Define input layers
input_ids = Input(shape=(None,), dtype=tf.int32, name="input_ids")
attention_mask = Input(shape=(None,), dtype=tf.int32, name="attention_mask")
inputs = {'input_ids': input_ids, 'attention_mask': attention_mask}

# 4. Use the custom BertLayer
bert_layer_output = BertLayer()(inputs)

# 5. Build the RNN model using the output from BertLayer
x = Bidirectional(LSTM(64, return_sequences=True))(bert_layer_output)
x = Dropout(0.2)(x)
x = Bidirectional(LSTM(64, return_sequences=True))(x)
x = Dropout(0.2)(x)
x = Bidirectional(LSTM(32, return_sequences=True))(x)
x = Dropout(0.2)(x)
x = Bidirectional(LSTM(32))(x)
outputs = Dense(1, activation='sigmoid')(x)

# 6. Create the Keras model
model = Model(inputs=inputs, outputs=outputs)

# 7. Compile the model with custom metrics
model.compile(optimizer='adam',
              loss='binary_crossentropy',
              metrics=['accuracy', precision_m, recall_m, f1_m])

# 8. Tokenize the text data
train_encodings = tokenizer(train_x.tolist(), truncation=True, padding=True)
test_encodings = tokenizer(test_x.tolist(), truncation=True, padding=True)

# 9. Convert encodings to NumPy arrays
train_input_ids = np.array(train_encodings['input_ids'])
train_attention_mask = np.array(train_encodings['attention_mask'])
test_input_ids = np.array(test_encodings['input_ids'])
test_attention_mask = np.array(test_encodings['attention_mask'])

# 10. Train the model
model.fit(
    x={'input_ids': train_input_ids, 'attention_mask': train_attention_mask},
    y=train_y,
    epochs=10,
    batch_size=32,
    validation_data=(
        {'input_ids': test_input_ids, 'attention_mask': test_attention_mask},
        test_y
    )
)

# 11. Evaluate the model
loss, accuracy, precision, recall, f1 = model.evaluate(
    x={'input_ids': test_input_ids, 'attention_mask': test_attention_mask},
    y=test_y
)

print(f'Test Loss: {loss:.4f}')
print(f'Test Accuracy: {accuracy:.4f}')
print(f'Test Precision: {precision:.4f}')
print(f'Test Recall: {recall:.4f}')
print(f'Test F1 Score: {f1:.4f}')

# 12. Manual Calculation for Verification
y_pred_probs = model.predict({'input_ids': test_input_ids, 'attention_mask': test_attention_mask})
y_pred = (y_pred_probs > 0.5).astype(int)

precision_sklearn = precision_score(test_y, y_pred)
recall_sklearn = recall_score(test_y, y_pred)
f1_sklearn = f1_score(test_y, y_pred)

print(f'[Sklearn] Precision: {precision_sklearn:.4f}')
print(f'[Sklearn] Recall: {recall_sklearn:.4f}')
print(f'[Sklearn] F1 Score: {f1_sklearn:.4f}')


Some weights of the PyTorch model were not used when initializing the TF 2.0 model TFBertModel: ['cls.predictions.transform.dense.weight', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.transform.dense.bias']
- This IS expected if you are initializing TFBertModel from a PyTorch model trained on another task or with another architecture (e.g. initializing a TFBertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertModel from a PyTorch model that you expect to be exactly identical (e.g. initializing a TFBertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of TFBertModel were initialized from the PyTorch model.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertModel for predictions w

Epoch 1/10
[1m113/113[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m57s[0m 296ms/step - accuracy: 0.7647 - f1_m: 0.7295 - loss: 0.4758 - precision_m: 0.7760 - recall_m: 0.7385 - val_accuracy: 0.8205 - val_f1_m: 0.8160 - val_loss: 0.3949 - val_precision_m: 0.8260 - val_recall_m: 0.8161
Epoch 2/10
[1m113/113[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m30s[0m 251ms/step - accuracy: 0.8711 - f1_m: 0.8717 - loss: 0.3156 - precision_m: 0.8793 - recall_m: 0.8733 - val_accuracy: 0.8439 - val_f1_m: 0.8519 - val_loss: 0.3838 - val_precision_m: 0.8120 - val_recall_m: 0.9077
Epoch 3/10
[1m113/113[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m42s[0m 260ms/step - accuracy: 0.9127 - f1_m: 0.9089 - loss: 0.2489 - precision_m: 0.9260 - recall_m: 0.9004 - val_accuracy: 0.8395 - val_f1_m: 0.8294 - val_loss: 0.3896 - val_precision_m: 0.8738 - val_recall_m: 0.7977
Epoch 4/10
[1m113/113[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m41s[0m 261ms/step - accuracy: 0.9239 - f1_m: 0.9198 - loss: 0.

In [94]:
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Bidirectional, LSTM, Dense, Dropout
from tensorflow.keras.models import Model
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np
from sklearn.metrics import precision_score, recall_score, f1_score

# 1. Download GloVe pre-trained embeddings
import os
import requests
import zipfile

# GloVe URL
GLOVE_URL = "https://nlp.stanford.edu/data/glove.6B.zip"
GLOVE_DIR = "glove/"
GLOVE_FILE = os.path.join(GLOVE_DIR, "glove.6B.300d.txt")

# Download and extract GloVe
if not os.path.exists(GLOVE_DIR):
    os.makedirs(GLOVE_DIR)

if not os.path.exists(GLOVE_FILE):
    print("Downloading GloVe embeddings...")
    response = requests.get(GLOVE_URL, stream=True)
    with open(os.path.join(GLOVE_DIR, "glove.6B.zip"), "wb") as file:
        for chunk in response.iter_content(chunk_size=1024):
            file.write(chunk)
    print("Extracting GloVe embeddings...")
    with zipfile.ZipFile(os.path.join(GLOVE_DIR, "glove.6B.zip"), "r") as zip_ref:
        zip_ref.extractall(GLOVE_DIR)

print("GloVe embeddings are ready!")

# 2. Load GloVe embeddings into a dictionary
def load_glove_embeddings(file_path):
    embeddings_index = {}
    with open(file_path, encoding="utf-8") as f:
        for line in f:
            values = line.split()
            word = values[0]
            coefs = np.asarray(values[1:], dtype="float32")
            embeddings_index[word] = coefs
    return embeddings_index

print("Loading GloVe embeddings...")
embeddings_index = load_glove_embeddings(GLOVE_FILE)
print(f"Loaded {len(embeddings_index)} word vectors.")

# 3. Example Data
MAX_VOCAB_SIZE = 10000
MAX_SEQUENCE_LENGTH = 100
EMBEDDING_DIM = 300

train_x, test_x, train_y, test_y = train_x_copy, test_x_copy, train_y_copy, test_y_copy

# Tokenize text data
tokenizer = Tokenizer(num_words=MAX_VOCAB_SIZE, oov_token="<OOV>")
tokenizer.fit_on_texts(train_x)
word_index = tokenizer.word_index

train_sequences = tokenizer.texts_to_sequences(train_x)
test_sequences = tokenizer.texts_to_sequences(test_x)

train_padded = pad_sequences(train_sequences, maxlen=MAX_SEQUENCE_LENGTH, padding="post")
test_padded = pad_sequences(test_sequences, maxlen=MAX_SEQUENCE_LENGTH, padding="post")

# 4. Create embedding matrix for GloVe
embedding_matrix = np.zeros((MAX_VOCAB_SIZE, EMBEDDING_DIM))
for word, i in word_index.items():
    if i < MAX_VOCAB_SIZE:
        embedding_vector = embeddings_index.get(word)
        if embedding_vector is not None:
            embedding_matrix[i] = embedding_vector

# 5. Define the Model
input_layer = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype=tf.int32, name="input_layer")

# GloVe Embedding Layer
embedding_layer = Embedding(input_dim=MAX_VOCAB_SIZE,
                            output_dim=EMBEDDING_DIM,
                            weights=[embedding_matrix],
                            input_length=MAX_SEQUENCE_LENGTH,
                            trainable=False)(input_layer)

# BiLSTM Layers
x = Bidirectional(LSTM(64, return_sequences=True))(embedding_layer)
x = Dropout(0.2)(x)
x = Bidirectional(LSTM(64, return_sequences=True))(x)
x = Dropout(0.2)(x)
x = Bidirectional(LSTM(32, return_sequences=True))(x)
x = Dropout(0.2)(x)
x = Bidirectional(LSTM(32))(x)
output_layer = Dense(1, activation='sigmoid')(x)


model = Model(inputs=input_layer, outputs=output_layer)

# 6. Compile Model with Custom Metrics
# Custom Metrics for Precision, Recall, and F1 Score
def precision_m(y_true, y_pred):
    y_true = tf.cast(y_true, tf.float32)  # Cast y_true to float32
    true_positives = tf.reduce_sum(tf.round(tf.clip_by_value(y_true * y_pred, 0, 1)))
    predicted_positives = tf.reduce_sum(tf.round(tf.clip_by_value(y_pred, 0, 1)))
    precision = true_positives / (predicted_positives + tf.keras.backend.epsilon())
    return precision

def recall_m(y_true, y_pred):
    y_true = tf.cast(y_true, tf.float32)  # Cast y_true to float32
    true_positives = tf.reduce_sum(tf.round(tf.clip_by_value(y_true * y_pred, 0, 1)))
    possible_positives = tf.reduce_sum(tf.round(tf.clip_by_value(y_true, 0, 1)))
    recall = true_positives / (possible_positives + tf.keras.backend.epsilon())
    return recall

def f1_m(y_true, y_pred):
    y_true = tf.cast(y_true, tf.float32)  # Cast y_true to float32
    precision = precision_m(y_true, y_pred)
    recall = recall_m(y_true, y_pred)
    return 2 * ((precision * recall) / (precision + recall + tf.keras.backend.epsilon()))

model.compile(optimizer="adam", loss="binary_crossentropy", metrics=["accuracy", precision_m, recall_m, f1_m])

# 7. Train Model
model.fit(
    x=train_padded,
    y=train_y,
    epochs=10,
    batch_size=32,
    validation_data=(test_padded, test_y)
)

# 8. Evaluate Model
loss, accuracy, precision, recall, f1 = model.evaluate(test_padded, test_y)

print(f"Test Loss: {loss:.4f}")
print(f"Test Accuracy: {accuracy:.4f}")
print(f"Test Precision: {precision:.4f}")
print(f"Test Recall: {recall:.4f}")
print(f"Test F1 Score: {f1:.4f}")

# Manual Scikit-learn Verification
y_pred_probs = model.predict(test_padded)
y_pred = (y_pred_probs > 0.5).astype(int)

precision_sklearn = precision_score(test_y, y_pred)
recall_sklearn = recall_score(test_y, y_pred)
f1_sklearn = f1_score(test_y, y_pred)

print(f"[Sklearn] Precision: {precision_sklearn:.4f}")
print(f"[Sklearn] Recall: {recall_sklearn:.4f}")
print(f"[Sklearn] F1 Score: {f1_sklearn:.4f}")


GloVe embeddings are ready!
Loading GloVe embeddings...
Loaded 400000 word vectors.
Epoch 1/10
[1m113/113[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m19s[0m 103ms/step - accuracy: 0.7455 - f1_m: 0.7598 - loss: 0.5061 - precision_m: 0.7384 - recall_m: 0.8200 - val_accuracy: 0.8339 - val_f1_m: 0.8279 - val_loss: 0.4060 - val_precision_m: 0.8398 - val_recall_m: 0.8237
Epoch 2/10
[1m113/113[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 55ms/step - accuracy: 0.8812 - f1_m: 0.8811 - loss: 0.3222 - precision_m: 0.8900 - recall_m: 0.8812 - val_accuracy: 0.8395 - val_f1_m: 0.8285 - val_loss: 0.3714 - val_precision_m: 0.8732 - val_recall_m: 0.7966
Epoch 3/10
[1m113/113[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m9s[0m 48ms/step - accuracy: 0.8973 - f1_m: 0.8941 - loss: 0.2660 - precision_m: 0.8997 - recall_m: 0.8968 - val_accuracy: 0.8339 - val_f1_m: 0.8346 - val_loss: 0.3675 - val_precision_m: 0.8180 - val_recall_m: 0.8590
Epoch 4/10
[1m113/113[0m [32m━━━━━━━━━━━━━━━━━━━━