<a href="https://colab.research.google.com/github/ArunVignesh75/Machine-Learning/blob/main/Twitter_cyberbullying.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Cyber Bullying Classification using RNN

#Dataset Description:

This dataset is from sources related to the automatic detection of cyber-bullying. The data is from social media platform like Twitter, Wikipedia. The data contain text and labeled as bullying or not. The data contains different types of cyber-bullying like hate speech, aggression, insults and toxicity.

#Contents

1.Import nescessary Packages

2.Load the Dataset

3.Pre-Processing the Data

4.Split the Dataset

5.Initializing the Model

6.Evaluating the Model


#Import nescessary Packages

In [1]:
import pandas as pd
import re
import nltk
from sklearn.preprocessing import LabelEncoder
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

In [2]:
import numpy as np
from keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.layers import Dense, Embedding, LSTM
from keras.models import Sequential
from sklearn.model_selection import train_test_split

In [3]:
# download nltk resources
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('wordnet')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...


True

In [4]:
# define stop words
stop_words = set(stopwords.words('english'))

#Loading the Dataset

In [5]:
df = pd.read_csv("twitter_parsed_dataset.csv")
df

Unnamed: 0,index,id,Text,Annotation,oh_label
0,5.74948705591165E+017,5.74948705591165E+017,@halalflaws @biebervalue @greenlinerzjm I read...,none,0.0
1,5.71917888690393E+017,5.71917888690393E+017,@ShreyaBafna3 Now you idiots claim that people...,none,0.0
2,3.90255841338601E+017,3.90255841338601E+017,"RT @Mooseoftorment Call me sexist, but when I ...",sexism,1.0
3,5.68208850655916E+017,5.68208850655916E+017,"@g0ssipsquirrelx Wrong, ISIS follows the examp...",racism,1.0
4,5.75596338802373E+017,5.75596338802373E+017,#mkr No No No No No No,none,0.0
...,...,...,...,...,...
16846,5.75606766236475E+017,5.75606766236475E+017,"Feeling so sorry for the girls, they should be...",none,0.0
16847,5.72333822886326E+017,5.72333822886326E+017,#MKR 'pretty good dishes we're happy with' - O...,none,0.0
16848,5.72326950057845E+017,5.72326950057845E+017,RT @colonelkickhead: Deconstructed lemon tart!...,none,0.0
16849,5.74799612642357E+017,5.74799612642357E+017,@versacezaynx @nyazpolitics @greenlinerzjm You...,none,0.0


In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 16851 entries, 0 to 16850
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   index       16851 non-null  object 
 1   id          16850 non-null  object 
 2   Text        16850 non-null  object 
 3   Annotation  16848 non-null  object 
 4   oh_label    16848 non-null  float64
dtypes: float64(1), object(4)
memory usage: 658.4+ KB


#Checking for Null values

In [7]:
df.isnull().sum()

index         0
id            1
Text          1
Annotation    3
oh_label      3
dtype: int64

In [8]:
#Dropping the null values
df = df.dropna()

In [9]:
df.isnull().sum()

index         0
id            0
Text          0
Annotation    0
oh_label      0
dtype: int64

In [10]:
df.describe()

Unnamed: 0,oh_label
count,16848.0
mean,0.317367
std,0.465465
min,0.0
25%,0.0
50%,0.0
75%,1.0
max,1.0


#Pre-Processing the Data

In [11]:
label_names = ['none', 'racism', 'sexism']


In [12]:
# initialize lemmatizer
lemmatizer = WordNetLemmatizer()

In [15]:
def preprocess_text(text):
    # convert text to lowercase and remove punctuations
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)

    # remove words starting with @ symbol
    text = re.sub(r'@\w+', '', text)

    # tokenize the text
    tokens = word_tokenize(text)

    # remove stop words
    tokens = [token for token in tokens if token not in stop_words]

    # apply lemmatization
    tokens_lemmatized = [lemmatizer.lemmatize(token) for token in tokens]

    # remove symbols other than text
    tokens_lemmatized = [re.sub(r'[^a-zA-Z0-9]', '', token) for token in tokens_lemmatized]

    return ' '.join(tokens_lemmatized)

df['Text_preprocessed'] = df['Text'].apply(preprocess_text)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Text_preprocessed'] = df['Text'].apply(preprocess_text)


In [16]:
df['Text_preprocessed']

0        halalflaws biebervalue greenlinerzjm read cont...
1        shreyabafna3 idiot claim people tried stop bec...
2        rt mooseoftorment call sexist go auto place id...
3        g0ssipsquirrelx wrong isi follows example moha...
4                                                      mkr
                               ...                        
16846     feeling sorry girl safe kat andre going home mkr
16847    mkr pretty good dish happy ok well im never ea...
16848    rt colonelkickhead deconstructed lemon tartcan...
16849    versacezaynx nyazpolitics greenlinerzjm stupid...
16850    protest youre mad there much reason youd tweet...
Name: Text_preprocessed, Length: 16848, dtype: object

In [17]:
# Find maximum length
max_len = 0
for text in df['Text_preprocessed']:
    if len(text) > max_len:
        max_len = len(text)
print(f'Maximum tweet length: {max_len}')


Maximum tweet length: 132


Encoding the Target Variable

In [18]:
# Convert target variable to numerical values
le = LabelEncoder()
df['Annotation_num'] = le.fit_transform(df['Annotation'])

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['Annotation_num'] = le.fit_transform(df['Annotation'])


In [19]:
# vectorize the preprocessed text using Tf-Idf
vectorizer_tfidf = TfidfVectorizer()
vectorizer_tfidf.fit_transform(df['Text_preprocessed'])

# convert preprocessed text to word embeddings
# code not shown, as this requires pre-trained word embeddings

<16848x24054 sparse matrix of type '<class 'numpy.float64'>'
	with 152470 stored elements in Compressed Sparse Row format>

In [20]:
tokenizer = Tokenizer(num_words=5000, split=' ')
tokenizer.fit_on_texts(df['Text_preprocessed'].values)
X = tokenizer.texts_to_sequences(df['Text_preprocessed'].values)
X = pad_sequences(X, maxlen=max_len)

In [21]:
df

Unnamed: 0,index,id,Text,Annotation,oh_label,Text_preprocessed,Annotation_num
0,5.74948705591165E+017,5.74948705591165E+017,@halalflaws @biebervalue @greenlinerzjm I read...,none,0.0,halalflaws biebervalue greenlinerzjm read cont...,0
1,5.71917888690393E+017,5.71917888690393E+017,@ShreyaBafna3 Now you idiots claim that people...,none,0.0,shreyabafna3 idiot claim people tried stop bec...,0
2,3.90255841338601E+017,3.90255841338601E+017,"RT @Mooseoftorment Call me sexist, but when I ...",sexism,1.0,rt mooseoftorment call sexist go auto place id...,2
3,5.68208850655916E+017,5.68208850655916E+017,"@g0ssipsquirrelx Wrong, ISIS follows the examp...",racism,1.0,g0ssipsquirrelx wrong isi follows example moha...,1
4,5.75596338802373E+017,5.75596338802373E+017,#mkr No No No No No No,none,0.0,mkr,0
...,...,...,...,...,...,...,...
16846,5.75606766236475E+017,5.75606766236475E+017,"Feeling so sorry for the girls, they should be...",none,0.0,feeling sorry girl safe kat andre going home mkr,0
16847,5.72333822886326E+017,5.72333822886326E+017,#MKR 'pretty good dishes we're happy with' - O...,none,0.0,mkr pretty good dish happy ok well im never ea...,0
16848,5.72326950057845E+017,5.72326950057845E+017,RT @colonelkickhead: Deconstructed lemon tart!...,none,0.0,rt colonelkickhead deconstructed lemon tartcan...,0
16849,5.74799612642357E+017,5.74799612642357E+017,@versacezaynx @nyazpolitics @greenlinerzjm You...,none,0.0,versacezaynx nyazpolitics greenlinerzjm stupid...,0


In [22]:
# Convert target variable to categorical
y = to_categorical(df['Annotation_num'].values)

#Splitting the Dataset

In [23]:
# Split data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25)

#Initializing the Model

In [24]:
# Build RNN model
model = Sequential()
model.add(Embedding(5000, 128, input_length=X.shape[1]))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(3, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [25]:
# Train model
model.fit(X_train, y_train, epochs=10, batch_size=32, validation_data=(X_test, y_test))


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f134e13b250>

#Evaluating the Model

In [26]:
# Evaluate model
loss, accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f'Test loss: {loss:.3f}')
print(f'Test accuracy: {accuracy:.3f}')


Test loss: 0.931
Test accuracy: 0.816


#Predictions

In [31]:
# Make predictions on new data
new_data = ['RT @Millhouse66 @Maureen_JS nooo not sexist but most women are bad drivers','A good Muslim is good despite his bad religion, not because of it.', 'Another tweet to classify']

# preprocess each text separately using list comprehension
new_data_processed = [preprocess_text(text) for text in new_data]

# convert preprocessed text to padded sequences
new_data_seq = tokenizer.texts_to_sequences(new_data_processed)
new_data_padded = pad_sequences(new_data_seq, maxlen=132)

# Make predictions using the loaded model
predictions = model.predict(new_data_padded)




In [32]:
for i, pred in enumerate(predictions):
     label = np.argmax(pred)
     print(f'Tweet {i+1} prediction: {label_names[label]}')

Tweet 1 prediction: sexism
Tweet 2 prediction: racism
Tweet 3 prediction: none


At first, I imported nesscessary packages,then loaded the dataset then I performed exploratory data analysis for better understanding each of dataset associated with Cyber Bullying.

Next, I pre-processed the data for my model and performed RNN. The accuracy rate of this model is 81%. And then, I did some predictions.

With predictions we ca clearly  see that model has predicted correctly.


Hereby I explored the dataset and pre-processed the data, then i trained two RNN model and evaluated the model.

Thank you.