# Module: Capstone Project

## Section: Natural Language Processing

## <font color='#4073FF'> Project Solution: Cyberbullying Classification </font>

###  <font color='#14AAF5'>Create a multiclassification model to predict cyberbullying type</font>


### Project Brief:

With rise of social media, cyberbullying has reached all-time highs. We can combat this by creating models to automatically flag potentially harmful tweets as well as break down the patterns of hatred.
As social media usage becomes increasingly prevalent in every age group, a vast majority of citizens rely on this essential medium for day-to-day communication. Social media’s ubiquity means that cyberbullying can effectively impact anyone at any time or anywhere, and the relative anonymity of the internet makes such personal attacks more difficult to stop than traditional bullying.
On April 15th, 2020, UNICEF issued a warning in response to the increased risk of cyberbullying during the COVID-19 pandemic due to widespread school closures, increased screen time, and decreased face-to-face social interaction. The statistics of cyberbullying are outright alarming: 36.5% of middle and high school students have felt cyberbullied and 87% have observed cyberbullying, with effects ranging from decreased academic performance to depression to suicidal thoughts.
In this project, you are required to create a multiclass classification model to predict the type of cyberbullying using tweets as input.


### 1. Dataset

In light of all of this problem, this dataset provided contains more than 47000 tweets labelled according to the class of cyberbullying:

- Age
- Ethnicity
- Gender
- Religion
- Other type of cyberbullying
- Not cyberbullying

The data has been balanced in order to contain around 8000 of each class.

In [None]:
# Importing Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import re
from sklearn.feature_extraction.text import CountVectorizer

import plotly.graph_objs as go
import plotly.express as px

import warnings
warnings.filterwarnings('ignore')

###2. Data collection and exploration

In [None]:
# Downloading dataset
!wget https://raw.githubusercontent.com/bluedataconsulting/AIMasteryProgram/main/Projects/Module-15/cyberbullying_tweets.csv

--2022-03-01 05:27:23--  https://raw.githubusercontent.com/bluedataconsulting/AIMasteryProgram/main/Projects/Module-15/cyberbullying_tweets.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7174545 (6.8M) [text/plain]
Saving to: ‘cyberbullying_tweets.csv’


2022-03-01 05:27:23 (109 MB/s) - ‘cyberbullying_tweets.csv’ saved [7174545/7174545]



In [None]:
# Viewing data
df = pd.read_csv("cyberbullying_tweets.csv")
df.head()

Unnamed: 0,tweet_text,cyberbullying_type
0,"In other words #katandandre, your food was cra...",not_cyberbullying
1,Why is #aussietv so white? #MKR #theblock #ImA...,not_cyberbullying
2,@XochitlSuckkks a classy whore? Or more red ve...,not_cyberbullying
3,"@Jason_Gio meh. :P thanks for the heads up, b...",not_cyberbullying
4,@RudhoeEnglish This is an ISIS account pretend...,not_cyberbullying


In [None]:
#Checking info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 47692 entries, 0 to 47691
Data columns (total 2 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   tweet_text          47692 non-null  object
 1   cyberbullying_type  47692 non-null  object
dtypes: object(2)
memory usage: 745.3+ KB


In [None]:
# Describing data
df.describe()

Unnamed: 0,tweet_text,cyberbullying_type
count,47692,47692
unique,46017,6
top,RT @sailorhg: the intro for my hardware hackin...,religion
freq,2,7998


In [None]:
# Viewing types
df['cyberbullying_type'].value_counts()

religion               7998
age                    7992
gender                 7973
ethnicity              7961
not_cyberbullying      7945
other_cyberbullying    7823
Name: cyberbullying_type, dtype: int64

### 3. Data cleaning

In [None]:
# Looking for missing values
df.isnull().sum()

tweet_text            0
cyberbullying_type    0
dtype: int64

In [None]:
# Duplicate entries
print(df.duplicated().sum())

# Dropping duplicates
df.drop_duplicates(inplace = True)

36


In [None]:
# twitter text cleaning 

TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"

from wordcloud import STOPWORDS
STOPWORDS.update(['rt', 'mkr', 'didn', 'bc', 'n', 'm', 'im', 'll', 'y', 've', 'u', 'ur', 'don', 't', 's'])

def lower(text):
    return text.lower()

def remove_twitter(text):
    return re.sub(TEXT_CLEANING_RE, ' ', text)

def remove_stopwords(text):
    return " ".join([word for word in str(text).split() if word not in STOPWORDS])

def clean_text(text):
    text = lower(text)
    text = remove_twitter(text)
    text = remove_stopwords(text)
    return text

In [None]:
def get_top_n_gram(corpus,ngram_range,n=None):
    vec = CountVectorizer(ngram_range=ngram_range,stop_words = STOPWORDS).fit(corpus)
    bag_of_words = vec.transform(corpus)
    sum_words = bag_of_words.sum(axis=0) 
    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]
    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)
    return words_freq[:n]

In [None]:
# Cleaning text
df['tweet_text']=df['tweet_text'].apply(clean_text)

In [None]:
# Performing lemmatization
import nltk
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

lematizer=WordNetLemmatizer()

def lemmatizer_words(text):
    return " ".join([lematizer.lemmatize(word) for word in text.split()])

df['tweet_text']=df['tweet_text'].apply(lambda text: lemmatizer_words(text))

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


In [None]:
# Checking for duplicates again
print(df.duplicated().sum())

# Dropping duplicates
df.drop_duplicates(inplace = True)

1122


### 4. Data visualization

### Hate on twitter

In [None]:
df['tweet_text']

0                       word katandandre food crapilicious
1        aussietv white theblock imacelebrityau today s...
2                          classy whore red velvet cupcake
3        meh p thanks head concerned another angry dude...
4         isi account pretending kurdish account islam lie
                               ...                        
47687    black ppl aren expected anything depended anyt...
47688    turner withhold disappointment turner called c...
47689    swear god dumb nigger bitch got bleach hair re...
47690    yea fuck therealexel youre nigger fucking unfo...
47691    bro gotta chill chillshrammy dog fuck kp dumb ...
Name: tweet_text, Length: 46534, dtype: object

In [None]:
text_corpus = df["tweet_text"].values

# Getting unigrams
unigrams = get_top_n_gram(text_corpus,(1,1),10)

# Getting bigrams
bigrams = get_top_n_gram(text_corpus,(2,2),10)

data1 = pd.DataFrame(unigrams, columns = ['Text' , 'count']).groupby('Text')['count'].sum().sort_values()
fig = px.bar(data1,y = data1.index,x = data1.values,orientation='h', opacity=0.75)
fig.update_layout(
    margin=dict(t=50),
    title=dict(
        text='Hate (Unigram Frequency)',
        font_size=20
    ),
    xaxis_title = "Count",
    yaxis_title = "Unigrams",

)
fig.update_traces(marker_color='#9d7043')
fig.show()

In [None]:
data2 = pd.DataFrame(bigrams, columns = ['Text' , 'count']).groupby('Text')['count'].sum().sort_values()
fig1 = px.bar(data2,y = data2.index,x = data2.values,orientation='h', opacity=0.75)
fig1.update_layout(
    margin=dict(t=50),
    title=dict(
        text='Hate (Bigram Frequency)',
        font_size=20
    ),
    xaxis_title = "Count",
    yaxis_title = "Bigrams"
)
fig1.update_traces(marker_color='#6b6161')
fig1.show()

###5. Modelling

In [None]:
ClassIDMap = {'not_cyberbullying': 1, 'gender':2, 
              'religion':3, 'other_cyberbullying': 4, 
              'age': 5, 'ethnicity': 6 }
ClassIDMap

{'age': 5,
 'ethnicity': 6,
 'gender': 2,
 'not_cyberbullying': 1,
 'other_cyberbullying': 4,
 'religion': 3}

In [None]:
corpus, target_labels, target_names = (df['tweet_text'], 
                                       [ClassIDMap[label] for 
                                        label in df['cyberbullying_type']], 
                                       df['cyberbullying_type'])

df = pd.DataFrame({'tweet text': corpus, 'cyberbullying Label': 
                        target_labels, 'cyberbulying Name': target_names})

In [None]:
# Viewing arranged data
df.head()

Unnamed: 0,tweet text,cyberbullying Label,cyberbulying Name
0,word katandandre food crapilicious,1,not_cyberbullying
1,aussietv white theblock imacelebrityau today s...,1,not_cyberbullying
2,classy whore red velvet cupcake,1,not_cyberbullying
3,meh p thanks head concerned another angry dude...,1,not_cyberbullying
4,isi account pretending kurdish account islam lie,1,not_cyberbullying


In [None]:
# Checking info
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 46534 entries, 0 to 47691
Data columns (total 3 columns):
 #   Column               Non-Null Count  Dtype 
---  ------               --------------  ----- 
 0   tweet text           46534 non-null  object
 1   cyberbullying Label  46534 non-null  int64 
 2   cyberbulying Name    46534 non-null  object
dtypes: int64(1), object(2)
memory usage: 1.4+ MB


#### Train test split

In [None]:
from sklearn.model_selection import train_test_split

train_corpus, test_corpus, train_label_nums, test_label_nums, train_label_names, test_label_names =\
                                 train_test_split(np.array(df['tweet text']), np.array(df['cyberbullying Label']),
                                                       np.array(df['cyberbulying Name']), test_size=0.3, random_state=42)

train_corpus.shape, test_corpus.shape

((32573,), (13961,))

#### Feature Extraction

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# build BOW features on train articles
tv = TfidfVectorizer(use_idf=True, min_df=0.00002, max_df=0.6)
tv_train_features = tv.fit_transform(train_corpus.astype('U'))

# transform test articles into features
tv_test_features = tv.transform(test_corpus.astype('U'))

print('TFIDF model:> Train features shape:', tv_train_features.shape, ' Test features shape:', tv_test_features.shape)

TFIDF model:> Train features shape: (32573, 37568)  Test features shape: (13961, 37568)


#### Creating a Classifier

In [None]:
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

# instantiating pipeline
svm_pipeline = Pipeline([('tfidf', TfidfVectorizer()),
                        ('svm', LinearSVC(random_state=42))])

# Parameter grid
param_grid = {'tfidf__ngram_range': [(1, 1), (1, 1)],
              'svm__C': [1e-5, 1e-4, 1e-2, 1e-1, 1]
}

# Grid search CV
gs_svm = GridSearchCV(svm_pipeline, param_grid, cv=5, verbose=2)


# Model Training on train_corpus
gs_svm = gs_svm.fit(train_corpus.astype('U'), train_label_names)



Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV] END ............svm__C=1e-05, tfidf__ngram_range=(1, 1); total time=   1.7s
[CV] END ............svm__C=1e-05, tfidf__ngram_range=(1, 1); total time=   1.5s
[CV] END ............svm__C=1e-05, tfidf__ngram_range=(1, 1); total time=   1.5s
[CV] END ............svm__C=1e-05, tfidf__ngram_range=(1, 1); total time=   1.5s
[CV] END ............svm__C=1e-05, tfidf__ngram_range=(1, 1); total time=   1.5s
[CV] END ............svm__C=1e-05, tfidf__ngram_range=(1, 1); total time=   1.5s
[CV] END ............svm__C=1e-05, tfidf__ngram_range=(1, 1); total time=   1.5s
[CV] END ............svm__C=1e-05, tfidf__ngram_range=(1, 1); total time=   1.5s
[CV] END ............svm__C=1e-05, tfidf__ngram_range=(1, 1); total time=   1.5s
[CV] END ............svm__C=1e-05, tfidf__ngram_range=(1, 1); total time=   1.6s
[CV] END ...........svm__C=0.0001, tfidf__ngram_range=(1, 1); total time=   1.6s
[CV] END ...........svm__C=0.0001, tfidf__ngram_

In [None]:
# Getting the best parameters
gs_svm.best_estimator_.get_params()

{'memory': None,
 'steps': [('tfidf', TfidfVectorizer()),
  ('svm', LinearSVC(C=0.1, random_state=42))],
 'svm': LinearSVC(C=0.1, random_state=42),
 'svm__C': 0.1,
 'svm__class_weight': None,
 'svm__dual': True,
 'svm__fit_intercept': True,
 'svm__intercept_scaling': 1,
 'svm__loss': 'squared_hinge',
 'svm__max_iter': 1000,
 'svm__multi_class': 'ovr',
 'svm__penalty': 'l2',
 'svm__random_state': 42,
 'svm__tol': 0.0001,
 'svm__verbose': 0,
 'tfidf': TfidfVectorizer(),
 'tfidf__analyzer': 'word',
 'tfidf__binary': False,
 'tfidf__decode_error': 'strict',
 'tfidf__dtype': numpy.float64,
 'tfidf__encoding': 'utf-8',
 'tfidf__input': 'content',
 'tfidf__lowercase': True,
 'tfidf__max_df': 1.0,
 'tfidf__max_features': None,
 'tfidf__min_df': 1,
 'tfidf__ngram_range': (1, 1),
 'tfidf__norm': 'l2',
 'tfidf__preprocessor': None,
 'tfidf__smooth_idf': True,
 'tfidf__stop_words': None,
 'tfidf__strip_accents': None,
 'tfidf__sublinear_tf': False,
 'tfidf__token_pattern': '(?u)\\b\\w\\w+\\b',
 't

In [None]:
# Checking accuracy on test set
best_svm_test_score = gs_svm.score(test_corpus.astype('U'), test_label_names)
print('Test Accuracy :', best_svm_test_score)

Test Accuracy : 0.8237948571019268


In [None]:
# Making random prediction on text
gs_svm.predict(np.array(['Roses are red, violets are blue, if i had a brick i would throw it at you']))

array(['other_cyberbullying'], dtype=object)

In [None]:
gs_svm

GridSearchCV(cv=5,
             estimator=Pipeline(steps=[('tfidf', TfidfVectorizer()),
                                       ('svm', LinearSVC(random_state=42))]),
             param_grid={'svm__C': [1e-05, 0.0001, 0.01, 0.1, 1],
                         'tfidf__ngram_range': [(1, 1), (1, 1)]},
             verbose=2)

### 6. Exporting the model for deployment

In [None]:
import joblib
joblib.dump(gs_svm,'cyber_bullying_model.pkl')

['cyber_bullying_model.pkl']