<a href="https://colab.research.google.com/github/Shriteen/cyberbullying-classifier/blob/shri3/training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
!git clone https://github.com/Shriteen/cyberbullying-classifier.git

Cloning into 'cyberbullying-classifier'...
remote: Enumerating objects: 12, done.[K
remote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 12 (delta 3), reused 2 (delta 0), pack-reused 0[K
Unpacking objects: 100% (12/12), 2.78 MiB | 3.86 MiB/s, done.


In [None]:
!cp cyberbullying-classifier/cyberbullying_tweets.csv .

Import Dependancies

In [None]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

Load data from dataset

In [275]:
df= pd.read_csv("cyberbullying_tweets.csv")
df.columns = ['tweet_text', 'cyberbullying_type']

# oversample not_cyberbullying and other_cyberbullying
df=pd.concat([df,
              df[df['cyberbullying_type']=='not_cyberbullying'],
              df[df['cyberbullying_type']=='not_cyberbullying'],
              df[df['cyberbullying_type']=='other_cyberbullying'] ])

df['category_id']= df['cyberbullying_type'].factorize()[0]
df.head()

Unnamed: 0,tweet_text,cyberbullying_type,category_id
0,"In other words #katandandre, your food was cra...",not_cyberbullying,0
1,Why is #aussietv so white? #MKR #theblock #ImA...,not_cyberbullying,0
2,@XochitlSuckkks a classy whore? Or more red ve...,not_cyberbullying,0
3,"@Jason_Gio meh. :P thanks for the heads up, b...",not_cyberbullying,0
4,@RudhoeEnglish This is an ISIS account pretend...,not_cyberbullying,0


Create dictionaries to simplify future use

In [276]:
category_id_df = df[['cyberbullying_type', 'category_id']].drop_duplicates().sort_values('category_id')
category_to_id = dict(category_id_df.values)
id_to_category = dict(category_id_df[['category_id', 'cyberbullying_type']].values)

Create tf-idf vector object
- sublinear to use a logarithmic form for frequency     
- min_df = 5 means ignore tokens with count lower than 5
- norm='l2' used to convert vector to unit vector       
- ngram_range is to include uni,bi and trigrams         
- stop_words is to ignore english stop words            

In [277]:
tfidf= TfidfVectorizer(sublinear_tf=True,
                           min_df=8,
                           norm='l2',
                           ngram_range=(1,3),
                           encoding='latin-1',
                           stop_words="english")


Convert text to vector form

In [278]:
features= tfidf.fit_transform(df.tweet_text).toarray()
labels= df.category_id
features.shape


(71405, 17452)

Extract and Display the top features of each category

In [279]:
from sklearn.feature_selection import chi2
import numpy as np

for category,category_id in sorted(category_to_id.items()):
  features_chi2= chi2(features, labels==category_id )
  indices = np.argsort(features_chi2[0])
  feature_names= np.array(tfidf.get_feature_names_out())[indices]

  unigrams = [v for v in feature_names if len(v.split(' ')) == 1]
  bigrams = [v for v in feature_names if len(v.split(' ')) == 2]
  trigrams = [v for v in feature_names if len(v.split(' ')) == 3]
  
  print('#'+category)
  print('Unigrams:',','.join(unigrams[-5:]))
  print('Bigrams:',','.join(bigrams[-5:]))
  print('Trigrams:',','.join(trigrams[-5:]))
  

#age
Unigrams: girl,bullies,bullied,high,school
Bigrams: girl bullied,girls bullied,school bully,bullied high,high school
Trigrams: high school bullies,high school bully,girl bullied high,girls bullied high,bullied high school
#ethnicity
Unigrams: ass,niggers,fuck,nigger,dumb
Bigrams: fuck obama,dumb fuck,ass nigger,dumb nigger,dumb ass
Trigrams: fuck dumb nigger,tayyoung_ fuck obama,obama dumb ass,fuck obama dumb,dumb ass nigger
#gender
Unigrams: sexist,joke,jokes,gay,rape
Bigrams: jokes rape,gay rape,gay jokes,rape joke,rape jokes
Trigrams: gay rape joke,jokes rape jokes,jokes gay jokes,rape jokes gay,gay rape jokes
#not_cyberbullying
Unigrams: fuck,nigger,dumb,bullying,mkr
Bigrams: rape jokes,bullied high,dumb ass,kat andre,high school
Trigrams: tayyoung_ fuck obama,obama dumb ass,fuck obama dumb,dumb ass nigger,bullied high school
#other_cyberbullying
Unigrams: nigger,dumb,blameonenotall,school,https
Bigrams: rape jokes,bullied high,idiot https,dumb ass,high school
Trigrams: tayyou

Train using Naive Bayes

In [280]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB

X_train, X_test, y_train, y_test = train_test_split(df['tweet_text'],df['cyberbullying_type'], random_state=0)
count_vect= CountVectorizer()
X_train_counts= count_vect.fit_transform(X_train)
tfidf_transformer= TfidfTransformer()
X_train_tfidf= tfidf_transformer.fit_transform(X_train_counts)

classifier= MultinomialNB(class_prior=[0.05,0.05,0.05,0.1,0.1,0.05]).fit(X_train_tfidf, y_train)

In [287]:
classifier.predict( count_vect.transform([""]) )

array(['other_cyberbullying'], dtype='<U19')