## Table of Contents:

### Introduction
   
### Section 1: Data Preparation

-   Statements, Libraries, functions and tools
-   Collecting data
-   Converting to pandas dataframe
-   General Preproccessing


### Section 2: Preprocessing:Gensim - model:Random Forest


-   Cleaning and tokenization dataset by Gensim preprossing package.
-   Vectorization : word2vec embedding 
-   Fit Ranodom Forest Classifier
-   Evaluation the model
 


### Section 3: Built in a Basic RNN

-   Tokenization dataset by keras 
-   Pad the sequences with the same length
-   Fit the Model


## Introduction

This notebook trains couple of models to classify tweets as a hate speech or not based on the text of the tweets.

The tweets are primarily in English Language
we used the [tweets_hate_speech_detection](https://huggingface.co/datasets/tweets_hate_speech_detection#source-data) from the Hugging face.

label : 
- it is a hate speech, 
- not a hate speech.

## 1. Data Preparation 

### 1.1.  Statements, Libraries, functions and tools

In [103]:
import nltk
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
import numpy as np


# nltk:
import re  # use if tor tokenizing
import string

#sklearn
import sklearn
from sklearn.ensemble import RandomForestClassifier
# Model Evaluation:
from sklearn.metrics import precision_score,recall_score
from sklearn.model_selection import train_test_split

#gensim
import gensim

# tensorflow:
import tensorflow as tf
from tensorflow.keras.models import Model
from tensorflow.keras.layers import *   # layers, losses, preprocessing
from tensorflow.keras.losses import sparse_categorical_crossentropy
from tensorflow.keras.optimizers import Adam

import keras.backend as k
from keras.layers import Dense,Embedding, LSTM
from keras.models import Sequential

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences




In [3]:
print(tf.__version__)

2.8.0-rc1


### 1.2. Collecting data from Hugging Face:

In [19]:
from datasets import list_datasets,load_dataset
from pprint import pprint

dataset = load_dataset("tweets_hate_speech_detection" , split = "train"  )

Using custom data configuration default
Reusing dataset tweets_hate_speech_detection (C:\Users\smora\.cache\huggingface\datasets\tweets_hate_speech_detection\default\0.0.0\c6b6f41e91ac9113e1c032c5ecf7a49b4e1e9dc8699ded3c2d8425c9217568b2)


 [pprint](https://docs.python.org/3/library/pprint.html) module provides a capability to “pretty-print”.

In [20]:
print(dataset)

Dataset({
    features: ['label', 'tweet'],
    num_rows: 31962
})


In [7]:
# Useful codes for sophisticated datasets.
#print("Column names ", dataset.column_names)
#print("Number of columns :", dataset.num_columns)
#print("Number of rows : ", dataset.num_rows)

In [21]:
print("First example: \n")
pprint(dataset[0]) # print the first sample as a dictionary

First example: 

{'label': 0,
 'tweet': '@user when a father is dysfunctional and is so selfish he drags his '
          'kids into his dysfunction.   #run'}


### 1.3 .Converting to pandas dataframe:

In [24]:
pd.set_option('display.max_colwidth',100)
#imdb_train.set_format("pandas")
##df_imdb_train = imdb_train[:]

#imdb_test.set_format("pandas")
#df_imdb_test = imdb_test[:]

#df_imdb_train.head()

dataset.set_format("pandas")
tweet_df = dataset[:]
tweet_df.head()




Unnamed: 0,label,tweet
0,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. ...
1,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. ...
2,0,bihday your majesty
3,0,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦
4,0,factsguide: society now #motivation


labels : 0 means Negative and 1 means Positive

In [25]:
tweet = tweet_df.copy()

### 1.4 . General Preproccessing

In [76]:
stopWords = nltk.corpus.stopwords.words('english')
wn = nltk.WordNetLemmatizer()

   # text_noNum="".join([char for char in text_noPunc if char not in '123456789'])

def cleaned_text (text):
    text_noPunc="".join([char.lower() for char in text if char not in string.punctuation])
    text_tokenized = re.split('\W+' , text_noPunc)
    text_noStopWords = [ word for word in text_tokenized if word not in stopWords]
    text_lemmatized =  [wn.lemmatize(word) for word in text_noStopWords]
    return text_lemmatized

In [77]:
tweet['clean_tweet'] = tweet['tweet'].apply ( lambda x: cleaned_text(x))
tweet.head()

Unnamed: 0,label,tweet,clean_tweet
0,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. ...,"[user, father, dysfunctional, selfish, drag, kid, dysfunction, run]"
1,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. ...,"[user, user, thanks, lyft, credit, cant, use, cause, dont, offer, wheelchair, van, pdx, disapoin..."
2,0,bihday your majesty,"[bihday, majesty]"
3,0,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦,"[model, love, u, take, u, time, urð, ð, ð, ð, ð, ð, ð, ð, ]"
4,0,factsguide: society now #motivation,"[factsguide, society, motivation]"


In [79]:
train,test = train_test_split(tweet, test_size  = 0.2 , random_state=142)
X_train_tweet = train['clean_tweet']
y_train_tweet = train['label']
X_test_tweet = test['clean_tweet']
y_test_tweet = test['label']

## 2. Preprocessing:Gensim - model:Random Forest

Gensim (Generate Similar) is an open-source library implemented in Python designed for natural language processing and unsupervised modelling.
            Handling large text files without loading in memory is of one the significant advantages of the Gensim.

### 2.1 .Cleaning and tokenization dataset by Gensim preprossing package.

In [31]:
tweet_gensim = tweet_df.copy()

In [33]:
import gensim
# clean data by using gensim built in function:
tweet_gensim['clean_tweet'] = tweet_gensim ['tweet'].apply(lambda x : gensim.utils.simple_preprocess(x))
tweet_gensim.head()


Unnamed: 0,label,tweet,clean_tweet
0,0,@user when a father is dysfunctional and is so selfish he drags his kids into his dysfunction. ...,"[user, when, father, is, dysfunctional, and, is, so, selfish, he, drags, his, kids, into, his, d..."
1,0,@user @user thanks for #lyft credit i can't use cause they don't offer wheelchair vans in pdx. ...,"[user, user, thanks, for, lyft, credit, can, use, cause, they, don, offer, wheelchair, vans, in,..."
2,0,bihday your majesty,"[bihday, your, majesty]"
3,0,#model i love u take with u all the time in urð±!!! ððððð¦ð¦ð¦,"[model, love, take, with, all, the, time, in, urð]"
4,0,factsguide: society now #motivation,"[factsguide, society, now, motivation]"


In [41]:
train,test = train_test_split(tweet_gensim, test_size  = 0.2 , random_state=142)
X_train = train['clean_tweet']
y_train = train['label']
X_test = test['clean_tweet']
y_test = test['label']

### 2.2 .Vectorization : word2vec embedding 

In [60]:
w2v_model = gensim.models.Word2Vec ( X_train,   # **train the w2v model only on training set.**IMP
                                    vector_size = 10 , #size of the vectors
                                    window = 4,   #number of words before and after the keyword to understand context in which the word is used
                                    min_count =2)   #the minimum time that a word must be appeared in the text in order to create the word vector.

In [61]:
w2v_model.wv['motivation'] # get numpy vector of a word

array([ 0.07217278, -0.00500214,  1.6541885 , -0.49206886,  0.6383111 ,
        0.01688413,  0.60983694,  1.2963771 , -2.1193588 , -2.0665634 ],
      dtype=float32)

In [62]:
w2v_model.wv.most_similar('motivation')

[('inspiration', 0.9965179562568665),
 ('energy', 0.9937769770622253),
 ('kevin', 0.9869865775108337),
 ('bogotadc', 0.9838134050369263),
 ('joy', 0.9835760593414307),
 ('meditation', 0.9819909930229187),
 ('danger', 0.9807791709899902),
 ('peaceful', 0.9804904460906982),
 ('gratitude', 0.9795219302177429),
 ('cretin', 0.9786925911903381)]

replace all words in each text with the learnt word vectors

In [63]:
vect_words = set (w2v_model.wv.index_to_key)  # use index_to_key attribute from the train model which is the list of words.
# words represent all the words that word2vec knows about.
train_vect = np.array ([np.array([w2v_model.wv[word] for word in txt if word in vect_words])  #  if word in vect_words. means make sure that model did learn about the word
                        for txt in X_train])
test_vect = np.array ([np.array([w2v_model.wv[word] for word in txt if word in vect_words])  
                        for txt in X_test])

  train_vect = np.array ([np.array([w2v_model.wv[word] for word in txt if word in vect_words])  #  if word in vect_words. means make sure that model did learn about the word
  test_vect = np.array ([np.array([w2v_model.wv[word] for word in txt if word in vect_words])  #  if word in vect_words. means make sure that model did learn about the word


In [64]:
train_vect[0]  # Thats the array of the arrays. (one array for every word in the text message)

array([[ 1.91048276e+00, -1.24070573e+00,  1.99192584e+00,
         2.72942662e+00,  1.04834354e+00, -5.76628149e-01,
         2.67861247e+00, -8.40476871e-01, -4.30134201e+00,
         4.77375239e-01],
       [ 8.08413625e-01, -6.77194774e-01,  1.18487060e+00,
         6.83297098e-01,  1.04146302e+00,  4.59551573e-01,
         2.65634346e+00,  1.09223664e+00, -2.27686501e+00,
        -4.73695278e-01],
       [ 1.80971876e-01, -2.40165472e-01,  7.52116799e-01,
        -2.05449149e-01,  5.39051592e-01, -7.89066702e-02,
         1.52979743e+00,  4.46467608e-01, -1.61578238e+00,
        -5.84985137e-01],
       [-1.39852151e-01,  1.19559467e-01,  7.70649076e-01,
         5.70386946e-02,  7.90282607e-01, -2.69560307e-01,
         2.76214623e+00,  6.18983209e-01, -2.38378453e+00,
        -2.36807212e-01],
       [-1.09718740e+00,  1.02908635e+00,  4.83179867e-01,
         1.01757377e-01,  1.72091818e+00,  5.58452487e-01,
         3.97436357e+00,  2.23353362e+00, -3.55532742e+00,
        -3.

In [65]:
# Avg of the word vectors for each sentence, to get a single vector representation with a fixed length.
# (Assume to assign zero if the model didnt learn about any of the words in sentence.)
avg_train_vect = []
for v in train_vect: # v is an array of arrays that we create in previous cell
    if v.size:
        avg_train_vect.append(v.mean(axis = 0))
    else:
        avg_train_vect.append(np.zeros(10,dtype = float))

avg_test_vect = []
for v in test_vect:
    if v.size:
        avg_test_vect.append(v.mean(axis = 0))
    else:
        avg_test_vect.append(np.zeros(10,dtype = float))

In [66]:
avg_train_vect[0]

array([-2.9593989e-02, -6.7191947e-01,  1.1231573e+00,  1.8920657e-01,
        5.6300396e-01,  1.4719795e-01,  2.2952085e+00,  9.3545902e-01,
       -3.3669040e+00, -2.5483607e-03], dtype=float32)

### 2.3 . Fit Ranodom Forest Classifier

In [67]:
rf = RandomForestClassifier()
rf_model =rf.fit(avg_train_vect,y_train.values.ravel())

In [68]:
y_pred = rf_model.predict(avg_test_vect)

### 2.4 .Evaluation the model

In [74]:
precision = precision_score(y_test,y_pred)

recall =recall_score(y_test,y_pred)

print ('precision : {}, recall : {}, Accuracy : {}'.format (
                                                            round(precision, 3),
                                                            round(recall , 3),
                                                            round(((y_pred == y_test).sum() / len(y_pred)),3)))

precision : 0.986, recall : 0.166, Accuracy : 0.943


## 3. Built in a Basic RNN

### 3.1 . Tokenization dataset by keras

In [81]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(X_train_tweet)
# integer encode documents
X_train_seq = tokenizer.texts_to_sequences(X_train_tweet )
X_test_seq = tokenizer.texts_to_sequences(X_test_tweet )

In [82]:
# summarize what was learned
#print(t.word_counts)
#print(t.document_count)
#print(t.word_index)
#print(t.word_docs)

### 3.2 . Pad the sequences with the same length

In [89]:
X_train_seq_padd = pad_sequences(X_train_seq,20)
X_test_seq_padd = pad_sequences(X_test_seq,20)

In [90]:
X_train_seq[0]
# each integer representing a word in the first text message

[470, 52, 141, 12, 8, 412, 5096, 3729, 12, 110]

In [91]:
X_train_seq_padd[0]

array([   0,    0,    0,    0,    0,    0,    0,    0,    0,    0,  470,
         52,  141,   12,    8,  412, 5096, 3729,   12,  110])

### 3.3 . Fit the Model

In [106]:
model = Sequential()
model.add(Embedding(len(tokenizer.index_word)+1,32))
model.add (LSTM (32,dropout = 0, recurrent_dropout = 0 ))
model.add (Dense(32, activation ='relu'))
model.add (Dense(1, activation ='sigmoid'))
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, None, 32)          1141120   
                                                                 
 lstm_1 (LSTM)               (None, 32)                8320      
                                                                 
 dense_2 (Dense)             (None, 32)                1056      
                                                                 
 dense_3 (Dense)             (None, 1)                 33        
                                                                 
Total params: 1,150,529
Trainable params: 1,150,529
Non-trainable params: 0
_________________________________________________________________


In [115]:
model.compile(loss = 'binary_crossentropy',
              optimizer='adam',
              metrics ='accuracy')

In [116]:
history = model.fit (X_train_seq_padd,y_train_tweet,
                    batch_size = 32,
                    epochs = 10,
                    validation_data = (X_test_seq_padd,y_test_tweet) )

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
