<a href="https://colab.research.google.com/github/Shehab-7/NLP/blob/main/Text%20Classification/Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Dataset**
labeled datasset collected from twitter

**Objective**
classify tweets containing hate speech from other tweets.
0 -> no hate speech
1 -> contains hate speech



### Import Libraries

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from collections import Counter
import random
from termcolor import colored
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances
from sklearn.model_selection import train_test_split
import re
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn import metrics


### Load Dataset

In [2]:
data = pd.read_csv('/content/dataset.csv')

In [3]:
data.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


### EDA

- check NaNs

In [4]:
data.isnull().any()


id       False
label    False
tweet    False
dtype: bool

- check duplicates

In [5]:
data.drop("id",axis=1,inplace=True)

In [6]:
data.duplicated().sum()

2432

- show samples of data texts to find out required preprocessing steps

In [7]:
data['tweet'].head(50)

0      @user when a father is dysfunctional and is s...
1     @user @user thanks for #lyft credit i can't us...
2                                   bihday your majesty
3     #model   i love u take with u all the time in ...
4                factsguide: society now    #motivation
5     [2/2] huge fan fare and big talking before the...
6      @user camping tomorrow @user @user @user @use...
7     the next school year is the year for exams.ð...
8     we won!!! love the land!!! #allin #cavs #champ...
9      @user @user welcome here !  i'm   it's so #gr...
10     â #ireland consumer price index (mom) climb...
11    we are so selfish. #orlando #standwithorlando ...
12    i get to see my daddy today!!   #80days #getti...
13    @user #cnn calls #michigan middle school 'buil...
14    no comment!  in #australia   #opkillingbay #se...
15    ouch...junior is angryð#got7 #junior #yugyo...
16    i am thankful for having a paner. #thankful #p...
17                               retweet if you 

In [8]:
data['tweet'].sample(50)

11661     â #usd/cad pushes higher to 1.2840, us data...
19165    @user no joke! i'm tired of #celebrities actin...
5148     if u truly want #happiness u must go where   p...
25118    euphoria-hold_the_rush__eternal_heights-(ng039...
17139    tag someone below #quotes #beautiful #beach #l...
19281    when you catch the light just right. #rocknrol...
30997    waking up in the morning has been so easy late...
22018    i love kids but i choose to be childfree  #yes...
20660     @user coffee, bacon butty, writing and catchi...
23561                   unending   #hours in #montreal :  
16438     â #nab may business survey: unwavering non-...
23409    now playing  :  nils frahm - " ambre" on    #m...
30632    @user @user mr. paris dinnard...is in denial.....
24090      big things on the way!! stay tuned!   #bigplans
21540    @user the feeling's not mutual, you , #ableism...
30055    nice morning doing airsoftððð«   #exci...
9153                 exactly but people don't get itð

- check dataset balancing

In [9]:
data['label'].value_counts()

0    29720
1     2242
Name: label, dtype: int64

- Cleaning and Preprocessing are:
    - 1- Drop emojis
    - 2- Drop @user
    - 3- Drop Hashtags
    - 4- Drop Duplicates
    - 5- Drop Punctuations

### Cleaning and Preprocessing

In [10]:
def remove_emoji(string):
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return emoji_pattern.sub(r'', string)

  #There's also emoji library

In [11]:
data['tweet'] = data['tweet'].apply(lambda x: remove_emoji(x))

In [12]:
def remove_hashtag(string):
    return re.sub(r"# ","",string)
    

In [13]:
data['tweet'] = data['tweet'].apply(lambda x: remove_hashtag(x))

In [14]:
data['tweet'].head(50)

0      @user when a father is dysfunctional and is s...
1     @user @user thanks for #lyft credit i can't us...
2                                   bihday your majesty
3     #model   i love u take with u all the time in ...
4                factsguide: society now    #motivation
5     [2/2] huge fan fare and big talking before the...
6      @user camping tomorrow @user @user @user @use...
7     the next school year is the year for exams.ð...
8     we won!!! love the land!!! #allin #cavs #champ...
9      @user @user welcome here !  i'm   it's so #gr...
10     â #ireland consumer price index (mom) climb...
11    we are so selfish. #orlando #standwithorlando ...
12    i get to see my daddy today!!   #80days #getti...
13    @user #cnn calls #michigan middle school 'buil...
14    no comment!  in #australia   #opkillingbay #se...
15    ouch...junior is angryð#got7 #junior #yugyo...
16    i am thankful for having a paner. #thankful #p...
17                               retweet if you 

In [15]:
def drop_users(string):
    return re.sub(r"@\S+","",string)

In [16]:
data['tweet'] = data['tweet'].apply(lambda x: drop_users(x))

In [17]:
data['tweet'].head(50)

0       when a father is dysfunctional and is so sel...
1       thanks for #lyft credit i can't use cause th...
2                                   bihday your majesty
3     #model   i love u take with u all the time in ...
4                factsguide: society now    #motivation
5     [2/2] huge fan fare and big talking before the...
6                      camping tomorrow        dannyâ¦
7     the next school year is the year for exams.ð...
8     we won!!! love the land!!! #allin #cavs #champ...
9                 welcome here !  i'm   it's so #gr8 ! 
10     â #ireland consumer price index (mom) climb...
11    we are so selfish. #orlando #standwithorlando ...
12    i get to see my daddy today!!   #80days #getti...
13     #cnn calls #michigan middle school 'build the...
14    no comment!  in #australia   #opkillingbay #se...
15    ouch...junior is angryð#got7 #junior #yugyo...
16    i am thankful for having a paner. #thankful #p...
17                               retweet if you 

**If it takes 60 Mins till here, you are doing Great** <br>
**If not! You also are doing Great**

### Modelling

In [18]:
train,test = train_test_split(data,test_size=0.3)

In [19]:
vec = CountVectorizer()
clf = LogisticRegression()
pipe = make_pipeline(vec, clf)
pipe.fit(train.tweet, train.label);

#### Evaluation

In [20]:
def print_report(pipe, x_test, y_test):
    y_pred = pipe.predict(x_test)
    report = metrics.classification_report(y_test, y_pred)
    print(report)
    print("accuracy: {:0.3f}".format(metrics.accuracy_score(y_test, y_pred)))

print_report(pipe, test.tweet, test.label)

              precision    recall  f1-score   support

           0       0.97      0.99      0.98      8962
           1       0.86      0.53      0.66       627

    accuracy                           0.96      9589
   macro avg       0.92      0.76      0.82      9589
weighted avg       0.96      0.96      0.96      9589

accuracy: 0.964


### Enhancement

- Using different N-grams
- Using different text representation technique

In [21]:
vec_2 = CountVectorizer(ngram_range=(1, 3))
clf_2 = LogisticRegression()
pipe_2 = make_pipeline(vec_2, clf_2)
pipe_2.fit(train.tweet, train.label);

In [22]:
print_report(pipe_2, test.tweet, test.label)

              precision    recall  f1-score   support

           0       0.96      1.00      0.98      8962
           1       0.91      0.48      0.62       627

    accuracy                           0.96      9589
   macro avg       0.94      0.74      0.80      9589
weighted avg       0.96      0.96      0.96      9589

accuracy: 0.963


In [23]:
vec_3 = CountVectorizer(ngram_range=(1, 5))
clf_3 = LogisticRegression()
pipe_3 = make_pipeline(vec_3, clf_3)
pipe_3.fit(train.tweet, train.label);

In [24]:
print_report(pipe_3, test.tweet, test.label)

              precision    recall  f1-score   support

           0       0.96      1.00      0.98      8962
           1       0.92      0.45      0.60       627

    accuracy                           0.96      9589
   macro avg       0.94      0.72      0.79      9589
weighted avg       0.96      0.96      0.95      9589

accuracy: 0.961


In [25]:
vec_4 = TfidfVectorizer(analyzer='char_wb', ngram_range=(3, 5), min_df=.01, max_df=.3)
clf_4 = LogisticRegression()
pipe_4 = make_pipeline(vec_4, clf_4)
pipe_4.fit(train.tweet, train.label);

In [26]:
print_report(pipe_4, test.tweet, test.label)

              precision    recall  f1-score   support

           0       0.96      1.00      0.98      8962
           1       0.86      0.40      0.54       627

    accuracy                           0.96      9589
   macro avg       0.91      0.70      0.76      9589
weighted avg       0.95      0.96      0.95      9589

accuracy: 0.956


In [27]:
print(data.tweet[10:15])
print(pipe.predict(data.tweet[10:15]))

10     â #ireland consumer price index (mom) climb...
11    we are so selfish. #orlando #standwithorlando ...
12    i get to see my daddy today!!   #80days #getti...
13     #cnn calls #michigan middle school 'build the...
14    no comment!  in #australia   #opkillingbay #se...
Name: tweet, dtype: object
[0 0 0 1 1]


#### Done!