<a href="https://colab.research.google.com/github/Shehab-7/NLP/blob/main/Text%20Classification/Text_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**Dataset**
labeled datasset collected from twitter

**Objective**
classify tweets containing hate speech from other tweets.
0 -> no hate speech
1 -> contains hate speech



### Import Libraries

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from collections import Counter
import random
from termcolor import colored
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity, cosine_distances
from sklearn.model_selection import train_test_split
import re
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn import metrics


### Load Dataset

In [2]:
data = pd.read_csv('/content/dataset.csv')

In [3]:
data.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


### EDA

- check NaNs

In [4]:
data.isnull().any()


id       False
label    False
tweet    False
dtype: bool

- check duplicates

In [5]:
data.drop("id",axis=1,inplace=True)

In [6]:
data.duplicated().sum()

2432

- show samples of data texts to find out required preprocessing steps

In [7]:
data['tweet'].head(50)

0      @user when a father is dysfunctional and is s...
1     @user @user thanks for #lyft credit i can't us...
2                                   bihday your majesty
3     #model   i love u take with u all the time in ...
4                factsguide: society now    #motivation
5     [2/2] huge fan fare and big talking before the...
6      @user camping tomorrow @user @user @user @use...
7     the next school year is the year for exams.ð...
8     we won!!! love the land!!! #allin #cavs #champ...
9      @user @user welcome here !  i'm   it's so #gr...
10     â #ireland consumer price index (mom) climb...
11    we are so selfish. #orlando #standwithorlando ...
12    i get to see my daddy today!!   #80days #getti...
13    @user #cnn calls #michigan middle school 'buil...
14    no comment!  in #australia   #opkillingbay #se...
15    ouch...junior is angryð#got7 #junior #yugyo...
16    i am thankful for having a paner. #thankful #p...
17                               retweet if you 

In [8]:
data['tweet'].sample(50)

1607     happy weekend everyone ðð   #alamode fri...
25065    it's thursday! see you soon! @user    #fit #mo...
22290    @user these are just stunning statistics!!! wh...
5896      @user sent this from a friend at @user when d...
20588    //i dont fucking understand why people just de...
11460     @user #needlefelted   #taylor by cervivintage...
8241     uk 2016: where it fell to pop star @user to do...
3349     the power to retweet myself = the power to use...
22892    i'm feel lost, wrong way and mistake my attitu...
16193    on my way to @user to talk #nbafinals with nea...
14552    @user they're all amazing! can't wait to see w...
4152      @user most brilliant &amp; balanced speech by...
14640    so many beautiful strikes on goal and nothing ...
21201        happy father's day !!! #fathersday   #family 
9196     #fun #work#love#scotland @ landmark forest adv...
2174     @user so many memories ð¢   #guns #memoies #...
9062               they only love you when you win. #nba

- check dataset balancing

In [9]:
data['label'].value_counts()
#Use Smot() or up sampling or down sampling

0    29720
1     2242
Name: label, dtype: int64

- Cleaning and Preprocessing are:
    - 1- Drop emojis
    - 2- Drop @user
    - 3- Drop Hashtags
    - 4- Drop Duplicates


### Cleaning and Preprocessing

In [10]:
def remove_emoji(string):
    emoji_pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return emoji_pattern.sub(r'', string)

  #There's also emoji library

In [11]:
data['tweet'] = data['tweet'].apply(lambda x: remove_emoji(x))

In [12]:
def remove_hashtag(string):
    return re.sub(r"# ","",string)
    

In [13]:
data['tweet'] = data['tweet'].apply(lambda x: remove_hashtag(x))

In [14]:
data['tweet'].head(50)

0      @user when a father is dysfunctional and is s...
1     @user @user thanks for #lyft credit i can't us...
2                                   bihday your majesty
3     #model   i love u take with u all the time in ...
4                factsguide: society now    #motivation
5     [2/2] huge fan fare and big talking before the...
6      @user camping tomorrow @user @user @user @use...
7     the next school year is the year for exams.ð...
8     we won!!! love the land!!! #allin #cavs #champ...
9      @user @user welcome here !  i'm   it's so #gr...
10     â #ireland consumer price index (mom) climb...
11    we are so selfish. #orlando #standwithorlando ...
12    i get to see my daddy today!!   #80days #getti...
13    @user #cnn calls #michigan middle school 'buil...
14    no comment!  in #australia   #opkillingbay #se...
15    ouch...junior is angryð#got7 #junior #yugyo...
16    i am thankful for having a paner. #thankful #p...
17                               retweet if you 

In [15]:
def drop_users(string):
    return re.sub(r"@\S+","",string)

In [16]:
data['tweet'] = data['tweet'].apply(lambda x: drop_users(x))

In [17]:
data['tweet'].head(50)

0       when a father is dysfunctional and is so sel...
1       thanks for #lyft credit i can't use cause th...
2                                   bihday your majesty
3     #model   i love u take with u all the time in ...
4                factsguide: society now    #motivation
5     [2/2] huge fan fare and big talking before the...
6                      camping tomorrow        dannyâ¦
7     the next school year is the year for exams.ð...
8     we won!!! love the land!!! #allin #cavs #champ...
9                 welcome here !  i'm   it's so #gr8 ! 
10     â #ireland consumer price index (mom) climb...
11    we are so selfish. #orlando #standwithorlando ...
12    i get to see my daddy today!!   #80days #getti...
13     #cnn calls #michigan middle school 'build the...
14    no comment!  in #australia   #opkillingbay #se...
15    ouch...junior is angryð#got7 #junior #yugyo...
16    i am thankful for having a paner. #thankful #p...
17                               retweet if you 

In [18]:
data = data.drop_duplicates()

In [19]:
data.duplicated().sum()

0

**If it takes 60 Mins till here, you are doing Great** <br>
**If not! You also are doing Great**

### Modelling

In [20]:
train,test = train_test_split(data,test_size=0.3)

In [21]:
vec = CountVectorizer()
clf = LogisticRegression()
pipe = make_pipeline(vec, clf)
pipe.fit(train.tweet, train.label);

#### Evaluation

In [22]:
def print_report(pipe, x_test, y_test):
    y_pred = pipe.predict(x_test)
    report = metrics.classification_report(y_test, y_pred)
    print(report)
    print("accuracy: {:0.3f}".format(metrics.accuracy_score(y_test, y_pred)))

print_report(pipe, test.tweet, test.label)

              precision    recall  f1-score   support

           0       0.96      0.99      0.97      8227
           1       0.80      0.41      0.54       632

    accuracy                           0.95      8859
   macro avg       0.88      0.70      0.76      8859
weighted avg       0.94      0.95      0.94      8859

accuracy: 0.950


### Enhancement

- Using different N-grams
- Using different text representation technique

In [23]:
vec_2 = CountVectorizer(ngram_range=(1, 3))
clf_2 = LogisticRegression()
pipe_2 = make_pipeline(vec_2, clf_2)
pipe_2.fit(train.tweet, train.label);

In [24]:
print_report(pipe_2, test.tweet, test.label)

              precision    recall  f1-score   support

           0       0.95      1.00      0.97      8227
           1       0.88      0.36      0.51       632

    accuracy                           0.95      8859
   macro avg       0.92      0.68      0.74      8859
weighted avg       0.95      0.95      0.94      8859

accuracy: 0.951


In [25]:
vec_3 = CountVectorizer(ngram_range=(1, 5))
clf_3 = LogisticRegression()
pipe_3 = make_pipeline(vec_3, clf_3)
pipe_3.fit(train.tweet, train.label);

In [26]:
print_report(pipe_3, test.tweet, test.label)

              precision    recall  f1-score   support

           0       0.95      1.00      0.97      8227
           1       0.91      0.33      0.48       632

    accuracy                           0.95      8859
   macro avg       0.93      0.66      0.73      8859
weighted avg       0.95      0.95      0.94      8859

accuracy: 0.950


In [27]:
vec_4 = TfidfVectorizer(analyzer='char_wb', ngram_range=(3, 5), min_df=.01, max_df=.3)
clf_4 = LogisticRegression()
pipe_4 = make_pipeline(vec_4, clf_4)
pipe_4.fit(train.tweet, train.label);

In [28]:
print_report(pipe_4, test.tweet, test.label)

              precision    recall  f1-score   support

           0       0.95      0.99      0.97      8227
           1       0.82      0.32      0.46       632

    accuracy                           0.95      8859
   macro avg       0.89      0.66      0.71      8859
weighted avg       0.94      0.95      0.94      8859

accuracy: 0.946


In [29]:
print(data.tweet[10:15])
print(pipe.predict(data.tweet[10:15]))

10     â #ireland consumer price index (mom) climb...
11    we are so selfish. #orlando #standwithorlando ...
12    i get to see my daddy today!!   #80days #getti...
13     #cnn calls #michigan middle school 'build the...
14    no comment!  in #australia   #opkillingbay #se...
Name: tweet, dtype: object
[0 0 0 1 1]


#### Done!