### Data from twitter to analysis which tweets contain hateful or harsh message

Hate Speech Detection is generally a task of sentiment classification. So for training, a model that can classify hate speech from a certain piece of text can be achieved by training it on a data that is generally used to classify sentiments. So for the task of hate speech detection model, I will use the Twitter data.
The data set I will use for the hate speech detection model consists of a test and train set. The training package includes a list of 31,962 tweets, a corresponding ID and a tag 0 or 1 for each tweet. The particular sentiment we need to detect in this dataset is whether or not the tweet is based on hate speech.

Where 0 represent non-hate speech and 1 represent hate speech

In [149]:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, accuracy_score
from sklearn.linear_model import LogisticRegression
import re

In [132]:
train_data = pd.read_csv("data/train.csv")
test_data = pd.read_csv("data/test.csv")

In [133]:
train_data.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [134]:
#extract right message from tweet with regular expressions

def clean_text(text):
    #convert to lowercase
    text = text.lower()
    #remove urls
    text = re.sub(r"http\S+|www\S+|https\S+", "", text)
    #remove special characters eg emojis
    text = re.sub(r"[^A-Za-z\s]", "", text)
    #remove usernames 
    text = re.sub(r"@\w+", "", text)
    return text

In [135]:
#overwrite tweet column with clean tweet
train_data["tweet"] = train_data["tweet"].apply(clean_text)
train_data["tweet"]

0         user when a father is dysfunctional and is so...
1        user user thanks for lyft credit i cant use ca...
2                                      bihday your majesty
3        model   i love u take with u all the time in u...
4                     factsguide society now    motivation
                               ...                        
31957                             ate user isz that youuu 
31958      to see nina turner on the airwaves trying to...
31959    listening to sad songs on a monday morning otw...
31960    user sikh temple vandalised in in calgary wso ...
31961                      thank you user for you follow  
Name: tweet, Length: 31962, dtype: object

In [136]:
train_data.head()

Unnamed: 0,id,label,tweet
0,1,0,user when a father is dysfunctional and is so...
1,2,0,user user thanks for lyft credit i cant use ca...
2,3,0,bihday your majesty
3,4,0,model i love u take with u all the time in u...
4,5,0,factsguide society now motivation


In [137]:
test_data["tweet"] = test_data["tweet"].apply(clean_text)
test_data.head()

Unnamed: 0,id,tweet
0,31963,studiolife aislife requires passion dedication...
1,31964,user white supremacists want everyone to see ...
2,31965,safe ways to heal your acne altwaystoheal h...
3,31966,is the hp and the cursed child book up for res...
4,31967,rd bihday to my amazing hilarious nephew eli...


In [138]:
X_train = train_data["tweet"]
y_train = train_data["label"]

In [139]:
# Initialize TfidfVectorizer
vectorizer = TfidfVectorizer()

# Fit and Transform the 'tweet' column of train data
X_train_tfidf = vectorizer.fit_transform(X_train)
print(X_train_tfidf)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 358639 stored elements and shape (31962, 39744)>
  Coords	Values
  (0, 36905)	0.07384154882116929
  (0, 38274)	0.1529115658867543
  (0, 11791)	0.22089089303118026
  (0, 17763)	0.22354755854034217
  (0, 10200)	0.3831058148100799
  (0, 1265)	0.10653356684771656
  (0, 32132)	0.13809337382160308
  (0, 30812)	0.32536874006080596
  (0, 15429)	0.1884908239175851
  (0, 9877)	0.35023474468704946
  (0, 15876)	0.3754184460646087
  (0, 18936)	0.21917774762306041
  (0, 17606)	0.21772073268784015
  (0, 10199)	0.3685601345609686
  (0, 29961)	0.24276566068528868
  (1, 36905)	0.1310764387285603
  (1, 34675)	0.1805935527183264
  (1, 12740)	0.09609757583617878
  (1, 20992)	0.3050464358003797
  (1, 7747)	0.27381673672632856
  (1, 5369)	0.14921776743480208
  (1, 36896)	0.21074145675482372
  (1, 5669)	0.22548758500162694
  (1, 35024)	0.15721964334141927
  (1, 9699)	0.15328257439862658
  :	:
  (31958, 6212)	0.3005799673880711
  (31959, 17763)	0.14

In [140]:
#initialize model
model = MultinomialNB()

In [141]:
#training the model
model.fit(X_train_tfidf,y_train)

In [142]:
X_test = test_data["tweet"]
X_test.head()

0    studiolife aislife requires passion dedication...
1     user white supremacists want everyone to see ...
2    safe ways to heal your acne    altwaystoheal h...
3    is the hp and the cursed child book up for res...
4      rd bihday to my amazing hilarious nephew eli...
Name: tweet, dtype: object

In [143]:
# Transform the 'tweet' column of test data
X_test_tfidf = vectorizer.transform(X_test)
print(X_test_tfidf)

<Compressed Sparse Row sparse matrix of dtype 'float64'
	with 178971 stored elements and shape (17197, 39744)>
  Coords	Values
  (0, 8668)	0.49636534791504755
  (0, 12222)	0.2814303354893764
  (0, 25971)	0.3914752493697996
  (0, 29194)	0.49636534791504755
  (0, 33358)	0.510664851360582
  (0, 35450)	0.11835703511161651
  (1, 1265)	0.13912844364886093
  (1, 3722)	0.3386883203053539
  (1, 11183)	0.27138377589631396
  (1, 15671)	0.34420646077249256
  (1, 22977)	0.3170419384419177
  (1, 23878)	0.21327629631030792
  (1, 30719)	0.22295702156862315
  (1, 33770)	0.5003203905070188
  (1, 34736)	0.11082752570093991
  (1, 35450)	0.11155676903607284
  (1, 36905)	0.09643401669631478
  (1, 37657)	0.2311854152330957
  (1, 38347)	0.28364020818555874
  (1, 38448)	0.24463835311256482
  (2, 218)	0.460496229654238
  (2, 1009)	0.3042564415370488
  (2, 15470)	0.3906192823208058
  (2, 15476)	0.32578754896114964
  (2, 15496)	0.2667794419964481
  :	:
  (17194, 20057)	0.21361663221460842
  (17194, 23847)	0.25744

In [144]:
y_pred = model.predict(X_test_tfidf)
y_pred

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

In [145]:
#convert y_pred to dataframe column
test_label = pd.DataFrame(y_pred)

#give column a name
test_label.columns = ["label"]
test_label.head()

Unnamed: 0,label
0,0
1,0
2,0
3,0
4,0


In [146]:
#join y_pred to test data 
test_data["label"] = test_label
test_data.head()

Unnamed: 0,id,tweet,label
0,31963,studiolife aislife requires passion dedication...,0
1,31964,user white supremacists want everyone to see ...,0
2,31965,safe ways to heal your acne altwaystoheal h...,0
3,31966,is the hp and the cursed child book up for res...,0
4,31967,rd bihday to my amazing hilarious nephew eli...,0


In [150]:
#using another model
model2 = LogisticRegression(max_iter=1000)


In [None]:
#training the model
model2.fit(X_train_tfidf,y_train)

In [None]:
#new y prediction with the model
y_pred2 = model2.predict(X_test_tfidf)
y_pred2

array([0, 0, 0, ..., 0, 0, 0], dtype=int64)

### Evaluate the model

#### Since a y test was not provided in our test data, I evaluated the model with the y_prediction of train data against the y train data

In [154]:
y_train_pred = model.predict(X_train_tfidf)
y_train_pred2 = model2.predict(X_train_tfidf)

#accuracy_score(y_test,y_pred)

#evaluate
print(f"The MultinomialNB accuracy is  {accuracy_score (y_train,y_train_pred)*100:.2f}%")
print(f"The LogisticRegression accuracy is  {accuracy_score (y_train,y_train_pred2)*100:.2f}%")
print("------------------------------------------------------")
print("MultinomialNB Model")
print(classification_report(y_train,y_train_pred))
print("-------------------------------------------------------")
print("LogicticRegression Model")
print(classification_report(y_train,y_train_pred2))

The MultinomialNB accuracy is  94.17%
The LogisticRegression accuracy is  95.43%
------------------------------------------------------
MultinomialNB Model
              precision    recall  f1-score   support

           0       0.94      1.00      0.97     29720
           1       1.00      0.17      0.29      2242

    accuracy                           0.94     31962
   macro avg       0.97      0.58      0.63     31962
weighted avg       0.95      0.94      0.92     31962

-------------------------------------------------------
LogicticRegression Model
              precision    recall  f1-score   support

           0       0.95      1.00      0.98     29720
           1       0.96      0.36      0.53      2242

    accuracy                           0.95     31962
   macro avg       0.96      0.68      0.75     31962
weighted avg       0.95      0.95      0.94     31962



In [148]:
#based my intuition on the outcome of the y_predict for the test data, most tweet correspond excatly to the right outcome
test_data.tail(20)

Unnamed: 0,id,tweet,label
17177,49140,i am thankful for children thankful positive,0
17178,49141,liverpool walk liverpool starbucks avidaeboa ...,0
17179,49142,bakersfield rooster simulation i want to cli...,0
17180,49143,por do sol instagood beautiful instadaily in...,0
17181,49144,user hell yeah what a great surprise for your ...,0
17182,49145,when ur the joke ur defensive towards everything,0
17183,49146,enjoying the evening sun in my bedroom cozy...,0
17184,49147,tonight on user from pm gmt you can here a sp...,0
17185,49148,today is a good day for excercise imready sofu...,0
17186,49149,good night with a tea and music billy music t...,0
