In [1]:
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

### We do have a two datasets of several comments collected on twitter and we are willing to identify wich of those can be categorized as "hate speech". In the first dataset that we are going to use to train the model the comments are labelled as: 0 as non-hate speech, 1 as hate speech. In the second one we don't have such a classification, and we will use it to identify the number of hate speech comments.

In [2]:
df=pd.read_csv(r"C:\Users\Nauel\Desktop\Python\Sentiment analysis\train.csv",error_bad_lines=False)
df

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation
...,...,...,...
31957,31958,0,ate @user isz that youuu?ðððððð...
31958,31959,0,to see nina turner on the airwaves trying to...
31959,31960,0,listening to sad songs on a monday morning otw...
31960,31961,1,"@user #sikh #temple vandalised in in #calgary,..."


# Training: TF-IDF prediction model

In [3]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split 
from sklearn.svm import LinearSVC 
from sklearn.metrics import classification_report 

In [4]:
df.head()

Unnamed: 0,id,label,tweet
0,1,0,@user when a father is dysfunctional and is s...
1,2,0,@user @user thanks for #lyft credit i can't us...
2,3,0,bihday your majesty
3,4,0,#model i love u take with u all the time in ...
4,5,0,factsguide: society now #motivation


In [5]:
!pip install git+https://github.com/laxmimerit/preprocess_kgptalkie.git
    

Defaulting to user installation because normal site-packages is not writeable
Collecting git+https://github.com/laxmimerit/preprocess_kgptalkie.git
  Cloning https://github.com/laxmimerit/preprocess_kgptalkie.git to c:\users\nauel\appdata\local\temp\pip-req-build-ose57t_z
  Resolved https://github.com/laxmimerit/preprocess_kgptalkie.git to commit 9ca68d37027af9f6a30d54640347ce3b2e2694b3


  Running command git clone -q https://github.com/laxmimerit/preprocess_kgptalkie.git 'C:\Users\Nauel\AppData\Local\Temp\pip-req-build-ose57t_z'


### Cleaning the datas from useless characters

In [6]:
import preprocess_kgptalkie as ps
import re
def get_clean(x):
    x = str(x).lower().replace('\\', '').replace('_', ' ')
    x = ps.cont_exp(x)
    x = ps.remove_emails(x)
    x = ps.remove_accented_chars(x)
    x = ps.remove_special_chars(x)
    x = re.sub("(.)\\1{2,}", "\\1", x)
    return x

In [7]:
df["tweet"]=df["tweet"].apply(lambda x: get_clean(x)) #it takes around one minute
df.head()

Unnamed: 0,id,label,tweet
0,1,0,user when a father is dysfunctional and is so ...
1,2,0,user user thanks for lyft credit i cannot use ...
2,3,0,bihday your majesty
3,4,0,model i love you take with you all the time in ur
4,5,0,factsguide society now motivation


In [8]:
tf_idf= TfidfVectorizer(max_features=5000) #5000 number of features
X = df["tweet"]
Y = df["label"]

X=tf_idf.fit_transform(X)
X

<31962x5000 sparse matrix of type '<class 'numpy.float64'>'
	with 311357 stored elements in Compressed Sparse Row format>

In [9]:
X_train, X_test, y_train, y_test = train_test_split(X,Y, test_size=0.2, random_state= 0)

In [10]:
clf= LinearSVC()
clf.fit(X_train, y_train)

LinearSVC()

### Accurancy of the training model

In [11]:
y_pred = clf.predict(X_test)
print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.97      0.99      0.98      5985
           1       0.81      0.54      0.65       408

    accuracy                           0.96      6393
   macro avg       0.89      0.77      0.82      6393
weighted avg       0.96      0.96      0.96      6393



### The model is quite precise (97%) into the identification of the racist or sexist comments. Down below we apply the the prediction on some examples phrases to test the model

In [45]:
z= "I hate black people" # 1 represents a racist/sexist comment
z=get_clean(z)
vec= tf_idf.transform([z])
vec.shape
clf.predict(vec)

array([1], dtype=int64)

In [44]:
z= "I hate pizza" # 0 represents a non-racist/sexist comment
z=get_clean(z)
vec= tf_idf.transform([z])
vec.shape
clf.predict(vec)

array([0], dtype=int64)

# Testing

In [14]:
test=pd.read_csv(r"C:\Users\Nauel\Desktop\Python\Sentiment analysis\test.csv",error_bad_lines=False)
test

Unnamed: 0,id,tweet
0,31963,#studiolife #aislife #requires #passion #dedic...
1,31964,@user #white #supremacists want everyone to s...
2,31965,safe ways to heal your #acne!! #altwaystohe...
3,31966,is the hp and the cursed child book up for res...
4,31967,"3rd #bihday to my amazing, hilarious #nephew..."
...,...,...
17192,49155,thought factory: left-right polarisation! #tru...
17193,49156,feeling like a mermaid ð #hairflip #neverre...
17194,49157,#hillary #campaigned today in #ohio((omg)) &am...
17195,49158,"happy, at work conference: right mindset leads..."


In [16]:
test["tweet"]=test["tweet"].apply(lambda x: get_clean(x)) #it takes around one minute
test.head()

Unnamed: 0,id,tweet
0,31963,studiolife aislife requires passion dedication...
1,31964,user white supremacists want everyone to see t...
2,31965,safe ways to heal your acne altwaystoheal heal...
3,31966,is the horsepower and the cursed child book up...
4,31967,3rd bihday to my amazing hilarious nephew eli ...


In [19]:
vec= tf_idf.transform(test["tweet"])
vec.shape
predictions=clf.predict(vec)

In [43]:
total=17196
pos=np.count_nonzero(predictions == 1)
perc= pos/total
perc=round(perc,3)
print("The hate speech comments represents the %f of the total" % (perc))

The hate speech comments represents the 0.049000 of the total
