# Hate Speech Detection With Python
https://copyassignment.com/hate-speech-detection/

## Understanding the data
The dataset for building our hate speech detection model is available on www.kaggle.com. The dataset consists of Twitter hate speech detection data, used to research hate-speech detection. The text in the data is classified as hate speech, offensive language, and neither. Due to the nature of the study, it’s important to note that this dataset contains text that can be considered racist, sexist, homophobic, or generally offensive.

You can find the dataset for hate speech detection here https://www.kaggle.com/datasets/mrmorj/hate-speech-and-offensive-language-dataset

There are 7 columns in the hate speech detection dataset. They are index, count, hate_speech, offensive_language, neither, class and tweet. The description of the column is as follows.

- index – This column has the index value
- count– It has the number of users who coded each tweet
- hate_speech – This column has the number of users who judged the tweet to be hate speech
- offensive_language – It has the number of users who judged the tweet to be offensive
- neither – This has the number of users who judged the tweet to be neither offensive nor non-offensive
- class – it has a class label for the majority of the users, in which 0 denotes hate speech, 1 means offensive language and 2 denotes neither of them.
- tweet – This column has the text tweet.

In [51]:
import pandas as pd
import numpy as np
from sklearn. feature_extraction. text import CountVectorizer
from sklearn. feature_extraction. text import TfidfVectorizer
from sklearn. model_selection import train_test_split
from sklearn. tree import DecisionTreeClassifier

In [52]:
import nltk
import re
nltk. download('stopwords')
from nltk. corpus import stopwords
stopword=set(stopwords.words('english'))
stemmer = nltk. SnowballStemmer("english")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\范宏瑞\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [53]:
data = pd. read_csv("labeled_data.csv")
#To preview the data
print(data. head())

   Unnamed: 0  count  hate_speech  offensive_language  neither  class  \
0           0      3            0                   0        3      2   
1           1      3            0                   3        0      1   
2           2      3            0                   3        0      1   
3           3      3            0                   2        1      1   
4           4      6            0                   6        0      1   

                                               tweet  
0  !!! RT @mayasolovely: As a woman you shouldn't...  
1  !!!!! RT @mleew17: boy dats cold...tyga dwn ba...  
2  !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...  
3  !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...  
4  !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...  


## Preprocessing the data
In Data preprocessing, we prepare the raw data and make it suitable for a machine learning model. It is the first and crucial step while creating a machine learning model. When creating a machine learning project, it is not always a case that we come across clean and formatted data. And while doing any operation with data, it is mandatory to clean it and put it in a formatted way. So for this, we use the data preprocessing task.

In [54]:
data["labels"] = data["class"]. map({0: "Hate Speech", 1: "Offensive Speech", 2: "No Hate and Offensive Speech"})
data = data[["tweet", "labels"]]
print(data. head())

                                               tweet  \
0  !!! RT @mayasolovely: As a woman you shouldn't...   
1  !!!!! RT @mleew17: boy dats cold...tyga dwn ba...   
2  !!!!!!! RT @UrKindOfBrand Dawg!!!! RT @80sbaby...   
3  !!!!!!!!! RT @C_G_Anderson: @viva_based she lo...   
4  !!!!!!!!!!!!! RT @ShenikaRoberts: The shit you...   

                         labels  
0  No Hate and Offensive Speech  
1              Offensive Speech  
2              Offensive Speech  
3              Offensive Speech  
4              Offensive Speech  


In [55]:
import string
def clean (text):
    text = str (text). lower()
    text = re. sub('[.?]', '', text) 
    text = re. sub('https?://\S+|www.\S+', '', text)
    text = re. sub('<.?>+', '', text)
    text = re. sub('[%s]' % re. escape(string. punctuation), '', text)
    text = re. sub('\n', '', text)
    text = re. sub('\w\d\w', '', text)
    text = [word for word in text.split(' ') if word not in stopword]
    text=" ". join(text)
    text = [stemmer. stem(word) for word in text. split(' ')]
    text=" ". join(text)
    return text

data["tweet"] = data["tweet"]. apply(clean)

In [56]:
data.labels.value_counts()

Offensive Speech                19190
No Hate and Offensive Speech     4163
Hate Speech                      1430
Name: labels, dtype: int64

这是一个不平衡分类问题

## Splitting the data
The next important step is to explore the dataset and divide the dataset into training and testing data.

NLP三种词袋模型CountVectorizer/TFIDF/HashVectorizer
https://zhuanlan.zhihu.com/p/268886634

In [57]:
x = np. array(data["tweet"])
y = np. array(data["labels"])

In [58]:
cv = CountVectorizer()
X = cv. fit_transform(x)
# Splitting the Data
#X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

#StratifiedKFold 分层k折
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=10)
skf.get_n_splits(X, y)


10

In [59]:
for train_index, test_index in skf.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]

TRAIN: [ 2411  2412  2413 ... 24780 24781 24782] TEST: [   0    1    2 ... 2680 2684 2687]
TRAIN: [    0     1     2 ... 24780 24781 24782] TEST: [2411 2412 2413 ... 5098 5100 5101]
TRAIN: [    0     1     2 ... 24780 24781 24782] TEST: [4015 4030 4036 ... 7734 7735 7736]
TRAIN: [    0     1     2 ... 24780 24781 24782] TEST: [ 5407  5417  5442 ... 10228 10229 10230]
TRAIN: [    0     1     2 ... 24780 24781 24782] TEST: [ 6712  6713  6732 ... 12574 12575 12576]
TRAIN: [    0     1     2 ... 24780 24781 24782] TEST: [ 8991  9007  9011 ... 15044 15045 15047]
TRAIN: [    0     1     2 ... 24780 24781 24782] TEST: [12026 12038 12052 ... 17500 17502 17504]
TRAIN: [    0     1     2 ... 24780 24781 24782] TEST: [14984 14985 15009 ... 19942 19946 19948]
TRAIN: [    0     1     2 ... 24780 24781 24782] TEST: [18477 18483 18499 ... 22391 22392 22393]
TRAIN: [    0     1     2 ... 22391 22392 22393] TEST: [21784 21837 21843 ... 24780 24781 24782]


In [60]:
tfv = TfidfVectorizer()
X2 = tfv. fit_transform(x)
#X2_train, X2_test, y2_train, y2_test = train_test_split(X2, y, test_size=0.33, random_state=42)

#StratifiedKFold 分层k折
skf = StratifiedKFold(n_splits=10)
skf.get_n_splits(X2, y)
for train_index, test_index in skf.split(X2, y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X2_train, X2_test = X2[train_index], X2[test_index]
    y2_train, y2_test = y[train_index], y[test_index]

TRAIN: [ 2411  2412  2413 ... 24780 24781 24782] TEST: [   0    1    2 ... 2680 2684 2687]
TRAIN: [    0     1     2 ... 24780 24781 24782] TEST: [2411 2412 2413 ... 5098 5100 5101]
TRAIN: [    0     1     2 ... 24780 24781 24782] TEST: [4015 4030 4036 ... 7734 7735 7736]
TRAIN: [    0     1     2 ... 24780 24781 24782] TEST: [ 5407  5417  5442 ... 10228 10229 10230]
TRAIN: [    0     1     2 ... 24780 24781 24782] TEST: [ 6712  6713  6732 ... 12574 12575 12576]
TRAIN: [    0     1     2 ... 24780 24781 24782] TEST: [ 8991  9007  9011 ... 15044 15045 15047]
TRAIN: [    0     1     2 ... 24780 24781 24782] TEST: [12026 12038 12052 ... 17500 17502 17504]
TRAIN: [    0     1     2 ... 24780 24781 24782] TEST: [14984 14985 15009 ... 19942 19946 19948]
TRAIN: [    0     1     2 ... 24780 24781 24782] TEST: [18477 18483 18499 ... 22391 22392 22393]
TRAIN: [    0     1     2 ... 22391 22392 22393] TEST: [21784 21837 21843 ... 24780 24781 24782]


## Building the model
After segregating the data, our next work is to find a good algorithm suited for our model. We can use a Decision tree classifier for building the Hate Speech detection project. Decision Trees are a type of Supervised Machine Learning used mainly for classification problems.

In [61]:
#Model building
model = DecisionTreeClassifier()
#Training the model
model. fit(X_train,y_train)

In [62]:
model2 = DecisionTreeClassifier()
#Training the model
model2. fit(X2_train,y2_train)

## Evaluating the results
The final step in machine learning model building is prediction. In this step, we can measure how well our model performs for the test input.

In [63]:
#Testing the model
y_pred = model. predict (X_test)
y_pred

array(['Offensive Speech', 'Offensive Speech', 'Offensive Speech', ...,
       'Offensive Speech', 'Offensive Speech',
       'No Hate and Offensive Speech'], dtype=object)

In [64]:
#Testing the model
y2_pred = model2. predict (X2_test)
y2_pred

array(['Offensive Speech', 'Offensive Speech', 'Offensive Speech', ...,
       'Offensive Speech', 'Offensive Speech',
       'No Hate and Offensive Speech'], dtype=object)

In [65]:
#Accuracy Score of our model
from sklearn. metrics import accuracy_score, f1_score, recall_score
print (accuracy_score (y_test,y_pred))
print (f1_score (y_test,y_pred, average='micro'))
print (recall_score (y_test,y_pred, average='micro'))

0.8829701372074253
0.8829701372074253
0.8829701372074253


### skmetrics输出acc、precision、recall、f1值相同的问题
average='micro'的原理是：
把每个类别的TP、FP、FN先相加，再把这个问题当成二分类来进行计算

在某一类中被判断成FP的样本，在其他类中一定是FN的样本

解决方法的话就是换一种平均的方法average = 'macro’

这种方法是对于不同的类分别计算评估指标，然后加起来求平均
https://blog.csdn.net/fujikoo/article/details/119926390

In [66]:
#Accuracy Score of our model
from sklearn. metrics import accuracy_score, f1_score, recall_score
print (accuracy_score (y_test,y_pred))
print (f1_score (y_test,y_pred, average='macro'))
print (recall_score (y_test,y_pred, average='macro'))

0.8829701372074253
0.6993934215587227
0.6893382531572995


In [67]:
#Accuracy Score of our model2
from sklearn. metrics import accuracy_score
print (accuracy_score (y2_test,y2_pred))
print (f1_score (y2_test,y2_pred, average='macro'))
print (recall_score (y2_test,y2_pred, average='macro'))

0.8829701372074253
0.6835371701368548
0.6733350080959513


In [68]:
#Predicting the outcome
inp = "You are too bad and I dont like your attitude"
inp = cv.transform([inp]).toarray()
print(model.predict(inp))

['No Hate and Offensive Speech']


## Conclusion
In this article, we have built a project for Hate Speech detection using Machine Learning. Hate speech is one of the serious issues we see on social media platforms like Facebook and Twitter. Hope you enjoyed this article by building a project to detect hate speech with Python.

# Use Russia-Ukraine Dataset

In [17]:
#"H:\课程\毕业论文\cleaned\tweet_ids_day_2022-2-22_clean.csv"
df = pd.read_csv("H:/课程/毕业论文/cleaned/tweet_ids_day_2022-2-22_clean.csv")

In [19]:
#model1: cv+ decision tree
def hate_detection_1(text):
    text = clean(text)
    inp = cv.transform([text]).toarray()
    result = model.predict(inp)
    return result[0]
    

In [23]:
#model1: tf-idf+ decision tree
def hate_detection_2(text):
    text = clean(text)
    inp = tfv.transform([text]).toarray()
    result = model2.predict(inp)
    return result[0]
    

In [20]:
df['is_Hate_m1'] = df['Tweet_content'].apply(hate_detection_1) 

In [24]:
df['is_Hate_m2'] = df['Tweet_content'].apply(hate_detection_2) 

In [22]:
df.is_Hate_m1.value_counts()

No Hate and Offensive Speech    638
Hate Speech                      60
Offensive Speech                 44
Name: is_Hate_m1, dtype: int64

In [25]:
df.is_Hate_m2.value_counts()

No Hate and Offensive Speech    656
Offensive Speech                 75
Hate Speech                      11
Name: is_Hate_m2, dtype: int64

In [32]:
df[df['is_Hate_m1']=="Hate Speech"].Tweet_content.iloc[0]

'russian troops enter eastern ukraine russia ukraine ukraineconflict ukraina war russiaukraine russiainvadedukraine ukrainecrisis biden putinswar russiaukrainecrisis russiaucraina ukrainerussiacrisis'

In [78]:
df[df['is_Hate_m2']=="Hate Speech"].Tweet_content.iloc[10]

'the whole world: mr. putin what you are doing is wrong. putin: ? ukraina russiaukraineconflict'

# 尝试换一下模型

In [34]:
from sklearn.naive_bayes import MultinomialNB
#让我们从朴素的贝叶斯分类器开始，它为该任务提供了一个很好的基准。 scikit-learn包含此分类器的多种变体； 多项式最适合单词计数：
clf = MultinomialNB().fit(X2_train,y2_train)

In [36]:
predicted = clf.predict(X2_test)

In [38]:
print (accuracy_score (y2_test,predicted))
print (f1_score (y2_test,predicted, average='macro'))
print (recall_score (y2_test,predicted, average='macro'))

0.7901944002934344
0.3516003992102368
0.3650300415199313


In [45]:
x = np. array(data["tweet"])
y = np. array(data["labels"])
# Splitting the Data
x3_train, x3_test, y3_train, y3_test = train_test_split(x, y, test_size=0.33, random_state=42)

In [46]:
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.pipeline import Pipeline
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', MultinomialNB()),
])

In [47]:
text_clf.fit(x3_train,y3_train)

In [50]:
#线性支持向量机（SVM）
from sklearn.linear_model import SGDClassifier
text_clf = Pipeline([
    ('vect', CountVectorizer()),
    ('tfidf', TfidfTransformer()),
    ('clf', SGDClassifier(loss='hinge', penalty='l2',
                          alpha=1e-3, random_state=42,
                          max_iter=5, tol=None)),
])

text_clf.fit(x3_train, y3_train)
y3_pred = text_clf.predict(x3_test)
print (accuracy_score (y3_test,y3_pred))
print (f1_score (y3_test,y3_pred, average='macro'))
print (recall_score (y3_test,y3_pred, average='macro'))

0.8141582100501283
0.43542201045292966
0.4214843275280216


换了以后还不如之前的呢……