### 1. Multilingual and Multi-Aspect (MLMA) Hate Speech

Dataset Classes (class, aspect(category)):

    class - contains (offensive, hateful and normal)
    aspect(category) - (Origin/Nationality, Gender/Sexual, Disability, Religion) 

Source: https://github.com/HKUST-KnowComp/MLMA_hate_speech/blob/master/hate_speech_mlma.zip

In [7]:
import pandas as pd

df1 = pd.read_csv(f"G:\My Drive\Hate Speech_Multilingual\datasets\Final Datasets\#1 MLMA_hate_speech\hate_speech_mlma\en_dataset_with_stop_words.csv")
df1.head()

Unnamed: 0,HITId,tweet,sentiment,directness,annotator_sentiment,target,group
0,0,If America had another 8 years of Obama's ideo...,fearful_abusive_hateful_disrespectful_normal,indirect,anger_fear_shock_sadness_disgust,origin,other
1,1,Most Canadians have never met seen or associat...,offensive,indirect,sadness_indifference,disability,special_needs
2,2,Hahaha grow up faggot @URL,offensive,indirect,shock_disgust,sexual_orientation,women
3,3,@user queue is fucking retarded it makes every...,offensive_hateful,direct,shock_disgust,disability,special_needs
4,4,@user Que ce ne soit pas des Burundais refugie...,hateful_normal,indirect,shock_disgust,origin,other


In [8]:
df1.columns

Index(['HITId', 'tweet', 'sentiment', 'directness', 'annotator_sentiment',
       'target', 'group'],
      dtype='object')

In [9]:
df1 = df1[['tweet', 'sentiment', 'target']]
df1.columns = ['text', 'class', 'aspect(category)']
df1.head()

Unnamed: 0,text,class,aspect(category)
0,If America had another 8 years of Obama's ideo...,fearful_abusive_hateful_disrespectful_normal,origin
1,Most Canadians have never met seen or associat...,offensive,disability
2,Hahaha grow up faggot @URL,offensive,sexual_orientation
3,@user queue is fucking retarded it makes every...,offensive_hateful,disability
4,@user Que ce ne soit pas des Burundais refugie...,hateful_normal,origin


In [10]:
class_counts = df1['class'].value_counts()
hateful_word = [class_name for class_name in class_counts.index if 'hateful' in class_name]

temp = df1[ (df1['class'] == 'normal') | (df1['class'].isin(hateful_word))]
temp['class'] = temp['class'].apply(lambda x: 'hateful' if x != 'normal' else x)

df1 = temp.copy()
df1['class'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  temp['class'] = temp['class'].apply(lambda x: 'hateful' if x != 'normal' else x)


class
hateful    1278
normal      661
Name: count, dtype: int64

In [11]:
df1.loc[df1['class'] == 'normal', 'aspect(category)'] = df1.loc[df1['class'] == 'normal', 'aspect(category)'].map(lambda x: ' ')
df1.head()

Unnamed: 0,text,class,aspect(category)
0,If America had another 8 years of Obama's ideo...,hateful,origin
3,@user queue is fucking retarded it makes every...,hateful,disability
4,@user Que ce ne soit pas des Burundais refugie...,hateful,origin
7,@user @user Btw. Are we now allowed to say \sh...,normal,
9,@user Still a bitter cunt. Why so much interes...,hateful,gender


In [12]:
df1['aspect(category)'].value_counts()

aspect(category)
                      661
origin                616
disability            214
other                 156
sexual_orientation    142
gender                131
religion               19
Name: count, dtype: int64

In [13]:
df1['aspect(category)'] = df1['aspect(category)'].map({
    'origin': 'Origin/Nationality',
    'religion': 'Religion',
    'disability': 'Disability',
    'sexual_orientation': 'Gender/Sexual',
    'gender': 'Gender/Sexual'
})
df1.head()

Unnamed: 0,text,class,aspect(category)
0,If America had another 8 years of Obama's ideo...,hateful,Origin/Nationality
3,@user queue is fucking retarded it makes every...,hateful,Disability
4,@user Que ce ne soit pas des Burundais refugie...,hateful,Origin/Nationality
7,@user @user Btw. Are we now allowed to say \sh...,normal,
9,@user Still a bitter cunt. Why so much interes...,hateful,Gender/Sexual


In [14]:
df1['aspect(category)'].value_counts()

aspect(category)
Origin/Nationality    616
Gender/Sexual         273
Disability            214
Religion               19
Name: count, dtype: int64

In [15]:
df1['class'].value_counts()

class
hateful    1278
normal      661
Name: count, dtype: int64

In [16]:
from langdetect import detect, DetectorFactory
from langdetect.lang_detect_exception import LangDetectException

def detect_language(text):
    try:
        return detect(text)
    except LangDetectException:
        return None

In [17]:
df1['detected_language'] = df1['text'].apply(detect_language)

In [18]:
df1['detected_language'].value_counts()

detected_language
en    1703
da      40
no      38
fr      29
ca      21
af      18
cy      13
tl      12
sv      11
it      10
vi       9
id       6
so       6
de       6
pt       4
et       3
hr       2
pl       2
ro       1
sq       1
lt       1
sw       1
nl       1
es       1
Name: count, dtype: int64

In [20]:
#total into value num for up all values into 'detect_language' column except 'en'
df1[df1['detected_language'] != 'en']['detected_language'].value_counts().sum()

236

In [21]:
df1 = df1[df1['detected_language'] == 'en']

### 2. Cyberbully Detection Dataset

Dataset Classes (class, aspect(category)):

    class - contains (offensive, hateful and normal)
    aspect(category) - (Origin/Nationality, Gender/Sexual, Disability, Religion) 

Source: https://www.kaggle.com/datasets/momo12341234/cyberbully-detection-dataset

In [22]:
import pandas as pd

df2 = pd.read_csv(f"G:\My Drive\Hate Speech_Multilingual\datasets\Final Datasets\#2 Cyberbully Detection Dataset\cb_multi_labeled_balanced.csv")
df2.head()

Unnamed: 0,text,label
0,@ZubearSays Any real nigga isn't letting this ...,ethnicity/race
1,@MoradoSkittle @prolifejewess @DAConsult @Kell...,not_cyberbullying
2,"the only thing i wish, i wish a nigga would",ethnicity/race
3,You saudias are not friends of Muslim idiots c...,religion
4,@JaydenT2399 @TractorLaw @holmes_gael @erconge...,religion


In [23]:
df2['label'].value_counts()

label
not_cyberbullying    50000
ethnicity/race       17000
gender/sexual        17000
religion             15990
Name: count, dtype: int64

In [24]:
df2.columns = ['text', 'aspect(category)']

df2['aspect(category)'] = df2['aspect(category)'].map({
    'ethnicity/race': 'Race/Ethnicity',
    'religion': 'Religion',
    'gender/sexual': 'Gender/Sexual',
    'not_cyberbullying': 'normal'
})

df2['aspect(category)'].value_counts()

aspect(category)
normal            50000
Race/Ethnicity    17000
Gender/Sexual     17000
Religion          15990
Name: count, dtype: int64

In [25]:
df2['class'] = 'hateful'

df2.loc[df2['aspect(category)'] == 'normal', 'class'] = 'normal'
df2.loc[df2['aspect(category)'] == 'normal', 'aspect(category)'] = ' '

df2 = df2[['text', 'class', 'aspect(category)']]
df2.head()

Unnamed: 0,text,class,aspect(category)
0,@ZubearSays Any real nigga isn't letting this ...,hateful,Race/Ethnicity
1,@MoradoSkittle @prolifejewess @DAConsult @Kell...,normal,
2,"the only thing i wish, i wish a nigga would",hateful,Race/Ethnicity
3,You saudias are not friends of Muslim idiots c...,hateful,Religion
4,@JaydenT2399 @TractorLaw @holmes_gael @erconge...,hateful,Religion


In [26]:
df2['class'].value_counts()

class
normal     50000
hateful    49990
Name: count, dtype: int64

In [27]:
df2['aspect(category)'].value_counts()

aspect(category)
                  50000
Race/Ethnicity    17000
Gender/Sexual     17000
Religion          15990
Name: count, dtype: int64

-------------

## Merging of datasets

In [28]:
df = pd.concat([df1, df2])
df['aspect(category)'] = df['aspect(category)'].map(lambda x: ' ' if pd.isna(x) else x)
df.head()

Unnamed: 0,text,class,aspect(category),detected_language
0,If America had another 8 years of Obama's ideo...,hateful,Origin/Nationality,en
3,@user queue is fucking retarded it makes every...,hateful,Disability,en
7,@user @user Btw. Are we now allowed to say \sh...,normal,,en
9,@user Still a bitter cunt. Why so much interes...,hateful,Gender/Sexual,en
11,children = 52% of refugees,hateful,Gender/Sexual,en


In [29]:
df['class'].value_counts()

class
hateful    51108
normal     50585
Name: count, dtype: int64

In [30]:
len(df[(df['class'] == 'hateful') & (df['aspect(category)'] == ' ')])

136

In [31]:
df[(df['class'] == 'hateful') & (df['aspect(category)'] == ' ')]

Unnamed: 0,text,class,aspect(category),detected_language
23,man i leave for one week and the first twat to...,hateful,,en
57,@user .... What no Roofe pen either?\n\nFuck o...,hateful,,en
77,It's always the Liberals that threaten that \I...,hateful,,en
132,Keep in mind I’m a negro/so my open mind got a...,hateful,,en
170,I'm sick of seeing your face on my news feed y...,hateful,,en
...,...,...,...,...
5555,Kevin Gates for President all the way retarded,hateful,,en
5573,People who insult Asians with \Ching Chong chi...,hateful,,en
5601,Lmfaooo fuckin ivan whata retard @URL,hateful,,en
5624,@user Where is Aryan Nations? In 2002 they los...,hateful,,en


Note: hatefull here also 

In [32]:
df['aspect(category)'].value_counts()

aspect(category)
                      50721
Gender/Sexual         17228
Race/Ethnicity        17000
Religion              16007
Origin/Nationality      554
Disability              183
Name: count, dtype: int64

In [33]:
df.drop_duplicates(subset=['class', 'aspect(category)'])[['text', 'class', 'aspect(category)']]

Unnamed: 0,text,class,aspect(category)
0,If America had another 8 years of Obama's ideo...,hateful,Origin/Nationality
3,@user queue is fucking retarded it makes every...,hateful,Disability
7,@user @user Btw. Are we now allowed to say \sh...,normal,
9,@user Still a bitter cunt. Why so much interes...,hateful,Gender/Sexual
23,man i leave for one week and the first twat to...,hateful,
566,@user True. Christians may be our retarded cou...,hateful,Religion
0,@ZubearSays Any real nigga isn't letting this ...,hateful,Race/Ethnicity


In [34]:
df['class'] = df['class'].map({'normal': 0, 'hateful': 1})

In [35]:
import numpy as np

df_one_hot = pd.get_dummies(df, columns=['aspect(category)'], prefix='', prefix_sep='')

for column in df_one_hot.columns:
    if type(df_one_hot[column].values[0]) == np.bool_:
        df_one_hot[column] = df_one_hot[column].astype(int)

del df_one_hot[' ']
df = df_one_hot.copy()
df_one_hot.head()

Unnamed: 0,text,class,detected_language,Disability,Gender/Sexual,Origin/Nationality,Race/Ethnicity,Religion
0,If America had another 8 years of Obama's ideo...,1,en,0,0,1,0,0
3,@user queue is fucking retarded it makes every...,1,en,1,0,0,0,0
7,@user @user Btw. Are we now allowed to say \sh...,0,en,0,0,0,0,0
9,@user Still a bitter cunt. Why so much interes...,1,en,0,1,0,0,0
11,children = 52% of refugees,1,en,0,1,0,0,0


In [36]:
df.to_csv(f"G:\My Drive\Hate Speech_Multilingual\Code\Dataset Statistics\dataset\english_curated(multi).csv", index=False)

In [43]:
df.to_csv("english_curated(multi).csv", index=False)