<a href="https://colab.research.google.com/github/PutriAW/DTI-ASSIGNMENT-TEXT-MINING/blob/main/Hate_Speech_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Hate Speech Sentiment Analysis**
Created By Putri Apriyanti Windya 
 (DS0124 - Data Scientist 01)

# **Dataset**

---
 Dataset for this classification obtained from https://raw.githubusercontent.com/ialfina/id-hatespeech-detection/master/IDHSD_RIO_unbalanced_713_2017.txt

# **Description**

---

The Dataset for Hate Speech Detection in Indonesian
(Dataset untuk Deteksi Ujaran Kebencian dalam Bahasa Indonesia)

Dataset
The dataset is a two columns data of: label - tweet, consist of 713 tweets in Indonesian.
The label is Non_HS or HS. Non_HS for "non-hate-speech" tweet and HS for "hate-speech" tweet.

Number of Non_HS tweets: 453
Number of HS tweets: 260
Since this dataset is unbalanced, you might have to do over-sampling/down-sampling in order to create a balanced dataset.
The dataset may be used freely, but if you want to publish paper/publication using the dataset, please cite this publication:

Ika Alfina, Rio Mulia, Mohamad Ivan Fanany, and Yudo Ekanata, "Hate Speech Detection in Indonesian Language: A Dataset and Preliminary Study ", in Proceeding of 9th International Conference on Advanced Computer Science and Information Systems 2017(ICACSIS 2017).

# **Problem to Solve**

---

Do sentiment Analysis to know whether a twitter tweet is hate speech or non hate speech

# **Data Preparation**

## **Data Exploration**

**Import All Libraries that Needed for Data Preparation**

In [1]:
# install library for indonesian language stemming
!pip install Sastrawi



In [2]:
# Import Library
# text preprocessing
import numpy as np
import pandas as pd 
import requests
import io
import re # regular expression
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory # stemming indonesian language
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory

# Visualization
import seaborn as sns
import matplotlib.pyplot as plt

# Vectorization and splitting
from sklearn.feature_extraction.text import CountVectorizer # to create Bag of words
from sklearn.feature_extraction.text import TfidfVectorizer # tfid Vector 
from sklearn.metrics import confusion_matrix, classification_report, accuracy_score # confussion matrix
from sklearn.preprocessing import LabelEncoder # to convert classes to number 
from sklearn.model_selection import train_test_split  # for splitting data 
from sklearn.metrics import accuracy_score # to calculate accuracy

In [3]:
# Get Data from Github
result = requests.get('https://raw.githubusercontent.com/ialfina/id-hatespeech-detection/master/IDHSD_RIO_unbalanced_713_2017.txt')
data = io.StringIO(result.text)

In [4]:
# Convert result into data frame
df_hs = pd.read_csv(data, sep='\t')
df_hs.head()

Unnamed: 0,Label,Tweet
0,Non_HS,RT @spardaxyz: Fadli Zon Minta Mendagri Segera...
1,Non_HS,RT @baguscondromowo: Mereka terus melukai aksi...
2,Non_HS,Sylvi: bagaimana gurbernur melakukan kekerasan...
3,Non_HS,"Ahmad Dhani Tak Puas Debat Pilkada, Masalah Ja..."
4,Non_HS,RT @lisdaulay28: Waspada KTP palsu.....kawal P...


In [5]:
# Count HS and Non_HS label
df_hs.Label.value_counts().to_frame()

Unnamed: 0,Label
Non_HS,453
HS,260


## **Text Cleaning**

**Case folding**

In [6]:
temp_tweet = []

for tw in df_hs['Tweet']:
  # removal of @name[mention]
  tw = re.sub(r"(?:\@|https?\://)\S+", "", tw)

  # removal of links[https://blabala.com]
  # tw = re.sub(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", "", tw)
  tw = re.sub(r"http\S+", "", tw)

  # removal of new line
  tw = re.sub('\n', '', tw)

  # removal of RT
  tw = re.sub('RT', '', tw)

  # Tokenization
  # removal of punctuations and numbers
  tw = re.sub("[^a-zA-Z^']", " ", tw)
  tw = re.sub(" {2,}", " ", tw)

  # remove leading and trailing whitespace
  tw = tw.strip()

  # remove whitespace with a single space
  tw = re.sub(r'\s+', ' ', tw)

  # convert text to Lowercase
  tw = tw.lower();
  temp_tweet.append(tw)

df_hs['Clean_Tweet'] = temp_tweet
df_hs.head()

Unnamed: 0,Label,Tweet,Clean_Tweet
0,Non_HS,RT @spardaxyz: Fadli Zon Minta Mendagri Segera...,fadli zon minta mendagri segera menonaktifkan ...
1,Non_HS,RT @baguscondromowo: Mereka terus melukai aksi...,mereka terus melukai aksi dalam rangka memenja...
2,Non_HS,Sylvi: bagaimana gurbernur melakukan kekerasan...,sylvi bagaimana gurbernur melakukan kekerasan ...
3,Non_HS,"Ahmad Dhani Tak Puas Debat Pilkada, Masalah Ja...",ahmad dhani tak puas debat pilkada masalah jal...
4,Non_HS,RT @lisdaulay28: Waspada KTP palsu.....kawal P...,waspada ktp palsu kawal pilkada


**Stemming**

In [7]:
factory = StemmerFactory()
stemmer = factory.create_stemmer()

def stem(tweet) :
    hasil = stemmer.stem(tweet)
    return hasil

df_hs['Clean_Tweet'] = df_hs.apply(lambda row : stem(row['Clean_Tweet']), axis = 1)

**Stop Word Removal**

In [8]:
R_factory = StopWordRemoverFactory()
R_stopword = R_factory.create_stop_word_remover()

def R_stopwords(tweet) :
    tweet = tweet.translate(str.maketrans('','',string.punctuation)).lower()
    return R_stopword.remove(tweet)

df_hs['Clean_Tweet'] = df_hs.apply(lambda row : stem(row['Clean_Tweet']), axis = 1)

In [9]:
df_hs.head()

Unnamed: 0,Label,Tweet,Clean_Tweet
0,Non_HS,RT @spardaxyz: Fadli Zon Minta Mendagri Segera...,fadli zon minta mendagri segera nonaktif ahok ...
1,Non_HS,RT @baguscondromowo: Mereka terus melukai aksi...,mereka terus luka aksi dalam rangka penjara ah...
2,Non_HS,Sylvi: bagaimana gurbernur melakukan kekerasan...,sylvi bagaimana gurbernur laku keras perempuan...
3,Non_HS,"Ahmad Dhani Tak Puas Debat Pilkada, Masalah Ja...",ahmad dhani tak puas debat pilkada masalah jal...
4,Non_HS,RT @lisdaulay28: Waspada KTP palsu.....kawal P...,waspada ktp palsu kawal pilkada


## **Vectorization**

**Count Vectorizer**

In [10]:
X = df_hs['Clean_Tweet']
# Count Vectorizer
count_vectorizer = CountVectorizer()
count_vector = count_vectorizer.fit_transform(X)
count_vector.shape

(713, 2291)

In [11]:
# Show Vocabulary
# count_vectorizer.vocabulary_

**TF-IDF Vectorizer**

In [12]:
tfidf_vectorizer = TfidfVectorizer()
tfid_vector = tfidf_vectorizer.fit_transform(X)
tfid_vector.shape 

(713, 2291)

**Vectorizer Conclusion**

Based on advantages and disadvantages, i decided to chose TF-IDF rather than count vectorizer because TF-IDF has less disadvantages than count vectorizer

**Label Encoder**

In [13]:
# Encode Target
encoder = LabelEncoder()
tweet_label = encoder.fit_transform(df_hs['Label'])
tweet_label

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

# **Data Splitting**

In [14]:
# Set Training and Testing Data (70:30)
X_train, X_test, y_train, y_test = train_test_split(tfid_vector, tweet_label , shuffle = True, test_size=0.3, random_state=11)

# Show the Training and Testing Data
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(499, 2291)
(214, 2291)
(499,)
(214,)


# **Modelling**

## **Support Vector Machine**

In [15]:
from sklearn.svm import SVC
svmc = SVC(kernel='rbf',probability=True)

# Training SVM
svmc.fit(X_train, y_train)

# predict SVM to test data
y_pred_svm = svmc.predict(X_test)

In [16]:
# Show the Confussion Matrix
cm_svm = confusion_matrix(y_test, y_pred_svm)
cm_svm

array([[ 37,  44],
       [  4, 129]])

In [17]:
# Classification report
print("Report : \n", classification_report(y_test, y_pred_svm))
print("Accuracy : ",accuracy_score(y_test,y_pred_svm))
accSVMC = accuracy_score(y_test,y_pred_svm)

Report : 
               precision    recall  f1-score   support

           0       0.90      0.46      0.61        81
           1       0.75      0.97      0.84       133

    accuracy                           0.78       214
   macro avg       0.82      0.71      0.72       214
weighted avg       0.81      0.78      0.75       214

Accuracy :  0.7757009345794392


## **Logistic Regression**

In [18]:
from sklearn.linear_model import LogisticRegression
lrc = LogisticRegression()

# Training Logistic regression
lrc.fit(X_train, y_train)

# predict lrc to test data
y_pred_lrc = lrc.predict(X_test)

In [19]:
# Show the Confussion Matrix
cm_lrc = confusion_matrix(y_test, y_pred_lrc)
cm_lrc

array([[ 31,  50],
       [  2, 131]])

In [20]:
# Classification report
print("Report : \n", classification_report(y_test, y_pred_lrc))
print("Accuracy : ",accuracy_score(y_test,y_pred_lrc))
accLRC = accuracy_score(y_test,y_pred_lrc)

Report : 
               precision    recall  f1-score   support

           0       0.94      0.38      0.54        81
           1       0.72      0.98      0.83       133

    accuracy                           0.76       214
   macro avg       0.83      0.68      0.69       214
weighted avg       0.81      0.76      0.72       214

Accuracy :  0.7570093457943925


## **Gradient Boosting**

In [21]:
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier(n_estimators=20, learning_rate = 0.5, max_features=2, max_depth = 2, random_state = 0)

# Training gbc
gbc.fit(X_train, y_train)

# predict gbc to test data
y_pred_gbc = gbc.predict(X_test)

In [22]:
# Show the Confussion Matrix
cm_gbc = confusion_matrix(y_test, y_pred_gbc)
cm_gbc

array([[  5,  76],
       [  7, 126]])

In [23]:
# Classification report
print("Report : \n", classification_report(y_test, y_pred_gbc))
print("Accuracy : ",accuracy_score(y_test,y_pred_gbc))
accGBC = accuracy_score(y_test,y_pred_gbc)

Report : 
               precision    recall  f1-score   support

           0       0.42      0.06      0.11        81
           1       0.62      0.95      0.75       133

    accuracy                           0.61       214
   macro avg       0.52      0.50      0.43       214
weighted avg       0.55      0.61      0.51       214

Accuracy :  0.6121495327102804


## **KNN**

In [24]:
from sklearn.neighbors import KNeighborsClassifier
knnc = KNeighborsClassifier(n_neighbors=5)

# Training knn
knnc.fit(X_train, y_train)

# predict knn to test data
y_pred_knnc = knnc.predict(X_test)

In [25]:
# Show the Confussion Matrix
cm_knnc = confusion_matrix(y_test, y_pred_knnc)
cm_knnc

array([[ 54,  27],
       [ 10, 123]])

In [26]:
# Classification report
print("Report : \n", classification_report(y_test, y_pred_knnc))
print("Accuracy : ",accuracy_score(y_test,y_pred_knnc))
accKNNC = accuracy_score(y_test,y_pred_knnc)

Report : 
               precision    recall  f1-score   support

           0       0.84      0.67      0.74        81
           1       0.82      0.92      0.87       133

    accuracy                           0.83       214
   macro avg       0.83      0.80      0.81       214
weighted avg       0.83      0.83      0.82       214

Accuracy :  0.8271028037383178


## **MLP**

In [27]:
from sklearn.neural_network import MLPClassifier

mlpc = MLPClassifier(hidden_layer_sizes=(20, 3), max_iter=200, alpha=1e-4,
                    solver='sgd', verbose=10, tol=1e-4, random_state=1,
                    learning_rate_init=.1)

# Training knn
mlpc.fit(X_train, y_train)

# predict knn to test data
y_pred_mlpc = mlpc.predict(X_test)


Iteration 1, loss = 0.66160988
Iteration 2, loss = 0.65714249
Iteration 3, loss = 0.65274222
Iteration 4, loss = 0.65132911
Iteration 5, loss = 0.65144208
Iteration 6, loss = 0.65189407
Iteration 7, loss = 0.65184266
Iteration 8, loss = 0.65029331
Iteration 9, loss = 0.64848332
Iteration 10, loss = 0.64740101
Iteration 11, loss = 0.64683368
Iteration 12, loss = 0.64519160
Iteration 13, loss = 0.64287492
Iteration 14, loss = 0.64009323
Iteration 15, loss = 0.63688240
Iteration 16, loss = 0.63287599
Iteration 17, loss = 0.62720633
Iteration 18, loss = 0.62010172
Iteration 19, loss = 0.61047877
Iteration 20, loss = 0.59801214
Iteration 21, loss = 0.58077943
Iteration 22, loss = 0.55817764
Iteration 23, loss = 0.52975953
Iteration 24, loss = 0.49396314
Iteration 25, loss = 0.44590564
Iteration 26, loss = 0.39189323
Iteration 27, loss = 0.34015647
Iteration 28, loss = 0.28070260
Iteration 29, loss = 0.22830096
Iteration 30, loss = 0.18330401
Iteration 31, loss = 0.15478074
Iteration 32, los

In [28]:
# Show the Confussion Matrix
cm_mlpc = confusion_matrix(y_test, y_pred_mlpc)
cm_mlpc

array([[ 57,  24],
       [  8, 125]])

In [29]:
# Classification report
print("Report : \n", classification_report(y_test, y_pred_mlpc))
print("Accuracy : ",accuracy_score(y_test,y_pred_mlpc))
accMLPC = accuracy_score(y_test,y_pred_mlpc)

Report : 
               precision    recall  f1-score   support

           0       0.88      0.70      0.78        81
           1       0.84      0.94      0.89       133

    accuracy                           0.85       214
   macro avg       0.86      0.82      0.83       214
weighted avg       0.85      0.85      0.85       214

Accuracy :  0.8504672897196262


## **Model Comparison**

In [30]:
# Accuracy Comparison
model = ['SVM', 'Logistic Regression', 'Gradient Boosting', 'KNN', 'MLP']
accuracies = [accSVMC, accLRC, accGBC, accKNNC, accMLPC]
comp = pd.DataFrame(list(zip(model, accuracies)), columns=['Model', 'Accuracy'])
comp

Unnamed: 0,Model,Accuracy
0,SVM,0.775701
1,Logistic Regression,0.757009
2,Gradient Boosting,0.61215
3,KNN,0.827103
4,MLP,0.850467


# **Conclusion**


---

Based on models accuracy above, we can conclude that the best model is MLP because it has biggest accuracy than other 

# **Predict Data**

In [33]:
df_hs.to_csv('HS dataset Clean.csv', index=False)

In [47]:
# Input New Statement
new_statement = ['terima kasih pak sudah abdi dijakarta dengan gudang minta warga','mungkin ada kata kata yang sakit ya pak tapi moga itu jadi semangat untuk bapak depan',
                 'tahu anda apa sama antara malin kundang dengan ahok mereka dua sama sam durhaka asbak iklanahokjahat kampanyeahokjahat',
                 'ngomong aja ente sama kaca pecah belah rukun lagi hok ente bisa ngaco iklanahokjahat',
                 'pala otak kau pecah tapi ada orang baik yang tolong','sungguh jahat anda terhadap dia', 
                 'ngomong aja ga jelas seperti anjing yang menggonggong', 'bilamana saya marah saya akan minum es', 
                 'apa perbedaan anda dengan malin kundang apabila anda mengutuk ibu anda', 'lela bukan anak yang baik'] 

# Extract Features
new_statement_features = tfidf_vectorizer.transform(new_statement).toarray()

## encodeing predict class
predict_sentiment = encoder.inverse_transform(mlpc.predict(new_statement_features))
print('sentiment: ',predict_sentiment)

sentiment:  ['Non_HS' 'Non_HS' 'HS' 'HS' 'Non_HS' 'Non_HS' 'HS' 'Non_HS' 'HS' 'Non_HS']
