<a href="https://colab.research.google.com/github/PutriAW/DTI-ASSIGNMENT-TEXT-MINING/blob/main/Hate_Speech_Sentiment_Analysis.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Hate Speech Sentiment Analysis**
Created By Putri Apriyanti Windya 
 (DS0124 - Data Scientist 01)

# **Dataset**

---
 Dataset for this classification obtained from https://raw.githubusercontent.com/ialfina/id-hatespeech-detection/master/IDHSD_RIO_unbalanced_713_2017.txt

# **Description**

---

The Dataset for Hate Speech Detection in Indonesian
(Dataset untuk Deteksi Ujaran Kebencian dalam Bahasa Indonesia)

Dataset
The dataset is a two columns data of: label - tweet, consist of 713 tweets in Indonesian.
The label is Non_HS or HS. Non_HS for "non-hate-speech" tweet and HS for "hate-speech" tweet.

Number of Non_HS tweets: 453
Number of HS tweets: 260
Since this dataset is unbalanced, you might have to do over-sampling/down-sampling in order to create a balanced dataset.
The dataset may be used freely, but if you want to publish paper/publication using the dataset, please cite this publication:

Ika Alfina, Rio Mulia, Mohamad Ivan Fanany, and Yudo Ekanata, "Hate Speech Detection in Indonesian Language: A Dataset and Preliminary Study ", in Proceeding of 9th International Conference on Advanced Computer Science and Information Systems 2017(ICACSIS 2017).

# **Problem to Solve**

---

Do sentiment Analysis to know whether a twitter tweet is hate speech or non hate speech

# **Data Preparation**

## **Data Exploration**

**Import All Libraries that Needed for Data Preparation**

In [1]:
# Install library for text preprocessing
!pip install nltk



In [2]:
# install library for indonesian language stemming
!pip install Sastrawi



In [3]:
# Import Library
import numpy as np
import pandas as pd 
import requests
import io
import re # regular expression
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from Sastrawi.StopWordRemover.StopWordRemoverFactory import StopWordRemoverFactory
import seaborn as sns
import matplotlib.pyplot as plt

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem.porter import *
import string
import nltk
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import confusion_matrix

In [4]:
# Get Data from Github
result = requests.get('https://raw.githubusercontent.com/ialfina/id-hatespeech-detection/master/IDHSD_RIO_unbalanced_713_2017.txt')
data = io.StringIO(result.text)

In [5]:
# Convert result into data frame
df_hs = pd.read_csv(data, sep='\t')
df_hs.head()

Unnamed: 0,Label,Tweet
0,Non_HS,RT @spardaxyz: Fadli Zon Minta Mendagri Segera...
1,Non_HS,RT @baguscondromowo: Mereka terus melukai aksi...
2,Non_HS,Sylvi: bagaimana gurbernur melakukan kekerasan...
3,Non_HS,"Ahmad Dhani Tak Puas Debat Pilkada, Masalah Ja..."
4,Non_HS,RT @lisdaulay28: Waspada KTP palsu.....kawal P...


In [6]:
# Count HS and Non_HS label
df_hs.Label.value_counts().to_frame()

Unnamed: 0,Label
Non_HS,453
HS,260


## **Text Cleaning**

**Case folding**

In [7]:
temp_tweet = []

for tw in df_hs['Tweet']:
  # removal of @name[mention]
  tw = re.sub(r"(?:\@|https?\://)\S+", "", tw)

  # removal of links[https://blabala.com]
  # tw = re.sub(r"http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", "", tw)
  tw = re.sub(r"http\S+", "", tw)

  # removal of new line
  tw = re.sub('\n', '', tw)

  # removal of RT
  tw = re.sub('RT', '', tw)

  # Tokenization
  # removal of punctuations and numbers
  tw = re.sub("[^a-zA-Z^']", " ", tw)
  tw = re.sub(" {2,}", " ", tw)

  # remove leading and trailing whitespace
  tw = tw.strip()

  # remove whitespace with a single space
  tw = re.sub(r'\s+', ' ', tw)

  # convert text to Lowercase
  tw = tw.lower();
  temp_tweet.append(tw)

df_hs['Clean_Tweet'] = temp_tweet
df_hs.head()

Unnamed: 0,Label,Tweet,Clean_Tweet
0,Non_HS,RT @spardaxyz: Fadli Zon Minta Mendagri Segera...,fadli zon minta mendagri segera menonaktifkan ...
1,Non_HS,RT @baguscondromowo: Mereka terus melukai aksi...,mereka terus melukai aksi dalam rangka memenja...
2,Non_HS,Sylvi: bagaimana gurbernur melakukan kekerasan...,sylvi bagaimana gurbernur melakukan kekerasan ...
3,Non_HS,"Ahmad Dhani Tak Puas Debat Pilkada, Masalah Ja...",ahmad dhani tak puas debat pilkada masalah jal...
4,Non_HS,RT @lisdaulay28: Waspada KTP palsu.....kawal P...,waspada ktp palsu kawal pilkada


**Stemming**

In [8]:
factory = StemmerFactory()
stemmer = factory.create_stemmer()

def stem(tweet) :
    hasil = stemmer.stem(tweet)
    return hasil

df_hs['Clean_Tweet'] = df_hs.apply(lambda row : stem(row['Clean_Tweet']), axis = 1)

**Stop Word Removal**

In [9]:
R_factory = StopWordRemoverFactory()
R_stopword = R_factory.create_stop_word_remover()

def R_stopwords(tweet) :
    tweet = tweet.translate(str.maketrans('','',string.punctuation)).lower()
    return R_stopword.remove(tweet)

df_hs['Clean_Tweet'] = df_hs.apply(lambda row : stem(row['Clean_Tweet']), axis = 1)

In [10]:
df_hs.head()

Unnamed: 0,Label,Tweet,Clean_Tweet
0,Non_HS,RT @spardaxyz: Fadli Zon Minta Mendagri Segera...,fadli zon minta mendagri segera nonaktif ahok ...
1,Non_HS,RT @baguscondromowo: Mereka terus melukai aksi...,mereka terus luka aksi dalam rangka penjara ah...
2,Non_HS,Sylvi: bagaimana gurbernur melakukan kekerasan...,sylvi bagaimana gurbernur laku keras perempuan...
3,Non_HS,"Ahmad Dhani Tak Puas Debat Pilkada, Masalah Ja...",ahmad dhani tak puas debat pilkada masalah jal...
4,Non_HS,RT @lisdaulay28: Waspada KTP palsu.....kawal P...,waspada ktp palsu kawal pilkada
