<a href="https://colab.research.google.com/github/DilshanBotheju/EmotionDetectionUsingTexts-NLP-/blob/DataPreprocessing/DataPreprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Loading the dataset

In [70]:
import numpy as np
import pandas as pd

In [71]:
# Load the datasets from google drive
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [72]:
Emotion_data = pd.read_csv("/content/drive/MyDrive/Emotion, Hate Speech and Violence Detection using NLP/datasets/Emotions.csv", encoding="latin-1")
Emotion_data.head()

Unnamed: 0.1,Unnamed: 0,text,label
0,0,i just feel really helpless and heavy hearted,4
1,1,ive enjoyed being able to slouch about relax a...,0
2,2,i gave up my internship with the dmrg and am f...,4
3,3,i dont know i feel so lost,0
4,4,i am a kindergarten teacher and i am thoroughl...,4


six categories: sadness (0), joy (1), love (2), anger (3), fear (4), and surprise (5)

Data Preprocessing

In [73]:
# Dropping unnecessary columns
Emotion_data = Emotion_data.drop(["Unnamed: 0"], axis=1)

In [74]:
# Viewing new data
Emotion_data.head()

Unnamed: 0,text,label
0,i just feel really helpless and heavy hearted,4
1,ive enjoyed being able to slouch about relax a...,0
2,i gave up my internship with the dmrg and am f...,4
3,i dont know i feel so lost,0
4,i am a kindergarten teacher and i am thoroughl...,4


Identifying datasets

In [75]:
Emotion_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 416809 entries, 0 to 416808
Data columns (total 2 columns):
 #   Column  Non-Null Count   Dtype 
---  ------  --------------   ----- 
 0   text    416809 non-null  object
 1   label   416809 non-null  int64 
dtypes: int64(1), object(1)
memory usage: 6.4+ MB


In [76]:
Emotion_data.columns

Index(['text', 'label'], dtype='object')

In [77]:
# Checking for null values
Emotion_data.isnull().sum()

Unnamed: 0,0
text,0
label,0


In [78]:
Emotion_data["label"].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
1,141067
0,121187
3,57317
4,47712
2,34554
5,14972


In [79]:
# Extracting 12,000 rows from each label
E_data = pd.DataFrame()
for i in range(6):
  subset = Emotion_data[Emotion_data["label"] == i].sample(n = 12000, random_state = 42)
  E_data = pd.concat([E_data, subset])


In [80]:
E_data.shape

(72000, 2)

In [81]:
# Assigning the dataset
Emotion_data = E_data.copy()

In [82]:
# Checking value count for Emotional data
Emotion_data["label"].value_counts()

Unnamed: 0_level_0,count
label,Unnamed: 1_level_1
0,12000
1,12000
2,12000
3,12000
4,12000
5,12000


In [83]:
Emotion_data.head()

Unnamed: 0,text,label
133243,ive learned to surround myself with women who ...,0
88501,i already feel crappy because of this and you ...,0
131379,i feel like i have lost mourned and moved past...,0
148369,i could write a whole lot more about why im fe...,0
134438,i always seem to feel inadequate,0


In [84]:
# Resetting the indexes
Emotion_data.reset_index(drop=True, inplace=True)

In [85]:
Emotion_data.head()

Unnamed: 0,text,label
0,ive learned to surround myself with women who ...,0
1,i already feel crappy because of this and you ...,0
2,i feel like i have lost mourned and moved past...,0
3,i could write a whole lot more about why im fe...,0
4,i always seem to feel inadequate,0


Text Preprocessing

In [86]:
# Load necessary datasets
import re
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [89]:
# Download relevant needed features
nltk.download("stopwords")
nltk.download("punkt")
nltk.download("punkt_tab")

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [90]:
# Text preprocessing by removing unecessary content
def preprocess_text(text):

    # Convert to lowercase
    text = text.lower()
    # Remove URLs
    text = re.sub(r"http\S+|www\S+|https\S+", "", text, flags=re.MULTILINE)
    # Remove special characters
    text = re.sub(r"\W", " ", text)
    # Remove numbers
    text = re.sub(r"\d+", "", text)
    # Remove extra spaces
    text = re.sub(r"\s+", " ", text).strip()

    # Tokenize the texts
    tokens = word_tokenize(text)
    # Remove stopwords
    tokens = [word for word in tokens if word not in stopwords.words("english")]
    return " ".join(tokens)

# Applying function
Emotion_data["cleaned_text"] = Emotion_data["text"].apply(preprocess_text)

In [91]:
# Viewing cleaned data
Emotion_data.head(10)

Unnamed: 0,text,label,cleaned_text
0,ive learned to surround myself with women who ...,0,ive learned surround women lift leave feeling ...
1,i already feel crappy because of this and you ...,0,already feel crappy upset situation doesnt help
2,i feel like i have lost mourned and moved past...,0,feel like lost mourned moved past tears relati...
3,i could write a whole lot more about why im fe...,0,could write whole lot im feeling crappy dont t...
4,i always seem to feel inadequate,0,always seem feel inadequate
5,i feel really inadequate and i just wish i had...,0,feel really inadequate wish enough brains atle...
6,i want to so badly because im lonely and feeli...,0,want badly im lonely feeling isolated much time
7,i feel troubled so troubled that i cannot seem...,0,feel troubled troubled seem feel comfortable s...
8,i am really surprised and frankly i feel prett...,0,really surprised frankly feel pretty beaten
9,i recommend using them when feeling emotionall...,0,recommend using feeling emotionally drained
