<a href="https://colab.research.google.com/github/Harsha-7989/Text_Emotion_Classification/blob/main/Text_Emotion_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
import pandas as pd
import numpy as np

In [6]:
data = pd.read_csv('/content/emotion_dataset_raw.csv')

In [8]:
data.head()

Unnamed: 0,Emotion,Text
0,neutral,Why ?
1,joy,Sage Act upgrade on my to do list for tommorow.
2,sadness,ON THE WAY TO MY HOMEGIRL BABY FUNERAL!!! MAN ...
3,joy,Such an eye ! The true hazel eye-and so brill...
4,joy,@Iluvmiasantos ugh babe.. hugggzzz for u .! b...


In [9]:
data.shape

(34792, 2)

In [10]:
data.groupby('Emotion').count()

Unnamed: 0_level_0,Text
Emotion,Unnamed: 1_level_1
anger,4297
disgust,856
fear,5410
joy,11045
neutral,2254
sadness,6722
shame,146
surprise,4062


# Data Preprocessing

In [13]:
classes = ['joy', 'sadness']

data = data[data['Emotion'].isin(classes)]
data.head()

Unnamed: 0,Emotion,Text
1,joy,Sage Act upgrade on my to do list for tommorow.
2,sadness,ON THE WAY TO MY HOMEGIRL BABY FUNERAL!!! MAN ...
3,joy,Such an eye ! The true hazel eye-and so brill...
4,joy,@Iluvmiasantos ugh babe.. hugggzzz for u .! b...
6,sadness,.Couldnt wait to see them live. If missing th...


## making all letter in text to lower case

In [14]:
data['Text'] = data['Text'].apply(lambda x: " ".join(x.lower() for x in x.split()))

## Removing Punctuation, Symbols

In [16]:

data['Text'] = data['Text'].str.replace('[^\w\s]',' ')

### Removing Stop Words using NLTK

In [18]:
import nltk
# nltk.download('stopwords')
from nltk.corpus import stopwords
stop = stopwords.words('english')

data['Text'] = data['Text'].apply(lambda x: " ".join(x for x in x.split() if x not in stop))

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


In [19]:
data.head()

Unnamed: 0,Emotion,Text
1,joy,sage act upgrade list tommorow.
2,sadness,way homegirl baby funeral!!! man hate funerals...
3,joy,eye ! true hazel eye-and brilliant ! regular f...
4,joy,@iluvmiasantos ugh babe.. hugggzzz u .! babe n...
6,sadness,.couldnt wait see live. missing nh7 wasnt pain...


### Lemmatisation

In [21]:
from textblob import Word
# nltk.download('wordnet')
data['Text'] = data['Text'].apply(lambda x: " ".join([Word(word).lemmatize() for word in x.split()]))

[nltk_data] Downloading package wordnet to /root/nltk_data...


In [22]:
#Correcting Letter Repetitions
import re
def de_repeat(text):
    pattern = re.compile(r"(.)\1{2,}")
    return pattern.sub(r"\1\1", text)

data['Text'] = data['Text'].apply(lambda x: " ".join(de_repeat(x) for x in x.split()))

In [26]:
# Removing all those rarely appearing words from the data
# Assuming 'freq_series' holds the original Pandas Series with word frequencies
freq_series = pd.Series(' '.join(data['Text']).split()).value_counts()[-10000:]
freq = list(freq_series.index) # Get the index (rare words) from the Series
data['Text'] = data['Text'].apply(lambda x: " ".join(x for x in x.split() if x not in freq))

# Feature Extraction

In [27]:
#Encoding output labels 'sadness' as '1' & 'joy' as '0'
from sklearn import preprocessing
lbl_enc = preprocessing.LabelEncoder()
y = lbl_enc.fit_transform(data.Emotion.values)

In [28]:
# Splitting into training and testing data in 90:10 ratio
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(data.Text.values, y, stratify=y, random_state=42, test_size=0.1, shuffle=True)

Term Frequency-Inverse Document Frequency (TF-IDF): This parameter gives the relative importance of a term in the data and is a measure of how frequently and rarely it appears in the text.

In [29]:
# Extracting TF-IDF parameters
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(max_features=1000, analyzer='word',ngram_range=(1,3))
X_train_tfidf = tfidf.fit_transform(X_train)
X_val_tfidf = tfidf.fit_transform(X_val)


Count Vectors: This is another feature we consider and as the name suggests we transform our tweet into an array having the count of appearances of each word in it. The intuition here is that the text that conveys similar emotions may have the same words repeated over and over again

In [31]:
from sklearn.feature_extraction.text import CountVectorizer
count_vect = CountVectorizer(analyzer='word')
count_vect.fit(data['Text'])
X_train_count =  count_vect.transform(X_train)
X_val_count =  count_vect.transform(X_val)