<a href="https://colab.research.google.com/github/BJB0/NLP-ML-Starter/blob/main/NLP.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [371]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [372]:
df = pd.read_csv('/content/train.txt', sep=';', header = None, names = ['text','emotion'])

In [373]:
df.head()

Unnamed: 0,text,emotion
0,i didnt feel humiliated,sadness
1,i can go from feeling so hopeless to so damned...,sadness
2,im grabbing a minute to post i feel greedy wrong,anger
3,i am ever feeling nostalgic about the fireplac...,love
4,i am feeling grouchy,anger


In [374]:
df.isnull().sum()

Unnamed: 0,0
text,0
emotion,0


In [375]:
df['emotion'].unique()

array(['sadness', 'anger', 'love', 'surprise', 'fear', 'joy'],
      dtype=object)

In [376]:
unique_emotions = df['emotion'].unique()
emotion_numbers={}
i = 0
for emo in unique_emotions:
  emotion_numbers[emo]=i
  i+=1
df['emotion']=df['emotion'].map(emotion_numbers)

In [377]:
df

Unnamed: 0,text,emotion
0,i didnt feel humiliated,0
1,i can go from feeling so hopeless to so damned...,0
2,im grabbing a minute to post i feel greedy wrong,1
3,i am ever feeling nostalgic about the fireplac...,2
4,i am feeling grouchy,1
...,...,...
15995,i just had a very brief time in the beanbag an...,0
15996,i am now turning and i feel pathetic that i am...,0
15997,i feel strong and good overall,5
15998,i feel like this was such a rude comment and i...,1


# **TEXT CLEANING**

CONVERTING TEXT INTO LOWER CASE

In [378]:
df['text']=df['text'].apply(lambda x: x.lower())

In [379]:
df

Unnamed: 0,text,emotion
0,i didnt feel humiliated,0
1,i can go from feeling so hopeless to so damned...,0
2,im grabbing a minute to post i feel greedy wrong,1
3,i am ever feeling nostalgic about the fireplac...,2
4,i am feeling grouchy,1
...,...,...
15995,i just had a very brief time in the beanbag an...,0
15996,i am now turning and i feel pathetic that i am...,0
15997,i feel strong and good overall,5
15998,i feel like this was such a rude comment and i...,1


REMOVING PUNCTUATIONS

In [380]:
import string

def remove_punc(txt):
  return txt.translate(str.maketrans('','',string.punctuation))


In [381]:
df['text']=df['text'].apply(remove_punc)

REMOVING NUMBERS

In [382]:
def remove_nums(txt):
  new = ""
  for i in txt:
    if not i.isdigit():
      new = new + i
  return new

In [383]:
df['text']=df['text'].apply(remove_nums)

REMOVING EMOJIS AND SPECIAL CHARACTERS

In [384]:
def remove_emojis(txt):
  new=""
  for i in txt:
    if i.isascii():
      new +=i
  return new

df['text']=df['text'].apply(remove_emojis)

REMOVE STOPWORDS

In [385]:
import nltk

In [386]:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize

In [387]:
nltk.download('punkt')  #for tokenization
nltk.download('stopwords')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [388]:
stop_words= set(stopwords.words('english'))

In [389]:
len(stop_words)

198

In [390]:
df.loc[1]['text']

'i can go from feeling so hopeless to so damned hopeful just from being around someone who cares and is awake'

In [391]:
def remove(txt):
  words= word_tokenize(txt)
  cleaned = []
  for i in words:
    if not i in stop_words:
      cleaned.append(i)
  return ' '.join(cleaned)


In [392]:
nltk.download('punkt_tab')

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


True

In [393]:
df['text']=df['text'].apply(remove)

In [394]:
df.loc[1]['text']

'go feeling hopeless damned hopeful around someone cares awake'

In [395]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(df['text'], df['emotion'], test_size=0.20, random_state=42 )

In [396]:
X_train

Unnamed: 0,text
676,refers course though cant help feeling somehow...
12113,im starting feel im suffering fatigue
7077,feel like probably would liked book little bit...
13005,really feel awkward
12123,im feeling little grumpy today lame weather te...
...,...
13418,love leave reader feeling confused slightly de...
5390,feel delicate
860,starting feel little stressed
15795,feel stressed tired worn shape neglected


In [397]:
df.shape

(16000, 2)

In [398]:
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

**BOW IMPLEMENTATION**

In [399]:
bow_vectorizer = CountVectorizer()

In [400]:
X_train_bow = bow_vectorizer.fit_transform(X_train)
X_test_bow = bow_vectorizer.transform(X_test)

In [401]:
X_train_bow

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 116049 stored elements and shape (12800, 13359)>

In [402]:
X_test_bow

<Compressed Sparse Row sparse matrix of dtype 'int64'
	with 26934 stored elements and shape (3200, 13359)>

In [403]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

Using naive bayes Model

In [404]:
nb_model = MultinomialNB()

In [405]:
nb_model.fit(X_train_bow, y_train)

In [406]:
pred_bow = nb_model.predict(X_test_bow)

In [407]:
print(accuracy_score(y_test,pred_bow))

0.7678125


**TF-IDF IMPLEMENTATION**

In [408]:
tfidf_vectorizer = TfidfVectorizer()

In [409]:
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

Using naive bayes Model

In [410]:
nb2_model = MultinomialNB()

In [411]:
nb2_model.fit(X_train_tfidf, y_train)

In [412]:
pred_tfidf = nb2_model.predict(X_test_tfidf)

In [413]:
print(accuracy_score(y_test,pred_tfidf))

0.6609375


Using Logistic Regression Model

In [414]:
from sklearn.linear_model import LogisticRegression

In [415]:
log_model = LogisticRegression(max_iter=1000)

In [416]:
log_model.fit(X_train_tfidf, y_train)

In [417]:
pred_tfidf = log_model.predict(X_test_tfidf)

In [418]:
print(accuracy_score(y_test,pred_tfidf))

0.8615625


**ACCURACY IMPROVED WHEN LOGISTIC REGRESSION MODEL IS USED**

# **FEATURE EXTRACTION/ VECTORIZATION**

ML models only understand numbers not feelings, not words, not
emojis.
So, we need to convert words into numbers in a smart way and that is called
feature extraction. The process of converting these into numbers is called Vectorization.

**BAG OF WORDS IMPLEMENTATION**

In [419]:
# from sklearn.feature_extraction.text import CountVectorizer

# documents = [
#     "I love pizza",
#     "Pizza is the best",
#     "I love pasta",
#     "Pasta is great"
# ]

# vectorizer = CountVectorizer(ngram_range=(2,2))

# X = vectorizer.fit_transform(documents)


In [420]:
# print("Vocabulary:", vectorizer.get_feature_names_out())

In [421]:
# print("Bow Matrix:\n", X.toarray())

**TF-IDF IMPLEMENTATION**

In [422]:
# from sklearn.feature_extraction.text import TfidfVectorizer

# documents = [
#     "I love pizza",
#     "Pizza is the best",
#     "I love pasta",
#     "Pasta is great"
# ]

# vectorizer = TfidfVectorizer()

# X = vectorizer.fit_transform(documents)

# print("Vocabulary:", vectorizer.get_feature_names_out())
# print("\nTf-idf matrix:\n",X.toarray())
