# NLP Project Tutorial

In this project we are reusing the spam dataset and we will repeat the cleaning process. However, instead of building a spam detector using SVM algorithm, we will have a very brief and simple introduction to sentiment analysis by adding two columns to the dataset that will detect polarity and subjectivity of the message text by using a new tool.

In [1]:
# Importing dependencies

import pandas as pd 
import regex as re

In [2]:
#Loading dataset

df = pd.read_csv("https://raw.githubusercontent.com/4GeeksAcademy/NLP-project-tutorial/main/spam.csv")

In [3]:
#Encoding target variable

df['Category'] = df['Category'].apply(lambda x: 1 if x == 'spam' else 0)

In [4]:
# EDA: Establish some baseline counts

print("spam count: " +str(len(df.loc[df.Category==1])))
print("not spam count: " +str(len(df.loc[df.Category==0])))
print(df.shape)
df['Category'] = df['Category'].astype(int)

spam count: 747
not spam count: 4825
(5572, 2)


In [5]:
# Eliminate duplicate rows.

df = df.drop_duplicates()
df = df.reset_index(inplace = False)[['Message','Category']]
df.shape

(5157, 2)

In [6]:
# NLP Cleaning process

clean_desc = []

for w in range(len(df.Message)):
    desc = df['Message'][w].lower()
    
    #remove punctuation
    desc = re.sub('[^a-zA-Z]', ' ', desc)
    
    #remove tags
    desc=re.sub("&lt;/?.*?&gt;"," &lt;&gt; ",desc)
    
    #remove digits and special chars
    desc=re.sub("(\\d|\\W)+"," ",desc)
    
    clean_desc.append(desc)

#assign the cleaned descriptions to the data frame
df['Message'] = clean_desc
        
df.head()

Unnamed: 0,Message,Category
0,go until jurong point crazy available only in ...,0
1,ok lar joking wif u oni,0
2,free entry in a wkly comp to win fa cup final ...,1


In [23]:
#After installation of the requested tool in the terminal, import TextBlob and test it with a phrase

from textblob import TextBlob 
Test = TextBlob("Scott really loves Alex, so he decided to take them to Disney Land!")
print(Test.sentiment)

Sentiment(polarity=0.25, subjectivity=0.2)


In [17]:
#load the messages into textblob
email_blob = [TextBlob(text) for text in df['Message']]

In [18]:
#add two columns, each for the sentiment metrics(polarity and subjectivity) to the dataframe
df['tb_Pol'] = [b.sentiment.polarity for b in email_blob]
df['tb_Subj'] = [b.sentiment.subjectivity for b in email_blob]

In [24]:
#Look at the first five rows of the dataframe
df.head()

Unnamed: 0,Message,Category,tb_Pol,tb_Subj
0,go until jurong point crazy available only in ...,0,0.15,0.7625
1,ok lar joking wif u oni,0,0.5,0.5
2,free entry in a wkly comp to win fa cup final ...,1,0.4,0.733333
3,u dun say so early hor u c already then say,0,0.1,0.3
4,nah i don t think he goes to usf he lives arou...,0,0.0,0.0
