# Introduction

Natural Language Processing or NLP refers to the branch of Artificial Intelligence that gives the machines the ability to read, understand and derive meaning from human languages. 
This project aims to train a logistic regression model for sentiment analysis.

# About the dataset

This dataset was created for the Paper 'From Group to Individual Labels using Deep Features', Kotzias et. al,. KDD 2015
. It contains sentences labelled with positive or negative sentiment, extracted from reviews of products, movies, and restaurants. Score is either 1 (for positive) or 0 (for negative)	
The sentences come from three different websites/fields:

imdb.com
, amazon.com
, yelp.com
 

In [1]:
!pip install nltk



# Import necessary libraries

In [2]:
import pandas as pd
import nltk
import re  #regular expression

# Read the dataset

In [None]:
path= r"C:\Users\HP\Desktop\FLEXISAF\MODULE_3\sentiment labelled sentences"

In [4]:
pd.read_csv(f"{path}\yelp_labelled.txt", delimiter = '\t', header= None )

Unnamed: 0,0,1
0,Wow... Loved this place.,1
1,Crust is not good.,0
2,Not tasty and the texture was just nasty.,0
3,Stopped by during the late May bank holiday of...,1
4,The selection on the menu was great and so wer...,1
...,...,...
995,I think food should have flavor and texture an...,0
996,Appetite instantly gone.,0
997,Overall I was not impressed and would not go b...,0
998,"The whole experience was underwhelming, and I ...",0


In [5]:
websites= ['amazon_cells_labelled.txt', 'yelp_labelled.txt', 'imdb_labelled.txt']

In [6]:
# read the data from each website into a dataframe and concatenate the data frames

df = pd.DataFrame()
for website in websites:
    website_df = pd.read_csv(f"{path}/{website}", delimiter = '\t', header= None)
    df= pd.concat([df, website_df], axis=0)

In [7]:
# check the size of the data(no of rows and columns)
df.shape

(2748, 2)

In [8]:
# assign column names to the dataframe for clarity and understanding
df.columns = ['Review', 'Sentiment']

In [9]:
# check the first five rows in the df to get an insight of the data
df.head()

Unnamed: 0,Review,Sentiment
0,So there is no way for me to plug it in here i...,0
1,"Good case, Excellent value.",1
2,Great for the jawbone.,1
3,Tied to charger for conversations lasting more...,0
4,The mic is great.,1


# Text Preprocessing

Stopwords are common words that are often removed from text data during the preprocessing step in natural language processing tasks. These words are considered to have little to no semantic meaning and are often ignored to focus on the more important words in the text. Examples of stopwords include "the", "is", "and", "in", "of", etc.

Tokenization is the process of breaking down a text into smaller units, such as words or phrases, known as tokens. These tokens serve as the basic building blocks for further analysis in natural language processing tasks.

NLTK provides us resources to perform these preprocessing task on our data.

In [10]:
# download stopwords and punkt from nltk
nltk.download('stopwords')
nltk.download('punkt')   #(for tokenization)

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\stopwords.zip.
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\HP\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.


True

In [1]:
# Import the stopwords corpus from NLTK
from nltk.corpus import stopwords

define a fuction (preprocessor) to perform the cleaning task on the data

In [13]:

# create set of english stopwords
stopwords = set(stopwords.words('english'))

# define fuction that takes doc as input
def preprocessor(doc):           
    doc = doc.lower().strip()     # convert to lowercase and remove whitespaces
    try:
        doc = doc.replace('.', '. ')   # Add spaces after periods (.)
        doc = doc.replace(',', ', ')   # Add spaces after commas
        doc = doc.replace('!', '! ')   # Add spaces  after exclamation mark
    except:
        pass
    doc = re.sub(r"[^a-z\s]", "", doc)    # remove non-alphabetic characters
    doc = [d.strip() for d in doc.split() if d.strip() not in stopwords]   # tokenise the doc and remove stopwords
    doc = " ".join(doc)                  # join the tokens(words) back into a single string
    return doc                           # return the cleaned document


In [14]:
# apply the fuction on review column and create a new column 
df['cleaned_review']=df['Review'].apply(preprocessor)

In [15]:
df

Unnamed: 0,Review,Sentiment,cleaned_review
0,So there is no way for me to plug it in here i...,0,way plug us unless go converter
1,"Good case, Excellent value.",1,good case excellent value
2,Great for the jawbone.,1,great jawbone
3,Tied to charger for conversations lasting more...,0,tied charger conversations lasting minutes maj...
4,The mic is great.,1,mic great
...,...,...,...
743,I just got bored watching Jessice Lange take h...,0,got bored watching jessice lange take clothes
744,"Unfortunately, any virtue in this film's produ...",0,unfortunately virtue films production work los...
745,"In a word, it is embarrassing.",0,word embarrassing
746,Exceptionally bad!,0,exceptionally bad


In [16]:
# assign cleaned_review to X(feature) and sentiment to y(target)
X = df['cleaned_review']
y= df['Sentiment']

vectorizer help us convert text data into numerical vectors that machine learning algorithm can operate on. it fits text data to learn the vocabulary and transform the text documents into numerical vectors. we will use the TF-IDF vectorizer.

In [17]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [18]:
vectorizer= TfidfVectorizer()

In [19]:
X= vectorizer.fit_transform(X)

In [21]:
# Get the feature names (words) learned by the vectorizer
vectorizer.get_feature_names_out()

array(['aailiyah', 'abandoned', 'abhor', ..., 'zombie', 'zombiestudents',
       'zombiez'], dtype=object)

In [23]:
# create a dataframe
pd.DataFrame(X.toarray(), columns= vectorizer.get_feature_names_out())

Unnamed: 0,aailiyah,abandoned,abhor,ability,able,abound,abroad,absolute,absolutel,absolutely,...,yukon,yum,yummy,yun,za,zero,zillion,zombie,zombiestudents,zombiez
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2743,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2744,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2745,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2746,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


# Model Building

In [31]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

In [25]:
# split the data into train and test set
X_train, X_test, y_train, y_test= train_test_split(X,y, test_size= 0.2, random_state = 1)

In [29]:
lr_model = LogisticRegression()
lr_model.fit(X_train, y_train)

In [30]:
predictions= lr_model.predict(X_test)

In [33]:
model_accuracy= accuracy_score(y_test, lr_model.predict(X_test))
print(f"The accuracy of the model is {model_accuracy}")

The accuracy of the model is 0.8109090909090909
