Sentiments analysis using naive bayes. 

Download link: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews?select=IMDB+Dataset.csv

About Dataset
IMDB dataset having 50K movie reviews for natural language processing or Text analytics.
This is a dataset for binary sentiment classification containing substantially more data than previous benchmark datasets. We provide a set of 25,000 highly polar movie reviews for training and 25,000 for testing. So, predict the number of positive and negative reviews using either classification or deep learning algorithms.
For more dataset information, please go through the following link,
http://ai.stanford.edu/~amaas/data/sentiment/

In [20]:
import pandas as pd
import re 
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer

In [2]:
df = pd.read_csv("../../Downloads/Data for ML/IMDB Dataset.csv")
print(df.head(3))

                                              review sentiment
0  One of the other reviewers has mentioned that ...  positive
1  A wonderful little production. <br /><br />The...  positive
2  I thought this was a wonderful way to spend ti...  positive


Info of the data. 

--> To check about the missing values. 

In [3]:
df.info()                       # There are no missing values present in the data. 

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   review     50000 non-null  object
 1   sentiment  50000 non-null  object
dtypes: object(2)
memory usage: 781.4+ KB


Text preprocessing. 

--> Removal html tags.
--> Removal special characters. 
--> Converting everything to the lower  case. 
--> Removal of stop words. 
--> Stemming. 

In [75]:
# Removal of html tags. 
def rem_html(text: str) -> str:
    eg_text = re.compile("<.*?>")   
    return re.sub(eg_text, " ", text)

def rem_specialChar(text: str) -> str:
    #Specific
    phrase = re.sub(r"won't", r"will not", text)
    phrase = re.sub(r"don't", r"do not", phrase)
    phrase = re.sub(r"can't", r"can not", phrase)
    
    #General
    phrase = re.sub(r"n't", r" not", phrase)
    phrase = re.sub(r"'s", r" is", phrase)
    phrase = re.sub(r"'m", r" am", phrase)
    phrase = re.sub(r"'re", r" are", phrase)
    phrase = re.sub(r"'ll", r" will", phrase)
    phrase = re.sub(r"'t", r" not", phrase)
    phrase = re.sub(r"'ve", r" have", phrase)
    
    # Clean punctuations
    phrase = re.sub(r'[?|!|\'|"|#|@|:]', r'', phrase)
    phrase = re.sub(r'[(|)|.|,|\|/]', r'', phrase)
    phrase = re.sub(r'-', r' ', phrase)    
    
    # Special characters (Remove all words which are not in that range)
    phrase = re.sub(r'[^A-Za-z0-9]+', r' ', phrase)
    
    # Remove all the alphanumeric words
    phrase = re.sub(r'\S*\d\S*', r'', phrase)
    return phrase

def lowercase(text: str) -> str:
    return text.lower()

def tokens(text: str) -> list[str]:
    return text.split()

def rem_stopwords(text: list[str]) -> list[str]:
    stop = stopwords.words('english')
    return [word for word in text if word not in stop]

def rem_stemmer(text: list[str]) -> list[str]:
    stemmer = PorterStemmer()
    ls: list = []
    for word in text:
        ls.append(stemmer.stem(word))
    return ls

def join_words(text: list[str]) -> str:
    return " ".join(text)

Here I am applying text preprocessing in all the data. This is not the correct way of preprocessing the data (data leakage problem) but for text data it is fine. 

In [74]:
df['review'] = df['review'].apply(rem_html)
df['review'] = df['review'].apply(rem_specialChar)
df['review'] = df['review'].apply(lowercase)
df['review'] = df['review'].apply(tokens)
df['review'] = df['review'].apply(rem_stopwords)
df['review'] = df['review'].apply(rem_stemmer)
print(df.head(3))

                                              review sentiment
0  [one, review, mention, watch, oz, episod, hook...  positive
1  [wonder, littl, product, film, techniqu, unass...  positive
2  [thought, wonder, way, spend, time, hot, summe...  positive


In [77]:
df['review'] = df['review'].apply(join_words)

Now convert the sentiments into 0 (negative) and 1 (positive).

In [79]:
df['sentiment'].replace({'positive': 1, 'negative': 0}, inplace=True)
print(df.head(3))

                                              review  sentiment
0  one review mention watch oz episod hook right ...          1
1  wonder littl product film techniqu unassum old...          1
2  thought wonder way spend time hot summer weeke...          1


Now it's time to convert each unique word into unique feature. Here we are making our own vocabulary or text corpus or dictionary based on the data we have.    

In [81]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer()
X = cv.fit_transform(df['review']).toarray()
print(X.shape)
print(f"Number of data points: {X.shape[0]}")
print(f"Number of unique words/features: {X.shape[1]}")

(50000, 97590)
Number of data points: 50000
Number of unique words/features: 97590
