Imports pandas for data manipulation and analysis

Imports the nltk library and specific modules for natural language processing tasks:
    stopwords: Provides a list of common stopwords in various languages.
    word_tokenize: Splits text into individual words (tokens)

Imports the re module, which provides support for regular expressions. 
Regular expressions are used for string searching and manipulation.

Imports the PorterStemmer class from nltk.stem. 
The Porter Stemmer is used for stemming, which is the process of reducing words to their base or root form.

Imports the TfidfVectorizer from sklearn.feature_extraction.text. This vectorizer converts a collection of raw documents to a matrix of TF-IDF features, which is useful for text analysis and classification tasks.

Imports the train_test_split function from sklearn.model_selection. This function is used to split a dataset into training and testing sets.

Imports modules from tensorflow.keras for building and training neural network models:
    Sequential: A linear stack of layers.
    Dense: A fully connected neural network layer.
    BatchNormalization: Layer that normalizes inputs across the batch.
    Dropout: Layer that randomly drops a fraction of the input units during training to prevent overfitting.
    Adam: An optimizer that implements the Adam algorithm.

In [2]:
import pandas as pd

from collections import Counter
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
nltk.download('punkt')
nltk.download('stopwords')
import re

from nltk.stem import PorterStemmer

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, BatchNormalization, Dropout
from tensorflow.keras.optimizers import Adam



[nltk_data] Downloading package punkt to C:\Users\BHARAT
[nltk_data]     JHAWAR\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package stopwords to C:\Users\BHARAT
[nltk_data]     JHAWAR\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


Reading the data

In [3]:
df=pd.read_csv("D:\sentiment_analysis_50000 imdb_Reviews\IMDB Dataset of 50K Movie Reviews\IMDB Dataset.csv")
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


Dropping the duplicates rows

In [4]:
df = df.drop_duplicates(subset=['review'], keep='first')
df.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


Replacing 1 for positive and 0 for negative

In [5]:
df['sentiment'] = df['sentiment'].replace({'positive': 1, 'negative': 0})
df.head(1)

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1


All the letters are converted to lower case and most of the unneccesary words,i.e. stopwords, special character, links, html tags, extra spaces are removed for better model training

In [6]:
def clean_review(text):
    text = text.lower() # Convert text to lowercase
    text = re.sub(r'http\S+|www\S+|https\S+', '', text, flags=re.MULTILINE) # Remove URLs
    text = re.sub(r'<.*?>', '', text) # Remove HTML tags
    text = re.sub(r'[^a-z\s]', '', text) # Remove special characters and numbers
    text = re.sub('<br />', '',text) #removing break tags
    text = re.sub(r'\s+', ' ', text).strip() # Remove extra spaces
    #removing stopwords
    tokens=word_tokenize(text)
    tokens = [word for word in tokens if word.isalnum() and word not in stopwords.words('english')]
    return " ".join(tokens)

df['review'] = df['review'].apply(clean_review)

Here converting the words to there root form

In [7]:
stemmer=PorterStemmer()
def stem(text):
    y=[]
    for i in text.split():
        y.append(stemmer.stem(i))
    return " ".join(y)

df.review = df['review'].apply(lambda x: stem(x))

Downloading the clean data for not having future harrasments

In [12]:
df.to_csv('cleaned_data.csv')

Defining the neural network model

In [13]:
from tensorflow.keras.callbacks import EarlyStopping

X=df['review']
Y=df['sentiment']

vect = TfidfVectorizer()
X=vect.fit_transform(df['review'])

X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size=0.2,random_state=27)

X_train= X_train[:2000]
Y_train= Y_train[:2000]
X_test= X_test[:500]
Y_test= Y_test[:500]

#converting X_train and X_test into array format
X_train = X_train.toarray()
X_test = X_test.toarray()

model = Sequential()
model.add(Dense(units=16,activation='relu',input_dim=X_train.shape[1]))
model.add(BatchNormalization())
model.add(Dense(units=8, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(units=1,activation='sigmoid'))

model.compile(optimizer='adam', loss = 'binary_crossentropy',metrics=['accuracy'])

early_stopping = EarlyStopping(monitor='val_loss', patience=3, restore_best_weights=True)

history = model.fit(X_train, Y_train, batch_size=10,epochs=15, validation_split=0.2, callbacks=[early_stopping])

model.summary()

test_loss,test_acc = model.evaluate(X_test, Y_test)
print(f"Test_loss: {test_loss}, Test_Accuracy: {test_acc}")

Epoch 1/15
Epoch 2/15
Epoch 3/15
Epoch 4/15
Epoch 5/15
Epoch 6/15
Epoch 7/15
Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 dense_3 (Dense)             (None, 16)                2793680   
                                                                 
 batch_normalization (BatchN  (None, 16)               64        
 ormalization)                                                   
                                                                 
 dense_4 (Dense)             (None, 8)                 136       
                                                                 
 dropout (Dropout)           (None, 8)                 0         
                                                                 
 dense_5 (Dense)             (None, 1)                 9         
                                                                 
Total params: 2,793,889
Trainable params: 2

Using the deep learning model Sequential 