1. Importing dependencies:

In [47]:
import os
import json
from zipfile import ZipFile
import pandas as pd
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras.layers import Dense, Embedding, LSTM
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

2. Data Collection (Kagle API):

In [2]:
kaggle_dic = json.load(open('kaggle.json'))

In [3]:
kaggle_dic.keys()

dict_keys(['username', 'key'])

A- Setup kaggle api as environement variables:

In [4]:
os.environ["KAGGLE_USERNAME"] = kaggle_dic["username"]
os.environ["KAGGLE_KEY"] = kaggle_dic["key"]

B- Loading the dataset:

In [5]:
!kaggle datasets download -d lakshmi25npathi/imdb-dataset-of-50k-movie-reviews

Dataset URL: https://www.kaggle.com/datasets/lakshmi25npathi/imdb-dataset-of-50k-movie-reviews
License(s): other
imdb-dataset-of-50k-movie-reviews.zip: Skipping, found more recently modified local copy (use --force to force download)


C- Unzib the dataset file:

In [6]:
with ZipFile("imdb-dataset-of-50k-movie-reviews.zip", 'r') as zip_ref:
        zip_ref.extractall()

3. Analyze the data:

In [7]:
data = pd.read_csv('IMDB Dataset.csv')

A- Dimmensions of the data:

In [8]:
data.shape

(50000, 2)

B- Head of the data:

In [9]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,positive
1,A wonderful little production. <br /><br />The...,positive
2,I thought this was a wonderful way to spend ti...,positive
3,Basically there's a family where a little boy ...,negative
4,"Petter Mattei's ""Love in the Time of Money"" is...",positive


C- Distrubutions of the data:

In [10]:
data["sentiment"].value_counts()

sentiment
positive    25000
negative    25000
Name: count, dtype: int64

D- Enconding our data:

In [11]:
data.replace({"sentiment": {"positive": 1 , "negative" : 0}} ,inplace=True )

  data.replace({"sentiment": {"positive": 1 , "negative" : 0}} ,inplace=True )


In [12]:
data.head()

Unnamed: 0,review,sentiment
0,One of the other reviewers has mentioned that ...,1
1,A wonderful little production. <br /><br />The...,1
2,I thought this was a wonderful way to spend ti...,1
3,Basically there's a family where a little boy ...,0
4,"Petter Mattei's ""Love in the Time of Money"" is...",1


In [13]:
data["sentiment"].value_counts()

sentiment
1    25000
0    25000
Name: count, dtype: int64

E- Train test Split:

In [14]:
train_data, test_data = train_test_split(data,test_size=0.2, random_state=42)

In [15]:
print(train_data.shape)
print(test_data.shape)

(40000, 2)
(10000, 2)


4. Preprocessing the data:

A- Tokenize text data:

In [16]:
tokenizer = Tokenizer(num_words=5000)
tokenizer.fit_on_texts(train_data["review"])
X_train = pad_sequences(tokenizer.texts_to_sequences(train_data["review"]), maxlen=200)
X_test = pad_sequences(tokenizer.texts_to_sequences(test_data["review"]), maxlen=200)

In [17]:
print(X_train)

[[1935    1 1200 ...  205  351 3856]
 [   3 1651  595 ...   89  103    9]
 [   0    0    0 ...    2  710   62]
 ...
 [   0    0    0 ... 1641    2  603]
 [   0    0    0 ...  245  103  125]
 [   0    0    0 ...   70   73 2062]]


In [18]:
X_test

array([[   0,    0,    0, ...,  995,  719,  155],
       [  12,  162,   59, ...,  380,    7,    7],
       [   0,    0,    0, ...,   50, 1088,   96],
       ...,
       [   0,    0,    0, ...,  125,  200, 3241],
       [   0,    0,    0, ..., 1066,    1, 2305],
       [   0,    0,    0, ...,    1,  332,   27]])

B- Assigning the labels:

In [19]:
Y_train = train_data["sentiment"]
Y_test = test_data["sentiment"]

In [20]:
Y_train

39087    0
30893    0
45278    1
16398    0
13653    0
        ..
11284    1
44732    1
38158    0
860      1
15795    1
Name: sentiment, Length: 40000, dtype: int64

In [21]:
Y_test

33553    1
9427     1
199      0
12447    1
39489    0
        ..
28567    0
25079    1
18707    1
15200    0
5857     1
Name: sentiment, Length: 10000, dtype: int64

5. Neural Network LSTM (Long Short-Term Memory):

A- Build the model:

In [22]:
#Initialisation du Modèle Séquentiel
model = Sequential()
#Couche d'embedding qui convertit les indices de mots en vecteur de dim finie
#On a 5000 mot , chat mot representer par un vecteur de 120 dimmenssion, chaque entrée a une longueur fixe de 200 mots.
model.add(Embedding(input_dim= 5000, output_dim=120,input_length=200))
#Chouche LSTM qui  traite les séquences de vecteurs produits par la couche d'Embedding.
#128: dim de l'espcae de sortie
#Abandon aléatoire de 20%  aux unités de sortie de la couche pendant l'entraînement pour éviter le surapprentissage
#Abandon aléatoire de 20%  aux cnx recurrentes 
model.add(LSTM(64,dropout=0.2, recurrent_dropout=0.2))
#Cette couche est une couche pleinement connectée qui suit la couche LSTM
#1: un seul unite de sortie
model.add(Dense(1, activation="sigmoid"))




B- Information about the model:

In [23]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 200, 120)          600000    
                                                                 
 lstm (LSTM)                 (None, 64)                47360     
                                                                 
 dense (Dense)               (None, 1)                 65        
                                                                 
Total params: 647425 (2.47 MB)
Trainable params: 647425 (2.47 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________


C- Compile the model:

In [24]:
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])




D- Training the model:

In [25]:
model.fit(X_train,Y_train,epochs=5, batch_size=32, validation_split=0.2)

Epoch 1/5


Epoch 2/5
Epoch 3/5
Epoch 4/5
Epoch 5/5


<keras.src.callbacks.History at 0x236a41697c0>

6. Model Evaluation:

A- Accuracy and loss:

In [26]:
loss, accuracy = model.evaluate(X_test,Y_test)
print(loss)
print(accuracy)

0.3312399983406067
0.8802000284194946


B- Saving the model:

In [46]:
model.save('sentiment_analysis_model.h5')

  saving_api.save_model(


7. Building a predeicitve systeme:

A- Loading the model

In [48]:
model = load_model('sentiment_analysis_model.h5')

A- Function of prediction sentiments:

In [49]:
def predict_sentiment(review):
  # tokenize and pad the review
  sequence = tokenizer.texts_to_sequences([review])
  padded_sequence = pad_sequences(sequence, maxlen=200)
  prediction = model.predict(padded_sequence)
  sentiment = "positive" if prediction[0][0] > 0.5 else "negative"
  return sentiment


B- Example!

In [50]:
review = "Bad movie"
sentiment = predict_sentiment(review)
print(sentiment)

negative


In [51]:
new_review = "This movie was ok but not that good."
sentiment = predict_sentiment(new_review)
print(f"The sentiment of the review is: {sentiment}")

The sentiment of the review is: negative


In [52]:
review = "I like this movie wowww"
sentiment = predict_sentiment(review)
print(sentiment)

positive
