### Problem Statement: Text Generation Using RNN on Netflix Reviews

### Objective:
- To build a text generation model using Recurrent Neural Networks (RNN) that can generate text mimicking the style of Netflix user reviews

-  Netflix reviews, the model will be trained to predict the next word in a sequence based on the previous words. This project aims to explore the effectiveness of RNNs in natural language generation tasks, specifically focusing on predicting the flow of sentences in a review context.

## 1. Importing Necessary Libraries

In [67]:
import pandas as pd
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from keras.utils import to_categorical
from keras.models import Sequential
from keras.layers import InputLayer, Embedding, SimpleRNN, Dense
import numpy as np
import pickle

- pandas: For data manipulation and reading the CSV file.
- keras: For building and training the neural network.
- numpy: For numerical operations.
- pickle: For saving and loading the model and tokenizer.

### 2. Loading and Preparing the Data

In [69]:
data = pd.read_csv(r"D:\\netflix_reviews.csv",usecols=['content'])

In [139]:
data.shape

(108664, 1)

In [141]:
data.head()

Unnamed: 0,content
0,There is problem playing this video. Please tr...
1,Netflix is the awesomest app there is when it ...
2,Aside from the cost being ridiculous.. audio k...
3,Dear Netflix... Please it's about time you put...
4,Terrible


In [83]:
data['content'][1]

"Netflix is the awesomest app there is when it comes to entertainment. And I mean in every genre there is, second to none. If its emersive entertainment you seek then youve arrived at your see all destination. Its worth every penny you spend for it. You can uninstall all those other generic apps that claim to entertain. Ive taken the time to read other reviews, by other users. The things that're bothering people are technical issues regarding their phones, (for inst) Video freeze, MT the cache."

### 3. Text Tokenization¶

In [85]:
tk = Tokenizer( filters='!"$%&()*+,#,-./:;<=>?@[\\]^_`{|}~\t\n')

In [87]:
tk.fit_on_texts(data['content'][:1000])

- Tokenizer: Converts words to numerical indices.
- filters: Specifies characters to filter out from the text.
- fit_on_texts: Updates internal vocabulary based on the first 1000 reviews.

In [89]:
len(tk.index_word)

2760

- The vocabulary contains 2760 unique words.

### 4. Converting Texts to Sequences

In [91]:
data1 = tk.texts_to_sequences(data['content'][:1000])

- Converts each review into a sequence of integers representing word indices.

### 5. Preparing Input and Output Sequences

In [93]:
X = []
y = []
for li in data1:
    for ind in range(1,len(li)):
        X.append(li[:ind])
        y.append(li[ind])       

- Purpose :- To create input-output pairs for training.
- X :- Sequences of words.
- y :- The next word to predict.
- This loops through each sequence and builds shorter sequences ending at each word, with the next word as the label.

In [95]:
len(X)

14499

In [97]:
len(y)

14499

In [99]:
pd.DataFrame({'X':X,'y':y})

Unnamed: 0,X,y
0,[100],6
1,"[100, 6]",118
2,"[100, 6, 118]",305
3,"[100, 6, 118, 305]",12
4,"[100, 6, 118, 305, 12]",119
...,...,...
14494,"[214, 381, 117, 2758, 55, 2, 226, 12, 186, 245...",1091
14495,"[214, 381, 117, 2758, 55, 2, 226, 12, 186, 245...",2
14496,"[214, 381, 117, 2758, 55, 2, 226, 12, 186, 245...",2760
14497,"[214, 381, 117, 2758, 55, 2, 226, 12, 186, 245...",470


In [101]:
X[0:3]

[[100], [100, 6], [100, 6, 118]]

In [103]:
y[:3]

[6, 118, 305]

### 6. Padding the Sequences

In [105]:
fv = pad_sequences(X,padding="pre")

- pad_sequences: Pads sequences to the same length.
- padding='pre': Pads zeros at the beginning.
- fv.shape: (14499, 114)
- All sequences are padded to a length of 114 (the length of the longest sequence).

In [107]:
fv.shape

(14499, 114)

In [109]:
len(fv[1])

114

### 7. One-Hot Encoding the Output

In [111]:
cv = to_categorical(y)

- to_categorical: Converts integer labels to one-hot encoded vectors.

In [113]:
cv.shape

(14499, 2761)

### 8. Building the Model

In [115]:
model1 = Sequential()
model1.add(InputLayer(shape=(114,)))
model1.add(Embedding(2761,5))
model1.add(SimpleRNN(100,return_state=False))
model1.add(Dense(2761,activation='softmax'))

- InputLayer: Specifies the input shape.
- Embedding Layer:
    - input_dim: Vocabulary size.
    - output_dim: Embedding size (5).
- SimpleRNN: RNN layer with 100 units.
    - return_sequences=False: Outputs the last output in the output sequence.
- Dense Layer: Output layer with softmax activation for classification.

In [117]:
model1.summary()

### 9. Compiling the Model

In [119]:
model1.compile( optimizer='rmsprop',
    loss='categorical_crossentropy',
    metrics=['accuracy'])

- optimizer: 'rmsprop' optimizer.
- loss: 'categorical_crossentropy' suitable for multi-class classification.
- metrics: Track accuracy during training.

In [121]:
fv[:1001]

array([[   0,    0,    0, ...,    0,    0,  100],
       [   0,    0,    0, ...,    0,  100,    6],
       [   0,    0,    0, ...,  100,    6,  118],
       ...,
       [   0,    0,    0, ...,    3, 1226,   10],
       [   0,    0,    0, ..., 1226,   10,  632],
       [   0,    0,    0, ...,   10,  632,   19]])

### 10. Training the Model

In [123]:
model1.fit(fv[:1001],cv[:1001],epochs=200)

Epoch 1/200
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 16ms/step - accuracy: 0.0012 - loss: 7.9138    
Epoch 2/200
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 19ms/step - accuracy: 0.0169 - loss: 6.9660
Epoch 3/200
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 17ms/step - accuracy: 0.0280 - loss: 6.0491
Epoch 4/200
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 16ms/step - accuracy: 0.0286 - loss: 5.8389
Epoch 5/200
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 16ms/step - accuracy: 0.0409 - loss: 5.8135
Epoch 6/200
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 16ms/step - accuracy: 0.0290 - loss: 5.7187
Epoch 7/200
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 16ms/step - accuracy: 0.0252 - loss: 5.7263
Epoch 8/200
[1m32/32[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 16ms/step - accuracy: 0.0253 - loss: 5.7618
Epoch 9/200
[1m32/32[0m [32m━━━━━

<keras.src.callbacks.history.History at 0x12bf12cc320>

#### Training Output:

- The model gradually improves accuracy over epochs.
- Initial epochs show low accuracy due to the model learning patterns.
- By the end, accuracy improves significantly.

In [125]:
X = "Netflix "
import time
for y in range(20):
    word = tk.index_word[np.argmax(model1.predict(pad_sequences(tk.texts_to_sequences([X]),maxlen=23)))]
    X = X+" "+word
    print(X)
    time.sleep(0.9)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 275ms/step
Netflix  watch
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
Netflix  watch the
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
Netflix  watch the awesomest
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
Netflix  watch the awesomest app
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
Netflix  watch the awesomest app there
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
Netflix  watch the awesomest app there is
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
Netflix  watch the awesomest app there is when
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step
Netflix  watch the awesomest app there is when it
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
Netflix  watch the awesomest app there is when it comes
[1m1/1[0m [32m━━━━

#### Process:
- Start with an initial text: "Netflix ".
- Loop 20 times to generate 20 words.
- Convert current text to sequences and pad it.
- Use the model to predict the next word.
- Append the predicted word to the text.
- Print the updated text.

In [127]:
data['content'][1]

"Netflix is the awesomest app there is when it comes to entertainment. And I mean in every genre there is, second to none. If its emersive entertainment you seek then youve arrived at your see all destination. Its worth every penny you spend for it. You can uninstall all those other generic apps that claim to entertain. Ive taken the time to read other reviews, by other users. The things that're bothering people are technical issues regarding their phones, (for inst) Video freeze, MT the cache."

In [129]:
import pickle

In [131]:
Mdl1 = pickle.dump(model1,open('model1.pkl','wb'))
Mdl1 =pickle.load(open('model1.pkl','rb'))

In [133]:
Mdl2 = pickle.dump(tk,open('tk.pkl','wb'))
Mdl2 =pickle.load(open('tk.pkl','rb'))

### Generating Text

In [135]:
X = input("Enter your First word:")
import time
for y in range(20):
    word = tk.index_word[np.argmax(Mdl1.predict(pad_sequences(Mdl2.texts_to_sequences([X]),maxlen=23)))]
    X = X+" "+word
    print(X)
    time.sleep(0.9)

Enter your First word: Netflix


[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 170ms/step
Netflix watch
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
Netflix watch the
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
Netflix watch the awesomest
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
Netflix watch the awesomest app
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
Netflix watch the awesomest app there
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
Netflix watch the awesomest app there is
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
Netflix watch the awesomest app there is when
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step
Netflix watch the awesomest app there is when it
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
Netflix watch the awesomest app there is when it comes
[1m1/1[0m [32m━━━━━━━━━━━━━

- The model reproduces the initial review because it's trained on a small dataset and overfits to it.