# :Next Word Prediction LSTM:

# Introduction

Step into the realm of predictive technology with our latest innovation: the Next Word Predictor. In a digital landscape overflowing with text, the Next Word Predictor aims to streamline your writing process by anticipating the next word you're likely to type. Whether you're composing emails, messages, or documents, this tool promises to enhance efficiency and productivity. Say goodbye to typing guesswork and hello to seamless, intuitive writing experiences. Let's explore the future of text prediction together.


# Objective

:Import library:

In [None]:
import pandas as pd
import tensorflow as tf
import numpy as np

:Import CSV file:

In [None]:
df = pd.read_csv('/content/Spotify_final_dataset.csv')
df.head(4)

Unnamed: 0,Position,Artist Name,Song Name,Days,Top 10 (xTimes),Peak Position,Peak Position (xTimes),Peak Streams,Total Streams
0,1.0,Post Malone,Sunflower SpiderMan: Into the SpiderVerse,1506.0,302.0,1.0,(x29),2118242.0,883369738.0
1,2.0,Juice WRLD,Lucid Dreams,1673.0,178.0,1.0,(x20),2127668.0,864832399.0
2,3.0,Lil Uzi Vert,XO TOUR Llif3,1853.0,212.0,1.0,(x4),1660502.0,781153024.0
3,4.0,J. Cole,No Role Modelz,2547.0,6.0,7.0,0,659366.0,734857487.0


:Extracting the required data from CSv fileL:

In [None]:
df = df['Song Name']

In [None]:
df

0       Sunflower  SpiderMan: Into the SpiderVerse
1                                     Lucid Dreams
2                                    XO TOUR Llif3
3                                   No Role Modelz
4                                         rockstar
                           ...                    
9357                                           NaN
9358                                           NaN
9359                                           NaN
9360                                           NaN
9361                                           NaN
Name: Song Name, Length: 9362, dtype: object

:Converting data into a list for easy handeling:

In [None]:
song_name = df.to_list()

In [None]:
song_name

['Sunflower  SpiderMan: Into the SpiderVerse',
 'Lucid Dreams',
 'XO TOUR Llif3',
 'No Role Modelz',
 'rockstar',
 'goosebumps',
 'Blinding Lights',
 'Jocelyn Flores',
 'SAD!',
 'All Girls Are The Same',
 'HUMBLE.',
 'Circles',
 'SICKO MODE',
 'Drip Too Hard (Lil Baby & Gunna)',
 'Congratulations',
 'I Fall Apart',
 'Heat Waves',
 "God's Plan",
 'The Box',
 'MIDDLE CHILD',
 'Fuck Love',
 'Better Now',
 'One Dance',
 'good 4 u',
 'Robbery',
 'lovely',
 'SLOW DANCING IN THE DARK',
 'Moonlight',
 'Ric Flair Drip (& Metro Boomin)',
 'Stay',
 'Someone You Loved',
 'Watermelon Sugar',
 'Wow.',
 'Location',
 'Sweater Weather',
 'Yes Indeed',
 'Closer',
 'Psycho',
 'The Hills',
 'Levitating',
 'bad guy',
 'Whiskey Glasses',
 'drivers license',
 'Kiss Me More',
 'ROCKSTAR',
 'INDUSTRY BABY',
 'Starboy',
 'Redbone',
 'Blueberry Faygo',
 'Going Bad',
 'Young Dumb & Broke',
 'As It Was',
 'Shape of You',
 '7 rings',
 "when the party's over",
 'LOVE. FEAT. ZACARI.',
 'I Like It',
 'Truth Hurts',
 '

:Tokenizing the data:

In [None]:
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts([str(item) for item in song_name])
seq = tokenizer.texts_to_sequences([str(item) for item in song_name])

In [None]:
seq[:10]

[[954, 955, 235, 2, 1579],
 [1580, 157],
 [956, 957, 1581],
 [18, 958, 1582],
 [378],
 [694],
 [1583, 317],
 [1584, 1585],
 [318],
 [21, 86, 92, 2, 236]]

In [None]:
tokenizer.word_index

{'nan': 1,
 'the': 2,
 'you': 3,
 'me': 4,
 'i': 5,
 'it': 6,
 'a': 7,
 'my': 8,
 'love': 9,
 'in': 10,
 'to': 11,
 'on': 12,
 'of': 13,
 'up': 14,
 'remix': 15,
 'for': 16,
 'version': 17,
 'no': 18,
 'christmas': 19,
 'like': 20,
 'all': 21,
 "don't": 22,
 'be': 23,
 'what': 24,
 'your': 25,
 'is': 26,
 'out': 27,
 'go': 28,
 'time': 29,
 'with': 30,
 'one': 31,
 'down': 32,
 'do': 33,
 'baby': 34,
 'this': 35,
 'from': 36,
 'and': 37,
 'back': 38,
 'know': 39,
 'bad': 40,
 'heart': 41,
 'that': 42,
 'way': 43,
 'good': 44,
 'night': 45,
 'feat': 46,
 '2': 47,
 'we': 48,
 'life': 49,
 "i'm": 50,
 'get': 51,
 'la': 52,
 'u': 53,
 'girl': 54,
 'let': 55,
 'want': 56,
 'talk': 57,
 'not': 58,
 'home': 59,
 'new': 60,
 'too': 61,
 'now': 62,
 'off': 63,
 'remastered': 64,
 'better': 65,
 'over': 66,
 'come': 67,
 "taylor's": 68,
 'at': 69,
 'eazy': 70,
 'n': 71,
 'big': 72,
 'how': 73,
 'make': 74,
 'right': 75,
 'day': 76,
 'man': 77,
 'interlude': 78,
 'money': 79,
 'high': 80,
 'song'

:Removing the song with a single word as we cant predict the next word of them:

In [None]:
X = []
y = []
total_words_dropped = 0

for i in seq:
    if len(i) > 1:
        for index in range(1, len(i)):
            X.append(i[:index])
            y.append(i[index])
    else:
        total_words_dropped += 1

print("Total Single Words Dropped are:", total_words_dropped)

Total Single Words Dropped are: 5255


In [None]:
X[:10]

[[954],
 [954, 955],
 [954, 955, 235],
 [954, 955, 235, 2],
 [1580],
 [956],
 [956, 957],
 [18],
 [18, 958],
 [1583]]

In [None]:
y[:10]

[955, 235, 2, 1579, 157, 957, 1581, 958, 1582, 317]

:Adding Sequence padding to ensure that input sequences have the same length by padding them with a specific value:

In [None]:
X = tf.keras.preprocessing.sequence.pad_sequences(X)

In [None]:
X

array([[   0,    0,    0, ...,    0,    0,  954],
       [   0,    0,    0, ...,    0,  954,  955],
       [   0,    0,    0, ...,  954,  955,  235],
       ...,
       [   0,    0,    0, ...,    0,    0,   31],
       [   0,    0,    0, ...,    0,    0, 4213],
       [   0,    0,    0, ...,    0,    0, 1367]], dtype=int32)

In [None]:
X.shape

(8807, 19)

In [None]:
y = tf.keras.utils.to_categorical(y)

In [None]:
y

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 1., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 1.]], dtype=float32)

In [None]:
y.shape

(8807, 4215)

:Calculating the total number of words in our vocabulary:

In [None]:
vocab_size = len(tokenizer.word_index) + 1

In [None]:
vocab_size

4215

:Creating the model:

In [None]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, 14),
    tf.keras.layers.LSTM(100, return_sequences=True),
    tf.keras.layers.LSTM(100),
    tf.keras.layers.Dense(100, activation='relu'),
    tf.keras.layers.Dense(vocab_size, activation='softmax'),
])

:Printing the summary of the model:

model.summary()

:Optimising the model using adam optimiser to minimise the loss function:

In [None]:
model.compile(
    optimizer=tf.keras.optimizers.Adam(learning_rate=0.004),
    loss='categorical_crossentropy',
    metrics=['accuracy'])

In [None]:
model.fit(X, y, epochs=50,batch_size=64)

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50


<keras.src.callbacks.History at 0x7f4c09721660>

model.save('nwp.h5')

In [None]:
vocab_array = np.array(list(tokenizer.word_index.keys()))

In [None]:
vocab_array

array(['nan', 'the', 'you', ..., 'daechwita', 'somethin’', 'skies'],
      dtype='<U22')

In [None]:
def make_prediction(text, n_words):
    for i in range(n_words):
        text_tokenize = tokenizer.texts_to_sequences([text])
        text_padded = tf.keras.preprocessing.sequence.pad_sequences(text_tokenize, maxlen=14)
        prediction = np.squeeze(np.argmax(model.predict(text_padded), axis=-1))
        prediction = str(vocab_array[prediction - 1])
        text += " " + prediction
    return text

In [None]:
make_prediction("cloudy", 5)



'cloudy acoustic wit camila tyler the'

In [None]:
model.summary()

Model: "sequential_1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding_1 (Embedding)     (None, None, 14)          59010     
                                                                 
 lstm_2 (LSTM)               (None, None, 100)         46000     
                                                                 
 lstm_3 (LSTM)               (None, 100)               80400     
                                                                 
 dense_2 (Dense)             (None, 100)               10100     
                                                                 
 dense_3 (Dense)             (None, 4215)              425715    
                                                                 
Total params: 621225 (2.37 MB)
Trainable params: 621225 (2.37 MB)
Non-trainable params: 0 (0.00 Byte)
_________________________________________________________________
