In [23]:
import numpy as np
import pandas as pd
import tensorflow as tf
import keras

In [2]:
docs=  """India's 2011 Cricket World Cup journey:
India's journey to winning the 2011 Cricket World Cup was filled with both triumphs and challenges, making their victory all the more significant.  
The tournament, co-hosted by India, Sri Lanka, and Bangladesh, took place from February 19 to April 2, 2011.  
India began their campaign with a convincing win against Bangladesh, setting the tone with a massive 87-run victory.  
Virender Sehwag's explosive 175 and Virat Kohli's century (100*) ensured India posted a daunting total of 370/4.  
Despite this strong start, India faced tough situations as the tournament progressed.  
In their group stage match against England, India set a challenging target of 339 runs, with Sachin Tendulkar scoring 120.  
However, England fought back valiantly, led by Andrew Strauss’s 158, and the match ended in a thrilling tie, with both teams scoring 338 runs.  
This result highlighted India's vulnerabilities in bowling under pressure, as they failed to defend a large total.  
India's only loss in the group stage came against South Africa in a closely contested match.  
India batted first and looked set for a massive score, with Tendulkar scoring another century (111) and India reaching 267/1.  
However, a dramatic collapse saw India lose 9 wickets for just 29 runs, being bowled out for 296.  
South Africa chased down the target with 2 balls to spare, winning by 3 wickets in a nail-biting finish.  
This defeat exposed India's middle-order fragility and raised questions about their ability to handle pressure in crucial moments.  
Despite these setbacks, India finished second in Group B, with 4 wins, 1 tie, and 1 loss, advancing to the knockout stages with a net run rate of +0.900.  
In the quarterfinals, India faced a formidable challenge against Australia, the three-time defending champions.  
Australia batted first and posted a competitive total of 260/6, with Ricky Ponting scoring a resilient century (104).  
India’s chase was not without challenges, as they lost key wickets at crucial junctures, but Yuvraj Singh's unbeaten 57 and Suresh Raina's 34* guided them to a 5-wicket victory with 14 balls remaining.  
This win was significant as it ended Australia’s 12-year dominance in World Cups and boosted India's confidence going into the semifinals.  
The semifinal against Pakistan was arguably one of the most intense matches of the tournament.  
India batted first and scored 260/9, with Sachin Tendulkar top-scoring with 85, albeit benefiting from several dropped catches.  
Pakistan, known for their unpredictable nature, kept the chase alive, but India’s bowlers held their nerve, with Zaheer Khan and Harbhajan Singh taking crucial wickets.  
India won by 29 runs, securing a place in the final, but not without surviving some anxious moments, particularly when the Pakistani batsmen looked set to take the game away.  
The final against Sri Lanka presented another tough situation for India, as they were set a challenging target of 275 runs after Sri Lanka scored 274/6, thanks to Mahela Jayawardene's unbeaten 103.  
India's chase began poorly, losing Virender Sehwag for a duck and Sachin Tendulkar for just 18, leaving them at a precarious 31/2.  
At this critical juncture, Gautam Gambhir played a vital innings, scoring 97 runs, and formed crucial partnerships with Virat Kohli (35) and MS Dhoni.  
Dhoni’s decision to promote himself ahead of the in-form Yuvraj Singh was a bold move, especially given the pressure of the situation.  
Dhoni justified his decision with a match-winning unbeaten 91 off 79 balls, guiding India to victory with 10 balls to spare.  
He sealed the win with an iconic six, a moment that has since become etched in cricketing history.  
India won the final by 6 wickets, clinching their second World Cup title after 28 years.  
This victory was a testament to India’s resilience, overcoming tough situations and bouncing back from setbacks throughout the tournament.  
Yuvraj Singh was named the Player of the Tournament for his exceptional all-round performance, scoring 362 runs and taking 15 wickets, often delivering under pressure.  
Sachin Tendulkar, who finished the tournament as India’s highest run-scorer with 482 runs at an average of 53.55, finally realized his dream of winning a World Cup in what was his sixth appearance in the tournament.  
Zaheer Khan was India’s leading wicket-taker, with 21 wickets at an average of 18.76, playing a crucial role in many of India’s victories.  
The 2011 World Cup victory is remembered not just for the triumph but for the tough situations India navigated along the way, showcasing their grit, determination, and ability to rise to the occasion when it mattered the most.  
This win united a nation, with millions celebrating across the country, and it solidified the legacy of a team that overcame challenges to achieve glory on the world stage."""

In [3]:
from tensorflow.keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences

In [4]:
tokenizer=Tokenizer()
tokenizer.fit_on_texts([docs]) #Now it Considers the docs as list of a string, So,it will tokenize by words rather than by letters.

In [77]:
#tokenizer.word_index

In [6]:
len(tokenizer.word_index)

386

In [None]:
# We want list of list containg sentences.

for sentence in docs.split('\n'):
    print(tokenizer.texts_to_sequences([sentence])) # The sentence is given to tokenize them by words rather than letters.

In [None]:
for sentence in docs.split('\n'):
    tokenized_sequence=tokenizer.texts_to_sequences([sentence])[0]
    for i in range(1,len(tokenized_sequence)):
        print(tokenized_sequence[:i+1])

In [9]:
input_sequences=[]
for sentence in docs.split('\n'):
    tokenized_sequence=tokenizer.texts_to_sequences([sentence])[0]
    for i in range(1,len(tokenized_sequence)):
        input_sequences.append(tokenized_sequence[:i+1])

In [75]:
input_sequences

array([[  0,   0,   0, ...,   0,   0,  13],
       [  0,   0,   0, ...,   0,  13,  28],
       [  0,   0,   0, ...,  13,  28,  65],
       ...,
       [  0,   0,   0, ..., 384, 385, 386],
       [  0,   0,   0, ..., 385, 386,   1],
       [  0,   0,   0, ..., 386,   1,  14]])

In [11]:
len(tokenizer.word_index)

386

In [12]:
max_len = max([len(x) for x in input_sequences]) # To find the length of Longest sentence

In [18]:
max_len

38

In [13]:
# Applying Padding for Equal Length of the Inputs.

from keras.preprocessing.sequence import pad_sequences
padded_sequences=pad_sequences(input_sequences,maxlen=max_len,padding='pre')

In [73]:
padded_sequences[0]

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0, 13, 28])

In [15]:
# Now Slicing the Last token as Output,Which the model will try to Predict.
input_sequences=padded_sequences[:,:-1]

In [72]:
input_sequences[0]

array([ 0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
        0,  0, 13])

In [71]:
output_sequences=padded_sequences[:,-1]
output_sequences[0]

28

In [19]:
# So, We have Both Inputs and Outputs For Our Model.

# Our Last Layer should be as long as the # of Tokens, We can have softmax as activation function for output layer.
# So,One-Hot Encoding the outputs using to_categorical

from keras.utils import to_categorical
output_sequences=to_categorical(output_sequences,num_classes=387) # 386 -> no. of tokens and 1 is added bcz the first token is assigned 1 rather than 0.

In [21]:
output_sequences[0]

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

## Model Architecture

In [27]:
input_sequences.shape

(760, 37)

In [28]:
len(tokenizer.word_index)

386

In [26]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding,Dense,LSTM

In [38]:
model=Sequential()
model.add(Embedding(387,100,input_length=37)) # Now Every Token/word is represented using 100 dimensional Vector.
model.add(LSTM(150,dropout=0.2,return_sequences=True))
model.add(LSTM(150,dropout=0.2))
model.add(Dense(387,activation='softmax'))

In [39]:
model.compile(loss='categorical_crossentropy',optimizer='adam',metrics=['accuracy'])

In [41]:
model.fit(input_sequences,output_sequences,epochs=100)

Epoch 1/100
[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 107ms/step - accuracy: 0.0467 - loss: 5.2041
Epoch 2/100
[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 110ms/step - accuracy: 0.0486 - loss: 5.1637
Epoch 3/100
[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 112ms/step - accuracy: 0.0482 - loss: 5.0215
Epoch 4/100
[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 113ms/step - accuracy: 0.0498 - loss: 4.9524
Epoch 5/100
[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m6s[0m 152ms/step - accuracy: 0.0478 - loss: 4.8966
Epoch 6/100
[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 117ms/step - accuracy: 0.0658 - loss: 4.7581
Epoch 7/100
[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 115ms/step - accuracy: 0.0500 - loss: 4.7360
Epoch 8/100
[1m24/24[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m5s[0m 116ms/step - accuracy: 0.0424 - loss: 4.6662
Epoch 9/100
[1m24/24[0m [32m━

<keras.src.callbacks.history.History at 0x2edc1474a50>

In [59]:
# Creating a function to predict next 20 words.
def next_word(text):
    
    for i in range(10):
        # Tokenizing the text
        sequence=tokenizer.texts_to_sequences([text])[0]
        
        #padding the text now
        padded_sequence=pad_sequences([sequence],maxlen=max_len,padding='pre')
        # Predicting next word
        pos=np.argmax(model.predict(padded_sequence))
        print(pos) # It is token index of the next predicted word
    
        for label,index in tokenizer.word_index.items():
            if index==pos:
                text=text+' '+label
        print(text)   
    
    

In [60]:
next_word("Dhoni’s decision to promote himself ahead ")

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step
8
Dhoni’s decision to promote himself ahead  of
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
1
Dhoni’s decision to promote himself ahead  of the
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 38ms/step
7
Dhoni’s decision to promote himself ahead  of the in
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
302
Dhoni’s decision to promote himself ahead  of the in form
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 20ms/step
59
Dhoni’s decision to promote himself ahead  of the in form yuvraj
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 50ms/step
62
Dhoni’s decision to promote himself ahead  of the in form yuvraj singh
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 56ms/step
10
Dhoni’s decision to promote himself ahead  of the in form yuvraj singh was
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[

In [61]:
next_word('India began their ')

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 48ms/step
130
India began their  campaign
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
3
India began their  campaign with
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 33ms/step
2
India began their  campaign with a
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
131
India began their  campaign with a convincing
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
30
India began their  campaign with a convincing win
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 21ms/step
20
India began their  campaign with a convincing win against
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 120ms/step
70
India began their  campaign with a convincing win against bangladesh
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 46ms/step
132
India began their  campaign with a convincing win against bangladesh setting
[

In [62]:
next_word('Pakistan, known for')

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
11
Pakistan, known for their
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 31ms/step
252
Pakistan, known for their unpredictable
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 39ms/step
253
Pakistan, known for their unpredictable nature
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 24ms/step
254
Pakistan, known for their unpredictable nature kept
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
1
Pakistan, known for their unpredictable nature kept the
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step
57
Pakistan, known for their unpredictable nature kept the chase
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 65ms/step
255
Pakistan, known for their unpredictable nature kept the chase alive
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 35ms/step
37
Pakistan, known for their unpredictable nature