!!next word predictor 

In [32]:
faqs = """
What is the Data Science Mentorship Program?
The Data Science Mentorship Program is a 7-month structured program with live classes and assignments.

What is the course fee?
The program follows a monthly subscription of Rs 799 per month.

What is the total duration of the course?
The total duration is 7 months.

What topics are covered in the program?
The program covers Python, Data Science Libraries, SQL, Data Analysis, Machine Learning, MLOps, Case Studies, and Maths for ML.

Will Deep Learning be covered?
No, Deep Learning is not included in this program.

Will NLP be covered?
No, NLP is also not included.

What if I miss a live session?
All sessions are recorded. You will get access to the recording if you miss a class.

Where can I find the class schedule?
The class schedule is updated monthly on the dashboard.

What is the duration of each session?
Most live sessions are around 2 hours long.

Which language will be used during the sessions?
The instructor teaches in Hinglish.

How will I be informed about upcoming sessions?
You will receive email notifications for every paid session.

Can a non-technical student join this program?
Yes, absolutely. The course is designed for beginners.

Can I join the program late?
Yes, you can join anytime during the year.

Will I get access to past lectures if I join late?
Yes, once you make the payment, all previous recordings will be unlocked.

Do we need to submit tasks?
You do not need to submit tasks. Solutions will be provided for self-evaluation.

Will there be case studies?
Yes, multiple real-world case studies are included.

How can I contact support?
You can email us at support@campusx.com.

Where should I make the payment?
Payments must be made on our official website.

Can I pay the full amount at once?
No, the program follows only a monthly subscription model.

What is the validity of the monthly subscription?
Your subscription is valid for 30 days from the date of payment.

What is the refund policy?
You get a 7-day refund period from the date of payment.

What if I live outside India and my payment fails?
You can email the support team for international payment assistance.

Till when can I watch the videos?
You can watch videos as long as your subscription is active. After completing all payments, full content becomes unlocked.

Why isn’t lifetime access provided?
Lifetime access is not provided due to the low course fees.

How can I get doubt-clearing support?
Fill out the doubt form and the team will schedule a 1-on-1 doubt session.

If I join late, can I ask doubts from previous weeks?
Yes, you can select "Past Week Doubt" in the doubt form.

What is the certificate criteria?
You must complete all assignments and pay the full 7-month subscription.

Is placement assistance included?
Placement assistance includes portfolio review, resume building, interview guidance, and job-search strategies. It does not guarantee a job.
"""


In [33]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

In [34]:
tokenizer = Tokenizer()

In [35]:
tokenizer.fit_on_texts([faqs])

In [36]:
len(tokenizer.word_index)

208

In [37]:
input_sequences = []
for sentence in faqs.split('\n'):
    tokenize_sentence = tokenizer.texts_to_sequences([sentence])[0]

for i in range(1, len(tokenize_sentence)):
    input_sequences.append(tokenize_sentence[: i+1])

In [38]:
input_sequences

[]

In [40]:
# rebuild input_sequences correctly and compute max_len with a safe fallback
input_sequences = []
for sentence in faqs.split('\n'):
    tokenize_sentence = tokenizer.texts_to_sequences([sentence])[0]
    if not tokenize_sentence:
        continue
    for i in range(1, len(tokenize_sentence) + 1):
        input_sequences.append(tokenize_sentence[:i])

# compute max_len (fallback to padded_input_sequences if input_sequences is empty)
if input_sequences:
    max_len = max(len(x) for x in input_sequences)
else:
    max_len = padded_input_sequences.shape[1] if 'padded_input_sequences' in globals() else 0

max_len

19

In [41]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
padded_input_sequences = pad_sequences(input_sequences , maxlen = max_len , padding='pre')

In [42]:
padded_input_sequences

array([[  0,   0,   0, ...,   0,   0,   7],
       [  0,   0,   0, ...,   0,   7,   2],
       [  0,   0,   0, ...,   7,   2,   1],
       ...,
       [  0,   0,  88, ..., 207,  16, 208],
       [  0,  88,  53, ...,  16, 208,   9],
       [ 88,  53, 197, ..., 208,   9,  89]], shape=(501, 19), dtype=int32)

In [43]:
X = padded_input_sequences[: ,:-1]

In [44]:
Y = padded_input_sequences[:,-1]

In [45]:
X.shape

(501, 18)

In [46]:
Y.shape

(501,)

In [47]:
from tensorflow.keras.utils import to_categorical
num_classes = int(Y.max()) + 1
y = to_categorical(Y, num_classes=num_classes)

In [48]:
y.shape

(501, 209)

In [49]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding , LSTM , Dense         

In [53]:
model = Sequential()
model.add(Embedding(274, 100, input_length = 56))
model.add(LSTM(150))
model.add(Dense(274, activation = 'softmax'))

In [54]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [55]:
model.summary()

In [57]:
# rebuild model to match your data shapes and use one-hot targets `y`
vocab_size = num_classes
seq_len = X.shape[1]

model = Sequential()
model.add(Embedding(vocab_size, 100, input_length=seq_len))
model.add(LSTM(150))
model.add(Dense(vocab_size, activation='softmax'))

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
model.summary()

# fit with the one-hot encoded targets
model.fit(X, y, epochs=30)

Epoch 1/30
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m3s[0m 19ms/step - accuracy: 0.0479 - loss: 5.3051
Epoch 2/30
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step - accuracy: 0.0719 - loss: 5.0234
Epoch 3/30
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - accuracy: 0.0739 - loss: 4.8916
Epoch 4/30
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - accuracy: 0.0739 - loss: 4.8052
Epoch 5/30
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - accuracy: 0.0758 - loss: 4.7349
Epoch 6/30
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 17ms/step - accuracy: 0.0778 - loss: 4.6503
Epoch 7/30
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 18ms/step - accuracy: 0.0878 - loss: 4.5613
Epoch 8/30
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 19ms/step - accuracy: 0.1058 - loss: 4.4680
Epoch 9/30
[1m16/16[0m [32m━━━━━━━━━━━━━━━━━━

<keras.src.callbacks.history.History at 0x1b2e1516520>

In [80]:
text = "what"

for i in range(10):
    #tokenize
 token_text = tokenizer.texts_to_sequences([text])[0]
  #padding
 padded_token_text = pad_sequences([token_text], maxlen=56, padding='pre')
 print(padded_token_text)

#predict
 pos = np.argmax(model.predict(padded_token_text))

 for word, index in tokenizer.word_index.items():
    if index == pos:
        text = text + " " + word
        print(text)


[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7]]
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 81ms/step
what is
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 2]]
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 72ms/step
what is the
[[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
  0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 2 1]]
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 78ms/step
what is the duration
[[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  7  2  1 38]]
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 80ms/step
what is the duration of
[[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
   0  0  0  0  0  0  0  0  0  0  0  0  0  0

In [73]:
import numpy as np

In [71]:
tokenizer.word_index

{'the': 1,
 'is': 2,
 'i': 3,
 'can': 4,
 'you': 5,
 'will': 6,
 'what': 7,
 'program': 8,
 'a': 9,
 'be': 10,
 'and': 11,
 'subscription': 12,
 'of': 13,
 'for': 14,
 'payment': 15,
 'not': 16,
 'if': 17,
 'to': 18,
 'join': 19,
 'yes': 20,
 'doubt': 21,
 'data': 22,
 '7': 23,
 'live': 24,
 'course': 25,
 'monthly': 26,
 'are': 27,
 'in': 28,
 'included': 29,
 'session': 30,
 'all': 31,
 'sessions': 32,
 'get': 33,
 'access': 34,
 'support': 35,
 'science': 36,
 'month': 37,
 'duration': 38,
 'covered': 39,
 'learning': 40,
 'case': 41,
 'studies': 42,
 'no': 43,
 'class': 44,
 'schedule': 45,
 'on': 46,
 'how': 47,
 'email': 48,
 'late': 49,
 'provided': 50,
 'full': 51,
 'from': 52,
 'assistance': 53,
 'mentorship': 54,
 'assignments': 55,
 'follows': 56,
 'total': 57,
 'deep': 58,
 'this': 59,
 'nlp': 60,
 'miss': 61,
 'where': 62,
 'long': 63,
 'during': 64,
 'past': 65,
 'once': 66,
 'make': 67,
 'previous': 68,
 'unlocked': 69,
 'do': 70,
 'need': 71,
 'submit': 72,
 'tasks': 73