In [1]:
docs = """
Exploring the Wonders of Space
What are black holes, and why are they significant?
Black holes are regions in space where gravity is so strong that nothing, not even light, can escape.
They are significant because they help scientists understand the limits of physics, particularly how gravity interacts with spacetime.
How do stars influence the universe?
Stars are the building blocks of galaxies.
They produce light and heat through nuclear fusion, creating elements essential for life.
Stars also play a role in forming planets and distributing energy throughout the universe.
What is the importance of space exploration?
Space exploration expands our knowledge of the universe and helps develop new technologies.
It inspires innovation, provides a better understanding of Earth’s place in the cosmos, and may someday enable humanity to inhabit other planets.
How are scientists searching for life beyond Earth?
Scientists use telescopes to analyze exoplanets' atmospheres and look for conditions that could support life.
Missions like the James Webb Space Telescope and Mars rovers aim to uncover signs of water, organic molecules, or microbial life.
What role do humans have in shaping the future of space?
Human efforts in space exploration, like building space stations and planning missions to Mars, will pave the way for interplanetary travel.
Innovations inspired by space research may lead to sustainable solutions on Earth and beyond.
Can ordinary people contribute to space science?
Yes, through initiatives like citizen science projects, people can help analyze astronomical data, identify celestial phenomena, and support space missions. These collaborations make space exploration more inclusive and impactful.
Where can we learn more about the universe?
Many organizations, like NASA, ESA, and local astronomy clubs, provide free resources and events.
Online platforms like YouTube and educational websites are also excellent places to deepen your knowledge.
Hello k xa khaber.
K gardai xaau timi.
kasto bhai ra xa pada.
Khana khanu bha ko.
"""

In [2]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

In [3]:
# Initialize the tokenizer to convert text into numerical sequences
tokenizer = Tokenizer()

In [4]:
# Fit the tokenizer on the given text to build the vocabulary and word index
tokenizer.fit_on_texts([docs])

In [5]:
tokenizer.word_index

{'and': 1,
 'the': 2,
 'space': 3,
 'of': 4,
 'are': 5,
 'to': 6,
 'in': 7,
 'like': 8,
 'they': 9,
 'can': 10,
 'universe': 11,
 'for': 12,
 'life': 13,
 'exploration': 14,
 'what': 15,
 'scientists': 16,
 'how': 17,
 'stars': 18,
 'missions': 19,
 'black': 20,
 'holes': 21,
 'significant': 22,
 'where': 23,
 'gravity': 24,
 'is': 25,
 'that': 26,
 'light': 27,
 'help': 28,
 'do': 29,
 'building': 30,
 'through': 31,
 'also': 32,
 'a': 33,
 'role': 34,
 'planets': 35,
 'knowledge': 36,
 'may': 37,
 'beyond': 38,
 'earth': 39,
 'analyze': 40,
 'support': 41,
 'mars': 42,
 'people': 43,
 'science': 44,
 'more': 45,
 'k': 46,
 'xa': 47,
 'exploring': 48,
 'wonders': 49,
 'why': 50,
 'regions': 51,
 'so': 52,
 'strong': 53,
 'nothing': 54,
 'not': 55,
 'even': 56,
 'escape': 57,
 'because': 58,
 'understand': 59,
 'limits': 60,
 'physics': 61,
 'particularly': 62,
 'interacts': 63,
 'with': 64,
 'spacetime': 65,
 'influence': 66,
 'blocks': 67,
 'galaxies': 68,
 'produce': 69,
 'heat': 70

In [6]:
len(tokenizer.word_index)

194

In [9]:
# Initialize a list to store the generated input sequences
input_sequences = []
# Split the text into individual sentences by newline characters
for sentence in docs.split('\n'):
    # Convert the current sentence into a sequence of integers based on the tokenizer
    tokenized_sentence = tokenizer.texts_to_sequences([sentence])[0]
    # print(tokenized_sentence)

    # Create multiple subsequences from the tokenized sentence
    # For each word in the sentence (after the first one), append the sequence from the start to the current word
    for i in range(1, len(tokenized_sentence)):
        input_sequences.append(tokenized_sentence[:i + 1])  # Add the current subsequence to the list

In [13]:
input_sequences

[[48, 2],
 [48, 2, 49],
 [48, 2, 49, 4],
 [48, 2, 49, 4, 3],
 [15, 5],
 [15, 5, 20],
 [15, 5, 20, 21],
 [15, 5, 20, 21, 1],
 [15, 5, 20, 21, 1, 50],
 [15, 5, 20, 21, 1, 50, 5],
 [15, 5, 20, 21, 1, 50, 5, 9],
 [15, 5, 20, 21, 1, 50, 5, 9, 22],
 [20, 21],
 [20, 21, 5],
 [20, 21, 5, 51],
 [20, 21, 5, 51, 7],
 [20, 21, 5, 51, 7, 3],
 [20, 21, 5, 51, 7, 3, 23],
 [20, 21, 5, 51, 7, 3, 23, 24],
 [20, 21, 5, 51, 7, 3, 23, 24, 25],
 [20, 21, 5, 51, 7, 3, 23, 24, 25, 52],
 [20, 21, 5, 51, 7, 3, 23, 24, 25, 52, 53],
 [20, 21, 5, 51, 7, 3, 23, 24, 25, 52, 53, 26],
 [20, 21, 5, 51, 7, 3, 23, 24, 25, 52, 53, 26, 54],
 [20, 21, 5, 51, 7, 3, 23, 24, 25, 52, 53, 26, 54, 55],
 [20, 21, 5, 51, 7, 3, 23, 24, 25, 52, 53, 26, 54, 55, 56],
 [20, 21, 5, 51, 7, 3, 23, 24, 25, 52, 53, 26, 54, 55, 56, 27],
 [20, 21, 5, 51, 7, 3, 23, 24, 25, 52, 53, 26, 54, 55, 56, 27, 10],
 [20, 21, 5, 51, 7, 3, 23, 24, 25, 52, 53, 26, 54, 55, 56, 27, 10, 57],
 [9, 5],
 [9, 5, 22],
 [9, 5, 22, 58],
 [9, 5, 22, 58, 9],
 [9, 5, 22

In [14]:
max_len = max([len(x) for x in input_sequences])

In [15]:
max_len

29

In [16]:
from tensorflow.keras.preprocessing.sequence import pad_sequences
# Pad the input sequences to ensure all sequences have the same length
# maxlen specifies the fixed length for all sequences; padding='pre' adds zeros at the beginning
padded_input_sequences = pad_sequences(input_sequences, maxlen=max_len, padding='pre')

In [17]:
padded_input_sequences

array([[  0,   0,   0, ...,   0,  48,   2],
       [  0,   0,   0, ...,  48,   2,  49],
       [  0,   0,   0, ...,   2,  49,   4],
       ...,
       [  0,   0,   0, ...,   0, 191, 192],
       [  0,   0,   0, ..., 191, 192, 193],
       [  0,   0,   0, ..., 192, 193, 194]], dtype=int32)

In [18]:
# Separate the input (X) and output (y) from the padded sequences
# X contains all tokens except the last one in each sequence
X = padded_input_sequences[:,:-1]

In [19]:
# y contains only the last token in each sequence (the target output)
y = padded_input_sequences[:,-1]

In [20]:
X.shape

(286, 28)

In [21]:
# Example: If X.shape is (num_samples, sequence_length)
# X = X.reshape((X.shape[0], X.shape[1], 1))  # Reshape to 3D with 1 feature per timestep

In [22]:
X.shape

(286, 28)

In [23]:
y.shape

(286,)

In [24]:
from tensorflow.keras.utils import to_categorical
# Convert the target output (y) into one-hot encoded format
# This is needed for multi-class classification with a categorical output
y = to_categorical(y,num_classes=195)

In [25]:
y.shape

(286, 195)

In [26]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense

In [27]:
# Initialize a sequential model
model = Sequential()
# Add an embedding layer to convert input tokens into dense vector representations
# - 180: Vocabulary size (number of unique words)
# - 100: Dimension of the dense vector for each word (embedding size)
# - input_length=28: Length of each input sequence
model.add(Embedding(195, 100, input_length=28))
# Add the first LSTM layer with 150 units to process sequential data
model.add(LSTM(150))

# Add a Dense (fully connected) layer with 180 units
# - Softmax activation is used to predict probabilities for 180 possible outputs
model.add(Dense(195, activation='softmax'))



In [28]:
# Compile the model
# - loss='categorical_crossentropy': Suitable for multi-class classification
# - optimizer='adam': Adaptive optimizer for efficient learning
# - metrics=['accuracy']: Track accuracy during training
model.compile(loss='categorical_crossentropy', optimizer='adam',metrics=['accuracy'])

In [None]:
# model.summary()

In [None]:
model.fit(X,y,epochs=100)

Epoch 1/100
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m4s[0m 63ms/step - accuracy: 0.0170 - loss: 5.2723
Epoch 2/100
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 60ms/step - accuracy: 0.0381 - loss: 5.2182
Epoch 3/100
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 62ms/step - accuracy: 0.0522 - loss: 5.0135
Epoch 4/100
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 56ms/step - accuracy: 0.0486 - loss: 4.9666
Epoch 5/100
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 59ms/step - accuracy: 0.0476 - loss: 4.9228
Epoch 6/100
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 61ms/step - accuracy: 0.0673 - loss: 4.9031
Epoch 7/100
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 60ms/step - accuracy: 0.0612 - loss: 4.7233
Epoch 8/100
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 49ms/step - accuracy: 0.0736 - loss: 4.8212
Epoch 9/100
[1m9/9[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[3

<keras.src.callbacks.history.History at 0x79600038d5d0>

In [None]:
model.save('/content/drive/MyDrive/workshop_program/code_notebook/trained_models/next_word_predictor.h5')



In [None]:
from tensorflow.keras.models import load_model
model = load_model('/content/drive/MyDrive/workshop_program/code_notebook/trained_models/next_word_predictor.h5')



In [None]:
import numpy as np

In [None]:
import time
text = "Scientists use"

for i in range(5):
  # tokenize
  token_text = tokenizer.texts_to_sequences([text])[0]
  # padding
  padded_token_text = pad_sequences([token_text], maxlen=56, padding='pre')
  # predict
  pos = np.argmax(model.predict(padded_token_text))

  for word,index in tokenizer.word_index.items():
    if index == pos:
      text = text + " " + word
      print(text)
      time.sleep(2)

[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 32ms/step
Kasto bhai
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 30ms/step
Kasto bhai ra
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 37ms/step
Kasto bhai ra xa
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 70ms/step
Kasto bhai ra xa pada
[1m1/1[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 60ms/step
Kasto bhai ra xa pada pada
