<a href="https://colab.research.google.com/github/RajMV05102004/DeepLeanring/blob/main/WikiWordPredictor.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd

# Read the file assuming it's space-separated or tab-separated
df = pd.read_csv(r"/content/wiki.train.tokens", sep="\t", header=None, names=["Text"])

# Display the first few lines


In [None]:
valid_df=pd.read_csv(r"/content/wiki.valid.tokens", sep="\t", header=None, names=["Text"])

In [None]:
valid_df.sample(5)

In [None]:
def preprocess(df,col):
  #Replacing unk with empty string
  df[col] = df[col].str.replace("<unk>", "")
  # Remove all special characters using regex
  df[col] = df[col].str.replace(r"[^a-zA-Z\s]", "", regex=True)
  #Convert everything to lowercase
  df[col]=df[col].str.lower()


In [None]:
preprocess(df,'Text')

In [None]:
preprocess(valid_df,'Text')

In [None]:
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer

In [None]:
tokenizer=Tokenizer()

In [None]:
def fit_tokenizer(obj,tokenizer,col):
  tokenizer.fit_on_texts(obj[col])


In [None]:
fit_tokenizer(df,tokenizer,'Text')

In [None]:
fit_tokenizer(valid_df,tokenizer,'Text')

In [None]:
tokenizer.word_index

In [None]:
tokenizer.word_counts

In [None]:
len(tokenizer.word_index)

In [None]:
Train_data=df.copy()# Storing the original DataFrame
Valid_data=df.copy()# Storing the Validity DataFrame

In [None]:
#We are creating a dataset where a sequence of words are stored in an non-decreasing manner
def createDataset(df,tokenizer):
  input_sequence=[]

  for sentence in df["Text"]:
    tokennized_sentence=tokenizer.texts_to_sequences([sentence])[0]

    for i in range(1,len(tokennized_sentence)):
      n_gram=tokennized_sentence[:i+1]
      input_sequence.append(n_gram)
  return input_sequence

In [None]:
input_sequence=createDataset(df,tokenizer)

In [None]:
valid_input_sequence=createDataset(valid_df,tokenizer)

In [None]:
valid_input_sequence[:]

In [None]:
len(valid_input_sequence)

In [None]:
#We need the maximum length in the input sequence
valid_maxlen=max(len(x) for x in valid_input_sequence)

In [None]:
#We need the maximum length in the input sequence
maxlen=max(len(x) for x in input_sequence)

In [None]:
maxlen# This is the  maximum size of the input sequence

In [None]:
valid_maxlen

In [None]:
from tensorflow.keras.preprocessing.sequence import pad_sequences

In [None]:
def padding(sequence,maxlen):
  #now we will pad the input sequences to the maxlen
  return pad_sequences(sequence,maxlen=maxlen,padding='pre')

In [None]:
input_padded_sequence=padding(input_sequence,maxlen)
valid_padded_sequence=padding(valid_input_sequence,valid_maxlen)

In [None]:
import pickle

#Saving the processed padded sequence in a pickle file

In [None]:
with open('input_padded_sequence.pkl', 'wb') as f:
    pickle.dump((input_padded_sequence, tokenizer), f)

In [None]:
with open('valid_padded_sequence.pkl', 'wb') as f:
    pickle.dump((valid_padded_sequence, tokenizer), f)

# Preprocessing is done till here
Now loading the Test and Validation sequences

In [1]:
import pickle

In [2]:
with open('/content/input_padded_sequence.pkl', 'rb') as f:
    input_padded_sequence, tokenizer = pickle.load(f)


In [3]:
with open('/content/valid_padded_sequence.pkl', 'rb') as f:
    valid_padded_sequence, tokenizer = pickle.load(f)


In [4]:
X_train=input_padded_sequence[:,:-1]
y_train=input_padded_sequence[:,-1]
#y=np.expand_dims(y,axis=1)
X_val=valid_padded_sequence[:,:-1]
y_val=valid_padded_sequence[:,-1]

In [5]:
from tensorflow.keras.utils import to_categorical

# Input:

1. y: The labels (target values) for your dataset. These are typically integer-encoded class labels (e.g., [0, 1, 2, ...]).

2. tokenizer: A tokenizer object (e.g., from Keras' Tokenizer class) that has been fitted on the text data. It contains the vocabulary and word-to-index mappings.

# Purpose:

The function converts the integer-encoded labels (y) into a one-hot encoded format, which is required for multiclass classification problems when using a softmax activation in the output layer.

# to_categorical:

1. This is a utility function from Keras (keras.utils.to_categorical) that converts a class vector (integers) into a binary class matrix (one-hot encoding).

2. For example, if y = [0, 1, 2] and num_classes=3, the output will be:
  [[1., 0., 0.],
  [0., 1., 0.],
  [0., 0., 1.]]
  num_classes=len(tokenizer.word_index)+1:

  len(tokenizer.word_index) gives the size of the vocabulary (number of unique words).

3. +1 is added to account for padding or unknown tokens (if any).

4. This ensures that the one-hot encoded vectors have the correct dimensionality, matching the number of classes (words in the vocabulary).

In [6]:
def preprocess_labels(y,tokenizer):
  #Applying categorical transformation to make it a multiclass classification problem
  return to_categorical(y,num_classes=len(tokenizer.word_index)+1)

In [7]:
y_train=preprocess_labels(y_train,tokenizer)
y_val=preprocess_labels(y_val,tokenizer)

Model1 Structure:
1. Embedding Layer
2. Bidirectional LSTM layer
3. Bidirectional LSTM layer
4. Dense Layer

In [8]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding,Bidirectional,LSTM,Dense

In [9]:
model1=Sequential()
model1.add(Embedding(input_dim=len(tokenizer.word_index)+1,output_dim=200))
'''
output_dim: This is the size of the word vectors (embeddings). You're setting it to 200, meaning each word will be represented by a 200-dimensional vector.
This layer converts each word index (from the tokenizer) into a dense embedding vector
'''
model1.add(Bidirectional(LSTM(256,return_sequences=True)))
model1.add(Bidirectional(LSTM(256)))
model1.add(Dense(len(tokenizer.word_index)+1,activation='softmax'))

Summary of the Model
Input: Integer-encoded sequences of words (from the tokenizer).

Embedding Layer: Converts words into dense 200-dimensional vectors.

Bidirectional LSTMs: Two layers of bidirectional LSTMs process the sequence to capture contextual information.

Output Layer: A dense layer with softmax activation predicts the next word (or class) based on the processed sequence.

In [10]:
model1.compile(optimizer='adam',loss='categorical_crossentropy',metrics=['accuracy'])

In [None]:
model1.fit(X_train,y_train,epochs=10,batch_size=128,validation_data=(X_val,y_val))