##### Project Overview: This project for prediciting the next word depends on Hadith Dataset help us when we have right the hadith.
##### Dataset: Hadith Dataset
##### Link of dataset: https://www.kaggle.com/datasets/fahd09/hadith-dataset

#### Goals: 
1. Help in writing hadith from Sunna with accuracy more than %65.
2. Applied LSTM in this project

#### Challenges:
1. In Hadith Dataset it is quite challenging to get high percent accuracy because it is ambigous and have many different words with the same patteren such as:

"قال أبو بكر رضي الله عنه",

"قال ابو سلمة رضي الله عنه"

..etc

In [1]:
# Import the important libraries

# for data manipulation
import numpy as np
import pandas as pd

# Tensorflow
import tensorflow as tf
from tensorflow import keras

# Preprocessing data
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences

# Model
from tensorflow.keras.models import Sequential # model
from tensorflow.keras.layers import LSTM, Dense, Embedding # Layers
from tensorflow.keras.utils import to_categorical

In [3]:
# read csv file through pandas
df = pd.read_csv("hadith.csv")
df.head()

Unnamed: 0,id,hadith_id,source,chapter_no,hadith_no,chapter,chain_indx,text_ar,text_en
0,0,1,Sahih Bukhari,1,1,Revelation - كتاب بدء الوحى,"30418, 20005, 11062, 11213, 11042, 3",حدثنا الحميدي عبد الله بن الزبير، قال حدثنا سف...,Narrated 'Umar bin Al-Khattab: ...
1,1,2,Sahih Bukhari,1,2,Revelation - كتاب بدء الوحى,"30355, 20001, 11065, 10511, 53",حدثنا عبد الله بن يوسف، قال أخبرنا مالك، عن هش...,Narrated 'Aisha: ...
2,2,3,Sahih Bukhari,1,3,Revelation - كتاب بدء الوحى,"30399, 20023, 11207, 11013, 10511, 53",حدثنا يحيى بن بكير، قال حدثنا الليث، عن عقيل، ...,Narrated 'Aisha: (the m...
3,3,4,Sahih Bukhari,1,4,Revelation - كتاب بدء الوحى,"11013, 10567, 34",قال ابن شهاب وأخبرني أبو سلمة بن عبد الرحمن، أ...,Narrated Jabir bin 'Abdullah Al-Ansari while ...
4,4,5,Sahih Bukhari,1,5,Revelation - كتاب بدء الوحى,"20040, 20469, 11399, 11050, 17",حدثنا موسى بن إسماعيل، قال حدثنا أبو عوانة، قا...,Narrated Said bin Jubair: ...


In [4]:
# Information of the dataset
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 34441 entries, 0 to 34440
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          34441 non-null  int64 
 1   hadith_id   34441 non-null  int64 
 2   source      34441 non-null  object
 3   chapter_no  34441 non-null  int64 
 4   hadith_no   34441 non-null  object
 5   chapter     34441 non-null  object
 6   chain_indx  34318 non-null  object
 7   text_ar     34433 non-null  object
 8   text_en     33588 non-null  object
dtypes: int64(3), object(6)
memory usage: 2.4+ MB


In [5]:
# Replace unrecognized symbols
df["text_ar"].replace(['\u200f', '  '], ["", " "], regex= True, inplace = True)
df["text_ar"][0]

'حدثنا الحميدي عبد الله بن الزبير، قال حدثنا سفيان، قال حدثنا يحيى بن سعيد الأنصاري، قال أخبرني محمد بن إبراهيم التيمي، أنه سمع علقمة بن وقاص الليثي، يقول سمعت عمر بن الخطاب رضى الله عنه على المنبر قال سمعت رسول الله صلى الله عليه وسلم يقول " إنما الأعمال بالنيات، وإنما لكل امرئ ما نوى، فمن كانت هجرته إلى دنيا يصيبها أو إلى امرأة ينكحها فهجرته إلى ما هاجر إليه ".'

In [6]:
# Assagin text_ar (arabic hadith) to corpus
corpus = df["text_ar"].astype(str).tolist()

In [7]:
# We have 34441 Hadith
len(corpus)

34441

In [8]:
# Tokenizer: create an index from each word in corpus
# Create Instance of Tokenizer
tokenizer = Tokenizer(oov_token='<oov>') # For those words which are not found in word_index
# Fit the tokenizer variable with our text
tokenizer.fit_on_texts(corpus) 
# build sequences of word using tokenizer
sequences = tokenizer.texts_to_sequences(corpus)
# Num of words
num_classes = len(tokenizer.word_index) + 1

print("Total number of words: ", num_classes)

Total number of words:  72158


In [9]:
# Now we take the sequence and divide it into input and label 
# Example: 
# Input: [[0, 0, 0, 65, 45]]
# Label: [[0, 0, 22, 65, 45]]

input_sequences = []
labels = []
for sequence in sequences:
    for i in range(1, len(sequence)):
        n_gram_sequence = sequence[:i+1]
        input_sequences.append(n_gram_sequence[:-1])
        labels.append(n_gram_sequence[-1])

In [10]:
# Find the max_sequence_length by taking the max length of input_sequences
# Then build a pad_sequence that make sure all sentences has the same lenth 
max_sequence_length = max([len(seq) for seq in input_sequences])
input_sequences = pad_sequences(input_sequences, maxlen=max_sequence_length)

In [10]:
# Split our data to train and test

split_ratio = 0.8 # 80% for the train
split_index = int(split_ratio * len(input_sequences))
x_train, y_train = input_sequences[:split_index], labels[:split_index]
x_test, y_test = input_sequences[split_index:], labels[split_index:] # 20 for the test

In [11]:
# DataGenerator is quite important when we have large number of data that can not store in your memory
# or GPU Memory because of the large input 
# So DataGenerator not loads all of the data its like iterator, when you finshed of this data it throw it

class DataGenerator(tf.keras.utils.Sequence):
    def __init__(self, tokenizer, sequences, labels, batch_size, max_sequence_length, num_classes):
        self.tokenizer = tokenizer
        self.sequences = sequences
        self.labels = labels
        self.batch_size = batch_size
        self.max_sequence_length = max_sequence_length
        self.num_classes = num_classes

    def __len__(self):
        return len(self.sequences) // self.batch_size

    def __getitem__(self, index):
        batch_indices = np.random.choice(len(self.sequences), size=self.batch_size, replace=False)
        batch_sequences = [self.sequences[i] for i in batch_indices]
        batch_labels = [self.labels[i] for i in batch_indices]
        x = pad_sequences(batch_sequences, maxlen=self.max_sequence_length)
        y = self.one_hot_encode(batch_labels)

        return x, y

    def one_hot_encode(self, labels):
        encoded_labels = np.zeros((len(labels), self.num_classes), dtype=np.float32)
        for i, label in enumerate(labels):
            encoded_labels[i, label] = 1.0
        return encoded_labels

In [12]:
epoch = 20 # epochs
batch_size = 64 # set the batch size

# Data generator store the data in the memory but for not all the data
# Works better when you have large amount of data
train_data_generator = DataGenerator(tokenizer, x_train, y_train, batch_size, max_sequence_length, num_classes)
test_data_generator = DataGenerator(tokenizer, x_test, y_test, batch_size, max_sequence_length, num_classes)

In [18]:
# Create the layers of the model
model = Sequential()
model.add(Embedding(input_dim=num_classes, output_dim=100, input_length=max_sequence_length))
model.add(LSTM(units=128))
model.add(Dense(units=num_classes, activation='softmax')) # Last layer with softmax

2023-06-01 17:40:52.606697: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1635] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 22078 MB memory:  -> device: 0, name: NVIDIA GeForce RTX 4090, pci bus id: 0000:81:00.0, compute capability: 8.9
2023-06-01 17:40:53.342224: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2023-06-01 17:40:53.343649: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_grad/concat/split/split_dim' with dtype int32
	 [[{{node gradients

In [19]:
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [22]:
# Using GPU train the data
with tf.device("/gpu:0"):
    model.fit(train_data_generator, epochs=epoch, batch_size=batch_size)

Epoch 1/20


2023-06-01 17:42:11.526156: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype int32
	 [[{{node Placeholder/_0}}]]


Epoch 2/20
Epoch 3/20
Epoch 4/20
Epoch 5/20
Epoch 6/20
Epoch 7/20
Epoch 8/20
Epoch 9/20
Epoch 10/20
Epoch 11/20
Epoch 12/20
Epoch 13/20
Epoch 14/20
Epoch 15/20
Epoch 16/20
Epoch 17/20
Epoch 18/20
Epoch 19/20
Epoch 20/20


In [30]:
model.summary()

Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 embedding (Embedding)       (None, 1822, 100)         7215800   
                                                                 
 lstm (LSTM)                 (None, 128)               117248    
                                                                 
 dense (Dense)               (None, 72158)             9308382   
                                                                 
Total params: 16,641,430
Trainable params: 16,641,430
Non-trainable params: 0
_________________________________________________________________


In [34]:
# save the model
keras.models.save_model(model, "model.h5")

In [24]:
# Test the data from test generator
loss, accuracy = model.evaluate(test_data_generator)

2023-06-02 06:34:46.190437: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'Placeholder/_0' with dtype int32
	 [[{{node Placeholder/_0}}]]
2023-06-02 06:34:46.367143: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_2_grad/concat/split_2/split_dim' with dtype int32
	 [[{{node gradients/split_2_grad/concat/split_2/split_dim}}]]
2023-06-02 06:34:46.368178: I tensorflow/core/common_runtime/executor.cc:1197] [/device:CPU:0] (DEBUG INFO) Executor start aborting (this does not indicate an error and you can ignore this message): INVALID_ARGUMENT: You must feed a value for placeholder tensor 'gradients/split_gra



In [25]:
print("Loss:", loss)
print("Accuracy:", accuracy)

Loss: 2.27672815322876
Accuracy: 0.7362821888923645


**I belive if we increase the number of epochs we will get better result like 50 epochs it will reach to more than 80%**

In [16]:
# Load the model
model = tf.keras.models.load_model('model.h5')

In [19]:
# function to predict next num of words
def predict_next_word(seed_text, num_of_words):
    for _ in range(num_of_words):
        input_sequence = tokenizer.texts_to_sequences([seed_text])
        input_sequence = pad_sequences(input_sequence, maxlen=max_sequence_length) 
        predictions = model.predict(input_sequence)

        # Convert the predictions to words
        predicted_word_index = predictions.argmax(axis=1)
        predicted_word = tokenizer.index_word[predicted_word_index[0]]    
        seed_text +=  ' ' + predicted_word
    return seed_text

In [32]:
# my choice of words
seed_words = ["رسول", "حذيفة", "محمد", "قال", "حديث", "عن"]

In [33]:
# Prediction of seed_words

import random
samples = dict()

for sen in seed_words:    
    samples.update({sen: predict_next_word(sen, random.randint(2, 8))})



In [41]:
# Dataframe that contain the start word in the predicted sentence
pd.DataFrame(samples.items(), columns=["start", "predicted"])

Unnamed: 0,start,predicted
0,رسول,رسول الله بن عبد الله بن
1,حذيفة,حذيفة بن نصر أبو توبة، عن ابن أبي
2,محمد,محمد بن عمرو بن
3,قال,قال أبو عيسى هذا حديث حسن صحيح وقد
4,حديث,حديث حسن وقد روي
5,عن,عن عمرو بن علي، قال حدثنا يحيى، عن عبيد


#### <centeR> Thank You for Reading