This notebook is developed as a documentation for experimentation and development of a model for the sacred.ai project created by Vasu Jain, Amee Madhani, Anand Chauhan, and Ananya Chauhan as part of our university course. The project to create one or multiple model in order to answer question based on life, the universe, and everything with the knowledge of several religious texts for Hinduism, Christianity, and Islam.

In [29]:
# import libraries
import numpy as np
import pandas as pd
import tensorflow as tf
print(tf.config.list_physical_devices())

[PhysicalDevice(name='/physical_device:CPU:0', device_type='CPU'), PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU')]


In [30]:
import spacy
# verifying if cuda setup works properly
spacy.prefer_gpu()

True

Current working approaches include looking to religious semantic searching and question-answering. The latter of these requires an extensive dataset covering a wide variety of questions and associated answers and outputs we'd be looking for, and is therefore hard to implement since we do not know if anything like that even exists (hopefully not lol)

So we'd have to play around a lot with feature extraction, semantics and transformers and their encoding.e

Can't do regression based linear transformations either since that, too, requires a dataset.

Let's just import all our datasets for now.

In [31]:
bhagvadgita = pd.read_csv('../dataset/gitaDataset.csv')
quran = pd.read_csv('../dataset/quranDataset.csv')
bible = pd.read_csv('../dataset/bibleDataset.csv')
print(bhagvadgita.shape)
bhagvadgita.head()

(700, 4)


Unnamed: 0,Title,Chapter,Verse,English Translation
0,Arjuna's Vishada Yoga,Chapter 1,Verse 1.1,"Dhrtarashtra asked of Sanjaya: O SANJAYA, what..."
1,Arjuna's Vishada Yoga,Chapter 1,Verse 1.2,Sanjaya explained: Now seeing that the army of...
2,Arjuna's Vishada Yoga,Chapter 1,Verse 1.3,"Behold O, Master, the mighty army of the sons ..."
3,Arjuna's Vishada Yoga,Chapter 1,Verse 1.4,"Present here are the mighty archers, peers or ..."
4,Arjuna's Vishada Yoga,Chapter 1,Verse 1.5,"Dhrishtaketu, Chekitana, and the valiant king ..."


In [32]:
bhagvadgita['Title'].unique()

array(["Arjuna's Vishada Yoga", 'Sankhya Yoga', 'Karma Yoga',
       'Jnana-Karma-Sanyasa Yoga', 'Atma-Samyama Yoga',
       'Jnana-Vijnana Yoga', 'Aksara-ParaBrahma Yoga',
       'Raja-Vidya-Raja-Guhya Yoga', 'Vibhuti Yoga',
       'Viswarupa-Darsana Yoga', 'Bhakti Yoga',
       'Ksetra-Ksetrajna-Vibhaga Yoga', 'Gunatraya-Vibhaga Yoga',
       'Purushottama Yoga', 'Daivasura-Sampad-Vibhaga Yoga',
       'Shraddhatraya-Vibhaga Yoga', 'Moksha-Sanyasa Yoga'], dtype=object)

In [33]:
print(quran.shape)
quran.head()

(6236, 5)


Unnamed: 0,Name,Surah,Ayat,Verse,Tafseer
0,The Opening,1,1,"In the name of Allah, the Beneficent, the Merc...",In the Name of God the Compassionate the Merciful
1,The Opening,1,2,"Praise be to Allah, Lord of the Worlds,",In the Name of God the name of a thing is that...
2,The Opening,1,3,"The Beneficent, the Merciful.",The Compassionate the Merciful that is to say ...
3,The Opening,1,4,"Owner of the Day of Judgment,",Master of the Day of Judgement that is the day...
4,The Opening,1,5,Thee (alone) we worship; Thee (alone) we ask f...,You alone we worship and You alone we ask for ...


In [34]:
print(bible.shape)
bible.head()

(31103, 5)


Unnamed: 0,field,book,chapter,verse,text
0,1001001,1,1,1,At the first God made the heaven and the earth.
1,1001002,1,1,2,And the earth was waste and without form; and ...
2,1001003,1,1,3,"And God said, Let there be light: and there wa..."
3,1001004,1,1,4,"And God, looking on the light, saw that it was..."
4,1001005,1,1,5,"Naming the light, Day, and the dark, Night. An..."


In [35]:
# whatever approach we implement now we will implement first on the bhagvadgita, since it is the smallest.
# everything else will be taken care of later lmao
try:
    # removing useless text and making everthing integer.
    bhagvadgita['Verse'] = bhagvadgita['Verse'].apply(lambda x: x.split('.')[-1]).astype('int')
    bhagvadgita['Chapter'] = bhagvadgita['Chapter'].apply(lambda x: x.split()[-1]).astype('int')
except AttributeError:
    # circumventing inplace problems
    pass
# check dtypes
bhagvadgita.dtypes

Title                  object
Chapter                 int64
Verse                   int64
English Translation    object
dtype: object

In [36]:
# finding the maximum amonut of words in any single shlok/verse/passage. we will model our max query length around it. maybe.
# every book will therfore contain three tensors. max pooled, mean pooled, and all vectors with a lot of zeroes.
max_length = 0
for i in bhagvadgita['English Translation']:
    if len(i.split()) > max_length:
        max_length = len(i.split())
print(max_length)
for j in bible['text']:
    if len(j.split()) > max_length:
        max_length = len(j.split())
print(max_length)
for k in quran['Verse']:
    if len(k.split()) > max_length:
        max_length = len(k.split())
print(max_length)
for l in quran['Tafseer'].astype('str'):
    if len(l.split()) > max_length:
        max_length = len(l.split())
print(max_length)

118
118
273
584


In [37]:
# MAX_LENGTH = 584 
# best case, let's try smaller sizes first
MAX_LENGTH = 256

In [38]:
# loading the word2vec model that we'll be using for now.
nlp = spacy.load('en_core_web_lg')

In [39]:
# investigating if the standard vector function does max or mean pooling.
lmaobase = np.array([token.vector.get() for token in nlp(bhagvadgita['English Translation'][0])])
lmao = lmaobase.mean(axis=(0))
argm = np.array([float(token.vector_norm) for token in nlp(bhagvadgita['English Translation'][0])])
lmao1 = lmaobase[argm.argmax()]
lmao2 = np.array(nlp(bhagvadgita['English Translation'][0]).vector.get())
lmao3 = lmaobase.max(axis=(0))
print(f"Max v/s overall vector: {np.array_equal(lmao1, lmao2)} \nMean v/s overall vector: {np.array_equal(lmao, lmao2)} \nMax v/s Argmax vector: {np.array_equal(lmao1, lmao3)}")

Max v/s overall vector: False 
Mean v/s overall vector: True 
Max v/s Argmax vector: False


In [40]:
# experimenting with how to make a vector with all zeroes
# find a zero vector
zerosample = np.array(nlp(bhagvadgita['English Translation'][0].split()[0]).vector.get())
constructedzeros = np.zeros((300,))
np.array_equal(constructedzeros, zerosample)
zerosample = np.array([token.vector.get() for token in nlp(bhagvadgita['English Translation'][0])])
constructarr = np.zeros((MAX_LENGTH - zerosample.shape[0], 300))
hope = np.append(zerosample, constructarr, axis=(0))
hope.shape

(256, 300)

# IT WORKS!!

With this we can see that spacy's `en_core_web_lg` uses mean pooling. so we can make our data as follows:

In [41]:
# test with one field, trying both max and mean pooling.
maxtokens = []
meantokens = []
alltokens = []

for i in bhagvadgita['English Translation']:
    doc = nlp(i)
    tokenlist = np.array([token.vector.get() for token in doc])
    maxnormlist = np.array([float(token.vector_norm) for token in doc])
    if MAX_LENGTH - tokenlist.shape[0] > 0:
        constructarr = np.zeros((MAX_LENGTH - tokenlist.shape[0], 300))
        alltokens.append(np.append(tokenlist, constructarr, axis=(0)))
    else:
        alltokens.append(tokenlist[:MAX_LENGTH])

    maxtokens.append(np.array(tokenlist[maxnormlist.argmax()]))
    meantokens.append(np.array(tokenlist.mean(axis=(0))))

tmaxtokens = tf.convert_to_tensor(maxtokens)
tmeantokens = tf.convert_to_tensor(meantokens)
talltokens = tf.convert_to_tensor(alltokens)

tmaxtokens.shape, tmeantokens.shape, talltokens.shape

2023-03-22 15:43:24.914847: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-03-22 15:43:24.917728: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-03-22 15:43:24.917830: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_node
Your kernel may have been built without NUMA support.
2023-03-22 15:43:24.917858: I tensorflow/compiler/xla/stream_executor/cuda/cuda_gpu_executor.cc:967] could not open file to read NUMA node: /sys/bus/pci/devices/0000:01:00.0/numa_nod

(TensorShape([700, 300]),
 TensorShape([700, 300]),
 TensorShape([700, 256, 300]))

In [None]:
# converting this to a function for later use:
def tensorgoBrr(df: pd.DataFrame, column: str):
    """this function returns tensors for max pooled, mean pooled, and all present (padded up or down to {MAX_LENGTH} vectors)
    the return is a tuple with the order max, mean, and all tokens respectively."""
    maxtokens = []
    meantokens = []
    alltokens = []

    for i in df[column]:
        doc = nlp(i)
        tokenlist = np.array([token.vector.get() for token in doc])
        maxnormlist = np.array([float(token.vector_norm) for token in doc])
        if MAX_LENGTH - tokenlist.shape[0] > 0:
            constructarr = np.zeros((MAX_LENGTH - tokenlist.shape[0], 300))
            alltokens.append(np.append(tokenlist, constructarr, axis=(0)))
        else:
            alltokens.append(tokenlist[:MAX_LENGTH])

        maxtokens.append(np.array(tokenlist[maxnormlist.argmax()]))
        meantokens.append(np.array(tokenlist.mean(axis=(0))))

    tmaxtokens = tf.convert_to_tensor(maxtokens)
    tmeantokens = tf.convert_to_tensor(meantokens)
    talltokens = tf.convert_to_tensor(alltokens)

    return (tmaxtokens, tmeantokens, talltokens)


We have now created the tensors for all verses of the bhagvadgita. for the demo this is all that we wil be doing.

Bear in mind that the tensor with all adjusted vectors for the Bhagvadgita alone is over 400 MB, which, considering my system VRAM of 8GB, might become a computational power problem later down the line.