# building a Quranic search engine

## setting up our data frame

In [1]:

import re
import pandas as pd

from farasa.stemmer import FarasaStemmer

In [2]:
# setting the max wideth of cilumns.
pd.set_option('max_colwidth',150)

# reading the quran 
quran= pd.read_pickle("../pickle/quran_A_stemmed.pkl")

# defining stemmer
stemmer = FarasaStemmer()

In [3]:
# length column is used in the EDA, but it is not important here.
quran.drop(columns="length", inplace=True)

In [4]:
# getting chapters' names.

n = """1-الفاتحة	2- البقرة	3- آل عمران
4- النساء	5- المائدة	6- الأنعام
7- الأعراف	8- الأنفال	9- التوبة
10- يونس	11- هود	12- يوسف
13- الرعد	14- إبراهيم	15- الحجر
16- النحل	17- الإسراء	18- الكهف
19- مريم	20- طه	21- الأنبياء
22- الحج	23- المؤمنون	24- النور
25- الفرقان	26- الشعراء	27- النمل
28- القصص	29- العنكبوت	30- الروم
31- لقمان	32- السجدة	33- الأحزاب
34- سبأ	35- فاطر	36- يس
37- الصافات	38- ص	39- الزمر
40- غافر	41- فصلت	42- الشورى
43- الزخرف	44- الدخان	45- الجاثية
46- الأحقاف	47- محمد	48- الفتح
49- الحجرات	50- ق	51- الذاريات
52- الطور	53- النجم	54- القمر
55- الرحمن	56- الواقعة	57- الحديد
58- المجادلة	59- الحشر	60- الممتحنة
61- الصف	62- الجمعة	63- المنافقون
64- التغابن	65- الطلاق	66- التحريم
67- الملك	68- القلم	69- الحاقة
70- المعارج	71- نوح	72- الجن
73- المزّمِّل	74- المدّثر	75- القيامة
76- الإنسان	77- المرسلات	78- النبأ
79- النازعات	80- عبس	81- التكوير
82- الإنفطار	83- المطففين	84- الانشقاق
85- البروج	86- الطارق	87- الأعلى
88- الغاشية	89- الفجر	90- البلد
91- الشمس	92- الليل	93- الضحى
94- الشرح	95- التين	96- العلق
97- القدر	98- البينة	99- الزلزلة
100- العاديات	101- القارعة	102- التكاثر
103- العصر	104- الهُمَزَة	105- الفيل
106- قريش	107- الماعون	108- الكوثر
109- الكافرون	110- النصر	111- المسد
112- الإخلاص	113- الفلق	114- الناس""" 

# extracting chapters' names
n = re.sub("[0-9]","",n)
n = re.sub("[-]","",n)
n = re.sub("[\t\n]",",",n)
n = re.sub(", ",",",n)
chap_names=n.split(sep=",")

# zipping the number of a chapter with the name of a chapter.
chap_dict = dict(zip(quran.chapter_num.unique(),chap_names))

# adding acolumn for chapters' names
quran["chapter"]=quran.chapter_num.map(chap_dict)
quran= quran[["chapter","chapter_num","verse_num", "verse","stemmed"]]

## Tagging

Some times one concept or a person has many ways to refer to:\
for example, Jesus is mentioned in the quran by three differant names: "عيسى", "ابن مريم" and "المسيح"
so I want to tag all verses that used these different names by one tag that represent Jesus.

we need to creat a new column called **tags**:\
type of this column is set because sets are not ordered and do not allow repetition.


In [5]:
# Tagging1:
# connect a key word that represent a concept, to a list of words that represent the same concept.
# to be used to tag the verses.

tags_dict={
    "ترهيب": ["نار", "جهنم", "عذاب", "جحيم", "سعير","ترهيب", "عقاب"],
    
    "ترغيب" :["جنة", "أنهار", "حور", "فاكهة","فواكه", "نعيم", "ترغيب", "ثواب", "سرور", "جنات"],
    
    "استنكار": ["أفلا", "ألم", "محاججة"],

    "العقل" : ["يتفكرون", "عقل", "يتدبرون"],

    "عيسى" : ["المسيح", "ابن مريم", "بن مريم", "عيسى"],

    "محمد" : ["محمد", "أحمد", "النبي ", "مزمل", "مدثر", " طه "],
    
    "مريم" : ["مريم"],
    
    "إبراهيم" : ["ابراهيم", "إبراهيم"],

    "موسى" : ["موسى","كلم الله"],
    
    "أهل البيت" : ["القربى", "أهل البيت", "نسائنا","نساء النبي", "أزواجه "],

    "الصحابة": ["المهاجرين", "الأنصار","صاحبه" , "الذين معه أشداء"],

    "القيامة" : ["القيامة", "الساعة", "التغابن", "يوم الدين"],

    "المنافقين": ["منافق", "نفاق" ],

    "إبليس" : ["إبليس", "الشيطان"],
    
    "سليمان" : ["سليمان"]

}

stem_tags= {
    
    "الجهاد" : ["جهاد", "قتال", "يجاهدون" ]  # this need stemming
}

tags_dict["ترغيب"]

['جنة',
 'أنهار',
 'حور',
 'فاكهة',
 'فواكه',
 'نعيم',
 'ترغيب',
 'ثواب',
 'سرور',
 'جنات']

In [6]:
# 
def tag_by_words_appearance(quran,tags_dictionary, stem= False):
    """tag each verse acording to the apearance of equivelant words"""

    
    if stem:

        for tag, lst in tags_dictionary.items():
            g = []
            for i in lst:
                g.append(stemmer.stem(i))
                if (i not in g):
                    g.append(i)
            tags_dictionary[tag] = g 
    
    for i in range(quran.shape[0]):
        for tag, lst in tags_dictionary.items():
            for word in lst:
                if bool(re.search(word, quran.verse.iloc[i])):
                    quran.tags.iloc[i].add(tag)
                    break



the apove technique is good but:
- not good enough especially with concepts such as "الترغيب" and "الترهيب"("enticemen" and "menacing").\
There are many false posative (verses tagged by "enticement" while it is not about that).


- in a story about Moses his name will not be poping all the time, so this technique will not catch all the verses of the story.

This is why the techique below was done. I graped my quran, looked for the verses that have Moses stories, note them and then code them below.

in fact I used also some help from the internet.

In [7]:
# link each concept to a dictionary where keys are chapters' numbers and values are verses numbers
# where this concept was used.

stories_dict= {

    "موسى" : {
        7 : range(103,169),
        10 : range(75,94),
        11 : range(96,100),
        20 : range(11,99),
        23 : range(45,50),
        26 : range(10, 67),
        27 : range(7,15),
        28 : range(3,44),
        43 : range(46,57),
        44 : range(17,34),
        2 : range(49,74),
        17 : range(101,105),
        4 : range(153,156),
        5 : range(20,27),
        14 : range(5,10),
        18 : range(60,83),
        37 : range(114,123),
        40 : range(23,46),
        79 : range(15,27)
        
    },

    "نوح" : {
        10 : range(71,75),
        11 : range(25,49),
        21 : range(76,78),
        23 : range(23,30),
        26 : range(105,121)
    },

    "مريم" : {
        3 : range(33, 48),
        19 : range(16,37)
    },

    "عيسى" : {
        3 : range(44,49),
        5 : range(110,119)
    },

    "إبراهيم" : {
        11 : range(69,77),
        21 : range(51,74),
        26 : range(69,88),
        43 : range(26,28),
        2 : range(124,142),
        6 : range(74,85),
        14 : range(35,42),
        15 : range(51,61),
        19 : range(41,51),
        29 : range(16,33),
        37 : range(83,114),
        51 : range(24,38)
    },

    "صالح" : {
        7 : range(73,80),
        11 : range(61,69),
        26 : range(141, 160),
        27 : range(45,54)
    },

    "لوط" : {
        7 : range(80,85),
        11 : range(77,84),
        21 : range(74,76),
        26 : range(160,174),
        27 : range(54,59),
        29 : range(32,35)
    },

    "شعيب" : {
        7 : range(85,94),
        11 : range(84,96),
        26 : range(176,190),
    },

    "هود" : {
        7 : range(65,73),
        11 : range(50,61),
        26 : range(123,140),
    },

    "آدم" : {
        7 : range(11,28),
        20 : range(115,124),
        17 : range(61,66),
        2 : range(30, 39)
    },

    "إبليس" : {
        20 : range(115,124),
        17 : range(61,66)
    },

    "سليمان" : {
        27 : range(15,45)
    },

    "قارون" : {
        28 : range(76,83)
    },

    "طالوت" : {
        2 : range(246,252)
    },
    
    "يوسف" : {
        12 : range(4, 102)
    },

    "القيامة" : {
        20 : range(100,113),
        26 : range(87,103),
        77 : range(7, 51),
        75 : range(1,16),
        69 : range(13, 29),
        55 :range(37,42),
        56 : range(1,12),
        50 : range(20,36),
        52 : range(7,17),
        80 : range(33,43),
        81 : range(1,15),
        83 : range(4,22),
        84 : range(1,16),
        101 : range(1,11)
    }

}

In [8]:
# 
def stories_tag(quran,dicti):
    """tag the verses that are included in 'dicti'."""
    for tag, chapters_dict in dicti.items():
        for chapter, verses in chapters_dict.items():
            for i,sett in enumerate( quran[(quran.chapter_num == chapter) & (quran.verse_num.isin(verses))].tags):
                # try:
                quran.iloc[i]["tags"] = sett.add(tag)
                # except:
                #     quran.iloc[i]["tags"] = set({tag})
    
                


In [9]:
# Adding tags by applying our taging functions.
quran["tags"]= [set({}) for _ in range(quran.shape[0])]

tag_by_words_appearance(quran,tags_dict)

tag_by_words_appearance(quran,stem_tags, True)

stories_tag(quran,stories_dict)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  quran.iloc[i]["tags"] = sett.add(tag)


In [10]:
(quran.shape[0] - quran.tags.value_counts()[0])/quran.shape[0]

0.35806553245116574

## basic search function

In [11]:
# first and basic search
def search0(quran,text):

    # columns wanted in the result
    resulted_columns = ["chapter","chapter_num", "verse_num", "verse", "tags"]
    
    # dataframe of the result.
    df = pd.DataFrame(columns=resulted_columns)
    
    # looping over the quran
    for i in range(quran.shape[0]):
        # check if the text matches a tag of the ith verse
        if (text in quran.tags.iloc[i]):
            df.loc[i] =  quran[resulted_columns].iloc[i]

        # check if the text appear in the ith verse
        if bool(re.search(text, quran.verse.iloc[i])):
            df.loc[i]= quran[resulted_columns].iloc[i]
    
    return df
    

In [12]:
search0(quran,"المسيح")

# searching for "المسيح" did not gave me all verses about Jesus.
# to get all verses about Jesus you need to search for the tag "عيسى"

Unnamed: 0,chapter,chapter_num,verse_num,verse,tags
339,آل عمران,3,45,إذ قالت الملائكة يا مريم إن الله يبشرك بكلمة منه اسمه المسيح عيسى ابن مريم وجيها في الدنيا والاخرة ومن المقربين,"{عيسى, مريم}"
652,النساء,4,157,وقولهم إنا قتلنا المسيح عيسى ابن مريم رسول الله وما قتلوه وما صلبوه ولكن شبه لهم وإن الذين اختلفوا فيه لفي شك منه ما لهم به من علم إلا اتباع الظن ...,"{عيسى, مريم}"
666,النساء,4,171,يا أهل الكتاب لا تغلوا في دينكم ولا تقولوا على الله إلا الحق إنما المسيح عيسى ابن مريم رسول الله وكلمته ألقاها إلى مريم وروح منه فامنوا بالله ورسل...,"{عيسى, مريم}"
667,النساء,4,172,لن يستنكف المسيح أن يكون عبدا لله ولا الملائكة المقربون ومن يستنكف عن عبادته ويستكبر فسيحشرهم إليه جميعا,{عيسى}
689,المائدة,5,17,لقد كفر الذين قالوا إن الله هو المسيح ابن مريم قل فمن يملك من الله شيئا إن أراد أن يهلك المسيح ابن مريم وأمه ومن في الأرض جميعا ولله ملك السماوات ...,"{عيسى, مريم}"
744,المائدة,5,72,لقد كفر الذين قالوا إن الله هو المسيح ابن مريم وقال المسيح يا بني إسرائيل اعبدوا الله ربي وربكم إنه من يشرك بالله فقد حرم الله عليه الجنة ومأواه ا...,"{عيسى, ترهيب, ترغيب, مريم}"
747,المائدة,5,75,ما المسيح ابن مريم إلا رسول قد خلت من قبله الرسل وأمه صديقة كانا يأكلان الطعام انظر كيف نبين لهم الايات ثم انظر أنى يؤفكون,"{عيسى, مريم}"
1271,التوبة,9,30,وقالت اليهود عزير ابن الله وقالت النصارى المسيح ابن الله ذلك قولهم بأفواههم يضاهئون قول الذين كفروا من قبل قاتلهم الله أنى يؤفكون,{عيسى}
1272,التوبة,9,31,اتخذوا أحبارهم ورهبانهم أربابا من دون الله والمسيح ابن مريم وما أمروا إلا ليعبدوا إلها واحدا لا إله إلا هو سبحانه عما يشركون,"{عيسى, مريم}"


## Adding some features to our search function

### considers searching for individual words and add stemm search opthion.

In [13]:
# considers searching for individual words and add stemm search opthion.

def search1(quran,text, stem= False):
    

    # setting up the resultant data frame.
    resulted_columns = ["chapter","chapter_num", "verse_num", "verse", "tags"]
    df = pd.DataFrame(columns=resulted_columns)

    
    # spliting the search text to search words
    words_lst = text.split()
    
    # do the stemming only once because it takes time
    if stem:
        # stemming the text
        stemed_text=stemmer.stem(text)
        
        # stemming the search words
        stemed_lst=[stemmer.stem(i) for i in words_lst]
    
    
    
    # searching for the complite text
    for i in range(quran.shape[0]):
        # check if the text matches a tag of the ith verse
        if (text in quran.tags.iloc[i]):
            df.loc[i] =  quran[resulted_columns].iloc[i]

        # check if the text appear in the ith verse
        if bool(re.search(text, quran.verse.iloc[i])):
            df.loc[i]= quran[resulted_columns].iloc[i]
            #c = c + 1 
        
        # are we searching for a stemmed version of the text    
        if stem:
            # check if stemmed text matches a tag of the ith verse
            if (stemed_text in quran.tags.iloc[i]):
                df.loc[i] =  quran[resulted_columns].iloc[i]

            # check if stemmed text appear in the ith verse
            if bool(re.search(stemed_text, quran.stemmed.iloc[i])):
                df.loc[i]= quran[resulted_columns].iloc[i]


    # by searching in two seperated loops,
    # the verses with complite text will appear first in the resultant data frame.

    # looping over the verses of the Quran
    for i in range(quran.shape[0]):
        
        # looping over the search words
        for word in words_lst:

            # check if a search word matches a tag of the ith verse
            if (word in quran.tags.iloc[i]):
                df.loc[i] =  quran[resulted_columns].iloc[i]

            # check if a search word appear in the ith verse
            if bool(re.search(word, quran.verse.iloc[i])):
                df.loc[i]= quran[resulted_columns].iloc[i]
 
        
        # searching for a stemmed version of the search words.    
        if stem:
            
            # looping over the stemmed search words
            for word in stemed_lst:       
                
                # check if a stemmed search word matches a tag of the ith verse
                if (word in quran.tags.iloc[i]):
                    df.loc[i] =  quran[resulted_columns].iloc[i]

                # check if a stemmed search word appear in the ith verse
                if bool(re.search(word, quran.stemmed.iloc[i])):
                    df.loc[i]= quran[resulted_columns].iloc[i]



    return df




In [14]:
search1(quran,"يجعلون رزقهم", stem= True)
# here appears the strength of stemming,
# we got this first verse up here because its stemmed version contains the stemmed_text. 

Unnamed: 0,chapter,chapter_num,verse_num,verse,tags
5114,الواقعة,56,82,وتجعلون رزقكم أنكم تكذبون,{}
10,البقرة,2,3,الذين يؤمنون بالغيب ويقيمون الصلاة ومما رزقناهم ينفقون,{}
26,البقرة,2,19,أو كصيب من السماء فيه ظلمات ورعد وبرق يجعلون أصابعهم في اذانهم من الصواعق حذر الموت والله محيط بالكافرين,{}
29,البقرة,2,22,الذي جعل لكم الأرض فراشا والسماء بناء وأنزل من السماء ماء فأخرج به من الثمرات رزقا لكم فلا تجعلوا لله أندادا وأنتم تعلمون,{}
32,البقرة,2,25,وبشر الذين امنوا وعملوا الصالحات أن لهم جنات تجري من تحتها الأنهار كلما رزقوا منها من ثمرة رزقا قالوا هذا الذي رزقنا من قبل وأتوا به متشابها ولهم ...,{ترغيب}
...,...,...,...,...,...
6037,الأعلى,87,5,فجعله غثاء أحوى,{}
6095,الفجر,89,16,وأما إذا ما ابتلاه فقدر عليه رزقه فيقول ربي أهانن,{}
6118,البلد,90,8,ألم نجعل له عينين,{استنكار}
6292,الفيل,105,2,ألم يجعل كيدهم في تضليل,{استنكار}


### What if the user want to get the verse with diacritics (tashkil)?
tashkil is important and change the meaning of words.\
For example, "**جَنة**" is heaven but "**جِنة**" is a 'supernatural' creature.

#### Adding tashkil column
to be able to return the result with tashkil we first need ta add the verse with tashkil.

In [15]:
# reading the Arabic Quran version.
tashkil = pd.read_csv("../data/quran-simple.txt", names=["chapter_num", "verse_num", "verse"], sep="|" )
tashkil.head()

Unnamed: 0,chapter_num,verse_num,verse
0,1,1.0,بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ
1,1,2.0,الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
2,1,3.0,الرَّحْمَٰنِ الرَّحِيمِ
3,1,4.0,مَالِكِ يَوْمِ الدِّينِ
4,1,5.0,إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ


In [16]:
# droping these inequivalent rows.
tashkil.dropna(inplace=True)

In [17]:
# adjusting the types.
tashkil["chapter_num"]=tashkil.chapter_num.astype(int)
tashkil["verse_num"]=tashkil.verse_num.astype(int)


In [18]:
# creating a function to add a basmalah at the begining of each chapter with verse_num = 0
def add_basmalah1(qur, language):
    
    """This function will insert a 'basmalah' (with tashkel and punctuation marks) 
    at the start of each chapter with verse_num = 0"""
    
    if language == "A":
        b = "بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ"
    elif language == "E":
        b = "In the name of Allāh, the Entirely Merciful, the Especially Merciful."
    else:
        print("invalide value or language; 'A' for Arabic or 'E' for English")
    
    a = list(qur[qur.verse_num == 1].index)
    a.remove(0)

    for chapter, i in enumerate(a):
        if language == "A":
            qur["verse"].iloc[i] = re.sub("بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ ", "", qur["verse"].iloc[i])
        qur.loc[i-0.5] = {"chapter_num":chapter + 2,"verse_num":0,"verse":b}
    
    qur.iloc[0].verse = b

    qur = qur.drop(index= qur[(qur.chapter_num == 9) & (qur.verse_num == 0)].index)
    qur = qur.sort_index().reset_index(drop= True)
    
    return qur

In [19]:
tashkil =add_basmalah1(tashkil,language="A")
tashkil

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  qur["verse"].iloc[i] = re.sub("بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ ", "", qur["verse"].iloc[i])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  qur["verse"].iloc[i] = re.sub("بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ ", "", qur["verse"].iloc[i])
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  qur["verse"].iloc[i] = re.sub("بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ ", "", qur["verse"].iloc[i])
A value is trying to be set on a copy of a slice from a DataFrame

See the cave

Unnamed: 0,chapter_num,verse_num,verse
0,1,1,بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ
1,1,2,الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ
2,1,3,الرَّحْمَٰنِ الرَّحِيمِ
3,1,4,مَالِكِ يَوْمِ الدِّينِ
4,1,5,إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ
...,...,...,...
6343,114,2,مَلِكِ النَّاسِ
6344,114,3,إِلَٰهِ النَّاسِ
6345,114,4,مِن شَرِّ الْوَسْوَاسِ الْخَنَّاسِ
6346,114,5,الَّذِي يُوَسْوِسُ فِي صُدُورِ النَّاسِ


In [20]:
# Adding the tashkil column to quran dataFrame.
quran["tashkil"]=tashkil.verse
quran =quran[["chapter","chapter_num", "verse_num", "verse","tashkil","stemmed","tags"]]


#### the search function considering tashkil

In [21]:
# considering tashkil

def search2(quran,text,tashkil=False, stem= False):
    
    # wether to return a verse with tashkil or without.
    if tashkil==True:
        j = "tashkil"
    else:
        j= "verse"

    # setting up the resultant data frame.
    resulted_columns = ["chapter","chapter_num", "verse_num", j, "tags"]
    df = pd.DataFrame(columns=resulted_columns)
    #c = 0
    
    # split the search text to search words.
    words_lst = text.split()
    
    # do the stemming only once because it takes time
    if stem:
        # stemming the text
        stemed_text=stemmer.stem(text)
        
        # stemming the search words
        stemed_lst=[stemmer.stem(i) for i in words_lst]
    
    
    
    # searching for the complite text
    for i in range(quran.shape[0]):
        # check if the text matches a tag of the ith verse
        if (text in quran.tags.iloc[i]):
            df.loc[i] =  quran[resulted_columns].iloc[i]

        # check if the text appear in the ith verse
        if bool(re.search(text, quran.verse.iloc[i])):
            df.loc[i]= quran[resulted_columns].iloc[i]

        
        # searching for a stemmed version of the text    
        if stem:
            # check if stemmed text matches a tag of the ith verse
            if (stemed_text in quran.tags.iloc[i]):
                df.loc[i] =  quran[resulted_columns].iloc[i]

            # check if stemmed text appear in the ith verse
            if bool(re.search(stemed_text, quran.stemmed.iloc[i])):
                df.loc[i]= quran[resulted_columns].iloc[i]




    # looping over the verses of the Quran
    for i in range(quran.shape[0]):
        
        # looping over the search words
        for word in words_lst:

            # check if a search word matches a tag of the ith verse
            if (word in quran.tags.iloc[i]):
                df.loc[i] =  quran[resulted_columns].iloc[i]

            # check if a search word appear in the ith verse
            if bool(re.search(word, quran.verse.iloc[i])):
                df.loc[i]= quran[resulted_columns].iloc[i]

        
        # are we searching for a stemmed version of the search words.    
        if stem:
            
            # looping over the stemmed search words
            for word in stemed_lst:       
                
                # check if a stemmed search word matches a tag of the ith verse
                if (word in quran.tags.iloc[i]):
                    df.loc[i] =  quran[resulted_columns].iloc[i]

                # check if a stemmed search word appear in the ith verse
                if bool(re.search(word, quran.stemmed.iloc[i])):
                    df.loc[i]= quran[resulted_columns].iloc[i]



    return df




In [22]:
# test our function.
search2(quran,"تجعلون", tashkil=True)

Unnamed: 0,chapter,chapter_num,verse_num,tashkil,tags
884,الأنعام,6,91,وَمَا قَدَرُوا اللَّهَ حَقَّ قَدْرِهِ إِذْ قَالُوا مَا أَنزَلَ اللَّهُ عَلَىٰ بَشَرٍ مِّن شَيْءٍ قُلْ مَنْ أَنزَلَ الْكِتَابَ الَّذِي جَاءَ بِهِ م...,{موسى}
4265,فصلت,41,9,قُلْ أَئِنَّكُمْ لَتَكْفُرُونَ بِالَّذِي خَلَقَ الْأَرْضَ فِي يَوْمَيْنِ وَتَجْعَلُونَ لَهُ أَندَادًا ذَٰلِكَ رَبُّ الْعَالَمِينَ,{}
5114,الواقعة,56,82,وَتَجْعَلُونَ رِزْقَكُمْ أَنَّكُمْ تُكَذِّبُونَ,{}


### Adding an English translation

#### inserting an english translated verses column 

In [23]:
# reading english quran.
english = pd.read_csv("../data/english_saheeh_v1.1.0-csv.1.csv")
english.columns=["i","chapter_num", "verse_num", "verse","f"]

In [24]:
# adjusting our cleaning function.
def cleaning1(english_text):
    
    """Removes brackets, parentheses and numbers"""
    
    english_text = re.sub("\[.*?\]","",english_text)
    english_text = re.sub("[(0-9)]","",english_text)
    english_text = re.sub("  "," ",english_text)
    english_text = english_text.strip()
    
    return english_text

english["verse"] = english.verse.apply(cleaning1)
english = add_basmalah1(english,"E")
english

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  qur.iloc[0].verse = b


Unnamed: 0,i,chapter_num,verse_num,verse,f
0,1.0,1,1,"In the name of Allāh, the Entirely Merciful, the Especially Merciful.","[2]- Allāh is a proper name belonging only to the one Almighty God, Creator and Sustainer of the heavens and the earth and all that is within them..."
1,2.0,1,2,"praise is to Allāh, Lord of the worlds -","[4]- When referring to Allāh (subḥānahu wa taʿālā) , the Arabic term ""rabb"" (translated as ""Lord"") includes all of the following meanings: ""owner,..."
2,3.0,1,3,"The Entirely Merciful, the Especially Merciful,",
3,4.0,1,4,Sovereign of the Day of Recompense.,"[5]- i.e., repayment and compensation for whatever was earned of good or evil during life on this earth."
4,5.0,1,5,It is You we worship and You we ask for help.,
...,...,...,...,...,...
6343,6232.0,114,2,"The Sovereign of mankind,",
6344,6233.0,114,3,"The God of mankind,",
6345,6234.0,114,4,From the evil of the retreating whisperer -,"[2016]- i.e., a devil who makes evil suggestions to man but disappears when one remembers Allāh."
6346,6235.0,114,5,Who whispers into the breasts of mankind -,


In [25]:
# inserting English column in our data frame.
quran["English"] = english.verse
quran = quran[["chapter","chapter_num", "verse_num", "verse","tashkil","stemmed","English","tags"]]
quran.head()


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  quran["English"] = english.verse


Unnamed: 0,chapter,chapter_num,verse_num,verse,tashkil,stemmed,English,tags
0,الفاتحة,1,1,بسم الله الرحمن الرحيم,بِسْمِ اللَّهِ الرَّحْمَٰنِ الرَّحِيمِ,بسم الله رحمن رحيم,"In the name of Allāh, the Entirely Merciful, the Especially Merciful.",{}
1,الفاتحة,1,2,الحمد لله رب العالمين,الْحَمْدُ لِلَّهِ رَبِّ الْعَالَمِينَ,حمد الله رب عالم,"praise is to Allāh, Lord of the worlds -",{}
2,الفاتحة,1,3,الرحمن الرحيم,الرَّحْمَٰنِ الرَّحِيمِ,رحمن رحيم,"The Entirely Merciful, the Especially Merciful,",{}
3,الفاتحة,1,4,مالك يوم الدين,مَالِكِ يَوْمِ الدِّينِ,مالك يوم دين,Sovereign of the Day of Recompense.,{القيامة}
4,الفاتحة,1,5,إياك نعبد وإياك نستعين,إِيَّاكَ نَعْبُدُ وَإِيَّاكَ نَسْتَعِينُ,إياك عبد إياك استعان,It is You we worship and You we ask for help.,{}


#### search function with translation option

In [26]:
# adding a translation option.

def search(quran,text,tashkil=False, stem= False, tran=False):
    
    #====================SETTING UP=======================#

    # returning result with tashkil or not.
    if tashkil==True:
        j = "tashkil"
    else:
        j= "verse"
    
    resulted_columns = ["chapter","chapter_num", "verse_num", j, "tags"]

    # wether to return a translation or not.
    if tran == True:
        resulted_columns= ["chapter_num", "verse_num", j, "English","tags"]
    
    df = pd.DataFrame(columns=resulted_columns)
    
    
    words_lst = text.split()
    
    # do the stemming only once because it takes time
    if stem:
        # stemming the text
        stemed_text=stemmer.stem(text)
        
        # stemming the search words
        stemed_lst=[stemmer.stem(i) for i in words_lst]
    
    #================SEARCHING: LOOPING================#
    
    # searching for the complite text
    for i in range(quran.shape[0]):
        # check if the text matches a tag of the ith verse
        if (text in quran.tags.iloc[i]):
            df.loc[i] =  quran[resulted_columns].iloc[i]

        # check if the text appear in the ith verse
        if bool(re.search(text, quran.verse.iloc[i])):
            df.loc[i]= quran[resulted_columns].iloc[i]

        
        # searching for a stemmed version of the text    
        if stem:
            # check if stemmed text matches a tag of the ith verse
            if (stemed_text in quran.tags.iloc[i]):
                df.loc[i] =  quran[resulted_columns].iloc[i]

            # check if stemmed text appear in the ith verse
            if bool(re.search(stemed_text, quran.stemmed.iloc[i])):
                df.loc[i]= quran[resulted_columns].iloc[i]




    # looping over the verses of the Quran
    for i in range(quran.shape[0]):
        
        # looping over the search words
        for word in words_lst:

            # check if a search word matches a tag of the ith verse
            if (word in quran.tags.iloc[i]):
                df.loc[i] =  quran[resulted_columns].iloc[i]

            # check if a search word appear in the ith verse
            if bool(re.search(word, quran.verse.iloc[i])):
                df.loc[i]= quran[resulted_columns].iloc[i]

        
        # are we searching for a stemmed version of the search words.    
        if stem:
            
            # looping over the stemmed search words
            for word in stemed_lst:       
                
                # check if a stemmed search word matches a tag of the ith verse
                if (word in quran.tags.iloc[i]):
                    df.loc[i] =  quran[resulted_columns].iloc[i]

                # check if a stemmed search word appear in the ith verse
                if bool(re.search(word, quran.stemmed.iloc[i])):
                    df.loc[i]= quran[resulted_columns].iloc[i]

    
    return df


In [27]:
search(quran,"قلوب",tran=True)

Unnamed: 0,chapter_num,verse_num,verse,English,tags
14,2,7,ختم الله على قلوبهم وعلى سمعهم وعلى أبصارهم غشاوة ولهم عذاب عظيم,"Allāh has set a seal upon their hearts and upon their hearing, and over their vision is a veil. And for them is a great punishment.",{ترهيب}
17,2,10,في قلوبهم مرض فزادهم الله مرضا ولهم عذاب أليم بما كانوا يكذبون,"In their hearts is disease, so Allāh has increased their disease; and for them is a painful punishment because they used to lie.",{ترهيب}
81,2,74,ثم قست قلوبكم من بعد ذلك فهي كالحجارة أو أشد قسوة وإن من الحجارة لما يتفجر منه الأنهار وإن منها لما يشقق فيخرج منه الماء وإن منها لما يهبط من خشية...,"Then your hearts became hardened after that, being like stones or even harder. For indeed, there are stones from which rivers burst forth, and the...",{ترغيب}
95,2,88,وقالوا قلوبنا غلف بل لعنهم الله بكفرهم فقليلا ما يؤمنون,"And they said, ""Our hearts are wrapped."" But, , Allāh has cursed them for their disbelief, so little is it that they believe.",{}
100,2,93,وإذ أخذنا ميثاقكم ورفعنا فوقكم الطور خذوا ما اتيناكم بقوة واسمعوا قالوا سمعنا وعصينا وأشربوا في قلوبهم العجل بكفرهم قل بئسما يأمركم به إيمانكم إن ...,"And when We took your covenant and raised over you the mount, , ""Take what We have given you with determination and listen."" They said , ""We hear ...",{}
...,...,...,...,...,...
5251,63,3,ذلك بأنهم امنوا ثم كفروا فطبع على قلوبهم فهم لا يفقهون,"That is because they believed, and then they disbelieved; so their hearts were sealed over, and they do not understand.",{}
5296,66,4,إن تتوبا إلى الله فقد صغت قلوبكما وإن تظاهرا عليه فإن الله هو مولاه وجبريل وصالح المؤمنين والملائكة بعد ذلك ظهير,"If you two repent to Allāh, , for your hearts have deviated. But if you cooperate against him - then indeed Allāh is his protector, and Gabriel an...",{}
5597,74,31,وما جعلنا أصحاب النار إلا ملائكة وما جعلنا عدتهم إلا فتنة للذين كفروا ليستيقن الذين أوتوا الكتاب ويزداد الذين امنوا إيمانا ولا يرتاب الذين أوتوا ا...,And We have not made the keepers of the Fire except angels. And We have not made their number except as a trial for those who disbelieve - that th...,{ترهيب}
5796,79,8,قلوب يومئذ واجفة,"Hearts, that Day, will tremble,",{}


This is not so usfull, a search by english word will be more convevient in this case.

### search by English

In [28]:
# Adding English tags.

def trans_tags(tags):
    E_tags=set({})
    
    if "عيسى" in tags:
        E_tags.add("Jesus")
    if "موسى" in tags:
        E_tags.add("Moses")
    if "محمد" in tags:
        E_tags=E_tags.union({"Muḥammad","Mohammed"})
    if "ترهيب" in tags:
        E_tags.add("menacing")
    if "إبراهيم" in tags:
        E_tags.add("Abraham")
    if "يوسف" in tags:
        E_tags.add("Joseph")
    if "نوح" in tags:
        E_tags.add("Noah")
    if "صالح" in tags:
        E_tags=E_tags.union({"Ṣāliḥ", "Saleh","Salih"})
    if "لوط" in tags:
        E_tags.add("Lot")
    if "شعيب" in tags:
        E_tags.add("Shuʿayb")
    if "مريم" in tags:
        E_tags.add("Mary")
    if "هود" in tags:
        E_tags.add("Hūd")
    if "الجهاد" in tags:
        E_tags.add("jehad")
    if "سليمان" in tags:
        E_tags.add("Solomon")
    if "المنافقين" in tags:
        E_tags.add("hypocrites")
    if "العقل" in tags:
        E_tags.add("reasoning")
    if "آدم" in tags:
        E_tags.add("Adam")
    if "القيامة" in tags:
        E_tags.add("day of judgment")
    if "ترغيب" in tags:
        E_tags.add("enticement")
    
    return(E_tags)

quran["E_tags"]=quran.tags.apply(trans_tags)


## final search function

In [29]:
# adjusting our search function to be able to search by English.

def search(quran,text,tashkil=False, stem= False, both_lang=False):
    
    #===========setting up resultant columns and by which languge we search=================#
    
    if re.search("[A-Za-z]",text):
        #search in E_tags and English columns
        language="English"
        tags="E_tags"

        if both_lang:
            # return both languages
            if tashkil:
                # return English and Arabic with tashkil
                resulted_columns=["chapter","chapter_num", "verse_num", "tashkil", "English","E_tags"]
            else:
                # return English and Arabic without tashkil
                resulted_columns=["chapter","chapter_num", "verse_num", "verse", "English","E_tags"]
        else:
            # return only English
            resulted_columns= ["chapter_num", "verse_num", "English","E_tags"]
    
    else:
        # text is not English
        # search in Arabic tags and verses
        tags="tags"
        language="verse"
        
        if both_lang:
            # return both language
            if tashkil:
                # return English and Arabic with tashkil
                resulted_columns=["chapter","chapter_num", "verse_num", "tashkil", "English","tags"]
            else:
                # return English and Arabic without tashkil
                resulted_columns= ["chapter","chapter_num", "verse_num", "verse", "English","tags"]
        else: 
            # only Arabic

            if tashkil:
                # with tashkil
                resulted_columns= ["chapter","chapter_num", "verse_num", "tashkil","tags"]
            else:
                # without tashkil
                resulted_columns = ["chapter","chapter_num", "verse_num", "verse","tags"]
    
    df = pd.DataFrame(columns=resulted_columns)


    words_lst = text.split()

    # do the stemming only once because it takes time
    if stem & (language=="verse"):
        # stemming the text
        stemed_text=stemmer.stem(text)
        
        # stemming the search words
        stemed_lst=[stemmer.stem(i) for i in words_lst]
    elif stem & (language=="English"):
        return("there is no english stem search")
        


    #================SEARCHING: for search whole text===================#
    
    
    
    # searching for the complite text
    for i in range(quran.shape[0]):
        # check if the text matches a tag of the ith verse
        if (text in quran[tags].iloc[i]):
            df.loc[i] =  quran[resulted_columns].iloc[i]

        # check if the text appear in the ith verse
        if bool(re.search(text, quran[language].iloc[i])):
            df.loc[i]= quran[resulted_columns].iloc[i]

        
        # are we searching for a stemmed version of the text    
        if stem:
            # check if stemmed text matches a tag of the ith verse
            if (stemed_text in quran[tags].iloc[i]):
                df.loc[i] =  quran[resulted_columns].iloc[i]

            # check if stemmed text appear in the ith verse
            if bool(re.search(stemed_text, quran.stemmed.iloc[i])):
                df.loc[i]= quran[resulted_columns].iloc[i]



    #================SEARCHING: for individual search words===================#

    # looping over the verses of the Quran
    for i in range(quran.shape[0]):
        
        # looping over the search words
        for word in words_lst:

            # check if a search word matches a tag of the ith verse
            if (word in quran[tags].iloc[i]):
                df.loc[i] =  quran[resulted_columns].iloc[i]

            # check if a search word appear in the ith verse
            if bool(re.search(word, quran[language].iloc[i])):
                df.loc[i]= quran[resulted_columns].iloc[i]
 
        
        # are we searching for a stemmed version of the search words.    
        if stem:
            
            # looping over the stemmed search words
            for word in stemed_lst:       
                
                # check if a stemmed search word matches a tag of the ith verse
                if (word in quran[tags].iloc[i]):
                    df.loc[i] =  quran[resulted_columns].iloc[i]

                # check if a stemmed search word appear in the ith verse
                if bool(re.search(word, quran.stemmed.iloc[i])):
                    df.loc[i]= quran[resulted_columns].iloc[i]

    
    return df


In [33]:
# checking our result.
search(quran,"heaven",both_lang=True,tashkil=False,stem=False)

Unnamed: 0,chapter,chapter_num,verse_num,verse,English,E_tags
36,البقرة,2,29,هو الذي خلق لكم ما في الأرض جميعا ثم استوى إلى السماء فسواهن سبع سماوات وهو بكل شيء عليم,"It is He who created for you all of that which is on the earth. Then He directed Himself to the heaven, , and made them seven heavens, and He is K...",{}
40,البقرة,2,33,قال يا ادم أنبئهم بأسمائهم فلما أنبأهم بأسمائهم قال ألم أقل لكم إني أعلم غيب السماوات والأرض وأعلم ما تبدون وما كنتم تكتمون,"He said, ""O Adam, inform them of their names."" And when he had informed them of their names, He said, ""Did I not tell you that I know the unseen o...",{Adam}
114,البقرة,2,107,ألم تعلم أن الله له ملك السماوات والأرض وما لكم من دون الله من ولي ولا نصير,Do you not know that to Allāh belongs the dominion of the heavens and the earth and you have not besides Allāh any protector or any helper?,{}
123,البقرة,2,116,وقالوا اتخذ الله ولدا سبحانه بل له ما في السماوات والأرض كل له قانتون,"They say, ""Allāh has taken a son."" Exalted is He! Rather, to Him belongs whatever is in the heavens and the earth. All are devoutly obedient to Him,",{}
124,البقرة,2,117,بديع السماوات والأرض وإذا قضى أمرا فإنما يقول له كن فيكون,"Originator of the heavens and the earth. When He decrees a matter, He only says to it, ""Be,"" and it is.",{}
...,...,...,...,...,...,...
5705,المرسلات,77,9,وإذا السماء فرجت,And when the heaven is opened,{day of judgment}
5766,النبأ,78,19,وفتحت السماء فكانت أبوابا,And the heaven is opened and will become gateways.,{}
5784,النبأ,78,37,رب السماوات والأرض وما بينهما الرحمن لا يملكون منه خطابا,"the Lord of the heavens and the earth and whatever is between them, the Most Merciful. They possess not from Him speech.",{}
5815,النازعات,79,27,أأنتم أشد خلقا أم السماء بناها,Are you a more difficult creation or is the heaven? He constructed it.,{}


In [31]:
# lets pickle our final data Frame.
quran.to_pickle("../pickle/tagged_quran.pkl")


## Conclusion

- now we can collect Moses's scatterd quranic story easily.
- moreover, you can search for a word with out caring about its connected pronouns, singular or plural.

I am looking forward to implement this search engine. This will help anyone who want to study a subject in the quran. 