# Consolidate Allah Occurrences in Quran Corpus

This notebook transforms the CSV to consolidate multiple occurrences of 'Allah' in the same verse into a single row with:
- Surah number
- Verse number
- Frequency (count of occurrences in that verse)
- Word locations (list of all positions in order)

In [1]:
import pandas as pd
import ast

## Load the Original CSV

In [2]:
# Load the CSV file
df = pd.read_csv('/kaggle/input/corpus-quran-com-scrapped-datasetoriginal/allah_occurrences_NO_DUPLICATES (1).csv')

# Display first few rows to understand the structure
print("Original CSV structure:")
print(df.head(10))
print(f"\nTotal rows in original CSV: {len(df)}")

Original CSV structure:
  location  chapter  verse  word_position transliteration    translation  \
0    1:1:2        1      1              2          l-lahi     (of) Allah   
1    1:2:2        1      2              2         lillahi  (be) to Allah   
2    2:7:2        2      7              2          l-lahu          Allah   
3    2:8:6        2      8              6        bil-lahi       in Allah   
4    2:9:2        2      9              2          l-laha          Allah   
5   2:10:5        2     10              5          l-lahu          Allah   
6   2:15:1        2     15              1         al-lahu          Allah   
7  2:17:11        2     17             11          l-lahu          Allah   
8  2:19:17        2     19             17        wal-lahu      And Allah   
9  2:20:16        2     20             16          l-lahu          Allah   

                                        arabic_verse  
0               بِسْمِاللَّهِالرَّحْمَٰنِ الرَّحِيمِ  
1                الْحَمْدُلِل

## Define Quran Structure

First, we need to know the number of verses in each surah to create a complete list of all verses.

In [3]:
# Number of verses in each surah (1-114)
verses_per_surah = [
    7, 286, 200, 176, 120, 165, 206, 75, 129, 109,
    123, 111, 43, 52, 99, 128, 111, 110, 98, 135,
    112, 78, 118, 64, 77, 227, 93, 88, 69, 60,
    34, 30, 73, 54, 45, 83, 182, 88, 75, 85,
    54, 53, 89, 59, 37, 35, 38, 29, 18, 45,
    60, 49, 62, 55, 78, 96, 29, 22, 24, 13,
    14, 11, 11, 18, 12, 12, 30, 52, 52, 44,
    28, 28, 20, 56, 40, 31, 50, 40, 46, 42,
    29, 19, 36, 25, 22, 17, 19, 26, 30, 20,
    15, 21, 11, 8, 8, 19, 5, 8, 8, 11,
    11, 8, 3, 9, 5, 4, 7, 3, 6, 3,
    5, 4, 5, 6
]

print(f"Total number of surahs: {len(verses_per_surah)}")
print(f"Total verses in Quran: {sum(verses_per_surah)}")

Total number of surahs: 114
Total verses in Quran: 6236


## Create Complete Verse List

Generate all surah-verse combinations from the Quran.

In [4]:
# Create a complete list of all verses in the Quran
all_verses = []
for surah_num, num_verses in enumerate(verses_per_surah, start=1):
    for verse_num in range(1, num_verses + 1):
        all_verses.append({'surah_number': surah_num, 'verse_number': verse_num})

# Convert to DataFrame
all_verses_df = pd.DataFrame(all_verses)

print(f"Total verses in complete list: {len(all_verses_df)}")
print("\nFirst 10 verses:")
print(all_verses_df.head(10))

Total verses in complete list: 6236

First 10 verses:
   surah_number  verse_number
0             1             1
1             1             2
2             1             3
3             1             4
4             1             5
5             1             6
6             1             7
7             2             1
8             2             2
9             2             3


## Transform the Data

Group by surah (chapter) and verse, then:
1. Count the frequency of occurrences
2. Collect all word positions as a list
3. Merge with complete verse list to include verses with frequency 0

In [5]:
# Group by chapter (surah) and verse for verses that contain Allah
allah_verses = df.groupby(['chapter', 'verse']).agg({
    'word_position': lambda x: list(x)  # Collect all word positions as a list
}).reset_index()

# Rename columns for clarity
allah_verses.columns = ['surah_number', 'verse_number', 'word_locations']

# Add frequency column (count of occurrences)
allah_verses['frequency'] = allah_verses['word_locations'].apply(len)

print(f"Verses containing 'Allah': {len(allah_verses)}")
print(allah_verses.head(10))

Verses containing 'Allah': 1821
   surah_number  verse_number word_locations  frequency
0             1             1            [2]          1
1             1             2            [2]          1
2             2             7            [2]          1
3             2             8            [6]          1
4             2             9            [2]          1
5             2            10            [5]          1
6             2            15            [1]          1
7             2            17           [11]          1
8             2            19           [17]          1
9             2            20       [16, 21]          2


In [6]:
# Merge with complete verse list
# Left join ensures all verses are included
consolidated = all_verses_df.merge(
    allah_verses[['surah_number', 'verse_number', 'frequency', 'word_locations']], 
    on=['surah_number', 'verse_number'], 
    how='left'
)

# Fill NaN values for verses without Allah
consolidated['frequency'] = consolidated['frequency'].fillna(0).astype(int)
consolidated['word_locations'] = consolidated['word_locations'].apply(
    lambda x: x if isinstance(x, list) else []
)

# Reorder columns: surah_number, verse_number, frequency, word_locations
consolidated = consolidated[['surah_number', 'verse_number', 'frequency', 'word_locations']]

# Display results
print("Consolidated CSV structure (with all verses):")
print(consolidated.head(20))
print(f"\nTotal rows in consolidated CSV: {len(consolidated)}")
print(f"Verses with Allah (frequency > 0): {len(consolidated[consolidated['frequency'] > 0])}")
print(f"Verses without Allah (frequency = 0): {len(consolidated[consolidated['frequency'] == 0])}")

Consolidated CSV structure (with all verses):
    surah_number  verse_number  frequency word_locations
0              1             1          1            [2]
1              1             2          1            [2]
2              1             3          0             []
3              1             4          0             []
4              1             5          0             []
5              1             6          0             []
6              1             7          0             []
7              2             1          0             []
8              2             2          0             []
9              2             3          0             []
10             2             4          0             []
11             2             5          0             []
12             2             6          0             []
13             2             7          1            [2]
14             2             8          1            [6]
15             2             9          1 

## Show Examples of Verses with Multiple Occurrences

In [7]:
# Show verses where Allah appears more than once
multiple_occurrences = consolidated[consolidated['frequency'] > 1]
print(f"Verses with multiple occurrences of 'Allah': {len(multiple_occurrences)}")
print("\nFirst 10 examples:")
print(multiple_occurrences.head(10))

Verses with multiple occurrences of 'Allah': 644

First 10 examples:
    surah_number  verse_number  frequency word_locations
26             2            20          2       [16, 21]
32             2            26          2        [2, 26]
33             2            27          2        [4, 11]
67             2            61          2       [44, 50]
73             2            67          2        [6, 16]
80             2            74          2       [32, 34]
86             2            80          3   [11, 15, 20]
95             2            89          2        [6, 24]
96             2            90          2        [9, 13]
97             2            91          2        [7, 25]


## Statistics

In [8]:
# Display frequency distribution
print("Frequency Distribution:")
print(consolidated['frequency'].value_counts().sort_index())

print(f"\nTotal verses in Quran: {len(consolidated)}")
print(f"Verses containing 'Allah': {len(consolidated[consolidated['frequency'] > 0])}")
print(f"Verses without 'Allah': {len(consolidated[consolidated['frequency'] == 0])}")
print(f"Total occurrences of 'Allah': {consolidated['frequency'].sum()}")
print(f"Maximum occurrences in a single verse: {consolidated['frequency'].max()}")

# Show the verse(s) with maximum occurrences
max_freq = consolidated['frequency'].max()
print(f"\nVerse(s) with {max_freq} occurrences:")
print(consolidated[consolidated['frequency'] == max_freq])

Frequency Distribution:
frequency
0    4415
1    1177
2     463
3     139
4      34
5       6
6       1
7       1
Name: count, dtype: int64

Total verses in Quran: 6236
Verses containing 'Allah': 1821
Verses without 'Allah': 4415
Total occurrences of 'Allah': 2699
Maximum occurrences in a single verse: 7

Verse(s) with 7 occurrences:
      surah_number  verse_number  frequency                word_locations
5494            73            20          7  [16, 43, 48, 58, 68, 74, 76]


## Save the Consolidated CSV

In [9]:
# Save to new CSV file
output_filename = 'quran_allah_reordered.csv'
consolidated.to_csv(output_filename, index=False)

print(f"Consolidated CSV saved as: {output_filename}")
print(f"\nColumns in output file:")
print(consolidated.columns.tolist())

Consolidated CSV saved as: quran_allah_reordered.csv

Columns in output file:
['surah_number', 'verse_number', 'frequency', 'word_locations']


## Preview of Output File

In [10]:
# Read back the saved file to verify
verification = pd.read_csv(output_filename)
print("Verification - First 10 rows of saved file:")
print(verification.head(10))

# Note: word_locations will be stored as string representation of list
# To use it as a list again, you can use: ast.literal_eval(row['word_locations'])

Verification - First 10 rows of saved file:
   surah_number  verse_number  frequency word_locations
0             1             1          1            [2]
1             1             2          1            [2]
2             1             3          0             []
3             1             4          0             []
4             1             5          0             []
5             1             6          0             []
6             1             7          0             []
7             2             1          0             []
8             2             2          0             []
9             2             3          0             []


## Example: How to Read and Use the Word Locations

In [11]:
# Example of how to convert string back to list when reading the CSV
print("Example: Converting word_locations from string to list\n")

# Take first row with multiple occurrences
example_row = verification[verification['frequency'] > 1].iloc[0]

print(f"Surah: {example_row['surah_number']}, Verse: {example_row['verse_number']}")
print(f"Frequency: {example_row['frequency']}")
print(f"Word locations (as string): {example_row['word_locations']}")
print(f"Type: {type(example_row['word_locations'])}")

# Convert string to actual list
locations_list = ast.literal_eval(example_row['word_locations'])
print(f"\nWord locations (as list): {locations_list}")
print(f"Type: {type(locations_list)}")
print(f"\nPositions where 'Allah' appears: {', '.join(map(str, locations_list))}")

Example: Converting word_locations from string to list

Surah: 2, Verse: 20
Frequency: 2
Word locations (as string): [16, 21]
Type: <class 'str'>

Word locations (as list): [16, 21]
Type: <class 'list'>

Positions where 'Allah' appears: 16, 21
