# Poet Poems Scraping from Ganjoor Website

This Python script is designed to scrape poets' poems from the Ganjoor website. It provides a set of functions to extract poems, their excerpts, and related details from the website's pages, and save them into CSV files based on specified criteria.

## Functionality

The script comprises several functions:

1. **`get_poem(url)`**: This function extracts a poem from a provided URL of a webpage containing the poem. It parses the HTML content, retrieves the verses, calculates the number of verses and words, and returns the poem text along with these details.

2. **`get_book_urls(url, prefix="https://ganjoor.net")`**: This function retrieves URLs of books from a given URL. It extracts the URLs from the webpage and appends them with a provided prefix, typically the Ganjoor website URL.

3. **`get_part_urls(url, prefix="https://ganjoor.net")`**: This function extracts URLs of each poem's full version from a webpage containing a collection of poem excerpts. It appends them with a given prefix to form complete URLs.

4. **`get_poet(poet_links, directory="./result", number_of_doc_words=700, print_details=False)`**: This main function takes a list of URLs of poet pages, extracts their poems, and saves them in CSV files according to specified criteria. It iterates over the poet links, retrieves the URLs of books written by each poet, then extracts the URLs of poem excerpts from each book. It further utilizes the `get_poem()` function to extract the poems, considering a maximum number of words per document. Finally, it saves the poems into CSV files and returns a list of dictionaries containing details about the processed documents.

## Usage

To use this script:
- Provide a list of URLs of poet pages.
- Specify the directory where the CSV files will be saved.
- Optionally set the maximum number of words per document and whether to print details during processing.

Example usage:

```python
poet_links = [
        {
            "name": "attar",
            "books": [
                {
                    "name": "asrarname",
                    "url": "https://ganjoor.net/attar/asrarname"
                },
                {
                    "name": "",
                    "url": "https://ganjoor.net/attar/manteghotteyr/naat"
                }
            ]
        },
        {
            "name": "saadi",
            "books": [
                {
                    "name": "golestan",
                    "url": "https://ganjoor.net/saadi/golestan"
                }
            ]
        }
    ]
details = get_poet(poet_links, directory="./poems", number_of_doc_words=800, print_details=True)
print("Details:", details)


In [1]:
from src.scrape import *

In [2]:
import pandas as pd

df = pd.read_csv("/home/amir/Documents/university/projects/NLP/IR-Project/persian_authors.csv")

In [3]:
df['author'].unique()

array(['moulavi', 'attar', 'rahi', 'ferdousi', 'saeb', 'iqbal', 'saadi',
       'nezami', 'jami', 'eraghi'], dtype=object)

In [4]:
df['b_name'].unique()

array(['moulavi', 'manteghotteyr', 'rahi', 'ferdousi', 'saeb',
       'asrar-khodi', 'romooz-bikhodi', 'payam-mashregh', 'saadi',
       'nezami', 'jami', 'eraghi'], dtype=object)

In [5]:
poets = [
        {
            "name": "attar",
            "books": [
                {
                    "name": "asrarname",
                    "url": "https://ganjoor.net/attar/asrarname"
                }
            ]
        },
        {
            "name": "moulavi",
            "books": [
                {
                    "name": "shams",
                    "url": "https://ganjoor.net/moulavi/shams"
                }
            ]
        },
        {
            "name": "rahi",
            "books": [
                {
                    "name": "all-books",
                    "url": "https://ganjoor.net/rahi"
                },
            ]
        },
        {
            "name": "ferdousi",
            "books": [
                {
                    "name": "shahname",
                    "url": "https://ganjoor.net/ferdousi/shahname"
                }
            ]
        },
        {
            "name": "saeb",
            "books": [
                {
                    "name": "divan-saeb",
                    "url": "https://ganjoor.net/saeb/divan-saeb"
                }
            ]
        },
        {
            "name": "iqbal",
            "books": [
                {
                    "name": "all-books",
                    "url": "https://ganjoor.net/iqbal"
                }
            ]
        },
        {
            "name": "saadi",
            "books": [
                {
                    "name": "golestan",
                    "url": "https://ganjoor.net/saadi/golestan"
                },
                {
                    "name": "boostan",
                    "url": "https://ganjoor.net/saadi/boostan"
                },
            ]
        },
        {
            "name": "nezami",
            "books": [
                {
                    "name": "5ganj",
                    "url": "https://ganjoor.net/nezami/5ganj"
                }
            ]
        },
        {
            "name": "jami",
            "books": [
                {
                    "name": "divanj",
                    "url": "https://ganjoor.net/jami/divanj"
                },
                {
                    "name": "7ourang",
                    "url": "https://ganjoor.net/jami/7ourang"
                }
            ]
        },
        {
            "name": "eraghi",
            "books": [
                {
                    "name": "divane",
                    "url": "https://ganjoor.net/eraghi/divane"
                },
                {
                    "name": "oshaghname",
                    "url": "https://ganjoor.net/eraghi/oshaghname"
                }
            ]
        },
    ]

In [6]:
for poet in poets:
    get_poet(poet, directory="datasets/raw", print_details=True)



  soup = BeautifulSoup(r.content)


  soup = BeautifulSoup(r.content)


  soup = BeautifulSoup(r.content)


https://ganjoor.net//attar/asrarname/abkhsh1/sh1 : 1890
https://ganjoor.net//attar/asrarname/abkhsh2/sh1 : 1287
https://ganjoor.net//attar/asrarname/abkhsh2/sh2 : 1344
https://ganjoor.net//attar/asrarname/abkhsh3/sh4 : 842
https://ganjoor.net//attar/asrarname/abkhsh4/sh1 : 1199
https://ganjoor.net//attar/asrarname/abkhsh5/sh1 : 1339
https://ganjoor.net//attar/asrarname/abkhsh6/sh5 : 966
https://ganjoor.net//attar/asrarname/abkhsh6/sh7 : 701
https://ganjoor.net//attar/asrarname/abkhsh7/sh5 : 839
https://ganjoor.net//attar/asrarname/abkhsh7/sh7 : 764
https://ganjoor.net//attar/asrarname/abkhsh7/sh11 : 731
https://ganjoor.net//attar/asrarname/abkhsh8/sh2 : 1107
https://ganjoor.net//attar/asrarname/abkhsh9/sh2 : 811
https://ganjoor.net//attar/asrarname/abkhsh9/sh5 : 795
https://ganjoor.net//attar/asrarname/abkhsh10/sh3 : 719
https://ganjoor.net//attar/asrarname/abkhsh10/sh6 : 859
https://ganjoor.net//attar/asrarname/abkhsh11/sh2 : 776
https://ganjoor.net//attar/asrarname/abkhsh11/sh5 : 873

In [17]:
import os

path = '/home/amir/Documents/university/projects/NLP/IR-Project/datasets/raw'
all_files = [os.path.join(path, file) for file in os.listdir(path) if file.endswith('.csv')]

li = []

for filename in all_files:
    df = pd.read_csv(filename, index_col=None, header=0)
    li.append(df)

# Concatenate all dataframes into one
frame = pd.concat(li, axis=0, ignore_index=True)

# Save the merged DataFrame to a new CSV file in the desired location
frame.to_csv('datasets/raw/persian_authors.csv', index=False)