# Poet Poems Scraping from Ganjoor Website

This Python script is designed to scrape poets' poems from the Ganjoor website. It provides a set of functions to extract poems, their excerpts, and related details from the website's pages, and save them into CSV files based on specified criteria.

## Functionality

The script comprises several functions:

1. **`get_poem(url)`**: This function extracts a poem from a provided URL of a webpage containing the poem. It parses the HTML content, retrieves the verses, calculates the number of verses and words, and returns the poem text along with these details.

2. **`get_book_urls(url, prefix="https://ganjoor.net")`**: This function retrieves URLs of books from a given URL. It extracts the URLs from the webpage and appends them with a provided prefix, typically the Ganjoor website URL.

3. **`get_part_urls(url, prefix="https://ganjoor.net")`**: This function extracts URLs of each poem's full version from a webpage containing a collection of poem excerpts. It appends them with a given prefix to form complete URLs.

4. **`get_poet(poet_links, directory="./result", number_of_doc_words=700, print_details=False)`**: This main function takes a list of URLs of poet pages, extracts their poems, and saves them in CSV files according to specified criteria. It iterates over the poet links, retrieves the URLs of books written by each poet, then extracts the URLs of poem excerpts from each book. It further utilizes the `get_poem()` function to extract the poems, considering a maximum number of words per document. Finally, it saves the poems into CSV files and returns a list of dictionaries containing details about the processed documents.

## Usage

To use this script:
- Provide a list of URLs of poet pages.
- Specify the directory where the CSV files will be saved.
- Optionally set the maximum number of words per document and whether to print details during processing.

Example usage:

```python
poet_links = ["https://example.com/poet1", "https://example.com/poet2"]
details = get_poet(poet_links, directory="./poems", number_of_doc_words=800, print_details=True)
print("Details:", details)


In [25]:
import requests
from bs4 import BeautifulSoup
import csv

## Function Description

This function, `get_poem(url)`, retrieves a poem from a specified URL. It extracts the verses from the webpage and calculates the number of verses and words in the poem.

### Parameters

- `url` (str): The URL of the webpage containing the poem.

### Returns

A tuple with the following elements:

1. **Poem Text** (str): The text of the poem.
2. **Number of Verses** (int): The total number of verses in the poem.
3. **Number of Words** (int): The total number of words in the poem.

### Libraries Used

- `requests`: Used for sending HTTP requests.
- `BeautifulSoup` (from `bs4`): Used for parsing HTML content.

### Example Usage

```python
poem_text, num_verses, num_words = get_poem("https://example.com/poem")
print("Poem Text:", poem_text)
print("Number of Verses:", num_verses)
print("Number of Words:", num_words)


In [26]:
def get_poem(url):
    """
    This function takes a URL of a webpage containing a poem, extracts the verses, and calculates the number of verses 
    and words in the poem.
    
    Parameters:
        url (str): The URL of the webpage containing the poem.
    
    Returns:
        tuple: A tuple containing the poem text, number of verses, and number of words in the poem.
    """
    # Import necessary libraries
    import requests
    from bs4 import BeautifulSoup
    
    # Send a GET request to the provided URL
    r = requests.get(url)
    
    # Create a BeautifulSoup object to parse the HTML content of the webpage
    soup = BeautifulSoup(r.content)
    
    # Find all <div> elements with class 'b' which typically contain the verses of the poem
    verses = soup.findAll('div', attrs={'class': 'b'})
    
    # Extract the text of each verse and join them with newline characters to form the poem text
    poem = "\n".join([b.text for b in verses])
    
    # Calculate the number of verses in the poem
    verse_num = len(verses)
    
    # Calculate the total number of words in the poem
    word_num = sum([len(verse.text.split()) or 0 for verse in verses])
    
    # Return a tuple containing the poem text, number of verses, and number of words
    return (poem, verse_num, word_num)


## Function Description

This function, `get_book_urls(url, prefix="https://ganjoor.net")`, retrieves URLs of books from a specified URL. It extracts the URLs from the webpage and appends them with the provided prefix.

### Parameters

- `url` (str): The URL of the webpage containing the book URLs.
- `prefix` (str, optional): The prefix to be appended to the extracted URLs. Defaults to `"https://ganjoor.net"`.

### Returns

A list of URLs of books.

### Libraries Used

- `requests`: Used for sending HTTP requests.
- `BeautifulSoup` (from `bs4`): Used for parsing HTML content.

### Example Usage

```python
book_urls = get_book_urls("https://example.com/books")
print("Book URLs:", book_urls)


In [27]:
def get_book_urls(url,prefix = "https://ganjoor.net"):
	r = requests.get(url)
	soup = BeautifulSoup(r.content)
	parts = soup.findAll('div',attrs={'class':'part-title-block'})
	parts=[f'{prefix}/{b.find("a")["href"]}' for b in parts]
	return parts

## Function Description

This function, `get_part_urls(url, prefix="https://ganjoor.net")`, extracts URLs of each poem's full version from a webpage containing a collection of poem excerpts. It then appends them with a given prefix to form complete URLs.

### Parameters

- `url` (str): The URL of the webpage containing the poem excerpts.
- `prefix` (str, optional): The prefix to be added to the extracted URLs. Default is `"https://ganjoor.net"`.

### Returns

A list containing the complete URLs of each poem's full version.

### Libraries Used

- `requests`: Used for sending HTTP requests.
- `BeautifulSoup` (from `bs4`): Used for parsing HTML content.

### Example Usage

```python
part_urls = get_part_urls("https://example.com/poem-excerpts")
print("Part URLs:", part_urls)


In [28]:
def get_part_urls(url, prefix="https://ganjoor.net"):
    """
    This function takes a URL of a webpage containing a collection of poem excerpts and extracts the URLs of each 
    poem's full version. It then appends them with a given prefix to form complete URLs.
    
    Parameters:
        url (str): The URL of the webpage containing the poem excerpts.
        prefix (str, optional): The prefix to be added to the extracted URLs. Default is "https://ganjoor.net".
    
    Returns:
        list: A list containing the complete URLs of each poem's full version.
    """
    # Import necessary libraries
    import requests
    from bs4 import BeautifulSoup
    
    # Send a GET request to the provided URL
    r = requests.get(url)
    
    # Create a BeautifulSoup object to parse the HTML content of the webpage
    soup = BeautifulSoup(r.content)
    
    # Find all <p> elements with class 'poem-excerpt' which typically contain links to poem excerpts
    parts = soup.findAll('p', attrs={'class': 'poem-excerpt'})
    
    # Extract the URLs from each <a> tag found within <p> elements and append them with the prefix
    parts = [f'{prefix}/{b.find("a")["href"]}' for b in parts]
    
    # Return the list of complete URLs
    return parts


## Function Description

This function, `get_poet(poet_links, directory="./result", number_of_doc_words=700, print_details=False)`, extracts poems from a list of URLs of poet pages and saves them in CSV files based on specified criteria.

### Parameters

- `poet_links` (list): A list of URLs of poet pages.
- `directory` (str, optional): The directory where the CSV files will be saved. Default is `"./result"`.
- `number_of_doc_words` (int, optional): The maximum number of words per document. Default is `700`.
- `print_details` (bool, optional): Whether to print details during processing. Default is `False`.

### Returns

A list of dictionaries containing details about the processed documents.

### Libraries Used

- `os`: Used for operating system-related functions.
- `csv`: Used for reading and writing CSV files.
- `get_book_urls`: A function to retrieve URLs of books from a webpage.
- `get_part_urls`: A function to retrieve URLs of poem excerpts from a webpage.
- `get_poem`: A function to retrieve the text, number of verses, and number of words of a poem from a webpage.

### Example Usage

```python
poet_links = ["https://example.com/poet1", "https://example.com/poet2"]
details = get_poet(poet_links, directory="./poems", number_of_doc_words=800, print_details=True)
print("Details:", details)


In [29]:
import os
import csv

def get_poet(poet_links, directory="./result", number_of_doc_words=700, print_details=False):
    """
    This function takes a list of URLs of poet pages, extracts their poems, and saves them in CSV files according to 
    specified criteria.
    
    Parameters:
        poet_links (list): A list of URLs of poet pages.
        directory (str, optional): The directory where the CSV files will be saved. Default is "./result".
        number_of_doc_words (int, optional): The maximum number of words per document. Default is 700.
        print_details (bool, optional): Whether to print details during processing. Default is False.
    
    Returns:
        list: A list of dictionaries containing details about the processed documents.
    """
    # Import necessary libraries
    import os
    import csv
    
    # Define the author name based on the first poet link
    a_name = poet_links[0][20:].split("/")[0]
    
    # Define the file name for the CSV file
    f_name = f"{directory}/{a_name}.csv"
    
    # Create the result directory if it doesn't exist
    try:
        os.mkdir(directory)
    except:
        pass
    
    # Open the CSV file for writing
    with open(f_name, 'w', newline='') as f_all:
        # Create a CSV writer object
        w = csv.DictWriter(f_all, ['author', 'b_name', 'p_name', 'text'])
        # Write the header row
        w.writeheader()

        # Initialize variables for document word count and details
        n_doc = 0
        n_doc_words = 0
        doc = ""
        details = []
        
        # Iterate over each poet link
        for poet_link in poet_links:
            # Get the URLs of books written by the poet
            book_urls = get_book_urls(poet_link)

            # Iterate over each book URL
            for b_url in book_urls:
                n_doc_words = 0
                doc = ""
                
                # Get the URLs of parts (poem excerpts) in the book
                part_urls = get_part_urls(b_url)
                
                # Iterate over each part URL
                for url in part_urls:
                    # Get the poem text, number of verses, and number of words
                    poem, verse_num, word_num = get_poem(url)
                    
                    # Append poem text to the document and update word count
                    doc += f"\n{poem}"
                    n_doc_words += word_num
                    
                    # Check if document word count exceeds the threshold
                    if n_doc_words > number_of_doc_words:
                        # Optionally print details
                        print_details and print(f"{url} : {n_doc_words}")
                        
                        # Write the document details to the CSV file
                        w.writerow({
                            "author": url[20:].split("/")[0],
                            "b_name": url[20:].split("/")[1],
                            "p_name": url[20:].split("/")[2],
                            "text": doc
                        })
                        
                        # Append details to the list
                        details.append({
                            "author": url[20:].split("/")[0],
                            "b_name": url[20:].split("/")[1],
                            "n_doc_words": n_doc_words,
                            "n_doc": n_doc
                        })
                        
                        # Reset document word count and content
                        n_doc_words = 0
                        doc = ""
                        n_doc += 1
                        
                        # Check if the maximum number of documents has been reached
                        if n_doc > 30:
                            return details
        
        return details


In [30]:
# get_poet(["https://ganjoor.net/shahriar"] , print_details=True)

In [31]:
poets = [
    # ["https://ganjoor.net/bahar"],
    # ["https://ganjoor.net/iqbal"],
    ["https://ganjoor.net/attar/manteghotteyr","https://ganjoor.net/attar/elahiname","https://ganjoor.net/attar/asrarname"]

]

In [32]:
for poet in poets :
    get_poet(poet , print_details=True)

https://ganjoor.net//attar/manteghotteyr/touhid/sh1 : 817
https://ganjoor.net//attar/manteghotteyr/touhid/sh2 : 2569
https://ganjoor.net//attar/manteghotteyr/naat/sh1 : 1731
https://ganjoor.net//attar/manteghotteyr/taassob/sh3 : 818
https://ganjoor.net//attar/manteghotteyr/taassob/sh8 : 895
https://ganjoor.net//attar/manteghotteyr/aghazm/sh1 : 1467
https://ganjoor.net//attar/manteghotteyr/porsesh/sh2 : 840
https://ganjoor.net//attar/manteghotteyr/hodhod/sh2 : 5383
https://ganjoor.net//attar/manteghotteyr/azm-rah/sh2 : 901
https://ganjoor.net//attar/manteghotteyr/ozr-morghan/sh3 : 1265
https://ganjoor.net//attar/manteghotteyr/ozr-morghan/sh8 : 848
https://ganjoor.net//attar/manteghotteyr/ozr-morghan/sh12 : 725
https://ganjoor.net//attar/manteghotteyr/ozr-morghan/sh16 : 971
https://ganjoor.net//attar/manteghotteyr/ozr-morghan/sh21 : 775
https://ganjoor.net//attar/manteghotteyr/ozr-morghan/sh24 : 793
https://ganjoor.net//attar/manteghotteyr/ozr-morghan/sh28 : 815
https://ganjoor.net//atta

In [33]:
# get_poet(["https://ganjoor.net/jami/divanj/fateha-shabab","https://ganjoor.net/jami/divanj/khatema-hayat"])