**Credit : ChatGPT , StackOverflow, bs4 documentation**

**How I made this?**

Primarily I asked GPT for the answer. Based on that answer I googled my interests. Then again prompt GPT for better response. 


#### Installing BeautifulSoup
```bash
    apt-get install python3-bs4    #for installing globally
```
```bash
    poetry add bs4     # poetry installations     
```

#### Parsers for Beautiful Soup

1. **`html.parser`**:
   - **Built-in Python parser** (no extra installation required).
   - **Best for lightweight projects** with simple, well-structured HTML.
   - **Pros**: Quick and easy to set up, as it comes with Python.
   - **Cons**: Less tolerant of broken or messy HTML compared to other parsers.

2. **`lxml`**:
   - **Fastest parser** with advanced features like **XPath** and **XSLT** support.
   - **Requires installation** of `lxml` library (may need additional dependencies on some systems).
   - **Pros**: High speed and memory efficiency; handles large HTML documents well.
   - **Cons**: Not as tolerant of malformed HTML as `html5lib`.

3. **`html5lib`**:
   - **HTML5-compliant parser** that strictly follows HTML5 specifications.
   - **Best for messy or broken HTML**; it’s very lenient and similar to how browsers interpret HTML.
   - **Pros**: Highly tolerant of errors and imperfections, ensuring well-formed output.
   - **Cons**: Slower than `lxml` and requires additional dependencies.

### Summary of Parser Choices

| Parser        | Pros                                        | Cons                              | Best For                      |
|---------------|---------------------------------------------|-----------------------------------|-------------------------------|
| `html.parser` | No dependencies, good for simple HTML       | Less error-tolerant               | Lightweight projects          |
| `lxml`        | Fast, advanced features (XPath, XSLT)       | Requires installation             | Large, well-formed HTML       |
| `html5lib`    | Highly tolerant, follows HTML5 spec         | Slower, requires dependencies     | Messy or broken HTML          |


For more Parser : [parser](https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html#differences-between-parsers)


In [72]:
from bs4 import BeautifulSoup
from bs4 import  Tag
from typing import List, Dict
import requests
import re
import pandas as pd
from rich.pretty import pprint
# import json //I do not need json here
import os


url = 'https://myanimelist.net/topanime.php'
response : requests.Response = requests.get(url)
anime_soup : BeautifulSoup = BeautifulSoup(response.text, 'lxml') # here anime_soup object is a tree of tags 


**Objects in BeautifulSoup**
- Tag
- BeautifulSoup
- Comment

**BeautifulSoup Object**

A BeautifulSoup object is created when you feed it's constructor an HTML or XML document. This object represents the entire structure of the web page or document you’re working with.

In [None]:
# Making a directory and writing the HTML to a file
try:
    os.mkdir('templates')
except FileExistsError:
    pass
except Exception as e:
    print(f"Error creating directory: {e}")
finally:
    try:
        with open('templates/top_anime.html', 'w') as file:
            file.write(anime_soup.prettify())
    except Exception as e:
        print(f"Error writing to file: {e}")

# Reading the HTML file and making soup
try:
    soup : BeautifulSoup = BeautifulSoup(open("templates/top_anime.html"), 'lxml')
except Exception as e:
    print(f"Error reading the HTML file: {e}")

# Parsing the HTML
try:
    table : Tag = soup.find('table', class_='top-ranking-table')
    rank_animes : List[Tag] = table.find_all('tr', class_='ranking-list')
    rank_animes_dict : List[Dict[str,str]] = []

    for anime in rank_animes:
        rank_tag : Tag = anime.find('td', class_='rank ac')
        title_tag : Tag = anime.find('td', class_='title')

        rank : str|None = rank_tag.text.strip() if rank_tag else None
        title : str|None = title_tag.find('h3', class_='anime_ranking_h3').text.strip() if title_tag else None
        ep : str|None = title_tag.find('div', class_='information di-ib mt4').contents[0].text.strip() if title_tag else None
        ep : str|None = re.sub(r'\D', '', ep) if ep else None

        rank_animes_dict.append(
            {
                'Rank': rank,
                'Title': title,
                'Episodes': ep
            }
        )
except Exception as e:
    print(f"Error parsing the HTML: {e}")

print(rank_animes_dict)


In [74]:
try:
    df = pd.DataFrame(rank_animes_dict)
except Exception as e:  
    print(f"Error creating DataFrame: {e}")
finally:
    try:
        os.mkdir('data')
    except FileExistsError:
        pass
    except Exception as e:
        print(f'Error creating directory: {e}')
    finally:
        try:
            df.to_json('data/top_anime.json', orient='records', indent=4)
            df.to_excel('data/top_anime.xlsx', index=False)
        except Exception as e:
            print(f"Error writing to file: {e}")