## Sending HTTP request with *requests*

In [3]:
import requests
import os

In [4]:
# Define URL
URL = "https://en.wikipedia.org/wiki/Hate_crime"

In [5]:
# Send a Http get request to the URL
response = requests.get(URL)
response

<Response [200]>

In [None]:
# check the status of the response
...

In [6]:
# Get the content
content = response.content
content



## 🍜 Parsing HTML with BeautifulSoup 🍜

**Read the code and discuss the questions below.**

```python
from bs4 import BeautifulSoup

html = """<html><head></head><body>
<h1>Hamlet</h1>
<ul class="cast"> 
  <li>Hamlet</li>
  <li>Polonius</li>
  <li>Ophelia</li>
  <li>Claudius</li>
</ul>
<ul class="authors">
  <li>William Shakespeare</li>
</ul>
</body></html>"""

soup = BeautifulSoup(html, "html.parser")

for ul in soup.find_all('ul'):
    if "cast" in ul.get('class'):
        for item in ul.find_all('li'):
            print(item.get_text(), end=", ")
        print()
```

+ what is the data type of the HTML document?
+ what does the find_all() function return?
+ what does the argument of the find_all() function refer to?
+ what does the argument of the get() function refer to?
+ what does the get_text() function extract?
+ how would you extract the title of the play?

In [2]:
from bs4 import BeautifulSoup

In [None]:
# Let's create a soup from content
soup_content = BeautifulSoup(content, "html.parser")
soup_content

In [None]:
headings = soup_content.find_all(['h1', 'h2', 'h3'])
for heading in headings:
    print(heading.get_text())

## Doing with the wikipedia API

In [9]:
import wikipediaapi

In [8]:


# Wikipedia API endpoint
endpoint = "https://en.wikipedia.org/w/api.php"

# Parameters for the API request
params = {
    "action": "query",
    "format": "json",
    "prop": "extracts",
    "titles": "Hate crime",
    "explaintext": True,  # Get plain text extract
    "redirects": 1,       # Follow redirects if the page exists under another title
}

# Send the request to Wikipedia API
response = requests.get(endpoint, params=params)

# Parse the response JSON
data = response.json()

# Extract the page content
pages = data.get("query", {}).get("pages", {})
for page_id, page_content in pages.items():
    if page_id != "-1":  # Check if page exists
        print("Title:", page_content.get("title", ""))
        print("\nContent:\n", page_content.get("extract", ""))
    else:
        print("Page not found.")

Title: Hate crime

Content:
 A hate crime (also known a bias crime) is crime where a perpetrator targets a victim because of their physical appearance or perceived membership of a certain social group.
Examples of such groups can include, and are almost exclusively limited to race, ethnicity, disability, language, nationality, physical appearance, political views and/or affiliation, age, religion, sex, gender identity, and/or sexual orientation. Non-criminal actions that are motivated by these reasons are often called "bias incidents".
Incidents may involve physical assault, homicide, damage to property, bullying, harassment, verbal abuse (which includes slurs) or insults, mate crime, or offensive graffiti or letters (hate mail).
In the criminal law of the United States, the Federal Bureau of Investigation (FBI) defines a hate crime as a traditional offense like murder, arson, or vandalism with an added element of bias. Hate itself is not a hate crime but committing a crime motivated b

In [11]:
import re

# Extract the content
content_text = page_content['extract']

# Define the pattern to match the sections and everything after them
pattern = r"(== See also ==|== References ==|== External links ==).*"

# Use re.sub to remove the matched sections
cleaned_content = re.sub(pattern, '', content_text, flags=re.DOTALL)

print(cleaned_content)

A hate crime (also known a bias crime) is crime where a perpetrator targets a victim because of their physical appearance or perceived membership of a certain social group.
Examples of such groups can include, and are almost exclusively limited to race, ethnicity, disability, language, nationality, physical appearance, political views and/or affiliation, age, religion, sex, gender identity, and/or sexual orientation. Non-criminal actions that are motivated by these reasons are often called "bias incidents".
Incidents may involve physical assault, homicide, damage to property, bullying, harassment, verbal abuse (which includes slurs) or insults, mate crime, or offensive graffiti or letters (hate mail).
In the criminal law of the United States, the Federal Bureau of Investigation (FBI) defines a hate crime as a traditional offense like murder, arson, or vandalism with an added element of bias. Hate itself is not a hate crime but committing a crime motivated by bias against one or more of

In [12]:
# Define the file path
file_path = 'hate_crime.txt'

# Write the cleaned content to the file
with open(file_path, 'w') as file:
    file.write(cleaned_content)

print(f"Content has been written to {file_path}")

Content has been written to hate_crime.txt
