#Web Scrapping


**Urllib3** is a powerful HTTP client library for Python. This makes it easy to perform HTTP requests programmatically. It handles HTTP headers, retries, redirects, and other low-level details, making it an excellent library for web scraping. It also supports SSL verification, connection pooling, and proxying.

**BeautifulSoup** allows you to parse HTML and XML documents. Using API, you can easily navigate through the HTML document tree and extract tags, meta titles, attributes, text, and other content. BeautifulSoup is also known for its robust error handling.

**MechanicalSoup **automates the interaction between a web browser and a website efficiently. It provides a high-level API for web scraping that simulates human behavior. With MechanicalSoup, you can interact with HTML forms, click buttons, and interact with elements like a real user.

**Requests** is a simple yet powerful Python library for making HTTP requests. It is designed to be easy to use and intuitive, with a clean and consistent API. With Requests, you can easily send GET and POST requests, and handle cookies, authentication, and other HTTP features. It is also widely used in web scraping due to its simplicity and ease of use.

**Selenium** allows you to automate web browsers such as Chrome, Firefox, and Safari and simulate human interaction with websites. You can click buttons, fill out forms, scroll pages, and perform other actions. It is also used for testing web applications and automating repetitive tasks.

**Panda**s allow storing and manipulating data in various formats, including CSV, Excel, JSON, and SQL databases. Using Pandas, you can easily clean, transform, and analyze data extracted from websites.


In [None]:
!pip install selenium



In [None]:
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.chrome.options import Options
from urllib.request import urlopen
import pandas as pd

# Common Web Scrapping Methods

*  Parse website data using string methods and regular expressions
*  Parse website data using an HTML parser
*  Interact with forms and other website components

#urllib

In [36]:
from urllib.request import urlopen
import urllib.request

To extract the HTML from the page, first use the HTTPResponse object’s .read() method, which returns a sequence of bytes. Then use .decode() to decode the bytes to a string using UTF-8:

In [37]:
# Specify the URL of the webpage
url = "https://www.kaggle.com/"

# Send a request to the webpage and get the response
response = urllib.request.urlopen(url)

# Read the HTML content as bytes
html_bytes = response.read()

# Decode the bytes to a string using UTF-8 encoding
html_content = html_bytes.decode('utf-8')

# Now you have the HTML content as a string
print(html_content)



<!DOCTYPE html>
<html lang="en">

<head>
  <title>Kaggle: Your Machine Learning and Data Science Community</title>
  <meta charset="utf-8" />
    <meta name="robots" content="index, follow" />
  <meta name="description" content="Kaggle is the world&#x2019;s largest data science community with powerful tools and resources to help you achieve your data science goals." />
  <meta name="turbolinks-cache-control" content="no-cache" />
    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=5.0, minimum-scale=1.0">
  <meta name="theme-color" content="#008ABC" />
  <script nonce="FDBOrchT596&#x2B;PBjKY5dn8g==" type="text/javascript">
    window["pageRequestStartTime"] = 1712488695964;
    window["pageRequestEndTime"] = 1712488695970;
    window["initialPageLoadStartTime"] = new Date().getTime();
  </script>
  <script nonce="FDBOrchT596&#x2B;PBjKY5dn8g==" id="gsi-client" src="https://accounts.google.com/gsi/client" async defer></script>
  <s

Once you have obtained the HTML content of a webpage as text, you can extract information from it using various techniques. Here are a couple of common methods:

1. **Using String Methods and Regular Expressions**:
   - You can use string manipulation methods such as `find()`, `split()`, and `replace()` to locate specific elements or patterns within the HTML content.
   - Regular expressions (`re` module) can be employed to search for complex patterns in the HTML content and extract desired information.

2. **Using an HTML Parser**:
   - Libraries like BeautifulSoup provide powerful tools for parsing HTML content in a structured way.
   - BeautifulSoup allows you to navigate and search through the HTML document using methods like `find()`, `find_all()`, and CSS selectors, making it easier to extract specific elements or data.

String methods and regular expressions are suitable for simple extraction tasks or when dealing with well-structured HTML content. On the other hand, using an HTML parser like BeautifulSoup is more robust and recommended for complex HTML structures or when you need to handle malformed HTML.

# HTML


1. **find()**:  
   This method is used to find the index of the first occurrence of a substring within a string. In the example, `html_content.find('<body>')` returns the index of the start of the `<body>` tag in the HTML content.

2. **len()**:  
   This built-in Python function returns the length of an object. In this context, `len('<body>')` returns the length of the `<body>` tag, which is the number of characters in the string `<body>`.

3. **+**:  
   This operator is used for concatenating strings in Python. In the example, `html_content.find('<body>') + len('<body>')` calculates the end index of the `<body>` tag.

4. **replace()**:  
   This method is used to replace occurrences of a specified substring within a string with another substring. In the example, `body_content.replace('<h1>', '')` removes all occurrences of the `<h1>` tags from the body content.

5. **strip()**:  
   This method is used to remove leading and trailing whitespace characters from a string. In the example, `text_content.strip()` removes any leading or trailing whitespace from the extracted text content.

These methods are used together to extract text content from HTML using string manipulation techniques. However, this approach is not robust for handling all cases of HTML content, especially those with complex structures. Using a dedicated HTML parsing library like BeautifulSoup is recommended for more reliable and flexible HTML parsing.

In [26]:
html_content = """
<html>
<head>
<title>Example Page</title>
</head>
<body>
<h1>Welcome to Example Page</h1>
<p>This is a paragraph.</p>
<p>This is another paragraph.</p>
</body>
</html>
"""

In [40]:
# Find the position of the first occurrence of '<body>' tag
start_index = html_content.find('<body>') + len('<body>')
start_index

5

In [41]:
# Find the position of the last occurrence of '</body>' tag
end_index = html_content.find('</body>')
end_index

5857

In [44]:
# Remove any remaining HTML tags using string replace
text_content = body_content.replace('<h1>', '').replace('</h1>', '').replace('<p>', '').replace('</p>', '')
# Print the extracted text content
print(text_content.strip())

Welcome to Example Page
This is a paragraph.
This is another paragraph.


In [45]:
# Extract the content between <body> and </body>
body_content = html_content[start_index:end_index]
body_content

'!DOCTYPE html>\r\n<html lang="en">\r\n\r\n<head>\r\n  <title>Kaggle: Your Machine Learning and Data Science Community</title>\r\n  <meta charset="utf-8" />\r\n    <meta name="robots" content="index, follow" />\r\n  <meta name="description" content="Kaggle is the world&#x2019;s largest data science community with powerful tools and resources to help you achieve your data science goals." />\r\n  <meta name="turbolinks-cache-control" content="no-cache" />\r\n    <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=5.0, minimum-scale=1.0">\r\n  <meta name="theme-color" content="#008ABC" />\r\n  <script nonce="FDBOrchT596&#x2B;PBjKY5dn8g==" type="text/javascript">\r\n    window["pageRequestStartTime"] = 1712488695964;\r\n    window["pageRequestEndTime"] = 1712488695970;\r\n    window["initialPageLoadStartTime"] = new Date().getTime();\r\n  </script>\r\n  <script nonce="FDBOrchT596&#x2B;PBjKY5dn8g==" id="gsi-client" src="https://accounts.google.com/gsi/client"

#Extract Text From HTML With String Methods

In [48]:
# HTML content
html_content = """
<html>
<head>
<title>Example Page</title>
</head>
<body>
<h1>Welcome to Example Page</h1>
<p>This is a paragraph.</p>
<p>This is another paragraph.</p>
</body>
</html>
"""

# Find the start and end index of the body content
start_index = html_content.find('<body>') + len('<body>')
end_index = html_content.find('</body>')

# Extract the content between <body> and </body>
body_content = html_content[start_index:end_index]

# Remove HTML tags using string manipulation
text_content = ""
inside_tag = False
for char in body_content:
    if char == '<':
        inside_tag = True
    elif char == '>':
        inside_tag = False
    elif not inside_tag:
        text_content += char

# Print the extracted text content
print(text_content.strip())


Welcome to Example Page
This is a paragraph.
This is another paragraph.


In [49]:
title_index = html.find("<title>")
title_index

51

In [50]:
url = "https://www.kaggle.com/"

# Send a request to the webpage and get the response
response = urlopen(url)

# Read the HTML content as bytes and decode it to a string using UTF-8 encoding
html = response.read().decode("utf-8")

# Find the start and end index of the title tag
start_index = html.find("<title>") + len("<title>")
end_index = html.find("</title>")

# Extract the content between the title tags
title = html[start_index:end_index]

# Print the extracted title
print(title)

Kaggle: Your Machine Learning and Data Science Community


#Regular Expressions

Regular expressions—or regexes for short—are patterns that you can use to search for text within a string. Python supports regular expressions through the standard library’s re module.

Regular expressions, often abbreviated as regex, are sequences of characters that define a search pattern. They are widely used in programming for pattern matching and string manipulation tasks. Here's a brief overview of working with regular expressions:

1. **Pattern Creation**: Regular expressions are used to define patterns that you want to search for within a string. These patterns can include literals, metacharacters, and quantifiers.

2. **Metacharacters**: Metacharacters are special characters in regex that have a specific meaning. Some common metacharacters include:
   - `.` : Matches any single character except newline.
   - `^` : Anchors the match to the start of the string.
   - `$` : Anchors the match to the end of the string.
   - `\d` : Matches any digit character (equivalent to `[0-9]`).
   - `\w` : Matches any alphanumeric character (equivalent to `[a-zA-Z0-9_]`).
   - `\s` : Matches any whitespace character.
   - `[...]` : Matches any single character within the brackets.
   - `|` : Acts as an OR operator.

3. **Quantifiers**: Quantifiers specify how many times a character or group of characters can occur in the pattern. Some common quantifiers include:
   - `*` : Matches zero or more occurrences of the preceding character.
   - `+` : Matches one or more occurrences of the preceding character.
   - `?` : Matches zero or one occurrence of the preceding character.
   - `{n}` : Matches exactly n occurrences of the preceding character.
   - `{n,}` : Matches at least n occurrences of the preceding character.
   - `{n,m}` : Matches between n and m occurrences of the preceding character.

4. **Matching**: Once you have defined a regex pattern, you can use it to search for matches within a string. This is typically done using functions like `re.search()` or `re.findall()` in Python's `re` module.

5. **Replacement**: Regular expressions can also be used for string replacement. You can use functions like `re.sub()` to replace matches of a pattern with a specified replacement string.

Here's a simple example of using regular expressions in Python:

```python
import re

# Define a regex pattern
pattern = r'\b\d{3}-\d{2}-\d{4}\b'  # Matches US social security numbers

# Define a string to search
text = "John's SSN is 123-45-6789 and Mary's is 987-65-4321."

# Search for matches of the pattern in the string
matches = re.findall(pattern, text)

# Print the matches
print(matches)
```

he first argument of re.findall() is the regular expression that you want to match, and the second argument is the string to test.

In [53]:
re.findall("ab*c", "abcd")

['abc']

In [55]:
re.findall("ab*c", "acc")

['ac']

In [56]:
re.findall("ab*c", "abcac")

['abc', 'ac']

In [57]:
re.findall("ab*c", "abdc")

[]

#Extract Text From HTML With Regular Expressions

The regex pattern r'<.*?>(.*?)</.*?>' is used to match HTML tags and their content. The .*? is a non-greedy quantifier that matches any character (except newline) as few times as possible.
re.findall() is used to find all matches of the pattern in the HTML content.
The matched text is concatenated to form the extracted content.

In [58]:
# HTML content
html_content = """
<html>
<head>
<title>Example Page</title>
</head>
<body>
<h1>Welcome to Example Page</h1>
<p>This is a paragraph.</p>
<p>This is another paragraph.</p>
</body>
</html>
"""

# Define the regex pattern to match HTML tags and their content
pattern = r'<.*?>(.*?)</.*?>'

# Find all matches of the pattern in the HTML content
matches = re.findall(pattern, html_content, re.DOTALL)

# Concatenate the matched text to form the extracted content
text_content = ' '.join(matches)

# Print the extracted text content
print(text_content.strip())


<head>
<title>Example Page 
<body>
<h1>Welcome to Example Page This is a paragraph. This is another paragraph.


# Beautiful Soup


Beautiful Soup is a Python library for parsing HTML and XML documents. It provides tools for navigating the parse tree and searching for elements, making it easier to extract and manipulate data from web pages. Here's an example of how you can use BeautifulSoup to extract information from a webpage


In [68]:
# URL of the webpage to scrape
url = "https://www.example.com"  # Replace with the URL of the webpage

# Send a GET request to the webpage
response = requests.get(url)

# Parse the HTML content of the webpage
soup = BeautifulSoup(response.content, 'html.parser')

# Find all <img> tags on the page
img_tags = soup.find_all("img")

# Print the src attribute of each <img> tag
for img_tag in img_tags:
    src = img_tag.get("src")
    print(src)

In [69]:
url = "http://olympus.realpython.org/profiles/dionysus"
page = urlopen(url)
html = page.read().decode("utf-8")
soup = BeautifulSoup(html, "html.parser")
print(soup.get_text())



Profile: Dionysus





Name: Dionysus

Hometown: Mount Olympus

Favorite animal: Leopard 

Favorite Color: Wine






We use the official Medium API endpoint (https://medium.com/_/api/home-feed) to fetch recent articles from the Medium homepage.

We specify parameters for the API request, including the limit of articles to retrieve and filtering by the latest articles.

We send a GET request to the Medium API endpoint using requests.get() and pass the parameters.

If the request is successful (status code 200), we parse the JSON response and extract the article information, including titles and URLs.

Finally, we print the titles and links of the recent articles.


In [67]:
soup.find_all("img")

[<img src="/static/dionysus.jpg"/>, <img src="/static/grapes.png"/>]

In [63]:
# URL of the Medium API endpoint for fetching recent articles
api_url = "https://medium.com/_/api/home-feed"

# Parameters for the API request
params = {
    "limit": 10,  # Number of articles to retrieve
    "filter": "latest",  # Filter by latest articles
}

# Send a GET request to the Medium API endpoint
response = requests.get(api_url, params=params)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    # Parse the JSON response
    data = response.json()

    # Extract article information from the response
    articles = data["payload"]["references"]["Post"]

    # Print the titles and links of the recent articles
    for article_id, article_info in articles.items():
        title = article_info["title"]
        url = f"https://medium.com/p/{article_info['uniqueSlug']}"
        print(f"Title: {title}")
        print(f"URL: {url}")
        print()
else:
    print("Failed to fetch articles. Please try again later.")

Failed to fetch articles. Please try again later.


# Interact With Websites in Real Time

[]