<a href="https://colab.research.google.com/github/EitanBakirov/Economics-Data-Science/blob/main/Web_Scraping_and_APIs.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction to Web Scraping and APIs for Data Scientists

### Motivation

In the world of data science, being able to get and use data from the web is very important. Two key methods that help data scientists do this are web scraping and APIs (Application Programming Interfaces). This notebook is a complete guide for beginners, showing how to gather useful data from websites and online services. Whether you need data for research, to build a dataset for machine learning, or to automate repetitive tasks, learning web scraping and APIs will be very helpful. With practical examples, pictures, and clear steps, this guide will teach you the basics and get you started in the data-rich world of the web.

### Overview of Web Scraping and APIs

- **Web Scraping**: Web scraping is a technique used to extract data from websites. It involves fetching the HTML of a webpage and then parsing it to find the necessary information. This is useful when there is no API available, or when the data you need is only displayed on a webpage.
  - **Use Cases**: Gathering product information from e-commerce sites, collecting social media posts, extracting news articles, etc.
  - **Common Tools**: BeautifulSoup, Scrapy, Selenium

- **APIs**: An API is a set of rules and protocols that allows different software applications to communicate with each other. Many websites and services provide APIs that allow you to programmatically request and retrieve data.
  - **Use Cases**: Accessing structured data from services like Spotify, Twitter (X), weather information, stock prices, etc.
  - **Common Tools**: Requests library, Postman, various language-specific libraries




### Importance in Data Science

Web scraping and APIs are essential skills for data scientists for several reasons:

1. **Data Availability**: Much of the data needed for analysis, model building, or business intelligence is available on the web.
2. **Automation**: These techniques allow for the automated collection of large datasets, saving time and effort compared to manual data collection.
3. **Customization**: By using web scraping and APIs, you can tailor the data collection process to your specific needs, gathering only the information relevant to your project.
4. **Integration**: Combining data from multiple sources (both scraped data and API data) can provide a more comprehensive dataset for deeper insights and more robust models.

With this understanding, let's dive into setting up our environment and getting started with web scraping and APIs.

# Getting Started



## Setting Up Your Environment




### Installing Necessary Libraries

Before we dive into web scraping and APIs, we need to set up our environment by installing some essential libraries. These libraries will help us fetch and process data from the web.

The main libraries we will use are:
- **BeautifulSoup**: For parsing HTML and extracting data from web pages.
- **Requests**: For making HTTP requests to web servers.
- **Selenium**: For automating web browsers (useful for scraping dynamic content).
- **Pandas**: For data manipulation and analysis.
- **JSON**: For handling JSON data (commonly returned by APIs).

Let's install these libraries. If you are using Google Colab, these can be installed using pip:

In [5]:
# Installing necessary libraries
!pip install beautifulsoup4 requests selenium pandas



Let's start by verifying the installation of our libraries and exploring a simple example.

In [6]:
# Importing necessary libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Verifying the installations
print("Libraries installed and ready to use!")


Libraries installed and ready to use!


In the next sections, we will explore web scraping in detail, starting with understanding HTML structure and how to navigate it using BeautifulSoup.

# Web Scraping



## What is Web Scraping?

Web scraping is a technique used to extract data from websites. It involves fetching the HTML of a webpage and then parsing it to find the necessary information. This is useful when there is no API available, or when the data you need is only displayed on a webpage.



### Ethical Considerations and Legal Aspects

Before you start web scraping, it’s important to consider the ethical and legal implications:
- **Respect Website Terms of Service**: Always check a website’s terms of service to ensure you are not violating any rules.
- **Be Polite**: Avoid making too many requests in a short period. Use delays between requests to prevent overloading the server.
- **Use an API if Available**: If a website provides an API, use it instead of scraping. APIs are designed for data access and are often more efficient and reliable.




## Basic Concepts



### HTML Structure

Web pages are structured using HTML (HyperText Markup Language). Understanding the basic structure of HTML is crucial for web scraping.

Here's a simple example of an HTML structure:

```html
<!DOCTYPE html>
<html>
<head>
    <title>Sample Page</title>
</head>
<body>
    <h1>Welcome to the Sample Page</h1>
    <p>This is a sample paragraph.</p>
    <div class="content">
        <p>More content here.</p>
    </div>
</body>
</html>
```

### CSS Selectors



CSS (Cascading Style Sheets) selectors are used to select and style HTML elements. They are also useful for selecting elements when scraping.

- Element Selector: p selects all <p> elements.
- Class Selector: .content selects all elements with class "content".
- ID Selector: #main selects the element with id "main".

### Tools and Libraries for Web Scraping




#### BeautifulSoup

BeautifulSoup is a Python library used for parsing HTML and XML documents. It creates a parse tree that can be used to extract data from HTML.



#### Scrapy
Scrapy is an open-source and collaborative web crawling framework for Python. It is used to extract data from websites and process them as required.



#### Selenium
Selenium is a powerful tool for controlling web browsers through programs and performing browser automation. It is useful for scraping dynamic content that requires JavaScript execution.

## Practical Examples


### Simple HTML Parsing with BeautifulSoup
Let's start with a basic example of using BeautifulSoup to parse HTML and extract data.

In [12]:
import requests
from bs4 import BeautifulSoup

# Fetch the webpage
url = 'https://ollivere.co/'
response = requests.get(url)

# Parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Extract the title of the page
title = soup.title.string
print('Page Title:', title)

# Extract all paragraphs
paragraphs = soup.find_all('p')
for i, p in enumerate(paragraphs, 1):
    print(f'Paragraph {i}:', p.text)


Page Title: Ollivere | Digital Product, UX and Branding Design Consultancy
Paragraph 1: Creative Direction - how to make things different & memorable
Paragraph 2: End-to-end UX and UI design of websites & apps.
Paragraph 3: Workshops to clarify brand positioning and messaging.
Paragraph 4: Branding and design kits/systems
Paragraph 5: Bespoke illustration, animation & data visualisation.
Paragraph 6: Design and build creative Webflow websites (like this one). 
Paragraph 7: Properly understand the project goals, who the audience are, what they want and what already exists.
Paragraph 8: Test early & often using sketches, mock-ups & prototypes to quickly learn what works.
Paragraph 9: Create organised systems of design & brand components that are easy to use & maintain.
Paragraph 10: Continuously refine & develop over time based on methodically gathered feedback.
Paragraph 11: “Martin makes my life easy. He’s one of those rare people that I can give a brief to and then relax, because I kn

### Navigating and Scraping Complex Web Pages

For more complex pages, we can use CSS selectors to find elements.

In [None]:
# Extract elements with class 'content'
content_divs = soup.select('.content')
for div in content_divs:
    print('Content:', div.text)

### Handling Dynamic Content with Selenium

When dealing with dynamic content, Selenium can be used to render JavaScript.

In [None]:
from selenium import webdriver
from selenium.webdriver.common.by import By
import time

# Set up the Selenium WebDriver (using Chrome in this example)
driver = webdriver.Chrome()

# Navigate to the webpage
driver.get('http://example.com')

# Allow time for JavaScript to execute
time.sleep(5)

# Extract data
dynamic_content = driver.find_element(By.CLASS_NAME, 'dynamic')
print('Dynamic Content:', dynamic_content.text)

# Close the browser
driver.quit()

## Data Cleaning and Storage

### Parsing and Cleaning Scraped Data


After extracting data, you may need to clean it for analysis.



In [None]:
# Example of cleaning data using pandas
import pandas as pd

data = {'Paragraphs': [p.text for p in paragraphs]}
df = pd.DataFrame(data)

# Display the DataFrame
print(df)


## Storing Data in CSV, JSON, or Databases


You can store the cleaned data in various formats for later use.



In [None]:
# Save to CSV
df.to_csv('scraped_data.csv', index=False)

# Save to JSON
df.to_json('scraped_data.json', orient='records')

With these examples and tools, you should be able to start scraping data from web pages. In the next section, we will explore APIs and how to use them to gather data.