### LSE Data Analytics Online Career Accelerator

# DA201: Data Analytics using Python

## Week 5: Data scraping with Python

The focus this week is on web scraping and SQL databases with Python. You will use this notebook to follow along with the demonstrations throughout the week.

This is your notebook. Use it to follow along with the demonstrations, test ideas and explore what is possible. The hands-on experience of writing your own code will accelarate your learning!

Learn about using your Jupyter Notebook here: https://jupyter-notebook.readthedocs.io/en/latest/ui_components.html.

### 5.1 Web scraping

In [None]:
# Install the Beautiful Soup library.
!pip install beautifulsoup4

In [1]:
# Import the requests and Beautiful Soup libraries.
import requests
import bs4
from bs4 import BeautifulSoup

# Specify the URL.
URL = 'https://en.wikipedia.org/wiki/Main_Page'

# Create a variable.
page = requests.get(URL)

# View the HTML. 
soup = BeautifulSoup(page.content, 'html.parser')
print(soup.text)





Wikipedia, the free encyclopedia













































Main Page

From Wikipedia, the free encyclopedia



Jump to navigation
Jump to search



Welcome to Wikipedia,
the free encyclopedia that anyone can edit.
6,562,922 articles in English




From today's featured article


Izmail under construction at the Baltic Works

The Borodino-class battlecruisers were a group of four battlecruisers ordered by the Imperial Russian Navy before World War I for service with the Baltic Fleet. Construction of the ships was delayed by a lack of capacity among domestic factories and the need to order some components from abroad. The start of the war in 1914 slowed their construction still further. All of the ships were launched in 1915–1916, but it became evident that Russian industry would not be able to complete them during the war. The Russian Revolution of 1917 halted all work on the ships. Although some consideration was given to finishing the hulls that were nearest to 

In [None]:
# Determine the title of the page.
soup.title

In [None]:
# Find all the main headings on the website.
soup.find_all('h1')

In [None]:
# Find all paragraphs.
soup.find_all('p')

In [None]:
# Create a for loop to find all the 'a' tags.
for link in soup.find_all('a'):
    print(link.get('href'))

In [None]:
# Extracting all the text from a page.
print(soup.get_text())

In [None]:
# Return all alt text of images.
soup.find_all('img')

# 

### 5.2 What is an API?

As your first experiment with HTTP requests in Python, you are going to call on an API. There is a huge number of APIs to choose from, each of them offering different data. For this exercise, you will use a completely free API, which doesn’t require any registration – SWAPI: The Star Wars API – which contains, 'all the Star Wars data you’ve ever wanted' (SWAPI 2021a; 2021b). Select the headings to explore more.

In [None]:
# Import the necessary libraries.
import requests
import json

# Create a variable.
response = requests.get('https://swapi.dev/api/people/1/')

# Print the status_code.
print(response.status_code)

# Print the JSON response.
print(response.json())

In [None]:
# Create a function.
def jprint(obj):
    # Create a formatted string of the Python JSON object.
    text = json.dumps(obj, sort_keys=True, indent=4)
    print(text)

# View the output.
jprint(response.json())

In [None]:
# HTTP headers.
info = requests.head('https://swapi.dev/api/people/1/')

print(info.headers)

In [None]:
# Create a variable.
response_2 = requests.get('https://swapi.dev/api/people/2/')

# Print the status_code.
print(response.status_code)

# Print the JSON response.
print(response_2.json())


In [None]:
# Create a variable.
response_3 = requests.get('https://swapi.dev/api/planets/1/')

# Print the status_code.
print(response.status_code)

# Print the JSON response.
print(response_3.json())

# 

### 5.3 Working with databases