# Web Scraping Lab

You will find in this notebook some scrapy exercises to practise your scraping skills.

**Tips:**

- Check the response status code for each request to ensure you have obtained the intended content.
- Print the response text in each request to understand the kind of info you are getting and its format.
- Check for patterns in the response text to extract the data/info requested in each question.
- Visit the urls below and take a look at their source code through Chrome DevTools. You'll need to identify the html tags, special class names, etc used in the html content you are expected to extract.

**Resources**:
- [Requests library](http://docs.python-requests.org/en/master/#the-user-guide)
- [Beautiful Soup Doc](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Urllib](https://docs.python.org/3/library/urllib.html#module-urllib)
- [re lib](https://docs.python.org/3/library/re.html)
- [lxml lib](https://lxml.de/)
- [Scrapy](https://scrapy.org/)
- [List of HTTP status codes](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes)
- [HTML basics](http://www.simplehtmlguide.com/cheatsheet.php)
- [CSS basics](https://www.cssbasics.com/#page_start)

#### Below are the libraries and modules you may need. `requests`,  `BeautifulSoup` and `pandas` are already imported for you. If you prefer to use additional libraries feel free to do it.

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import bs4

#### Download, parse (using BeautifulSoup), and print the content from the Trending Developers page from GitHub:

In [None]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/developers'

In [None]:
response = requests.get(url)

In [None]:
html = response.content
html[:500]

In [None]:
parsed_html = bs4.BeautifulSoup(html[:500], "html.parser") 
print(parsed_html.prettify())

#### Display the names of the trending developers retrieved in the previous step.

Your output should be a Python list of developer names. Each name should not contain any html tag.

**Instructions:**

1. Find out the html tag and class names used for the developer names. You can achieve this using Chrome DevTools.

1. Use BeautifulSoup to extract all the html elements that contain the developer names.

1. Use string manipulation techniques to replace whitespaces and linebreaks (i.e. `\n`) in the *text* of each html element. Use a list to store the clean names.

1. Print the list of names.

Your output should look like below:

```
['trimstray (@trimstray)',
 'joewalnes (JoeWalnes)',
 'charlax (Charles-AxelDein)',
 'ForrestKnight (ForrestKnight)',
 'revery-ui (revery-ui)',
 'alibaba (Alibaba)',
 'Microsoft (Microsoft)',
 'github (GitHub)',
 'facebook (Facebook)',
 'boazsegev (Bo)',
 'google (Google)',
 'cloudfetch',
 'sindresorhus (SindreSorhus)',
 'tensorflow',
 'apache (TheApacheSoftwareFoundation)',
 'DevonCrawford (DevonCrawford)',
 'ARMmbed (ArmMbed)',
 'vuejs (vuejs)',
 'fastai (fast.ai)',
 'QiShaoXuan (Qi)',
 'joelparkerhenderson (JoelParkerHenderson)',
 'torvalds (LinusTorvalds)',
 'CyC2018',
 'komeiji-satori (神楽坂覚々)',
 'script-8']
 ```

In [None]:
soup = bs4.BeautifulSoup(html, 'html.parser')
type(soup)

In [None]:
tags = [tag.a in soup.find_all('h1,{class:[h3, lh-condensed]})
names = [tag.string.strip() for tag in tags if tag.string != None]
names

#### Display the trending Python repositories in GitHub.

The steps to solve this problem is similar to the previous one except that you need to find out the repository names instead of developer names.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://github.com/trending/python?since=daily'

In [None]:
import requests

url = 'https://api.github.com/search/repositories'
params = {
    'q': 'language:python',
    'sort': 'stars',
    'order': 'desc',
    'per_page': 10
}

response = requests.get(url, params=params)
data = response.json()

# Extract repository names from the response
repo_names = [item['full_name'] for item in data['items']]

# Display the repository names
for name in repo_names:
    print(name)

#### Display all the image links from Walt Disney wikipedia page.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/Walt_Disney'

In [None]:
import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/Walt_Disney'

# Send a GET request to the URL
response = requests.get(url)

# Create a BeautifulSoup object with the response text
soup = BeautifulSoup(response.text, 'html.parser')

# Find all the image elements
image_elements = soup.find_all('img')

# Extract the source (src) attribute from each image element
image_links = [image['src'] for image in image_elements]

# Display the image links
for link in image_links:
    print(link)

#### Retrieve an arbitary Wikipedia page of "Python" and create a list of links on that page.

In [None]:
# This is the url you will scrape in this exercise
url ='https://en.wikipedia.org/wiki/Python' 

In [None]:
import requests
from bs4 import BeautifulSoup

url = 'https://en.wikipedia.org/wiki/Python'

# Send a GET request to the URL
response = requests.get(url)

# Create a BeautifulSoup object with the response text
soup = BeautifulSoup(response.text, 'html.parser')

# Find all the anchor elements
anchor_elements = soup.find_all('a')

# Extract the href attribute from each anchor element
links = [anchor.get('href') for anchor in anchor_elements]

# Filter out None values and links that don't start with '/wiki/'
links = [link for link in links if link and link.startswith('/wiki/')]

# Display the links
for link in links:
    print(link)
# your code here

#### Find the number of titles that have changed in the United States Code since its last release point.

In [None]:
# This is the url you will scrape in this exercise
url = 'http://uscode.house.gov/download/download.shtml'

In [None]:
import requests
from bs4 import BeautifulSoup

url = 'http://uscode.house.gov/download/download.shtml'

# Send a GET request to the URL
response = requests.get(url)

# Create a BeautifulSoup object with the response text
soup = BeautifulSoup(response.text, 'html.parser')

# Find all the download links for the United States Code
download_links = soup.find_all('a', href=lambda href: href and href.startswith('/download/'))

# Extract the count of changed titles from the number of download links
changed_titles_count = len(download_links)

# Display the count of changed titles
print(f"The number of titles changed in the United States Code since its last release point: {changed_titles_count}")


#### Find a Python list with the top ten FBI's Most Wanted names.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.fbi.gov/wanted/topten'

In [None]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.fbi.gov/wanted/topten'

# Send a GET request to the URL
response = requests.get(url)

# Create a BeautifulSoup object with the response text
soup = BeautifulSoup(response.text, 'html.parser')

# Find all the list items (li) containing the wanted names
wanted_names = soup.find_all('li', class_='portal-type-person castle-grid-block-item')

# Extract the names from the list items
top_ten_wanted_names = [name.text.strip() for name in wanted_names]

# Display the top ten wanted names
print("Top Ten FBI's Most Wanted Names:")
for name in top_ten_wanted_names:
    print(name)

####  Display the 20 latest earthquakes info (date, time, latitude, longitude and region name) by the EMSC as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.emsc-csem.org/Earthquake/'

In [None]:
import requests
import pandas as pd
from bs4 import BeautifulSoup

url = 'https://www.emsc-csem.org/Earthquake/'

# Send a GET request to the URL
response = requests.get(url)

# Create a BeautifulSoup object with the response text
soup = BeautifulSoup(response.text, 'html.parser')

# Find the container that holds the earthquake information
container = soup.find('div', id='tbody')

# Find all the rows in the container
rows = container.find_all('tr')

# Initialize an empty list to store the data
data = []

# Extract the information from each row
for row in rows:
    cols = row.find_all('td')
    if len(cols) == 13:  # Check for valid row
        date = cols[3].text.strip()
        time = cols[4].text.strip()
        latitude = cols[5].text.strip()
        longitude = cols[6].text.strip()
        region = cols[11].text.strip()
        data.append([date, time, latitude, longitude, region])

# Create a Pandas DataFrame with the extracted data
columns = ['Date', 'Time', 'Latitude', 'Longitude', 'Region']
df = pd.DataFrame(data, columns=columns)

# Display the DataFrame
print(df)

#### Count the number of tweets by a given Twitter account.
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the number of tweets for any provided account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
import tweepy

# Set up your Twitter API credentials
consumer_key = 'YOUR_CONSUMER_KEY'
consumer_secret = 'YOUR_CONSUMER_SECRET'
access_token = 'YOUR_ACCESS_TOKEN'
access_token_secret = 'YOUR_ACCESS_TOKEN_SECRET'

# Authenticate with the Twitter API
auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

try:
    # Create the API object
    api = tweepy.API(auth)

    # Ask the user for the Twitter handle
    handle = input("Enter the Twitter handle (without '@') of the account: ")

    # Get the user object for the given handle
    user = api.get_user(screen_name=handle)

    # Get the tweet count from the user object
    tweet_count = user.statuses_count

    # Display the tweet count
    print(f"The number of tweets by @{handle}: {tweet_count}")

except tweepy.error.TweepError as e:
    if e.api_code == 50:
        print(f"Account '@{handle}' not found.")
    else:
        print(f"An error occurred while accessing the Twitter API: {e}")

#### Number of followers of a given twitter account
Ask the user for the handle (@handle) of a twitter account. You will need to include a ***try/except block*** for account names not found. 
<br>***Hint:*** the program should count the followers for any provided account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
import tweepy

consumer_key = 'YOUR_CONSUMER_KEY'
consumer_secret = 'YOUR_CONSUMER_SECRET'
access_token = 'YOUR_ACCESS_TOKEN'
access_token_secret = 'YOUR_ACCESS_TOKEN_SECRET'

handle = input("Enter the Twitter handle (without '@') of the account: ")

auth = tweepy.OAuthHandler(consumer_key, consumer_secret)
auth.set_access_token(access_token, access_token_secret)

api = tweepy.API(auth)

try:
    user = api.get_user(screen_name=handle)
    follower_count = user.followers_count
    print(f"The follower count for @{handle} is: {follower_count}")
except tweepy.error.TweepError as e:
    print(f"Error: {e}")

#### List all language names and number of related articles in the order they appear in wikipedia.org.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://www.wikipedia.org/'

In [None]:
import requests
from bs4 import BeautifulSoup

# Send a GET request to the Wikipedia homepage
url = 'https://www.wikipedia.org/'
response = requests.get(url)

# Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Find the container that holds the language links
language_container = soup.find('div', class_='central-featured-lang')

# Find all the language links within the container
language_links = language_container.find_all('a')

# Iterate over the language links and extract the language name and the number of related articles
for link in language_links:
    language_name = link.text.strip()
    article_count = link.find_next('bdi').text.strip()
    print(f"Language: {language_name}\nNumber of articles: {article_count}\n")

#### A list with the different kind of datasets available in data.gov.uk.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://data.gov.uk/'

In [8]:
import requests
from bs4 import BeautifulSoup

# Send a GET request to the data.gov.uk homepage
url = 'https://data.gov.uk/'
response = requests.get(url)

# Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Find the parent container that holds the dataset categories
categories_container = soup.find('nav', class_='dguk-secondary-navigation')

# Get all the child elements of the container
category_elements = categories_container.find_all(True, recursive=False)

# Create an empty list to store the dataset categories
dataset_categories = []

# Iterate over the child elements and extract the category names
for element in category_elements:
    if element.name == 'a':
        category_name = element.get_text(strip=True)
        dataset_categories.append(category_name)

# Print the list of dataset categories
for category in dataset_categories:
    print(category)

AttributeError: 'NoneType' object has no attribute 'find_all'

#### Display the top 10 languages by number of native speakers stored in a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'

In [11]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

# Send a GET request to the URL
url = 'https://en.wikipedia.org/wiki/List_of_languages_by_number_of_native_speakers'
response = requests.get(url)

# Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Find the table that contains the language data
table = soup.find('table', class_='wikitable')

# Initialize empty lists to store the language data
languages = []
native_speakers = []

# Iterate over the rows in the table (excluding the header row)
for row in table.find_all('tr')[1:]:
    # Get the columns for each row
    columns = row.find_all('td')
    # Extract the language and native speaker count from the columns
    language = columns[1].text.strip()
    speakers = columns[2].text.strip().replace(',', '')
    # Append the language and speaker count to the respective lists
    languages.append(language)
    native_speakers.append(speakers)

# Create a pandas dataframe with the language data
data = {'Language': languages, 'Native Speakers': native_speakers}
df = pd.DataFrame(data)

# Convert the 'Native Speakers' column to numeric values
df['Native Speakers'] = pd.to_numeric(df['Native Speakers'], errors='coerce')

# Sort the dataframe by the number of native speakers in descending order
df = df.sort_values(by='Native Speakers', ascending=False)

# Select the top 10 languages
top_10_languages = df.head(10)

# Display the top 10 languages dataframe
print(top_10_languages)

  Language  Native Speakers
0      939              NaN
1      485              NaN
2      380              NaN
3      345              NaN
4      236              NaN
5      234              NaN
6      147              NaN
7      123              NaN
8     86.1              NaN
9     85.0              NaN


## Bonus
#### Scrape a certain number of tweets of a given Twitter account.

In [None]:
# This is the url you will scrape in this exercise 
# You will need to add the account credentials to this url
url = 'https://twitter.com/'

In [None]:
# your code here

#### Display IMDB's top 250 data (movie name, initial release, director name and stars) as a pandas dataframe.

In [1]:
# This is the url you will scrape in this exercise 
url = 'https://www.imdb.com/chart/top'

In [3]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# Send a GET request to the IMDB top 250 page
url = 'https://www.imdb.com/chart/top'
response = requests.get(url)

# Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Find the table that holds the movie data
table = soup.find('table', class_='chart')

# Find all the rows in the table body
rows = table.find_all('tr')

# Initialize empty lists to store the data
movie_names = []
initial_releases = []
directors = []
stars = []

# Iterate over the rows, skipping the header row
for row in rows[1:]:
    # Find the columns in the row
    columns = row.find_all('td')

    # Extract the movie name, initial release, director, and stars from the columns
    movie_name = columns[1].text.strip()
    initial_release = columns[1].find('span', class_='secondaryInfo').text.strip('()')

    director_element = columns[3].find('a')
    if director_element is not None:
        director = director_element.text.strip()
    else:
        director = 'N/A'

    star_elements = columns[2].find_all('a')
    star_names = [star.text.strip() for star in star_elements]

    # Append the data to the respective lists
    movie_names.append(movie_name)
    initial_releases.append(initial_release)
    directors.append(director)
    stars.append(star_names)

# Create a pandas dataframe with the scraped data
data = {
    'Movie Name': movie_names,
    'Initial Release': initial_releases,
    'Director': directors,
    'Stars': stars
}
df = pd.DataFrame(data)

# Display the dataframe
print(df)

                                   Movie Name Initial Release Director Stars
0           1.\n      Cadena perpetua\n(1994)            1994      N/A    []
1                2.\n      El padrino\n(1972)            1972      N/A    []
2       3.\n      El caballero oscuro\n(2008)            2008      N/A    []
3     4.\n      El padrino (parte II)\n(1974)            1974      N/A    []
4     5.\n      12 hombres sin piedad\n(1957)            1957      N/A    []
..                                        ...             ...      ...   ...
245     246.\n      Criadas y señoras\n(2011)            2011      N/A    []
246      247.\n      La vida de Brian\n(1979)            1979      N/A    []
247  248.\n      El gigante de hierro\n(1999)            1999      N/A    []
248               249.\n      Aladdín\n(1992)            1992      N/A    []
249              250.\n      Drishyam\n(2015)            2015      N/A    []

[250 rows x 4 columns]


#### Display the movie name, year and a brief summary of the top 10 random movies (IMDB) as a pandas dataframe.

In [None]:
#This is the url you will scrape in this exercise
url = 'http://www.imdb.com/chart/top'

In [None]:
# your code here

#### Find the live weather report (temperature, wind speed, description and weather) of a given city.

In [None]:
#https://openweathermap.org/current
city = input('Enter the city: ')
url = 'http://api.openweathermap.org/data/2.5/weather?'+'q='+city+'&APPID=b35975e18dc93725acb092f7272cc6b8&units=metric'

In [None]:
# your code here

#### Find the book name, price and stock availability as a pandas dataframe.

In [None]:
# This is the url you will scrape in this exercise. 
# It is a fictional bookstore created to be scraped. 
url = 'http://books.toscrape.com/'

In [None]:
# your code here