## Webscraping

Web scraping in Python refers to the automated process of extracting information or data from websites. This is typically done by writing code that sends requests to a website and then parses the HTML or other structured data that is returned by the server in response to those requests.

Python is a popular language for web scraping because it has a variety of libraries that make it easy to work with web pages, such as BeautifulSoup and Scrapy. These libraries can help automate the process of requesting and parsing web pages, allowing you to extract data quickly and efficiently.

Web scraping can be used for a variety of purposes, such as gathering data for research or analysis, monitoring websites for changes, or building datasets for machine learning applications. However, it is important to note that web scraping can potentially be a violation of a website's terms of service or even illegal in some cases, so it is important to understand the legal and ethical implications of your actions before proceeding.

Some of the python packages used for webscraping are `beautiful soup`, `selenium`, and `scrapy`. However, we shall use `BeautifulSoup` for this session.

### Example 1:

In [1]:
# Import Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import json
import re
from bs4 import BeautifulSoup
import lxml
import requests, string, nbconvert

%matplotlib inline

In [2]:
# Getting the url
url = 'https://www.worlddata.info/africa/nigeria/inflation-rates.php'

response = requests.get(url)

# Getting the status code
response.status_code

200

In [3]:
# Instantiate the soup object
soup = BeautifulSoup(response.text,  'lxml')

# Get the table object
table = soup.find('table', class_ = 'std100 hover')

# Get the column header
col_head = [col.text.strip() for col in table.find_all('th')]
col_head

['Year', 'Nigeria', 'Ø EU', 'Ø USA', 'Ø World']

In [4]:
# Get the column header into a dataframe
df2 = pd.DataFrame(columns=col_head)
df2

Unnamed: 0,Year,Nigeria,Ø EU,Ø USA,Ø World


In [5]:
# Get the row data
for row in table.find_all('tr')[1:]:
    datapoint = [td.text.strip() for td in row.find_all('td')]

    # Assign the table rows to the dataframe
    df2.loc[len(df2)] = datapoint

# Investigate the dataframe
df2

Unnamed: 0,Year,Nigeria,Ø EU,Ø USA,Ø World
0,2021,16.95 %,2.55 %,4.70 %,3.50 %
1,2020,13.25 %,0.50 %,1.23 %,1.92 %
2,2019,11.40 %,1.63 %,1.81 %,2.19 %
3,2018,12.09 %,1.74 %,2.44 %,2.44 %
4,2017,16.52 %,1.43 %,2.13 %,2.19 %
...,...,...,...,...,...
57,1964,0.86 %,3.42 %,1.28 %,
58,1963,-2.69 %,2.92 %,1.24 %,
59,1962,5.27 %,3.55 %,1.20 %,
60,1961,6.28 %,2.08 %,1.07 %,


In [6]:
# Save the dataset

df2.to_csv('worlddata.csv', index=False)

### Example 2: Webscraping Github Account

In [7]:
# Webscraping data from github
url = 'https://github.com/Codecademy/datasets/blob/master/streeteasy/streeteasy.csv'

response = requests.get(url)
response.status_code

200

In [8]:
# Instantiate the soup object
soup = BeautifulSoup(response.text, 'lxml')

# Get the table object
table = soup.find('table', class_ = 'js-csv-data csv-data js-file-line-container')

# Get the column header
col_head = [col.text.strip() for col in table.find_all('th')]
col_head

['rental_id',
 'building_id',
 'rent',
 'bedrooms',
 'bathrooms',
 'size_sqft',
 'min_to_subway',
 'floor',
 'building_age_yrs',
 'no_fee',
 'has_roofdeck',
 'has_washer_dryer',
 'has_doorman',
 'has_elevator',
 'has_dishwasher',
 'has_patio',
 'has_gym',
 'neighborhood',
 'submarket',
 'borough']

In [9]:
# Put columns to dataframe
data_ = pd.DataFrame(columns=col_head)
data_

Unnamed: 0,rental_id,building_id,rent,bedrooms,bathrooms,size_sqft,min_to_subway,floor,building_age_yrs,no_fee,has_roofdeck,has_washer_dryer,has_doorman,has_elevator,has_dishwasher,has_patio,has_gym,neighborhood,submarket,borough


In [10]:
# check the number of columns
len(data_.columns)

20

In [11]:
# Get the row data
for row in table.find_all('tr')[1:]:
    obser = [td.text.strip() for td in row.find_all('td')]
    # Deleting empty column from the rows
    del obser[0] 

    # Attach the row data to the dataframe
    data_.loc[len(data_)] = obser

# View the dataframe
data_.head()

Unnamed: 0,rental_id,building_id,rent,bedrooms,bathrooms,size_sqft,min_to_subway,floor,building_age_yrs,no_fee,has_roofdeck,has_washer_dryer,has_doorman,has_elevator,has_dishwasher,has_patio,has_gym,neighborhood,submarket,borough
0,1545,44518357,2550,0,1,480,9,2,17,1,1,0,0,1,1,0,1,Upper East Side,All Upper East Side,Manhattan
1,2472,94441623,11500,2,2,2000,4,1,96,0,0,0,0,0,0,0,0,Greenwich Village,All Downtown,Manhattan
2,10234,87632265,3000,3,1,1000,4,1,106,0,0,0,0,0,0,0,0,Astoria,Northwest Queens,Queens
3,2919,76909719,4500,1,1,916,2,51,29,0,1,0,1,1,1,0,0,Midtown,All Midtown,Manhattan
4,2790,92953520,4795,1,1,975,3,8,31,0,0,0,1,1,1,0,1,Greenwich Village,All Downtown,Manhattan


### Example 3: Webscraping Nigeria Inflation Rate

In [12]:
# Getting the url
url = 'https://www.worlddata.info/africa/nigeria/inflation-rates.php'

response = requests.get(url)

# Getting the status code
response.status_code

200

In [13]:
# Instantiate the soup object
soup = BeautifulSoup(response.text, 'lxml')

# Get the table object
table = soup.find('table', class_ = 'std100 hover')

# Get the column header
col_head = [col.text.strip() for col in table.find_all('th')]
col_head

# Put columns to dataframe
data_ = pd.DataFrame(columns=col_head)

# Get the row data
for row in table.find_all('tr')[1:]:
    obser = [td.text.strip() for td in row.find_all('td')]
    data_.loc[len(data_)] = obser

data_

Unnamed: 0,Year,Nigeria,Ø EU,Ø USA,Ø World
0,2021,16.95 %,2.55 %,4.70 %,3.50 %
1,2020,13.25 %,0.50 %,1.23 %,1.92 %
2,2019,11.40 %,1.63 %,1.81 %,2.19 %
3,2018,12.09 %,1.74 %,2.44 %,2.44 %
4,2017,16.52 %,1.43 %,2.13 %,2.19 %
...,...,...,...,...,...
57,1964,0.86 %,3.42 %,1.28 %,
58,1963,-2.69 %,2.92 %,1.24 %,
59,1962,5.27 %,3.55 %,1.20 %,
60,1961,6.28 %,2.08 %,1.07 %,


### Example 4: Webscraping Nigeria Exchange Rate

In [15]:
# Getting the url
url = 'https://www.indexmundi.com/facts/nigeria/official-exchange-rate'

response = requests.get(url)

# Getting the status code
response.status_code

200

In [16]:
# Instantiate the soup object
soup = BeautifulSoup(response.text, 'lxml')

# Get the table object
table = soup.find('table')

# Get the column header
col_head = [col.text.strip() for col in table.find_all('th')]
col_head

# Put columns to dataframe
data_ = pd.DataFrame(columns=col_head)

# Get the row data
for row in table.find_all('tr')[1:]:
    obser = [td.text.strip() for td in row.find_all('td')]
    data_.loc[len(data_)] = obser

print(data_.head())

# Save dataframe
data_.to_csv('Real_Exchange_Rate.csv', index=False)

   Year Value
0  1960  0.71
1  1961  0.71
2  1962  0.71
3  1963  0.71
4  1964  0.71


### Example 5: Scrape Exchange Rate dataset


In [2]:
# Webscraping data from internet
url = 'https://infomediang.com/cbn-exchange-rate/'

# Getting response variable
resp = requests.get(url)
# Checking connection status
resp.status_code


200

In [3]:
# Getting soup instance
soup = BeautifulSoup(resp.text, 'lxml')
# Getting the necessary tables
tab1 = soup.find_all('table')[2]
# Getting the table header
col_hd = [i.text.strip() for i in tab1.find_all('tr')[0]]
# Checking column header
col_hd

['YEAR (DATE)', 'DOLLAR TO NAIRA']

In [4]:
# Putting column header to dataframe
dfr1 = pd.DataFrame(columns=col_hd)
dfr1

Unnamed: 0,YEAR (DATE),DOLLAR TO NAIRA


In [5]:
# Getting the rows of the table and joining to dataframe
for row in tab1.find_all('tr')[1:]:
    obs = [i.text.strip() for i in row.find_all('td')]
    dfr1.loc[len(dfr1)] = obs

# Checking dataframe 
dfr1.head()

Unnamed: 0,YEAR (DATE),DOLLAR TO NAIRA
0,1984,$1 = N0.765
1,1985,$1 = N0.894
2,1986,$1 = N2.02
3,1987,$1 = N4.02
4,1988,$1 = N4.54


In [6]:
# Renaming column
dfr1.rename(columns={'YEAR (DATE)':'Year', 'DOLLAR TO NAIRA':'Exchange(N/$)'}, inplace=True)
dfr1.head()


Unnamed: 0,Year,Exchange(N/$)
0,1984,$1 = N0.765
1,1985,$1 = N0.894
2,1986,$1 = N2.02
3,1987,$1 = N4.02
4,1988,$1 = N4.54


### Example 6: 

In [78]:
# Define a function to remove newline and digit from column
def transformer(s):
    # Remove newline
    result = [t for t in s if re.sub(r"\n+","",t)]
    # Remove numbers
    no_num = "".join([i for i in result if not i.isdigit()])
    return no_num


# Define function to remove punctuations
def remove_punc(txt):
    no_punc = [s for s in txt if s not in string.punctuation]
    chr = "".join([c for c in no_punc])
    return chr

In [75]:
url = 'https://www.imdb.com/chart/top'

response = requests.get(url)

# Getting the status code
response.status_code

200

In [94]:
# Instantiate the soup object
soup = BeautifulSoup(response.text, 'lxml')

# Get the table object
table = soup.find('table', class_ = 'chart full-width')

# Get the column header
col_head = [col.text.strip() for col in table.find_all('th')]
del col_head[0]
del col_head[-1]
col_head

['Rank & Title', 'IMDb Rating', 'Your Rating']

In [95]:
# Put columns to dataframe
data_ = pd.DataFrame(columns=col_head)
data_

# Get the row data
for row in table.find_all('tr')[1:]:
    obser = [td.text.strip() for td in row.find_all('td')]
    # Deleting empty column from the rows
    del obser[0] 
    del obser[-1]
    data_.loc[len(data_)] = obser

data_.head()

Unnamed: 0,Rank & Title,IMDb Rating,Your Rating
0,1.\n The Shawshank Redemption\n(1994),9.2,12345678910 \n\n\n\nNOT YET RELEASED\n \n\nSeen
1,2.\n The Godfather\n(1972),9.2,12345678910 \n\n\n\nNOT YET RELEASED\n \n\nSeen
2,3.\n The Dark Knight\n(2008),9.0,12345678910 \n\n\n\nNOT YET RELEASED\n \n\nSeen
3,4.\n The Godfather Part II\n(1974),9.0,12345678910 \n\n\n\nNOT YET RELEASED\n \n\nSeen
4,5.\n 12 Angry Men\n(1957),9.0,12345678910 \n\n\n\nNOT YET RELEASED\n \n\nSeen


In [96]:
# Transform messy variables
data_[['Rank','Title']] = data_['Rank & Title'].str.split('\n',1,expand=True)
data_.drop(columns='Rank & Title', inplace=True)

# Remove newline and digits from column
data_["Your Rating"] = data_["Your Rating"].apply(transformer)

# Remove full stop from column
data_["Rank"] = data_["Rank"].str.replace(".", "")

# Split column into two columns
data_[["Title","Year"]] = data_["Title"].str.split("\n", expand=True)

# Remove punctuations from column
data_["Year"] = data_["Year"].apply(lambda x: remove_punc(x))

# Change the index column
data_.set_index(keys="Rank", drop=True, inplace=True)

# Investigate dataframe
data_.head()


  data_[['Rank','Title']] = data_['Rank & Title'].str.split('\n',1,expand=True)
  data_["Rank"] = data_["Rank"].str.replace(".", "")


Unnamed: 0_level_0,IMDb Rating,Your Rating,Title,Year
Rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,9.2,NOT YET RELEASED Seen,The Shawshank Redemption,1994
2,9.2,NOT YET RELEASED Seen,The Godfather,1972
3,9.0,NOT YET RELEASED Seen,The Dark Knight,2008
4,9.0,NOT YET RELEASED Seen,The Godfather Part II,1974
5,9.0,NOT YET RELEASED Seen,12 Angry Men,1957


In [97]:
# Save data
data_.to_csv("Rotten_tomato.csv")


```python
def find_max(nums):
    max_num = float("-inf") # smaller than all other numbers
    for num in nums:
        if num > max_num:
            max_num = num
    return max_num

x = [20,1,300,-10,7,350,400,360,9]
find_max(x)
```