<font color='#2F4F4F'>To use this notebook on Colaboratory, you will need to make a copy of it. Go to File > Save a Copy in Drive. You can then use the new copy that will appear in the new tab.</font>


# <font color='#2F4F4F'>AfterWork Data Science: Web Scraping with Python</font>

## <font color='#2F4F4F'>Prerequisites</font>

In [1]:
# We first import the required libraries
# ---
#
import pandas as pd             # library for data manipulation
import requests                 # library for fetching a web page 
from bs4 import BeautifulSoup   # library for extrating contents from a webpage 

## <font color='#2F4F4F'>Examples</font>

As we go through the following examples, we should keep in mind that web pages are different, however, the process for scraping data will largely be the same. This means that while scraping for data in other webpages, we can always use the given code with some modifications.

### Example 1: Performing Basic Web Scraping 

In [None]:
# Example 1
# ---
# Scrape data found within the <a class="tag"> HTML tags in the given quotes website.
# ---
# Website: http://quotes.toscrape.com/ 
# ---
# YOUR CODE GOES BELOW
#

**Before we begin, we need to first understand the following concepts:**

1. A website is made up of web pages. These web pages can be either HTML or XML Webpages.   
2. HTML webpages contain content made up of tags i.e. `<head>, <body>, <div>, <header>, <section>, <p>, <span>, <a>, <li>` etc.
2. XML webpages contain content made up of user defined tags such as `<root>, <name>, <address>, <sector>, <location>`, etc.
3. The desired text data is usually contained within tags in an HTML or a XML web page. For example the text "hello" can be contained in the `<p>` tag as shown: `<p>`hello`</p>`. In such a case the `<p>` tag or *paragraph tag* has a closing tag `</p>`.
4. HTML Tags can comprise of attributes such as class, id, href etc. A good example would a paragraph tag with a class attribute that has a value "home": `<p class="home">hello</p>.` These attributes help us specify the elements that we would want to work with. 



#### Step 1: Obtaining our Data

In [5]:
# We will first download our webpage from the server that contains 
# our web page through the use of the get() method from the requests library.
# Upon doing this, we also check the status_code of our download.
# - A status_code starting with 2 indicates success.
# - A status_code starting with 4 or 5 indicates an error.
# ---
#  
page = requests.get('http://quotes.toscrape.com/')
page

<Response [200]>

In [6]:
# Run the following code to see an error in downloading an non-existent webpage.
# There is not page with URL: http://quotes.toscrape.com/mypage
# ---
# 
page2 = requests.get('http://quotes.toscrape.com/mypage')
page2

<Response [404]>

In [7]:
# Once we have successfully retrieved our page we can preview our document 
# by printing the first 1600 characters of the HTML document as shown below.
# ---
#
print(page.text[0:1600])

<!DOCTYPE html>
<html lang="en">
<head>
	<meta charset="UTF-8">
	<title>Quotes to Scrape</title>
    <link rel="stylesheet" href="/static/bootstrap.min.css">
    <link rel="stylesheet" href="/static/main.css">
</head>
<body>
    <div class="container">
        <div class="row header-box">
            <div class="col-md-8">
                <h1>
                    <a href="/" style="text-decoration: none">Quotes to Scrape</a>
                </h1>
            </div>
            <div class="col-md-4">
                <p>
                
                    <a href="/login">Login</a>
                
                </p>
            </div>
        </div>
    

<div class="row">
    <div class="col-md-8">

    <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">
        <span class="text" itemprop="text">“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”</span>
        <span>by <small class="author" itempr

From the above output, we can see some of the HTML `<a class="tag" href="">` tags that contain the data that we're interested in. We can also locate the html tag that contains our data by using the inspect tool within our browser. 
This has been demostrated in this [short video](https://www.youtube.com/watch?v=CwiRPmXhcLY).

#### Step 2: Parsing

In [8]:
# Once we have successfully downloaded our html document, 
# we can parse it and extract our desired text from the <a class="tag"> tags
# as shown below.
# ---
#

# We use BeautifulSoup, which is a popular Python library for web scraping to parse our HTML document. 
# By parsing in this case, BeautifulSoup parses the HTML (stored in page.text) 
# into a special object called soup that the Beautiful Soup library understands. 
# In laymans terms, Beautiful Soup is reading the HTML and making sense of its structure.
# If we were working with an XML document, we would use the XML parser 'xml'.
# ---
# 
# 
soup = BeautifulSoup(page.text, "html.parser")

In [9]:
# We can then print out the HTML content of the page formatted nicely, 
# using the prettify() method as shown below:
# ---
#
print(soup.prettify())

<!DOCTYPE html>
<html lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Quotes to Scrape
  </title>
  <link href="/static/bootstrap.min.css" rel="stylesheet"/>
  <link href="/static/main.css" rel="stylesheet"/>
 </head>
 <body>
  <div class="container">
   <div class="row header-box">
    <div class="col-md-8">
     <h1>
      <a href="/" style="text-decoration: none">
       Quotes to Scrape
      </a>
     </h1>
    </div>
    <div class="col-md-4">
     <p>
      <a href="/login">
       Login
      </a>
     </p>
    </div>
   </div>
   <div class="row">
    <div class="col-md-8">
     <div class="quote" itemscope="" itemtype="http://schema.org/CreativeWork">
      <span class="text" itemprop="text">
       “The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”
      </span>
      <span>
       by
       <small class="author" itemprop="author">
        Albert Einstein
       </small>
       <a href="/author/Albert

#### Step 3: Extracting Required Elements

In [11]:
# Let's now extract data found in a specific tag i.e. data found in <a> tags.
# The result is a list of instances of <a> tags found within our document.
# NB: In this case, we will get all <a> tags using the find_all() method
# which will get all the instances of the specified tag in the web page.
# ---
# 
results = soup.find_all('a')
results

[<a href="/" style="text-decoration: none">Quotes to Scrape</a>,
 <a href="/login">Login</a>,
 <a href="/author/Albert-Einstein">(about)</a>,
 <a class="tag" href="/tag/change/page/1/">change</a>,
 <a class="tag" href="/tag/deep-thoughts/page/1/">deep-thoughts</a>,
 <a class="tag" href="/tag/thinking/page/1/">thinking</a>,
 <a class="tag" href="/tag/world/page/1/">world</a>,
 <a href="/author/J-K-Rowling">(about)</a>,
 <a class="tag" href="/tag/abilities/page/1/">abilities</a>,
 <a class="tag" href="/tag/choices/page/1/">choices</a>,
 <a href="/author/Albert-Einstein">(about)</a>,
 <a class="tag" href="/tag/inspirational/page/1/">inspirational</a>,
 <a class="tag" href="/tag/life/page/1/">life</a>,
 <a class="tag" href="/tag/live/page/1/">live</a>,
 <a class="tag" href="/tag/miracle/page/1/">miracle</a>,
 <a class="tag" href="/tag/miracles/page/1/">miracles</a>,
 <a href="/author/Jane-Austen">(about)</a>,
 <a class="tag" href="/tag/aliteracy/page/1/">aliteracy</a>,
 <a class="tag" href

In [12]:
# Determining the no. of tags/ instances of the <a> tags
# ---
#
len(results)

55

In [13]:
# To get the first text within our first tag we can perform our extraction as follows: 
# ---
# 

# Method 1
# ---
#
soup.find_all('a')[0].get_text()

# or
# soup.find_all('a')[0].text

'Quotes to Scrape'

In [15]:
soup.find('a').get_text()

'Quotes to Scrape'

In [16]:
# Method 2
# ---
# Uncomment the following code
# ---
#
# soup.find('a').get_text()

# or
soup.find_all('a')[0].text

'Quotes to Scrape'

In [17]:
# Checking the last item
# ---
# 
results[-1]

<a href="https://scrapinghub.com">Scrapinghub</a>

In [18]:
# We can also specify the tag that we would like to retrive our text data from, 
# in this case getting the text from the tag with the following attributes:
# ---
# - class="tag"
# - href="/tag/change/page/1/"
# This tag would be:
# - <a class="tag" href="/tag/change/page/1/">change</a>
# ---
# 

# We perform the following 
# ---
#
results = soup.find_all('a', attrs={'class':'tag', 'href': '/tag/change/page/1/'})[0].get_text()
results

# Method 2
# ---
# Uncomment the following lines
# ---
#
# results = soup.find_all('a', {'class':'tag', 'href': '/tag/change/page/1/'})[0].get_text()
# results

'change'

In [19]:
# We then create empty lists that we will use to store
# content fetched from the <a> tags
# ---
#
link_content = []
link_url = []

# Getting all our <a> tags 
# ---
#
results = soup.find_all('a')

# We the loop through these tags
for result in results:
   
    # Getting our text from each tag
    text = result.get_text()

    # We concatenate our domain with href link that we scrape
    # in order to form a full link
    link = 'http://quotes.toscrape.com' + result.get('href')

    # Then appending the text to our link_content list
    link_content.append(text)

    # Then appending the text to our link_url list
    link_url.append(link)

In [20]:
# Previewing our link_content list by checking first 10 items
# ---
#
link_content[0:10]

['Quotes to Scrape',
 'Login',
 '(about)',
 'change',
 'deep-thoughts',
 'thinking',
 'world',
 '(about)',
 'abilities',
 'choices']

In [21]:
# Previewing our links_content list by checking first 10 items
# ---
#
link_url[0:10]

['http://quotes.toscrape.com/',
 'http://quotes.toscrape.com/login',
 'http://quotes.toscrape.com/author/Albert-Einstein',
 'http://quotes.toscrape.com/tag/change/page/1/',
 'http://quotes.toscrape.com/tag/deep-thoughts/page/1/',
 'http://quotes.toscrape.com/tag/thinking/page/1/',
 'http://quotes.toscrape.com/tag/world/page/1/',
 'http://quotes.toscrape.com/author/J-K-Rowling',
 'http://quotes.toscrape.com/tag/abilities/page/1/',
 'http://quotes.toscrape.com/tag/choices/page/1/']

#### Step 4: Saving our Data

In [22]:
# Finally, we save the scraped contents in a dataframe and preview our data as shown
# ---
#
df = pd.DataFrame({"link_content": link_content, "link_url": link_url})
df.head()

Unnamed: 0,link_content,link_url
0,Quotes to Scrape,http://quotes.toscrape.com/
1,Login,http://quotes.toscrape.com/login
2,(about),http://quotes.toscrape.com/author/Albert-Einstein
3,change,http://quotes.toscrape.com/tag/change/page/1/
4,deep-thoughts,http://quotes.toscrape.com/tag/deep-thoughts/p...


### Example 2: Scraping for Tables

In [None]:
# Example 2
# ---
# Get a list of African cities from the given URL and store the city and respective population.
# ---
# Website URL = https://en.wikipedia.org/wiki/List_of_cities_in_Africa_by_population
# ---
# YOUR CODE GOES BELOW
# 

#### Step 1: Obtaining our Data

In [23]:
# Fetching our data from wikipedia. A status of 200 mean success.
# 
# ---
#
page = requests.get('https://en.wikipedia.org/wiki/List_of_cities_in_Africa_by_population')
page 

<Response [200]>

#### Step 2: Parsing

In [24]:
# Parsing our data using BeautifulSoup
# ---
#
soup = BeautifulSoup(page.text, "html.parser")

Using the browser inspect feature, we identify our source table to have the following tag: 

`<table class="sortable wikitable jquery-tablesorter">`

#### Step 3: Extracting Required Elements

In [25]:
# Using the browser inspect feature, we identify our source table to have the following tag: 
# ---
# <table class="sortable wikitable">
# ---
#
right_table = soup.find('table', {'class': 'sortable wikitable'}) 


In [29]:
right_table

<table class="sortable wikitable">
<tbody><tr>
<th style="width:4">Rank
</th>
<th style="width:60">City
</th>
<th style="width:60">Country
</th>
<th style="width:40">Population
</th>
<th style="width:6">Date of estimate
</th></tr>
<tr>
<td>1
</td>
<td><i><b><a href="/wiki/Kinshasa" title="Kinshasa">Kinshasa</a></b></i>
</td>
<td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="800" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Flag_of_the_Democratic_Republic_of_the_Congo.svg/20px-Flag_of_the_Democratic_Republic_of_the_Congo.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Flag_of_the_Democratic_Republic_of_the_Congo.svg/31px-Flag_of_the_Democratic_Republic_of_the_Congo.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Flag_of_the_Democratic_Republic_of_the_Congo.svg/40px-Flag_of_the_Democratic_Republic_of_the_Congo.svg.png 2x" width="20"/> </span><a href="/wiki/Demo

In [None]:
# And then preview it, still the purpose of confirmation
#  
print(right_table)

<table class="sortable wikitable">
<tbody><tr>
<th style="width:4">Rank
</th>
<th style="width:60">City
</th>
<th style="width:60">Country
</th>
<th style="width:40">Population
</th>
<th style="width:6">Date of estimate
</th></tr>
<tr>
<td>1
</td>
<td><i><b><a href="/wiki/Kinshasa" title="Kinshasa">Kinshasa</a></b></i>
</td>
<td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="800" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Flag_of_the_Democratic_Republic_of_the_Congo.svg/20px-Flag_of_the_Democratic_Republic_of_the_Congo.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Flag_of_the_Democratic_Republic_of_the_Congo.svg/31px-Flag_of_the_Democratic_Republic_of_the_Congo.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Flag_of_the_Democratic_Republic_of_the_Congo.svg/40px-Flag_of_the_Democratic_Republic_of_the_Congo.svg.png 2x" width="20"/> </span><a href="/wiki/Demo

In [30]:
# Getting the table body rows
# ---
#
rows = right_table.find_all('tr')
rows

[<tr>
 <th style="width:4">Rank
 </th>
 <th style="width:60">City
 </th>
 <th style="width:60">Country
 </th>
 <th style="width:40">Population
 </th>
 <th style="width:6">Date of estimate
 </th></tr>,
 <tr>
 <td>1
 </td>
 <td><i><b><a href="/wiki/Kinshasa" title="Kinshasa">Kinshasa</a></b></i>
 </td>
 <td><span class="flagicon"><img alt="" class="thumbborder" data-file-height="600" data-file-width="800" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Flag_of_the_Democratic_Republic_of_the_Congo.svg/20px-Flag_of_the_Democratic_Republic_of_the_Congo.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Flag_of_the_Democratic_Republic_of_the_Congo.svg/31px-Flag_of_the_Democratic_Republic_of_the_Congo.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/6/6f/Flag_of_the_Democratic_Republic_of_the_Congo.svg/40px-Flag_of_the_Democratic_Republic_of_the_Congo.svg.png 2x" width="20"/> </span><a href="/wiki/Democratic_Republic_of_the_C

In [31]:
# Getting the required text data from our table body rows
# ---
#
rank = []
cities = []
countries = []
population = [] 

for row in rows:
    cells = row.find_all('td') 

    # We check to make sure there's no empty cell
    if len(cells) > 1:
        rank.append(cells[0].text.strip())
        cities.append(cells[1].text.strip())
        countries.append(cells[2].text.strip())
        population.append(cells[3].text.strip()) 

In [32]:
# Previewing our lists
# ---
# 
print(rank[0:5])
print(cities[0:5])
print(countries[0:5])
print(population[0:5]) 

['1', '2', '3', '4', '5']
['Kinshasa', 'Lagos', 'Cairo', 'Giza', 'Dar es Salaam']
['Democratic Republic of the Congo', 'Nigeria', 'Egypt', 'Egypt', 'Tanzania']
['15,628,000', '15,388,000', '10,025,657', '9,200,000', '7,100,000']


#### Step 4: Saving our Data

In [33]:
# Saving the scraped data to a dataframe
# ---
# 
countries_df = pd.DataFrame({"rank": rank, "city": cities, "country": countries, "population": population})
countries_df.sample(10)

Unnamed: 0,rank,city,country,population
58,59,Nouakchott,Mauritania,958399
67,68,Ilorin,Nigeria,777667
30,31,Bamako,Mali,2713230
1,2,Lagos,Nigeria,15388000
66,67,Blantyre,Malawi,800264
92,93,El Mahalla el-Kubra,Egypt,543271
46,47,Port Elizabeth (Gqeberha),South Africa,1213060
20,21,Kano,Nigeria,3550000
97,98,Aba,Nigeria,534265
76,77,Mwanza,Tanzania,706453


### Example 3: Scraping for Articles

In [None]:
# Example 3 
# ---
# Scrape the given article from the the DailyPost Nigeria
# ---
# Website URL = https://dailypost.ng/2020/09/29/danbatta-inaugurates-evaluation-committee-for-2020-research-proposals/
# ---
# YOUR CODE GOES BELOW
# 

#### Step 1: Obtaining our Data

In [34]:
page = requests.get('https://dailypost.ng/2020/09/29/danbatta-inaugurates-evaluation-committee-for-2020-research-proposals/')
page

<Response [200]>

#### Step 2: Parsing

In [35]:
soup = BeautifulSoup(page.text, "html.parser")

#### Step 3: Extracting Required Elements

In [36]:
# Getting our heading content
# ---
# This is our tag:
# <h1 class="mvp-post-title left entry-title" itemprop="headline">
# ---
#
article_heading = soup.find('h1', {'class': 'mvp-post-title left entry-title'}).get_text()
article_heading

'Danbatta inaugurates Evaluation Committee for 2020 research proposals'

In [37]:
# Getting our article content
# ---
# Target tags: 
# All <p> tags contained in <div id="mvp-content-main">
# ---
#
article = soup.find('div', {'id': 'mvp-content-main'})
article

<div class="left relative" id="mvp-content-main">
<div class="ai-viewports ai-viewport-3 ai-insert-7-58206922" data-block="7" data-code="PGRpdiBjbGFzcz0nY29kZS1ibG9jayBjb2RlLWJsb2NrLTcnIHN0eWxlPSdtYXJnaW46IDhweCAwOyBjbGVhcjogYm90aDsnPgo8ZGl2IGFsaWduPSJjZW50ZXIiPgo8ZGl2IGlkPSJzdGlja3l1bml0Ij4gCjwhLS0gLzE0MDAxNjM2L0RQX0xlYWRlcmJvYXJkXzIgLS0+CjxkaXYgaWQ9ImRpdi1ncHQtYWQtMTUwMDM4Njk1MzI4MS05Ij48c2NyaXB0Pgpnb29nbGV0YWcuY21kLnB1c2goZnVuY3Rpb24oKSB7IGdvb2dsZXRhZy5kaXNwbGF5KCdkaXYtZ3B0LWFkLTE1MDAzODY5NTMyODEtOScpOyB9KTsKPC9zY3JpcHQ+PC9kaXY+CjwvZGl2Pgo8L2Rpdj4gCgo8L2Rpdj4K" data-insertion-no-dbg="" data-insertion-position="prepend" data-selector=".ai-insert-7-58206922" style="margin: 8px 0; clear: both;"></div>
<div class="code-block code-block-4" style="margin: 8px auto; text-align: center; display: block; clear: both;">
<div align="center">
<div id="div-gpt-ad-1500386953281-9"><script defer="" src="data:text/javascript;base64,Z29vZ2xldGFnLmNtZC5wdXNoKGZ1bmN0aW9uKCl7Z29vZ2xldGFnLmRpc3BsYXkoJ2Rp

In [38]:
# Lets find all the p tags that contain the article text 
# ---
#
p_tags = article.find_all('p')

# We then strip all the surrounding whitespace.
# ---
#
p_tags_text = [tag.get_text().strip() for tag in p_tags]
p_tags_text

['The Executive Vice Chairman and Chief Executive of the Nigerian Communications Commission, Prof. Umar Garba Danbatta, has inaugurated a 15-member Evaluation Committee for the assessment of the 2020 Telecommunications based research from Academics in the Nigerian tertiary institutions.',
 'The Committee, chaired by Prof. Mu’azu Bashir, a Professor of Computer and Control Engineering and Head of Computer Engineering Department at Ahmadu Bello University, Zaria was inaugurated at the Commission’s Head Office in Abuja on Wednesday, September 24, 2020.',
 'Speaking during the inauguration, Danbatta said that the initiative speaks to the Commission’s commitment towards encouraging the development of indigenous innovative solutions that impact not only the Telecom industry/ICT sector positively but also the nation as a whole.',
 '“We want to continuously support research projects that can lead to the development of new products and services in the industry as the key enabler of the nation’s

#### Step 4: Saving our Data

In [39]:
# Combine list items into string.
article = ' '.join(p_tags_text)
article

'The Executive Vice Chairman and Chief Executive of the Nigerian Communications Commission, Prof. Umar Garba Danbatta, has inaugurated a 15-member Evaluation Committee for the assessment of the 2020 Telecommunications based research from Academics in the Nigerian tertiary institutions. The Committee, chaired by Prof. Mu’azu Bashir, a Professor of Computer and Control Engineering and Head of Computer Engineering Department at Ahmadu Bello University, Zaria was inaugurated at the Commission’s Head Office in Abuja on Wednesday, September 24, 2020. Speaking during the inauguration, Danbatta said that the initiative speaks to the Commission’s commitment towards encouraging the development of indigenous innovative solutions that impact not only the Telecom industry/ICT sector positively but also the nation as a whole. “We want to continuously support research projects that can lead to the development of new products and services in the industry as the key enabler of the nation’s digital econ

## <font color='#2F4F4F'>Challenges</font>

### <font color="green">Challenge 1</font>

In [40]:
# Challenge 1
# ---
# Write a Python program to extract h2 tags content from the Y Combinator website.
# --- 
# Website URL = https://www.ycombinator.com/about/
# ---
# YOUR CODE GOES BELOW
# 

link = 'https://www.ycombinator.com/about/'
site = requests.get(link)
site

<Response [200]>

In [46]:
soup = BeautifulSoup(site.text, 'html.parser')

content = soup.find_all('h2')
len(content)

1

In [50]:
for cont in content:
    print(cont.get_text())

Footer


### <font color="green">Challenge 2</font>

In [51]:
# Challenge 2
# ---
# Write a Python program that will get the top 10 ads on the following e-commerce website.
# Return product title and price.
# --- 
# Website URL = https://www.jumia.co.ke/space-heaters-accessories/
# ---
# YOUR CODE GOES BELOW
# 
url = 'https://www.jumia.co.ke/space-heaters-accessories/'
page = requests.get(url)
page


<Response [200]>

In [52]:
soup = BeautifulSoup(page.text,'html.parser')
soup.prettify()

'<!DOCTYPE html>\n<html dir="ltr" lang="en">\n <head>\n  <meta charset="utf-8"/>\n  <title>\n   Heaters - Buy Space Heaters Online | Jumia Kenya\n  </title>\n  <meta content="product" property="og:type"/>\n  <meta content="Jumia Kenya" property="og:site_name"/>\n  <meta content="Heaters - Buy Space Heaters Online | Jumia Kenya" property="og:title"/>\n  <meta content="Buy Space heaters online at Jumia Kenya. Shop room heaters for home, tents and office at best price. Order now to get free shipping and pay on delivery" property="og:description"/>\n  <meta content="/space-heaters-accessories/" property="og:url"/>\n  <meta content="https://ke.jumia.is/cms/icons/jumialogo-x-4.png" property="og:image"/>\n  <meta content="en_KE" property="og:locale"/>\n  <meta content="Heaters - Buy Space Heaters Online | Jumia Kenya" name="title"/>\n  <meta content="index,follow" name="robots"/>\n  <meta content="Buy Space heaters online at Jumia Kenya. Shop room heaters for home, tents and office at best pr

In [54]:
ad_div = soup.find_all('div', attrs={'class':'info'})
len(ad_div)

48

In [87]:
ad_title = []
add_price = []

for add in ad_div:
    title_header = add.find_all('h3',attrs={'class':'name'})[0].get_text()
    ad_title.append(title_header)
    price_div = add.find_all('div',attrs={'class':'prc'})[0].get_text()
    add_price.append(price_div)
    
 
df = pd.DataFrame({'Title':ad_title, 'Price':add_price})
df.head(10)

Unnamed: 0,Title,Price
0,"Mika MH302 - 2 Way Quartz Heater, 2000W - Grey","KSh 5,295"
1,Tronic Safest Oil Radiator Room Heater For Bab...,"KSh 11,900 - KSh 14,500"
2,Estia Quartz Portable Electric Room Heater,"KSh 2,499"
3,Pinex Portable Hot/ Warm/ Electric Room Heater...,"KSh 3,499"
4,Nunix Hot/ Warm/ Electric Room Heater/ Warmer...,"KSh 3,390"
5,"Mika MH101 - Fan Heater, 1000/2000W - White.","KSh 2,545"
6,Estia Portable Room Space Heater,"KSh 3,999"
7,"Mika MH103 - Quartz Heater, 800-1600W - Red & ...","KSh 3,795"
8,Nunix Electric Room Heater-Perfect For Cold Se...,"KSh 3,299"
9,Rashnik Electric Fan Heater,"KSh 2,349"


### <font color="green">Challenge 3</font>

In [157]:
# Challenge 3
# ---
# Scrape for quotes and author from the given URL then store your data in a pandas dataframe.
# --- 
# Website URL = http://quotes.toscrape.com/
# ---
# YOUR CODE GOES BELOW
# 
url = 'http://quotes.toscrape.com/'
page = requests.get(url)
page

<Response [200]>

In [164]:
soup = BeautifulSoup(page.text, 'html.parser')

quote_div = soup.find_all('span', attrs={'class':'text'})
auth_div = soup.find_all('small', attrs={'class':'author'})
auth_div

[<small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">J.K. Rowling</small>,
 <small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">Jane Austen</small>,
 <small class="author" itemprop="author">Marilyn Monroe</small>,
 <small class="author" itemprop="author">Albert Einstein</small>,
 <small class="author" itemprop="author">André Gide</small>,
 <small class="author" itemprop="author">Thomas A. Edison</small>,
 <small class="author" itemprop="author">Eleanor Roosevelt</small>,
 <small class="author" itemprop="author">Steve Martin</small>]

In [166]:
quote = []
author = []

for quotes, auth in zip(quote_div,auth_div):
    author.append(auth.text)
    quote.append(quotes.text)
    

df = pd.DataFrame({'Author':author, 'Quote':quote})
df.head()

Unnamed: 0,Author,Quote
0,Albert Einstein,“The world as we have created it is a process ...
1,J.K. Rowling,"“It is our choices, Harry, that show what we t..."
2,Albert Einstein,“There are only two ways to live your life. On...
3,Jane Austen,"“The person, be it gentleman or lady, who has ..."
4,Marilyn Monroe,"“Imperfection is beauty, madness is genius and..."


### <font color="green">Challenge 4</font>

In [88]:
# Challenge 4
# ---
# Write a Python program to download IMDB's Popular 100 movie data. 
# Return (movie name, release year and imdb rating).
# --- 
# Website URL = https://www.imdb.com/chart/moviemeter
# ---
# YOUR CODE GOES BELOW
# 
url = 'https://www.imdb.com/chart/moviemeter'
page = requests.get(url)
page

<Response [200]>

In [116]:
soup = BeautifulSoup(page.text, 'html.parser')
right_table = soup.find('table', {'class': 'chart full-width'}) 
right_table

<table class="chart full-width" data-caller-name="chart-moviemeter">
<colgroup>
<col class="chartTableColumnPoster"/>
<col class="chartTableColumnTitle"/>
<col class="chartTableColumnIMDbRating"/>
<col class="chartTableColumnYourRating"/>
<col class="chartTableColumnWatchlistRibbon"/>
</colgroup>
<thead>
<tr>
<th></th>
<th>Rank &amp; Title</th>
<th>IMDb Rating</th>
<th>Your Rating</th>
<th></th>
</tr>
</thead>
<tbody class="lister-list">
<tr>
<td class="posterColumn">
<span data-value="1" name="rk"></span>
<span data-value="7.2" name="ir"></span>
<span data-value="1.6583616E12" name="us"></span>
<span data-value="108928" name="nv"></span>
<span data-value="-3.8" name="ur"></span>
<a href="/title/tt11866324/"> <img alt="Prey" height="67" src="https://m.media-amazon.com/images/M/MV5BMDBlMDYxMDktOTUxMS00MjcxLWE2YjQtNjNhMjNmN2Y3ZDA1XkEyXkFqcGdeQXVyMTM1MTE1NDMx._V1_UY67_CR0,0,45,67_AL_.jpg" width="45"/>
</a> </td>
<td class="titleColumn">
<a href="/title/tt11866324/" title="Dan Trachtenberg

In [117]:

rows = soup.find_all('tr')
rows

[<tr>
 <th></th>
 <th>Rank &amp; Title</th>
 <th>IMDb Rating</th>
 <th>Your Rating</th>
 <th></th>
 </tr>,
 <tr>
 <td class="posterColumn">
 <span data-value="1" name="rk"></span>
 <span data-value="7.2" name="ir"></span>
 <span data-value="1.6583616E12" name="us"></span>
 <span data-value="108928" name="nv"></span>
 <span data-value="-3.8" name="ur"></span>
 <a href="/title/tt11866324/"> <img alt="Prey" height="67" src="https://m.media-amazon.com/images/M/MV5BMDBlMDYxMDktOTUxMS00MjcxLWE2YjQtNjNhMjNmN2Y3ZDA1XkEyXkFqcGdeQXVyMTM1MTE1NDMx._V1_UY67_CR0,0,45,67_AL_.jpg" width="45"/>
 </a> </td>
 <td class="titleColumn">
 <a href="/title/tt11866324/" title="Dan Trachtenberg (dir.), Amber Midthunder, Dakota Beavers">Prey</a>
 <span class="secondaryInfo">(2022)</span>
 <div class="velocity">1
 (no change)
 </div>
 </td>
 <td class="ratingColumn imdbRating">
 <strong title="7.2 based on 108,928 user ratings">7.2</strong>
 </td>
 <td class="ratingColumn">
 <div class="seen-widget seen-widget-tt1

In [156]:
movie_name = []
release_year = [] 
imdb_rating = []

for row in rows:
    cell = row.find_all('td')
    cells = row.find_all('span', attrs={'class':'secondaryInfo'}) 
    cell_rtn = row.find_all('td', attrs={'class':'ratingColumn imdbRating'}) 
    # print(cells)
# We check to make sure there's no empty cell
    if len(cells) > 1:
        movie_name.append(cell[1].text.strip())
        release_year.append(cells[0].text.strip())
        imdb_rating.append(cell_rtn[0].text.strip())
    #     population.append(cells[3].text.strip()) 
    
df = pd.DataFrame({'Movie Name': movie_name, 'Release Year': release_year, 'imdb Rating': imdb_rating})
df.head()

Unnamed: 0,Movie Name,Release Year,imdb Rating
0,Bullet Train\n(2022)\n2\n(\n\n1),(2022),7.5
1,Thirteen Lives\n(2022)\n3\n(\n\n13),(2022),7.8
2,The Gray Man\n(2022)\n6\n(\n\n4),(2022),6.5
3,Elvis\n(2022)\n7\n(\n\n10),(2022),7.6
4,Day Shift\n(2022)\n8\n(\n\n182),(2022),6.1
