### Python Data-science Introduction- Web Scrapping
Web scraping (also called web data extraction or data scraping) provides a solution for those who want to get access to structured web data in an automated fashion. 
Web scraping is useful if the public website you want to get data from doesn't have an API, or it does but provides only limited access to the data.
Hence:-
- I scrapped data from the Zimbabwe Stock Exchange
- Scrapped Emails from the same
- Scrapped images from the same website
- Scrapped tables from the World population Wikipedia webpage and converted it into a dataframe.
- Used pandas (pd.read.html) to scrap data from a table @'https://www.zse.co.zw/price-sheet' website.
- Cleaned the data although it had a lot of discrepencies 
- Saved the new data set into my local library


In [210]:
#!pip install bs4
#!pip install lxml==4.6.4
#!pip install requests==2.26.0
#!pip install html5lib==1.1 -y

In [211]:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from bs4 import BeautifulSoup # this is a webscrapping module
import requests # requests module helps us to download a web page

This project is to show how we can download data from a web page.
iam going to scrap data from the Zimbabwe stock exchange page and store it into a data frame

### Data source -zimbabwe stock exchange 25/10/22

We have to download the data from the web page and store it inside a web page called ZSE_data:

In [212]:
url='https://www.zse.co.zw/price-sheet'

In [213]:
ZSE_data=requests.get(url).text

In [214]:
# We create a BeautifulSoup object(soup) using a BS constructor

soup=BeautifulSoup(ZSE,'html.parser')

#### 1.  Scraping all links

In [215]:
# in html anchor/link is represented by the tag <a>
for link in soup.find_all('a',href=True): 
    
    print(link.get('href'))

https://www.zse.co.zw/
https://www.zse.co.zw/
https://www.zse.co.zw/rules-and-regulations/
https://www.zse.co.zw/faqs/
https://www.zse.co.zw/fungibility-faqs-2/
https://www.zse.co.zw/practice-and-procedures/
https://www.zse.co.zw/rules-legislation/
https://www.zse.co.zw/companies-2/
https://www.zse.co.zw/equity/
https://www.zse.co.zw/abc/
https://www.zse.co.zw/african-distillers-limited-2/
https://www.zse.co.zw/african-sun-limited-2/
https://www.zse.co.zw/amalgamated-regional-trading-art-holdings-limited-2/
https://www.zse.co.zw/ariston-holdings-limited-2/
https://www.zse.co.zw/axia-corporation-limited-2/
https://www.zse.co.zw/bindura-nickel-corporation-limited-2/
https://www.zse.co.zw/border-timbers-limited/
https://www.zse.co.zw/british-american-tobacco-zimbabwe-limited-2/
https://www.zse.co.zw/cafca-limited-2/
https://www.zse.co.zw/cbz-holdings-limited-2/
https://www.zse.co.zw/cfi-holdings-limited-2/
https://www.zse.co.zw/cottco-holdings-limited-2/
https://www.zse.co.zw/def/
https:/

#### 2. Scraping all images

In [216]:
# in html image is represented by the tag <img>
for link in soup.find_all('img'):
    print(link)
    print(link.get('src'))

<img class="img-responsive" src="https://www.zse.co.zw/wp-content/uploads/2019/03/logo.png" style="height:75px; width:150px;"/>
https://www.zse.co.zw/wp-content/uploads/2019/03/logo.png
<img alt="Webp.net-gifmaker" class="fl-photo-img wp-image-2306" itemprop="image" src="https://www.zse.co.zw/wp-content/uploads/2019/03/Webp.net-gifmaker-1.gif" title="Webp.net-gifmaker"/>
https://www.zse.co.zw/wp-content/uploads/2019/03/Webp.net-gifmaker-1.gif
<img alt="Zimbabwe Stock Exchange" height="48" src="https://pbs.twimg.com/profile_images/1018818881023070209/jvwXgyQD_normal.jpg" width="48"/>
https://pbs.twimg.com/profile_images/1018818881023070209/jvwXgyQD_normal.jpg
<img alt="ZSE_ZW" height="48" src="https://pbs.twimg.com/profile_images/1018818881023070209/jvwXgyQD_normal.jpg" width="48"/>
https://pbs.twimg.com/profile_images/1018818881023070209/jvwXgyQD_normal.jpg


#### 3. Scraping data from HTML tables

Its important that before scrapping data from a website.There is a need to examine the contents,and the ays data is organised on the website.


In [217]:
# Get the contents of the webpage in text format and store in a variable called ZSE_data
ZSE_data = requests.get(url).text

soup=BeautifulSoup(ZSE_data,'html.parser')

In [218]:
# Finding a html table in the web page
# in html table is represented by the tag<table>

table=soup.find('table')

In [219]:
# Getting all rows from the table
# in html a column is represented by the tag <td>
for row in table.find_all('tr'):
    cols=row.find_all('td')
    color_name=cols[2].string   # storing the value in column 3 as color name
    color_code= cols[3].string  # store the value in the column 4 as color_code
    print('{}--->{}'.format(color_name,color_code))


None--->None
None--->None
None--->None
None--->None
None--->None
None--->None
None--->298
None--->None
None--->None
None--->None
15.05--->8200
None--->None
None--->None
None--->4.2
None--->None
None--->64.6379
None--->None
None--->7.1
None--->None
26--->26
None--->None
2995.975--->0
None--->None
None--->None
None--->None
None--->None
None--->None
None--->None
None--->None
None--->None
None--->49.7158
None--->None
None--->230.191
None--->None
45.7564--->45.6504
None--->None
84.6252--->84.8125
None--->None
None--->7.47
None--->None
None--->None
None--->None
None--->22
None--->None
None--->9.3992
None--->None
None--->25.8002
None--->None
None--->9.04
None--->None
2--->2
None--->None
17.55--->17.55
None--->None
None--->207.7526
None--->None
None--->310.6556
None--->None
120--->120
None--->None
None--->7.5
None--->None
None--->75
None--->None
None--->None
None--->None
None--->8.7
None--->None
None--->1105
None--->None
None--->12
None--->None
None--->21.7667
None--->None
None--->32.8486
None

#### 4. Scrapping data from HTML tables into a DataFrame    
      (pandas/Beautiful soup)


In [220]:
url = "https://en.wikipedia.org/wiki/World_population"
ppn =requests.get(url).text
soup=BeautifulSoup(ppn,'html.parser')
tables= soup.find_all('table')

In [221]:
# to check how many tables were found by checking of the tables list

len(tables)

25

In [222]:
for index,table in enumerate(tables):
    if ('10 most densely populated countries' in str(table)):
        table_index = index
print(table_index)

5


In [223]:
print(tables[table_index].prettify())

<table class="wikitable sortable" style="text-align:right">
 <caption>
  10 most densely populated countries
  <small>
   (with population above 5 million)
  </small>
 </caption>
 <tbody>
  <tr>
   <th>
    Rank
   </th>
   <th>
    Country
   </th>
   <th>
    Population
   </th>
   <th>
    Area
    <br/>
    <small>
     (km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
   <th>
    Density
    <br/>
    <small>
     (pop/km
     <sup>
      2
     </sup>
     )
    </small>
   </th>
  </tr>
  <tr>
   <td>
    1
   </td>
   <td align="left">
    <span class="flagicon">
     <img alt="" class="thumbborder" data-file-height="600" data-file-width="900" decoding="async" height="15" src="//upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapore.svg/23px-Flag_of_Singapore.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapore.svg/35px-Flag_of_Singapore.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/4/48/Flag_of_Singapo

In [224]:
population_data = pd.DataFrame(columns=["Rank", "Country", "Population", "Area", "Density"])

for row in tables[table_index].tbody.find_all("tr"):
    col = row.find_all("td")
    if (col != []):
        rank = col[0].text
        country = col[1].text
        population = col[2].text.strip()
        area = col[3].text.strip()
        density = col[4].text.strip()
        population_data = population_data.append({"Rank":rank, "Country":country, "Population":population, "Area":area, "Density":density}, ignore_index=True)

population_data

Unnamed: 0,Rank,Country,Population,Area,Density
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,173640000,143998,1206
2,3,\n Palestine\n\n,5266785,6020,847
3,4,Lebanon,6856000,10452,656
4,5,Taiwan,23604000,36193,652
5,6,South Korea,51781000,99538,520
6,7,Rwanda,12374000,26338,470
7,8,Israel,9600000,22072,435
8,9,Haiti,11578000,27065,428
9,10,Netherlands,17760000,41526,428


#### Scrape data from HTML tables into a DataFrame using BeautifulSoup and read_html

Using the same url, data, soup, and tables object as in the last section we can use the read_html function to create a DataFrame.
Function read_html always returns a list of DataFrames


In [225]:
pd.read_html(str(tables[5]), flavor='bs4')

[   Rank      Country  Population  Area(km2)  Density(pop/km2)
 0     1    Singapore     5704000        710              8033
 1     2   Bangladesh   173640000     143998              1206
 2     3    Palestine     5266785       6020               847
 3     4      Lebanon     6856000      10452               656
 4     5       Taiwan    23604000      36193               652
 5     6  South Korea    51781000      99538               520
 6     7       Rwanda    12374000      26338               470
 7     8       Israel     9600000      22072               435
 8     9        Haiti    11578000      27065               428
 9    10  Netherlands    17760000      41526               428]

In [226]:
population_data_read_html = pd.read_html(str(tables[5]), flavor='bs4')[0]

population_data_read_html

Unnamed: 0,Rank,Country,Population,Area(km2),Density(pop/km2)
0,1,Singapore,5704000,710,8033
1,2,Bangladesh,173640000,143998,1206
2,3,Palestine,5266785,6020,847
3,4,Lebanon,6856000,10452,656
4,5,Taiwan,23604000,36193,652
5,6,South Korea,51781000,99538,520
6,7,Rwanda,12374000,26338,470
7,8,Israel,9600000,22072,435
8,9,Haiti,11578000,27065,428
9,10,Netherlands,17760000,41526,428


#### Scrape data from HTML tables into a DataFrame using read_html
- I used pd.read_html  to scrape Zimbabwe Stock exchange table from the Zimbabwe Stock Exchange.
- I converted it into a list then had to convert the list into a data Frame.

In [227]:
urls='https://www.zse.co.zw/price-sheet'
dataframe_list = pd.read_html(urls, flavor='bs4')

In [228]:
len(dataframe_list)

1

In [229]:
dataframe_list[0]

Unnamed: 0,0,1,2,3,4,5,6,7
0,Company Name,Company Name,,,,Opening Price,Closing Price,Total Traded Volume
1,,,,,,,,
2,,,,,,,,
3,,,,,,,,
4,EQUITIES,EQUITIES,,,,,,
...,...,...,...,...,...,...,...,...
103,Zimbabwe Newspapers (1980) Limited,Zimbabwe Newspapers (1980) Limited,Zimbabwe Newspapers (1980) Limited,Zimbabwe Newspapers (1980) Limited,,3,3,49100
104,,,,,,,,
105,Zimplow Holdings Limited,Zimplow Holdings Limited,Zimplow Holdings Limited,,,16,17.8,300
106,,,,,,,,


#### Converting Data list into a data frame

In [230]:
ZSE_df = pd.DataFrame (dataframe_list[0])
print (ZSE_df)

                                      0                                   1  \
0                          Company Name                        Company Name   
1                                   NaN                                 NaN   
2                                   NaN                                 NaN   
3                                   NaN                                 NaN   
4                              EQUITIES                            EQUITIES   
..                                  ...                                 ...   
103  Zimbabwe Newspapers (1980) Limited  Zimbabwe Newspapers (1980) Limited   
104                                 NaN                                 NaN   
105            Zimplow Holdings Limited            Zimplow Holdings Limited   
106                                 NaN                                 NaN   
107              Zimre Holdings Limited              Zimre Holdings Limited   

                                      2            

In [231]:
#dropping rows with null values

ZSE_data=ZSE_df.dropna()

In [232]:
ZSE_data

Unnamed: 0,0,1,2,3,4,5,6,7
10,Amalgamated Regional Trading (Art) Holdings Li...,Amalgamated Regional Trading (Art) Holdings Li...,Amalgamated Regional Trading (Art) Holdings Li...,Amalgamated Regional Trading (Art) Holdings Li...,Amalgamated Regional Trading (Art) Holdings Li...,17.6,15.05,8200
21,British American Tobacco Zimbabwe Limited,British American Tobacco Zimbabwe Limited,British American Tobacco Zimbabwe Limited,British American Tobacco Zimbabwe Limited,British American Tobacco Zimbabwe Limited,2995.975,2995.975,0


In [233]:
ZSE_data.describe()

Unnamed: 0,0,1,2,3,4,5,6,7
count,2,2,2,2,2,2.0,2.0,2
unique,2,2,2,2,2,2.0,2.0,2
top,Amalgamated Regional Trading (Art) Holdings Li...,Amalgamated Regional Trading (Art) Holdings Li...,Amalgamated Regional Trading (Art) Holdings Li...,Amalgamated Regional Trading (Art) Holdings Li...,Amalgamated Regional Trading (Art) Holdings Li...,17.6,15.05,8200
freq,1,1,1,1,1,1.0,1.0,1


#### Saving the dataframe into a file

In [234]:
ZSE_data.to_csv('Zimbabwe_Exchange_27_10_22',index=False)

In [235]:
ZSE_data.tail(10)

Unnamed: 0,0,1,2,3,4,5,6,7
10,Amalgamated Regional Trading (Art) Holdings Li...,Amalgamated Regional Trading (Art) Holdings Li...,Amalgamated Regional Trading (Art) Holdings Li...,Amalgamated Regional Trading (Art) Holdings Li...,Amalgamated Regional Trading (Art) Holdings Li...,17.6,15.05,8200
21,British American Tobacco Zimbabwe Limited,British American Tobacco Zimbabwe Limited,British American Tobacco Zimbabwe Limited,British American Tobacco Zimbabwe Limited,British American Tobacco Zimbabwe Limited,2995.975,2995.975,0
