# Statistics for Data Science [DS401]
## Web Scrapping
#### By: Javier Orduz

[licenseBDG]: https://img.shields.io/badge/License-CC-orange?style=plastic
[license]: https://creativecommons.org/licenses/by-nc-sa/3.0/deed.en

[mywebsiteBDG]:https://img.shields.io/badge/website-jaorduz.github.io-0abeeb?style=plastic
[mywebsite]: https://jaorduz.github.io/

[mygithubBDG-jaorduz]: https://img.shields.io/badge/jaorduz-repos-blue?logo=github&label=jaorduz&style=plastic
[mygithub-jaorduz]: https://github.com/jaorduz/

[mygithubBDG-jaorduc]: https://img.shields.io/badge/jaorduc-repos-blue?logo=github&label=jaorduc&style=plastic 
[mygithub-jaorduc]: https://github.com/jaorduc/

[myXprofileBDG]: https://img.shields.io/static/v1?label=Follow&message=jaorduc&color=2ea44f&style=plastic&logo=X&logoColor=black
[myXprofile]:https://twitter.com/jaorduc


[![website - jaorduz.github.io][mywebsiteBDG]][mywebsite]
[![Github][mygithubBDG-jaorduz]][mygithub-jaorduz]
[![Github][mygithubBDG-jaorduc]][mygithub-jaorduc]
[![Follow @jaorduc][myXprofileBDG]][myXprofile]
[![CC License][licenseBDG]][license]

<h1>Contents</h1>
<div class="alert  alert-block alert-info" style="margin-top: 1px">
    <ol>
        <li><a href="#daExploration">Definition: Web Scrapping</a></li>
        <li><a href="#showData">Libraries for web scrapping and pandas</a></li>
            <!-- <li><a href="#unData">Import modules</a></li> -->
         <!-- <ol>
             <li><a href="#reData">Libraries for web scrapping</a></li>
         </ol> -->
        <li><a href="#PrintoutText">Print out the text</a></li>
        <li><a href="#versions">Versions</a></li>
        <li><a href="#exercises1">Exercises 1</a></li>
        <li><a href="#regex">Regex</a></li>
        <li><a href="#manipuClean">Using pandas, data manipulating and cleaning</a></li>
        <li><a href="#exercises2">Exercise 2</a></li>        
        <li><a href="#references">References</a></li>
    </ol>
</div>
<br>
<hr>

**Web scraping** involves using a program or algorithm to extract and process data from websites in large quantities. This skill is valuable for data scientists, engineers, or anyone needing to analyze extensive datasets. When you find data online that isn’t available for direct download, web scraping with Python enables you to collect and format this data for easier analysis and use in your projects. This technique can transform unstructured web data into a format that’s ready for importing and manipulation in various analytical tools.

## Libraries for web scrapping and pandas

In [None]:
from urllib.request import urlopen
from bs4 import BeautifulSoup
import pandas as pd

## Exploring a website
You should specify the URL containing the dataset and use ```urlopen()``` to het the html of the page.

In [None]:
url = "https://en.wikipedia.org/wiki/Web_scraping"
html = urlopen(url)

After getting the **html** of the page, we should create a **Beautiful Soup object** from the html, so we should use ```BeautifulSoup()``` function. This function has two parameters one for  html and  second argument __lxml__ is the html parser, we don't need more details about these so far.

In [None]:
soup = BeautifulSoup(html, 'lxml')
type(soup)

We can extract information form the website, e.g. getting the title as follows

In [None]:
title = soup.title
print(title)

In [None]:
soup.find('img').get('alt')

## Print out the text
Create a variable and print the text of the webpage.

In [None]:
text = soup.get_text()

In [None]:
print(soup.text)

Extracting **html** tags within a webpage. Examples:
```html
a
table
tr
```
Where we will use ```a``` for hyperlinks, ```table```  for tables, ```tr``` for table rows, and so on. Check ref. [[2](https://www.w3schools.com/tags/tag_html.asp)] for more tags.

We can use the ```find_all()``` method to obtain helpful  HTML tags within a webpage.

In [None]:
all_links = soup.find_all("a")
all_links

In [None]:
all_tables = soup.find_all('table')
all_tables

In [None]:
rows = soup.find_all('tr')
print(rows[:8])

In [None]:
for row in rows:
    row_td = row.find_all('td')
print(row_td)
type(row_td)

## Exercise 1

From the last output, we can see that ```html``` tags sometimes come with attributes such as ```class```, ```src```, etc. These attributes provide additional information about ```html``` elements. You can use a for loop and the ```get('"href")``` method to extract and print out only hyperlinks.

<!-- all_links = soup.find_all("a")
for link in all_links:
    print(link.get("href")) -->

In [None]:
# Your code here

# Regular expressions (Regex)
A regular expression is defined using **tokens that match a particular pattern** [[3](https://brightdata.com/blog/web-data/web-scraping-with-regex)].

With Regex, it is common to make a lot of mistakes; take care. It requires to import
```python
import re
```
module. The following code shows how to build a regular expression that finds all the characters inside the ```< td >``` html tags and replace them with an empty string for each table row. 
1. Compile a regular expression by passing a string to match to ```re.compile()```.
2. The dot, star, and question mark (```.*?```) will match an opening angle bracket followed by anything and followed by a closing angle bracket.
3. It matches text in a non-greedy fashion, that is, it matches the shortest possible string. If you omit the question mark, it will match all the text between the first opening angle bracket and the last closing angle bracket.
4. After compiling a regular expression, you can use the ```re.sub()``` method to find all the substrings where the regular expression matches and replace them with an empty string.

The full code below generates an empty **list**, **extracts** text in between HTML tags for each row, and **appends** it to the assigned list.

In [None]:
import re

list_rows = []
for row in rows:
    cells = row.find_all('td')
    str_cells = str(cells)
    clean = re.compile('<.*?>')
    clean2 = (re.sub(clean, '',str_cells))
    list_rows.append(clean2)
print(clean2)
type(clean2)

## Using pandas, data manipulating and cleaning

The next step is to convert the list into a dataframe and get a quick view of the first 10 rows using Pandas.

In [None]:
df = pd.DataFrame(list_rows)
df.head(10)

Cleaning: split the "0" column into multiple columns at comma position.

In [None]:
df1 = df[0].str.split(',', expand=True)
df1.head(10)

The dataframe has unwanted square brackets surrounding each row. You can use the ```strip()``` method to remove the opening square bracket on column "0."

In [None]:
df1[0] = df1[0].str.strip('[')
df1.head(10)

The table is almost adequately formatted. You can start by getting an overview of the data as shown below, and make an analysis.

In [None]:
df1.info()
df1.shape

## Exercise 2

Find a different website to do the same that we did here.

# References
[1] https://www.datacamp.com/tutorial/web-scraping-using-python

[2] https://www.w3schools.com/tags/tag_html.asp

[3] https://brightdata.com/blog/web-data/web-scraping-with-regex