# Python: Web scraping

**Goal**: Retrieve elements from a web page!

## Introduction to web scraping

In the context of extracting data outside of APIs, i.e. from the web, it is called **web scraping**. It consists in **loading** and **extracting data** from **websites**. To do this, it is necessary to know the **structure of a web page (html & css)**.

In [1]:
# Example
import requests

In [2]:
response = requests.get("https://raw.githubusercontent.com/Niangmohamed/Niangmohamed.github.io/main/index.html")

In [3]:
response_content = response.content
response_content

b'<!doctype html>\n<!--[if IE 7 ]>    <html lang="en-gb" class="isie ie7 oldie no-js"> <![endif]-->\n<!--[if IE 8 ]>    <html lang="en-gb" class="isie ie8 oldie no-js"> <![endif]-->\n<!--[if IE 9 ]>    <html lang="en-gb" class="isie ie9 no-js"> <![endif]-->\n<!--[if (gt IE 9)|!(IE)]><!-->\n<html lang="en-gb" class="no-js">\n<!--<![endif]-->\n<head>\n<meta charset="utf-8">\n<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">\n<!--[if lt IE 9]> \n    <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">\n    <![endif]-->\n<title>Mohamed Niang</title>\n<link rel="icon" type="image/x-icon" href="images/icons.png" />\n<meta name="description" content="">\n<meta name="author" content="WebThemez">\n<!--[if lt IE 9]>\n        <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>\n    <![endif]-->\n<!--[if lte IE 8]>\n\t\t<script type="text/javascript" src="http://explorercanvas.googlecode.com/svn/trunk/excanvas.js"></script>\n\t<!

## Retrieve elements from a page

Once we have loaded a page, now let's suppose that we want extract a **paragraph**. To do this we have a powerful library, **BeautifulSoup** from the package **bs4**.

In [4]:
from bs4 import BeautifulSoup

We will use *BeautifulSoup* to analyze the content of the page previously downloaded.

In [5]:
parser = BeautifulSoup(response_content, "html.parser")

In [6]:
body = parser.body

In [7]:
p = body.p
p

<p align="justify"> Data Science and Artificial Intelligence - I am passionate about developing and applying machine learning methods for building algorithms and predictive models using a variety of datasets for solving impactful real world problems. 
          I am currently interested in the field of Natural Language Processing, Computer Vision and Generative Models.
        </p>

To display only the text without the \<p> tag, we use the **text attribute** of the **p object**.

In [8]:
print(p.text)

 Data Science and Artificial Intelligence - I am passionate about developing and applying machine learning methods for building algorithms and predictive models using a variety of datasets for solving impactful real world problems. 
          I am currently interested in the field of Natural Language Processing, Computer Vision and Generative Models.
        


### Training

Let's get the contents of the \<head> tag.

In [9]:
head = parser.head

In [10]:
title = head.title
title.text

'Mohamed Niang'

## Use of find_all method

The **find_all method** allows to **find all the elements** that correspond to a **tag** and to **return a list with all these elements**.

In [11]:
body = parser.find_all("body")
body

[<body>
 <header class="header">
 <div class="bg-parlex">
 <nav class="navbar navbar-inverse" role="navigation">
 <div class="navbar-header">
 <button class="navbar-toggle" data-target="#main-nav" data-toggle="collapse" id="nav-toggle" type="button"> <span class="sr-only">Toggle navigation</span> <span class="icon-bar"></span> <span class="icon-bar"></span> <span class="icon-bar"></span> </button>
 <a class="navbar-brand scroll-top logo" href="#"><b></b></a> </div>
 <!--/.navbar-header-->
 <div class="collapse navbar-collapse" id="main-nav">
 <ul class="nav navbar-nav" id="mainNav">
 <li class="active"><a class="scroll-link" href="#aboutUs">About Me</a></li>
 <li><a class="scroll-link" href="#skills">Skills</a></li>
 <li><a class="scroll-link" href="#contactUs">Contact Me</a></li>
 </ul>
 </div>
 <!--/.navbar-collapse-->
 </nav>
 <!--/.navbar-->
 </div>
 <!--/.container-->
 </header>
 <!--/.header-->
 <!--About-->
 <section id="aboutUs">
 <div class="container">
 <div class="heading">


In [12]:
p = body[0].find_all("p")
p

[<p align="justify"> Data Science and Artificial Intelligence - I am passionate about developing and applying machine learning methods for building algorithms and predictive models using a variety of datasets for solving impactful real world problems. 
           I am currently interested in the field of Natural Language Processing, Computer Vision and Generative Models.
         </p>,
 <p align="justify" class="mrgBtm20">I am a big fan of programming and love to spend my time programming and learning new things about programming. My favorite programming language is without a doubt Python. However, I often manipulate a lot of other languages like SQL, R, JavaScript, etc. in the context of data science projects. </p>,
 <p>Thank you for visiting out my profile. If you would like to get into contact with me, please fill out the form below.</p>]

In [13]:
print(p[0].text)

 Data Science and Artificial Intelligence - I am passionate about developing and applying machine learning methods for building algorithms and predictive models using a variety of datasets for solving impactful real world problems. 
          I am currently interested in the field of Natural Language Processing, Computer Vision and Generative Models.
        


### Training

Let's get the contents of the \<title> tag using **find_all**.

In [14]:
head = parser.find_all("head")
head

[<head>
 <meta charset="utf-8"/>
 <meta content="width=device-width, initial-scale=1, maximum-scale=1" name="viewport"/>
 <!--[if lt IE 9]> 
     <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
     <![endif]-->
 <title>Mohamed Niang</title>
 <link href="images/icons.png" rel="icon" type="image/x-icon"/>
 <meta content="" name="description"/>
 <meta content="WebThemez" name="author"/>
 <!--[if lt IE 9]>
         <script src="http://html5shim.googlecode.com/svn/trunk/html5.js"></script>
     <![endif]-->
 <!--[if lte IE 8]>
 		<script type="text/javascript" src="http://explorercanvas.googlecode.com/svn/trunk/excanvas.js"></script>
 	<![endif]-->
 <link href="css/bootstrap.min.css" rel="stylesheet"/>
 <link href="css/isotope.css" media="screen" rel="stylesheet" type="text/css"/>
 <link href="js/fancybox/jquery.fancybox.css" media="screen" rel="stylesheet" type="text/css"/>
 <link href="css/da-slider.css" rel="stylesheet" type="text/css"/>
 <link href="css/styles.css" rel="

In [15]:
title = head[0].find_all("title")
title

[<title>Mohamed Niang</title>]

In [16]:
print(title[0].text)

Mohamed Niang


## Elements corresponding to ids

In web programming, **ids** are a kind of **unique key** to **identify** elements of a tag. Let's get the **first section (id="aboutUs")** of our **web page** using **id**.

In [17]:
first_section_id = parser.find_all("section", id="aboutUs")[0]
first_section_id

<section id="aboutUs">
<div class="container">
<div class="heading">
<br/><br/><br/>
<h2>About Me</h2>
<div class="row">
<!-- item -->
<div class="col-md-4 tileBox"> <img alt="me" src="images/photo-1.jpg"/> </div>
<div class="col-md-8 tileBox">
<div class="txtHead">
<h2>Mohamed Niang</h2>
<h3> ML Scientist</h3>
</div>
<p align="justify"> Data Science and Artificial Intelligence - I am passionate about developing and applying machine learning methods for building algorithms and predictive models using a variety of datasets for solving impactful real world problems. 
          I am currently interested in the field of Natural Language Processing, Computer Vision and Generative Models.
        </p>
</div>
<!-- end: -->
</div>
</div>
</div>
</section>

### Training

Let's get the **second section (id="skills")** of our **web page** using **id**.

In [18]:
second_section_id = parser.find_all("section", id="skills")[0]
second_section_id

<section class="secPad white" id="skills">
<div class="container">
<div class="heading">
<!-- Heading -->
<h2>My Skills</h2>
<ul>
<li>Programming: Python, Spark, R, SQL, C++, C, VBA, Stata, PHP, Java-Script, HTML, CSS</li>
<li>Numerical Tools: Tensorflow, Keras, Scikit-learn, Pytorch, Pycuda, PySpark, Numpy, Scipy, Pandas, Matplotlib, MLlib</li>
<li>Machine Learning Skills: Supervised Methods: Naive Bayes, Linear and Logistic Regression</li>
<li>Deep Learning Skills: MLP, CNN, RNN, GAN, ResNet and Semi Supervised Learning (VAT)</li>
<li>Reinforcement Learning Skills: Markov Decision Processes, Exploration/Exploitation, Value Functions, Q-Learning, Policy Gradients and Dynamic Programming</li>
<li>SVM and Kernel Methods, Decision tree</li>
<li>Ensemble Methods: XGBoost, LGBM, CatBoost, Random Forest, Bagging</li>
<li>Other Modeling Techniques: RIDGE, LASSO, Elastic Net regression</li>
<li>Unsupervised Methods: Model based and K-means clustering, Gaussian Mixture, PCA, LDA</li>
<li>Resea

## Elements corresponding to class

In web programming, **classes** represent a **set of elements to characterize a tag**, so they are not necessarily unique like ids. As for the sections, based on the **structure of our web page**, let's retrieve the **second section (class="secPad white")** of our **web page** using the **classes**.

In [19]:
second_section_class = parser.find_all("section", class_="secPad white")[0]
second_section_class

<section class="secPad white" id="skills">
<div class="container">
<div class="heading">
<!-- Heading -->
<h2>My Skills</h2>
<ul>
<li>Programming: Python, Spark, R, SQL, C++, C, VBA, Stata, PHP, Java-Script, HTML, CSS</li>
<li>Numerical Tools: Tensorflow, Keras, Scikit-learn, Pytorch, Pycuda, PySpark, Numpy, Scipy, Pandas, Matplotlib, MLlib</li>
<li>Machine Learning Skills: Supervised Methods: Naive Bayes, Linear and Logistic Regression</li>
<li>Deep Learning Skills: MLP, CNN, RNN, GAN, ResNet and Semi Supervised Learning (VAT)</li>
<li>Reinforcement Learning Skills: Markov Decision Processes, Exploration/Exploitation, Value Functions, Q-Learning, Policy Gradients and Dynamic Programming</li>
<li>SVM and Kernel Methods, Decision tree</li>
<li>Ensemble Methods: XGBoost, LGBM, CatBoost, Random Forest, Bagging</li>
<li>Other Modeling Techniques: RIDGE, LASSO, Elastic Net regression</li>
<li>Unsupervised Methods: Model based and K-means clustering, Gaussian Mixture, PCA, LDA</li>
<li>Resea

### Training

Let's get the **third section (class="page-section secPad")** of our **web page** using **class**.

In [20]:
third_section_class = parser.find_all("section", class_="page-section secPad")[0]
third_section_class

<section class="page-section secPad" id="contactUs">
<div class="container">
<div class="row">
<div class="heading">
<!-- Heading -->
<h2>Let's Keep In Touch!</h2>
<p>Thank you for visiting out my profile. If you would like to get into contact with me, please fill out the form below.</p>
</div>
</div>
<div class="row mrgn30">
<form action="" id="contactfrm" method="post" role="form">
<div class="col-sm-4">
<div class="form-group">
<label for="name">Name</label>
<input class="form-control" id="name" name="name" placeholder="Enter name" title="Please enter your name (at least 2 characters)" type="text"/>
</div>
<div class="form-group">
<label for="email">Email</label>
<input class="form-control" id="email" name="email" placeholder="Enter email" title="Please enter a valid email address" type="email"/>
</div>
</div>
<div class="col-sm-4">
<div class="form-group">
<label for="comments">Comments</label>
<textarea class="form-control" cols="3" id="comments" name="comment" placeholder="Enter 

## Elements corresponding to css selectors

In web programming, **css** allows us to add **styles** to our **html pages**. It can be for example a **color** or a **font size** for the **paragraphs**. To add **style** to **elements (class)**, **css** uses **selectors**. To do this, we will use a new method (**select**) to select elements from the CSS selectors.

Let's **get** the **elements** of the **div** corresponding to the **"chart-text" selectors** of our web page.

In [21]:
div_char_text = parser.select(".chart-text")
div_char_text

[<div class="chart-text">
 <h4>Data Science </h4>
 </div>,
 <div class="chart-text">
 <h4>Web Development</h4>
 </div>,
 <div class="chart-text">
 <h4>Data Analysis</h4>
 </div>]

In [22]:
print(div_char_text[0].text)


Data Science 



### Training

This time, we will **retrieve** the **content** of the **divs** corresponding to the **selectors** of an **id ("main-nav")**.

In [23]:
div_main_nav = parser.select("#main-nav")
div_main_nav

[<div class="collapse navbar-collapse" id="main-nav">
 <ul class="nav navbar-nav" id="mainNav">
 <li class="active"><a class="scroll-link" href="#aboutUs">About Me</a></li>
 <li><a class="scroll-link" href="#skills">Skills</a></li>
 <li><a class="scroll-link" href="#contactUs">Contact Me</a></li>
 </ul>
 </div>]

In [24]:
print(div_main_nav[0].text)



About Me
Skills
Contact Me




## Elements corresponding to selectors association in css

The **association of selectors in CSS** is a very robust method in web scraping. It **allows** to **associate several tags** to **retrieve elements** while taking into **consideration the hierarchy**.

Let's **retrieve** only the **Skills text** from the **previous div** using the **selector association**.

In [25]:
skills = div_main_nav[0].select("li")[1]
skills

<li><a class="scroll-link" href="#skills">Skills</a></li>

In [26]:
print(skills.text)

Skills
