
<img src="../assets/logo3.png" width="200" height="200" >
<div style="display:block"><br><br>
    <div style="display:block" align=left display=block> 
        <font size=5><b>HandsOn 5 - Crawling the web with Beautiful Soup</b></font><br>
        <hr/>
</div>


<pre>
$ ( click to jump on task )
.
├── Introduction
│   └── Jupyter hack!!
│
├── Working with Beautiful Soup
│   └── Searching with Beautiful Soup
│ 
├── <a href="#Task1">Task1: Football Table</a> (Morning session)
│
└── <a href="#Task2">Task2: Phone Shop</a> (Afternoon session)

</pre>


The sections marked with a Thinking Emoji (💭) are those which you need to read and answer. All right, without further ado let's get started!

<b><span style="color:Red">You might encounter difficulties sending a Request to Iranian websites using <span style="color:Green">Google Colab</span>. This is because Colab uses a Foreign IP and It gets blocked when trying to access these websites. Please use <span style="color:Green">Jupyter Notebook</span> for this Hands-On exercise. </span></b>

<hr />

# Introduction

In this Hands-On excercise, you will work with these concepts:
- Web Scraping & Data Collection using Requests and Beautiful Soup
- Advanced Data Cleaning using Pandas and Regex

<hr />

### Jupyter hack!! 

Run the code below. Now by clicking TAB when writing code, you get a list of all functions and objects and you can enjoy auto completion. I recommend going wild with this feature and using it always! You can also use SHIFT + TAB in front of any function or variable to see its information.

In [1]:
%config Completer.use_jedi = False

<hr />

# 📖 Working with Beautiful Soup

We can send a GET request to any webpage and get frontend's source code. Raw source code is usually messy and difficult to parse... 

💭 Run the code below to get the source code for https://python.org.

In [2]:
import requests

url = 'https://python.org'
response = requests.get(url)

print(response.encoding)
print(response.apparent_encoding)

print(response.text[:3000])

utf-8
utf-8
<!doctype html>
<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" lang="en" dir="ltr">  <!--<![endif]-->

<head>
    <meta charset="utf-8">
    <meta http-equiv="X-UA-Compatible" content="IE=edge">

    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js">
    <link rel="prefetch" href="//ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/jquery-ui.min.js">

    <meta name="application-name" content="Python.org">
    <meta name="msapplication-tooltip" content="The official home of the Python Programming Language">
    <meta name="apple-mobile-web-app-title" content="Python.org">
    <meta name="apple-mobile-web-app-capable" content="yes">
    <meta name="apple-mobile-web-app-status-bar-style" content="black">

    

The response is in the form of a very long string! It's difficult to access each HTML tag and attribute like this. The string is very messy as well...

<hr />

All you need is a <b><span style="color:green">beautiful soup</span></b>! 

<b><span style="color:green">Beautiful soup</span></b> is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

💭 Please install it in your conda environment: <br>

<code> !conda install -y -c anaconda beautifulsoup4 </code>

In [3]:
'''
You can install beautiful soup here
'''
# !conda install -y -c anaconda beautifulsoup4 
# !conda update -n base -c defaults conda

'\nYou can install beautiful soup here\n'

💭 Now run the code below:

In [4]:
from bs4 import BeautifulSoup

# beautiful soup takes the source code and a parser as input
soup = BeautifulSoup(response.text, 'html.parser')

print(soup)

<!DOCTYPE html>

<!--[if lt IE 7]>   <html class="no-js ie6 lt-ie7 lt-ie8 lt-ie9">   <![endif]-->
<!--[if IE 7]>      <html class="no-js ie7 lt-ie8 lt-ie9">          <![endif]-->
<!--[if IE 8]>      <html class="no-js ie8 lt-ie9">                 <![endif]-->
<!--[if gt IE 8]><!--><html class="no-js" dir="ltr" lang="en"> <!--<![endif]-->
<head>
<meta charset="utf-8"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<link href="//ajax.googleapis.com/ajax/libs/jquery/1.8.2/jquery.min.js" rel="prefetch"/>
<link href="//ajax.googleapis.com/ajax/libs/jqueryui/1.12.1/jquery-ui.min.js" rel="prefetch"/>
<meta content="Python.org" name="application-name"/>
<meta content="The official home of the Python Programming Language" name="msapplication-tooltip"/>
<meta content="Python.org" name="apple-mobile-web-app-title"/>
<meta content="yes" name="apple-mobile-web-app-capable"/>
<meta content="black" name="apple-mobile-web-app-status-bar-style"/>
<meta content="width=device-width, initial-scal

The data looks much prettier now. <b><span style="color:green">Beautiful Soup</span></b> detects all HTML tags, so we can access these tags by using its built in functions!

### 📖 But what are these "HTML tags" anyway?
These HTML tags are exactly what you see when you press <b>F12</b> on a webpage. Everything that you see as a user inside a website is associated with one of these tags. If you don't believe me, right click on any element in a webpage and click on the <b>Inspect Element</b> botton. You will see which tag the element you clicked on belongs to! 

<img src="../assets/day5-handson-im.png" height="200" >

#### 💭 try it for yourself! 
Visit <b>www.python.org</b>, right click on the <b>Community</b> botton, then click on the <b>Inspect Element</b> botton. It should look like the figure above:


We can see that this element belongs to an <b><a\></b> tag inside another <b><li\></b> tag. In HTML, <b><a\> </b> tag defines a hyperlink and <b><li\></b> tag defines an item in a list. Each tag has some attributes. For example, the <b><a\></b> tag here has a link (href) and a text value ('Community').
<hr />

### 📖 Searching with Beautiful Soup

<b><span style="color:green">Beautiful soup</span></b> allows you to search through the source code by tag names and their attributes. The code below finds the first <b><a\></b> tag in python.org which satisfies the given conditions.
    
💭 try it for yourself!

In [5]:
# Can add any attributes we want to the function

# print(soup.find('a', href="/community-landing/"))
# print(soup.find('a', href="/jobs/"))
print(soup.find('div', id="top"))

<div class="top-bar do-not-print" id="top">
<nav class="meta-navigation container" role="navigation">
<div class="skip-link screen-reader-text">
<a href="#content" title="Skip to content">Skip to content</a>
</div>
<a aria-hidden="true" class="jump-link" href="#python-network" id="close-python-network">
<span aria-hidden="true" class="icon-arrow-down"><span>▼</span></span> Close
                </a>
<ul class="menu" role="tree">
<li class="python-meta current_item selectedcurrent_branch selected">
<a class="current_item selectedcurrent_branch selected" href="/" title="The Python Programming Language">Python</a>
</li>
<li class="psf-meta">
<a href="/psf-landing/" title="The Python Software Foundation">PSF</a>
</li>
<li class="docs-meta">
<a href="https://docs.python.org" title="Python Documentation">Docs</a>
</li>
<li class="pypi-meta">
<a href="https://pypi.org/" title="Python Package Index">PyPI</a>
</li>
<li class="jobs-meta">
<a href="/jobs/" title="Python Job Board">Jobs</a>
</li>


💭 Notice that there are sometimes many ways to search for the same tag:

In [6]:
print(soup.find('a', href="#content"))

<a href="#content" title="Skip to content">Skip to content</a>


In [7]:
print(soup.find('a', text="Skip to content"))

<a href="#content" title="Skip to content">Skip to content</a>


<hr />

#### 📖 What if we need to find every element that satisfies a condition?

💭 Run the code below to find every <b> <a\> </b> tag that exists in python.org!

In [8]:
a = soup.findAll('a')
a

[<a href="#content" title="Skip to content">Skip to content</a>,
 <a aria-hidden="true" class="jump-link" href="#python-network" id="close-python-network">
 <span aria-hidden="true" class="icon-arrow-down"><span>▼</span></span> Close
                 </a>,
 <a class="current_item selectedcurrent_branch selected" href="/" title="The Python Programming Language">Python</a>,
 <a href="/psf-landing/" title="The Python Software Foundation">PSF</a>,
 <a href="https://docs.python.org" title="Python Documentation">Docs</a>,
 <a href="https://pypi.org/" title="Python Package Index">PyPI</a>,
 <a href="/jobs/" title="Python Job Board">Jobs</a>,
 <a href="/community-landing/">Community</a>,
 <a aria-hidden="true" class="jump-link" href="#top" id="python-network">
 <span aria-hidden="true" class="icon-arrow-up"><span>▲</span></span> The Python Network
                 </a>,
 <a href="/"><img alt="python™" class="python-logo" src="/static/img/python-logo.png"/></a>,
 <a class="donate-button" href="

<hr />

#### 📖 What if we want to access the attributes of all those tags? 
Just imagine if the search result was a dictionary!

💭 Run the code below to extract the links (href).

In [9]:
# for i in a:
#     print(i['href'])
d=0
titles=[]
for i in a:
    try:
        titles.append(i['title'])
    except:
        d=d+1
print(titles)

['Skip to content', 'The Python Programming Language', 'The Python Software Foundation', 'Python Documentation', 'Python Package Index', 'Python Job Board', 'Make Text Smaller', 'Make Text Larger', 'Reset any font size changes I have made', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'success-stories', '', '', '', '', '', '', '', 'News from around the Python world', 'Python Insider Blog Posts', 'Python Software Foundation Newsletter', 'Planet Python', 'PSF Blog', 'PyCon Blog', '', '', '', '', '', '', 'More News', 'More Events', 'More Success Stories', 'More Applications', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', '', 'success-stories', '', '', '', '', '', '', '', 'News from around the Python world', 'Python Insider Blog Posts', 'Python Software Foundation Newsletter', 'Planet Python'

<hr />

#### 📖 Fancier functions
 There are two other functions, ```select_one``` and ```select``` that work in a similar fashion to ```find``` and ```findAll```, but these functions are more powerful. They allow defining complex conditions by <b><span style="color:green">CSS</span></b> syntax!

Below you can see some examples for <b><span style="color:green">CSS</span></b> syntax. (Check out the documentation for more awesome tricks!)

- a > b: I want an <a\> tag that is inside a <b\> tag
- c#d: I want a <c\> tag with its 'id' attribute equal to 'c'
- e.f: I want an <e\> tag with its 'class' attribute equal to 'd'

💭 Run the code below. It finds <b>the first</b> <div\> with id='nojs' which is inside another <div\>. 

In [10]:
# show = soup.select_one('div > div#nojs')
# show = soup.select_one('body > div.python home')
show = soup.select_one('div >' "div[class='do-not-print']")
show

<div class="do-not-print" id="nojs">
<p><strong>Notice:</strong> While JavaScript is not essential for this website, your interaction with the content will be limited. Please turn JavaScript on for the full experience. </p>
</div>

What if we want to access the text of this tag?

In [11]:
show.text

'\nNotice: While JavaScript is not essential for this website, your interaction with the content will be limited. Please turn JavaScript on for the full experience. \n'

 💭 Run the code below. It finds <b>all</b> <div\>s with class='do-not-print' inside another <div\>.

In [12]:
show = soup.select('div > div.do-not-print') 
show

[<div class="do-not-print" id="nojs">
 <p><strong>Notice:</strong> While JavaScript is not essential for this website, your interaction with the content will be limited. Please turn JavaScript on for the full experience. </p>
 </div>,
 <div class="top-bar do-not-print" id="top">
 <nav class="meta-navigation container" role="navigation">
 <div class="skip-link screen-reader-text">
 <a href="#content" title="Skip to content">Skip to content</a>
 </div>
 <a aria-hidden="true" class="jump-link" href="#python-network" id="close-python-network">
 <span aria-hidden="true" class="icon-arrow-down"><span>▼</span></span> Close
                 </a>
 <ul class="menu" role="tree">
 <li class="python-meta current_item selectedcurrent_branch selected">
 <a class="current_item selectedcurrent_branch selected" href="/" title="The Python Programming Language">Python</a>
 </li>
 <li class="psf-meta">
 <a href="/psf-landing/" title="The Python Software Foundation">PSF</a>
 </li>
 <li class="docs-meta">
 <

<hr />

## 💭💭💭Task 1: Football table💭💭💭

<a name="Task1"></a>

The goal of this exercise is to familiarize you more with Inspecting HTML source codes by extracting information from a static table in varzesh3.com.

💭 Please visit this <b>[link](https://www.varzesh3.com/football/league/900578/%D9%84%DB%8C%DA%AF-%D8%A8%D8%B1%D8%AA%D8%B1-%D8%A7%DB%8C%D8%B1%D8%A7%D9%86-1400-1401)</b> and look at the table. It's the data for Iran's football league (1400-1401).

💭 Run the code below to load the table.

<b><span style="color:red">When working with Persian letters, sometimes <b><span style="color:green">requests</span></b> can get the encoding wrong and show strange characters. If this happens, restart the kernel and run the code below again.</span></b>


In [13]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.varzesh3.com/football/league/900578/%D9%84%DB%8C%DA%AF-%D8%A8%D8%B1%D8%AA%D8%B1-%D8%A7%DB%8C%D8%B1%D8%A7%D9%86-1400-1401'

response= requests.get(url)
soup = BeautifulSoup(response.text , 'html.parser')
table= soup.find('table')
table

<table class="league-standing football-standing">
<caption>لیگ برتر ایران 1400-1401</caption>
<thead>
<tr>
<th scope="col">رتبه</th>
<th scope="col"></th>
<th scope="col">بازی</th>
<th scope="col">برد</th>
<th scope="col">مساوی</th>
<th scope="col">باخت</th>
<th scope="col"> گل -/+</th>
<th scope="col">تفاضل </th>
<th scope="col">امتياز</th>
<th class="last-five" scope="col">5 بازی آخر  <img src="https://static.varzesh3.com/img/icons/last-five-arrow.svg"/> </th>
</tr>
</thead>
<tbody>
<tr>
<td><span class="standing-rule-color" style="background: #0f40d2"></span>1</td>
<td scope="row">
<a href="/football/team/4/استقلال">
<figure>
<img alt="استقلال" src="https://static.farakav.com/files/pictures/01150467.png?w=30" width="30"/>
</figure>
                                استقلال 
                            </a>
</td>
<td>30</td>
<td>19</td>
<td>11</td>
<td>0</td>
<td>39-10</td>
<td>29</td>
<td>68</td>
<td class="last-five">
<a href="/football/match/172336">
<span class="ls-results ls-win">

💭 By inspecting the tag names in the webpage youre trying to crawl, give a short description of what each represent:

 - ```<thead>```: 
 - ```<tr>```:
 - ```<th>```:
 - ```<tbody>```:
 - ```<td>```:

<hr />

💭 Explain briefly what this code is doing. What should the missing value stand for?

....

In [14]:
rows = table.find_all('tr')
# for row in rows:
for head in rows[0].find_all('th'):
       print([head.text])
    

['رتبه']
['']
['بازی']
['برد']
['مساوی']
['باخت']
[' گل -/+']
['تفاضل ']
['امتياز']
['5 بازی آخر   ']


<hr />

💭 Explain briefly what this code is doing. 

...

In [15]:
for row in rows:        
    for body in row.find_all('td'):
        print([body.text])

['1']
['\n\n\n\n\r\n                                استقلال \r\n                            \n']
['30']
['19']
['11']
['0']
['39-10']
['29']
['68']
['\n\n✔\n\n\n✔\n\n\n ― \n\n\n✔\n\n\n ― \n\n']
['2']
['\n\n\n\n\r\n                                پرسپولیس \r\n                            \n']
['30']
['18']
['9']
['3']
['44-21']
['23']
['63']
['\n\n✔\n\n\n ― \n\n\n✖\n\n\n✔\n\n\n✔\n\n']
['3']
['\n\n\n\n\r\n                                سپاهان \r\n                            \n']
['30']
['16']
['8']
['6']
['43-21']
['22']
['56']
['\n\n✔\n\n\n✔\n\n\n✔\n\n\n✖\n\n\n ― \n\n']
['4']
['\n\n\n\n\r\n                                گل گهرسیرجان \r\n                            \n']
['30']
['13']
['12']
['5']
['37-28']
['9']
['51']
['\n\n✔\n\n\n✔\n\n\n✔\n\n\n✔\n\n\n✔\n\n']
['5']
['\n\n\n\n\r\n                                فولاد \r\n                            \n']
['30']
['13']
['10']
['7']
['30-22']
['8']
['49']
['\n\n✔\n\n\n✖\n\n\n✔\n\n\n✔\n\n\n ― \n\n']
['6']
['\n\n\n\n\r\n                     

<hr />

📖 As you see, some of the lines have extra spacings, or extra characters like "\n", "\r" or extra spaces. We can use ```replace('a', 'b')``` function on any string to deal with these. 

💭 Use the aforementioned function to remove these extra characters from the table in the code below.

In [16]:
rows = table.find_all('tr')

datas = []
import re
for row in rows:
    data = []
    for head in row.find_all('th'):
        h = head.text
        
        ''' Enter your code here'''
#         
        h= h.replace('\r','').replace('\n','')         
        h= re.sub('\\s+$', '', h)
#         print(h)
        data.append(h)
    for body in row.find_all('td'):
        b = body.text
        
        ''' Enter your code here'''
        b= b.replace('\r','').replace('\n','')
        b= re.sub('\\s+', ' ', b)
#         print(b)
        data.append(b)
    datas.append(data)

datas

[['رتبه',
  '',
  'بازی',
  'برد',
  'مساوی',
  'باخت',
  ' گل -/+',
  'تفاضل',
  'امتياز',
  '5 بازی آخر'],
 ['1', ' استقلال ', '30', '19', '11', '0', '39-10', '29', '68', '✔✔ ― ✔ ― '],
 ['2', ' پرسپولیس ', '30', '18', '9', '3', '44-21', '23', '63', '✔ ― ✖✔✔'],
 ['3', ' سپاهان ', '30', '16', '8', '6', '43-21', '22', '56', '✔✔✔✖ ― '],
 ['4', ' گل گهرسیرجان ', '30', '13', '12', '5', '37-28', '9', '51', '✔✔✔✔✔'],
 ['5', ' فولاد ', '30', '13', '10', '7', '30-22', '8', '49', '✔✖✔✔ ― '],
 ['6', ' مس رفسنجان ', '30', '12', '9', '9', '39-29', '10', '45', '✔✖✔✖ ― '],
 ['7', ' ذوب آهن ', '30', '10', '7', '13', '21-25', '-4', '37', ' ― ― ✔ ― ✖'],
 ['8',
  ' آلومینیوم اراک ',
  '30',
  '7',
  '16',
  '7',
  '20-23',
  '-3',
  '37',
  '✖✔ ― ― ― '],
 ['9', ' پیکان ', '30', '7', '15', '8', '26-27', '-1', '36', '✖ ― ✖ ― ― '],
 ['10',
  ' صنعت نفت آبادان ',
  '30',
  '9',
  '9',
  '12',
  '26-30',
  '-4',
  '36',
  ' ― ✖✖✖ ― '],
 ['11', ' هوادار ', '30', '8', '10', '12', '18-25', '-7', '34', '✔✖✔✖✖'],

<hr />

💭 Convert ```datas``` to a pandas DataFrame with proper column names & no empty rows or columns

In [17]:
import pandas as pd
datas[0][1]='تیم'
''' Enter your code here'''

df=pd.DataFrame(datas[1:], columns= datas[0])
df=df.set_index('رتبه')
df[datas[0][2:6]]=df[datas[0][2:6]].astype(int)
df[datas[0][8:9]]=df[datas[0][8:9]].astype(int)

<hr />

💭 Run these 2 code blocks

In [18]:
df.head()

Unnamed: 0_level_0,تیم,بازی,برد,مساوی,باخت,گل -/+,تفاضل,امتياز,5 بازی آخر
رتبه,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
1,استقلال,30,19,11,0,39-10,29,68,✔✔ ― ✔ ―
2,پرسپولیس,30,18,9,3,44-21,23,63,✔ ― ✖✔✔
3,سپاهان,30,16,8,6,43-21,22,56,✔✔✔✖ ―
4,گل گهرسیرجان,30,13,12,5,37-28,9,51,✔✔✔✔✔
5,فولاد,30,13,10,7,30-22,8,49,✔✖✔✔ ―


In [19]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 16 entries, 1 to 16
Data columns (total 9 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   تیم         16 non-null     object
 1   بازی        16 non-null     int32 
 2   برد         16 non-null     int32 
 3   مساوی       16 non-null     int32 
 4   باخت        16 non-null     int32 
 5    گل -/+     16 non-null     object
 6   تفاضل       16 non-null     object
 7   امتياز      16 non-null     int32 
 8   5 بازی آخر  16 non-null     object
dtypes: int32(5), object(4)
memory usage: 960.0+ bytes


<hr />

💭 Create Two Columns for GS (Goals Scored) and GA (Goals Against) 
(همون گل زده و گل خورده خودمون)

In [20]:
''' Enter your code here'''
import numpy as np
# df['Col'].str.split(delimiter, expand=True)
df[['GS','GA']]=(df[datas[0][6]].str.split('-', expand=True)).astype(int)
df.head()

Unnamed: 0_level_0,تیم,بازی,برد,مساوی,باخت,گل -/+,تفاضل,امتياز,5 بازی آخر,GS,GA
رتبه,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,استقلال,30,19,11,0,39-10,29,68,✔✔ ― ✔ ―,39,10
2,پرسپولیس,30,18,9,3,44-21,23,63,✔ ― ✖✔✔,44,21
3,سپاهان,30,16,8,6,43-21,22,56,✔✔✔✖ ―,43,21
4,گل گهرسیرجان,30,13,12,5,37-28,9,51,✔✔✔✔✔,37,28
5,فولاد,30,13,10,7,30-22,8,49,✔✖✔✔ ―,30,22


<hr />

💭 Which team has most losses?

In [21]:
''' Enter your code here'''
print('team with most losses is',df['تیم'][np.argmax(df['باخت'])])

team with most losses is  شهرخودرو مشهد 


<hr />


💭 Which team has most goal difference? (Difference between the GS and GA) 

In [22]:
''' Enter your code here'''
df['تیم'][np.argmax(df.GS-df.GA)]

' استقلال '

<hr />


💭 which team was the best during last 3 games? (Use the column with the ✔s)

In [23]:
''' Enter your code here'''
def findLastGameScores(inList):
    outList=np.zeros(len(inList))
    for j in range(len(inList)):
        for i in range(len(inList[j])):
            character= inList[j][i]
#         print(character)
            if(character=='✔'):
                outList[j]+=(1)
#             print(i)
            elif(character=='―'):
                outList[j]+=(0)
#             print(i)
            elif(character== '✖'):
                outList[j]+=(-1)
#             print(i)
    return outList
            
df['L5GS']= findLastGameScores(df[datas[0][9]]) 
print(df['تیم'][np.argmax(df.L5GS)])

 گل گهرسیرجان 


 <hr />
The code snippet below can be used to display Persian and Arabic strings beautifully when plotting in Python. 

- Please add the package ```arabic_reshaper``` by this command: ```!conda install -c conda-forge arabic_reshaper```
- Please add the package ```bidi``` by this command: ```!conda install -c conda-forge python-bidi```

In [24]:
'''
You can install the packages here
'''
# !conda install -c conda-forge arabic_reshaper
# !conda install -c conda-forge python-bidi

'\nYou can install the packages here\n'

In [25]:
# The code 
import matplotlib.pyplot as plt
import arabic_reshaper
from bidi.algorithm import get_display
def reshaper(text_list):
    for i in range(len(text_list)):
        text_list[i] = get_display(arabic_reshaper.reshape(u'%s' %str(text_list[i])))
    return text_list

 💭 Plot Total Scores and The Scores of the last 3 games in one single bar plot. Please use the code snippet above for displaying the team names.
 (Your answer should look like the provided figure)


In [26]:
''' Enter your code here'''

' Enter your code here'

<hr />
<hr />
<hr />
<hr />
<hr />
<hr />
<hr />
<hr />
<hr />
<hr />
<hr />
<hr />
<hr />

## 💭💭💭Task 2: Phone Shop💭💭💭

<a name="Task2"></a>

📖 Now let's search for a nice new phone in <b>technolife</b> by crawling it. :) <br />
Below you can see the URL of the first page in the phone section <br />
https://www.technolife.ir/product/list/69_800_801/%D8%AA%D9%85%D8%A7%D9%85%DB%8C-%DA%AF%D9%88%D8%B4%DB%8C%E2%80%8C%D9%87%D8%A7?code=69_800_801&plp=%D8%AA%D9%85%D8%A7%D9%85%DB%8C-%DA%AF%D9%88%D8%B4%DB%8C%E2%80%8C%D9%87%D8%A7&page=1

<b>Notes</b>
- if we want to crawl laptops from all pages, we should change the URL accordingly (For example, ```page=1``` should change to ```page=2```.
In this example, we want to crawl the first 10 pages.
- Crawling 10 pages might take a while to finish, so begin with just a few pages and increase the number when you're sure about your code. Using ```tqdm``` library helps by showing a progress bar! install it with ```pip``` or ```conda``` before running the code below.


In [27]:
'''
You can install tqdm here

'''

'\nYou can install tqdm here\n\n'


### 💭 Crawling and saving the info
💭 Open the URL above and Inspect the web page and their elements like title, price , ... to familiarize yourself with them.

<hr />
💭 Complete the crawling code below...



<b>Notes:</b> Please apply no preprocessing or data cleaning! <b>Just save the raw data</b>. You can also change the pages by manipulating the URL in the ```For``` loop below. Your results should look like the table provided in the output cell below.

In [28]:
import requests
from bs4 import BeautifulSoup 
import pandas as pd
from tqdm import tqdm

url = 'https://www.technolife.ir/product/list/69_800_801/%D8%AA%D9%85%D8%A7%D9%85%DB%8C-%DA%AF%D9%88%D8%B4%DB%8C%E2%80%8C%D9%87%D8%A7?code=69_800_801&plp=%D8%AA%D9%85%D8%A7%D9%85%DB%8C-%DA%AF%D9%88%D8%B4%DB%8C%E2%80%8C%D9%87%D8%A7&page='

result = []

for page in tqdm(range(1, 10)):
    
    '''' Enter your code here to Change the URL page '''
    new_url = url+str(page)
#     print(new_url)
    page = requests.get(new_url)
    soup = BeautifulSoup(page.text, 'html.parser')
#     print(soup)
    # Get product list
    products = soup.select('div#productsList>ul>li')
    for p in products:
        # Get product title
        title = p.select('a.ProductComp_product_title__bOrf5>strong')[0].text

        # Get product price
        price = -1

        offer_section = p.select_one('div.ProductComp_product_off_box__OfLBa')
        
        if not offer_section:
            # there is no offer for this product (main price)
            main_price = p.select_one('section.ProductComp_product_price__S4_x8>div.ProductComp_main_price__XgWce')
            price = main_price.select('span')[0].text
        else:
            # there is offer for this product (offer price)
            '''' Enter your code here to extract product offer price '''
            offer_price =p.select_one('section.ProductComp_product_price__S4_x8>div.ProductComp_offer_price__HAQ6N')
            price =offer_price.select('span')[0].text 
        
        # Get product specs (hard disk/screen size/camera/battery)
        '''' Enter your code here to extract product specs '''
        specs = p.select('div.ProductComp_product_icon__OLqA5>ul.pr_icon')
#         print(specs)
        hard_disk = specs[0].select('span')[1].text
        size = specs[0].select('span')[3].text
        camera =specs[0].select('span')[5].text
        battery = specs[0].select('span')[7].text
        result.append({
            'title': title,
            'hard disk': hard_disk,
            'size': size,
            'camera': camera,
            'battery': battery,
            'price': price
        })

# Saving dictionary in a dataframe
data = pd.DataFrame(result)
data

100%|████████████████████████████████████████████████████████████████████████████████████| 9/9 [00:13<00:00,  1.46s/it]


Unnamed: 0,title,hard disk,size,camera,battery,price
0,گوشی موبايل سامسونگ مدل گلکسی A32 4G دو سیم کا...,128,6.4,64,5000,6219000
1,گوشی موبايل نوکيا مدل 105 (2019) ظرفیت 4 مگابا...,4,1.77,,800,635000
2,گوشی موبایل سامسونگ مدل Galaxy A13 ظرفیت 64 ...,64,6.6,50,5000,4449000
3,گوشی موبايل سامسونگ مدل Galaxy A52s 5G ظرفیت 2...,256,6.5,64,4500,11799000
4,گوشی موبايل نوکيا مدل 106 (2018) ظرفیت 4 مگابا...,4,1.8,,800,630000
...,...,...,...,...,...,...
283,گوشی موبایل نوکیا G20 ظرفیت 128 گیگابایت - رم ...,128,6.52,48,5050,4355000
284,گوشی موبایل نوکیا 150 (2020),4,2.4,VGA,1020,1135000
285,گوشی موبايل سامسونگ مدل Galaxy A23 ظرفیت 128 گ...,128,6.6,50,5000,5399000
286,گوشی موبايل سامسونگ مدل Galaxy A23 ظرفیت 128 گ...,128,6.6,50,5000,6325000


In [29]:
# hard_disk = specs[0].select('span')[1].text
print(camera)

13


<hr />

### 💭 Clean the pandas dataframe
<a name="Task2:clean"></a>

 - Fix data types
   - `hard disk` -> `int`
   - `size` -> `float`
   - `camera` -> `float`
   - `battery` -> `int`
   - `price` -> `int`
 - Normalize Arabic characters
 - Extract RAM inforamtion. (Use Regex)
 - Clean `title` column
   - Remove redundant words (e.g. گوشی)
   - Remove parts related to RAM/Storage information
   - Remove non-word characters (e.g. -)
 
 You can play around with your <b><span style="color:green">Regex</span></b> patterns <b>[here](https://regexr.com/)</b>

In [74]:
# Fix data types

data['camera'] = data['camera'].replace(['VGA', ''], 0)

'''Convert these datatypes'''
data['camera'] =  data['camera'].astype(float)
data['battery'] = data['battery'] .astype(int) 
data['hard disk'] =data['hard disk'].astype(int) 
data['size'] = data['size'].astype(float) 

'''Convert price (It needs a little more work)'''
print(data.price.dtype)
if(data.price.dtype!='int'):
    data['price'] = (data['price'].str.replace(',','')).astype(int)
data.head()

int32


Unnamed: 0,title,hard disk,size,camera,battery,price,ram
0,سامسونگ گلکسی A32 4G دو سیم کارت,128,6.4,64.0,5000,6219000,6144
1,نوکیا 105 2019,4,1.77,0.0,800,635000,4
2,سامسونگ Galaxy A13,64,6.6,50.0,5000,4449000,4096
3,سامسونگ Galaxy A52s 5G,256,6.5,64.0,4500,11799000,8192
4,نوکیا 106 2018,4,1.8,0.0,800,630000,4


In [31]:
# Normalize Arabic characters

def normalize_char(txt):
    txt = txt.replace('ك', 'ک')
    txt = txt.replace('دِ', 'د')
    txt = txt.replace('زِ', 'ز')
    txt = txt.replace('ذِ', 'ذ')
    txt = txt.replace('شِ', 'ش')
    txt = txt.replace('سِ', 'س')
    txt = txt.replace('ى', 'ی')
    txt = txt.replace('ي', 'ی')
    return txt

data['title'] = data['title'].apply(normalize_char)
data.head()

Unnamed: 0,title,hard disk,size,camera,battery,price
0,گوشی موبایل سامسونگ مدل گلکسی A32 4G دو سیم کا...,128,6.4,64.0,5000,6219000
1,گوشی موبایل نوکیا مدل 105 (2019) ظرفیت 4 مگابا...,4,1.77,0.0,800,635000
2,گوشی موبایل سامسونگ مدل Galaxy A13 ظرفیت 64 ...,64,6.6,50.0,5000,4449000
3,گوشی موبایل سامسونگ مدل Galaxy A52s 5G ظرفیت 2...,256,6.5,64.0,4500,11799000
4,گوشی موبایل نوکیا مدل 106 (2018) ظرفیت 4 مگابا...,4,1.8,0.0,800,630000


In [32]:
# Extract RAM
import re 

def extract_ram(title):
    info = re.findall(r'رم (\d+) (مگابایت|گیگابایت)', title)
    if len(info)>0:
        if info[0][1] == 'مگابایت':
            return int(info[0][0])
        else:
            return int(info[0][0]) * 1024
    else:
        return 0

data['ram'] = data['title'].apply(extract_ram)

In [33]:
# Clean title
def clean(title):
    title = title.replace('گوشی', '')
    title = title.replace('موبایل', '')
    title = title.replace('مدل', '')
    title = re.sub(r'رم \d+ (مگابایت|گیگابایت)', '', title)
    title = re.sub(r'ظرفیت \d+ (مگابایت|گیگابایت)', '', title)
    title = re.sub(r'\W+', ' ', title)
    return title.strip()

data['title'] = data['title'].apply(clean)

<hr />

💭 Run these 2 code blocks at the end of your task

In [34]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 288 entries, 0 to 287
Data columns (total 7 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   title      288 non-null    object 
 1   hard disk  288 non-null    int32  
 2   size       288 non-null    float64
 3   camera     288 non-null    float64
 4   battery    288 non-null    int32  
 5   price      288 non-null    object 
 6   ram        288 non-null    int64  
dtypes: float64(2), int32(2), int64(1), object(2)
memory usage: 13.6+ KB


In [43]:
data.sample(15)

Unnamed: 0,title,hard disk,size,camera,battery,price,ram
184,سامسونگ گلکسی A53 5G,256,6.46,64.0,5000,11399000,8192
132,نوکیا 106 2018,4,1.8,0.0,800,630000,4
157,سامسونگ Galaxy A23,128,6.6,50.0,5000,5399000,4096
38,شیائومی Redmi 9A,32,6.53,13.0,5000,2769000,2048
278,سامسونگ Galaxy A13,128,6.6,50.0,5000,4959000,4096
94,سامسونگ Galaxy A23,128,6.6,50.0,5000,6325000,6144
240,نوکیا 110,4,1.77,0.0,800,850000,4
153,نوکیا 5310 2020,16,2.4,0.0,1200,1389000,0
105,اپل iPhone 13 Pro Max ZA A Not Active,256,6.7,12.0,4352,53499000,6144
180,سامسونگ Galaxy A22 5G,128,6.6,48.0,5000,5239000,4096


<hr />

### 💭Learn more about the data

Good job so far. Now let's gain some insights from the data we crawled!

<a name="Task2:clean"></a>

💭 Show me all Samsung (سامسونگ) phones that have a 128GB hard disk.

In [45]:
def is_samsung(title):
    info = re.findall('سامسونگ', title)
    if len(info)>0:
        return 1
    else:
        return 0

In [50]:
'''Enter your code here'''
newdata= data.copy()
newdata['isSM'] = newdata['title'].apply(is_samsung)
newdata [np.logical_and(newdata.isSM, newdata['hard disk'] ==128)]

Unnamed: 0,title,hard disk,size,camera,battery,price,ram,isSM
0,سامسونگ گلکسی A32 4G دو سیم کارت,128,6.4,64.0,5000,6219000,6144,1
19,سامسونگ Galaxy A52,128,6.5,64.0,4500,8129000,8192,1
20,سامسونگ Galaxy A22 5G,128,6.6,48.0,5000,5239000,4096,1
22,سامسونگ Galaxy A13,128,6.6,50.0,5000,4959000,4096,1
29,سامسونگ Galaxy A23,128,6.6,50.0,5000,5399000,4096,1
30,سامسونگ Galaxy A23,128,6.6,50.0,5000,6325000,6144,1
32,سامسونگ گلکسی A32 4G دو سیم کارت,128,6.4,64.0,5000,6219000,6144,1
51,سامسونگ Galaxy A52,128,6.5,64.0,4500,8129000,8192,1
52,سامسونگ Galaxy A22 5G,128,6.6,48.0,5000,5239000,4096,1
54,سامسونگ Galaxy A13,128,6.6,50.0,5000,4959000,4096,1


<hr />

💭 Show me the phone with most battery capacity.

In [61]:
'''Enter your code here'''
data.loc[np.argmax(data.battery)]

title        نوکیا G20
hard disk          128
size              6.52
camera            48.0
battery           5050
price        4,355,000
ram               4096
Name: 27, dtype: object

<hr />

💭 Categorize `price` column into below ranges and tell me how many phones there are from each price catgory is?:
- (0.0, 1000000.0] : very low
- (1000000.0, 5000000.0] : low
- (5000000.0, 10000000.0] : mid
- (10000000.0, 20000000.0] : high
- (20000000.0, inf] : very high

<b>HINT</b>: use `pd.cut` method

 

In [85]:
'''Enter your code here'''
data['Pcat']=pd.cut(data.price, bins= [0,1000000.0, 5000000.0, 10000000.0,20000000.0 ,np.inf]  , labels= ['very low', 'low', 'mid', 'high','very high'])
data.sample(10)

Unnamed: 0,title,hard disk,size,camera,battery,price,ram,Pcat
202,سامسونگ Galaxy A03 Core,32,6.5,8.0,5000,2749000,2048,low
92,نوکیا 150 2020,4,2.4,0.0,1020,1135000,0,low
111,شیائومی Redmi 9C,64,6.53,13.0,5000,3449000,3072,low
148,سامسونگ Galaxy A22 5G,128,6.6,48.0,5000,5239000,4096,mid
127,نوکیا G10,64,6.52,13.0,5050,3369000,4096,low
123,نوکیا G20,128,6.52,48.0,5050,4355000,4096,low
91,نوکیا G20,128,6.52,48.0,5050,4355000,4096,low
15,شیائومی Redmi 9C,64,6.53,13.0,5000,3449000,3072,low
137,اپل iPhone 13 Pro Max ZA A Not Active,256,6.7,12.0,4352,53499000,6144,very high
104,سامسونگ Galaxy M12,64,6.5,48.0,5000,4259000,4096,low


<hr />

💭 plot `stacked bar plot` for count of each brand's price category. (Your table should look like the one in the output cell)
- consider these brands:
    - نوکیا
    - سامسونگ
    - شیائومی
    - اپل

In [None]:
def brand_title(title):
    infoS = re.findall('سامسونگ', title)
    infoS = re.findall('نوکیا', title)
    infoS = re.findall('سامسونگ', title)
    infoS = re.findall('سامسونگ', title)
    if len(info)>0:
        return 'Samsung'
    elif()
    else:
        return np.nan

In [92]:
'''Enter your code here'''
data.Pcat.value_count()

AttributeError: 'Series' object has no attribute 'value_count'

<hr />

💭 plot `Side-by-Side Boxplot` for price of phones with hard disk capacity of `64, 128 and 256`. (Your table should look like the one provided in the output cell)

In [40]:
'''Enter your code here'''

'Enter your code here'

<hr />
<hr />

<span style="color:green">
    
- Get crawling! but always treat the data & its owners with respect. There are a number of online articles about ethics of crawling. Check them out if you are interested. :)  
    
- Also learning Selenium and Spyder libraries is recommended if you're interested in advanced crawling!

</span>


<hr />
<hr />

# Good Luck 😉

# More on crawling:
- An awesome free Persian <b><a  href=https://programming.tosinso.com/fa/videos/8506/%D8%AF%D9%88%D8%B1%D9%87-%D8%A2%D9%85%D9%88%D8%B2%D8%B4%DB%8C-%D8%B1%D8%A7%DB%8C%DA%AF%D8%A7%D9%86-Web-Scraping-%D8%A8%D8%A7-%D8%B2%D8%A8%D8%A7%D9%86-%D8%A8%D8%B1%D9%86%D8%A7%D9%85%D9%87-%D9%86%D9%88%DB%8C%D8%B3%DB%8C-%D9%BE%D8%A7%DB%8C%D8%AA%D9%88%D9%86 > course</a></b>. 