# Scraping ICC World Players Details using Python

Data Source : [ICC- World Player Ranking](https://www.icc-cricket.com/rankings/mens/player-rankings/test/batting)
![WEB SCRAPPING](https://i.imgur.com/xwRTlwz.png)

![image1](https://i.imgur.com/aUb4I5h.png)

## Web Scraping 

>### Q. What is Web Scraping?

Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. While web scraping often involves parsing and processing HTML documents.

In general, web data extraction is used by people and businesses who want to make use of the vast amount of publicly available web data to make smarter decisions.

>### Q. How does Web Scrapping work?

- STEP 1: Identify the website to be scrapped.
- STEP 2: Find URLs of the pages where you want to extract the data.
- STEP 3: Make a request to these URLs to get the HTML of the page.
- STEP 4: Use locators to find the data in the HTML 
- STEP:5: Save the data in a JSON or CSV file or some other structured format as per your requirement

![](https://i.imgur.com/b8DLXR4.png)

In [1]:
!pip install jovian --upgrade --quiet
import jovian

In [2]:
jovian.commit(project="web-scaping-project-final")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "jp-amith/web-scaping-project-final" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/jp-amith/web-scaping-project-final[0m


'https://jovian.ai/jp-amith/web-scaping-project-final'

## About MRF Tyres ICC Player Ranking

The MRF Tyres ICC Player Rankings is a table where international cricket players performances are ranked using a points based system which is worked out by doing a series of calculations leading to a sophisticated moving average. 

Players are rated on a scale of 0 to 1000 points. If a player’s performance is improving on his past record, his points increase; if his performance is declining his points will go down. 

The value of each player’s performance within a match is calculated using an algorithm, a series of calculations (all pre-programmed) based on various circumstances in the match. There is no human intervention in this calculation process, and no subjective assessment is made. 
There are slightly different factors for each of the different formats of the game.

![](https://i.imgur.com/7qU5l4K.png)



## Project Idea

- In this Project my goal is to parse through the International Mens Cricket Players Batting-Ranking and Player Information from ICC official site.


- I will retrieve information from the page [TEST Men's Test Batting Rankings](https://www.icc-cricket.com/rankings/mens/player-rankings/test/batting) using **web scraping**. This given world Ranking of players and information of the player

## Project Goal

The project goal is to build a web scraper that withdraws all desirable information and assemble them into a single CSV. The format of the output CSV file is shown below:

|#|Player Name|Ranking|Nation|Rating|Career Best|Date of Birth||Role||style|
|-|----------|-------|------------|------------|------|------|----|--|----|---|
|1|Joe Root| 1 |ENG|923|923 v India,05/07/2022|30 December 1990||Batter||Right Hand|
|2|Marnus Labuschagne| 2 |AUS|885|936 v Pakistan, 08/03/2022|22 June 1994||Batter||Right Hand|
|100|Mitchell Starc | ......| ..|....|..|....|...|...|..|...

## PROJECT OUTLINE

Here is an outline of the steps we'll follow :

1. First Download the webpage using the python lybrary `requests`

2. By using `BeautifulSoup` library parse the HTML source code and then extract the required set of data

3. Building the scraper components

4. Extracted information to be compiled into Python list and dictionaries

5. Python dictionaries to convert as `Pandas DataFrames
`
5. Finally the information will be written into a `CSV file`

7. Future scope of the work and References

>## Packages Used:
>1. `Requests` — For downloading the HTML code from the ICC URL
>2. `BeautifulSoup4` — For parsing and extracting data from the HTML string
>3. `Pandas` — to gather the required set of data into a dataframe for the Data Analysis and other process

### How to run the code

This tutorial is an executable [Jupyter notebook](https://jupyter.org) hosted on [Jovian](https://www.jovian.ai). You can _run_ this tutorial and experiment with the code examples in a couple of ways: *using free online resources* (recommended) or *on your computer*.

#### Option 1: Running using free online resources (1-click, recommended)

The easiest way to start executing the code is to click the **Run** button at the top of this page and select **Run on Binder**. You can also select "Run on Colab" or "Run on Kaggle", but you'll need to create an account on [Google Colab](https://colab.research.google.com) or [Kaggle](https://kaggle.com) to use these platforms.


#### Option 2: Running on your computer locally

To run the code on your computer locally, you'll need to set up [Python](https://www.python.org), download the notebook and install the required libraries. We recommend using the [Conda](https://docs.conda.io/projects/conda/en/latest/user-guide/install/) distribution of Python. Click the **Run** button at the top of this page, select the **Run Locally** option, and follow the instructions.

>  **Jupyter Notebooks**: This tutorial is a [Jupyter notebook](https://jupyter.org) - a document made of _cells_. Each cell can contain code written in Python or explanations in plain English. You can execute code cells and view the results, e.g., numbers, messages, graphs, tables, files, etc., instantly within the notebook. Jupyter is a powerful platform for experimentation and analysis. Don't be afraid to mess around with the code & break things - you'll learn a lot by encountering and fixing errors. You can use the "Kernel > Restart & Clear Output" menu option to clear all outputs and start again from the top.


> Please follow this guide to attach image screenshots in the markdown: https://jovian.ai/learn/zero-to-data-analyst-bootcamp/knowledge/add-markdown-image-link-95

## Let us start  : 


>Note : We will use the `Jovian` library and its `commit()` function throughout the code to save our progress as we move along.

In [3]:
!pip install jovian --upgrade --quiet ## Will undergo updates and '--quite' will avoid the loading messages 
import jovian
# Execute this to save new versions of the notebook
jovian.commit(project="final-web-scraping-project")

<IPython.core.display.Javascript object>

[jovian] Updating notebook "jp-amith/web-scraping-project-final" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/jp-amith/web-scraping-project-final[0m


'https://jovian.ai/jp-amith/web-scraping-project-final'

## Using the Requests Library to download the web pages 

>### **What is `requests`**?


>Requests is a Python HTTP library that allows us to send HTTP requests to servers of websites, instead of using browsers to communicate the web.

>We use `pip`, a package-management system, to install and manage softwares. Since the platform we selected is **Binder**, we would have to type a line of code `!pip install` to install `requests`. You will see lots codes of `!pip` when installing other packages.

>When we attempt to use some prewritten functions from a certain library, we would use the `import` statement. e.g. When we would have to type `import requests` after installation, we are able to use any function from `requests` library.



![](https://i.imgur.com/V79Bwlm.png)

In [4]:
!pip install requests --quiet --upgrade
import requests

#### *requests.get()*

In order to **download a web page**, we use `requests.get()` to **send the HTTP request** to the **ICC server** and what the function returns is a **response object**, which is **the HTTP response**. 

In [5]:
topic_url='https://www.icc-cricket.com/rankings/mens/player-rankings/test/batting'
response= requests.get(topic_url)

## Status Code

- We have to check if we succesfully send the HTTP request and get a HTTP response back on purpose. This is because we're NOT using browsers, because of which we can't get the feedback directly if we didn't send HTTP requests successfully.

- If the request was successful, response.status_code is set to a value between 200 and 299.

- In general, the method to check out if the server sended a HTTP response back is the status code. In requests library, requests.get returns a response object, which containing the page contents and the information about status code indicating if the HTTP request was successful. Learn more about HTTP status codes here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status.

In [6]:
response.status_code # Checking the response code to know that the request was successful, should be between 200 and 299.

200

#### Here we can use `response.text` to retrive the HTML document and also by using `len()`  function we can check the lenth of the text

In [7]:
page_contents=response.text
len(response.text)

266250

#### Great!! We have 268232 characters in the the HTML file that we have just downloaded it now in a second!!

In [8]:
response.text[:1000] 

'<!DOCTYPE html>\n<html lang="en">\n<head>\n\n    <meta name="twitter:title" content="ICC Men\'s Test Batting| Player Rankings | ICC"/>\n<meta property="og:type" content="website"/>\n<meta property="twitter:card" content="summary_large_image"/>\n<meta name="description" content="Official ICC Cricket website - live matches, scores, news, highlights, commentary, rankings, videos and fixtures from the International Cricket Council."/>\n<meta property="twitter:site" content="@icc"/>\n<meta name="twitter:description" content="Official ICC Cricket website - live matches, scores, news, highlights, commentary, rankings, videos and fixtures from the International Cricket Council."/>\n<meta name="twitter:image" content="https://www.icc-cricket.com/resources/ver/i/elements/default-thumbnail.jpg"/>\n<meta property="og:title" content="ICC Men\'s Test Batting| Player Rankings | ICC"/>\n<meta property="og:image" content="https://www.icc-cricket.com/resources/ver/i/elements/default-thumbnail.jpg"/>\n<

- What we see above is the source code of the web page. It is written in a language called HTML. 
- It defines and display the content and structure of the web page by the help of the browsers like Chrome

In [9]:
with open ('icc_mens_batting_ranking.html', 'w') as f: f.write(page_contents)
    
#Writing the html page to a file locally in the given file name

- Here, we have saved the text that we got into a `HTML` file using with `open` statement.

- Now, HTML file is created with the given name `icc_mens_batting_ranking.html`

- By using the Jupyter notebook we can open the HTML file using the following path.
File -> Open -> `icc_mens_batting_ranking.html`

![](https://i.imgur.com/7q6MKCG.png)



In [10]:
jovian.commit() #Saving the work to jovian cloud 

<IPython.core.display.Javascript object>

[jovian] Updating notebook "jp-amith/web-scraping-project-final" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/jp-amith/web-scraping-project-final[0m


'https://jovian.ai/jp-amith/web-scraping-project-final'

## Using `BeatifulSoup` to parse and extracted information from the web page

![](https://i.imgur.com/Ld1s7JA.png)

>### What is Beautiful Soup?

- Beautiful Soup is **a Python package** for **parsing HTML and XML documents**. Beautiful Soup enables us to get data out of sequences of characters. It creates a parse tree for parsed pages that can be used to extract data from HTML. It's a handy tool when it comes to web scraping. You can read more on their documentation site. https://www.crummy.com/software/BeautifulSoup/bs4/doc/#getting-help



- To extract information from the HTML source code of a webpage programmatically, we can use the Beautiful Soup library. Let's install the library and import **the BeautifulSoup class** from **the bs4 module.**



In [11]:
!pip install beautifulsoup4 --quiet --upgrade

from bs4 import BeautifulSoup as bs
doc = bs(page_contents, 'html.parser') #Now 'doc' contains entire html text in parsed format

### Inspecting the HTML source code of a web page


![](https://i.imgur.com/309Pcy9.png)


#### What is `HTML` ?

- The HyperText Markup Language, or HTML is the standard markup language for documents designed to be displayed in a web browser. 

- It can be assisted by technologies such as Cascading Style Sheets and scripting languages such as JavaScript.

- In Beautiful Soup library, we can specify html.parser to ask Python to read components of the page, instead of reading it as a long string.

## **HTML tag comprises of three parts:**

1. **Name**: (`html`, `head`, `body`, `div`, etc.) Indicates what the tag represents and how a browser should interpret the information inside it.
2. **Attributes**: (`href`, `target`, `class`, `id`, etc.) Properties of tag used by the browser to customize how a tag is displayed and decide what happens on user interactions.
3. **Children**: A tag can contain some text or other tags or both between the opening and closing segments, e.g., `<div>Some content</div>`.


## Common Tags and Attributes

Following are some of the most commonly used HTML tags:

* `html`
* `head`
* `title`
* `body`
* `div`
* `span`
* `h1` to `h6`
* `p`
* `img`
* `ul`, `ol` and `li`
* `table`, `tr`, `th` and `td`
* `style`
* ...

Each tag supports several attributes. Following are some common attributes used to modify the behavior of tags:

* `id`
* `style`
* `class`
* `href` (used with `<a>`)
* `src` (used with `<img>`)


`What we can do with **a BeautifulSoup object** is to get **a specifc types of a tag in HTML** by calling the name of a tag, as shown in code cell below.`

Here, we use the `find()` function of BeautifulSoup to find the first `<title>` tag in the HTML document and display its content

In [12]:
type(doc)

bs4.BeautifulSoup

In [13]:
title = doc.find('title')
title.text

"ICC Men's Test Batting| Player Rankings | ICC"

In [14]:
type(title)

bs4.element.Tag

## Inspecting HTML in the Browser

>To view the **source code** of any webpage right within **your browser**, you can **right click** anywhere on a page and **select** the **"Inspect"** option. You access the **"Developer Tools"** mode, where you can see the source code as **a tree**. You can expand and collapse various nodes and find the source code for a specific portion of the page.

![](https://i.imgur.com/jnKM8jO.png)


As shown in the photo above, I've inspected one of the Player Names to display how the entire content was presented. 
I found out that each `player` was present inside the `<a>` tag. Since it does not have any specific class, or other attribute, so I have to check for the desired `<a>` tags among all the `<a>` tags present on the page to get the required information alone.

Since I've pulled a single page and return to a BeautifulSoup object, we can start to use some function from Beautiful Soup library to withdraw the piece of information we want.


### Now we will use `BeautifulSoup` to extract the `Name` , `Ranking`, `Nation`,`Rating` and `Career Best` from the HTML Page 

In [15]:
maindoc=doc.find_all('tr') 

In [16]:
len(maindoc)

101

In [17]:
maindoc[:1]

[<tr class="table-head">
 <th class="table-head__cell u-text-right">Pos</th>
 <th class="table-head__cell">Player</th>
 <th class="table-head__cell">Team</th>
 <th class="table-head__cell table-head__cell--rating">Rating</th>
 <th class="table-head__cell u-text-right u-hide-phablet">Career Best Rating</th>
 </tr>]

In [18]:
maindoc = doc.find_all("tr")

In [19]:
first_link=doc.tr

In [20]:
first_link['class']

['table-head']

In [21]:
first_link.attrs

{'class': ['table-head']}

 ### Player Name 

In [22]:
ranking_rank1=doc.find_all(class_="rankings-block__pos-number")
 # Rank1 player name taken seperately 

In [23]:
ranking_rank1
Rank=[]
Rank.append(ranking_rank1[0].text.strip())
Rank

['1']

In [24]:
ranking=doc.find_all(class_="rankings-table__pos-number")

In [25]:
def total_players(ranking):
    for i in range(len(ranking)):
      Rank.append(ranking[i].text.strip())
    return len(Rank)

In [26]:
total_players(ranking)

100

In [27]:
player_name = doc.find_all(class_="table-body__cell rankings-table__name name")

In [28]:
len(player_name)

99

In [29]:
player_name  # Displaying div class of Top 25 Ranked players from the top 100 rank list

[<td class="table-body__cell rankings-table__name name">
 <a href="/rankings/mens/player-rankings/4029">Marnus Labuschagne</a>
 </td>,
 <td class="table-body__cell rankings-table__name name">
 <a href="/rankings/mens/player-rankings/2759">Babar Azam</a>
 </td>,
 <td class="table-body__cell rankings-table__name name">
 <a href="/rankings/mens/player-rankings/271">Steve Smith</a>
 </td>,
 <td class="table-body__cell rankings-table__name name">
 <a href="/rankings/mens/player-rankings/2972">Rishabh Pant</a>
 </td>,
 <td class="table-body__cell rankings-table__name name">
 <a href="/rankings/mens/player-rankings/440">Kane Williamson</a>
 </td>,
 <td class="table-body__cell rankings-table__name name">
 <a href="/rankings/mens/player-rankings/948">Usman Khawaja</a>
 </td>,
 <td class="table-body__cell rankings-table__name name">
 <a href="/rankings/mens/player-rankings/954">Dimuth Karunaratne</a>
 </td>,
 <td class="table-body__cell rankings-table__name name">
 <a href="/rankings/mens/player

#### Collecting the required information alone by using `.append()`, `.strip()` and `.test` functions after declaring a list

In [30]:
def Player_name(rank1_player_name,player_name):
    Name=[]
    Name.append(rank1_player_name[0].text.strip())
    for i in range(len(player_name)):
      Name.append(player_name[i].text.strip())
    return Name

In [31]:
rank1_player_name=doc.find_all(class_="rankings-block__banner--name-large")
player_name = doc.find_all(class_="table-body__cell rankings-table__name name")

Player_name(rank1_player_name,player_name)

['Joe Root',
 'Marnus Labuschagne',
 'Babar Azam',
 'Steve Smith',
 'Rishabh Pant',
 'Kane Williamson',
 'Usman Khawaja',
 'Dimuth Karunaratne',
 'Rohit Sharma',
 'Jonny Bairstow',
 'Daryl Mitchell',
 'Virat Kohli',
 'Dean Elgar',
 'Litton Das',
 'Travis Head',
 'Dinesh Chandimal',
 'Mohammad Rizwan',
 'David Warner',
 'Mushfiqur Rahim',
 'Abdullah Shafique',
 'Tom Blundell',
 'Angelo Mathews',
 'Mayank Agarwal',
 'Azhar Ali',
 'Tom Latham',
 'Cheteshwar Pujara',
 'Ben Stokes',
 'Kraigg Brathwaite',
 'Temba Bavuma',
 'Henry Nicholls',
 'Abid Ali',
 'Sean Williams',
 'Colin de Grandhomme',
 'Devon Conway',
 'Ravindra Jadeja',
 'Shreyas Iyer',
 'Jermaine Blackwood',
 'Ajinkya Rahane',
 'Cameron Green',
 'Tamim Iqbal',
 'Dhananjaya de Silva',
 'Shakib Al Hasan',
 'Lokesh Rahul',
 'Keegan Petersen',
 'Ollie Pope',
 'Aiden Markram',
 'Rassie van der Dussen',
 'Fawad Alam',
 'Niroshan Dickwella',
 'Imam-ul-Haq',
 'Sikandar Raza',
 'Nkruma Bonner',
 'Kusal Mendis',
 'Kyle Mayers',
 'Rory Burn

- As the Name of Player is directly wriiten as the text of 'a' , we could directly access the same using the find_all() function of the BeautifulSoup object, i.e. doc here.

- But, for the Player Name we will have to access one of the attributes of the 'a' tag, i.e. href which contains our desired text.

#### Player Ranking

In [32]:
def Player_ranking(ranking_rank1,ranking):
    Rank=[]
    Rank.append(ranking_rank1[0].text.strip())
    for i in range(len(ranking)):
      Rank.append(ranking[i].text.strip())
    return Rank

In [33]:
ranking_rank1=doc.find_all(class_="rankings-block__pos-number") #  Rank1 player ranking
ranking=doc.find_all(class_="rankings-table__pos-number") #  Rank2 to Rank100 player ranking
Player_ranking(ranking_rank1,ranking)

['1',
 '2',
 '3',
 '4',
 '5',
 '6',
 '7',
 '8',
 '9',
 '10',
 '11',
 '12',
 '13',
 '14',
 '15',
 '16',
 '17',
 '18',
 '19',
 '20',
 '21',
 '22',
 '23',
 '24',
 '25',
 '26',
 '27',
 '28',
 '29',
 '30',
 '31',
 '32',
 '33',
 '34',
 '35',
 '36',
 '37',
 '38',
 '39',
 '40',
 '41',
 '42',
 '43',
 '44',
 '45',
 '46',
 '47',
 '48',
 '49',
 '50',
 '51',
 '52',
 '53',
 '54',
 '55',
 '56',
 '57',
 '58',
 '59',
 '60',
 '61',
 '62',
 '63',
 '64',
 '65',
 '66',
 '67',
 '68',
 '69',
 '70',
 '71',
 '72',
 '73',
 '74',
 '75',
 '76',
 '77',
 '78',
 '79',
 '80',
 '81',
 '82',
 '83',
 '84',
 '85',
 '86',
 '87',
 '88',
 '89',
 '90',
 '91',
 '92',
 '93',
 '94',
 '95',
 '96',
 '97',
 '98',
 '99',
 '100']

### Player Nation

In [34]:
def Player_nation(rank1_nation,nation):
    Nation=[]
    Nation.append(rank1_nation[0].text.strip())
    for i in range(len(nation)):
      Nation.append(nation[i].text.strip())
    return Nation

In [35]:
rank1_nation = doc.find_all(class_="rankings-block__banner--nationality")
nation=doc.find_all(class_="table-body__logo-text")
Player_nation(rank1_nation,nation)

['ENG',
 'AUS',
 'PAK',
 'AUS',
 'IND',
 'NZ',
 'AUS',
 'SL',
 'IND',
 'ENG',
 'NZ',
 'IND',
 'SA',
 'BAN',
 'AUS',
 'SL',
 'PAK',
 'AUS',
 'BAN',
 'PAK',
 'NZ',
 'SL',
 'IND',
 'PAK',
 'NZ',
 'IND',
 'ENG',
 'WI',
 'SA',
 'NZ',
 'PAK',
 'ZIM',
 'NZ',
 'NZ',
 'IND',
 'IND',
 'WI',
 'IND',
 'AUS',
 'BAN',
 'SL',
 'BAN',
 'IND',
 'SA',
 'ENG',
 'SA',
 'SA',
 'PAK',
 'SL',
 'PAK',
 'ZIM',
 'WI',
 'SL',
 'WI',
 'ENG',
 'AUS',
 'IND',
 'SL',
 'AUS',
 'BAN',
 'AFG',
 'WI',
 'IND',
 'ENG',
 'SA',
 'PAK',
 'PAK',
 'SL',
 'WI',
 'ENG',
 'AUS',
 'BAN',
 'AUS',
 'WI',
 'SL',
 'WI',
 'SA',
 'IND',
 'ZIM',
 'SL',
 'ENG',
 'BAN',
 'ENG',
 'ENG',
 'ENG',
 'ENG',
 'IND',
 'AUS',
 'AFG',
 'SL',
 'BAN',
 'NZ',
 'ENG',
 'WI',
 'AFG',
 'PAK',
 'WI',
 'WI',
 'AUS',
 'WI']

### Player Ratings

In [36]:
def Player_ratings(rating,rank1_rating):
    Rating=[]
    Rating.append(rank1_rating[0].text.strip())
    for i in range(len(rating)):
      Rating.append(rating[i].text.strip())
    return Rating

In [37]:
rating=doc.find_all(class_="table-body__cell rating")
rank1_rating = doc.find_all(class_="rankings-block__banner--rating")
Player_ratings(rating,rank1_rating)

['900',
 '885',
 '879',
 '848',
 '801',
 '786',
 '766',
 '748',
 '746',
 '716',
 '715',
 '714',
 '700',
 '694',
 '678',
 '673',
 '670',
 '667',
 '662',
 '657',
 '654',
 '654',
 '644',
 '631',
 '622',
 '622',
 '621',
 '619',
 '614',
 '613',
 '610',
 '603',
 '601',
 '594',
 '590',
 '585',
 '583',
 '577',
 '576',
 '572',
 '570',
 '570',
 '560',
 '556',
 '552',
 '551',
 '543',
 '536',
 '534',
 '523',
 '523',
 '516',
 '516',
 '511',
 '510',
 '502',
 '501',
 '500',
 '497',
 '488',
 '486',
 '486',
 '482',
 '476',
 '476',
 '465',
 '462',
 '461',
 '461',
 '457',
 '453',
 '449',
 '449',
 '444',
 '439',
 '439',
 '435',
 '433',
 '427',
 '419',
 '415',
 '410',
 '410',
 '407',
 '406',
 '403',
 '398',
 '398',
 '397',
 '396',
 '395',
 '386',
 '385',
 '385',
 '377',
 '369',
 '369',
 '369',
 '362',
 '354']

### Career Best

In [38]:
def Career_best(career_best,career1_best):
    Career_Best=[]
    Career_Best.append(career1_best[0].text.strip())
    for i in range(len(career_best)):
      Career_Best.append(career_best[i].text.strip())
    return Career_Best  # Displaying Top 25 Ranked player's CAREER BEST ratings and with which country the rating achieved

In [39]:
career_best = doc.find_all(class_="table-body__cell u-text-right u-hide-phablet")
career1_best=doc.find_all(class_="rankings-block__career-best-text")
Career_best(career_best,career1_best)

['923 v India, 05/07/2022',
 '936 v Pakistan, 08/03/2022',
 '879 v Sri Lanka, 28/07/2022',
 '947 v England, 08/01/2018',
 '801 v England, 05/07/2022',
 '919 v Pakistan, 07/01/2021',
 '779 v Sri Lanka, 03/07/2022',
 '782 v Australia, 12/07/2022',
 '813 v England, 06/09/2021',
 '772 v South Africa, 07/08/2017',
 '715 v England, 27/06/2022',
 '937 v England, 22/08/2018',
 '784 v Australia, 03/04/2018',
 '724 v Sri Lanka, 27/05/2022',
 '773 v England, 18/01/2022',
 '755 v West Indies, 18/06/2018',
 '700 v Australia, 16/03/2022',
 '880 v India, 13/12/2014',
 '675 v Sri Lanka, 27/05/2022',
 '671 v Sri Lanka, 20/07/2022',
 '654 v England, 27/06/2022',
 '877 v New Zealand, 30/12/2015',
 '727 v New Zealand, 25/02/2020',
 '787 v Australia, 30/12/2016',
 '733 v West Indies, 07/12/2020',
 '888 v Sri Lanka, 07/08/2017',
 '827 v West Indies, 20/07/2020',
 '701 v England, 29/08/2017',
 '627 v Pakistan, 07/01/2019',
 '778 v Bangladesh, 12/03/2019',
 '643 v Bangladesh, 30/11/2021',
 '621 v Afghanistan,

## Creating a Data Frame using Pandas

> Use of **Pandas**?

>Pandas is a software library written for the Python programming language for data manipulation and analysis. 
In particular, it offers data structures and operations for manipulating numerical tables and time series.

![](https://i.imgur.com/oDZnRhA.png)


>What is a **DataFrame**?

>A Pandas DataFrame is a 2 dimensional data structure, like a 2 dimensional array, or a table with rows and columns.
DataFrame makes it easier for us to work with tablular data and analse it.



In [75]:
import pandas as pd # Importing Pandas 

In [76]:
data_dict={}
data_dict['Ranking']=Player_ranking(ranking_rank1,ranking)
data_dict['Name']=Player_name(rank1_player_name,player_name)
data_dict['Nation']=Player_nation(rank1_nation,nation)
data_dict['Rating']=Player_ratings(rating,rank1_rating)
data_dict['Career Best']= Career_best(career_best,career1_best)

In [77]:
Players_df=pd.DataFrame.from_dict(data_dict)
Players_df=Players_df.reindex(columns=['Name','Ranking','Nation','Rating','Career Best'])
Players_df # First web page result by using Pandas

Unnamed: 0,Name,Ranking,Nation,Rating,Career Best
0,Joe Root,1,ENG,900,"923 v India, 05/07/2022"
1,Marnus Labuschagne,2,AUS,885,"936 v Pakistan, 08/03/2022"
2,Babar Azam,3,PAK,879,"879 v Sri Lanka, 28/07/2022"
3,Steve Smith,4,AUS,848,"947 v England, 08/01/2018"
4,Rishabh Pant,5,IND,801,"801 v England, 05/07/2022"
...,...,...,...,...,...
95,Haris Sohail,96,PAK,369,"563 v New Zealand, 28/11/2018"
96,Shamarh Brooks,97,WI,369,"536 v England, 20/07/2020"
97,Roston Chase,98,WI,369,"626 v Pakistan, 14/05/2017"
98,Mitchell Starc,99,AUS,362,"446 v India, 25/02/2017"


- Firstly, we will create a Python Dictionary with the Player Name and other information that we have extracted till now.

We can see that the DataFrame consists of **100 items**, that is equal to the number of Players that we have on the page `Most ICC Test Men's ranking`.

Therefore, we can be sure that we have extracted the complete information that we had intended to.

![](https://i.imgur.com/cHMGPi0.png)


- We have finally created the Data Frame which contains Player **Names, Ranking, Nation, Ratings and Career Best**

In [78]:
jovian.commit()

<IPython.core.display.Javascript object>

[jovian] Updating notebook "jp-amith/web-scraping-project-final" on https://jovian.ai[0m
[jovian] Committed successfully! https://jovian.ai/jp-amith/web-scraping-project-final[0m


'https://jovian.ai/jp-amith/web-scraping-project-final'

## Next Step 

#### Now, we will go into individual player profile page and extract the rest of the required information using the same procedure which we used for the first web page

In [79]:
info_url='https://www.icc-cricket.com/rankings/mens/player-rankings/887'

In [80]:
player_info = info_url  #To get information of the first player in the ranking table
response = requests.get(player_info)
response.status_code

200

In [81]:
player_info=response.text
len(response.text)

136998

In [82]:
response.text[:1000]

'<!DOCTYPE html>\n<html lang="en">\n<head>\n\n    <meta name="twitter:title" content="ICC Profile - Stats, Ranking & Info"/>\n<meta property="og:type" content="website"/>\n<meta property="twitter:card" content="summary_large_image"/>\n<meta name="description" content="Official ICC Cricket website - live matches, scores, news, highlights, commentary, rankings, videos and fixtures from the International Cricket Council."/>\n<meta property="twitter:site" content="@icc"/>\n<meta name="twitter:description" content="Official ICC Cricket website - live matches, scores, news, highlights, commentary, rankings, videos and fixtures from the International Cricket Council."/>\n<meta name="twitter:image" content="https://www.icc-cricket.com/resources/ver/i/elements/default-thumbnail.jpg"/>\n<meta property="og:title" content="ICC Profile - Stats, Ranking & Info"/>\n<meta property="og:image" content="https://www.icc-cricket.com/resources/ver/i/elements/default-thumbnail.jpg"/>\n<title>ICC Profile - St

In [83]:
with open ('player_info.html', 'w') as f: f.write(page_contents)

In [84]:
!pip install beautifulsoup4 --quiet --upgrade

from bs4 import BeautifulSoup
doc2 = BeautifulSoup(player_info, 'html.parser')

In [85]:
type(doc2)

bs4.BeautifulSoup

In [86]:
title = doc2.find('title')
title.text

'ICC Profile - Stats, Ranking & Info'

In [87]:
player_info[:1000]

'<!DOCTYPE html>\n<html lang="en">\n<head>\n\n    <meta name="twitter:title" content="ICC Profile - Stats, Ranking & Info"/>\n<meta property="og:type" content="website"/>\n<meta property="twitter:card" content="summary_large_image"/>\n<meta name="description" content="Official ICC Cricket website - live matches, scores, news, highlights, commentary, rankings, videos and fixtures from the International Cricket Council."/>\n<meta property="twitter:site" content="@icc"/>\n<meta name="twitter:description" content="Official ICC Cricket website - live matches, scores, news, highlights, commentary, rankings, videos and fixtures from the International Cricket Council."/>\n<meta name="twitter:image" content="https://www.icc-cricket.com/resources/ver/i/elements/default-thumbnail.jpg"/>\n<meta property="og:title" content="ICC Profile - Stats, Ranking & Info"/>\n<meta property="og:image" content="https://www.icc-cricket.com/resources/ver/i/elements/default-thumbnail.jpg"/>\n<title>ICC Profile - St

In [88]:
link=doc2.span

In [89]:
link['class']

['js-profile-completion-percentage']

In [90]:
link.attrs

{'class': ['js-profile-completion-percentage']}

In [91]:
tag=doc2.find_all('span')

In [92]:
len(tag)

86

### Player Date of Birth

In [93]:
Name=[]
innerp=[]
inner1=doc.find_all(class_="rankings-block__player-image-container rankings-block__player-image-container--large")
innerp.append([rank1_player_name[0].text.strip(),inner1[0].find('a').get('href')])
Name.append(rank1_player_name[0].text.strip())

In [94]:
for i in range(len(player_name)):
      Name.append(player_name[i].text.strip())
      innerp.append([player_name[i].text.strip(),player_name[i].find('a').get('href')])
innerp # To scrap more information from the 2nd web page we are found the html tags with different URLs 

[['Joe Root', '/rankings/mens/player-rankings/887'],
 ['Marnus Labuschagne', '/rankings/mens/player-rankings/4029'],
 ['Babar Azam', '/rankings/mens/player-rankings/2759'],
 ['Steve Smith', '/rankings/mens/player-rankings/271'],
 ['Rishabh Pant', '/rankings/mens/player-rankings/2972'],
 ['Kane Williamson', '/rankings/mens/player-rankings/440'],
 ['Usman Khawaja', '/rankings/mens/player-rankings/948'],
 ['Dimuth Karunaratne', '/rankings/mens/player-rankings/954'],
 ['Rohit Sharma', '/rankings/mens/player-rankings/107'],
 ['Jonny Bairstow', '/rankings/mens/player-rankings/506'],
 ['Daryl Mitchell', '/rankings/mens/player-rankings/3642'],
 ['Virat Kohli', '/rankings/mens/player-rankings/164'],
 ['Dean Elgar', '/rankings/mens/player-rankings/652'],
 ['Litton Das', '/rankings/mens/player-rankings/1596'],
 ['Travis Head', '/rankings/mens/player-rankings/1020'],
 ['Dinesh Chandimal', '/rankings/mens/player-rankings/230'],
 ['Mohammad Rizwan', '/rankings/mens/player-rankings/1201'],
 ['David W

In [95]:
Dateob=[]
Role=[]
style=[]
for i in innerp:
  url="https://www.icc-cricket.com"+i[1]
  response=requests.get(url)
  soup=bs(response.content,"html.parser")
  dob=soup.find_all(class_="rankings-player-bio__entry")
  if dob :
    Dateob.append(dob[0].text)
    Role.append(dob[1].text)
    style.append(dob[2].text)
  else:
    Dateob.append("NA")
    Role.append("NA")
    style.append("NA")
dic2={}

In [96]:
dic2['Dateofbirth']=Dateob #To collect the Date of Birth of each players

In [97]:
Dateob

['30 December 1990',
 '22 June 1994',
 '15 October 1994',
 '02 June 1989',
 '04 October 1997',
 '08 August 1990',
 '18 December 1986',
 '21 April 1988',
 '30 April 1987',
 '26 September 1989',
 '25 November 1983',
 '05 November 1988',
 '11 June 1987',
 '13 October 1994',
 '29 December 1993',
 '18 November 1989',
 '01 June 1992',
 '27 October 1986',
 '09 May 1987',
 'NA',
 '01 September 1990',
 '02 June 1987',
 '18 February 1991',
 '19 February 1985',
 '02 April 1992',
 '25 January 1988',
 '04 June 1991',
 '01 December 1992',
 '17 May 1990',
 '15 November 1991',
 '16 October 1987',
 '26 September 1986',
 '22 July 1986',
 'NA',
 '06 December 1988',
 '06 December 1994',
 '20 November 1991',
 '06 June 1988',
 'NA',
 '20 March 1989',
 '06 September 1991',
 '24 March 1987',
 '18 April 1992',
 'NA',
 '02 January 1998',
 '04 October 1994',
 '07 February 1989',
 '08 October 1985',
 '23 June 1993',
 '12 December 1995',
 '24 April 1986',
 'NA',
 '02 February 1995',
 'NA',
 '26 August 1990',
 '08 

### Player Role

In [98]:
dic2['Role']=Role

In [99]:
Role

['Batter',
 'Batter',
 'Batter',
 'Batter',
 'Wicket-keeper',
 'Batter',
 'Batter',
 'Batter',
 'Batter',
 'Batter',
 '-',
 'Batter',
 'Batter',
 'Wicket-keeper',
 'Batter',
 'Batter',
 'Wicket-keeper',
 'Batter',
 'Batter',
 'NA',
 'Wicket-keeper',
 'All-rounder',
 'Batter',
 'Batter',
 'Batter',
 'Batter',
 'All-rounder',
 'Batter',
 'Batter',
 'Batter',
 'Batter',
 'Batter',
 'All-rounder',
 'NA',
 'All-rounder',
 'Batter',
 'Batter',
 'Batter',
 'NA',
 'Batter',
 'All-rounder',
 'All-rounder',
 'Batter',
 'NA',
 'Batter',
 'Batter',
 'Batter',
 '-',
 'Wicket-keeper',
 'Batter',
 'All-rounder',
 'NA',
 'Batter',
 'NA',
 'Batter',
 'Wicket-keeper',
 'Batter',
 'Batter',
 'Wicket-keeper',
 'All-rounder',
 'Batter',
 'All-rounder',
 'NA',
 'Wicket-keeper',
 'NA',
 'Bowler',
 'Batter',
 'NA',
 'NA',
 'Batter',
 'Wicket-keeper',
 'Batter',
 'Batter',
 'Wicket-keeper',
 'Wicket-keeper',
 'Batter',
 'NA',
 '-',
 'Wicket-keeper',
 'Batter',
 'Batter',
 '-',
 'NA',
 'NA',
 'All-rounder',
 'B

### Player Batting Style

In [100]:
dic2['style']=style

In [101]:
style

['Right Hand',
 'Right Hand',
 'Right Hand',
 'Right Hand',
 'Left Hand',
 'Right Hand',
 'Left Hand',
 'Left Hand',
 'Right Hand',
 'Right Hand',
 'Right Hand',
 'Right Hand',
 'Left Hand',
 'Right Hand',
 'Left Hand',
 'Right Hand',
 'Right Hand',
 'Left Hand',
 'Right Hand',
 'NA',
 'Right Hand',
 'Right Hand',
 'Right Hand',
 'Right Hand',
 'Left Hand',
 'Right Hand',
 'Left Hand',
 'Right Hand',
 'Right Hand',
 'Left Hand',
 'Right Hand',
 'Left Hand',
 'Right Hand',
 'NA',
 'Left Hand',
 'Right Hand',
 'Right Hand',
 'Right Hand',
 'NA',
 'Left Hand',
 'Right Hand',
 'Left Hand',
 'Right Hand',
 'NA',
 'Right Hand',
 'Right Hand',
 'Right Hand',
 'Left Hand',
 'Left Hand',
 'Left Hand',
 'Right Hand',
 'NA',
 'Right Hand',
 'NA',
 'Left Hand',
 'Right Hand',
 'Right Hand',
 'Right Hand',
 'Left Hand',
 'Right Hand',
 'Right Hand',
 'Right Hand',
 'NA',
 'Right Hand',
 'NA',
 'Left Hand',
 'Left Hand',
 'NA',
 'NA',
 'Right Hand',
 'Left Hand',
 'Left Hand',
 'Right Hand',
 'Right

### Let us write functions to combine what we have done above and get all the details at once for any given player URL

In [102]:
def Player_name(rank1_player_name,player_name):
    Name=[]
    Name.append(rank1_player_name[0].text.strip())
    for i in range(len(player_name)):
      Name.append(player_name[i].text.strip())
    return Name


def Player_ranking(ranking_rank1,ranking):
    Rank=[]
    Rank.append(ranking_rank1[0].text.strip())
    for i in range(len(ranking)):
      Rank.append(ranking[i].text.strip())
    return Rank


def Player_nation(rank1_nation,nation):
    Nation=[]
    Nation.append(rank1_nation[0].text.strip())
    for i in range(len(nation)):
      Nation.append(nation[i].text.strip())
    return Nation


def Player_ratings(rating,rank1_rating):
    Rating=[]
    Rating.append(rank1_rating[0].text.strip())
    for i in range(len(rating)):
      Rating.append(rating[i].text.strip())
    return Rating


def Career_best(career_best,career1_best):
    Career_Best=[]
    Career_Best.append(career1_best[0].text.strip())
    for i in range(len(career_best)):
      Career_Best.append(career_best[i].text.strip())
    return Career_Best

In [103]:
def Player_Ranking_Info():
    data_dict={}
    data_dict['Ranking']=Player_ranking(ranking_rank1,ranking)
    data_dict['Name']=Player_name(rank1_player_name,player_name)
    data_dict['Nation']=Player_nation(rank1_nation,nation)
    data_dict['Rating']=Player_ratings(rating,rank1_rating)
    data_dict['Career Best']= Career_best(career_best,career1_best)
    return data_dict

In [104]:
def Player_Personal_Info(innerp):
    
    Dateob=[]
    Role=[]
    style=[]
    for i in innerp:
      url="https://www.icc-cricket.com"+i[1]
      response=requests.get(url)
      soup=bs(response.content,"html.parser")
      dob=soup.find_all(class_="rankings-player-bio__entry")
      if dob :
        Dateob.append(dob[0].text)
        Role.append(dob[1].text)
        style.append(dob[2].text)
      else:
        Dateob.append("NA")
        Role.append("NA")
        style.append("NA") 
    dic2={'Dateofbirth':Dateob,'Role':Role, 'style':style}
    return dic2

In [105]:
Player_Personal_Info(innerp)

{'Dateofbirth': ['30 December 1990',
  '22 June 1994',
  '15 October 1994',
  '02 June 1989',
  '04 October 1997',
  '08 August 1990',
  '18 December 1986',
  '21 April 1988',
  '30 April 1987',
  '26 September 1989',
  '25 November 1983',
  '05 November 1988',
  '11 June 1987',
  '13 October 1994',
  '29 December 1993',
  '18 November 1989',
  '01 June 1992',
  '27 October 1986',
  '09 May 1987',
  'NA',
  '01 September 1990',
  '02 June 1987',
  '18 February 1991',
  '19 February 1985',
  '02 April 1992',
  '25 January 1988',
  '04 June 1991',
  '01 December 1992',
  '17 May 1990',
  '15 November 1991',
  '16 October 1987',
  '26 September 1986',
  '22 July 1986',
  'NA',
  '06 December 1988',
  '06 December 1994',
  '20 November 1991',
  '06 June 1988',
  'NA',
  '20 March 1989',
  '06 September 1991',
  '24 March 1987',
  '18 April 1992',
  'NA',
  '02 January 1998',
  '04 October 1994',
  '07 February 1989',
  '08 October 1985',
  '23 June 1993',
  '12 December 1995',
  '24 April 

#### Now that we have all the information from individual player Page and from the ICC player Ranking page, let us combine both the DataFrames into one single DataFrame* using the`concat` in pandas

- Let us see what we have in the final DataFrame:

In [106]:
def Player_info():

    data_dict= Player_Ranking_Info()
    dict2=Player_Personal_Info(innerp)
    df2=pd.DataFrame.from_dict(dic2)

    Players_df=pd.DataFrame.from_dict(data_dict)
    Players_df=Players_df.reindex(columns=['Name','Ranking','Nation','Rating','Career Best'])
    Players_df # First web page result by using Pandas

    df=pd.concat([Players_df,df2],axis=1)
    return df

In [107]:
df=Player_info()
Player_info()

Unnamed: 0,Name,Ranking,Nation,Rating,Career Best,Dateofbirth,Role,style
0,Joe Root,1,ENG,900,"923 v India, 05/07/2022",30 December 1990,Batter,Right Hand
1,Marnus Labuschagne,2,AUS,885,"936 v Pakistan, 08/03/2022",22 June 1994,Batter,Right Hand
2,Babar Azam,3,PAK,879,"879 v Sri Lanka, 28/07/2022",15 October 1994,Batter,Right Hand
3,Steve Smith,4,AUS,848,"947 v England, 08/01/2018",02 June 1989,Batter,Right Hand
4,Rishabh Pant,5,IND,801,"801 v England, 05/07/2022",04 October 1997,Wicket-keeper,Left Hand
...,...,...,...,...,...,...,...,...
95,Haris Sohail,96,PAK,369,"563 v New Zealand, 28/11/2018",09 January 1989,Batter,Left Hand
96,Shamarh Brooks,97,WI,369,"536 v England, 20/07/2020",01 October 1988,Batter,Right Hand
97,Roston Chase,98,WI,369,"626 v Pakistan, 14/05/2017",22 March 1992,All-rounder,Right Hand
98,Mitchell Starc,99,AUS,362,"446 v India, 25/02/2017",30 January 1990,Bowler,Left Hand


#### Our project goal is achieved successfully with the above required information!!

In [None]:
df.to_csv("Ranking_list.csv",index=False)

![](https://i.imgur.com/OwhLg4O.png)
![](https://i.imgur.com/2mJ2Q53.png)

Follow this guide for attaching files:https://jovian.ai/docs/user-guide/attach.html

## Summary

Finally, we have managed to `parse` 'ICC Top 100 Ranked Test Players-Batting' to get our hands on very **interesting and insightful data** when it comes to the International Cricket.  
We have saved all the information we could extract from that website for our needs in a `CSV` file using which we can further get answers to a lot of questions we may want to ask, 
e.g - `Highest Ranking of 5th Ranked player?`
      `Date of Birth and First test of a particular Player?`

#### Let us look at the steps that we took from start to finish : 

1. We downloaded the webpage https://www.icc-cricket.com/rankings/mens/player-rankings/test/batting using `requests`  


2. We `parsed` the HTML source code using `BeautifulSoup` library and extracted the desired infromation, i.e.

* The names of 'Top 100 Ranked Men's cricket Batsmen'
* Ranking of top 100 players we parsed before
* Player's Nation
* Player Ratings
* Career Best Ratings


3. We created a `DataFrame` using `Pandas` for `Python Lists` that we derived from the previous step


4. We extracted detailed information for each player among the list of `Top 100 Ranked Test Players-Batting`, such as :
* Player Name
* Ranking 
* Nation
* Ratings
* Career Best	
* Date of Birth
* Batting Style
* Player Role


5. We then created a `Python Dictionary` to save all these details.


6. We converted the python dictionary into `Pandas DataFrames`.


7. We merged `2 DataFrames`which we parsed into a single DataFrame.


8. With one single DataFrame in hand, we then converted it into a single `CSV` file, which contains all our required infromation about the players to accomplish the goal of our project.



## Future Work

Now we can work forward to explore this data more and more to fetch many other information from this site.

With all the insights , and further analysis into the data, we can have information such as 

- Men's Player Ranking Bowling and Batting in Test Match, One day International and T20 Cricket 
- Team Ranking and Player Rankings of both Men's and Women's in all International cricket formats such as Test, One Day International and T20.
- Player comparison and player statistics in all cricket formats
- We will get all the tables from each of the urls,
And the list goes on..

>In the future, I would like to work to make this DataSet even richer with more data from other lists created by  I would then like to work on analysing the entire data, to know a lot more about cricket players than I currently know. Also I would like to scrap more deatils of other Roles such us bowling and All-Rounders

## References
[1] Python offical documentation. https://docs.python.org/3/

[2] Beautiful Soup documentation. https://www.crummy.com/software/BeautifulSoup/bs4/doc/

[3] Aakash N S, Introduction to Web Scraping, 2021. https://jovian.ai/aakashns/python-web-scraping-and-rest-api

[4] Pandas library documentation. https://pandas.pydata.org/docs/

[5] ICC Official Website https://www.icc-cricket.com/rankings/mens/player-rankings/test/batting

[6] Working with Jupyter Notebook https://www.youtube.com/watch?v=lNPofGL28lU and 
                                  https://towardsdatascience.com/write-markdown-latex-in-the-jupyter-notebook-10985edb91fd

[7] HTML Tutorial for Beginners: HTML Crash Course https://www.youtube.com/watch?v=qz0aGYrrlhU

[8] Web Scraping Article. https://www.toptal.com/python/web-scraping-with-python

[9] Web Scraping Image. https://morioh.com/p/431153538ecb

In [74]:
jovian.commit

<function jovian.utils.commit.commit(message=None, files=[], outputs=[], environment=None, privacy='auto', filename=None, project=None, new_project=None, git_commit=False, git_message='auto', **kwargs)>