# Web Scraping Tutorial: Scraping ICC Cricket Website


<img src="https://d1vd9vlqo1765y.cloudfront.net/blog/virat-kohli-to-miss-world-cup-warm-up-match-vs-netherlands-1.jpg" alt="Image" width="500" height="300">
## Introduction

In this tutorial, we will learn how to scrape data from the ICC Cricket website (www.icc-cricket.com) using Python. We will focus on extracting information about cricket teams and players.

### Tools Required

- Python (3.x)
- Beautiful Soup (for HTML parsing)
- Requests (for making HTTP requests)

## Setup

Before we start, make sure you have the required libraries installed:

```bash
pip install beautifulsoup4
pip install requests


# Before Scraping: Learning about Requests and Beautiful Soup (bs4)

Before we dive into web scraping, it's important to understand the two essential libraries we'll be using: `requests` and `Beautiful Soup` (bs4).



## Making Requests 
In this section, we'll explore how to make HTTP requests using markup.


### Introduction

Markup languages like HTML or Markdown do not have native capabilities for making HTTP requests. However, you can incorporate code snippets in various languages to demonstrate this functionality.


### Making GET Requests

To make a GET request, you can use a language like Python. Here's an example using the `requests` library:

```python
import requests

response = requests.get('https://api.example.com/endpoint')
print(response.text)


### Making POST Requests

If you need to send data to a server, you can use a POST request. Here's an example in Python:

```python
import requests

data = {'key1': 'value1', 'key2': 'value2'}
response = requests.post('https://api.example.com/endpoint', data=data)
print(response.text)


### Handling Responses

After making a request, you'll receive a response from the server. This response contains information like status codes, headers, and the actual content. You can parse and use this data as needed.

For example, to access the status code in Python:

```python
print(response.status_code)


### Additional Notes

- Always ensure that you have the necessary permissions and credentials when making requests to protected resources.
- Consider error handling and exception management in your code for a robust application.
- Check the documentation of the language or library you're using for more advanced request options and features.


In [4]:
# Import the requests library
import requests

# Define the URL to scrape
url = 'https://www.icc-cricket.com/match/101865#scorecard'

# Send a GET request to the URL
response = requests.get(url)

# Check if the request was successful (status code 200)
if response.status_code == 200:
    print('Request successful') 
    # Print the HTML content of the page
    print(response.text)
else:
    # Print an error message along with the status code if the request was not successful
    print('Error:', response.status_code)


Request successful
<!DOCTYPE html>
<html lang="en">
<head>

    <meta name="twitter:title" content="India 191/5 vs Australia 188 | India won by 5 wickets | ICC"/>
<meta name="description" content="Follow the live scores of the India won by 5 wickets India vs Australia at Wankhede Stadium, Mumbai. Read the commentary, team updates and detailed match info!"/>
<meta name="twitter:description" content="Follow the live scores of the India won by 5 wickets India vs Australia at Wankhede Stadium, Mumbai. Read the commentary, team updates and detailed match info!"/>
<meta property="og:title" content="India 191/5 vs Australia 188 | India won by 5 wickets | ICC"/>
<title>India 191/5 vs Australia 188 | India won by 5 wickets | ICC</title>
<meta property="og:description" content="Follow the live scores of the India won by 5 wickets India vs Australia at Wankhede Stadium, Mumbai. Read the commentary, team updates and detailed match info!"/>

    
<script>

    var dataLayer = [{
        'user_langu

## Web Scraping with Beautiful Soup

In this section, we'll explore how to use Beautiful Soup, a Python library, for web scraping in your notebook.


### Introduction

Beautiful Soup is a powerful Python library used for web scraping purposes. It allows you to parse HTML or XML documents, extract data, and navigate through the document's structure.


### Installing Beautiful Soup

Before using Beautiful Soup, you need to install it. You can do this via pip:

```bash
pip install bs4


### Basic Usage

To start using Beautiful Soup, you first need to import it in your Python script or notebook:

```python
from bs4 import BeautifulSoup


### Assuming 'html_content' contains your HTML content
soup = BeautifulSoup(html_content, 'html.parser')


### Navigating the Document

Beautiful Soup provides methods to navigate through the document's structure. For example, to find all links (`<a>` tags), you can use:

```python
links = soup.find_all('a')
for link in links:
    print(link.get('href'))


### Extracting Data

You can extract specific data by targeting HTML elements or their attributes. For instance, to get the text inside a `<div>` with a class of 'content':

```python
content_div = soup.find('div', class_='content')
print(content_div.text)


In [2]:
# Import necessary libraries
import json  # For working with JSON files
import requests  # For making HTTP requests
from bs4 import BeautifulSoup  # For parsing HTML

# Define the file path for the JSON file
file_path = 'D:\\data\\json\\matchlinks.json'

# Read the JSON file that we made using selenium into a dictionary
with open(file_path, 'r') as json_file:
    my_dict = json.load(json_file)

# Now 'my_dict' contains the data from the JSON file
print(my_dict)


{'India': [["ODI\nAustralia won by 66 runs\nIndia 286\nAustralia 352/7\n3rd ODI, India v Australia - 2023/24 Men's ODI Series | Saurashtra Cricket Association Stadium, Rajkot\nWed 27 September\nMatch Centre", 'https://www.icc-cricket.com/match/102964'], ["ODI\nIndia won by 99 runs (DLS method)\nIndia 399/5\nAustralia 217\nIndia won by 99 runs (DLS method), India v Australia - 2023/24 Men's ODI Series | Holkar Cricket Stadium, Indore\nSun 24 September\nMatch Centre", 'https://www.icc-cricket.com/match/102963'], ["ODI\nIndia won by 5 wickets\nIndia 281/5\nAustralia 276\nIndia won by 5 wickets, India v Australia - 2023/24 Men's ODI Series | Punjab Cricket Association Stadium, Mohali\nFri 22 September\nMatch Centre", 'https://www.icc-cricket.com/match/102962'], ['ODI Final\nIndia won by 10 wickets\nIndia 51/0\nSri Lanka 50\nFinal, Asia Cup 2023 | R.Premadasa Stadium, Khettarama\nSun 17 September\nMatch Centre', 'https://www.icc-cricket.com/match/102846'], ['ODI\nBangladesh won by 6 runs\nI

In [3]:
#Lets see what in the dictionary
my_dict['India'][0][0] 

"ODI\nAustralia won by 66 runs\nIndia 286\nAustralia 352/7\n3rd ODI, India v Australia - 2023/24 Men's ODI Series | Saurashtra Cricket Association Stadium, Rajkot\nWed 27 September\nMatch Centre"

In [25]:
teams = ['India', 'Afghanistan', 'Australia','Bangladesh', 'England', 'New Zealand', 'Pakistan', 'South Africa', 'Netherlands' , 'Sri Lanka' ]

In [26]:
# Create an empty dictionary 'd'
d = dict()

# Loop through key-value pairs in 'my_dict'
for k, v in my_dict.items():
    print("<----------Had----------->")
    print(k, len(v))
    
    links = []  # Initialize an empty list for storing links
    
    # Loop through matches in 'v'
    for match in v:
        count = -1
        
        # Loop through teams in 'teams' so that we can capture only those matches that are played between ODI teams
        for i in teams:
            if match[0].__contains__(i):
                count = count + 1
        
        # Append match link if count is 1
        if count == 1:
            links.append(match[1])
        
        # Break loop if the number of links reaches 50
        if len(links) >= 50:
            break
    
    print("<----------Taking----------->")
    print(k, len(links))
    d[k] = links  # Assign links to the key k in dictionary 'd'


<----------Had----------->
India 304
<----------Taking----------->
India 50
<----------Had----------->
Afghanistan 119
<----------Taking----------->
Afghanistan 45
<----------Had----------->
Australia 273
<----------Taking----------->
Australia 50
<----------Had----------->
Bangladesh 186
<----------Taking----------->
Bangladesh 50
<----------Had----------->
England 267
<----------Taking----------->
England 50
<----------Had----------->
New Zealand 266
<----------Taking----------->
New Zealand 50
<----------Had----------->
Pakistan 257
<----------Taking----------->
Pakistan 50
<----------Had----------->
South Africa 234
<----------Taking----------->
South Africa 50
<----------Had----------->
Netherlands 68
<----------Taking----------->
Netherlands 36
<----------Had----------->
Sri Lanka 296
<----------Taking----------->
Sri Lanka 50


In [28]:
for i in d.keys():
    print(f"<------------- Doing for {i} ---------------->")
    for j in d[i]:
        try:
            url=j
            response = requests.get(url)
            soup = BeautifulSoup(response.content, 'html.parser')
            script_tag=soup.find("script",type="application/ld+json")
            result_str=soup.find('div','scorebox__outcome').text.replace("\n",'').strip()
            script_tag=soup.find("script",type="application/ld+json")
            data = json.loads(script_tag.string)
            
            #Taking the meta data
            match={
            "date":data['startDate'],
            "home":data['homeTeam']['name'],
            'away':data['awayTeam']['name'],
            "place":data['location']['name'],
            "result":result_str}
            
            
            #Taking data out of score board.
            try:
                novert0=soup.find_all('div',"scorebox__team")[0].find('span',"match-score__overs").text.replace('\n','').strip()
            except AttributeError:
                novert0=''
            try:
                novert1=soup.find_all('div',"scorebox__team")[1].find('span',"match-score__overs").text.replace('\n','').strip()
            except AttributeError:
                novert1=''
            try:
                rrt0=soup.find_all('div',"scorebox__team")[0].find('span',"match-score__run-rate").text.replace('\n','').strip()
            except AttributeError:
                rrt0=''
            try:
                rrt1=soup.find_all('div',"scorebox__team")[1].find('span',"match-score__run-rate").text.replace('\n','').strip()
            except AttributeError:
                rrt1=''
            namet0=soup.find_all('div',"scorebox__team")[0].find('span',"scorebox__team-name").text.replace('\n','').strip()
            namet1=soup.find_all('div',"scorebox__team")[1].find('span',"scorebox__team-name").text.replace('\n','').strip()
            try:
                scoret0=soup.find_all('div',"scorebox__team")[0].find('span',"match-score__runs").text.replace('\n','').strip()
            except AttributeError:
                scoret0=''

            try:
                scoret1=soup.find_all('div',"scorebox__team")[1].find('span',"match-score__runs").text.replace('\n','').strip()
            except AttributeError:
                scoret1=''
                
                
            # Extract batting and bowling information    
            tables=soup.find_all("table",class_="table scorecard__table")
            batting=dict()
            for table in tables[0::2]: #Only batting tables taken by slicing
                player_data = []
                extras=''
                for row in table.find('tbody').find_all('tr'):
                    try:
                        cols = row.find_all('td')
                        player_name = cols[0].find('span', class_='scorecard__player-name').text
                        try:
                            dismisal_name=cols[0].find('span', class_='scorecard__dismissal').text

                        except:
                            dismisal_name=""
                        runs = cols[1].strong.text
                        balls_faced = cols[2].text.strip()
                        fours = cols[3].text.strip()
                        sixes = cols[4].text.strip()
                        strike_rate = cols[5].text.strip()

                        player_data.append({
                            "Player Name": player_name,
                            "Dismissed by":dismisal_name ,
                            "Runs": runs,
                            "Balls Faced": balls_faced,
                            "Fours": fours,
                            "Sixes": sixes,
                            "Strike Rate": strike_rate
                        })
                    except:
                        if extras=='':
                            extras=cols[0].text.replace("\n",'')
                        else:
                            pass
                # Save data in a dictionary and append
                batting[table.tr.text.replace("\n",'').replace('Batters','').split("Batting")[0].strip()]=[player_data,extras]

            balling=dict()
            for table in tables[1:4:2]: #Only balling tables taken by slicing
                bowling_stats = []
                for row in table.find('tbody').find_all('tr'):
                    player_name=row.find('a', class_='scorecard__cell--main').span.text
                    player_data = row.find_all('a', class_='scorecard__cell u-link-reset u-show')
                    overs = player_data[0].text.strip()
                    maidens = player_data[1].text.strip()
                    runs = player_data[2].text.strip()
                    wickets = player_data[3].text.strip()
                    economy = player_data[4].text.strip()
                    dots = player_data[5].text.strip()

                    # Save data in a dictionary and append
                    bowling_stats.append({
                        'player_name':player_name,
                        'Overs': overs,
                        'Maidens': maidens,
                        'Runs': runs,
                        'Wickets': wickets,
                        'Economy': economy,
                        'Dots': dots
                    })


                # Save data in a dictionary and append
                balling[table.tr.text.replace("\n",'').replace('Bowlers','').split("Bowling")[0].strip()]=bowling_stats
            try:
                batt0=batting[namet0]
            except:
                batt0=["None","None"]
            try:
                batt1=batting[namet1]
            except:
                batt1=["None","None"]
            try:
                ballingt1=balling[namet1]
            except:
                ballingt1=[]
            try:
                ballingt0=balling[namet0]
            except:
                ballingt0=[]

            dicteam1 = {"Name":namet0,
            "score":scoret0,
            "overs":novert0,
            "rr":rrt0,
            "batting":batt0[0],
            "Extras":batt0[1],
            "balling":ballingt0}

            dicteam2 = {"Name":namet1,
            "score":scoret1,
            "overs":novert1,
            "rr":rrt1,
            "batting":batt1[0],
            "Extras":batt1[1],
            "balling":ballingt1}

            match["Team1"]=dicteam1
            match["Team2"]=dicteam2

            filenum=url.split("#")[0].split("/")[-1]

            match["match_id"]=filenum
            
            #Save the files individually to computer
            file_path = fr'D:\data\json\team_{i}_{filenum}.json'
            with open(file_path, 'w') as json_file:
                json.dump(match, json_file)
        except:
            pass

<------------- Doing for India ---------------->
<------------- Doing for Afghanistan ---------------->
<------------- Doing for Australia ---------------->
<------------- Doing for Bangladesh ---------------->
<------------- Doing for England ---------------->
<------------- Doing for New Zealand ---------------->
<------------- Doing for Pakistan ---------------->
<------------- Doing for South Africa ---------------->
<------------- Doing for Netherlands ---------------->
<------------- Doing for Sri Lanka ---------------->


In [6]:
d={"India":["https://www.icc-cricket.com/match/102846#scorecard"]}

In [18]:
tables[0].tr.text.replace("\n",'').replace('Batters','').split("Batting")[0].strip()

'Sri Lanka'

In [24]:
match

{'date': '09/17/2023 09:30:00 AM',
 'home': 'India',
 'away': 'Sri Lanka',
 'place': 'R.Premadasa Stadium',
 'result': 'India won by 10 wickets',
 'Team1': {'Name': 'India',
  'score': '51/0',
  'overs': '6.1/50 ov',
  'rr': 'RR: 8.27',
  'batting': [{'Player Name': 'Shubman Gill',
    'Dismissed by': '',
    'Runs': '27',
    'Balls Faced': '19',
    'Fours': '6',
    'Sixes': '0',
    'Strike Rate': '142.10'},
   {'Player Name': 'Ishan Kishan',
    'Dismissed by': '',
    'Runs': '23',
    'Balls Faced': '18',
    'Fours': '3',
    'Sixes': '0',
    'Strike Rate': '127.77'},
   {'Player Name': 'Rohit Sharma',
    'Dismissed by': '',
    'Runs': '-',
    'Balls Faced': '-',
    'Fours': '-',
    'Sixes': '-',
    'Strike Rate': '-'},
   {'Player Name': 'Virat Kohli',
    'Dismissed by': '',
    'Runs': '-',
    'Balls Faced': '-',
    'Fours': '-',
    'Sixes': '-',
    'Strike Rate': '-'},
   {'Player Name': 'KL Rahul',
    'Dismissed by': '',
    'Runs': '-',
    'Balls Faced': '-',