# DATA SCRAPING
## AUTHOR: ANTE DUJIC

<hr style="border:2px solid gray">

Data scraping, also known as web scraping, is the process of automatically extracting data from websites using various software tools. It is an important skill for data analysis as it allows gathering of large amounts of data from the internet quickly and efficiently.To successfully write a web scraping script, it is important to have familiarity with programming, but also to have a fundamental understanding of HTML structure and the basics of web development. It's also crucial to approach data scraping with caution and ensure responsible use.

<img src="scrap.jpg" width="400" style="margin:auto"/>

In this Jupyter Notebook, I will explore web scraping by creating a script to extract data from [Transfermarkt](https://www.transfermarkt.com/), a popular website specializing in football player and team statistics, transfers, and market values. I will be using Python and it's libraries - Beautiful Soup for parsing HTML and requests for making HTTP requests.

To create a web scraper I will follow the steps below:

1. Identifying the target data
2. Inspect the structure of the website
3. Write Code to Navigate Through the HTML Structure
4. Creating a DataFrame
5. Exporting the Data to CSV

It is important to note that this project is designed for educational purposes only. The data obtained through scraping will not be used for any other means beyond learning and exploration. 


<hr style="border:2px solid gray">

### STEP 1. Identifying the target data
***

My objective is to extract data on all players in the Croatian football league ([Supersport HNL](https://www.transfermarkt.com/1-hnl/startseite/wettbewerb/KR1)), including their name, age, nationality, position, height, and more. To do so, I will have to navigate to each club url. I will then create a CSV file with this data, which can be used for further analysis.

Example of a target data for single club: https://www.transfermarkt.com/gnk-dinamo-zagreb/kader/verein/419/saison_id/2022/plus/1

### STEP 2. Inspect the structure of the website
***

Before starting to write the code, I will inspect the website itself using browser developer tools to directly observe the actual webpage and validate my understanding of its structure. I will use my understanding of the parsed HTML structure to extract the relevant data. This involves accessing specific elements, retrieving their attributes or text content, and navigating through the HTML structure to extract the precise information I need.

### STEP 3. Write Code to Navigate Through the HTML Structure
***

Here I'll write the code for the scraper. I will do this in two stages. First I will construct the links where I want to collect the data for further analysis. I will then scrap the data from those links and create a single database. This section is divided into three parts: Importing Libraries, Constructing Club URLs and Scraping Player Data. More on each below.

#### Importing Libraries

First, I import the necessary libraries for web scraping and data processing tasks. The *requests* library allows me to make HTTP requests to retrieve web page content. I use *BeautifulSoup* to parse the HTML content and extract specific elements of interest. Finally, I import *pandas* to create and manipulate data frames for further analysis.

In [1]:
# Import libraries
import requests
from bs4 import BeautifulSoup
import pandas as pd


#### Constructing Club URLs

In this section, I will construct the URLs of the each club in the league. 

I will begin with an initial link that serves as the starting point to access the website's content. This link leads us to a page related to a specific league or competition - Supersport HNL in our case. After this I will extract the information that includes links to individual clubs participating in the league. Using the extracted information, I construct a new URL for each club by appending the gathered information to a base URL. This new URL directs us to a dedicated page containing information about the club. Once I have the link for each club, I will manipulate the URL further to construct the final link. This involves replacing specific parts of the URL to navigate to the desired page that holds the target data.


##### EXAMPLE: GNK Dinamo Zagreb
$\rightarrow$ INITIAL LINK: https://www.transfermarkt.com/1-hnl/startseite/wettbewerb/KR1 </br> 
&nbsp; &nbsp;$\rightarrow$ SECOND LINK: https://www.transfermarkt.com/gnk-dinamo-zagreb/startseite/verein/419/saison_id/2022 </br> 
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;$\rightarrow$ FINAL LINK: https://www.transfermarkt.com/gnk-dinamo-zagreb/kader/verein/419/saison_id/2022/plus/1

To do so, I will use the requests library to make a GET request to the website and fetch its content. Then, I will use BeautifulSoup to parse the HTML content and find the relevant elements containing the club URLs. I will store these URLs in a list for further processing.

In [2]:
# Supersport HNL url
    # Initial link
url = 'https://www.transfermarkt.com/1-hnl/startseite/wettbewerb/KR1'

In [3]:
#  Defining a dictionary named headers, to mimic the user agent of the browser. 
# This is necessary because some websites may block requests from bots or scripts that are not coming from a browser.
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}

# Send a GET request to the specified URL with the defined headers
response = requests.get(url, headers=headers)
# <Response [200]> indicates that the GET request to the specified URL was successful.
response

<Response [200]>

In [4]:
# Get the content of the response
content = response.content

In [5]:
# Parse the content using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
#soup

In [6]:
# Find the div element with class 'responsive-table' which contains the table
table = soup.find('div', {'class': 'responsive-table'})
#table

In [7]:
# Find the tbody element within the table
tbody = table.find('tbody')
#tbody

In [8]:
# Initialize an empty list to store the club links
club_links = []

# Iterate over each tr element within the tbody
for tr in tbody.find_all('tr'):
    # Get the href attribute of the first 'a' element and append it to the club_links list
    club_links.append(tr.find_all('a')[0]['href'])

# Prepend the base URL and modify the links to include '/kader/plus/1'
for i in range(len(club_links)):
    # Second link
    club_links[i] = 'https://www.transfermarkt.com' + club_links[i]
    # Final link
    club_links[i] = club_links[i].replace('startseite', 'kader') + '/plus/1'

# Sanity check: Print the list of club links
club_links

['https://www.transfermarkt.com/gnk-dinamo-zagreb/kader/verein/419/saison_id/2022/plus/1',
 'https://www.transfermarkt.com/hnk-hajduk-split/kader/verein/447/saison_id/2022/plus/1',
 'https://www.transfermarkt.com/hnk-rijeka/kader/verein/144/saison_id/2022/plus/1',
 'https://www.transfermarkt.com/nk-lokomotiva-zagreb/kader/verein/11194/saison_id/2022/plus/1',
 'https://www.transfermarkt.com/nk-osijek/kader/verein/327/saison_id/2022/plus/1',
 'https://www.transfermarkt.com/hnk-gorica/kader/verein/24575/saison_id/2022/plus/1',
 'https://www.transfermarkt.com/nk-varazdin/kader/verein/599/saison_id/2022/plus/1',
 'https://www.transfermarkt.com/nk-istra-1961/kader/verein/999/saison_id/2022/plus/1',
 'https://www.transfermarkt.com/hnk-sibenik/kader/verein/223/saison_id/2022/plus/1',
 'https://www.transfermarkt.com/slaven-belupo-koprivnica/kader/verein/2362/saison_id/2022/plus/1']

#### Scraping Player Data

In this section, I will scrape the player data for each club. I'll iterate through the club URLs obtained in the previous section and make individual requests to each club's page. Using BeautifulSoup, I'll extract the desired player information such as name, position, nationality, height, etc. This data is then stored in a dictionary and appended to a list, accumulating data for all players from different clubs.

In [9]:
# Create an empty list to store all players' data
all_players = []

# Iterate over each club URL
for clubs in club_links:
    # Assign the club URL to club_url variable
    club_url = clubs
    
    # Define the headers dictionary to mimic the user agent of the browser
    headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    
    # Send a GET request to the club URL with the defined headers
    response = requests.get(club_url, headers=headers)
    
    # Get the content of the response
    content = response.content
    
    # Parse the content using BeautifulSoup
    club_soup = BeautifulSoup(response.content, 'html.parser')
    
    # Find the table element with class 'items' which contains the player data
    player_table = club_soup.find('table', {'class': 'items'})
    
    # Select all 'tr' elements with class 'odd' or 'even' within the player_table
    players = player_table.select('tr.odd, tr.even')
    
    # Iterate over the players
    for i in range(1, len(players) + 1):
        # Create an empty list to store the nationalities of the player
        nationalities = []
        
        # Find all 'img' elements with class 'flaggenrahmen' within the current player row
        imgs = players[i - 1].find_all('img', {'class': 'flaggenrahmen'})
        
        # Iterate over the 'img' elements to extract the nationalities
        for img in imgs:
            nationalities.append(img.get('alt'))
        
        # Find the 'img' element with an empty class, representing the club the player joined from
        joined_from_element = players[i - 1].find('img', {'class': ''})
        
        # Check if the joined_from_element exists and extract its 'alt' attribute,
        # otherwise, assign an empty string to joined_from
        if joined_from_element:
            joined_from = joined_from_element.get('alt')
        else:
            joined_from = ""
        
        # Create a dictionary to store the player data
        player_dict = {
            'number': players[i - 1].find('div', {'class': 'rn_nummer'}).text.strip(),
            'image_url': players[i - 1].find('img', {'class': 'bilderrahmen-fixed'}).get('data-src'),
            'name': players[i - 1].find('td', {'class': 'hauptlink'}).text.strip(),
            'position': players[i - 1].find_all('td')[4].text.strip(),
            'dob': players[i - 1].find_all('td', {'class': 'zentriert'})[1].text.strip()[:-5],
            'nationality': ', '.join(nationalities),
            'height': players[i - 1].find_all('td', {'class': 'zentriert'})[3].text.strip()[:-1],
            'foot': players[i - 1].find_all('td', {'class': 'zentriert'})[4].text.strip(),
            'joined': players[i - 1].find_all('td', {'class': 'zentriert'})[5].text.strip(),
            'joined_from': joined_from,
            'contract': players[i - 1].find_all('td', {'class': 'zentriert'})[7].text.strip(),
            'value': players[i - 1].find('td', {'class': 'rechts hauptlink'}).text.strip()[1:],
            'club_name': club_soup.find('h1', {'class': 'data-header__headline-wrapper data-header__headline-wrapper--oswald'}).text.strip()
        }
        
        # Append the player data dictionary to the all_players list
        all_players.append(player_dict)

# Sanity check: Print the number of players scraped
len(all_players)

284

### STEP 4. Creating Dataframe
***

In this section, I create a pandas data frame to organize and structure the scraped player data. The data frame allows me to perform various operations and analysis on the data more conveniently.

In [10]:
# Create a DataFrame from the list of player dictionaries
df = pd.DataFrame(all_players)

# Sanity check: Print the first few rows of the DataFrame
df.head()

Unnamed: 0,number,image_url,name,position,dob,nationality,height,foot,joined,joined_from,contract,value,club_name
0,40,https://img.a.transfermarkt.technology/portrai...,Dominik Livakovic,Goalkeeper,"Jan 9, 1995",Croatia,188,right,"Aug 31, 2015",NK Zagreb,"Jun 15, 2024",14.00m,GNK Dinamo Zagreb
1,33,https://img.a.transfermarkt.technology/portrai...,Ivan Nevistic,Goalkeeper,"Jul 31, 1998",Croatia,195,right,"Jan 28, 2021",NK Lokomotiva Zagreb,"Jun 15, 2025",1.50m,GNK Dinamo Zagreb
2,1,https://img.a.transfermarkt.technology/portrai...,Danijel Zagorac,Goalkeeper,"Feb 7, 1987",Croatia,186,right,"Jul 11, 2016",RNK Split,"Jun 30, 2026",200k,GNK Dinamo Zagreb
3,37,https://img.a.transfermarkt.technology/portrai...,Josip Sutalo,Centre-Back,"Feb 28, 2000",Croatia,190,right,"Jan 7, 2020",GNK Dinamo Zagreb II,"Jun 15, 2028",18.00m,GNK Dinamo Zagreb
4,55,https://img.a.transfermarkt.technology/portrai...,Dino Peric,Centre-Back,"Jul 12, 1994",Croatia,197,left,"Jul 7, 2017",NK Lokomotiva Zagreb,"Jun 15, 2026",5.00m,GNK Dinamo Zagreb


### STEP 5. Exporting the Data to CSV
***

In this section, I export the scraped player data to a CSV (Comma-Separated Values) file. By using the to_csv() function provided by pandas data frames, I convert the data frame into a CSV format. This allows me to save the data to a file that can be easily shared, stored, or further processed in other applications. The CSV file preserves the tabular structure of the data, making it accessible for analysis or visualization tasks outside of the current Jupyter Notebook environment.

In [13]:
# Export the DataFrame to a CSV file
#df.to_csv('SuperSportHNL.csv', index=False)

***
## THE END