# Persistence and Efficency
You now might be aware of some of the problems with parsing several pages and how that can quicly get out of hand. While parsing, you might chose to persist the collected data, so that it can be analyzed and cleaned later. 
Parsing, retrieving, saving, and cleaning data are all separate actions, and you shouldn't try to work with data while collecting. In this notebook you'll practice further parsing techniques along with persistency, both for JSON and CSV formats as well as using SQL and a database.

Start by loading one of the available HTML files into the `scrapy` library

**Note**: This notebook relies on several dependencies that must be pre-installed in your environment. If you haven't done so, please run the following command in your terminal or command prompt:
```bash
python3 -m venv venv
venv/bin/pip install -r requirements.txt
```

In [1]:
import scrapy
import os
current_dir = os.path.abspath('')
url = os.path.join(current_dir, "html/1994_World_Junior_Championships_in_Athletics_–_Men's_high_jump")
with open(url) as _f:
    url_data = _f.read()

response = scrapy.http.TextResponse(url, body=url_data, encoding='utf-8')

FileNotFoundError: [Errno 2] No such file or directory: "c:\\Users\\oscar\\OneDrive\\Oscar\\Master_Data_Science_AI_Nuclio_School\\Recursos _Ejercicios_de_limpieza_y_preprocesamiento_20240318\\Clase_Web_scraping_V4\\Coursera_Webscrapping\\Files\\Files\\home\\jovyan\\work\\scrapy-xpath\\html/1992_World_Junior_Championships_in_Athletics_–_Men's_high_jump"

In [94]:
import scrapy
import os

# Directly using the new path provided
file_path = r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1992_World_Junior_Championships_in_Athletics_–_Men's_high_jump";

try:
    with open(file_path, 'r', encoding='utf-8') as _f:
        url_data = _f.read()

    # Since this is a local file, using a placeholder URL in the response object
    response = scrapy.http.TextResponse(url='file://' + file_path, body=url_data.encode('utf-8'), encoding='utf-8')
    print("File loaded successfully.")
except FileNotFoundError:
    print(f"File not found: {file_path}")
except Exception as e:
    print(f"An error occurred: {e}")



File loaded successfully.


In [95]:
import scrapy
import os

# List of file paths
file_paths = [
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1992_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1994_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1996_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1998_World_Junior_Championships_in_Athletics_–_Men's_high_jump"
]

for file_path in file_paths:
    try:
        with open(file_path, 'r', encoding='utf-8') as _f:
            url_data = _f.read()

        # Since this is a local file, using a placeholder URL in the response object
        response = scrapy.http.TextResponse(url='file://' + file_path, body=url_data.encode('utf-8'), encoding='utf-8')
        print(f"File loaded successfully: {file_path}")
    except FileNotFoundError:
        print(f"File not found: {file_path}")
    except Exception as e:
        print(f"An error occurred with {file_path}: {e}")


File loaded successfully: C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1992_World_Junior_Championships_in_Athletics_–_Men's_high_jump
File loaded successfully: C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1994_World_Junior_Championships_in_Athletics_–_Men's_high_jump
File loaded successfully: C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1996_World_Junior_Championships_in_Athletics_–_Men's_high_jump
File loaded successfully: C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1998_World_Junior_Championships_in_Athletics_–_Men's_high_jump


In [96]:
# Make sure that the interesting data is available 
table = response.xpath('//table')[1].xpath('tbody')
for tr in table.xpath('tr'):
    print(tr.xpath('td/b/text()').extract()[0],
          tr.xpath('td/a/text()').extract()[0]
    )

Gold Alfredo Deza
Silver Yin Xueli
Bronze Aleksandr Veryutin


This interaction with `scrapy` in a Jupyter Notebook is useful because you don't need to run the special shell and you also don't need to run the whole spider. Once you learn what you need to do here, you can adapat the spider to persist data.
First, start by persisting data as JSON. To do this, you will need to keep the information in a Python data structure like a dictionary, and then load it as a JSON object, and finally, save it to a file.

In [97]:
scrapped_data = {}
for tr in table.xpath('tr'):
    medal = tr.xpath('td/b/text()').extract()[0]
    athlete = tr.xpath('td/a/text()').extract()[0]
    scrapped_data[medal] = athlete

scrapped_data

{'Gold': 'Alfredo Deza', 'Silver': 'Yin Xueli', 'Bronze': 'Aleksandr Veryutin'}

In [98]:
import json

# You can convert Python into JSON first, but there is no need if you use `json.dump()`
# as shown next
json_data = json.dumps(scrapped_data)

# Persist it in a file:
with open("all_results.json", "w") as _f:
    # use dump() with the Python dictionary directly. 
    # the conversion is done on the fly
    json.dump(scrapped_data, _f)

Now that you can persist the scrapped data as JSON, you can also use CSV. This is specially useful if you want to to some data science operations. Although you can use an advanced library like Pandas for this, you can use the standard library CSV module from Python.

In [82]:
# construct the data first

column_names = ["Medal", "Athlete"]
rows = []

for tr in table.xpath('tr'):
    medal = tr.xpath('td/b/text()').extract()[0]
    athlete = tr.xpath('td/a/text()').extract()[0]
    rows.append([medal, athlete])


In [83]:
# Now persist it to disk
import csv

with open("results.csv", "w") as _f:
    writer = csv.writer(_f)

    # write the column names
    writer.writerow(column_names)

    # now write the rows
    writer.writerows(rows)

Finally, you can persist data to a database. Unlike the JSON and CSV approach, using a database is much more memory efficient. This is the principal reason why you want to use a database instead of a file on disk. Imagine capturing 10GB of data. This could potentially mean that you need 10GB of available memory to hold onto that data before saving it to disk.
By using a database, you can save the data as the data is gathered. 

For the next cells, use a SQLite database to persist the data. Create the file-based database and the table needed.

In [41]:
import sqlite3
connection = sqlite3.connect("1992_results.db")
db_table = 'CREATE TABLE results (id integer primary key, medal TEXT, athlete TEXT)'
cursor = connection.cursor()
cursor.execute(db_table)
connection.commit()

OperationalError: table results already exists

In [34]:
# Now it is time to persist the data. Open the connection again
connection = sqlite3.connect("1994_results.db")
cursor = connection.cursor()
query = 'INSERT INTO results(medal, athlete) VALUES(?, ?)'

for tr in table.xpath('tr'):
    medal = tr.xpath('td/b/text()').extract()[0]
    athlete = tr.xpath('td/a/text()').extract()[0]
    cursor.execute(query, (medal, athlete)) 
    connection.commit()

The data is now persisted in a file-based database that you can query. Verify that all works by creating a new connection and querying the database.

Update the _wikipedia_ project and spider to use some of these techniques to persist data. Next, try parsing all the files in the _html_ directory instead of just one and persist all results. Do you think you can parse other information as well? 

Try parsing the height and the results from all the other athletes, not just the top three places.

In [80]:
import scrapy
import os

# Directly using the new path provided
file_path = r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1998_World_Junior_Championships_in_Athletics_–_Men's_high_jump"

try:
    with open(file_path, 'r', encoding='utf-8') as _f:
        url_data = _f.read()

    # Since this is a local file, using a placeholder URL in the response object
    response = scrapy.http.TextResponse(url='file://' + file_path, body=url_data.encode('utf-8'), encoding='utf-8')
    print("File loaded successfully.")
except FileNotFoundError:
    print(f"File not found: {file_path}")
except Exception as e:
    print(f"An error occurred: {e}")

File loaded successfully.


In [84]:
# Assuming 'response' is your parsed HTML document
table = response.xpath('//table')[1].xpath('tbody')
for tr in response.xpath('//table/tr'):
    medal = tr.xpath('./td[1]/text()').get()
    athlete = tr.xpath('./td[2]/text()').get()
    height = tr.xpath('./td[3]/text()').get()
    result = tr.xpath('./td[4]/text()').get()
    # Process and save this data as needed


In [85]:
import json

# You can convert Python into JSON first, but there is no need if you use `json.dump()`
# as shown next
json_data = json.dumps(scrapped_data)

# Persist it in a file:
with open("1998_results.json", "w") as _f:
    # use dump() with the Python dictionary directly. 
    # the conversion is done on the fly
    json.dump(scrapped_data, _f)

In [75]:
import csv

column_names = ["Medal", "Athlete", "Height", "Result"]
rows = []  # This will be filled with data from the HTML parsing

# Assuming you've filled `rows` with the parsed data
with open("results.csv", "w", newline='', encoding='utf-8') as f:
    writer = csv.writer(f)
    writer.writerow(column_names)
    writer.writerows(rows)


In [43]:
import os
import glob

def parse_html(file_path):
    # Implement parsing logic here
    pass

def save_to_csv(data, year):
    # Implement CSV writing logic here
    pass

def save_to_db(data, year):
    # Implement database saving logic here
    pass

def save_to_json(data, year):
    # Implement JSON writing logic here
    pass

# Example of iterating over HTML files and processing them
base_dir = r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping"
years = ['1992', '1994', '1996', '1998']

for year in years:
    html_files = glob.glob(os.path.join(base_dir, f"{year}_World_Junior_Championships_in_Athletics_–_Men's_high_jump"))
    for file_path in html_files:
        data = parse_html(file_path)
        # Example of deciding how to persist data based on some condition
        save_to_csv(data, year)
        save_to_db(data, year)
        save_to_json(data, year)


In [44]:
import os
import glob
import sqlite3
import csv
import json

# Define a function for parsing HTML files
def parse_html(file_path):
    # Dummy data - replace this with your actual parsing logic
    return [{'medal': 'Gold', 'athlete': 'Athlete Name', 'year': file_path.split('\\')[-1][:4]}]

# Function to save data to a CSV file
def save_to_csv(data, file_path):
    with open(file_path, 'w', newline='', encoding='utf-8') as csvfile:
        fieldnames = data[0].keys()
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for row in data:
            writer.writerow(row)

# Function to save data to a SQLite database
def save_to_db(data, db_path):
    connection = sqlite3.connect(db_path)
    cursor = connection.cursor()
    for item in data:
        cursor.execute('INSERT INTO results (medal, athlete) VALUES (?, ?)', (item['medal'], item['athlete']))
    connection.commit()
    connection.close()

# Function to save data to a JSON file
def save_to_json(data, file_path):
    with open(file_path, 'w', encoding='utf-8') as jsonfile:
        json.dump(data, jsonfile, ensure_ascii=False, indent=4)

# Main processing logic
base_dir = r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping"

years = ['1992', '1994', '1996', '1998']
file_types = ['csv', 'db', 'json']

for year in years:
    # Construct file paths
    html_path = os.path.join(base_dir, f"{year}_World_Junior_Championships_in_Athletics_–_Men's_high_jump")
    paths = {
        'csv': os.path.join(base_dir, f"{year}_results.csv"),
        'db': os.path.join(base_dir, f"{year}_results.db"),
        'json': os.path.join(base_dir, f"{year}_results.json")
    }

    # Parse the HTML file
    data = parse_html(html_path)

    # Save to different formats
    save_to_csv(data, paths['csv'])
    save_to_db(data, paths['db'])
    save_to_json(data, paths['json'])

    print(f"Processed {year} data")


Processed 1992 data
Processed 1994 data
Processed 1996 data
Processed 1998 data


In [47]:
import sqlite3

class SQLitePipeline:

    def open_spider(self, spider):
        self.connection = sqlite3.connect("results.db")
        self.c = self.connection.cursor()
        self.c.execute('''
            CREATE TABLE IF NOT EXISTS results (
                id INTEGER PRIMARY KEY,
                rank TEXT,
                athlete TEXT,
                height TEXT,
                result TEXT
            )
        ''')
        self.connection.commit()

    def close_spider(self, spider):
        self.connection.close()

    def process_item(self, item, spider):
        self.c.execute('''
            INSERT INTO results (rank, athlete, height, result) VALUES (?, ?, ?, ?)
        ''', (item['rank'], item['athlete'], item['height'], item['result']))
        self.connection.commit()
        return item


In [49]:
ITEM_PIPELINES = {
    'yourprojectname.pipelines.SQLitePipeline': 300,
}

FEEDS = {
    'results.json': {
        'format': 'json',
        'encoding': 'utf8',
        'store_empty': False,
        'item_export_kwargs': {
           'export_empty_fields': True,
        },
    },
    'results.csv': {
        'format': 'csv',
        'encoding': 'utf8',
        'store_empty': False,
    },
}


In [76]:
import scrapy
import sqlite3
import csv

class ResultsSpider(scrapy.Spider):
    name = "results"
    start_urls = ["file:///path/to/your/html/1992_World_Junior_Championships_in_Athletics_–_Men's_high_jump.html"]  # Use file URI scheme

    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        # Open a CSV file for writing
        self.csvfile = open("1992_results.csv", "w", newline='', encoding='utf-8')
        self.csvwriter = csv.writer(self.csvfile)
        # Write the header row
        self.csvwriter.writerow(["Medal", "Athlete", "Height", "Result"])
        
        # Connect to the database
        self.connection = sqlite3.connect("1992_results.db")
        self.cursor = self.connection.cursor()
        # Ensure the table exists
        self.cursor.execute('''
            CREATE TABLE IF NOT EXISTS results (
                id INTEGER PRIMARY KEY,
                medal TEXT,
                athlete TEXT,
                height TEXT,
                result TEXT
            )
        ''')

    def parse(self, response):
        # Iterate through each row in the table
        for tr in response.xpath('//tr[position()>1]'):  # Skipping the header row
            medal = tr.xpath('td[1]/text()').get()
            athlete = tr.xpath('td[2]/text()').get()
            height = tr.xpath('td[3]/text()').get()  # Update these based on actual HTML structure
            result = tr.xpath('td[4]/text()').get()

            # Write to CSV
            self.csvwriter.writerow([medal, athlete, height, result])

            # Insert into the database
            self.cursor.execute('INSERT INTO results (medal, athlete, height, result) VALUES (?, ?, ?, ?)', (medal, athlete, height, result))
        
        # Commit the database transaction
        self.connection.commit()

    def closed(self, reason):
        # Close the CSV file and database connection when the spider is closed
        self.csvfile.close()
        self.connection.close()


In [99]:
import scrapy
import os

# List of file paths
file_paths = [
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1992_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1994_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1996_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1998_World_Junior_Championships_in_Athletics_–_Men's_high_jump"
]

for file_path in file_paths:
    try:
        with open(file_path, 'r', encoding='utf-8') as _f:
            url_data = _f.read()

        # Since this is a local file, using a placeholder URL in the response object
        response = scrapy.http.TextResponse(url='file://' + file_path, body=url_data.encode('utf-8'), encoding='utf-8')
        print(f"File loaded successfully: {file_path}")
        
        # Ensure that the data extraction logic is within the loop
        # Adjust the index of the table if needed
        table = response.xpath('//table').xpath('tbody')  # Adjusted for simplification
        for tr in table.xpath('tr'):
            medal = tr.xpath('td/b/text()').get()
            athlete = tr.xpath('td/a/text()').get()
            print(medal, athlete)

    except FileNotFoundError:
        print(f"File not found: {file_path}")
    except Exception as e:
        print(f"An error occurred with {file_path}: {e}")


File loaded successfully: C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1992_World_Junior_Championships_in_Athletics_–_Men's_high_jump
None None
None None
None None
None men
None men
None men
None men
None men
None women
None men
None men
None women
None men
None men
None men
None men
None men
None women
None men
None None
None men
None men
None men
None men
None men
None men
None men
None men
None None
None women
None men
None None
Gold Steve Smith
Silver Tim Forsyth
Bronze Takahiro Kimino
None None
2.37 Steve Smith
2.31 Tim Forsyth
2.29 Takahiro Kimino
2.23 Kim Tae Young
2.20 Sergey Klyugin
2.20 Kristofer Lamos
2.17 Tomáš Janků
2.17 Coenraad Roux
2.17 Clayton Pugh
2.14 Xu Xiaodong
2.14 Sven Ootjers
2.14 Mirko Zanotti
2.14 Clifford van Reed
2.10 Dejan Miloševic
2.10 Hugo Muñoz
2.05 Stanley Osuide
2.05 Kostas Liapis
2.05 Kim Tae-hoi
2.00 Antoine Burke
None None
2.16 Tim Forsyth
2.16 Takahiro Kimino
2.16 Xu Xiaodong
2.16 Clifford van R

In [107]:
import scrapy
import os

# List of file paths
file_paths = [
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1992_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1994_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1996_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1998_World_Junior_Championships_in_Athletics_–_Men's_high_jump"
]

for file_path in file_paths:
    try:
        with open(file_path, 'r', encoding='utf-8') as _f:
            url_data = _f.read()

        # Since this is a local file, using a placeholder URL in the response object
        response = scrapy.http.TextResponse(url='file://' + file_path, body=url_data.encode('utf-8'), encoding='utf-8')
        print(f"File loaded successfully: {file_path}")
        
        # Extract data from the table
        for tr in response.xpath('//table//tr'):
            medal = tr.xpath('./td[1]/b/text()').get() or tr.xpath('./td[1]/text()').get()
            athlete = tr.xpath('./td[2]/a/text()').get() or tr.xpath('./td[2]/text()').get()
            nationality = tr.xpath('./td[3]/img/@alt').get() or tr.xpath('./td[3]/text()').get()
            result = tr.xpath('./td[4]/text()').get() or tr.xpath('./td[4]/text()').get()

            if medal and athlete:  # Ensure variables are not None
                print(medal, athlete, nationality, result)
        
    except FileNotFoundError:
        print(f"File not found: {file_path}")
    except Exception as e:
        print(f"An error occurred with {file_path}: {e}")



File loaded successfully: C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1992_World_Junior_Championships_in_Athletics_–_Men's_high_jump
Gold Steve Smith None None
Silver Tim Forsyth None None
Bronze Takahiro Kimino None None
4 Kim Tae Young None None
5 Sergey Klyugin None None
5 Kristofer Lamos None None
7 Tomáš Janků None None
8 Coenraad Roux None None
9 Clayton Pugh None None
10 Xu Xiaodong None None
10 Sven Ootjers None None
12 Mirko Zanotti None None
13 Clifford van Reed None None
14 Dejan Miloševic None None
15 Hugo Muñoz None None
16 Stanley Osuide None None
16 Kostas Liapis None None
16 Kim Tae-hoi None None
19 Antoine Burke None None
1 Tim Forsyth None None
1 Takahiro Kimino None None
3 Xu Xiaodong None None
3 Clifford van Reed None None
5 Sven Ootjers None None
6 Stanley Osuide None None
6 Kristofer Lamos None None
6 Antoine Burke None None
6 Coenraad Roux None None
10 Kim Tae-hoi None None
11 Dejan Miloševic None None
12 Kost

  super().__init__(text=text, type=st, root=root, **kwargs)


In [109]:
import scrapy
import os

# List of file paths
file_paths = [
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1992_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1994_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1996_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1998_World_Junior_Championships_in_Athletics_–_Men's_high_jump"
]

for file_path in file_paths:
    try:
        with open(file_path, 'r', encoding='utf-8') as _f:
            url_data = _f.read()

        # Since this is a local file, using a placeholder URL in the response object
        response = scrapy.http.TextResponse(url='file://' + file_path, body=url_data.encode('utf-8'), encoding='utf-8')
        #print(f"File loaded successfully: {file_path}")
        
        # Extract data from the table
        for tr in response.xpath('//table//tr'):
            medal = tr.xpath('.//td[1]//img/@alt').extract_first(default='').split(',')[0]
            athlete = tr.xpath('./td[2]//a/text()').extract_first(default='N/A')
            nationality = tr.xpath('.//td[3]//a/text()').extract_first(default='N/A')
            result = tr.xpath('./td[4]/b/text() | ./td[4]/text()').extract_first(default='N/A').strip()

            if medal and athlete:  # Ensure variables are not None
                print(medal, athlete, nationality, result)
        
    except FileNotFoundError:
        print(f"File not found: {file_path}")
    except Exception as e:
        print(f"An error occurred with {file_path}: {e}")


Athletics pictogram.svg N/A N/A N/A
1st place Steve Smith United Kingdom 2.37
2nd place Tim Forsyth Australia 2.31
3rd place Takahiro Kimino Japan 2.29
Athletics pictogram.svg N/A N/A N/A
1st place Jagan Hames Australia 2.23
2nd place Antoine Burke Ireland 2.20
3rd place Mika Polku Finland 2.20
Athletics pictogram.svg N/A N/A N/A
1st place Mark Boswell Canada 2.24
2nd place Svatoslav Ton Czech Republic 2.21
2nd place Ben Challenger United Kingdom 2.21
Athletics pictogram.svg N/A N/A N/A
1st place Alfredo Deza Peru 2.21
2nd place Yin Xueli China 2.21
3rd place Aleksandr Veryutin Belarus 2.21


In [110]:
import scrapy

# List of file paths for demonstration purposes. Replace these paths with your actual file paths.
file_paths = [
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1992_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1994_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1996_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1998_World_Junior_Championships_in_Athletics_–_Men's_high_jump"
]

for file_path in file_paths:
    try:
        with open(file_path, 'r', encoding='utf-8') as _f:
            url_data = _f.read()

        # Since this is a local file, using a placeholder URL in the response object
        response = scrapy.http.TextResponse(url='file://' + file_path, body=url_data.encode('utf-8'), encoding='utf-8')
        print(f"File loaded successfully: {file_path}")
        
        # Extract data from the table
        for tr in response.xpath('//table//tr'):
            medal = tr.xpath('.//td[1]//img/@alt').extract_first(default='').split(',')[0]
            athlete = tr.xpath('./td[2]//a/text()').extract_first(default='N/A')
            nationality = tr.xpath('.//td[3]//a/text()').extract_first(default='N/A')
            result = tr.xpath('./td[4]/b/text() | ./td[4]/text()').extract_first(default='N/A').strip()

            # Skip rows where the athlete name is not available (assuming it's always present in a valid data row)
            if athlete == 'N/A':
                continue

            print(medal, athlete, nationality, result)
        
    except FileNotFoundError:
        print(f"File not found: {file_path}")
    except Exception as e:
        print(f"An error occurred with {file_path}: {e}")


File loaded successfully: C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1992_World_Junior_Championships_in_Athletics_–_Men's_high_jump
 women N/A N/A
 women N/A N/A
 women N/A N/A
 women N/A N/A
 women N/A N/A
 women N/A N/A
 women N/A N/A
 women N/A N/A
 women N/A N/A
 women N/A N/A
 women N/A N/A
 women N/A N/A
 women N/A N/A
 women N/A N/A
 women N/A N/A
 women N/A N/A
 women N/A N/A
 women N/A N/A
 Steve Smith N/A N/A
 Tim Forsyth N/A N/A
 Takahiro Kimino N/A N/A
1st place Steve Smith United Kingdom 2.37
2nd place Tim Forsyth Australia 2.31
3rd place Takahiro Kimino Japan 2.29
 Kim Tae Young South Korea 2.23
 Sergey Klyugin Commonwealth of Independent States 2.20
 Kristofer Lamos Germany 2.20
 Tomáš Janků Czechoslovakia 2.17
 Coenraad Roux South Africa 2.17
 Clayton Pugh Australia 2.17
 Xu Xiaodong China 2.14
 Sven Ootjers Netherlands 2.14
 Mirko Zanotti Italy 2.14
 Clifford van Reed United States 2.14
 Dejan Miloševic Slovenia 2.

In [111]:
import scrapy

# List of file paths for demonstration purposes. Replace these paths with your actual file paths.
file_paths = [
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1992_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1994_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1996_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1998_World_Junior_Championships_in_Athletics_–_Men's_high_jump"
]

for file_path in file_paths:
    try:
        with open(file_path, 'r', encoding='utf-8') as _f:
            url_data = _f.read()

        response = scrapy.http.TextResponse(url='file://' + file_path, body=url_data.encode('utf-8'), encoding='utf-8')
        print(f"File loaded successfully: {file_path}")
        
        for tr in response.xpath('//table//tr'):
            # Check if the row looks like it contains data based on expected patterns
            if tr.xpath('.//td').extract_first(default='').strip():
                medal = tr.xpath('.//td[1]//img/@alt').extract_first(default='').split(',')[0]
                athlete = tr.xpath('./td[2]//a/text()').extract_first(default='N/A')
                nationality = tr.xpath('.//td[3]//a/text()').extract_first(default='N/A')
                result = tr.xpath('./td[4]/b/text() | ./td[4]/text()').extract_first(default='N/A').strip()

                # Skip rows where essential data is missing, indicating it's likely not a data row
                if athlete == 'N/A' or result == 'N/A':
                    continue

                print(medal, athlete, nationality, result)
        
    except FileNotFoundError:
        print(f"File not found: {file_path}")
    except Exception as e:
        print(f"An error occurred with {file_path}: {e}")


File loaded successfully: C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1992_World_Junior_Championships_in_Athletics_–_Men's_high_jump
1st place Steve Smith United Kingdom 2.37
2nd place Tim Forsyth Australia 2.31
3rd place Takahiro Kimino Japan 2.29
 Kim Tae Young South Korea 2.23
 Sergey Klyugin Commonwealth of Independent States 2.20
 Kristofer Lamos Germany 2.20
 Tomáš Janků Czechoslovakia 2.17
 Coenraad Roux South Africa 2.17
 Clayton Pugh Australia 2.17
 Xu Xiaodong China 2.14
 Sven Ootjers Netherlands 2.14
 Mirko Zanotti Italy 2.14
 Clifford van Reed United States 2.14
 Dejan Miloševic Slovenia 2.10
 Hugo Muñoz Peru 2.10
 Stanley Osuide United Kingdom 2.05
 Kostas Liapis Greece 2.05
 Kim Tae-hoi South Korea 2.05
 Antoine Burke Ireland 2.00
 Tim Forsyth Australia 2.16
 Takahiro Kimino Japan 2.16
 Xu Xiaodong China 2.16
 Clifford van Reed United States 2.16
 Sven Ootjers Netherlands 2.16
 Stanley Osuide United Kingdom 2.13
 Krist

In [114]:
import scrapy

# List of file paths for demonstration purposes. Replace these paths with your actual file paths.
file_paths = [
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1992_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1994_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1996_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1998_World_Junior_Championships_in_Athletics_–_Men's_high_jump"
]

for file_path in file_paths:
    try:
        with open(file_path, 'r', encoding='utf-8') as _f:
            url_data = _f.read()

        response = scrapy.http.TextResponse(url='file://' + file_path, body=url_data.encode('utf-8'), encoding='utf-8')
        print(f"File loaded successfully: {file_path}")
        
        # Initialize counter for athlete rank
        athlete_rank = 0
        
        for tr in response.xpath('//table//tr'):
            if tr.xpath('.//td').extract_first(default='').strip():
                medal = tr.xpath('.//td[1]//img/@alt').extract_first(default='').split(',')[0]
                athlete = tr.xpath('./td[2]//a/text()').extract_first(default='N/A')
                nationality = tr.xpath('.//td[3]//a/text()').extract_first(default='N/A')
                result = tr.xpath('./td[4]/b/text() | ./td[4]/text()').extract_first(default='N/A').strip()

                if athlete == 'N/A' or result == 'N/A':
                    continue
                
                # Increment athlete rank
                athlete_rank += 1
                
                # For non-medal positions, prepend rank to athlete's name
                if athlete_rank > 3:
                    position = f"{athlete_rank}th place"
                    print(position, athlete, nationality, result)
                else:
                    print(medal, athlete, nationality, result)
        
    except FileNotFoundError:
        print(f"File not found: {file_path}")
    except Exception as e:
        print(f"An error occurred with {file_path}: {e}")


File loaded successfully: C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1992_World_Junior_Championships_in_Athletics_–_Men's_high_jump
1st place Steve Smith United Kingdom 2.37
2nd place Tim Forsyth Australia 2.31
3rd place Takahiro Kimino Japan 2.29
4th place Kim Tae Young South Korea 2.23
5th place Sergey Klyugin Commonwealth of Independent States 2.20
6th place Kristofer Lamos Germany 2.20
7th place Tomáš Janků Czechoslovakia 2.17
8th place Coenraad Roux South Africa 2.17
9th place Clayton Pugh Australia 2.17
10th place Xu Xiaodong China 2.14
11th place Sven Ootjers Netherlands 2.14
12th place Mirko Zanotti Italy 2.14
13th place Clifford van Reed United States 2.14
14th place Dejan Miloševic Slovenia 2.10
15th place Hugo Muñoz Peru 2.10
16th place Stanley Osuide United Kingdom 2.05
17th place Kostas Liapis Greece 2.05
18th place Kim Tae-hoi South Korea 2.05
19th place Antoine Burke Ireland 2.00
20th place Tim Forsyth Australia 2.16

In [124]:
import scrapy
import os

# Define your file paths here
file_paths = [
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1992_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1994_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1996_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1998_World_Junior_Championships_in_Athletics_–_Men's_high_jump"
]

def ordinal(n):
    return "%d%s" % (n, "th" if 4 <= n % 100 <= 20 else {1: "st", 2: "nd", 3: "rd"}.get(n % 10, "th"))

for file_path in file_paths:
    try:
        with open(file_path, 'r', encoding='utf-8') as _f:
            url_data = _f.read()

        response = scrapy.http.TextResponse(url='file://' + file_path, body=url_data.encode('utf-8'), encoding='utf-8')
        event_name = os.path.basename(file_path)
        print(f"{event_name}")
        
        athlete_rank = 0
        
        for tr in response.xpath('//table//tr[position()>1]'):
            # Extract and strip to remove leading/trailing whitespace
            athlete = tr.xpath('./td[2]//text()').get(default='').strip()
            nationality = tr.xpath('./td[3]//a/text()').get(default='').strip()
            result = tr.xpath('./td[4]/text()').get(default='').strip() or tr.xpath('./td[4]/b/text()').get(default='').strip()

            # Skip rows without meaningful athlete or result data
            if not athlete or athlete.lower() == "women" or not result:
                continue
            
            athlete_rank += 1
            
            position = ordinal(athlete_rank)
            
            # If nationality or result data is missing, print a placeholder message
            if not nationality or not result:
                print(f"{position} - {athlete} - Data not available")
            else:
                print(f"{position} - {athlete}, {nationality}, {result}")
        
    except FileNotFoundError:
        print(f"File not found for: {event_name}")
    except Exception as e:
        print(f"An error occurred with {event_name}: {e}")



1992_World_Junior_Championships_in_Athletics_–_Men's_high_jump
1st - Steve Smith, United Kingdom, 2.37
2nd - Tim Forsyth, Australia, 2.31
3rd - Takahiro Kimino, Japan, 2.29
4th - Kim Tae Young, South Korea, 2.23
5th - Sergey Klyugin, Commonwealth of Independent States, 2.20
6th - Kristofer Lamos, Germany, 2.20
7th - Tomáš Janků, Czechoslovakia, 2.17
8th - Coenraad Roux, South Africa, 2.17
9th - Clayton Pugh, Australia, 2.17
10th - Xu Xiaodong, China, 2.14
11th - Sven Ootjers, Netherlands, 2.14
12th - Mirko Zanotti, Italy, 2.14
13th - Clifford van Reed, United States, 2.14
14th - Dejan Miloševic, Slovenia, 2.10
15th - Hugo Muñoz, Peru, 2.10
16th - Stanley Osuide, United Kingdom, 2.05
17th - Kostas Liapis, Greece, 2.05
18th - Kim Tae-hoi, South Korea, 2.05
19th - Antoine Burke, Ireland, 2.00
20th - Tim Forsyth, Australia, 2.16
21st - Takahiro Kimino, Japan, 2.16
22nd - Xu Xiaodong, China, 2.16
23rd - Clifford van Reed, United States, 2.16
24th - Sven Ootjers, Netherlands, 2.16
25th - Sta

In [131]:
import scrapy
import os
import csv
import json
import sqlite3

# Define your file paths here
file_paths = [
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1992_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1994_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1996_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1998_World_Junior_Championships_in_Athletics_–_Men's_high_jump"
]

# Helper function for ordinal numbers
def ordinal(n):
    return "%d%s" % (n, "th" if 4 <= n % 100 <= 20 else {1: "st", 2: "nd", 3: "rd"}.get(n % 10, "th"))

# List to store all athlete data
athletes_data = []

for file_path in file_paths:
    try:
        with open(file_path, 'r', encoding='utf-8') as _f:
            url_data = _f.read()

        response = scrapy.http.TextResponse(url='file://' + file_path, body=url_data.encode('utf-8'), encoding='utf-8')
        event_name = os.path.basename(file_path)
        print(f"{event_name}")
        
        athlete_rank = 0
        
        for tr in response.xpath('//table//tr[position()>1]'):
            athlete = tr.xpath('./td[2]//text()').get(default='').strip()
            nationality = tr.xpath('./td[3]//a/text()').get(default='').strip()
            result = tr.xpath('./td[4]/text()').get(default='').strip() or tr.xpath('./td[4]/b/text()').get(default='').strip()

            if not athlete or athlete.lower() == "women" or not result:
                continue
            
            athlete_rank += 1
            position = ordinal(athlete_rank)
            
            athletes_data.append({
                "Position": position,
                "Athlete": athlete,
                "Nationality": nationality,
                "Result": result
            })
        
    except FileNotFoundError:
        print(f"File not found for: {event_name}")
    except Exception as e:
        print(f"An error occurred with {event_name}: {e}")

# Writing to CSV
csv_file = "allyearsdata.csv"
with open(csv_file, 'w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=["Position", "Athlete", "Nationality", "Result"])
    writer.writeheader()
    for data in athletes_data:
        writer.writerow(data)

# Writing to JSON
json_file = "allyearsdata.json"
with open(json_file, 'w', encoding='utf-8') as file:
    json.dump(athletes_data, file, ensure_ascii=False, indent=4)

# Writing to SQLite DB
db_file = "allyearsdata.db"
conn = sqlite3.connect(db_file)
c = conn.cursor()
c.execute('''CREATE TABLE IF NOT EXISTS athletes
             (Position TEXT, Athlete TEXT, Nationality TEXT, Result TEXT)''')
for data in athletes_data:
    c.execute("INSERT INTO athletes VALUES (:Position, :Athlete, :Nationality, :Result)", data)
conn.commit()
conn.close()

print("Data saved to CSV, JSON, and SQLite DB.")


1992_World_Junior_Championships_in_Athletics_–_Men's_high_jump
1994_World_Junior_Championships_in_Athletics_–_Men's_high_jump
1996_World_Junior_Championships_in_Athletics_–_Men's_high_jump
1998_World_Junior_Championships_in_Athletics_–_Men's_high_jump
Data saved to CSV, JSON, and SQLite DB.


In [132]:
import scrapy
import os
import csv
import json
import sqlite3

# Define your file paths here
file_paths = [
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1992_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1994_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1996_World_Junior_Championships_in_Athletics_–_Men's_high_jump",
    r"C:\Users\oscar\OneDrive\Oscar\Master_Data_Science_AI_Nuclio_School\Projects\Coursera_Webscrapping\1998_World_Junior_Championships_in_Athletics_–_Men's_high_jump"
]

def ordinal(n):
    """Helper function for ordinal numbers."""
    return "%d%s" % (n, "th" if 4 <= n % 100 <= 20 else {1: "st", 2: "nd", 3: "rd"}.get(n % 10, "th"))

# List to store all athlete data
athletes_data = []

for file_path in file_paths:
    try:
        with open(file_path, 'r', encoding='utf-8') as _f:
            url_data = _f.read()

        response = scrapy.http.TextResponse(url='file://' + file_path, body=url_data.encode('utf-8'), encoding='utf-8')
        event_name = os.path.basename(file_path)
        print(event_name)

        athlete_rank = 0
        for tr in response.xpath('//table//tr[position()>1]'):
            if athlete_rank >= 10:  # Stop after processing 10 athletes
                break

            athlete = tr.xpath('./td[2]//text()').get(default='').strip()
            nationality = tr.xpath('./td[3]//a/text()').get(default='').strip()
            result = tr.xpath('./td[4]/text()').get(default='').strip() or tr.xpath('./td[4]/b/text()').get(default='').strip()

            if not athlete or athlete.lower() == "women" or not result:
                continue

            athlete_rank += 1
            position = ordinal(athlete_rank)

            athletes_data.append({
                "Event": event_name,
                "Position": position,
                "Athlete": athlete,
                "Nationality": nationality,
                "Result": result
            })

    except FileNotFoundError:
        print(f"File not found for: {event_name}")
    except Exception as e:
        print(f"An error occurred with {event_name}: {e}")

# Writing to CSV
csv_file = "allyearsdata.csv"
with open(csv_file, 'w', newline='', encoding='utf-8') as file:
    writer = csv.DictWriter(file, fieldnames=["Event", "Position", "Athlete", "Nationality", "Result"])
    writer.writeheader()
    for data in athletes_data:
        writer.writerow(data)

# Writing to JSON
json_file = "allyearsdata.json"
with open(json_file, 'w', encoding='utf-8') as file:
    json.dump(athletes_data, file, ensure_ascii=False, indent=4)

# Writing to SQLite DB
db_file = "allyearsdata.db"
conn = sqlite3.connect(db_file)
c = conn.cursor()
c.execute('''CREATE TABLE IF NOT EXISTS athletes
             (Event TEXT, Position TEXT, Athlete TEXT, Nationality TEXT, Result TEXT)''')
for data in athletes_data:
    c.execute("INSERT INTO athletes VALUES (:Event, :Position, :Athlete, :Nationality, :Result)", data)
conn.commit()
conn.close()

print("Data saved to CSV, JSON, and SQLite DB.")


1992_World_Junior_Championships_in_Athletics_–_Men's_high_jump
1994_World_Junior_Championships_in_Athletics_–_Men's_high_jump
1996_World_Junior_Championships_in_Athletics_–_Men's_high_jump
1998_World_Junior_Championships_in_Athletics_–_Men's_high_jump
Data saved to CSV, JSON, and SQLite DB.
