# **Data Acquisition project**

# Step 1: Kaggle dataset

Downloading Kaggle datasets from : [usgs/earthquake-database](https://www.kaggle.com/datasets/usgs/earthquake-database)

In [2]:
import kagglehub

# link: https://www.kaggle.com/datasets/usgs/earthquake-database


path = kagglehub.dataset_download("usgs/earthquake-database")

print("Path to dataset files:", path)

Downloading from https://www.kaggle.com/api/v1/datasets/download/usgs/earthquake-database?dataset_version_number=1...


100%|██████████| 590k/590k [00:00<00:00, 91.5MB/s]

Extracting files...
Path to dataset files: /root/.cache/kagglehub/datasets/usgs/earthquake-database/versions/1





### USGS Earthquake Database Description

This dataset contains earthquake records from the USGS (United States Geological Survey) database.
Key features include:
- **Time**: Timestamp of earthquake occurrence
- **Latitude/Longitude**: Geographic coordinates of earthquake epicenter
- **Depth**: Depth of earthquake in kilometers
- **Magnitude**: Earthquake magnitude (typically Richter or moment magnitude scale)
- **Location**: Text description of earthquake location

The dataset allows analysis of seismic patterns, magnitude distributions, and geographic clustering of earthquakes.

In [3]:
import pandas as pd
import numpy as np

df_earthquakes = pd.read_csv(path + '/database.csv')


# Checking content
print("\nHeader :\n")
print(df_earthquakes.head())

# Column informations
print("\nInformations :\n")
print(df_earthquakes.info())


# Missing values
print("\nMissing values :\n")
print(df_earthquakes.isnull().sum())

if len(df_earthquakes) > 10000:
    df_earthquakes = df_earthquakes.sample(n=10000, random_state=42)
    print(f"\nDataset reduced to {len(df_earthquakes)} rows")


Header :

         Date      Time  Latitude  Longitude        Type  Depth  Depth Error  \
0  01/02/1965  13:44:18    19.246    145.616  Earthquake  131.6          NaN   
1  01/04/1965  11:29:49     1.863    127.352  Earthquake   80.0          NaN   
2  01/05/1965  18:05:58   -20.579   -173.972  Earthquake   20.0          NaN   
3  01/08/1965  18:49:43   -59.076    -23.557  Earthquake   15.0          NaN   
4  01/09/1965  13:32:50    11.938    126.427  Earthquake   15.0          NaN   

   Depth Seismic Stations  Magnitude Magnitude Type  ...  \
0                     NaN        6.0             MW  ...   
1                     NaN        5.8             MW  ...   
2                     NaN        6.2             MW  ...   
3                     NaN        5.8             MW  ...   
4                     NaN        5.8             MW  ...   

   Magnitude Seismic Stations  Azimuthal Gap  Horizontal Distance  \
0                         NaN            NaN                  NaN   
1        

# Step 2: Web Scrape a Complementary dataset


In [4]:
from bs4 import BeautifulSoup
import numpy as np
import requests

# Step 2: Scrape the Pleistocene HTML table as backup
url = "https://volcano.si.edu/volcanolist_pleistocene.cfm"

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36'
}

response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

headers = {'User-Agent': 'Mozilla/5.0'}

# Get volcano list
url = "https://volcano.si.edu/volcanolist_pleistocene.cfm"
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')

# Extract volcano names and links
volcanoes = []
table = soup.find('table')

for row in table.find_all('tr')[1:]:
    cols = row.find_all('td')
    if cols:
        link = cols[0].find('a')
        if link:
            volcanoes.append({
                'name': link.text.strip(),
                'url': 'https://volcano.si.edu/' + link['href']
            })

df_volcanoes_scraped = pd.DataFrame(volcanoes)
print(f"\nScraped {len(df_volcanoes_scraped)} volcanoes")
print(df_volcanoes_scraped.head())






Scraped 1427 volcanoes
                     name                                           url
0                     Aak  https://volcano.si.edu/volcano.cfm?vn=300830
1  Acatlan Volcanic Field  https://volcano.si.edu/volcano.cfm?vn=341822
2                Acoculco  https://volcano.si.edu/volcano.cfm?vn=341827
3                Acotango  https://volcano.si.edu/volcano.cfm?vn=355813
4             Acquasparta  https://volcano.si.edu/volcano.cfm?vn=211804


In [8]:
import time
from concurrent.futures import ThreadPoolExecutor, as_completed

def get_volcano_data(volcano):
    # Get data for a single volcano
    try:
        response = requests.get(volcano['url'], headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')

        # Get coordinates
        table = soup.find('div', {'class': 'volcano-subinfo-table'})
        items = table.find_all('li', {'class': 'clear'})

        lat = items[0].text.strip()
        lat = float(lat.replace('°N', '').replace('°S', '').strip())
        if '°S' in items[0].text:
            lat = -lat

        lon = items[1].text.strip()
        lon = float(lon.replace('°E', '').replace('°W', '').strip())
        if '°W' in items[1].text:
            lon = -lon

        volcano_num = items[5].text.strip()

        # Get type
        info_table = soup.find('div', {'class': 'volcano-info-table'})
        shaded = info_table.find_all('li', {'class': 'shaded'})
        volcano_type = shaded[2].text.strip() if len(shaded) > 2 else None

        return {
            'name': volcano['name'],
            'latitude': lat,
            'longitude': lon,
            'type': volcano_type,
            'volcano_number': volcano_num,
            'status': 'success'
        }
    except Exception as e:
        return {
            'name': volcano['name'],
            'latitude': None,
            'longitude': None,
            'type': None,
            'volcano_number': None,
            'status': 'failed'
        }

print("Scraping volcano data with parallel processing...\n")

results = []
success = 0
failed = 0

# Use 10 parallel threads for faster volcanoes data retrievals
with ThreadPoolExecutor(max_workers=10) as executor:
    future_to_volcano = {executor.submit(get_volcano_data, v): v for v in volcanoes}

    for future in as_completed(future_to_volcano):
        result = future.result()
        results.append(result)

        if result['status'] == 'success':
            success += 1
            print(f"{result['name']} ({success}/{len(volcanoes)})")
        else:
            failed += 1
            print(f"{result['name']} failed ({failed} total failures)")

# Remove status
df = pd.DataFrame(results)
df = df[df['status'] == 'success'].drop('status', axis=1)

# Display results
print(f"\n{'='*60}")
print(f"Successfully scraped: {success}")
print(f"Failed: {failed}")
print(f"{'='*60}\n")

print(df.head(10))

# Save to CSV
df.to_csv('volcano_data.csv', index=False)
print(f"\nSaved {len(df)} volcanoes to volcano_data.csv")


Scraping volcano data with parallel processing...

✓ Adamozhets (1/1427)
✓ Acquasparta (2/1427)
✓ Acatlan Volcanic Field (3/1427)
✓ Acotango (4/1427)
✓ Addington Volcanic Field (5/1427)
✓ Aguas Delgadas (6/1427)
✓ Ahon, Tarso (7/1427)
✓ Aird Hills (8/1427)
✓ Aguajito, El (9/1427)
✓ Aitutaki (10/1427)
✓ Aizu Nunobiki (11/1427)
✓ Akaigawa (12/1427)
✓ Akhtang (13/1427)
✓ Albuquerque Volcanic Field (14/1427)
✓ Aldama Volcanic Field (15/1427)
✓ Akhuryan Valley (16/1427)
✓ Alicudi (17/1427)
✓ Aegina (18/1427)
✓ Alexander Selkirk (19/1427)
✓ Ali Mela (20/1427)
✓ Alngey (21/1427)
✓ Alto (22/1427)
✓ Altar (23/1427)
✓ Alutate, Cerro (24/1427)
✓ Alice Arm (25/1427)
✓ Amasing (26/1427)
✓ Amboy (27/1427)
✓ Ambre (28/1427)
✓ Amorong Group (29/1427)
✓ Amiata (30/1427)
✓ Ampiro (31/1427)
✓ Ananopa, Cerro (32/1427)
✓ Anallajsi, Nevado (33/1427)
✓ Anaun (34/1427)
✓ Adare Peninsula (35/1427)
✓ Aneityum (36/1427)
✓ Anilao Hill (37/1427)
✓ Antiparos (38/1427)
✓ Antipin (39/1427)
✓ Animas (40/1427)
✓ Antipo

# Step 3: Merge

# Step 4: Explore / visualise