<!-- Centered Title -->
<h1 style="color: #1f4e79; 
           font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; 
           text-align: center;">
Cryptocurrency Web Scraping Project
</h1>

<!-- Left-aligned bullet points -->
<ul style="text-align: left; 
           color: #333333; 
           font-family: 'Segoe UI', Tahoma, Geneva, Verdana, sans-serif; 
           line-height: 1.6; 
           margin-top: 10px;">
  <li>This notebook collects historical cryptocurrency data (top 10 coins) from CoinMarketCap.</li>
  <li>The goal is to extract, clean, and structure the data before saving it as a CSV for further processing.</li>
</ul>


## Importing Required Libraries

In this block, we import all the necessary Python libraries for our project:

- `requests`: To send HTTP requests and fetch HTML content from websites.
- `BeautifulSoup` from `bs4`: To parse and extract data from HTML pages.
- `pandas`: For data manipulation and storage in DataFrame format.
- `KNNImputer` from `sklearn.impute`: To handle missing values in numerical columns using K-Nearest Neighbors imputation.
- `MinMaxScaler` from `sklearn.preprocessing`: To normalize numerical features before applying imputation.


In [1]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
from sklearn.impute import KNNImputer
from sklearn.preprocessing import MinMaxScaler

## 1. Extraction

### 1.1 Initializing Data Lists and Creating DataFrame

Before scraping cryptocurrency data, we initialize empty lists for each feature we want to collect. These lists will store data for:

- `crypto_date_list` : Date of the data snapshot
- `crypto_name_list` : Name of the cryptocurrency
- `crypto_symbol_list` : Symbol of the cryptocurrency
- `crypto_market_cap_list` : Market capitalization
- `crypto_price_list` : Current price
- `crypto_circulating_supply_list` : Circulating supply of the coin
- `crypto_voulume_24hr_list` : Trading volume in the last 24 hours
- `crypto_pct_1hr_list` : Percentage change in 1 hour
- `crypto_pct_24hr_list` : Percentage change in 24 hours
- `crypto_pct_7day_list` : Percentage change in 7 days

We also create an empty pandas DataFrame `df` which will later store all this collected data in tabular format.


In [2]:
# Initializing lists to store cryptocurrency data
crypto_date_list = []
crypto_name_list = []
crypto_symbol_list = []
crypto_market_cap_list = []
crypto_price_list = []
crypto_circulating_supply_list = []
crypto_voulume_24hr_list = []
crypto_pct_1hr_list = []
crypto_pct_24hr_list = []
crypto_pct_7day_list = []

# Creating an empty DataFrame to hold the scraped data
df = pd.DataFrame()

### 1.2 Function: Scrape Date List from CoinMarketCap

To collect historical cryptocurrency data, we first need the list of dates for which data is available. CoinMarketCap provides historical snapshots, typically taken on Sundays. 

- We define an empty list `scrape_date_list` to store these URLs.
- The function `scrape_date()` performs the following:
  1. Sends a GET request to the CoinMarketCap historical page.
  2. Parses the HTML content using BeautifulSoup.
  3. Finds all `<a>` tags with the class `'historical-link cmc-link'`, which correspond to the available dates.
  4. Extracts the `href` attribute from each tag and appends it to `scrape_date_list`.

After running the function, we print the total number of dates (Sundays) available for scraping.


In [3]:
# List to store historical data URLs
scrape_date_list = []

# Function to scrape available historical data dates (Sundays)
def scrape_date():
    url = 'https://coinmarketcap.com/historical/'
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    a_tags = soup.find_all('a', class_='historical-link cmc-link')
    for tag in a_tags:
        href = tag.get('href')
        scrape_date_list.append(href)

# Execute the function to collect all available historical data URLs
scrape_date()
print('There are ' + str(len(scrape_date_list)) + ' dates(Sundays) available for scraping from CoinMarketCap historical data.')

There are 659 dates(Sundays) available for scraping from CoinMarketCap historical data.


### 1.3 Scrape Cryptocurrency Data for a Specific Date

The following function `scrape_data(date)` is designed to extract the top 10 cryptocurrencies by market capitalization for a given historical date from CoinMarketCap.  

**Function Overview:**

- **Input:** `date` (string) â€” the historical date URL path from CoinMarketCap.
- **Process:** 
  - Send a request to the specific historical page.
  - Parse the HTML using BeautifulSoup.
  - Iterate through the table rows to extract details of the top 10 cryptocurrencies.
  - Handle missing data gracefully using `try-except` blocks for each field.
- **Output:** Data is appended to the respective lists initialized earlier for further processing and creation of a DataFrame.

The extracted fields include:  
- Date  
- Cryptocurrency Name  
- Symbol  
- Market Capitalization  
- Price  
- Circulating Supply  
- 24-hour Trading Volume  
- Percentage change over 1 hour, 24 hours, and 7 days


In [4]:
def scrape_data(date):

    """
    Scrapes the top 10 cryptocurrencies for a specific historical date from CoinMarketCap.
    
    Parameters:
    date (str): The historical date URL path (e.g., '/historical/20251201/')
    
    The function updates the global lists initialized earlier.
    """
    
    url = 'https://coinmarketcap.com' + date
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    tr = soup.find_all('tr', attrs={'class': 'cmc-table-row'})
    count = 0
    for row in tr:
        if count == 10:
            break
        count += 1

        try:
            crypto_date = date
        except AttributeError:
            crypto_date = None

        try:
            name_column = row.find('td', attrs={'class': 'cmc-table__cell cmc-table__cell--sticky cmc-table__cell--sortable cmc-table__cell--left cmc-table__cell--sort-by__name'})
            crypto_name = name_column.find('a', attrs={'class': 'cmc-table__column-name--name cmc-link'}).text.strip()
        except AttributeError:
            crypto_name = None

        try:
            crypto_symbol = row.find('td', attrs={'class': 'cmc-table__cell cmc-table__cell--sortable cmc-table__cell--left cmc-table__cell--sort-by__symbol'}).text.strip()
        except AttributeError:
            crypto_symbol = None

        try:
            crypto_market_cap = row.find('td', attrs={'class': 'cmc-table__cell cmc-table__cell--sortable cmc-table__cell--right cmc-table__cell--sort-by__market-cap'}).text.strip()
        except AttributeError:
            crypto_market_cap = None

        try:
            crypto_price = row.find('td', attrs={'class': 'cmc-table__cell cmc-table__cell--sortable cmc-table__cell--right cmc-table__cell--sort-by__price'}).text.strip()
        except AttributeError:
            crypto_price = None

        try:
            crypto_circulating_supply = row.find('td', attrs={'class': 'cmc-table__cell cmc-table__cell--sortable cmc-table__cell--right cmc-table__cell--sort-by__circulating-supply'}).text.strip().split(' ')[0]
        except AttributeError:
            crypto_circulating_supply = None

        try:
            crypto_voulume_24hr_td = row.find('td', attrs={'class': 'cmc-table__cell cmc-table__cell--sortable cmc-table__cell--right cmc-table__cell--sort-by__volume-24-h'})
            crypto_voulume_24hr = crypto_voulume_24hr_td.find('a', attrs={'class': 'cmc-link'}).text.strip()
        except AttributeError:
            crypto_voulume_24hr = None

        try:
            crypto_pct_1hr = row.find('td', attrs={'class': 'cmc-table__cell cmc-table__cell--sortable cmc-table__cell--right cmc-table__cell--sort-by__percent-change-1-h'}).text.strip()
        except AttributeError:
            crypto_pct_1hr = None

        try:
            crypto_pct_24hr = row.find('td', attrs={'class': 'cmc-table__cell cmc-table__cell--sortable cmc-table__cell--right cmc-table__cell--sort-by__percent-change-24-h'}).text.strip()
        except AttributeError:
            crypto_pct_24hr = None

        try:
            crypto_pct_7day = row.find('td', attrs={'class': 'cmc-table__cell cmc-table__cell--sortable cmc-table__cell--right cmc-table__cell--sort-by__percent-change-7-d'}).text.strip()
        except AttributeError:
            crypto_pct_7day = None
            
        # Append extracted data to global lists
        crypto_date_list.append(crypto_date)
        crypto_name_list.append(crypto_name)
        crypto_symbol_list.append(crypto_symbol)
        crypto_market_cap_list.append(crypto_market_cap)
        crypto_price_list.append(crypto_price)
        crypto_circulating_supply_list.append(crypto_circulating_supply)
        crypto_voulume_24hr_list.append(crypto_voulume_24hr)
        crypto_pct_1hr_list.append(crypto_pct_1hr)
        crypto_pct_24hr_list.append(crypto_pct_24hr)
        crypto_pct_7day_list.append(crypto_pct_7day)

### 1.4 Scraping Cryptocurrency Data for Multiple Dates

Once we have collected all available historical date URLs from CoinMarketCap, we can iterate over these dates to scrape the cryptocurrency data for each one.  

**Steps:**

1. Convert the start and end date from the scraped URLs to a readable format (`YYYY-MM-DD`) for display.
2. Print the total number of available dates to scrape.
3. Loop through each date URL in `scrape_date_list`, calling the `scrape_data()` function to extract top 10 cryptocurrencies for each date.
4. Display progress in the console to track scraping status.

In [5]:
from datetime import datetime
# Define the date format for conversion
date_format = "%Y%m%d"

# Convert start and end dates from the scrape_date_list URLs
start_date = datetime.strptime(scrape_date_list[0].split('/')[-2], date_format).strftime('%Y-%m-%d')
end_date = datetime.strptime(scrape_date_list[-1].split('/')[-2], date_format).strftime('%Y-%m-%d')
print('There are ' + str(len(scrape_date_list)) + ' dates(Sundays) between ' + start_date + ' and ' + end_date)

# Loop through all historical dates and scrape cryptocurrency data
for i in range(len(scrape_date_list)):
    scrape_data(scrape_date_list[i])
    print("completed: " + str(i+1) + " out of " + str(len(scrape_date_list)))

There are 659 dates(Sundays) between 2013-04-28 and 2025-12-07
completed: 1 out of 659
completed: 2 out of 659
completed: 3 out of 659
completed: 4 out of 659
completed: 5 out of 659
completed: 6 out of 659
completed: 7 out of 659
completed: 8 out of 659
completed: 9 out of 659
completed: 10 out of 659
completed: 11 out of 659
completed: 12 out of 659
completed: 13 out of 659
completed: 14 out of 659
completed: 15 out of 659
completed: 16 out of 659
completed: 17 out of 659
completed: 18 out of 659
completed: 19 out of 659
completed: 20 out of 659
completed: 21 out of 659
completed: 22 out of 659
completed: 23 out of 659
completed: 24 out of 659
completed: 25 out of 659
completed: 26 out of 659
completed: 27 out of 659
completed: 28 out of 659
completed: 29 out of 659
completed: 30 out of 659
completed: 31 out of 659
completed: 32 out of 659
completed: 33 out of 659
completed: 34 out of 659
completed: 35 out of 659
completed: 36 out of 659
completed: 37 out of 659
completed: 38 out of 

## 1.5. Creating the Cryptocurrency DataFrame

After scraping the data for multiple dates, we now populate our main DataFrame with the collected lists.  
Each list corresponds to a specific attribute of the cryptocurrencies, such as name, symbol, market cap, price, and percentage changes.  

This step consolidates all the scraped data into a structured format suitable for further processing and analysis.

In [6]:
# Populate DataFrame with scraped data
df['Date'] = crypto_date_list
df['Name'] = crypto_name_list
df['Symbol'] = crypto_symbol_list
df['Market Cap'] = crypto_market_cap_list
df['Price'] = crypto_price_list
df['Circulating Supply'] = crypto_circulating_supply_list
df['Volume (24hr)'] = crypto_voulume_24hr_list
df['% 1h'] = crypto_pct_1hr_list
df['% 24h'] = crypto_pct_24hr_list
df['% 7d'] = crypto_pct_7day_list

df

Unnamed: 0,Date,Name,Symbol,Market Cap,Price,Circulating Supply,Volume (24hr),% 1h,% 24h,% 7d
0,/historical/20130428/,Bitcoin,BTC,"$1,488,566,971.96",$134.21,11091325,,0.64%,--,--
1,/historical/20130428/,Litecoin,LTC,"$74,637,021.57",$4.3484,17164230,,0.80%,--,--
2,/historical/20130428/,Peercoin,PPC,"$7,250,186.65",$0.3865,18757362,,-0.93%,--,--
3,/historical/20130428/,Namecoin,NMC,"$5,995,997.19",$1.1072,5415300,,-0.05%,--,--
4,/historical/20130428/,Terracoin,TRC,"$1,503,099.40",$0.6469,2323570,,0.61%,--,--
...,...,...,...,...,...,...,...,...,...,...
6582,/historical/20251207/,USDC,USDC,"$78,198,364,307.09",$1.0001,78192824319,"$7,817,626,806.69",-0.01%,<0.01%,0.03%
6583,/historical/20251207/,Solana,SOL,"$74,055,834,015.84",$132.09,560631621,"$3,824,892,146.92",0.63%,-0.19%,-1.10%
6584,/historical/20251207/,TRON,TRX,"$27,163,787,850.16",$0.2869,94681527841,"$523,906,170.03",0.21%,-0.18%,1.85%
6585,/historical/20251207/,Dogecoin,DOGE,"$22,395,155,616.57",$0.1386,161608712799,"$1,065,448,744.32",0.51%,-0.83%,-5.52%


## 1.6 DataFrame Information Before Transformation

Before performing any preprocessing or handling missing values, it is important to inspect the structure of the scraped dataset.  
The `df.info()` method provides details such as:

- Number of rows
- Column names
- Non-null counts
- Data types  
- Memory usage

This helps identify columns containing missing values, ensuring that appropriate preprocessing steps can be applied later.

In [7]:
# Display DataFrame information before transformation
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6587 entries, 0 to 6586
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   Date                6587 non-null   object
 1   Name                6587 non-null   object
 2   Symbol              6587 non-null   object
 3   Market Cap          6587 non-null   object
 4   Price               6587 non-null   object
 5   Circulating Supply  6587 non-null   object
 6   Volume (24hr)       6240 non-null   object
 7   % 1h                6587 non-null   object
 8   % 24h               6587 non-null   object
 9   % 7d                6587 non-null   object
dtypes: object(10)
memory usage: 514.7+ KB


## 2. Transformation  
### 2.1. Data Cleaning and Formatting

After collecting the raw cryptocurrency data, several preprocessing steps are required to convert the scraped text values into clean numerical formats suitable for analysis and machine learning.  
The following transformations are applied:

- **Date parsing:** Convert scraped URL-style date strings into Python datetime format.  
- **String cleaning:** Remove symbols such as `$`, `,`, `%`, `--`, `<`, `>` from numeric fields.  
- **Type conversion:** Convert cleaned values into numeric data types.  
- **Special handling:** Replace values like `<0.01` and `--` with meaningful numerical substitutes.  
- **Formatting:** Apply consistent display formatting for readability.

In [8]:
# Extract the date component from the 'Date' column and convert it to a datetime data type
df['Date'] = pd.to_datetime(df['Date'].str.split('/').str[-2], format='%Y%m%d')

# Replace the dollar signs ($) and commas (,) from the 'Market Cap' and 'Price' columns
df['Market Cap'] = df['Market Cap'].str.replace('[$,]', '', regex=True)
df['Price'] = df['Price'].str.replace('[$,]', '', regex=True)

# Replace the commas (,) from the 'Circulating Supply' column
df['Circulating Supply'] = df['Circulating Supply'].str.replace(',', '')

# Replace the dollar signs ($) and commas (,) from the 'Volume (24hr)' columns
df['Volume (24hr)'] = df['Volume (24hr)'].str.replace('[$,]', '', regex=True)

# Replace the unchange sign (--), the smaller sign (<), the larger sign (>) and percentage sign (%) from the '% 1h', '% 24h', and '% 7d' columns
df['% 1h'] = df['% 1h'].str.replace('--', '0').str.lstrip('>').str.lstrip('<').str.rstrip('%')
df['% 24h'] = df['% 24h'].str.replace('--', '0').str.lstrip('>').str.lstrip('<').str.rstrip('%')
df['% 7d'] = df['% 7d'].str.replace('--', '0').str.lstrip('>').str.lstrip('<').str.rstrip('%')

# Convert the numeric columns to appropriate data types, replacing invalid values with NaN
numeric_cols = ['Market Cap', 'Price', 'Circulating Supply', 'Volume (24hr)', '% 1h', '% 24h', '% 7d']
df[numeric_cols] = df[numeric_cols].apply(lambda x: pd.to_numeric(x))

# Handle specific case of "<0.01" by replacing it with a small non-zero value, e.g., 0.005
df.loc[df['% 1h'] < 0, '% 1h'] = 0.005

# Set the display format for float and integer values
pd.options.display.float_format = '{:.2f}'.format

# Display the updated DataFrame
df

Unnamed: 0,Date,Name,Symbol,Market Cap,Price,Circulating Supply,Volume (24hr),% 1h,% 24h,% 7d
0,2013-04-28,Bitcoin,BTC,1488566971.96,134.21,11091325,,0.64,0.00,0.00
1,2013-04-28,Litecoin,LTC,74637021.57,4.35,17164230,,0.80,0.00,0.00
2,2013-04-28,Peercoin,PPC,7250186.65,0.39,18757362,,0.01,0.00,0.00
3,2013-04-28,Namecoin,NMC,5995997.19,1.11,5415300,,0.01,0.00,0.00
4,2013-04-28,Terracoin,TRC,1503099.40,0.65,2323570,,0.61,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...
6582,2025-12-07,USDC,USDC,78198364307.09,1.00,78192824319,7817626806.69,0.01,0.01,0.03
6583,2025-12-07,Solana,SOL,74055834015.84,132.09,560631621,3824892146.92,0.63,-0.19,-1.10
6584,2025-12-07,TRON,TRX,27163787850.16,0.29,94681527841,523906170.03,0.21,-0.18,1.85
6585,2025-12-07,Dogecoin,DOGE,22395155616.57,0.14,161608712799,1065448744.32,0.51,-0.83,-5.52


## 2.2. Missing Value Imputation and Reverse Normalization

The `Volume (24hr)` column contains missing values that need to be handled before analysis.  
Instead of removing rows or filling with simple statistics like mean or median, we can use **K-Nearest Neighbors (KNN) imputation**.  

**Why KNN?**  
- It predicts missing values based on the similarity of other numerical features.  
- More accurate than mean/median imputation.  
- Slightly more computationally intensive due to algorithmic calculations.  

In [9]:
# Check for missing data
df.isnull().sum()

Date                    0
Name                    0
Symbol                  0
Market Cap              0
Price                   0
Circulating Supply      0
Volume (24hr)         347
% 1h                    0
% 24h                   0
% 7d                    0
dtype: int64

In [10]:
# Select numerical columns for imputation
numeric_cols = ['Market Cap', 'Price', 'Circulating Supply', '% 1h', '% 24h', '% 7d', 'Volume (24hr)']

# Normalization
scaler = MinMaxScaler()
df_normalized = df.copy()
df_normalized[numeric_cols] = scaler.fit_transform(df[numeric_cols])

# KNN Imputation
imputer = KNNImputer(n_neighbors=3)
df_imputed = df_normalized.copy()
df_imputed[numeric_cols] = imputer.fit_transform(df_normalized[numeric_cols])

# Reverse normalization
df_imputed[numeric_cols] = scaler.inverse_transform(df_imputed[numeric_cols])

df = df_imputed.copy()
df

Unnamed: 0,Date,Name,Symbol,Market Cap,Price,Circulating Supply,Volume (24hr),% 1h,% 24h,% 7d
0,2013-04-28,Bitcoin,BTC,1488566971.96,134.21,11091325.00,207329643.26,0.64,0.00,0.00
1,2013-04-28,Litecoin,LTC,74637021.57,4.35,17164230.00,208620175.73,0.80,0.00,0.00
2,2013-04-28,Peercoin,PPC,7250186.65,0.39,18757362.00,5939582.36,0.01,0.00,0.00
3,2013-04-28,Namecoin,NMC,5995997.19,1.11,5415300.00,5939582.36,0.01,0.00,0.00
4,2013-04-28,Terracoin,TRC,1503099.40,0.65,2323570.00,94459172.69,0.61,0.00,0.00
...,...,...,...,...,...,...,...,...,...,...
6582,2025-12-07,USDC,USDC,78198364307.09,1.00,78192824319.00,7817626806.69,0.01,0.01,0.03
6583,2025-12-07,Solana,SOL,74055834015.84,132.09,560631621.00,3824892146.92,0.63,-0.19,-1.10
6584,2025-12-07,TRON,TRX,27163787850.16,0.29,94681527841.00,523906170.03,0.21,-0.18,1.85
6585,2025-12-07,Dogecoin,DOGE,22395155616.57,0.14,161608712799.00,1065448744.32,0.51,-0.83,-5.52


## 2.3. DataFrame Information After Transformation

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6587 entries, 0 to 6586
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Date                6587 non-null   datetime64[ns]
 1   Name                6587 non-null   object        
 2   Symbol              6587 non-null   object        
 3   Market Cap          6587 non-null   float64       
 4   Price               6587 non-null   float64       
 5   Circulating Supply  6587 non-null   float64       
 6   Volume (24hr)       6587 non-null   float64       
 7   % 1h                6587 non-null   float64       
 8   % 24h               6587 non-null   float64       
 9   % 7d                6587 non-null   float64       
dtypes: datetime64[ns](1), float64(7), object(2)
memory usage: 514.7+ KB


After performing data cleaning, formatting, and KNN imputation, the dataset is now fully transformed.  
All columns are in their appropriate data types and missing values have been handled:

## 3. Save Processed Data to CSV

In [12]:
# Save the DataFrame as a CSV file
df.to_csv('historical_crypto_sundays.csv', index=False)

## ðŸ“Œ 01_web_scraping.ipynb --> Section Summary

In this notebook, we successfully completed the **web scraping, cleaning, transformation, and preparation** of historical cryptocurrency data (top 10 coins weekly) from CoinMarketCap spanning **2013 to 2025 (7th December)**.  

Key accomplishments:

1. **Data Collection**
   - Scraped weekly cryptocurrency data including: `Date`, `Name`, `Symbol`, `Market Cap`, `Price`, `Circulating Supply`, `Volume (24hr)`, and percentage changes (`% 1h`, `% 24h`, `% 7d`).
   - Ensured extraction of top-performing coins for each week and consistent structure for analysis.

2. **Data Cleaning & Formatting**
   - Converted string values to appropriate numeric formats.
   - Removed symbols like `$`, `,`, `%`, `<`, `>`, and handled placeholders like `--`.
   - Parsed dates from URL-style strings to `datetime` objects.
   - Applied consistent float formatting for better readability.

3. **Missing Value Imputation**
   - Identified missing values, primarily in the `Volume (24hr)` column.
   - Applied **K-Nearest Neighbors (KNN) imputation** after normalizing numerical columns.
   - Successfully filled missing values and reversed normalization to retain original scale.

4. **Data Type Transformation**
   - All numeric columns (`Market Cap`, `Price`, `Circulating Supply`, `Volume (24hr)`, `% 1h`, `% 24h`, `% 7d`) were converted to `float64`.
   - `Date` column is in `datetime64[ns]` format.
   - Data is now fully structured and ready for downstream analysis.

5. **Data Export**
   - Saved the fully cleaned and transformed dataset as `historical_crypto_sundays.csv` for further use in **Power BI, SQL databases, or other analytical tools**.

**Outcome:**  
The dataset is now complete, accurate, and formatted for **visual analytics, trend analysis, and interactive dashboards**. This forms a solid foundation for creating professional cryptocurrency market insights and investment dashboards.

---
