# Web Scraping and Introductory Data Analysis

Welcome to Homework 0, where we will delve into web scraping and perform an introductory data analysis. This homework will be a hands-on exercise that will help you become familiar with the process of extracting data from websites and conducting basic statistical analysis. 

## Objectives

By the end of this homework, you will be able to:

1. Set up a Python environment with the necessary libraries for web scraping and data analysis.
2. Write a web scraping script using Beautiful Soup and Selenium to collect data from a website.
3. Sample from the collected dataset and compare the statistics of the sample and the population.
   
## Tasks

1. **Environment Setup**: Install the required libraries such as Beautiful Soup, Selenium, pandas, numpy, matplotlib, and seaborn.

2. **Web Scraping**: Write a script to scrape transaction data from [Etherscan.io](https://etherscan.io/txs). Use Selenium to interact with the website and Beautiful Soup to parse the HTML content.

3. **Data Sampling**: Once the data is collected, create a sample from the dataset. Compare the sample statistics (mean and standard deviation) with the population statistics.


## Deliverables

1. A Jupyter notebook with all the code and explanations.
2. A detailed report on the findings, including the comparison of sample and population statistics.
Note: You can include the report in your notebook.

## Getting Started

Begin by setting up your Python environment and installing the necessary libraries. Then, proceed with the web scraping task, ensuring that you handle any potential issues such as rate limiting. Once you have the data, move on to the data sampling and statistical analysis tasks. 

Remember to document your process and findings in the Jupyter notebook, and to include visualizations where appropriate to illustrate your results. <br>
Good luck, and happy scraping!

## Data Collection (Etherscan)

In this section, we will use web scraping to gather transaction data from the Ethereum blockchain using the Etherscan block explorer. Our objective is to collect transactions from the **last 10 blocks** on Ethereum.

To accomplish this task, we will employ web scraping techniques to extract the transaction data from the Etherscan website. The URL we will be targeting for our data collection is:

[https://etherscan.io/txs](https://etherscan.io/txs)

### Steps

1. **Navigate to the URL**: Use Selenium to open the Etherscan transactions page in a browser.

2. **Locate the Transaction Data**: Identify the HTML elements that contain the transaction data for the specified block range.

3. **Extract the Data**: Write a script to extract the transaction details e.g. Hash, Method, Block, etc.

4. **Handle Pagination**: If the transactions span multiple pages, implement pagination handling to navigate through the pages and collect all relevant transaction data.

5. **Store the Data**: Save the extracted transaction data into a structured format, such as a CSV file or a pandas DataFrame, for further analysis.

### Considerations

- **Rate Limiting**: Be mindful of the website's rate limits to avoid being blocked. Implement delays between requests if necessary.
- **Dynamic Content**: The Etherscan website may load content dynamically. Ensure that Selenium waits for the necessary elements to load before attempting to scrape the data.
- **Data Cleaning**: After extraction, clean the data to remove any inconsistencies or errors that may have occurred during the scraping process.

### Resources

- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Selenium Documentation](https://selenium-python.readthedocs.io/)
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Ethereum](https://ethereum.org/en/)

In [None]:
pip install numpy pandas matplotlib seaborn scipy selenium webdriver_manager

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import norm

In [None]:
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

chrome_service = Service(ChromeDriverManager().install())

In [None]:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from bs4 import BeautifulSoup

In [None]:
url = f'https://etherscan.io/blocks'
chrome_options = Options()

dr = webdriver.Chrome(options=chrome_options, service=chrome_service)

WebDriverWait(dr, 10).until(lambda dr: dr.execute_script('return document.readyState') == 'complete')

dr.get(url)
page = dr.page_source
dr.quit()
    

soup = BeautifulSoup(page, 'html')
table = soup.find('table')

tbody = table.find('tbody')
trs = tbody.find_all('tr')

block_numbers = []
for tr in trs[:10]:
    td = tr.find('td')
    block_number = td.text.strip()
    block_numbers.append(block_number)
    
block_numbers

In [None]:
from time import sleep

df = None
for block_number in block_numbers:
    p = 1
    while True:
        url = f'https://etherscan.io/txs?block={block_number}&p={p}'
        sleep(0.5)
        chrome_options = Options()

        dr = webdriver.Chrome(options=chrome_options, service=chrome_service)

        WebDriverWait(dr, 10).until(lambda dr: dr.execute_script('return document.readyState') == 'complete')

        dr.get(url)
        page = dr.page_source
        dr.quit()        
        
        soup = BeautifulSoup(page, 'html')
        table = soup.find('table')
        
        if df is None:
            thead = table.find('thead')
            thead_titles = thead.find_all('th')
            titles = [table_title.text.strip() for table_title in thead_titles]
            df = pd.DataFrame(columns=titles)
        
        tbody = table.find('tbody')
        trs = tbody.find_all('tr')
        if len(trs) == 1: 
            break
        
        for tr in trs:
            tds = tr.find_all('td')
            tds = filter(lambda td: not (td.get('style') and 'display:none' in td.get('style')), tds)
            data = [td.text.strip() for td in tds]

            length = len(df)
            df.loc[length] = data
        p += 1

In [None]:
df.info()

In [None]:
df.to_csv('data.csv')

## Data Analysis

Now that we have collected the transaction data from Etherscan, the next step is to perform conduct an initial analysis. This task will involve the following steps:

1. **Load the Data**: Import the collected transaction data into a pandas DataFrame.

2. **Data Cleaning**: Clean the data by converting data types, removing any irrelevant information, and handling **duplicate** values.

3. **Statistical Analysis**: Calculate the mean and standard deviation of the population. Evaluate these statistics to understand the distribution of transaction values. The analysis and plotting will be on **Txn Fee** and **Value**.

4. **Visualization**: This phase involves the creation of visual representations to aid in the analysis of transaction values. The visualizations include:
    - A histogram for each data column, which provides a visual representation of the data distribution. The selection of bin size is crucial and should be based on the data's characteristics to ensure accurate representation. Provide an explanation on the bin size selection!
    - A normal distribution plot fitted alongside the histogram to compare the empirical distribution of the data with the theoretical normal distribution.
    - A box plot and a violin plot to identify outliers and provide a comprehensive view of the data's distribution.

### Deliverables

The project aims to deliver the following deliverables:

- A refined pandas DataFrame containing the transaction data, which has undergone thorough cleaning and is ready for analysis.
- A simple statistical analysis evaluating the population statistics, offering insights into the distribution of transaction values and fees.
- A set of visualizations showcasing the distribution of transaction values for the population. These visualizations include histograms, normal distribution plots, box plots, and violin plots, each serving a specific purpose in the analysis.

### Getting Started

The project starts with the importing of transaction data into a pandas DataFrame, setting the stage for data manipulation and analysis. Subsequent steps involve the cleaning of the data to ensure its quality and reliability. Followed by the calculation of population statistics. Finally, a series of visualizations are created to visually analyze the distribution of transaction values and fees.

In [None]:
df = pd.read_csv('data.csv')

In [None]:
df.head()

In [None]:
df['Txn Fee'] = pd.to_numeric(df['Txn Fee'], errors='coerce')
filtered_df = df[df['Value'].str.endswith('ETH')]
filtered_df['Value'] = filtered_df['Value'].str.replace(' ETH', '').astype(float)
df.drop([], axis=1, inplace=True)

In [None]:
filtered_df.head()

In [None]:
filtered_df.drop_duplicates(inplace=True)
filtered_df.dropna(subset=['Txn Fee', 'Value'], inplace=True)
filtered_df.head()

In [None]:
txn_fee_mean = filtered_df['Txn Fee'].mean()
txn_fee_std = filtered_df['Txn Fee'].std()

value_mean = filtered_df['Value'].mean()
value_std = filtered_df['Value'].std()

print(f"Txn Fee Mean: {txn_fee_mean}, Standard Deviation: {txn_fee_std}")
print(f"Value Mean: {value_mean}, Standard Deviation: {value_std}")


In [None]:
def plot_histogram_and_normal_dist(column, bin_size='auto', title=''):
    sns.histplot(filtered_df[column], bins=bin_size, kde=False, stat='density', label='Histogram')
    xmin, xmax = plt.xlim()
    x = np.linspace(xmin, xmax, 100)
    p = norm.pdf(x, filtered_df[column].mean(), filtered_df[column].std())
    plt.plot(x, p, 'k', linewidth=2, label='Normal dist')
    title = title or f'Distribution of {column}'
    plt.title(title)
    plt.legend()

plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plot_histogram_and_normal_dist('Txn Fee', title='Txn Fee Distribution')

plt.subplot(1, 2, 2)
plot_histogram_and_normal_dist('Value', title='Value Distribution')

plt.show()

In [None]:
plt.figure(figsize=(14, 7))
plt.subplot(1, 2, 1)
sns.boxplot(data=filtered_df, y='Txn Fee')
plt.title('Box Plot of Txn Fee')

plt.subplot(1, 2, 2)
sns.violinplot(data=filtered_df, y='Value')
plt.title('Violin Plot of Value')

plt.show()


## Data Sampling and Analysis

In this section, we will delve into the process of data sampling and perform an initial analysis on the transaction data we have collected. Our objective is to understand the distribution of transaction values by sampling the data and comparing the sample statistics with the population statistics.

### Steps

1. **Load the Data**: Import the collected transaction data into a pandas DataFrame.

2. **Data Cleaning**: Clean the data by handling missing values, converting data types, and removing any irrelevant information.

3. **Simple Random Sampling (SRS)**: Create a sample from the dataset using a simple random sampling method. This involves randomly selecting a subset of the data without regard to any specific characteristics of the data.

4. **Stratified Sampling**: Create another sample from the dataset using a stratified sampling method. This involves dividing the data into strata based on a specific characteristic (e.g., transaction value) and then randomly selecting samples from each stratum. Explain what you have stratified the data by and why you chose this column.

5. **Statistical Analysis**: Calculate the mean and standard deviation of the samples and the population. Compare these statistics to understand the distribution of transaction values.

6. **Visualization**: Plot the distribution of transaction values and fees for both the samples and the population to visually compare their distributions.

### Considerations

- **Sample Size**: The size of the sample should be large enough to represent the population accurately but not so large that it becomes impractical to analyze.
- **Sampling Method**: Choose the appropriate sampling method based on the characteristics of the data and the research question.

Explain the above considerations in your report.

In [None]:
# Load the Data:
transactions_df = pd.read_csv('data.csv')

In [None]:
transactions_df.head()

In [None]:
# Cleaning Data:

# Handling missing values:
transactions_df.dropna(subset=['Txn Fee', 'Value'], inplace=True)

# Converting data types:
transactions_df['Txn Fee'] = pd.to_numeric(transactions_df['Txn Fee'], errors='coerce')
filtered_transactions_df = transactions_df[transactions_df['Value'].str.endswith('ETH')]
filtered_transactions_df['Value'] = filtered_transactions_df['Value'].str.replace(' ETH', '').astype(float)
filtered_transactions_df.drop([], axis=1, inplace=True)

# Removing any irrelevant information:
filtered_transactions_df = filtered_transactions_df.loc[:, ~filtered_transactions_df.columns.str.match('Unnamed: ')]
filtered_transactions_df.drop(['Txn Hash','Method', 'Block', 'Age', 'From', 'To'], axis=1, inplace=True)




In [None]:
filtered_transactions_df.head()

In [None]:
# Simple Random Sampling:
sample_size = 100 

if filtered_transactions_df.empty:
    print("The DataFrame is empty.")
else:
    print(f"The DataFrame has {len(filtered_transactions_df)} rows.")
if len(filtered_transactions_df) < sample_size:
    print(f"Cannot sample {sample_size} rows from a DataFrame with only {len(filtered_transactions_df)} rows.")
else:
    SRS_samples = filtered_transactions_df.sample(n=sample_size)
    print("Sampled successfully.")


In [None]:
SRS_samples.head()

In [None]:
# Stratified Sampling:




In [None]:
# Statistical Analysis:

# ***SRS***
SRS_txn_fee_mean = SRS_samples['Txn Fee'].mean()
SRS_txn_fee_std = SRS_samples['Txn Fee'].std()

SRS_value_mean = SRS_samples['Value'].mean()
SRS_value_std = SRS_samples['Value'].std()

print(f"SRC: Txn Fee Mean: {txn_fee_mean}, Standard Deviation: {txn_fee_std}")
print(f"SRC: Value Mean: {value_mean}, Standard Deviation: {value_std}")

# ***Stratified Sampling***



In [None]:
# Visualization:

