# Web Scrapping and Introductory Data Analysis

In this notebook, we will become familiar with the basics of web scraping and data analysis. We will be using the `selenium` and `BeautifulSoup` libraries to scrape data from a website and then use `pandas` to perform some basic data analysis.
We will be scraping data using `selenium` from the website [Etherscan.io](https://etherscan.io/), which is a block explorer for the Ethereum blockchain, to get the data and then use `BeautifulSoup` to parse the data and `pandas` to perform some basic data analysis.

We begin by setting up our Python environment and installing the necessary libraries. Then, proceed with the web scraping task, ensuring that we handle any potential issues such as rate limiting. Once we have the data, we will move on to the data sampling and statistical analysis tasks. 

## Setting up the Environment 

In [146]:
!pip install selenium
!pip install beautifulsoup4
!pip install numpy
!pip install pandas
!pip install seaborn
!pip install matplotlib

Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable
Defaulting to user installation because normal site-packages is not writeable


In [147]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import os
import sys
import time
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup

## Data Collection

In this section, we will use web scraping to gather transaction data from the Ethereum blockchain using the Etherscan block explorer. Our objective is to collect transactions from the **last 10 blocks** on Ethereum.

To accomplish this task, we will employ web scraping techniques to extract the transaction data from the Etherscan website. The URL we will be targeting for our data collection is [https://etherscan.io/txs](https://etherscan.io/txs)


### Steps

1. **Navigate to the URL**: Use Selenium to open the Etherscan transactions page in a browser.

2. **Locate the Transaction Data**: Identify the HTML elements that contain the transaction data for the specified block range.

3. **Extract the Data**: Write a script to extract the transaction details e.g. Hash, Method, Block, etc.

4. **Handle Pagination**: If the transactions span multiple pages, implement pagination handling to navigate through the pages and collect all relevant transaction data.

5. **Store the Data**: Save the extracted transaction data into a structured format, such as a CSV file or a pandas DataFrame, for further analysis.

In [148]:
URL = 'https://etherscan.io/txs'
LOAD_TIME = 0.5
LAST_BLOCK = None
NUM_BLOCKS = 10
raw_data = []

In [149]:
driver = webdriver.Chrome()
driver.get(URL)

In [150]:
# accept cookies
time.sleep(LOAD_TIME)
driver.find_element('xpath', '//*[@id="btnCookie"]').click()

In [151]:
# scroll to the bottom of the page, so that the dropdown menu is visible
time.sleep(LOAD_TIME)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(LOAD_TIME)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

In [152]:
# make each page show 100 records instead of 50
time.sleep(LOAD_TIME)
driver.find_element('xpath', '//*[@id="ContentPlaceHolder1_ddlRecordsPerPage"]').click()
driver.find_element('xpath', '//*[@id="ContentPlaceHolder1_ddlRecordsPerPage"]/option[4]').click()

In [153]:
while True:
    try:
        time.sleep(LOAD_TIME)
        html = driver.page_source
        soup = BeautifulSoup(html, 'html.parser')
        
        table = soup.find('table')
        if table:
            rows = table.find_all('tr')
            for row_index in range(1, len(rows)):
                cols = rows[row_index].find_all('td')
                cols = [ele.text.strip() for ele in cols]
                
                if LAST_BLOCK == None:
                    LAST_BLOCK = int(cols[3])
                
                elif LAST_BLOCK - NUM_BLOCKS > int(cols[3]):
                    raise StopIteration('Reached the last block')
                
                raw_data.append(cols)
            
            # go to the next page
            next_button = driver.find_element('xpath', '//a[@aria-label="Next"]')
            if next_button:
                next_button.click()
            else:
                raise Exception('Next button not found')
        else:
            raise Exception('Table not found')
    except StopIteration as e:
        break
    except Exception as e:
        print(e)
        time.sleep(10)
        continue

driver.quit()



### Considerations

- **Rate Limiting**: Be mindful of the website's rate limits to avoid being blocked. Implement delays between requests if necessary.
- **Dynamic Content**: The Etherscan website may load content dynamically. Ensure that Selenium waits for the necessary elements to load before attempting to scrape the data.
- **Data Cleaning**: After extraction, clean the data to remove any inconsistencies or errors that may have occurred during the scraping process.

### Resources

- [Beautiful Soup Documentation](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)
- [Selenium Documentation](https://selenium-python.readthedocs.io/)
- [Pandas Documentation](https://pandas.pydata.org/docs/)
- [Ethereum](https://ethereum.org/en/)

## Data Analysis

Now that we have collected the transaction data from Etherscan, the next step is to perform conduct an initial analysis. This task will involve the following steps:

1. **Load the Data**: Import the collected transaction data into a pandas DataFrame.

2. **Data Cleaning**: Clean the data by converting data types, removing any irrelevant information, and handling **duplicate** values.

3. **Statistical Analysis**: Calculate the mean and standard deviation of the population. Evaluate these statistics to understand the distribution of transaction values. The analysis and plotting will be on **Txn Fee** and **Value**.

4. **Visualization**: This phase involves the creation of visual representations to aid in the analysis of transaction values. The visualizations include:
    - A histogram for each data column, which provides a visual representation of the data distribution. The selection of bin size is crucial and should be based on the data's characteristics to ensure accurate representation. Provide an explanation on the bin size selection!
    - A normal distribution plot fitted alongside the histogram to compare the empirical distribution of the data with the theoretical normal distribution.
    - A box plot and a violin plot to identify outliers and provide a comprehensive view of the data's distribution.

### Deliverables

The project aims to deliver the following deliverables:

- A refined pandas DataFrame containing the transaction data, which has undergone thorough cleaning and is ready for analysis.
- A simple statistical analysis evaluating the population statistics, offering insights into the distribution of transaction values and fees.
- A set of visualizations showcasing the distribution of transaction values for the population. These visualizations include histograms, normal distribution plots, box plots, and violin plots, each serving a specific purpose in the analysis.

### Getting Started

The project starts with the importing of transaction data into a pandas DataFrame, setting the stage for data manipulation and analysis. Subsequent steps involve the cleaning of the data to ensure its quality and reliability. Followed by the calculation of population statistics. Finally, a series of visualizations are created to visually analyze the distribution of transaction values and fees.

In [154]:
dataframe = pd.DataFrame(columns=['Txn Hash', 'Method', 'Block', 'Date', 'Age', 'Local Date', 'From', 'To', 'Value', 'Txn fee'])

In [155]:
# create a dataframe from the raw data
for row in raw_data:
    dataframe = pd.concat([dataframe, 
               pd.DataFrame([row[1:8] + row[9:12]], columns=dataframe.columns)],
               ignore_index=True)

In [156]:
dataframe

Unnamed: 0,Txn Hash,Method,Block,Date,Age,Local Date,From,To,Value,Txn fee
0,0x3ca0ed52edb1b65fb3bb91838a8e6dc9e4e7128b81fa...,Transfer,19334424,2024-02-29 16:41:23,7 secs ago,1709224883,0xa83114A4...9F37fCCcb,Lido: Execution Layer Rewards Vault,1.098303845 ETH,0.00196578
1,0x665142b06fb72c625386fa3ca1fa0401a0248112b710...,Transfer,19334424,2024-02-29 16:41:23,7 secs ago,1709224883,Rollbit: Hot Wallet,0xb1659e12...f803629fc,0.0168134 ETH,0.00186723
2,0x86873d12def84b29d010321303d36375869ca37ddcfc...,Create Assertion...,19334424,2024-02-29 16:41:23,7 secs ago,1709224883,Mantle: Rollup Asserter,Mantle: Rollup Proxy,0 ETH,0.5819696
3,0x4545501d07befafd9903105673fb5004ccb176333639...,Transfer,19334424,2024-02-29 16:41:23,7 secs ago,1709224883,0xd7Aa9ba6...7F31664fC,Tether: USDT Stablecoin,0 ETH,0.00409982
4,0xb0f305ba36dc075d393fb35ab183e42ee6480a10d477...,Transfer,19334424,2024-02-29 16:41:23,7 secs ago,1709224883,mercuryo,0x6833831c...2E1595325,0.073559907 ETH,0.00186723
...,...,...,...,...,...,...,...,...,...,...
2784,0xd3f786edd1f025f9ea6406a3e59cb541efa0399b1aa5...,0xfb034fb2,19334414,2024-02-29 16:39:23,3 mins ago,1709224763,0x46187698...2ee79934B,0x51C72848...784502a7F,0 ETH,0.03079432
2785,0xfeb1b120e6bcd5838a01256b7cc0ff780db595c86080...,0x771d503f,19334414,2024-02-29 16:39:23,3 mins ago,1709224763,0x8EC5Fb1d...6067C3778,0x51C72848...784502a7F,0 ETH,0.02771194
2786,0xc762706d3312f532a3553d87037926252d1a1c47d42e...,0x771d503f,19334414,2024-02-29 16:39:23,3 mins ago,1709224763,0x5fA60dd1...Bd59F64d4,0x51C72848...784502a7F,0 ETH,0.02833663
2787,0x7efeba2596e8c2b6b8096de158595895a49a7d1545fc...,0xceb5748e,19334414,2024-02-29 16:39:23,3 mins ago,1709224763,0x07E53341...D809D8538,MEV Bot: 0x6f1…168,0 ETH,0.05085303


## Data Sampling and Analysis

In this section, we will delve into the process of data sampling and perform an initial analysis on the transaction data we have collected. Our objective is to understand the distribution of transaction values by sampling the data and comparing the sample statistics with the population statistics.

### Steps

1. **Load the Data**: Import the collected transaction data into a pandas DataFrame.

2. **Data Cleaning**: Clean the data by handling missing values, converting data types, and removing any irrelevant information.

3. **Simple Random Sampling (SRS)**: Create a sample from the dataset using a simple random sampling method. This involves randomly selecting a subset of the data without regard to any specific characteristics of the data.

4. **Stratified Sampling**: Create another sample from the dataset using a stratified sampling method. This involves dividing the data into strata based on a specific characteristic (e.g., transaction value) and then randomly selecting samples from each stratum. Explain what you have stratified the data by and why you chose this column.

5. **Statistical Analysis**: Calculate the mean and standard deviation of the samples and the population. Compare these statistics to understand the distribution of transaction values.

6. **Visualization**: Plot the distribution of transaction values and fees for both the samples and the population to visually compare their distributions.

### Considerations

- **Sample Size**: The size of the sample should be large enough to represent the population accurately but not so large that it becomes impractical to analyze.
- **Sampling Method**: Choose the appropriate sampling method based on the characteristics of the data and the research question.

Explain the above considerations in your report.

In [157]:
# Your code here