# Web Scraping. 
scraping information about Web3 VC companies from https://crypto-fundraising.info/investors/


# Importing Necessary Libraries
- requests: sends HTTP requests to the target website and receives the webpage content.
- BeautifulSoup: parses the webpage, navigating and extracting specific elements from the HTML.
- pandas: to structure and manage the data in a DataFrame, making it easy to analyze and export.
- time: to create delays between requests, to avoid overloading the server

In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

- Setting Up the Target URL
- Sending an HTTP Request to Fetch Web Page Content
- Parsing HTML Content with BeautifulSoup

In [2]:
url = "https://crypto-fundraising.info/investors/"
    
page = requests.get(url)
    
soup = BeautifulSoup(page.text, 'html.parser')

- Inspecting the Web Page 

In [3]:
print(soup.prettify())

<!DOCTYPE html>
<html lang="en-US">
 <head>
  <meta charset="utf-8"/>
  <meta content="user-scalable=no, initial-scale=1, maximum-scale=1, minimum-scale=1, width=device-width, height=device-height, target-densitydpi=device-dpi" name="viewport">
   <link href="https://gmpg.org/xfn/11" rel="profile"/>
   <link href="/favicon.ico" rel="shortcut icon" type="image/x-icon">
    <link as="image" href="https://crypto-fundraising.info/wp-content/themes/ico/img/cf-logo.svg" rel="preload"/>
    <link as="image" href="https://crypto-fundraising.info/wp-content/themes/ico/img/sort-all.png" rel="preload"/>
    <link as="image" href="https://crypto-fundraising.info/wp-content/themes/ico/img/sort-bottom.png" rel="preload"/>
    <link as="image" href="https://crypto-fundraising.info/wp-content/themes/ico/img/sort-top.png" rel="preload"/>
    <link href="https://crypto-fundraising.info/wp-content/cache/breeze-minification/css/breeze_93b19b5612357300a063ffc1fa70f84f.css" media="all" rel="stylesheet" type

- Identifying Relevant HTML Elements for Scraping

In [7]:
soup.find_all('div', class_='htop')

[<div class="hpt-col1 htop"> <span>#</span></div>,
 <div class="hpt-col2 htop wsort"><div class="sorter" data-sort="default" data-sortby="pname"></div> <span>Fund name</span></div>,
 <div class="hpt-col3 htop"></div>,
 <div class="hpt-col4 htop"> <span>Website</span></div>,
 <div class="hpt-col5 htop wsort"><div class="sorter" data-sort="default" data-sortby="pcount"></div> <span>Invested projects</span></div>]

In [9]:
soup.find_all('div', class_='htop', span_='')

[<div class="hpt-col1 htop"> <span>#</span></div>,
 <div class="hpt-col2 htop wsort"><div class="sorter" data-sort="default" data-sortby="pname"></div> <span>Fund name</span></div>,
 <div class="hpt-col3 htop"></div>,
 <div class="hpt-col4 htop"> <span>Website</span></div>,
 <div class="hpt-col5 htop wsort"><div class="sorter" data-sort="default" data-sortby="pcount"></div> <span>Invested projects</span></div>]

In [11]:
column = soup.find_all('div', class_='htop')

- Extracting data elements from html.

In [13]:
# Find all div elements with class 'htop'
column = soup.find_all('div', class_='htop')

if column:
    for div in column:
        spans = div.find_all('span')
        for span in spans:
            print(span.text)
else:
    print("No divs with class 'htop' found")

#
Fund name
Website
Invested projects


In [15]:
data = []

- Creating a dataFrame to organize extracted data.

In [17]:
# Create a DataFrame
df = pd.DataFrame(data, columns=['#', 'Fund name', 'Website', 'Invested projects'])
df

Unnamed: 0,#,Fund name,Website,Invested projects


In [19]:
 soup.find_all('div', class_='hpt-data')

[<div class="hp-table-row hpt-data" data-fund_id="249" data-fund_slug="coinbase-ventures"><div class="hpt-col1"> 01 </div><div class="hpt-col2"> <a class="aprojects nochevron" href="/funds/coinbase-ventures"> <img alt="Coinbase Ventures" class="fundlogoinvest" src="https://crypto-fundraising.info/wp-content/uploads/funds/2022/01/cb-ven.png"/> </a><p class="mob-only">Coinbase Ventures</p></div><div class="hpt-col3"> Coinbase Ventures</div><div class="hpt-col4"> <a href="https://www.coinbase.com/ventures" rel="noopener noreferrer nofollow" target="_blank" title="https://www.coinbase.com/ventures"> https://www.coinbase.com/ventures </a> <a href="https://twitter.com/cbventures" rel="noopener noreferrer nofollow" target="_blank" title="https://twitter.com/cbventures"> https://twitter.com/cbventures </a></div><div class="hpt-col5 flexwrap"> <a class="aprojects" href="/funds/coinbase-ventures"><div class="pcount" data-funid="249"> 331</div> </a></div></div>,
 <div class="hp-table-row hpt-data

In [21]:
rows = soup.find_all('div', class_='hpt-data')

- Looping through all pages to scrape data.

In [23]:
# Base URL 
base_url = 'https://crypto-fundraising.info/investors/?page='

# List 
all_data = []

# Number of pages 
total_pages = 358

# Loop through all pages
for page_number in range(1, total_pages + 1):
    # URL for the current page
    url = f"{base_url}{page_number}"
    
    #GET request
    response = requests.get(url)
    
    # Check if the request was successful
    if response.status_code == 200:
        # Parse the page content
        soup = BeautifulSoup(response.text, 'html.parser')
        
        # Extract the data 
        rows = soup.find_all('div', class_='hpt-data')  
        
        # Loop through the rows and extract
        for row in rows:
            # Extract these details 
            rank = row.find('div', class_='hpt-col1').text.strip()
            name = row.find('div', class_='hpt-col3').text.strip()
            website = row.find('div', class_='hpt-col4').a['href'].strip()
            twitter = row.find('div', class_='hpt-col4').find_all('a')[1]['href'].strip()
            investments = row.find('div', class_='pcount').text.strip()
            
            # Store the data
            all_data.append({
                'Rank': rank,
                'Name': name,
                'Website': website,
                'Twitter': twitter,
                'Investments': investments
            })
        
        # Print the progress
        print(f"Successfully scraped page {page_number} of {total_pages}")
        
        # 
        time.sleep(1)  
    else:
        print(f"Failed to retrieve page {page_number}")

Successfully scraped page 1 of 358
Successfully scraped page 2 of 358
Successfully scraped page 3 of 358
Successfully scraped page 4 of 358
Successfully scraped page 5 of 358
Successfully scraped page 6 of 358
Successfully scraped page 7 of 358
Successfully scraped page 8 of 358
Successfully scraped page 9 of 358
Successfully scraped page 10 of 358
Successfully scraped page 11 of 358
Successfully scraped page 12 of 358
Successfully scraped page 13 of 358
Successfully scraped page 14 of 358
Successfully scraped page 15 of 358
Successfully scraped page 16 of 358
Successfully scraped page 17 of 358
Successfully scraped page 18 of 358
Successfully scraped page 19 of 358
Successfully scraped page 20 of 358
Successfully scraped page 21 of 358
Successfully scraped page 22 of 358
Successfully scraped page 23 of 358
Successfully scraped page 24 of 358
Successfully scraped page 25 of 358
Successfully scraped page 26 of 358
Successfully scraped page 27 of 358
Successfully scraped page 28 of 358
S

- Saving and exporting data.

In [43]:
# Convert the list of data into a DataFrame
df = pd.DataFrame(all_data)

# Save the DataFrame to a CSV file
df.to_csv('scraped_data.csv', index=False)

# Save the DataFrame to an Excel file
df.to_excel('scraped_data.xlsx', index=False)