<a href="https://colab.research.google.com/github/Rishi625/Investment_match/blob/main/Investment_Match.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Web Scraping Top Startups Website**

This Colab notebook uses Python to scrape data from the [Top Startups website](https://topstartups.io/) and saves the results to a CSV file. Below is an explanation of the script.

---

## **Script Overview**

### **1. Required Libraries**
The script imports the following libraries:
- **`requests`**: To make HTTP requests and fetch web page content.
- **`BeautifulSoup`**: To parse HTML and extract specific elements.
- **`pandas`**: To create and save the scraped data as a structured table (DataFrame).
- **`time`**: To introduce delays between requests (politeness in web scraping).

### **2. Function: `scrape_topstartups_with_pagination`**
This function:
- Scrapes multiple pages of the Top Startups website.
- Handles pagination by incrementing the page number in the URL until no more companies are found.

#### **Key Steps in the Function:**
1. **Set Up Base URL and Headers**
   - The `base_url` points to the website's main page.
   - A `headers` dictionary is used to mimic a browser user-agent string to avoid being blocked by the website.

2. **Loop Through Pages**
   - A `while` loop fetches each page incrementally by appending the `?page={page}` query to the URL.
   - The HTML content of the page is parsed using `BeautifulSoup`.

3. **Extract Data**
   - For each page, the `extract_company_info` function extracts relevant information (like name, description, tags, and location) about startups.

4. **Stop Condition**
   - If no companies are found or there is no "Next Page" link, the loop terminates.

5. **Save Results**
   - All scraped data is stored in a Pandas DataFrame and exported as a CSV file named `topstartups_data.csv`.

---



In [None]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

def scrape_topstartups_with_pagination():
    base_url = "https://topstartups.io/"
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    }

    all_companies = []
    page = 1

    while True:
        url = f"{base_url}?page={page}"
        response = requests.get(url, headers=headers)
        soup = BeautifulSoup(response.content, 'html.parser')

        # Extract companies from current page
        companies = extract_company_info(soup)

        if not companies:
            break

        all_companies.extend(companies)
        print(f"Scraped page {page}, found {len(companies)} companies")

        # Check if next page exists
        next_page = soup.find('a', class_='infinite-more-link')
        if not next_page:
            break

        page += 1
        time.sleep(1)

    # Create and save DataFrame
    df = pd.DataFrame(all_companies)
    df.to_csv('topstartups_data.csv', index=False)
    return df

def extract_company_info(soup):
    companies = []

    for card in soup.find_all('div', class_='card card-body', id='item-card-filter'):
        company = {
            'Company Name': '',
            'Website': '',
            'Industry Tags': '',
            'Description': '',
            'Funding Tags': '',
            'Location': '',
            'Company Size': ''
        }

        # Company Name
        name_tag = card.find('h3')
        if name_tag:
            company['Company Name'] = name_tag.text.strip()

            # Website
            website_tag = card.find('a', id='startup-website-link')
            if website_tag:
                company['Website'] = website_tag.get('href', '').split('?')[0]

            # Industry Tags
            industry_tags = card.find_all('span', class_='badge rounded-pill bg-success', id='industry-tags')
            company['Industry Tags'] = ', '.join([tag.text.strip() for tag in industry_tags])

            # Extract company description
            description_tag = card.find('b', id='card-header', text='What they do: ')
            if description_tag:
                br_tag = description_tag.find_next('br')
                if br_tag:
                    description_text = br_tag.find_next_sibling(text=True)
                    if description_text:
                        company['Description'] = description_text.strip()

            # Funding Tags
            funding_tags = card.find_all('span', class_='badge rounded-pill bg-primary', id='funding-tags')
            company['Funding Tags'] = ', '.join([tag.text.strip() for tag in funding_tags])

            # Location
            location_tag = card.find('b', id='card-header')
            if location_tag:
                br_tags = card.find_all('br')
                for br in br_tags:
                    if '📍' in br.next_sibling:
                        company['Location'] = br.next_sibling.strip()

            # Company Size
            company_size_tags = card.find_all('span', class_='badge rounded-pill bg-success' , id ='company-size-tags')
            company['Company Size'] = ', '.join(tag.text.strip() for tag in company_size_tags)

            companies.append(company)
    return companies

if __name__ == "__main__":
    df = scrape_topstartups_with_pagination()
    print(f"\nTotal companies scraped: {len(df)}")

  description_tag = card.find('b', id='card-header', text='What they do: ')
  description_text = br_tag.find_next_sibling(text=True)


Scraped page 1, found 18 companies
Scraped page 2, found 18 companies
Scraped page 3, found 18 companies
Scraped page 4, found 18 companies
Scraped page 5, found 18 companies
Scraped page 6, found 18 companies
Scraped page 7, found 18 companies
Scraped page 8, found 18 companies
Scraped page 9, found 18 companies
Scraped page 10, found 18 companies
Scraped page 11, found 18 companies
Scraped page 12, found 18 companies
Scraped page 13, found 18 companies
Scraped page 14, found 18 companies
Scraped page 15, found 18 companies
Scraped page 16, found 18 companies
Scraped page 17, found 18 companies
Scraped page 18, found 18 companies
Scraped page 19, found 18 companies
Scraped page 20, found 18 companies
Scraped page 21, found 18 companies
Scraped page 22, found 18 companies
Scraped page 23, found 18 companies
Scraped page 24, found 18 companies
Scraped page 25, found 18 companies
Scraped page 26, found 18 companies
Scraped page 27, found 18 companies
Scraped page 28, found 18 companies
S

In [None]:
df.head(5)

Unnamed: 0,Company Name,Website,Industry Tags,Description,Funding Tags,Location,Company Size
0,Pogo,https://www.joinpogo.com/,"Consumer, Mobile App, FinTech",Help over 1.5M+ users earn and save by unlocki...,"20VC, Josh Buckley, Founders of Honey & Carta,...","📍HQ: New York, New York, USA","11-50 employees, Founded: 2020"
1,Icon,https://icon.me/careers,"Artificial Intelligence, E-Commerce, Creator",Help brands do AI ads with real creators,"Founders Fund, Seed in 2024","📍HQ: New York City, New York, USA","1-10 employees, Founded: 2024"
2,Eon,https://www.eon.io/,"SaaS, Enterprise Software",First-ever backup autopilot for modern cloud i...,"Sequoia, $70M Series C in 2024",📍HQ: Remote,"51-100 employees, Founded: 2016"
3,Cyera,https://www.cyera.io/,"Artificial Intelligence, Cybersecurity, Enterp...",AI-powered data security platform that gives e...,"Sequoia, Accel, $300M Series D in 2024, $3.0B ...","📍HQ: New York, New York, USA","201-500 employees, Founded: 2021"
4,Kong,https://konghq.com/,"Enterprise Software, SaaS",Kong offers the industry-leading service conne...,"Andreessen Horowitz, $175M Series E in 2024, $...","📍HQ: San Francisco Bay Area, California, USA","201-500 employees, Founded: 2017"


In [None]:
df.dropna(how='all', inplace=True)

In [None]:
# Split the Industry Tags column
df[['Tag 1', 'Tag 2', 'Tag 3']] = df['Industry Tags'].str.split(',', n=2, expand=True)

In [None]:
# Extract the Funding Stage
df['Funding Stage'] = df['Funding Tags'].str.extract(r'(Seed|Series [A-Z])')

In [None]:
# Clean the Location column
df['Location'] = df['Location'].str.replace('📍HQ: ', '', regex=False)

In [None]:
# Split the Company Size column
df[['Company_Size', 'Founded']] = df['Company Size'].str.split(', Founded: ', expand=True)

In [None]:
# Drop the original Company Size column
df = df.drop(columns=['Company Size'])

In [None]:
df.to_csv('topstartups_data.csv', index=False)

#Scraping Company Descriptions from Websites

This section of the code focuses on extracting additional descriptions from the websites of the startups listed in the topstartups_data.csv file. It uses web scraping techniques to gather more information about each company.

In [None]:
import pandas as pd
import requests
from bs4 import BeautifulSoup
from urllib.parse import urlparse
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm

df = pd.read_csv('/content/topstartups_data.csv')

def scrape_website(url):
    try:
        response = requests.get(url, timeout=10)
        soup = BeautifulSoup(response.content, 'html.parser')

        # Try to find the most relevant content
        content = soup.find('meta', attrs={'name': 'description'})
        if content:
            return content['content']

        # If no meta description, try to get the first paragraph
        content = soup.find('p')
        if content:
            return content.text.strip()

        # If still no content, return the first 500 characters of the body
        body = soup.find('body')
        if body:
            return body.text.strip()[:500]

        return "No relevant content found"
    except Exception as e:
        return f"Error: {str(e)}"

def process_row(index, url):
    parsed_url = urlparse(url)
    if not parsed_url.scheme:
        url = 'https://' + url
    return index, scrape_website(url)

df['Description 2'] = ''

# parallel scraping
with ThreadPoolExecutor(max_workers=10) as executor:
    futures = {
        executor.submit(process_row, index, row['Website']): index
        for index, row in df.iterrows() if pd.notna(row['Website'])
    }

    for future in tqdm(as_completed(futures), total=len(futures), desc="Scraping websites"):
        index, description = future.result()
        df.at[index, 'Description 2'] = description

df.to_csv('topstartups_data.csv', index=False)

print("Scraping completed. Updated data saved to 'topstartups_data_with_descriptions.csv'")

Scraping websites: 100%|██████████| 1239/1239 [02:25<00:00,  8.54it/s]

Scraping completed. Updated data saved to 'topstartups_data_with_descriptions.csv'





# Company Similarity Matching with Investment Criteria

This Python script identifies the top startups matching Innovius Capital's investment criteria using `SentenceTransformer` for text embeddings and `cosine_similarity` for similarity computation.


In [None]:
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from tqdm import tqdm
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {device}")

df = pd.read_csv('/content/topstartups_data.csv')

# Initialize the SentenceTransformer model
model = SentenceTransformer('roberta-base-nli-mean-tokens', device=device)

# Define investment criteria for the company
investment_criteria = """
B2B SaaS startups in Series B stage with strong growth potential,
focusing on companies with product-market fit and annual recurring revenue (ARR) of $4M+,
overcoming challenges in go-to-market repeatability and scalability,
leveraging playbooks, resources, and a network of operating partners
"""

# Encode the Innovius criteria
innovius_embedding = model.encode([innovius_criteria], show_progress_bar=True)[0]

def create_company_description(row):
    return f"{row['Company Name']} is a {row['Industry Tags']} company in {row['Funding Stage']} stage. {row['Description']} {row['Description 2']}"

df['company_description'] = df.apply(create_company_description, axis=1)

batch_size = 64
company_embeddings = []

print("Encoding company descriptions...")
for i in tqdm(range(0, len(df), batch_size)):
    batch = df['company_description'].iloc[i:i+batch_size].tolist()
    embeddings = model.encode(batch, show_progress_bar=False)
    company_embeddings.extend(embeddings)

similarities = cosine_similarity([innovius_embedding], company_embeddings)[0]

df['similarity_score'] = similarities

df_sorted = df.sort_values('similarity_score', ascending=False)

def display_top_matches(n=10):
    print(f"Top {n} matches for Innovius Capital:")
    for i, row in df_sorted.head(n).iterrows():
        print(f"\n{row['Company Name']} (Similarity: {row['similarity_score']:.4f})")
        print(f"Industry: {row['Industry Tags']}")
        print(f"Funding Stage: {row['Funding Stage']}")
        print(f"Description: {row['Description']}")
        print(f"Funding: {row['Funding Tags']}")
        print(f"Location: {row['Location']}")
        print(f"Founded: {row['Founded']}")
        print(f"Company Size: {row['Company_Size']}")
        print(f"Website: {row['Website']}")

display_top_matches(10)

def get_company_details(company_name):
    company = df[df['Company Name'] == company_name].iloc[0]
    print(f"\nCompany: {company['Company Name']}")
    print(f"Industry: {company['Industry Tags']}")
    print(f"Funding Stage: {company['Funding Stage']}")
    print(f"Description: {company['Description']}")
    print(f"Additional Description: {company['Description 2']}")
    print(f"Funding: {company['Funding Tags']}")
    print(f"Location: {company['Location']}")
    print(f"Founded: {company['Founded']}")
    print(f"Company Size: {company['Company_Size']}")
    print(f"Website: {company['Website']}")


Using device: cuda


Batches:   0%|          | 0/1 [00:00<?, ?it/s]

Encoding company descriptions...


100%|██████████| 20/20 [00:06<00:00,  3.26it/s]

Calculating similarity scores...
Top 10 matches for Innovius Capital:

Gem (Similarity: 0.7715)
Industry: Recruiting, Enterprise Software, SaaS
Funding Stage: Series C
Description: Helps teams source talent 5x faster, double response rates and manage relationships with passive talent
Funding: Y Combinator, Accel, $37M Series C in 2021, $1.2B valuation
Location: San Francisco Bay Area, California, USA
Founded: 2017.0
Company Size: 201-500 employees
Website: http://www.gem.com/

Clay (Similarity: 0.7310)
Industry: Artificial Intelligence, SaaS, Enterprise Software
Funding Stage: Series B
Description: Data enrichment solutions and personalized outreach automation
Funding: Sequoia, $46M Series B in 2024, $500M valuation
Location: New York, New York, USA
Founded: 2016.0
Company Size: 11-50 employees
Website: https://www.clay.com/

Statsig (Similarity: 0.7276)
Industry: SaaS, Enterprise Software, Testing
Funding Stage: Series B
Description: Statsig is a modern application building framework 




# **Using Gemini AI for Company Analysis and Recommendation**

This code integrates Google Gemini AI to evaluate the top startups based on a custom investment criterion and generate recommendations for potential investment. The focus is to identify the top 3 companies that align with the Innovius Capital investment criteria from the provided dataset.


In [None]:
import os
import google.generativeai as genai
from typing import List, Dict
from google.colab import userdata

os.environ['GEMINI_API_KEY'] = userdata.get('GEMINI_API_KEY')

genai.configure(api_key=os.environ['GEMINI_API_KEY'])

class GeminiApi:
    def __init__(self):
        self.model = genai.GenerativeModel("gemini-1.5-pro")

    def generate_response(self, prompt):
        try:
            response = self.model.generate_content(prompt)
            return response.text
        except Exception as e:
            print(f"An error occurred while generating response: {e}")
            return None

def get_top_companies(df_sorted: pd.DataFrame, n: int = 10) -> List[Dict]:
    top_companies = []
    for _, row in df_sorted.head(n).iterrows():
        company = {
            "Company Name": row['Company Name'],
            "Industry": row['Industry Tags'],
            "Funding Stage": row['Funding Stage'],
            "Description": row['Description'],
            "Additional Description": row['Description 2'],
            "Funding": row['Funding Tags'],
            "Location": row['Location'],
            "Founded": row['Founded'],
            "Company Size": row['Company_Size'],
            "Website": row['Website'],
            "Similarity Score": row['similarity_score']
        }
        top_companies.append(company)
    return top_companies

def custom_prompt(top_companies: List[Dict], innovius_criteria: str) -> str:
    prompt = f" Investment Criteria:\n{investment_criteria}\n\n"
    prompt += "Top 10 Companies:\n"
    for company in top_companies:
        prompt += f"Company: {company['Company Name']}\n"
        prompt += f"Industry: {company['Industry']}\n"
        prompt += f"Funding Stage: {company['Funding Stage']}\n"
        prompt += f"Description: {company['Description']}\n"
        prompt += f"Additional Description: {company['Additional Description']}\n"
        prompt += f"Funding: {company['Funding']}\n"
        prompt += f"Location: {company['Location']}\n"
        prompt += f"Founded: {company['Founded']}\n"
        prompt += f"Company Size: {company['Company Size']}\n"
        prompt += f"Website: {company['Website']}\n"
        prompt += f"Similarity Score: {company['Similarity Score']:.4f}\n\n"

    prompt += "Based on the {company name} investment criteria and the information provided for the top 10 companies, please select the top 3 companies that are best suited for {company name} to invest in. Provide a detailed explanation for each selected company, highlighting how they align with {company name} investment strategy and criteria."
    prompt += """
    Give the response in the following format but everything in markdown format:
    1. Company Name:
      * Alignment with Criteria:
      * Strong Growth Potential:
      * Why it stands out:

    2. Company Name:
      * Alignment with Criteria:
      * Strong Growth Potential:
      * Why it stands out:

    3. Company Name:
      * Alignment with Criteria:
      * Strong Growth Potential:
      * Why it stands out:

      Companies which did not make the top 3 and why are:
    """
    return prompt

# Get top 10 companies
top_10_companies = get_top_companies(df_sorted)
prompt = custom_prompt(top_10_companies, innovius_criteria)
gemini_api = GeminiApi()
response = gemini_api.generate_response(prompt)

print("Top 3 recommended companies:")
print(response)

Top 3 recommended companies:
1. Company Name: **Clay**
   * Alignment with Criteria: Clay is a Series B B2B SaaS company focused on data enrichment and outreach automation, directly addressing the go-to-market challenges many businesses face. Their AI-driven approach suggests strong product-market fit, although ARR is not explicitly stated.  Assuming their $500M valuation is based on typical SaaS multiples, it implies a healthy ARR that likely meets the $4M+ threshold.  Innovius's playbooks and resources, especially in GTM, could greatly benefit Clay.
   * Strong Growth Potential:  The market for sales automation and data enrichment tools is large and growing rapidly. Clay's use of AI for crafting outreach positions them for significant growth as businesses increasingly adopt these technologies to improve sales efficiency.
   * Why it stands out:  Clay's innovative use of AI for personalized outreach is a key differentiator. Their focus on solving a core GTM challenge for sales teams m