In this project I will attempt to build a program which: 
1. Scrapes information about start-ups from relevent VC portfolio's
2. Parse the information through ChatGPT's API to make sense of the info and filter based on pre-determined criteria
3. Present me with my future employer (hopefully)...

1. Web Scraper

In [2]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

# List of VC websites
url = 'https://www.balderton.com/companies/'
#Balderton is using cloudflare - which blocks scraper requests. We need to pretend to be a device.
HEADERS = {'User-Agent': 'Mozilla/5.0 (iPad; CPU OS 12_2 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) Mobile/15E148'}

def scrape_vc_portfolio(url):
    response = requests.get(url, headers=HEADERS)
    soup = BeautifulSoup(response.content, 'html.parser')

    portfolio_data = []
    for investment in soup.find_all('div', class_='col-md-6 col-lg-4 js-shuffle-item'):
        # Extract name
        name_tag = investment.find('h3')
        name = name_tag.text.strip() if name_tag else None

        # Extract description
        description_tag = investment.find('p')
        description = description_tag.text.strip() if description_tag else None

        stage_date_tag = investment.find('span', class_='label-m fw-medium')
        if stage_date_tag:
            stage_date = stage_date_tag.text.strip()
            if ', ' in stage_date:
                stage, date_of_investment = stage_date.split(', ')
            elif ' in ' in stage_date:
                stage, date_of_investment = stage_date.split(' in ')
            else:
                stage, date_of_investment = stage_date, None
        else:
            stage, date_of_investment = None, None

        # Extract location
        location_tag = investment.find('span', class_='label-M d-block mb-auto')
        location = location_tag.text.strip() if location_tag else None

        portfolio_data.append({
            'Name': name,
            'Description': description,
            'Stage': stage,
            'Date of Investment': date_of_investment,
            'Location': location,
        })

    return portfolio_data


portfolio_data = scrape_vc_portfolio(url)

# Convert to DataFrame and export to CSV
df = pd.DataFrame(portfolio_data)

print(df)

            Name                                        Description     Stage  \
0           32co                                               None      Seed   
1        Adludio        Sensory mobile ads for agencies and brands.      Seed   
2    Agave Games  Mobile puzzle game developer & publisher build...      Seed   
3        Aircall  Aircall provides an integrated, easy to use, c...  Series A   
4        Andjaro  Manage temporary personnel transfers between s...  Series A   
..           ...                                                ...       ...   
171        Yokoy           The all-in-one spend management platform  Series A   
172   Yoox Group  Global Internet retailing partner for leading ...      2000   
173         Zego   Work insurance made flexible, simple and better.  Series A   
174          ZOE  Understand how your body responds to food so y...  Series B   
175         Zopa                                Peer-to-peer loans.  Series A   

    Date of Investment     

Great! now I have a webscraper that can extract info off of VC websites.
But - I'm going to have to modify the HTML structure for each website. Since GPT-4o does not have access to the internet, I cannot ask it (yet) to find the correct HTML and automatically update the code.

New approach:
The webscraper will now just download all data from the VC website (e.g. ctrl-A), maybe use selenium for this. once all data is downloaded, GPT API can parse through the data to pull out the important bits.

In [5]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.balderton.com/companies/'
output_file = 'scrapedtext.txt'
# Balderton is using cloudflare - which blocks scraper requests. We need to pretend to be a device.
HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.83 Safari/537.36'}


def scrape_page(url):
    response = requests.get(url, headers=HEADERS)
    soup = BeautifulSoup(response.content, 'html.parser')
    pagetext = soup.get_text(separator='\n', strip=True)
    print(pagetext)


scrape_page(url)

Balderton Capital Portfolio | Balderton Capital
About Us
Wellbeing & Performance Platform
Our Values
Sustainable Future Goals
Our team
Companies
Founder spotlights
Careers
Resources
News
Europe’s most
exciting startup
s
All companies
Status
Exited
Live
Location
Denmark
Finland
France
Germany
Greece
Ireland
Israel
Italy
Mexico
Netherlands
Norway
Poland
PRC
Spain
Sweden
Switzerland
Turkey
UK
USA
All Categories
AI & machine learning
Consumer
Cyber security
Digital Health
Enterprise
Fintech
Marketplaces
Sustainable tech
Other
London, UK
32co
Seed, 2023
London, UK
Adludio
Sensory mobile ads for agencies and brands.
Seed in 2013
Istanbul, Turkey
Agave Games
Mobile puzzle game developer & publisher building unique experiences for audiences worldwide.
Seed in 2022
Paris, France
Aircall
Aircall provides an integrated, easy to use, cloud-based phone solution
Series A in 2016
Paris , France
Andjaro
Manage temporary personnel transfers between sites.
Series A in 2018
Berlin, Germany
Anytype
Seed 2

That's much easier, and will (should) work for every page I put in (without having to do anything manually). Now I just need a way to parse the data and get the important bits out. I believe this is where the GPT API can help.

...

I've done a bit of digging and it seems like for my task (i.e. taking a text input and pulling out relevant information), using Chat-GPT API would be like using a jet engine to dry my hair (thanks 4o for that one). Also I have to pay for it, and I hate paying.

So I'm going to use the llama 3 api - which I can access for free using groq!

2. LLM API

Groq build LPU inference engines. I think they want to show off how fast their engines are by giving it to users for free (obvs for bigger data loads you have to buy an enterprise plan). also they are probs stealing my data - it's not cheap to run...

In [2]:
from groq import Groq

client = Groq()

#open webscraping file
with open('scrapedtext.txt', 'r') as scrapedtext:
    text = scrapedtext.read()

#Call in the LLama3 model and ask it my prompt
completion = client.chat.completions.create(
    model="llama3-70b-8192",
    messages=[
        {
            "role": "user",
            "content": "extract the: company name, description, stage of investment, date of investment & Location from this copy and paste of a venture capital website:" + text + ". note that not all information may be there." +
            "Since the data is from a website, note that the start and end of the file will be useless (e.g. website titles and contact information). Please then filter for companies who are: based in the UK, NOT to do with health & received funding post 2019. please still give me their descriptions"
        },
    ],
    temperature=0.1,
    max_tokens=4096,
    top_p=0.2,
    stream=True,
    stop=None,
)


for chunk in completion:
    print(chunk.choices[0].delta.content or "", end="")

After filtering the data, I've extracted the companies that meet the criteria:

**Companies based in the UK, NOT related to health, and received funding post 2019:**

1. **Anytype** (London, UK) - Seed 2023
	* Description: Not provided
2. **Attio** (London, UK) - Seed in 2021
	* Description: The data-driven CRM for modern teams
3. **Brigad** (London, UK) - Growth in 2023
	* Description: The leading European marketplace connecting skilled self-employed talents to hospitality and healthcare establishments for short-term jobs
4. **Carwow** (London, UK) - Series A in 2014
	* Description: Platform for buying new cars from franchise dealers
5. **Demodesk** (London, UK) - Series A in 2020
	* Description: Intelligent meeting platform for online sales
6. **Nested** (London, UK) - Series A in 2018
	* Description: The estate agent that guarantees your move
7. **Numeral** (London, UK) - Seed in 2021
	* Description: Unlocking business innovation through payment automation
8. **Primer** (London, UK)

3. Get a Job

Got stuck with the code for this.