# Scraping The 300 Most Valuable Startups in India on Failory

<img src="https://i.ibb.co/7kxTmF6/austin-distel-rxp-Th-Owu-Vg-E-unsplash.jpg" alt="austin-distel-rxp-Th-Owu-Vg-E-unsplash" border="0">

<big> **Web scraping** is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning.<big>

<big> **[Failory](https://www.failory.com/)**: A content site for startups founders. They publish weekly interviews and short and long-form articles to help you become a better founder.<big>

<big> **Project Goal**: 
    
The final output would be a list of the 300 most valuable startups along with their relevant details.
    
details such as:
```
Company Name            : Name of the Startup.
Description             : What the company does.
City                    : The City in which the startup is started.
Year                    : The Year in which the startup was started.
Founders                : Name of the founders of the startup.
Industries              : Industrial domain in which the startup falls.
No. of Employees        : Number of employees in the startup.
Funding Amount          : Total funding amount funded to the startup.(in USD)
Funding Round           : Number of Funding Round
No. of Investors        : Number of investors in the startup.
```
<big>

## 1. Download & Parse webpage using Requests and BeautifulSoup

## 1.1 Import necessary libraries and modules

All the lines of code together import the necessary libraries and modules to implement a web scraping project in Python.

In [1]:
import re                       # for regular expressions
import requests                 # to download web pages
import numpy as np              # for numerical computing
import pandas as pd             # for data manipulation
from bs4 import BeautifulSoup   # parsing HTML content

### 1.2 Parsed HTML content of a web page using Function

In [2]:
def get_startups_india_page():
    
    """Returns the BeautifulSoup object for the web page at the URL 'https://www.failory.com/startups/india'.
    
    Raises:
        Exception: If the HTTP response status code is not 200 (OK).
    
    Returns:
        BeautifulSoup: A BeautifulSoup object representing the parsed HTML content of the web page.
    """
    
    startups_india_url = 'https://www.failory.com/startups/india'
    
    response = requests.get(startups_india_url)
    
    if response.status_code != 200:
        raise Exception('Failed to load page {}'.format(startups_india_url))
        
    doc = BeautifulSoup(response.text, 'html.parser')
    
    return doc

soup (BeautifulSoup): A BeautifulSoup object representing the parsed HTML content of a web page.

In [3]:
soup = get_startups_india_page()

In [4]:
project_name = soup.title.text
print(f"PROJECT NAME : {project_name}")

PROJECT NAME : The 300 Most Valuable Startups in India


## 2. Extract the information into python list using Function

### Company

In [5]:
def get_company(soup):
    
    """
    Retrieves and returns a list of company names from a BeautifulSoup object.

    Args:
        soup (BeautifulSoup): A BeautifulSoup object representing the parsed HTML content of a web page.

    Returns:
        list: A list of company names extracted from the given BeautifulSoup object.
    """
    
    company = []
    
    for companies in soup.find_all('h3', limit=300):
        company_name = re.sub('[0-9]|\)|amp;', '', companies.text).strip()
        company.append(company_name)

    return company

### City

In [6]:
def get_city(soup):
    
    """
    Retrieves and returns a list of city names from a BeautifulSoup object.

    Args:
        soup (BeautifulSoup): A BeautifulSoup object representing the parsed HTML content of a web page.

    Returns:
        list: A list of city names extracted from the given BeautifulSoup object.
    """
    
    city = []

    for details in soup.find_all('li'):
        city_name = details.text.split('City: ')
        if city_name[0] == '':
            city.append(city_name[1])

    return city

###  Year

In [7]:
def get_year(soup):
    
    """
    Retrieves and returns a list of startup establishment years from a BeautifulSoup object.

    Args:
        soup (BeautifulSoup): A BeautifulSoup object representing the parsed HTML content of a web page.

    Returns:
        list: A list of startup establishment years extracted from the given BeautifulSoup object.
    """
    
    year = []
    
    for details in soup.find_all('li'):
        years = details.text.split('Started in: ')
        if years[0] == '':
            year.append(years[1])
    
    return year

### Founders

In [8]:
def get_founders(soup):
    
    """
    Returns a list of startup founders from a BeautifulSoup object.

    Args:
        soup (BeautifulSoup): A BeautifulSoup object representing the parsed HTML content of a web page.

    Returns:
        list: A list of startup founders extracted from the given BeautifulSoup object.
    """
    
    founders = []
    
    containers = soup.find_all('ul', {'role': 'list'})[:300]
    
    for container in containers:
        founder = container.find_all('li')[2].text.split('Founders: ')
        if len(founder) == 2:
            founders.append(founder[1])
        else:
            founders.append('N/A')
    
    return founders

### Industries

In [9]:
def get_industries(soup):
    
    """
    Returns a list of industries for the 300 most valuable startups in India.
    
    Args:
        soup (BeautifulSoup): A BeautifulSoup object representing the parsed HTML content of a web page.

    Returns:
        list: A list of industries for the 300 most valuable startups in India.
    """
    
    industries = []
    
    containers = soup.find_all('ul',{'role':'list'})[:300]
    
    for container in containers:
        details = container.find_all('li')
        flag = False
        
        for detail in details:
            if  'Industries: ' in detail.text:
                flag = True
                industry = detail.text.split( 'Industries: ')[1].strip()
                
        if flag:
            industries.append(industry)
        else:
            industries.append('N/A')
            
    return industries

### Number of employees

In [10]:
def get_number_of_employees(soup):
    
    """
    Scrape the number of employees for each startup from the Failory India page.
    
    Args:
        soup (BeautifulSoup): A BeautifulSoup object representing the parsed HTML content of a web page.

    Returns:
        A list of the number of employees for each startup.
    """

    number_of_employees = []
    
    containers = soup.find_all('ul',{'role':'list'})[:300]
    
    for container in containers:
        details = container.find_all('li')
        flag = False
        
        for detail in details:
            if 'Number of employees: ' in detail.text:
                flag = True
                employees = detail.text.split('Number of employees:')[1].strip()
                
        if flag:
            number_of_employees.append(employees)
        else:
            number_of_employees.append(0)
    
    return number_of_employees

### Funding amount

In [11]:
def get_funding_amount(soup):
    
    """
    Extracts funding amounts from a given Beautiful Soup object and returns a list of floats.

    Args:
        soup (BeautifulSoup): A Beautiful Soup object representing the HTML of a webpage.

    Returns:
        list: A list of funding amounts in float format extracted from the Beautiful Soup object.
    """
    
    funding_amount = []
    
    containers = soup.find_all('ul',{'role':'list'})[:300]
    
    for container in containers:
        details = container.find_all('li')
        flag = False
        
        for detail in details:
            if 'Funding amount: ' in detail.text:
                flag = True
                amount = detail.text.split('Funding amount: ')[1].strip()
                
        if flag:
            if 'â\x82¹' in amount:
                amount = str(int(amount[3:].replace(',','')) * 0.012)
                funding_amount.append(amount)
            else:
                amount = re.sub('[,$]','',amount)
                funding_amount.append(amount)
        else:
            funding_amount.append(0)
    
    return funding_amount

### Number of funding rounds

In [12]:
def get_number_of_funding_rounds(soup):
    
    """
    Extracts the number of funding rounds from the given Beautiful Soup object.

    Args:
        soup (BeautifulSoup): A Beautiful Soup object representing the HTML of a webpage.

    Returns:
        list: A list of integers representing the number of funding rounds extracted from the Beautiful Soup object.
    """
    
    number_of_funding_rounds = []
    
    containers = soup.find_all('ul',{'role':'list'})[:300]
    
    for container in containers:
        details = container.find_all('li')
        flag = False
        
        for detail in details:
            if 'Number of funding rounds: ' in detail.text:
                flag = True
                rounds = detail.text.split('Number of funding rounds: ')[1].strip()
                
        if flag:
            number_of_funding_rounds.append(rounds)
        else:
            number_of_funding_rounds.append(0)
    
    return number_of_funding_rounds

### Number of investors

In [13]:
def get_number_of_investors(soup):
    
    """
    Extracts the number of investors from the given Beautiful Soup object.

    Args:
        soup (BeautifulSoup): A Beautiful Soup object representing the HTML of a webpage.

    Returns:
        list: A list of integers representing the number of investors extracted from the Beautiful Soup object.
    """
    
    number_of_investors = []
    
    containers = soup.find_all('ul',{'role':'list'})[:300]
    
    for container in containers:
        details = container.find_all('li')
        flag = False
        
        for detail in details:
            if 'Number of investors: ' in detail.text:
                flag = True
                investors = detail.text.split('Number of investors: ')[1].strip()
                
        if flag:
            number_of_investors.append(investors)
        else:
            number_of_investors.append(0)
    
    return number_of_investors

### Description

In [14]:
def get_description(soup):
    
    """
    Extracts the descriptions from the given Beautiful Soup object.

    Args:
        soup (BeautifulSoup): A Beautiful Soup object representing the HTML of a webpage.

    Returns:
        list: A list of strings representing the descriptions extracted from the Beautiful Soup object.
    """
    
    description = []
    
    for container in soup.find_all('figure'):
        info = container.find_next_sibling('p').text
        description.append(info)
        
    return description

## 3. Save the extracted information to a CSV file

DataFrame method to create pandas dataframe.

The DataFrame has ten columns named 'Company', 'Description', 'City', 'Year', 'Founders', 'Industries', 'No. of Employees', 'Funding Amount', 'Funding Round', and 'No. of Investors'.

Each of these columns contains data extracted from the corresponding functions

In [15]:
df = pd.DataFrame({
    'Company': get_company(soup),
    'Description': get_description(soup),
    'City': get_city(soup),
    'Year': get_year(soup),
    'Founders': get_founders(soup),
    'Industries':get_industries(soup),
    'No. of Employees': get_number_of_employees(soup),
    'Funding Amount': get_funding_amount(soup),
    'Funding Round' : get_number_of_funding_rounds(soup),
    'No. of Investors': get_number_of_investors(soup)
})

This code converts selected columns of a DataFrame to specified data types 

In [16]:
convert_dtype = {
    'Year': 'int32',
    'Funding Amount': 'float64',
    'Funding Round': 'int32',
    'No. of Investors': 'int32'
}

df = df.astype(convert_dtype)
print(df.dtypes)

Company              object
Description          object
City                 object
Year                  int32
Founders             object
Industries           object
No. of Employees     object
Funding Amount      float64
Funding Round         int32
No. of Investors      int32
dtype: object


Use to_csv method to store required data in CSV format

In [17]:
df.to_csv('The 300 Most Valuable Startups in India.csv')

## 4. Project Summary


The project aims to scrape data from the Failory website to obtain details of the 300 most valuable startups in India. The data to be extracted includes:

- Company name
- Description
- City
- Year
- Founders
- Industries
- Number of employees
- Funding amount
- Funding round
- Number of investors

The project uses Python and libraries such as Requests, BeautifulSoup, and Pandas for web scraping and data manipulation. Several functions are defined to extract each category of data, and the data is stored in Python lists. Finally, the data is combined into a Pandas DataFrame for easy manipulation and analysis.

## 5. References

- Failory: https://www.failory.com/startups/india
- Python: https://www.python.org/
- Requests: https://docs.python-requests.org/en/latest/
- BeautifulSoup: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
- Pandas: https://pandas.pydata.org/
- re : https://docs.python.org/3/library/re.html
- Numpy: https://numpy.org/doc/stable/