In [1]:
import pandas as pd
import requests
import numpy as np
from bs4 import BeautifulSoup

In [3]:
requests.get('https://www.ambitionbox.com/list-of-companies?page=1').text

'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http&#58;&#47;&#47;www&#46;ambitionbox&#46;com&#47;list&#45;of&#45;companies&#63;" on this server.<P>\nReference&#32;&#35;18&#46;8eb61160&#46;1707462655&#46;7500ffc\n</BODY>\n</HTML>\n'

we getting this error because the Some websites may block access from specific user agents (browser or bot identification). Modifying the User-Agent header in your request might help in some cases.

In [2]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
webpage = requests.get('https://www.ambitionbox.com/list-of-companies?page=1', headers=headers).text

When web scraping, including appropriate headers in your HTTP requests can help mimic the behavior of a legitimate browser and reduce the chances of being blocked or encountering access issues

# creating a variable to extract necessary information 

In [3]:
soup = BeautifulSoup(webpage, 'lxml')

The line soup = BeautifulSoup(webpage, 'lxml') is creating a BeautifulSoup object named soup by parsing an HTML or XML document represented by the variable webpage. The second argument, 'lxml', specifies the parser to be used by BeautifulSoup.


'lxml': This is the parser to be used by BeautifulSoup. BeautifulSoup supports different parsers, and 'lxml' is one of them. It refers to the lxml library, which is a high-performance library for processing XML and HTML. Using 'lxml' as the parser often results in faster parsing compared to other parsers.

The purpose of using BeautifulSoup is to provide a convenient way to navigate and search the HTML or XML tree. Once you have a BeautifulSoup object (soup), you can use its methods and attributes to extract information from the parsed document easily.

In [None]:
print(soup.prettify())

The line print(soup.prettify()) is used to output the HTML content of a BeautifulSoup object in a formatted and visually structured way. 

prettify(): This is a method provided by BeautifulSoup. When called on a BeautifulSoup object, it returns a string containing a "pretty-printed" version of the HTML or XML tree. Pretty-printing means that the structure of the document is indented and formatted in a way that makes it more human-readable.

So, when you execute print(soup.prettify()), it will print a nicely formatted version of the HTML content represented by the soup object. This can be especially useful for debugging, understanding the structure of the document, or simply for a more readable representation of the HTML when working in a console or script.

In [4]:
soup.find_all('h1')

[<h1 class="companyListing__title">
 							List of companies in India
 						</h1>]

give you all the h1 tags in this situation there is only one header

In [10]:
soup.find_all('h1')[0].text

'\n\t\t\t\t\t\t\tList of companies in India\n\t\t\t\t\t\t'

if there were multiple line we can use indexing to extract information and .text to convert it into text format

# fetching all the names 

In [16]:
for i in soup.find_all('h2'):
    print(i.text.strip())

TCS
Accenture
Cognizant
Wipro
Capgemini
HDFC Bank
ICICI Bank
Infosys
HCLTech
Tech Mahindra
Genpact
Axis Bank
Teleperformance
Concentrix Corporation
Jio
Amazon
IBM
Larsen & Toubro Limited
Reliance Retail
HDB Financial Services
Companies by  Industry
Companies by  Locations
Companies by  Type
Companies by  Badges


using a loop to extract all the h2 tags and print it and using strip() function to remove the spaces

# Fetching rating

In [72]:

ratings = soup.find_all('span', class_='companyCardWrapper__companyRatingValue')

for rating in ratings:
    print(rating.text.strip())


3.8
4.0
3.9
3.8
3.9
3.9
4.0
3.8
3.7
3.7
3.9
3.9
3.6
3.9
4.0
4.1
4.1
4.0
3.9
4.0


When we use find_all in 'BeautifulSoup', it returns a 'ResultSet', which is essentially a list-like object containing all the elements that match our specified criteria. Since there could be multiple elements that match, you receive a collection.

When we iterate over the 'ResultSet' using a loop, we can access each individual element in the collection. This is necessary because each element might have different content, and you may want to perform specific operations on each element.

In [4]:
company = soup.find_all('div', class_ = 'companyCardWrapper')

Now here what we gonna do is to fetch a single container first and apply it to all

In [31]:
Name = []
Rating = []
Reviews = []
Industry = []
Employees = []
Domain = []
Years_in_Operation = []
HQ = []
for i in company:
    Name.append(i.find('h2').text.strip())
    Rating.append(i.find('span', class_='companyCardWrapper__companyRatingValue').text.strip())
    Reviews.append(i.find('a', class_='companyCardWrapper__ActionWrapper').text.strip())
    info = i.find('div', class_='companyCardWrapper__interLinkingWrapper').text.strip()
    info_parts = info.split('|')
    if len(info_parts) > 4:
        Industry.append(info_parts[0].strip())
        Employees.append(info_parts[1].strip())
        Domain.append(info_parts[2].strip())
        Years_in_Operation.append(info_parts[3].strip())
        HQ.append(info_parts[4].strip())  
    else:
        Industry.append('N/A')
        Employees.append('N/A')
        Domain.append('N/A')
        Years_in_Operation.append('N/A')
        HQ.append('N/A')

"We have divided the page into 20 divs, so rather than fetching everything at once, we're going to create a loop to extract the info from the first block and then move to the next block. Also, you should always ask a question while fetching data to determine whether the block will provide you with a single or multiple values (so you will understand where to use 'find' or 'find_all')."

In [32]:
d = {'Name':Name, 'Rating':Rating, 'Reviews':Reviews, 'Industry':Industry, 'Employee':Employees, 'Domain':Domain, 'Years_in_Operation':Years_in_Operation, 'HQ':HQ}
df = pd.DataFrame(d)

In [35]:
df

Unnamed: 0,Name,Rating,Reviews,Industry,Employee,Domain,Years_in_Operation,HQ
0,Panasonic Appliances,4.1,91 Reviews,,,,,
1,Paras Spices,3.7,91 Reviews,,,,,
2,Augustus Healthcare Solutions,3.8,91 Reviews,,,,,
3,SKS POWER GENERATION,3.2,91 Reviews,Power,201-500 Employees,Public,16 years old,Mumbai +13 more
4,Veative Labs,3.0,91 Reviews,,,,,
5,Jarvis Technology and Strategy Consulting,3.3,91 Reviews,,,,,
6,Simplify Healthcare,3.2,91 Reviews,,,,,
7,GTPL-KCBPL,3.9,91 Reviews,,,,,
8,Meritto,3.2,91 Reviews,Internet,201-500 Employees,Startup,7 years old,Gurgaon +9 more
9,G10X,4.6,91 Reviews,,,,,


In [5]:
import numpy as np

companydf = pd.DataFrame()

for j in range(1, 400):
    url = 'https://www.ambitionbox.com/list-of-companies?page={}'.format(j)
    print(f"Scraping data from page {j}, URL: {url}")
    
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
    webpage = requests.get(url, headers=headers).text
    soup = BeautifulSoup(webpage, 'lxml')
    company = soup.find_all('div', class_='companyCardWrapper')
    
    data = []
    
    for i in company:
        try:
            Name = i.find('h2').text.strip()
            Rating = i.find('span', class_='companyCardWrapper__companyRatingValue').text.strip()
            Reviews = i.find('a', class_='companyCardWrapper__ActionWrapper').text.strip()
            info = i.find('span', class_='companyCardWrapper__interLinking').text.strip()
            info_parts = info.split('|')
            
            Industry = info_parts[0].strip() if len(info_parts) > 0 else np.nan
            Employees = info_parts[1].strip() if len(info_parts) > 1 else np.nan
            Domain = info_parts[2].strip() if len(info_parts) > 2 else np.nan
            Years_in_Operation = info_parts[3].strip() if len(info_parts) > 3 else np.nan
            HQ = info_parts[4].strip() if len(info_parts) > 4 else np.nan
            
            data.append({'Name': Name, 'Rating': Rating, 'Reviews': Reviews,
                         'Industry': Industry, 'Employees': Employees, 'Domain': Domain,
                         'Years_in_Operation': Years_in_Operation, 'HQ': HQ})
        except Exception as e:
            print(f"Error scraping data: {e}")

    df1 = pd.DataFrame(data)
    companydf = pd.concat([companydf, df1], ignore_index=True)

print(f"Total data length: {len(companydf)}")


Scraping data from page 1, URL: https://www.ambitionbox.com/list-of-companies?page=1
Scraping data from page 2, URL: https://www.ambitionbox.com/list-of-companies?page=2
Scraping data from page 3, URL: https://www.ambitionbox.com/list-of-companies?page=3
Scraping data from page 4, URL: https://www.ambitionbox.com/list-of-companies?page=4
Scraping data from page 5, URL: https://www.ambitionbox.com/list-of-companies?page=5
Scraping data from page 6, URL: https://www.ambitionbox.com/list-of-companies?page=6
Scraping data from page 7, URL: https://www.ambitionbox.com/list-of-companies?page=7
Scraping data from page 8, URL: https://www.ambitionbox.com/list-of-companies?page=8
Scraping data from page 9, URL: https://www.ambitionbox.com/list-of-companies?page=9
Scraping data from page 10, URL: https://www.ambitionbox.com/list-of-companies?page=10
Scraping data from page 11, URL: https://www.ambitionbox.com/list-of-companies?page=11
Scraping data from page 12, URL: https://www.ambitionbox.com/

Scraping data from page 96, URL: https://www.ambitionbox.com/list-of-companies?page=96
Scraping data from page 97, URL: https://www.ambitionbox.com/list-of-companies?page=97
Scraping data from page 98, URL: https://www.ambitionbox.com/list-of-companies?page=98
Scraping data from page 99, URL: https://www.ambitionbox.com/list-of-companies?page=99
Scraping data from page 100, URL: https://www.ambitionbox.com/list-of-companies?page=100
Scraping data from page 101, URL: https://www.ambitionbox.com/list-of-companies?page=101
Scraping data from page 102, URL: https://www.ambitionbox.com/list-of-companies?page=102
Scraping data from page 103, URL: https://www.ambitionbox.com/list-of-companies?page=103
Scraping data from page 104, URL: https://www.ambitionbox.com/list-of-companies?page=104
Scraping data from page 105, URL: https://www.ambitionbox.com/list-of-companies?page=105
Scraping data from page 106, URL: https://www.ambitionbox.com/list-of-companies?page=106
Scraping data from page 107, 

Scraping data from page 189, URL: https://www.ambitionbox.com/list-of-companies?page=189
Scraping data from page 190, URL: https://www.ambitionbox.com/list-of-companies?page=190
Scraping data from page 191, URL: https://www.ambitionbox.com/list-of-companies?page=191
Scraping data from page 192, URL: https://www.ambitionbox.com/list-of-companies?page=192
Scraping data from page 193, URL: https://www.ambitionbox.com/list-of-companies?page=193
Scraping data from page 194, URL: https://www.ambitionbox.com/list-of-companies?page=194
Scraping data from page 195, URL: https://www.ambitionbox.com/list-of-companies?page=195
Scraping data from page 196, URL: https://www.ambitionbox.com/list-of-companies?page=196
Scraping data from page 197, URL: https://www.ambitionbox.com/list-of-companies?page=197
Scraping data from page 198, URL: https://www.ambitionbox.com/list-of-companies?page=198
Scraping data from page 199, URL: https://www.ambitionbox.com/list-of-companies?page=199
Scraping data from pa

Scraping data from page 282, URL: https://www.ambitionbox.com/list-of-companies?page=282
Scraping data from page 283, URL: https://www.ambitionbox.com/list-of-companies?page=283
Scraping data from page 284, URL: https://www.ambitionbox.com/list-of-companies?page=284
Scraping data from page 285, URL: https://www.ambitionbox.com/list-of-companies?page=285
Scraping data from page 286, URL: https://www.ambitionbox.com/list-of-companies?page=286
Scraping data from page 287, URL: https://www.ambitionbox.com/list-of-companies?page=287
Scraping data from page 288, URL: https://www.ambitionbox.com/list-of-companies?page=288
Scraping data from page 289, URL: https://www.ambitionbox.com/list-of-companies?page=289
Scraping data from page 290, URL: https://www.ambitionbox.com/list-of-companies?page=290
Scraping data from page 291, URL: https://www.ambitionbox.com/list-of-companies?page=291
Scraping data from page 292, URL: https://www.ambitionbox.com/list-of-companies?page=292
Scraping data from pa

Scraping data from page 375, URL: https://www.ambitionbox.com/list-of-companies?page=375
Scraping data from page 376, URL: https://www.ambitionbox.com/list-of-companies?page=376
Scraping data from page 377, URL: https://www.ambitionbox.com/list-of-companies?page=377
Scraping data from page 378, URL: https://www.ambitionbox.com/list-of-companies?page=378
Scraping data from page 379, URL: https://www.ambitionbox.com/list-of-companies?page=379
Scraping data from page 380, URL: https://www.ambitionbox.com/list-of-companies?page=380
Scraping data from page 381, URL: https://www.ambitionbox.com/list-of-companies?page=381
Scraping data from page 382, URL: https://www.ambitionbox.com/list-of-companies?page=382
Scraping data from page 383, URL: https://www.ambitionbox.com/list-of-companies?page=383
Scraping data from page 384, URL: https://www.ambitionbox.com/list-of-companies?page=384
Scraping data from page 385, URL: https://www.ambitionbox.com/list-of-companies?page=385
Scraping data from pa

Import Libraries:

import pandas as pd:: Imports the Pandas library for handling data in tabular form.
import requests:: Enables making HTTP requests to retrieve web pages.
from bs4 import BeautifulSoup:: Allows for easy extraction of data from HTML.
import numpy as np::Imports NumPy for numerical operations.
Initialize DataFrame:

companydf = pd.DataFrame():: Creates an empty Pandas DataFrame named companydf to store the scraped data.
Loop through Pages::

for j in range(1, 400):: Iterates through page numbers from 1 to 399. This assumes that there are 399 pages to scrape.
HTTP Request and Parsing::

url = 'https://www.ambitionbox.com/list-of-companies?page={}'.format(j):: Constructs the URL for the current page.
webpage = requests.get(url, headers=headers).text:: Sends an HTTP GET request to the URL and retrieves the HTML content.
soup = BeautifulSoup(webpage, 'lxml'):: Creates a BeautifulSoup object to parse the HTML content.
Extract Company Information::

company = soup.find_all('div', class_='companyCardWrapper'):: Finds all HTML elements with the specified class that represent company cards on the webpage.
Loop through each company card and extract relevant information::
Name = i.find('h2').text.strip():: Company name.
Rating = i.find('span', class_='companyCardWrapper__companyRatingValue').text.strip():: Company rating.
Reviews = i.find('a', class_='companyCardWrapper__ActionWrapper').text.strip():: Number of reviews.
info = i.find('span', class_='companyCardWrapper__interLinking').text.strip():: Additional information about the company.
Split the additional information into parts using info.split('|').
Assign each part to variables (Industry, Employees, Domain, Years_in_Operation, HQ).
If a part is missing, assign np.nan (NumPy's representation of missing values).
Error Handling::

Encloses the information extraction in a try-except block to handle any exceptions that may occur during scraping.
If an error occurs, it prints an error message.
Create DataFrame for Current Page:

df1 = pd.DataFrame(data):: Creates a DataFrame (df1) from the extracted data on the current page.
Concatenate DataFrames::

companydf = pd.concat([companydf, df1], ignore_index=True):: Concatenates the current page's DataFrame with the overall DataFrame (companydf).
Print Progress::

print(f"Total data length: {len(companydf)}"):: Prints the total length of the DataFrame after scraping all pages

In [11]:
companydf.shape

(7980, 8)

In [14]:
companydf.to_csv('company_data.csv', index=False)