In [1]:
import pandas
import requests
from bs4 import BeautifulSoup

In [2]:
requests.get('https://www.ambitionbox.com/list-of-companies?campaign=desktop_nav&page=1')

<Response [403]>

**HTTP response status codes**<br>

HTTP response status codes indicate whether a specific HTTP request has been successfully completed. Responses are grouped in five classes:

Informational responses (100 – 199)<br>
Successful responses (200 – 299)<br>
Redirection messages (300 – 399)<br>
Client error responses (400 – 499)<br>
Server error responses (500 – 599)<br>

# **USE OF HEADERS**
Headers in the context of web scraping, specifically the 'User-Agent' header, play a crucial role in making your HTTP requests mimic those of a web browser. This is important because some websites use user agent information to identify the type of client making the request. The 'User-Agent' header typically contains information about the browser and operating system.<br>

### **Purpose of Headers in Web Scraping:**
**Simulating a Browser**:<br>
Many websites expect requests to come from browsers rather than automated scripts. By including a 'User-Agent' header that mimics a common browser, you make your request look more like a legitimate user request.<br>

**Avoiding Detection and Blocking:**<br>
Some websites may block or limit access to requests that don't have a valid 'User-Agent' header. Including a user agent helps avoid being flagged as a bot or being subject to anti-scraping measures.

In [3]:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.162 Safari/537.36'}

web=requests.get('https://www.ambitionbox.com/list-of-companies?campaign=desktop_nav&page=1',headers=headers).text

**BeautifulSoup(web, 'lxml'):** This line creates a BeautifulSoup object named soup by parsing the HTML or XML content stored in the variable web.<br>
 **The second argument, 'lxml'**, specifies the parser to be used. In this case, the 'lxml' parser is being used, which is a fast and feature-rich XML and HTML parsing library for Python.

In [4]:
soup=BeautifulSoup(web,'lxml')

In [5]:
(soup.find_all('h2'))


[<h2 class="companyCardWrapper__companyName" title="TCS">
 										TCS
 									</h2>,
 <h2 class="companyCardWrapper__companyName" title="Accenture">
 										Accenture
 									</h2>,
 <h2 class="companyCardWrapper__companyName" title="Cognizant">
 										Cognizant
 									</h2>,
 <h2 class="companyCardWrapper__companyName" title="Wipro">
 										Wipro
 									</h2>,
 <h2 class="companyCardWrapper__companyName" title="HDFC Bank">
 										HDFC Bank
 									</h2>,
 <h2 class="companyCardWrapper__companyName" title="ICICI Bank">
 										ICICI Bank
 									</h2>,
 <h2 class="companyCardWrapper__companyName" title="Infosys">
 										Infosys
 									</h2>,
 <h2 class="companyCardWrapper__companyName" title="Capgemini">
 										Capgemini
 									</h2>,
 <h2 class="companyCardWrapper__companyName" title="HCLTech">
 										HCLTech
 									</h2>,
 <h2 class="companyCardWrapper__companyName" title="Tech Mahindra">
 										Tech Mahindra
 									</h2>

# **To get the inside text only**

`soup.find_all('h2',class_="companyCardWrapper__companyName"):` <br>This part searches for all h2 tags with the specified class "companyCardWrapper__companyName" within the BeautifulSoup object soup. The `find_all` method returns a ResultSet containing all matching elements.

for i in ...: This sets up a loop to iterate over each matching element found by `find_all`.

`print(i.text.strip()):` This prints the stripped text content of each matching element. The text attribute extracts the text content within the HTML tag, and `strip()` removes any leading or trailing whitespace.

In [6]:
for i in soup.find_all('h2',class_="companyCardWrapper__companyName"):
  print(i.text.strip())

TCS
Accenture
Cognizant
Wipro
HDFC Bank
ICICI Bank
Infosys
Capgemini
HCLTech
Tech Mahindra
Genpact
Axis Bank
Teleperformance
Concentrix Corporation
Reliance Jio
Amazon
IBM
Larsen & Toubro Limited
Reliance Retail
HDB Financial Services


In [7]:
import pandas as pd
import numpy as np

company=soup.find_all('div',class_="companyCardWrapper__primaryInformation")

In [8]:
name=[]
rating=[]
reviews=[]
bad_reviews=[]
ctype=[]
how_old=[]


for i in company:

  name.append(i.find('h2',class_="companyCardWrapper__companyName").text.strip())
  rating.append(i.find('span',class_="companyCardWrapper__companyRatingValue").text.strip())
  reviews.append(i.find('span',class_="companyCardWrapper__ratingValues").text.strip())
  bad_reviews.append(i.find_all_next('span',class_="companyCardWrapper__ratingValues")[1].text.strip())
  ctype.append(i.find('span', class_='companyCardWrapper__interLinking').get_text(strip=True).split('|')[0].strip())

  span_element = i.find('span', class_='companyCardWrapper__interLinking')
  if span_element:
      text_content = span_element.get_text(strip=True)
      split_result = text_content.split('|')
      company_age = split_result[3].strip() if len(split_result) > 4 else split_result[2].strip()
      how_old.append(company_age)
  else:
      how_old.append(np.nan)


df=pd.DataFrame({'name':name
                 ,
   'rating':rating,
   'highly_rated_for':reviews,
   'critically_rated_for':ctype,
   'company_type':ctype,
   'Company_Age':how_old
   })


In [9]:
df

Unnamed: 0,name,rating,highly_rated_for,critically_rated_for,company_type,Company_Age
0,TCS,3.8,"Job Security, Work Life Balance",IT Services & Consulting,IT Services & Consulting,56 years old
1,Accenture,4.0,"Company Culture, Job Security, Skill Developme...",IT Services & Consulting,IT Services & Consulting,35 years old
2,Cognizant,3.9,Skill Development / Learning,IT Services & Consulting,IT Services & Consulting,30 years old
3,Wipro,3.8,Job Security,IT Services & Consulting,IT Services & Consulting,79 years old
4,HDFC Bank,3.9,"Job Security, Skill Development / Learning",Banking,Banking,30 years old
5,ICICI Bank,4.0,"Job Security, Skill Development / Learning, Co...",Banking,Banking,30 years old
6,Infosys,3.8,"Job Security, Company Culture, Skill Developme...",IT Services & Consulting,IT Services & Consulting,43 years old
7,Capgemini,3.8,"Job Security, Work Life Balance, Skill Develop...",IT Services & Consulting,IT Services & Consulting,57 years old
8,HCLTech,3.6,Job Security,IT Services & Consulting,IT Services & Consulting,33 years old
9,Tech Mahindra,3.7,"Promotions / Appraisal, Salary & Benefits",IT Services & Consulting,IT Services & Consulting,38 years old
