## Web Scrapping


##### Whenever a website from which we need data isn't providing api,the only option left to get that data is web scrapping.

##### For web-scrapping , we need  BeautifulSoup library form the bs4 package.

In [119]:
import pandas as pd
import requests
from bs4 import BeautifulSoup

##### In the case of some websites ,when we try to web scrap data from their website , access is denied from them as they dont want any bot like us to scrap their data .In such cases ,we get repsonse 403 which means permission is denied .We can have a look to it -->


In [120]:
requests.get('https://www.ambitionbox.com/list-of-companies?page=1')


<Response [403]>

##### We can actually get more info about this error using text attribute.

In [121]:
requests.get('https://www.ambitionbox.com/list-of-companies?page=1').text

'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http&#58;&#47;&#47;www&#46;ambitionbox&#46;com&#47;list&#45;of&#45;companies&#63;" on this server.<P>\nReference&#32;&#35;18&#46;b5fed417&#46;1702040695&#46;74ef948a\n</BODY>\n</HTML>\n'

##### In such cases ,we use a parameter called headers to make the webiste believe that request is made by any human via a browser.So in this case,it grants us acces.

In [122]:
headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win 64 ; x64) Apple WeKit /537.36(KHTML , like Gecko) Chrome/80.0.3987.162 Safari/537.36'}

In [123]:
webpage=requests.get('https://www.ambitionbox.com/list-of-companies?page=1',headers=headers).text

##### So ,in above line we have stored all the html cotnet of webpage inside the webpage variable.

##### Now beautifulsoup is used to extract the data that we want from the fetched hmtl content. We use BeautifulSoup() which takes 2 arguments ,contents of webpage i.e in our case is stored in webpage and second is html parser format which is 'lxml'.

In [124]:
soup=BeautifulSoup(webpage, 'lxml')

##### Now we can access the BeautifulSoup class functions and attributes as we have create aobject of it called as soup.We have prettify() which is used to make our html content look pretty i.e in nice order .You should print it ,so as it to see its output.This step helps us in understanding the structure of webpge.

In [125]:
####print(soup.prettify())

##### We use find_all() to find any div/tags or anythin.For e.g if we want to find all the h1 tags on the webpage,we can pass it as argument.

In [126]:
soup.find_all('h1')

[<h1 class="companyListing__title">
 							List of companies in India
 						</h1>]

##### As we can see ,there is only one h1 tag element.It is stored in list.To access the text inside this element .we can use [ ] as we are working with list and select 0 as list only has one element and we want to fetch first elements detail.

In [127]:
soup.find_all('h1')[0]

<h1 class="companyListing__title">
							List of companies in India
						</h1>

##### Now ,to extract the text ,we can use text attribute.

In [128]:
soup.find_all('h1')[0].text

'\n\t\t\t\t\t\t\tList of companies in India\n\t\t\t\t\t\t'

In [129]:
soup.find_all('h2')

[<h2 class="companyCardWrapper__companyName" title="TCS">
 										TCS
 									</h2>,
 <h2 class="companyCardWrapper__companyName" title="Accenture">
 										Accenture
 									</h2>,
 <h2 class="companyCardWrapper__companyName" title="Cognizant">
 										Cognizant
 									</h2>,
 <h2 class="companyCardWrapper__companyName" title="Wipro">
 										Wipro
 									</h2>,
 <h2 class="companyCardWrapper__companyName" title="HDFC Bank">
 										HDFC Bank
 									</h2>,
 <h2 class="companyCardWrapper__companyName" title="ICICI Bank">
 										ICICI Bank
 									</h2>,
 <h2 class="companyCardWrapper__companyName" title="Infosys">
 										Infosys
 									</h2>,
 <h2 class="companyCardWrapper__companyName" title="Capgemini">
 										Capgemini
 									</h2>,
 <h2 class="companyCardWrapper__companyName" title="HCLTech">
 										HCLTech
 									</h2>,
 <h2 class="companyCardWrapper__companyName" title="Tech Mahindra">
 										Tech Mahindra
 									</h2>

##### As we can see ,there are more no og h2 tags on the webpage.We can also use len() to calculate the total elements ,as len() is used to calculate length of list.

In [130]:
len(soup.find_all('h2'))

24

##### To print all the comapny's names ,we can use for loop and to remove the whitespace characters we can use strip().

In [131]:
for i in soup.find_all('h2'):
    print(i.text.strip())

TCS
Accenture
Cognizant
Wipro
HDFC Bank
ICICI Bank
Infosys
Capgemini
HCLTech
Tech Mahindra
Genpact
Axis Bank
Concentrix Corporation
Amazon
Teleperformance
Reliance Jio
IBM
Larsen & Toubro Limited
Reliance Retail
HDB Financial Services
Companies by  Industry
Companies by  Locations
Companies by  Type
Companies by  Badges


##### As we can see ,last three rows are not names of companies.So we can remove them using remove().

In [132]:
list=soup.find_all('h2')

In [133]:
list=list[:-4]

In [134]:
for i in list:
    print(i.text.strip())

TCS
Accenture
Cognizant
Wipro
HDFC Bank
ICICI Bank
Infosys
Capgemini
HCLTech
Tech Mahindra
Genpact
Axis Bank
Concentrix Corporation
Amazon
Teleperformance
Reliance Jio
IBM
Larsen & Toubro Limited
Reliance Retail
HDB Financial Services


##### Now to extract the ratings of each company ,we need to extract the spans in html content.But there exists some spans which are showcasing some other info .To differentiate with such elements ,we can use the class of elements which is specified while bulding website for each element.We need to pass class argument which is set to the class name of the element which we want to scrap.

In [135]:
soup.find_all('span',class_='companyCardWrapper__companyRatingValue')

[<span class="companyCardWrapper__companyRatingValue">3.8</span>,
 <span class="companyCardWrapper__companyRatingValue">4.0</span>,
 <span class="companyCardWrapper__companyRatingValue">3.9</span>,
 <span class="companyCardWrapper__companyRatingValue">3.8</span>,
 <span class="companyCardWrapper__companyRatingValue">3.9</span>,
 <span class="companyCardWrapper__companyRatingValue">4.0</span>,
 <span class="companyCardWrapper__companyRatingValue">3.9</span>,
 <span class="companyCardWrapper__companyRatingValue">3.8</span>,
 <span class="companyCardWrapper__companyRatingValue">3.7</span>,
 <span class="companyCardWrapper__companyRatingValue">3.7</span>,
 <span class="companyCardWrapper__companyRatingValue">3.9</span>,
 <span class="companyCardWrapper__companyRatingValue">3.8</span>,
 <span class="companyCardWrapper__companyRatingValue">3.9</span>,
 <span class="companyCardWrapper__companyRatingValue">4.1</span>,
 <span class="companyCardWrapper__companyRatingValue">3.6</span>,
 <span cla

In [136]:
len(soup.find_all('span',class_='companyCardWrapper__companyRatingValue'))

20

##### Now ,extracting the no of reviews of companies -->

In [137]:
soup.find_all('span',class_='companyCardWrapper__ActionCount')

[<span class="companyCardWrapper__ActionCount">69.2k</span>,
 <span class="companyCardWrapper__ActionCount">847.6k</span>,
 <span class="companyCardWrapper__ActionCount">5.8k</span>,
 <span class="companyCardWrapper__ActionCount">603</span>,
 <span class="companyCardWrapper__ActionCount">11.4k</span>,
 <span class="companyCardWrapper__ActionCount">74</span>,
 <span class="companyCardWrapper__ActionCount">43.8k</span>,
 <span class="companyCardWrapper__ActionCount">577.8k</span>,
 <span class="companyCardWrapper__ActionCount">4.1k</span>,
 <span class="companyCardWrapper__ActionCount">5.9k</span>,
 <span class="companyCardWrapper__ActionCount">7k</span>,
 <span class="companyCardWrapper__ActionCount">39</span>,
 <span class="companyCardWrapper__ActionCount">39.5k</span>,
 <span class="companyCardWrapper__ActionCount">556.7k</span>,
 <span class="companyCardWrapper__ActionCount">3.4k</span>,
 <span class="companyCardWrapper__ActionCount">563</span>,
 <span class="companyCardWrapper__Acti

##### Now ,instead of extracting part one by one ,we can extract a div where the info of all companies lies.

In [138]:
company=soup.find_all('div',class_='companyCardWrapper')

In [139]:
len(company)

20

##### Now,we will loop through each div.We can also use find() to find that particular element where we have applied this method.It returns us element instead of list of elements.

In [140]:
name=[]
ratings=[]
reviews=[]
info=[]

for i in company:
    name.append(i.find('h2').text.strip())
    ratings.append(i.find('span',class_='companyCardWrapper__companyRatingValue').text.strip())
    reviews.append(i.find('span',class_='companyCardWrapper__ActionCount').text.strip())
    info.append(i.find('span',class_='companyCardWrapper__interLinking').text.strip())
    

In [141]:
info

['IT Services & Consulting | 1 Lakh+ Employees | Public | 55 years old | Mumbai +321 more',
 'IT Services & Consulting | 1 Lakh+ Employees | Public | 34 years old | Dublin +160 more',
 'IT Services & Consulting | 1 Lakh+ Employees | Forbes Global 2000 | 29 years old | Teaneck. New Jersey. +140 more',
 'IT Services & Consulting | 1 Lakh+ Employees | Public | 78 years old | Bangalore/Bengaluru +267 more',
 'Banking | 1 Lakh+ Employees | Public | 29 years old | Mumbai +1475 more',
 'Banking | 1 Lakh+ Employees | Public | 29 years old | Mumbai +1234 more',
 'IT Services & Consulting | 1 Lakh+ Employees | Public | 42 years old | Bengaluru/Bangalore +159 more',
 'IT Services & Consulting | 1 Lakh+ Employees | Public | 56 years old | Paris +110 more',
 'IT Services & Consulting | 1 Lakh+ Employees | Public | 32 years old | Noida +172 more',
 'IT Services & Consulting | 1 Lakh+ Employees | Public | 37 years old | Pune +251 more',
 'IT Services & Consulting | 1 Lakh+ Employees | Public | 26 yea

##### Now ,we can create a dataframe using DataFrame().It takes a dictionary as argument, where key is the column name and values is the list of data.

In [142]:
dict={'name':name,'ratings':ratings,'reviews':reviews,'info':info}

In [143]:
pd.DataFrame(dict)

Unnamed: 0,name,ratings,reviews,info
0,TCS,3.8,69.2k,IT Services & Consulting | 1 Lakh+ Employees |...
1,Accenture,4.0,43.8k,IT Services & Consulting | 1 Lakh+ Employees |...
2,Cognizant,3.9,39.5k,IT Services & Consulting | 1 Lakh+ Employees |...
3,Wipro,3.8,36.8k,IT Services & Consulting | 1 Lakh+ Employees |...
4,HDFC Bank,3.9,31.7k,Banking | 1 Lakh+ Employees | Public | 29 year...
5,ICICI Bank,4.0,31.7k,Banking | 1 Lakh+ Employees | Public | 29 year...
6,Infosys,3.9,30k,IT Services & Consulting | 1 Lakh+ Employees |...
7,Capgemini,3.8,27.9k,IT Services & Consulting | 1 Lakh+ Employees |...
8,HCLTech,3.7,26.7k,IT Services & Consulting | 1 Lakh+ Employees |...
9,Tech Mahindra,3.7,26k,IT Services & Consulting | 1 Lakh+ Employees |...


##### Now ,we have scrap the above data for only 1 page ,now we will do it for 10 pages.

In [150]:
final=pd.DataFrame()

for j in range(0,11):
    url='https://www.ambitionbox.com/list-of-companies?page={}'.format(j)
    headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win 64 ; x64) Apple WeKit /537.36(KHTML , like Gecko) Chrome/80.0.3987.162 Safari/537.36'}
    webpage=requests.get(url=url,headers=headers).text
    soup=BeautifulSoup(webpage,'lxml')
    name=[]
    ratings=[]
    reviews=[]
    info=[]

    for i in company:
        name.append(i.find('h2').text.strip())
        ratings.append(i.find('span',class_='companyCardWrapper__companyRatingValue').text.strip())
        reviews.append(i.find('span',class_='companyCardWrapper__ActionCount').text.strip())
        info.append(i.find('span',class_='companyCardWrapper__interLinking').text.strip())
        
    dict={'name':name,'ratings':ratings,'reviews':reviews,'info':info}
    df=pd.DataFrame(dict)
    
    final=pd.concat([final,df],ignore_index=True)



In [151]:
final


Unnamed: 0,name,ratings,reviews,info
0,TCS,3.8,69.2k,IT Services & Consulting | 1 Lakh+ Employees |...
1,Accenture,4.0,43.8k,IT Services & Consulting | 1 Lakh+ Employees |...
2,Cognizant,3.9,39.5k,IT Services & Consulting | 1 Lakh+ Employees |...
3,Wipro,3.8,36.8k,IT Services & Consulting | 1 Lakh+ Employees |...
4,HDFC Bank,3.9,31.7k,Banking | 1 Lakh+ Employees | Public | 29 year...
...,...,...,...,...
215,Reliance Jio,3.9,19k,Telecom | 50k-1 Lakh Employees | Public | 16 y...
216,IBM,4.1,18.9k,Software Product | 1 Lakh+ Employees | Public ...
217,Larsen & Toubro Limited,4.0,17.7k,Engineering & Construction | 10k-50k Employees...
218,Reliance Retail,3.9,17.3k,Retail | 1 Lakh+ Employees | 17 years old | Na...
