# Web Scraping

## Top Data Analytics Companies
To improve effectiveness in business processes, companies are focussing on collecting and utilizing data. Data analytics companies enable businesses to analyze the acquired data and use them as required. Data analytics services can assist in product development, identifying potential market gaps, improving operational efficiency, etc. [Goodfirms](https://www.goodfirms.co/big-data-analytics/data-analytics)

The aim of this project is to scrape Goodfirms website for details about top data analytics company. These details includes, review, rating, year founded, location, etc.

### Libraries

In [2]:
import requests as r
from bs4 import BeautifulSoup


### Download and save HTML file

In [3]:
# # URL link
# url = 'https://www.goodfirms.co/big-data-analytics/data-analytics'
# # access website
# html = r.get(url)
# with open('top_da_company.html', mode='wb') as file:
#     file.write(html.content)

In [6]:
with open("top_da_company.html", encoding='utf-8', mode='r') as file:
    bs = BeautifulSoup(file, 'lxml')
    # bs = BeautifulSoup(file, 'html5lib')

Name Extraction

In [7]:
firm_names = bs.find_all('span', {'itemprop': "name"})
firm_names[:5]

[<span itemprop="name">Home</span>,
 <span itemprop="name">big data analytics</span>,
 <span itemprop="name">
 Data Analytics </span>,
 <span itemprop="name">SPEC INDIA</span>,
 <span itemprop="name">instinctools</span>]

In [8]:
names_lst = []
for name in firm_names[3:]:
    names_lst.append(name.text)

print(len(names_lst))
# names_lst

51


In [12]:
firm_motors = bs.find_all('p', {'class': "profile-tagline"})
firm_motors[:5]

[<p class="profile-tagline">Enterprise Software, Mobility &amp; BI Solutions</p>,
 <p class="profile-tagline">Delivering the future. Now.</p>,
 <p class="profile-tagline">Blockchain | IoT | Mobility | AI | Big Data</p>,
 <p class="profile-tagline">Discover the world of Big Data with us!</p>,
 <p class="profile-tagline">The Hub Of Data Science Innovation</p>]

In [16]:
motor_lst = []
for motor in firm_motors:
    motor_lst.append(motor.text)

print(len(motor_lst))
# motor_lst

51


In [15]:
firm_reviews = bs.find_all('span', {'class': "listinv_review_label"})
firm_reviews[:5]

[<span class="listinv_review_label">4.8 (26 Reviews)</span>,
 <span class="listinv_review_label">4.8 (8 Reviews)</span>,
 <span class="listinv_review_label">5.0 (32 Reviews)</span>,
 <span class="listinv_review_label">4.7 (5 Reviews)</span>,
 <span class="listinv_review_label">5.0 (5 Reviews)</span>]

In [18]:
review_lst = []
for review in firm_reviews:
    review_lst.append(review.text)
    
print(len(review_lst))
# review_lst

51


In [9]:
progress_value = bs.find_all('div', {'class': "circle-progress-value"})
progress_value[:5]

[<div class="circle-progress-value">20%</div>,
 <div class="circle-progress-value">15%</div>,
 <div class="circle-progress-value">5%</div>,
 <div class="circle-progress-value">10%</div>,
 <div class="circle-progress-value">15%</div>]

In [10]:
service_pct = []
platform_pct = []
for percent in enumerate(progress_value):
    if percent[0] % 2 == 0:
        service_pct.append(percent[1].text)
    else:
        platform_pct.append(percent[1].text)
        
print(len(service_pct))
print(len(platform_pct))
# service_pct
# platform_pct

51
51


In [19]:
firm_prices = bs.find_all('div', {'class': "firm-pricing"})
firm_prices[:5]

[<div class="firm-pricing">
 &lt; $25/hr </div>,
 <div class="firm-pricing">
 $50 - $99/hr </div>,
 <div class="firm-pricing">
 $25 - $49/hr </div>,
 <div class="firm-pricing">
 $25 - $49/hr </div>,
 <div class="firm-pricing">
 $25 - $49/hr </div>]

In [24]:
price_lst = []
for price in firm_prices:
    price_lst.append(price.text)
    
print(len(price_lst))
# price_lst

51


In [25]:
firm_emps = bs.find_all('div', {'class': "firm-employees"})
firm_emps[:5]

[<div class="firm-employees">250 - 999</div>,
 <div class="firm-employees">250 - 999</div>,
 <div class="firm-employees">50 - 249</div>,
 <div class="firm-employees">250 - 999</div>,
 <div class="firm-employees">10 - 49</div>]

In [27]:
emps_lst = []
for emp in firm_emps:
    emps_lst.append(emp.text)
    
print(len(emps_lst))
# emps_lst

51


In [28]:
firm_years = bs.find_all('div', {'class': "firm-founded"})
firm_years[:5]

[<div class="firm-founded">1987</div>,
 <div class="firm-founded">2000</div>,
 <div class="firm-founded">2014</div>,
 <div class="firm-founded">2010</div>,
 <div class="firm-founded">2016</div>]

In [31]:
year_lst = []
for year in firm_years:
    year_lst.append(year.text)
    
print(len(year_lst))
# year_lst

51


In [32]:
firm_locations = bs.find_all('div', {'class': "firm-location"})
firm_locations[:5]

[<div class="firm-location">
 India, United States </div>,
 <div class="firm-location">
 United States, Germany </div>,
 <div class="firm-location">
 United States, India </div>,
 <div class="firm-location">
 United States, Australia </div>,
 <div class="firm-location">
 United States, India </div>]

In [41]:
firm_lst = [firm.text for firm in firm_locations]
print(len(firm_lst))
# firm_lst

51
