# Web Scraping

## Top Data Analytics Companies
To improve effectiveness in business processes, companies are focussing on collecting and utilizing data. Data analytics companies enable businesses to analyze the acquired data and use them as required. Data analytics services can assist in product development, identifying potential market gaps, improving operational efficiency, etc. [Goodfirms](https://www.goodfirms.co/big-data-analytics/data-analytics)

The aim of this project is to scrape Goodfirms website for details about top data analytics company. These details includes, review, rating, year founded, location, etc.

### Road Map
Stage 1
* Download page
* Open page
* Get details
* Extract details
* Make a dataframe

Stage 2 (Code Refactoring)
* Change loops to function
* Take functions to another file
* Import functions and extract
* Make a dataframe

Libraries

In [94]:
import requests as r
from bs4 import BeautifulSoup 
import pandas as pd
import numpy as np

Accessing webpage

In [2]:
url_link = 'https://www.goodfirms.co/big-data-analytics/data-analytics'
html = r.get(url_link)
html

<Response [200]>

Saving webpage

In [12]:
with open('topfirm.html', mode='wb') as file:
    file.write(html.content)

In [4]:
with open('topfirm.html',encoding='utf-8', mode='r') as openfile:
    bs = BeautifulSoup(openfile, 'lxml')

Locate Details

<!-- 1. firm position -->
2. firm name
3. firm reviews
4. firm progress value
5. firm price
6. firm employee
7. year founded
8. firm location

In [36]:
firm_name = bs.find_all('h3')
firm_motor = bs.find_all('div', {'class': 'tagline'})
firm_reviews_count = bs.find_all('span', {'class': 'review-count'})

In [45]:
extract_details(firm_name)

0                       instinctools
1                  Outsource Bigdata
2              Systango Technologies
3                        NEX Softsys
4                         Beyond Key
5                           Datapine
6                         Prismetric
7                   Rudder Analytics
8                      FreshDataLabs
9                      Kanerika Inc.
10                    ARKA Softwares
11                      Amri Systems
12    LeoMetric Technology Pvt. Ltd.
13                      InsightWhale
14                           Softura
15                         Astrokyon
16                       WebDataGuru
17                         Analytist
18                     VOLMATICA INC
19                 Estenda Solutions
20                   Juice analytics
21                            Signon
22           aQb Solutions Pvt. Ltd.
23                           Ancient
24                        XenonStack
25                    IntelliCompute
26                       DY INFOSOFT
2

In [187]:
x = bs.find_all('div', {'class': 'firm-header-wrapper'})
y = bs.find_all('div', {'class': 'firm-services'})



name = [i.find('h3') for i in x]
motor = [i.find('div', {'class': 'tagline'}) for i in x]
reviews_count = [i.find('span', {'class': 'review-count'}) for i in x]
review_rating = [i.find('span', {'class':"review-rating"}) for i in x]
peyl = [i.find_all('span') for i in y]



firm_reviews_count = [x.text.strip() if x and x.text else '0' for x in reviews_count]
firm_name = [x.text.strip() if x and x.text else np.nan for x in name]
firm_motor = [x.text.strip() if x and x.text else np.nan for x in motor]
firm_reviews_rating = [x.text.splitlines()[1] if x and x.text else np.nan for x in review_rating]
firm_price = [i[0].text if i[0] and i[0].text else np.nan for i in peyl]
firm_employee = [i[1].text if i[1] and i[1].text else np.nan for i in peyl] 
firm_year_founded = [i[2].text if i[2] and i[2].text else np.nan for i in peyl]
firm_location = [i[-1].text if i[-1] and i[-1].text else np.nan for i in peyl]
# [i[3] for i in peyl]
firm_location

['United States, Germany',
 'United States, Canada',
 'United Kingdom, United States',
 'United States',
 'United States',
 'Germany',
 'India',
 'India',
 'India',
 'United States',
 'United States',
 'United States',
 'India',
 'Russia',
 'United States',
 'Poland',
 'United States',
 'Thailand',
 'United States',
 'United States',
 'United States',
 'India',
 'India',
 'Mexico',
 'United States',
 'United States',
 'India',
 'Lithuania',
 'India',
 'India',
 'Brazil',
 'United States',
 'United States',
 'United States',
 'Germany',
 'Ukraine',
 'United States',
 'Luxembourg, Poland',
 'Kazakhstan',
 'United States, United Kingdom',
 'United States',
 'Portugal',
 'United States',
 'Vietnam',
 'Poland',
 'United States',
 'United States',
 'Israel',
 'Portugal']

In [142]:
df_new = pd.DataFrame()

In [188]:
df_new['firm_name'] = firm_name
df_new['firm_motor'] = firm_motor
df_new['firm_review_count'] = firm_reviews_count
df_new['firm_review_rating'] = firm_reviews_rating
# df_new['firm_serv_pct(%)'] = serv_lst
# df_new['firm_plt_pct(%)'] = plt_lst
df_new['firm_price'] = firm_price
df_new['firm_emp'] = firm_employee
df_new['firm_year_founded'] = firm_year_founded
df_new['firm_location'] = firm_location

df_new.head()

Unnamed: 0,firm_name,firm_motor,firm_review_count,firm_review_rating,firm_price,firm_emp,firm_year_founded,firm_location
0,instinctools,BlackFriday discount for all contracts before ...,11 Reviews,4.9,$25 - $49/hr,250 - 999,2000,"United States, Germany"
1,Outsource Bigdata,AI-driven Web Scraping & Data Labeling Provider,19 Reviews,5.0,< $25/hr,50 - 249,2012,"United States, Canada"
2,Systango Technologies,"We Make the Impossible, Possible.",6 Reviews,5.0,$25 - $49/hr,250 - 999,2007,"United Kingdom, United States"
3,NEX Softsys,IT Partner for Global Clients,12 Reviews,5.0,$25 - $49/hr,50 - 249,2003,United States
4,Beyond Key,IT Consulting and Software Development Services,7 Reviews,5.0,$25 - $49/hr,250 - 999,2005,United States


In [92]:
bs.find_all('div', {'class': 'firm-header-wrapper'})[0].find('h3')

<h3>
<a class="visit-website" href="https://www.instinctools.com/bi-and-big-data/?utm_source=goodfirms.co&amp;utm_medium=referral&amp;utm_campaign=data analytics" rel="nofollow" target="_blank">
instinctools
</a>
</h3>

In [None]:
<div class="firm-pricing custom_tooltip" data-content="<i>Hourly Rate</i>" data-tooltip-position=".icon-wrapper">
<div class="icon-wrapper"><i class="price-icon"></i></div>
<span>$25 - $49/hr</span>
</div>

In [134]:
bs.find_all('div', {'class': 'firm-services'})[0].find_all('span')

[<span>$25 - $49/hr</span>,
 <span>250 - 999</span>,
 <span>2000</span>,
 <span>United States, Germany</span>]

In [128]:
bs.find_all('div', {'class': 'firm-services'})[0]

<div class="firm-services">
<div class="firm-pricing custom_tooltip" data-content="&lt;i&gt;Hourly Rate&lt;/i&gt;" data-tooltip-position=".icon-wrapper">
<div class="icon-wrapper"><i class="price-icon"></i></div>
<span>$25 - $49/hr</span>
</div>
<div class="firm-employees custom_tooltip" data-content="&lt;i&gt;Employees&lt;/i&gt;" data-tooltip-position=".icon-wrapper">
<div class="icon-wrapper"><i class="employee-icon"></i></div>
<span>250 - 999</span>
</div>
<div class="firm-founded custom_tooltip" data-content="&lt;i&gt;Founded&lt;/i&gt;" data-tooltip-position=".icon-wrapper">
<div class="icon-wrapper"><i class="founded-icon"></i></div>
<span>2000</span>
</div>
<div class="firm-location custom_tooltip" data-content="&lt;i&gt;Location&lt;/i&gt;" data-tooltip-position=".icon-wrapper">
<div class="icon-wrapper"><i class="location-icon"></i></div>
<span>United States, Germany</span>
</div>
</div>

In [5]:
firm_name = bs.find_all('span', {'itemprop': 'name'})
firm_motor = bs.find_all('p', {'class': 'profile-tagline'})
firm_reviews = bs.find_all('span', {'class': 'listinv_review_label'})
firm_progress_value = bs.find_all('div', {'class': 'circle-progress-value'})
firm_price = bs.find_all('div', {'class': 'firm-pricing'})

In [13]:
firm_name

[]

In [7]:
firm_name_sr = extract_details(firm_name[3:])
firm_motor_sr = extract_details(firm_motor)
firm_reviews_sr = extract_details(firm_reviews)
firm_price_sr = extract_details(firm_price)

  return pd.Series(lst)


In [8]:
serv_lst = []
plt_lst = []
for val in enumerate(firm_progress_value):
    if val[0] % 2 == 0:
        serv_lst.append(val[1].text)
    else:
        plt_lst.append(val[1].text)        
    
serv_lst[:3]

[]

In [98]:
df_new = pd.DataFrame()

In [10]:
df_new['firm_name'] = firm_name_sr
df_new['firm_motor'] = firm_motor_sr
df_new['firm_reviews'] = firm_reviews_sr
df_new['firm_serv_pct(%)'] = serv_lst
df_new['firm_plt_pct(%)'] = plt_lst
df_new['firm_price'] = firm_price_sr

In [11]:
df_new

Unnamed: 0,firm_name,firm_motor,firm_reviews,firm_serv_pct(%),firm_plt_pct(%),firm_price
0,,,,,,\n\n$25 - $49/hr\n
1,,,,,,\n\n< $25/hr\n
2,,,,,,\n\n$25 - $49/hr\n
3,,,,,,\n\n$25 - $49/hr\n
4,,,,,,\n\n$25 - $49/hr\n
5,,,,,,\n\n$50 - $99/hr\n
6,,,,,,\n\n< $25/hr\n
7,,,,,,\n\n$25 - $49/hr\n
8,,,,,,\n\n$50 - $99/hr\n
9,,,,,,\n\n$50 - $99/hr\n
