# Web Scraping

## Top Data Analytics Companies
To improve effectiveness in business processes, companies are focussing on collecting and utilizing data. Data analytics companies enable businesses to analyze the acquired data and use them as required. Data analytics services can assist in product development, identifying potential market gaps, improving operational efficiency, etc. [Goodfirms](https://www.goodfirms.co/big-data-analytics/data-analytics)

The aim of this project is to scrape Goodfirms website for details about top data analytics company. These details includes, review, rating, year founded, location, etc.

### Road Map
Stage 1
* Download page
* Open page
* Get details
* Extract details
* Make a dataframe

Stage 2 (Code Refactoring)
* Change loops to function
* Take functions to another file
* Import functions and extract
* Make a dataframe

### Libraries

In [1]:
import requests as r
from bs4 import BeautifulSoup
import pandas as pd
import plotly.express as px
import plotly.offline as po
po.init_notebook_mode(connected=True)

### Download and save HTML file

In [2]:
# # URL link
# url = 'https://www.goodfirms.co/big-data-analytics/data-analytics'
# # access website
# html = r.get(url)
# with open('top_da_company1.html', mode='wb') as file:
#     file.write(html.content)

In [3]:
with open("top_da_company1.html", encoding='utf-8', mode='r') as file:
    bs = BeautifulSoup(file, 'lxml')
    # bs = BeautifulSoup(file, 'html5lib')

### Locate Details

In [17]:
firm_position = bs.find_all('span', {'class': "position"})
firm_names = bs.find_all('span', {'itemprop': "name"})
firm_motors = bs.find_all('p', {'class': "profile-tagline"})
firm_reviews = bs.find_all('span', {'class': "listinv_review_label"})
progress_value = bs.find_all('div', {'class': "circle-progress-value"})
firm_prices = bs.find_all('div', {'class': "firm-pricing"})
firm_emps = bs.find_all('div', {'class': "firm-employees"})
firm_years = bs.find_all('div', {'class': "firm-founded"})
firm_locations = bs.find_all('div', {'class': "firm-location"})

### Extract Details

In [19]:
pos_lst = []
for pos in firm_position:
    pos_lst.append(pos.text)

print(len(pos_lst))
pos_lst[:5]

51


['1', '2', '3', '4', '5']

In [5]:
names_lst = []
for name in firm_names[3:]:
    names_lst.append(name.text)
    
print(len(names_lst))
names_lst[:3]

51


['SPEC INDIA', 'instinctools', 'SoluLab']

In [6]:
motor_lst = []
for motor in firm_motors:
    motor_lst.append(motor.text)

print(len(motor_lst))
motor_lst[:3]

51


['Enterprise Software, Mobility & BI Solutions',
 'Delivering the future. Now.',
 'Blockchain | IoT | Mobility | AI | Big Data']

In [7]:
review_lst = []
for review in firm_reviews:
    review_lst.append(review.text)
    
print(len(review_lst))
review_lst[:3]

51


['4.8 (26 Reviews)', '4.8 (8 Reviews)', '5.0 (32 Reviews)']

In [8]:
service_pct = []
platform_pct = []
for percent in enumerate(progress_value):
    if percent[0] % 2 == 0:
        service_pct.append(percent[1].text)
    else:
        platform_pct.append(percent[1].text)
        
print(len(service_pct))
print(len(platform_pct))
print(service_pct[:3])
print(platform_pct[:3])

51
51
['20%', '5%', '15%']
['15%', '10%', '15%']


In [9]:
price_lst = []
for price in firm_prices:
    price_lst.append(price.text)
    
print(len(price_lst))
price_lst[:3]

51


['\n< $25/hr ', '\n$50 - $99/hr ', '\n$25 - $49/hr ']

In [10]:
emps_lst = []
for emp in firm_emps:
    emps_lst.append(emp.text)
    
print(len(emps_lst))
emps_lst[:3]

51


['250 - 999', '250 - 999', '50 - 249']

In [11]:
year_lst = []
for year in firm_years:
    year_lst.append(year.text)
    
print(len(year_lst))
year_lst[:3]

51


['1987', '2000', '2014']

In [12]:
firm_lst = [firm.text for firm in firm_locations]
print(len(firm_lst))
firm_lst[:3]

51


['\nIndia, United States ',
 '\nUnited States, Germany ',
 '\nUnited States, India ']

### Put Details in a DataFrame

In [20]:
df = pd.DataFrame()

In [21]:
df['firm_position'] = pos_lst
df['firm_name'] = names_lst
df['firm_motor'] = motor_lst
df['firm_review'] = review_lst
df['service_pct'] = service_pct
df['platform_pct'] = platform_pct
df['firm_price'] = price_lst
df['firm_employee'] = emps_lst
df['year_founded'] = year_lst
df['firm_location'] = firm_locations

In [22]:
# df
# df2 = df.copy()
# pd.concat([df, df2])

Task
Extract the remaining three details
Download the second page and save it as "top_da_company2.html"
Extract the same details as seen above
Create a dataframe 

In [38]:
# URL link
url = 'https://www.goodfirms.co/big-data-analytics/data-analytics?page=2'
# access website
html = r.get(url)
with open('top_da_company2.html', mode='wb') as file:
    file.write(html.content)

In [39]:
with open("top_da_company2.html", encoding='utf-8', mode='r') as file:
    bs = BeautifulSoup(file, 'lxml')
    # bs = BeautifulSoup(file, 'html5lib')

In [40]:
firm_names = bs.find_all('span', {'itemprop': "name"})
firm_motors = bs.find_all('p', {'class': "profile-tagline"})
firm_reviews = bs.find_all('span', {'class': "listinv_review_label"})
progress_value = bs.find_all('div', {'class': "circle-progress-value"})
firm_prices = bs.find_all('div', {'class': "firm-pricing"})
firm_emps = bs.find_all('div', {'class': "firm-employees"})
firm_years = bs.find_all('div', {'class': "firm-founded"})
firm_locations = bs.find_all('div', {'class': "firm-location"})

In [41]:
import functions as fn

In [42]:
fn.extract_names(firm_names)[:5]

The number of name extracted is 51


0                           Systango Technologies
1                                         Inoxoft
2    Noventum Custom Software Development Company
3                   Napollo Software Design L.L.C
4                        SemiDot Infotech Pvt Ltd
dtype: object

In [49]:
fn.extract_motors(firm_motors)[:5]

The number of firm motor extracted is 51


0                    We Make the Impossible, Possible!
1    IEC/ISO 27001, Google, Microsoft Certified Com...
2      Custom Software & Web Development in New Mexico
3      Best Software Design Agency in Dubai & New York
4            Right Technology Partner for IT Solutions
dtype: object

In [26]:
fn.extract_reviews(firm_reviews)[:5]

The number of review extracted is 51


0    4.8 (26 Reviews)
1     4.8 (8 Reviews)
2    5.0 (32 Reviews)
3     4.7 (5 Reviews)
4     5.0 (5 Reviews)
dtype: object

In [27]:
fn.extract_progress_values(progress_value)[0][:5]

The number of service percent is 51
The number of platform percent is 51


0    20%
1     5%
2    15%
3    40%
4    25%
dtype: object

In [28]:
fn.exttract_prices(firm_prices)[:5]

The number of price extracted is 51


0        \n< $25/hr 
1    \n$50 - $99/hr 
2    \n$25 - $49/hr 
3    \n$25 - $49/hr 
4    \n$25 - $49/hr 
dtype: object

In [29]:
fn.extract_employees(firm_emps)[:5]

The number of employees extracted is 51


0    250 - 999
1    250 - 999
2     50 - 249
3    250 - 999
4      10 - 49
dtype: object

In [30]:
fn.extract_founded_year(firm_years)[:5]

The number of year extracted is 51


0    1987
1    2000
2    2014
3    2010
4    2016
dtype: object

In [31]:
fn.extract_locations(firm_locations)[:5]

The number of locations extracted is 51


0        \nIndia, United States 
1      \nUnited States, Germany 
2        \nUnited States, India 
3    \nUnited States, Australia 
4        \nUnited States, India 
dtype: object

### Data Cleaning

In [23]:
df.head(5)

Unnamed: 0,firm_position,firm_name,firm_motor,firm_review,service_pct,platform_pct,firm_price,firm_employee,year_founded,firm_location
0,1,SPEC INDIA,"Enterprise Software, Mobility & BI Solutions",4.8 (26 Reviews),20%,15%,\n< $25/hr,250 - 999,1987,"[\nIndia, United States ]"
1,2,instinctools,Delivering the future. Now.,4.8 (8 Reviews),5%,10%,\n$50 - $99/hr,250 - 999,2000,"[\nUnited States, Germany ]"
2,3,SoluLab,Blockchain | IoT | Mobility | AI | Big Data,5.0 (32 Reviews),15%,15%,\n$25 - $49/hr,50 - 249,2014,"[\nUnited States, India ]"
3,4,Sigma Data Systems,Discover the world of Big Data with us!,4.7 (5 Reviews),40%,10%,\n$25 - $49/hr,250 - 999,2010,"[\nUnited States, Australia ]"
4,5,NeenOpal Inc.,The Hub Of Data Science Innovation,5.0 (5 Reviews),25%,10%,\n$25 - $49/hr,10 - 49,2016,"[\nUnited States, India ]"


In [24]:
df.insert(2, 'star_rating', df['firm_review'].apply(lambda x: x.split()[0]))

In [25]:
df.insert(3, 'review', df['firm_review'].apply(lambda x: x.split()[1].strip('(')))

In [26]:
df.rename(columns={'service_pct':'service_pct(%)'}, inplace=True)
df['service_pct(%)'] = df['service_pct(%)'].apply(lambda x: x.strip('%'))

In [27]:
df.rename(columns={'platform_pct':'platform_pct(%)'}, inplace=True)
df['platform_pct(%)'] = df['platform_pct(%)'].apply(lambda x: x.strip('%'))

In [28]:
df['firm_price'] = df['firm_price'].apply(lambda x: x.strip('\n'))

In [29]:
df['firm_location'] = df['firm_location'].apply(lambda x: x.text.strip('\n'))

In [30]:
df.drop(columns='firm_review', inplace=True)

In [31]:
df.head(5)

Unnamed: 0,firm_position,firm_name,star_rating,review,firm_motor,service_pct(%),platform_pct(%),firm_price,firm_employee,year_founded,firm_location
0,1,SPEC INDIA,4.8,26,"Enterprise Software, Mobility & BI Solutions",20,15,< $25/hr,250 - 999,1987,"India, United States"
1,2,instinctools,4.8,8,Delivering the future. Now.,5,10,$50 - $99/hr,250 - 999,2000,"United States, Germany"
2,3,SoluLab,5.0,32,Blockchain | IoT | Mobility | AI | Big Data,15,15,$25 - $49/hr,50 - 249,2014,"United States, India"
3,4,Sigma Data Systems,4.7,5,Discover the world of Big Data with us!,40,10,$25 - $49/hr,250 - 999,2010,"United States, Australia"
4,5,NeenOpal Inc.,5.0,5,The Hub Of Data Science Innovation,25,10,$25 - $49/hr,10 - 49,2016,"United States, India"


In [90]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51 entries, 0 to 50
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   firm_position    51 non-null     object
 1   firm_name        51 non-null     object
 2   star_rating      51 non-null     object
 3   review           51 non-null     object
 4   firm_motor       51 non-null     object
 5   service_pct(%)   51 non-null     object
 6   platform_pct(%)  51 non-null     object
 7   firm_price       51 non-null     object
 8   firm_employee    51 non-null     object
 9   year_founded     51 non-null     object
 10  firm_location    51 non-null     object
dtypes: object(11)
memory usage: 4.5+ KB


### EDA

In [89]:
df.sort_values(by='platform_pct(%)', ascending=False)

Unnamed: 0,firm_position,firm_name,star_rating,review,firm_motor,service_pct(%),platform_pct(%),firm_price,firm_employee,year_founded,firm_location
50,51,Burning Buttons LLC,5.0,6,Digital boost for your business to succeed!,10,50,$25 - $49/hr,50 - 249,2009,"Germany, Russia"
47,48,BR Softech Pvt Ltd,4.5,10,Website & Mobile Apps Development Company,5,50,$25 - $49/hr,"1,000 - 9,999",2012,"India, United States"
37,38,Sunflower Lab,4.9,13,Sunflower Lab,5,50,$50 - $99/hr,50 - 249,2010,United States
36,37,Prismetric,4.7,14,Delivering Quality Products and Premium Services,5,50,< $25/hr,50 - 249,2008,"India, United States"
34,35,WebClues Infotech,4.6,15,"Your Vision, Our Creation",5,50,$25 - $49/hr,50 - 249,2014,Canada
32,33,Welby Consulting,3.4,5,The Future of Digital,10,50,$100 - $149/hr,2 - 9,2015,Canada
18,19,Broscorp,4.9,4,Delivering software that helps businesses to grow,10,50,$25 - $49/hr,10 - 49,2016,Ukraine
13,14,Huspi,5.0,6,Transforming ideas into user-friendly software.,10,50,$50 - $99/hr,10 - 49,2015,"Poland, Ukraine"
15,16,Fayrix,5.0,8,Remote software teams & services for startups.,10,5,< $25/hr,"1,000 - 9,999",2005,Israel
41,42,7EDGE,5.0,14,Software and Product Development | Dedicated T...,5,5,$25 - $49/hr,50 - 249,2010,"India, United States"
