# Web Scraping of Top Data Analytics Companies
To improve effectiveness in business processes, companies are focussing on collecting and utilizing data. Data analytics companies enable businesses to analyze the acquired data and use them as required. Data analytics services can assist in product development, identifying potential market gaps, improving operational efficiency, etc.

Goodfirms

The aim of this project is to scrape Goodfirms website for details about top data analytics company. These details includes, review, rating, year founded, location, etc.

# Road Map
Stage 1


1. Download page
2. Open page
3. Get details
4. Extract details
5. Make a dataframe

Stage 2 (Code Refactoring)

1. Change loops to function
2. Take functions to another file
3. Import functions and extract
4. Make a dataframe

# Libraries

In [22]:
import requests as r
from bs4 import BeautifulSoup
import pandas as pd
import plotly.express as px
import plotly.offline as po
po.init_notebook_mode(connected=True)

# Download and save HTML file

In [None]:
# # URL link
# url_link = ['https://www.goodfirms.co/big-data-analytics/data-analytics', 'https://www.goodfirms.co/big-data-analytics/data-analytics?page=2']# 
#        

# # access website
# access = [r.get(url) for url in url_link]

# """ OR

# access = []
# for url in url_link:
#     access.append(r.get(url))
    
# """


In [23]:
#url_link = 'https://www.goodfirms.co/big-data-analytics/data-analytics'
#html = r.get(url_link)
#html#

<Response [200]>

# Saving webpage to PC

In [24]:
# for page in access:
#     index = access.index(page)+1
#     with open(f'page{index}.html', mode='wb') as file:
#         file.write(page.content)

# Page 1

In [25]:
with open('topfirm.html',encoding='utf-8', mode='r') as openfile:
    bs = BeautifulSoup(openfile, 'lxml')

# Locate Details


1. firm names
2. firm motors
3. firm reviews
4. firm progress value
5. firm prices
6. firm employees
7. firm years
8. firm locations

In [231]:
firm_names = bs.find_all('span', {'itemprop': 'name'})
firm_motors = bs.find_all('p', {'class': 'profile-tagline'})
firm_reviews = bs.find_all('span', {'class': 'listinv_review_label'})
firm_progress_value = bs.find_all('div', {'class': 'circle-progress-value'})
firm_prices = bs.find_all('div', {'class': 'firm-pricing'})
firm_employees = bs.find_all('div',{'class':'firm-employees'})
firm_years = bs.find_all('div',{'class':'firm-founded'})
firm_locations = bs.find_all('div',{'class':'firm-location'})

# Function to extract details

In [267]:
def extract_detail(tag_lst):
    lst = [tag.text for tag in tag_lst]
    return pd.Series(lst)

def extract_progress_values(tag_lst):    
    serv_pct = [percent[1].text for percent in enumerate(tag_lst) if percent[0]%2 ==0]
    plt_pct = [percent[1].text for percent in enumerate(tag_lst) if percent[0]%2 ==1]       
    return pd.Series(serv_pct), pd.Series(plt_pct) 

In [268]:
serv_lst = []
plt_lst = []
for val in enumerate(firm_progress_value):
    if val[0] % 2 == 0:
        serv_lst.append(val[1].text)
    else:
        plt_lst.append(val[1].text)        
    
serv_lst[:3]

['5%', '10%', '10%']

# Extract Details

In [269]:
firm_names_sr = extract_detail(firm_names[3:])
firm_motors_sr = extract_detail(firm_motors)
firm_reviews_sr = extract_detail(firm_reviews)
firm_prices_sr = extract_detail(firm_prices)
firm_employees_sr = extract_detail(firm_employees)
firm_years_sr = extract_detail(firm_years)
firm_locations_sr = extract_detail(firm_locations)

# Put Details in a DataFrame

In [270]:
df_new = pd.DataFrame()

In [271]:
df_new['firm_names'] = firm_names_sr
df_new['firm_motors'] = firm_motors_sr
df_new['firm_reviews'] = firm_reviews_sr
df_new['firm_serv_pct'] = serv_lst
df_new['firm_plt_pct'] = plt_lst
df_new['firm_prices'] = firm_prices_sr
df_new['firm_employees'] = firm_employees_sr
df_new['year_founded'] = firm_years_sr
df_new['firm_location'] = firm_locations_sr

In [272]:
df_new.head()

Unnamed: 0,firm_names,firm_motors,firm_reviews,firm_serv_pct,firm_plt_pct,firm_prices,firm_employees,year_founded,firm_location
0,Sphinx Solutions,Inspire : Innovate : Evolve,4.6 (8 Reviews),5%,10%,\n< $25/hr,Freelancer,2010,"\nIndia, United States"
1,Redian Software,Delivering Open Source Solutions,5.0 (5 Reviews),10%,50%,\n$25 - $49/hr,10 - 49,2016,"\nUnited Kingdom, India"
2,Indus Net Technologies,We Deliver Digital Success,5.0 (5 Reviews),10%,50%,\n< $25/hr,250 - 999,1997,"\nIndia, United Kingdom"
3,RWaltz Group Inc.,Blockchain Solutions Experts,5.0 (5 Reviews),10%,50%,\n$50 - $99/hr,50 - 249,2000,\n United States
4,UNL Solutions,"Dedicated Team, Dedicated Developer",4.3 (3 Reviews),10%,100%,\n$25 - $49/hr,50 - 249,2006,"\nUnited Kingdom, Belarus"


# Page 2

In [280]:
# open page 2
with open('topfirm.html',encoding='utf-8', mode='r') as openfile:
    bs2 = BeautifulSoup(openfile, 'lxml')

# locate details 
firm_names = bs2.find_all('span', {'itemprop': 'name'})
firm_motors = bs2.find_all('p', {'class': 'profile-tagline'})
firm_reviews = bs2.find_all('span', {'class': 'listinv_review_label'})
firm_progress_value = bs2.find_all('div', {'class': 'circle-progress-value'})
firm_prices = bs2.find_all('div', {'class': 'firm-pricing'})
firm_employees = bs2.find_all('div',{'class':'firm-employees'})
firm_years = bs2.find_all('div',{'class':'firm-founded'})
firm_locations = bs2.find_all('div',{'class':'firm-location'})

# extract details


names = extract_detail(firm_names[3:])
motors = extract_detail(firm_motors)
reviews = extract_detail(firm_reviews)
serv_lst, plt_lst = extract_progress_values(firm_progress_value)
prices = extract_detail(firm_prices)
employees = extract_detail(firm_employees)
years = extract_detail(firm_years)
locations = extract_detail(firm_locations)



# Put detais in a dataframe
df_2 = pd.DataFrame()
df_2['firm_names'] = names
df_2['firm_motors'] = motors
df_2['firm_reviews'] = reviews
df_2['firm_serv_pct'] = serv_lst
df_2['firm_plt_pct'] = plt_lst
df_2['firm_prices'] = prices
df_2['firm_employees'] = employees
df_2['year_founded'] = years
df_2['firm_location'] = locations

# preview  
df_2.head()


Unnamed: 0,firm_names,firm_motors,firm_reviews,firm_serv_pct,firm_plt_pct,firm_prices,firm_employees,year_founded,firm_location
0,Sphinx Solutions,Inspire : Innovate : Evolve,4.6 (8 Reviews),5%,10%,\n< $25/hr,Freelancer,2010,"\nIndia, United States"
1,Redian Software,Delivering Open Source Solutions,5.0 (5 Reviews),10%,50%,\n$25 - $49/hr,10 - 49,2016,"\nUnited Kingdom, India"
2,Indus Net Technologies,We Deliver Digital Success,5.0 (5 Reviews),10%,50%,\n< $25/hr,250 - 999,1997,"\nIndia, United Kingdom"
3,RWaltz Group Inc.,Blockchain Solutions Experts,5.0 (5 Reviews),10%,50%,\n$50 - $99/hr,50 - 249,2000,\n United States
4,UNL Solutions,"Dedicated Team, Dedicated Developer",4.3 (3 Reviews),10%,100%,\n$25 - $49/hr,50 - 249,2006,"\nUnited Kingdom, Belarus"


# COMBINE ALL DATAFRAMES

In [281]:
all_df = pd.concat([df_new,df_2], ignore_index= True)
all_df

Unnamed: 0,firm_names,firm_motors,firm_reviews,firm_serv_pct,firm_plt_pct,firm_prices,firm_employees,year_founded,firm_location
0,Sphinx Solutions,Inspire : Innovate : Evolve,4.6 (8 Reviews),5%,10%,\n< $25/hr,Freelancer,2010,"\nIndia, United States"
1,Redian Software,Delivering Open Source Solutions,5.0 (5 Reviews),10%,50%,\n$25 - $49/hr,10 - 49,2016,"\nUnited Kingdom, India"
2,Indus Net Technologies,We Deliver Digital Success,5.0 (5 Reviews),10%,50%,\n< $25/hr,250 - 999,1997,"\nIndia, United Kingdom"
3,RWaltz Group Inc.,Blockchain Solutions Experts,5.0 (5 Reviews),10%,50%,\n$50 - $99/hr,50 - 249,2000,\n United States
4,UNL Solutions,"Dedicated Team, Dedicated Developer",4.3 (3 Reviews),10%,100%,\n$25 - $49/hr,50 - 249,2006,"\nUnited Kingdom, Belarus"
...,...,...,...,...,...,...,...,...,...
103,Evolve Technologies,IT agency,4.8 (1 Review),30%,30%,\n$50 - $99/hr,10 - 49,2000,\nIreland
104,Virtual Electronics PTE LTD,Software and mobile app development in Singapore,5.0 (2 Reviews),5%,25%,\n$25 - $49/hr,10 - 49,2019,\nSingapore
105,Reenbit,Intelligent engineering & beyond,5.0 (3 Reviews),5%,5%,\n$25 - $49/hr,50 - 249,2018,"\nUkraine, Poland"
106,Michigan Software Labs,We Make Apps,5.0 (2 Reviews),10%,20%,\n$150 - $199/hr,10 - 49,2010,\nUnited States


# Data Cleaning

In [282]:
# extracting star rating 
values = all_df['firm_reviews'].apply(lambda x: x.split()[0])
all_df.insert(3, 'star_rating', values)

# extracting number of reviews
val = all_df['firm_reviews'].apply(lambda x: x.split()[1].strip('('))
all_df.insert(4, 'firm_rev', val)

# drop firm reviews column
all_df.drop(columns='firm_reviews', inplace=True)

# remove "%" from firm service and platform percent
all_df['firm_serv_pct'] = all_df['firm_serv_pct'].apply(lambda x: x.strip('%'))
all_df['firm_plt_pct'] = all_df['firm_plt_pct'].apply(lambda x: x.strip('%'))

# rename columns
all_df.rename(columns={'firm_serv_pct': 'service_pct(%)', 'firm_plt_pct':'platform_pct(%)'}, inplace=True)

# remove "\n" from firm price and location
all_df['firm_prices'] = all_df['firm_prices'].apply(lambda x: x.strip('\n'))
all_df['firm_location'] = all_df['firm_location'].apply(lambda x: x.strip('\n'))

In [284]:
all_df.head()

Unnamed: 0,firm_names,firm_motors,star_rating,firm_rev,service_pct(%),platform_pct(%),firm_prices,firm_employees,year_founded,firm_location
0,Sphinx Solutions,Inspire : Innovate : Evolve,4.6,8,5,10,< $25/hr,Freelancer,2010,"India, United States"
1,Redian Software,Delivering Open Source Solutions,5.0,5,10,50,$25 - $49/hr,10 - 49,2016,"United Kingdom, India"
2,Indus Net Technologies,We Deliver Digital Success,5.0,5,10,50,< $25/hr,250 - 999,1997,"India, United Kingdom"
3,RWaltz Group Inc.,Blockchain Solutions Experts,5.0,5,10,50,$50 - $99/hr,50 - 249,2000,United States
4,UNL Solutions,"Dedicated Team, Dedicated Developer",4.3,3,10,100,$25 - $49/hr,50 - 249,2006,"United Kingdom, Belarus"


# Save dataframe to CSV file

In [291]:

all_df.to_csv('clean_extract.csv', index=False)

# Changing Data Type

In [293]:
all_df.dtypes

firm_names         object
firm_motors        object
star_rating        object
firm_rev           object
service_pct(%)     object
platform_pct(%)    object
firm_prices        object
firm_employees     object
year_founded       object
firm_location      object
dtype: object

In [295]:
all_df['star_rating']=all_df['star_rating'].astype('float')
all_df['firm_rev']=all_df['firm_rev'].astype('int')
all_df['service_pct(%)']=all_df['service_pct(%)'].astype('int')
all_df['platform_pct(%)']=all_df['platform_pct(%)'].astype('int')

In [296]:
all_df.dtypes

firm_names          object
firm_motors         object
star_rating        float64
firm_rev             int32
service_pct(%)       int32
platform_pct(%)      int32
firm_prices         object
firm_employees      object
year_founded        object
firm_location       object
dtype: object

# EDA

In [317]:
px.histogram(all_df,'star_rating', width=500, title='Star Rating Distribution')

- 82 firms have a star rating of 5.0 stars
- 10 firms have a rating of 4.9
- Less than 5 firms has a rating between 1 - 4

In [299]:
px.histogram(all_df, 'firm_rev', width=500, title='Firm Review Distribution')

From the distribution above, about 20 firms have a review between 0 -2.

In [305]:
top5_star_rev= all_df.sort_values(by=['star_rating','firm_rev'],ascending=False)[:5]
px.bar(top5_star_rev,'firm_names',['star_rating','firm_rev'],width=800, title= 'Top 5 Firms by Star Rating and Reviews')

In [318]:
top5_star_serv=all_df.sort_values(['star_rating','service_pct(%)'],ascending=False )[:5]
px.bar(top5_star_serv,'firm_names',['star_rating','service_pct(%)'], width=700, 
       title='Top 5 Firms by Star Rating and Service Percentage')

In [315]:
split= all_df['firm_location'].apply(lambda x: x.split(','))
lst=[]

for x in split:
    lst += x
    
new_lst= []
for x in lst:
    new_lst.append(x.strip())
locations = pd.Series(new_lst).value_counts(ascending=True)
locations[:5]

Malta                   2
Saudi Arabia            2
United Arab Emirates    2
Ireland                 2
Singapore               2
dtype: int64

In [319]:
px.bar(x=locations.values, y=locations.index,width=900, height=800, 
       title= 'Country Frequency',labels={'x':'Frequency','y':''})

# Summary
The webpage from Goodfirm was succesfully accessed and downloaded using Request library.
BeautifulSoup was used to locate and extract the details from the downloaded html file
The extracted details was converted to data frame using Pandas.
The file was cleaned and changed to its right datatype
The cleaned file was explored for some insights.


FINDINGS


Over 80 firms are rated 5 star while about 18 are rated between 4.8 - 4.9,

Very few firms have a review above 30,

Most of the firms are located in United States followed by India.