# Web Scraping

## Top Data Analytics Companies
To improve effectiveness in business processes, companies are focussing on collecting and utilizing data. Data analytics companies enable businesses to analyze the acquired data and use them as required. Data analytics services can assist in product development, identifying potential market gaps, improving operational efficiency, etc. [Goodfirms](https://www.goodfirms.co/big-data-analytics/data-analytics)

The aim of this project is to scrape Goodfirms website for details about top data analytics company. These details includes, review, rating, year founded, location, etc.

### Road Map
Stage 1
* Download page
* Open page
* Get details
* Extract details
* Make a dataframe

Stage 2 (Code Refactoring)
* Change loops to function
* Take functions to another file
* Import functions and extract
* Make a dataframe

### Libraries

In [1]:
import requests as r
from bs4 import BeautifulSoup
import pandas as pd
import plotly.express as px
import plotly.offline as po
po.init_notebook_mode(connected=True)

### Download and save HTML file

In [2]:
# URL link
url_link = ['https://www.goodfirms.co/big-data-analytics/data-analytics', 'https://www.goodfirms.co/big-data-analytics/data-analytics?page=2', 
       'https://www.goodfirms.co/big-data-analytics/data-analytics?page=3']

# access website
access = [r.get(url) for url in url_link]

""" OR

access = []
for url in url_link:
    access.append(r.get(url))
    
"""


' OR\n\naccess = []\nfor url in url_link:\n    access.append(r.get(url))\n    \n'

### Saving webpage to PC

In [3]:
for page in access:
    index = access.index(page)+1
    with open(f'page{index}.html', mode='wb') as file:
        file.write(page.content)

### Page 1 

In [4]:
with open("page1.html", encoding='utf-8', mode='r') as file:
    bs = BeautifulSoup(file, 'lxml')

#### Locate Details

In [5]:
firm_names = bs.find_all('span', {'itemprop': "name"})
firm_motors = bs.find_all('p', {'class': "profile-tagline"})
firm_reviews = bs.find_all('span', {'class': "listinv_review_label"})
progress_value = bs.find_all('div', {'class': "circle-progress-value"})
firm_prices = bs.find_all('div', {'class': "firm-pricing"})
firm_emps = bs.find_all('div', {'class': "firm-employees"})
firm_years = bs.find_all('div', {'class': "firm-founded"})
firm_locations = bs.find_all('div', {'class': "firm-location"})

#### Function to extract details

In [6]:
def extract_detail(tag_lst):
    lst = [tag.text for tag in tag_lst]
    return pd.Series(lst)

def extract_progress_values(tag_lst):    
    service_pct = [percent[1].text for percent in enumerate(tag_lst) if percent[0]%2 ==0]
    platform_pct = [percent[1].text for percent in enumerate(tag_lst) if percent[0]%2 ==1]       
    return pd.Series(service_pct), pd.Series(platform_pct) 

#### Extract Details

In [7]:
names = extract_detail(firm_names[3:])
motors = extract_detail(firm_motors)
reviews = extract_detail(firm_reviews)
ser, pct = extract_progress_values(progress_value)
prices = extract_detail(firm_prices)
emps = extract_detail(firm_emps)
years = extract_detail(firm_years)
locations = extract_detail(firm_locations)

### Put Details in a DataFrame

In [8]:

# Empty Dataframe
df1 = pd.DataFrame()

# Creating columns with extracted details
df1['firm_name'] = names
df1['firm_motor'] = motors
df1['firm_review'] = reviews
df1['service_pct'] = ser
df1['platform_pct'] = pct
df1['firm_price'] = prices
df1['firm_employee'] = emps
df1['year_founded'] = years
df1['firm_location'] = locations

# Preview
df1.head()

Unnamed: 0,firm_name,firm_motor,firm_review,service_pct,platform_pct,firm_price,firm_employee,year_founded,firm_location
0,SPEC INDIA,"Enterprise Software, Mobility & BI Solutions",4.8 (26 Reviews),20%,15%,\n< $25/hr,250 - 999,1987,"\nIndia, United States"
1,instinctools,Delivering the future. Now.,4.8 (8 Reviews),5%,10%,\n$50 - $99/hr,250 - 999,2000,"\nUnited States, Germany"
2,SoluLab,Blockchain | IoT | Mobility | AI | Big Data,5.0 (32 Reviews),15%,15%,\n$25 - $49/hr,50 - 249,2014,"\nUnited States, India"
3,Sigma Data Systems,Discover the world of Big Data with us!,4.7 (5 Reviews),40%,10%,\n$25 - $49/hr,250 - 999,2010,"\nUnited States, Australia"
4,NeenOpal Inc.,The Hub Of Data Science Innovation,5.0 (5 Reviews),25%,10%,\n$25 - $49/hr,10 - 49,2016,"\nUnited States, India"


### Page 2

In [9]:
# Open page 2
with open("page2.html", encoding='utf-8', mode='r') as file:
    bs2 = BeautifulSoup(file, 'lxml')
    
# Locate details
firm_names = bs2.find_all('span', {'itemprop': "name"})
firm_motors = bs2.find_all('p', {'class': "profile-tagline"})
firm_reviews = bs2.find_all('span', {'class': "listinv_review_label"})
progress_value = bs2.find_all('div', {'class': "circle-progress-value"})
firm_prices = bs2.find_all('div', {'class': "firm-pricing"})
firm_emps = bs2.find_all('div', {'class': "firm-employees"})
firm_years = bs2.find_all('div', {'class': "firm-founded"})
firm_locations = bs2.find_all('div', {'class': "firm-location"})

# Extract details
names2 = extract_detail(firm_names[3:])
motors = extract_detail(firm_motors)
reviews = extract_detail(firm_reviews)
ser, pct = extract_progress_values(progress_value)
prices = extract_detail(firm_prices)
emps = extract_detail(firm_emps)
years = extract_detail(firm_years)
locations = extract_detail(firm_locations)

# Put details in a dataframe
df2 = pd.DataFrame()
df2['firm_name'] = names2
df2['firm_motor'] = motors
df2['firm_review'] = reviews
df2['service_pct'] = ser
df2['platform_pct'] = pct
df2['firm_price'] = prices
df2['firm_employee'] = emps
df2['year_founded'] = years
df2['firm_location'] = locations

# Preview
df2.head()

Unnamed: 0,firm_name,firm_motor,firm_review,service_pct,platform_pct,firm_price,firm_employee,year_founded,firm_location
0,Inoxoft,"IEC/ISO 27001, Google, Microsoft Certified Com...",5.0 (8 Reviews),10%,10%,\n$25 - $49/hr,50 - 249,2014,\nUkraine
1,Noventum Custom Software Development Company,Custom Software & Web Development in New Mexico,5.0 (3 Reviews),20%,100%,\n$100 - $149/hr,2 - 9,2012,\nUnited States
2,Napollo Software Design L.L.C,Best Software Design Agency in Dubai & New York,5.0 (7 Reviews),20%,20%,\nNA,50 - 249,2011,\nUnited States
3,SemiDot Infotech Pvt Ltd,Right Technology Partner for IT Solutions,4.7 (7 Reviews),10%,30%,\n< $25/hr,50 - 249,2011,"\nUnited States, India"
4,Sphinx Solutions,Inspire : Innovate : Evolve,4.6 (8 Reviews),5%,10%,\n< $25/hr,Freelancer,2010,"\nIndia, United States"


### Page 3

In [10]:
# Open page 3 
with open("page3.html", encoding='utf-8', mode='r') as file:
    bs3 = BeautifulSoup(file, 'lxml')
    
# Locate details
firm_names = bs3.find_all('span', {'itemprop': "name"})
firm_motors = bs3.find_all('p', {'class': "profile-tagline"})
firm_reviews = bs3.find_all('span', {'class': "listinv_review_label"})
progress_value = bs3.find_all('div', {'class': "circle-progress-value"})
firm_prices = bs3.find_all('div', {'class': "firm-pricing"})
firm_emps = bs3.find_all('div', {'class': "firm-employees"})
firm_years = bs3.find_all('div', {'class': "firm-founded"})
firm_locations = bs3.find_all('div', {'class': "firm-location"})

# Extract details
names = extract_detail(firm_names[3:])
motors = extract_detail(firm_motors)
reviews = extract_detail(firm_reviews)
ser, pct = extract_progress_values(progress_value)
prices = extract_detail(firm_prices)
emps = extract_detail(firm_emps)
years = extract_detail(firm_years)
locations = extract_detail(firm_locations)

# Put details in a dataframe
df3 = pd.DataFrame()
df3['firm_name'] = names
df3['firm_motor'] = motors
df3['firm_review'] = reviews
df3['service_pct'] = ser
df3['platform_pct'] = pct
df3['firm_price'] = prices
df3['firm_employee'] = emps
df3['year_founded'] = years
df3['firm_location'] = locations

# Preview
df3.head()

Unnamed: 0,firm_name,firm_motor,firm_review,service_pct,platform_pct,firm_price,firm_employee,year_founded,firm_location
0,Hiteshi,Top Mobile & Web Development Company,5.0 (2 Reviews),10%,25%,\n$25 - $49/hr,50 - 249,2006,"\nIndia, Australia"
1,Techmango Technology Services Private Limited,Best Offshore Software Development Company,4.3 (3 Reviews),10%,20%,\n< $25/hr,250 - 999,2014,\nIndia
2,Evolve Technologies,IT agency,4.8 (1 Review),30%,30%,\n$50 - $99/hr,10 - 49,2000,\nIreland
3,Virtual Electronics PTE LTD,Software and mobile app development in Singapore,5.0 (2 Reviews),5%,25%,\n$25 - $49/hr,10 - 49,2019,\nSingapore
4,Reenbit,Intelligent engineering & beyond,5.0 (3 Reviews),5%,5%,\n$25 - $49/hr,50 - 249,2018,"\nUkraine, Poland"


### Combine all DataFrame

In [11]:
all_df = pd.concat([df1, df2, df3], ignore_index=True)
all_df

Unnamed: 0,firm_name,firm_motor,firm_review,service_pct,platform_pct,firm_price,firm_employee,year_founded,firm_location
0,SPEC INDIA,"Enterprise Software, Mobility & BI Solutions",4.8 (26 Reviews),20%,15%,\n< $25/hr,250 - 999,1987,"\nIndia, United States"
1,instinctools,Delivering the future. Now.,4.8 (8 Reviews),5%,10%,\n$50 - $99/hr,250 - 999,2000,"\nUnited States, Germany"
2,SoluLab,Blockchain | IoT | Mobility | AI | Big Data,5.0 (32 Reviews),15%,15%,\n$25 - $49/hr,50 - 249,2014,"\nUnited States, India"
3,Sigma Data Systems,Discover the world of Big Data with us!,4.7 (5 Reviews),40%,10%,\n$25 - $49/hr,250 - 999,2010,"\nUnited States, Australia"
4,NeenOpal Inc.,The Hub Of Data Science Innovation,5.0 (5 Reviews),25%,10%,\n$25 - $49/hr,10 - 49,2016,"\nUnited States, India"
...,...,...,...,...,...,...,...,...,...
148,Rock Your Data,Delivering Cloud Analytics that Rocks Your Data,0.0 (0 Review),70%,10%,\n$150 - $199/hr,10 - 49,2017,\nCanada
149,CodeRiders,We desire. Together we achieve!,0.0 (0 Review),15%,50%,\n$25 - $49/hr,10 - 49,2013,\nArmenia
150,Notionmind,Software. Strategy. Managed Services.,0.0 (0 Review),15%,50%,\n$50 - $99/hr,10 - 49,2019,\nIndia
151,OptimusFox,Best Blockchain Development Company in USA,0.0 (0 Review),15%,50%,\n$50 - $99/hr,50 - 249,2018,\nUnited States


### Data Cleaning

In [12]:
df = all_df.copy()
df.head(5)

Unnamed: 0,firm_name,firm_motor,firm_review,service_pct,platform_pct,firm_price,firm_employee,year_founded,firm_location
0,SPEC INDIA,"Enterprise Software, Mobility & BI Solutions",4.8 (26 Reviews),20%,15%,\n< $25/hr,250 - 999,1987,"\nIndia, United States"
1,instinctools,Delivering the future. Now.,4.8 (8 Reviews),5%,10%,\n$50 - $99/hr,250 - 999,2000,"\nUnited States, Germany"
2,SoluLab,Blockchain | IoT | Mobility | AI | Big Data,5.0 (32 Reviews),15%,15%,\n$25 - $49/hr,50 - 249,2014,"\nUnited States, India"
3,Sigma Data Systems,Discover the world of Big Data with us!,4.7 (5 Reviews),40%,10%,\n$25 - $49/hr,250 - 999,2010,"\nUnited States, Australia"
4,NeenOpal Inc.,The Hub Of Data Science Innovation,5.0 (5 Reviews),25%,10%,\n$25 - $49/hr,10 - 49,2016,"\nUnited States, India"


In [13]:
df.insert(2, 'star_rating', df['firm_review'].apply(lambda x: x.split()[0]))
df.insert(3, 'review', df['firm_review'].apply(lambda x: x.split()[1].strip('(')))
df.rename(columns={'service_pct':'service_pct(%)'}, inplace=True)
df['service_pct(%)'] = df['service_pct(%)'].apply(lambda x: x.strip('%'))
df.rename(columns={'platform_pct':'platform_pct(%)'}, inplace=True)
df['platform_pct(%)'] = df['platform_pct(%)'].apply(lambda x: x.strip('%'))
df['firm_price'] = df['firm_price'].apply(lambda x: x.strip('\n'))
df['firm_location'] = df['firm_location'].apply(lambda x: x.strip('\n'))
df.drop(columns='firm_review', inplace=True)
df.head(5)

Unnamed: 0,firm_name,firm_motor,star_rating,review,service_pct(%),platform_pct(%),firm_price,firm_employee,year_founded,firm_location
0,SPEC INDIA,"Enterprise Software, Mobility & BI Solutions",4.8,26,20,15,< $25/hr,250 - 999,1987,"India, United States"
1,instinctools,Delivering the future. Now.,4.8,8,5,10,$50 - $99/hr,250 - 999,2000,"United States, Germany"
2,SoluLab,Blockchain | IoT | Mobility | AI | Big Data,5.0,32,15,15,$25 - $49/hr,50 - 249,2014,"United States, India"
3,Sigma Data Systems,Discover the world of Big Data with us!,4.7,5,40,10,$25 - $49/hr,250 - 999,2010,"United States, Australia"
4,NeenOpal Inc.,The Hub Of Data Science Innovation,5.0,5,25,10,$25 - $49/hr,10 - 49,2016,"United States, India"


In [17]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 153 entries, 0 to 152
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   firm_name        153 non-null    object
 1   firm_motor       153 non-null    object
 2   star_rating      153 non-null    object
 3   review           153 non-null    object
 4   service_pct(%)   153 non-null    object
 5   platform_pct(%)  153 non-null    object
 6   firm_price       153 non-null    object
 7   firm_employee    153 non-null    object
 8   year_founded     153 non-null    object
 9   firm_location    153 non-null    object
dtypes: object(10)
memory usage: 12.1+ KB


In [18]:
df['star_rating'] = df['star_rating'].astype('float')
df['review'] = df['review'].astype('int')
df['service_pct(%)'] = df['service_pct(%)'].astype('int')
df['platform_pct(%)'] = df['platform_pct(%)'].astype('int')

### EDA

In [19]:
top5_sta_rev = df.sort_values(by=['star_rating', 'review'], ascending=False)[:5]
top5_sta_rev

Unnamed: 0,firm_name,firm_motor,star_rating,review,service_pct(%),platform_pct(%),firm_price,firm_employee,year_founded,firm_location
2,SoluLab,Blockchain | IoT | Mobility | AI | Big Data,5.0,32,15,15,$25 - $49/hr,50 - 249,2014,"United States, India"
31,S-PRO,Strategic partner.,5.0,30,10,25,$50 - $99/hr,50 - 249,2014,Poland
35,MobiDev,"Web and Mobile development, AI apps",5.0,16,10,10,$50 - $99/hr,250 - 999,2009,United States
39,APPSTIRR,Award Winning App Development Company,5.0,14,5,20,$25 - $49/hr,50 - 249,2010,"United States, United Arab Emirates"
41,Idealogic,End-to-end custom software development,5.0,13,10,20,$50 - $99/hr,50 - 249,2016,"Ukraine, Estonia"


In [28]:
px.histogram(df, 'star_rating', width=500, title='Star Rating Distribution')

* More than 100 Companies has a rating between 5.0 - 5.2
* Less than 5 Companies has a rating between 1 - 4.2
* 24 Companies has a rating of Zero (0)

In [20]:
px.histogram(df, 'review', width=500, title='Review Distribution')

From the distribution, more than 50 firms has a review in the range 0 - 1. Very few firms has more than 20 reviews

In [30]:
px.bar(top5_sta_rev, 'firm_name', ['star_rating', 'review'], width=700,
       title='Top 5 firms based on star rating and review')

The above bar chart shows the top 5 firms with a 5 star rating and highest review. 

In [31]:
top5_sta_ser = df.sort_values(by=['star_rating', 'service_pct(%)'], ascending=False)[:5]
px.bar(top5_sta_ser, 'firm_name', ['star_rating', 'service_pct(%)'], width=600,
       title='Top 5 firms based on star rating and service percent')

Datapine is the company with highest service percent followed by SetuServ

In [24]:
df.sort_values('year_founded')[:5]

Unnamed: 0,firm_name,firm_motor,star_rating,review,service_pct(%),platform_pct(%),firm_price,firm_employee,year_founded,firm_location
137,SunTec Data,Turn Data Into Sight,0.0,0,100,20,,250 - 999,,"United States, India"
88,The Analyst Agency,"Web Developers in Buffalo, NY",5.0,1,30,50,,Freelancer,,"United States, India"
0,SPEC INDIA,"Enterprise Software, Mobility & BI Solutions",4.8,26,20,15,< $25/hr,250 - 999,1987.0,"India, United States"
90,ELEKS,Your Technology Partner for Software Innovation,5.0,3,10,20,$25 - $49/hr,"1,000 - 9,999",1991.0,"Estonia, United States"
66,NIX,"Enterprise Software, Data Analytics & BI Solut...",5.0,6,10,10,$50 - $99/hr,"1,000 - 9,999",1994.0,United States


In [25]:
df['firm_price'].value_counts()

$25 - $49/hr       68
$50 - $99/hr       35
< $25/hr           23
NA                 13
$100 - $149/hr     11
$150 - $199/hr      2
 $25 - $49/hr       1
Name: firm_price, dtype: int64

In [26]:
split = df['firm_location'].apply(lambda x: x.split(', '))
country_lst = []
for lst in split:
    country_lst += lst
new = []
for country in country_lst:
    new.append(country.strip())    

countries_cnt = pd.Series(new).value_counts(ascending=True)

In [27]:
px.bar(x=countries_cnt.values, y=countries_cnt.index, labels={'x': 'count', 'y':''}, width=800)

### Summary

* The webpage from Goodfirm was succesfully accessed and downloaded using Request library. 
* BeautifulSoup was used to locate and extract the details from the downloaded html file
* The extracted details was converted to data frame using Pandas.
* The file was cleaned and changed to its right datatype
* The cleaned file was explored for some insights.

#### Findings
* About 30 firms are rated 5 star while 12 are rated between 4.8 - 4.9
* Very few firms (3) has a review above 30
* Most of the firms are located in United States followed by India.