# Web Scraping

## Top Data Analytics Companies
To improve effectiveness in business processes, companies are focussing on collecting and utilizing data. Data analytics companies enable businesses to analyze the acquired data and use them as required. Data analytics services can assist in product development, identifying potential market gaps, improving operational efficiency, etc. [Goodfirms](https://www.goodfirms.co/big-data-analytics/data-analytics)

The aim of this project is to scrape Goodfirms website for details about top data analytics company. These details includes, review, rating, year founded, location, etc.

### Road Map
Stage 1
* Download page
* Open page
* Get details
* Extract details
* Make a dataframe

Stage 2 (Code Refactoring)
* Change loops to function
* Take functions to another file
* Import functions and extract
* Make a dataframe

### Libraries

In [1]:
import requests as r
from bs4 import BeautifulSoup
import pandas as pd
import plotly.express as px
import plotly.offline as po
po.init_notebook_mode(connected=True)

### Download and save HTML file

In [12]:
# URL link
url_link = ['https://www.goodfirms.co/big-data-analytics/data-analytics', 'https://www.goodfirms.co/big-data-analytics/data-analytics?page=2', 
       'https://www.goodfirms.co/big-data-analytics/data-analytics?page=3']

# access website
access = [r.get(url) for url in url_link]

# """ OR

# access = []
# for url in url_link:
#     access.append(r.get(url))
    
# """


### Saving webpage to PC

In [13]:
for page in access:
    index = access.index(page)+1
    with open(f'page{index}.html', mode='wb') as file:
        file.write(page.content)

### Page 1 

In [14]:
with open("page1.html", encoding='utf-8', mode='r') as file:
    bs = BeautifulSoup(file, 'lxml')

#### Locate Details

In [23]:
firm_names = bs.find_all('span', {'itemprop': "name"})
firm_motors = bs.find_all('p', {'class': "profile-tagline"})
firm_reviews = bs.find_all('span', {'class': "listinv_review_label"})
progress_value = bs.find_all('div', {'class': "circle-progress-value"})
firm_prices = bs.find_all('div', {'class': "firm-pricing"})
firm_emps = bs.find_all('div', {'class': "firm-employees"})
firm_years = bs.find_all('div', {'class': "firm-founded"})
firm_locations = bs.find_all('div', {'class': "firm-location"})

#### Function to extract details

In [24]:
def extract_detail(tag_lst):
    lst = [tag.text for tag in tag_lst]
    return pd.Series(lst)

def extract_progress_values(tag_lst):    
    service_pct = [percent[1].text for percent in enumerate(tag_lst) if percent[0]%2 ==0]
    platform_pct = [percent[1].text for percent in enumerate(tag_lst) if percent[0]%2 ==1]       
    return pd.Series(service_pct), pd.Series(platform_pct) 

#### Extract Details

In [25]:
names = extract_detail(firm_names[3:])
motors = extract_detail(firm_motors)
reviews = extract_detail(firm_reviews)
ser, pct = extract_progress_values(progress_value)
prices = extract_detail(firm_prices)
emps = extract_detail(firm_emps)
years = extract_detail(firm_years)
locations = extract_detail(firm_locations)

### Put Details in a DataFrame

In [26]:
df1 = pd.DataFrame()

In [27]:
df1['firm_name'] = names
df1['firm_motor'] = motors
df1['firm_review'] = reviews
df1['service_pct'] = ser
df1['platform_pct'] = pct
df1['firm_price'] = prices
df1['firm_employee'] = emps
df1['year_founded'] = years
df1['firm_location'] = locations

In [28]:
df1

Unnamed: 0,firm_name,firm_motor,firm_review,service_pct,platform_pct,firm_price,firm_employee,year_founded,firm_location
0,SPEC INDIA,"Enterprise Software, Mobility & BI Solutions",4.8 (26 Reviews),20%,15%,\n< $25/hr,250 - 999,1987,"\nIndia, United States"
1,instinctools,Delivering the future. Now.,4.8 (8 Reviews),5%,10%,\n$50 - $99/hr,250 - 999,2000,"\nUnited States, Germany"
2,SoluLab,Blockchain | IoT | Mobility | AI | Big Data,5.0 (32 Reviews),15%,15%,\n$25 - $49/hr,50 - 249,2014,"\nUnited States, India"
3,ELEKS,Your Technology Partner for Software Innovation,5.0 (3 Reviews),10%,20%,\n$25 - $49/hr,"1,000 - 9,999",1991,"\nEstonia, United States"
4,Sigma Data Systems,Discover the world of Big Data with us!,4.7 (5 Reviews),40%,10%,\n$25 - $49/hr,250 - 999,2010,"\nUnited States, Australia"
5,NeenOpal Inc.,The Hub Of Data Science Innovation,5.0 (5 Reviews),25%,10%,\n $25 - $49/hr,10 - 49,2016,"\nUnited States, India"
6,CodeCompletePro,IT Outsourcing Services,5.0 (1 Review),20%,5%,\n$50 - $99/hr,10 - 49,2021,"\nUnited States, Philippines"
7,Consagous Technologies,Helping brands by being their Technology Partner,4.8 (32 Reviews),10%,20%,\n$25 - $49/hr,50 - 249,2008,"\nIndia, United States"
8,NEX Softsys,IT Partner for Global Clients,5.0 (12 Reviews),20%,20%,\n$25 - $49/hr,50 - 249,2003,\nUnited States
9,Beyond Key,IT Consulting and Software Development Services,5.0 (7 Reviews),20%,10%,\n$25 - $49/hr,250 - 999,2005,"\nUnited States, India"


### Page 2

In [29]:
# Open page 2
with open("page2.html", encoding='utf-8', mode='r') as file:
    bs2 = BeautifulSoup(file, 'lxml')
    
# Locate details
firm_names = bs2.find_all('span', {'itemprop': "name"})
firm_motors = bs2.find_all('p', {'class': "profile-tagline"})
firm_reviews = bs2.find_all('span', {'class': "listinv_review_label"})
progress_value = bs2.find_all('div', {'class': "circle-progress-value"})
firm_prices = bs2.find_all('div', {'class': "firm-pricing"})
firm_emps = bs2.find_all('div', {'class': "firm-employees"})
firm_years = bs2.find_all('div', {'class': "firm-founded"})
firm_locations = bs2.find_all('div', {'class': "firm-location"})

# Extract details
names2 = extract_detail(firm_names[3:])
motors = extract_detail(firm_motors)
reviews = extract_detail(firm_reviews)
ser, pct = extract_progress_values(progress_value)
prices = extract_detail(firm_prices)
emps = extract_detail(firm_emps)
years = extract_detail(firm_years)
locations = extract_detail(firm_locations)

# Put details in a dataframe
df2 = pd.DataFrame()
df2['firm_name'] = names2
df2['firm_motor'] = motors
df2['firm_review'] = reviews
df2['service_pct'] = ser
df2['platform_pct'] = pct
df2['firm_price'] = prices
df2['firm_employee'] = emps
df2['year_founded'] = years
df2['firm_location'] = locations

# Preview
df2.head()

Unnamed: 0,firm_name,firm_motor,firm_review,service_pct,platform_pct,firm_price,firm_employee,year_founded,firm_location
0,Napollo Software Design L.L.C,Best Software Design Agency in Dubai & New York,5.0 (7 Reviews),20%,20%,\nNA,50 - 249,2011,\nUnited States
1,SemiDot Infotech Pvt Ltd,Right Technology Partner for IT Solutions,4.7 (7 Reviews),10%,30%,\n< $25/hr,50 - 249,2011,"\nUnited States, India"
2,Sphinx Solutions,Inspire : Innovate : Evolve,4.6 (8 Reviews),5%,10%,\n< $25/hr,Freelancer,2010,"\nIndia, United States"
3,Redian Software,Delivering Open Source Solutions,5.0 (5 Reviews),10%,50%,\n$25 - $49/hr,10 - 49,2016,"\nUnited Kingdom, India"
4,Indus Net Technologies,We Deliver Digital Success,5.0 (5 Reviews),10%,50%,\n< $25/hr,250 - 999,1997,"\nIndia, United Kingdom"


### Page 3

In [30]:
# Open page 3 
with open("page3.html", encoding='utf-8', mode='r') as file:
    bs3 = BeautifulSoup(file, 'lxml')
    
# Locate details
firm_names = bs3.find_all('span', {'itemprop': "name"})
firm_motors = bs3.find_all('p', {'class': "profile-tagline"})
firm_reviews = bs3.find_all('span', {'class': "listinv_review_label"})
progress_value = bs3.find_all('div', {'class': "circle-progress-value"})
firm_prices = bs3.find_all('div', {'class': "firm-pricing"})
firm_emps = bs3.find_all('div', {'class': "firm-employees"})
firm_years = bs3.find_all('div', {'class': "firm-founded"})
firm_locations = bs3.find_all('div', {'class': "firm-location"})

# Extract details
names = extract_detail(firm_names[3:])
motors = extract_detail(firm_motors)
reviews = extract_detail(firm_reviews)
ser, pct = extract_progress_values(progress_value)
prices = extract_detail(firm_prices)
emps = extract_detail(firm_emps)
years = extract_detail(firm_years)
locations = extract_detail(firm_locations)

# Put details in a dataframe
df3 = pd.DataFrame()
df3['firm_name'] = names
df3['firm_motor'] = motors
df3['firm_review'] = reviews
df3['service_pct'] = ser
df3['platform_pct'] = pct
df3['firm_price'] = prices
df3['firm_employee'] = emps
df3['year_founded'] = years
df3['firm_location'] = locations

# Preview
df3.head()

Unnamed: 0,firm_name,firm_motor,firm_review,service_pct,platform_pct,firm_price,firm_employee,year_founded,firm_location
0,Starkflow,#1 platform for hiring SaaS talent globally.,5.0 (2 Reviews),10%,20%,\n$100 - $149/hr,10 - 49,2017,"\nUnited States, Ukraine"
1,ExpertsFromIndia,We Make IT Possible,5.0 (2 Reviews),2%,25%,\n$25 - $49/hr,250 - 999,2003,"\nUnited States, India"
2,Volumetree,Impact Through Technology,5.0 (2 Reviews),20%,10%,\n$25 - $49/hr,50 - 249,2017,"\nIndia, South Africa"
3,QuadLogix Technologies Pvt. Ltd.,Next-Gen Technology Solutions,5.0 (2 Reviews),5%,20%,\n$25 - $49/hr,10 - 49,2009,"\nIndia, United Arab Emirates"
4,WOXAPP,Mobile applications for startups and businesses,5.0 (2 Reviews),5%,20%,\n$25 - $49/hr,10 - 49,2011,\nUkraine


### Combine all DataFrame

In [31]:
all_df = pd.concat([df1, df2, df3], ignore_index=True)
all_df

Unnamed: 0,firm_name,firm_motor,firm_review,service_pct,platform_pct,firm_price,firm_employee,year_founded,firm_location
0,SPEC INDIA,"Enterprise Software, Mobility & BI Solutions",4.8 (26 Reviews),20%,15%,\n< $25/hr,250 - 999,1987,"\nIndia, United States"
1,instinctools,Delivering the future. Now.,4.8 (8 Reviews),5%,10%,\n$50 - $99/hr,250 - 999,2000,"\nUnited States, Germany"
2,SoluLab,Blockchain | IoT | Mobility | AI | Big Data,5.0 (32 Reviews),15%,15%,\n$25 - $49/hr,50 - 249,2014,"\nUnited States, India"
3,ELEKS,Your Technology Partner for Software Innovation,5.0 (3 Reviews),10%,20%,\n$25 - $49/hr,"1,000 - 9,999",1991,"\nEstonia, United States"
4,Sigma Data Systems,Discover the world of Big Data with us!,4.7 (5 Reviews),40%,10%,\n$25 - $49/hr,250 - 999,2010,"\nUnited States, Australia"
...,...,...,...,...,...,...,...,...,...
157,Right Information,Perfect match when standard software is not en...,0.0 (0 Review),10%,50%,\n$50 - $99/hr,10 - 49,2001,\nPoland
158,ISHIR Secure,"ISHIR Secure, a managed security services prov...",0.0 (0 Review),20%,40%,\nNA,50 - 249,1999,\nUnited States
159,"BrookeWealth Global, LLC",Business Consultants. Creating Value.,0.0 (0 Review),50%,15%,\n$100 - $149/hr,2 - 9,2019,\nUnited States
160,Easy Code LTD,bespoke software solutions,0.0 (0 Review),5%,50%,\n$100 - $149/hr,2 - 9,2016,\nUnited Kingdom


### Data Cleaning

In [33]:
# extracting star rating 
values = all_df['firm_review'].apply(lambda x: x.split()[0])
all_df.insert(3, 'star_rating', values)

# extracting number of reviews
val = all_df['firm_review'].apply(lambda x: x.split()[1].strip('('))
all_df.insert(4, 'firm_rev', val)

# drop firm reviews column
all_df.drop(columns='firm_review', inplace=True)

# remove "%" from firm service and platform percent
all_df['service_pct'] = all_df['service_pct'].apply(lambda x: x.strip('%'))
all_df['platform_pct'] = all_df['platform_pct'].apply(lambda x: x.strip('%'))

# rename columns
all_df.rename(columns={'service_pct': 'service_pct(%)', 'platform_pct':'platform_pct(%)'}, inplace=True)

# remove "\n" from firm price and location
all_df['firm_price'] = all_df['firm_price'].apply(lambda x: x.strip('\n'))
all_df['firm_location'] = all_df['firm_location'].apply(lambda x: x.strip('\n'))

In [34]:
all_df.head()

Unnamed: 0,firm_name,firm_motor,star_rating,firm_rev,service_pct(%),platform_pct(%),firm_price,firm_employee,year_founded,firm_location
0,SPEC INDIA,"Enterprise Software, Mobility & BI Solutions",4.8,26,20,15,< $25/hr,250 - 999,1987,"India, United States"
1,instinctools,Delivering the future. Now.,4.8,8,5,10,$50 - $99/hr,250 - 999,2000,"United States, Germany"
2,SoluLab,Blockchain | IoT | Mobility | AI | Big Data,5.0,32,15,15,$25 - $49/hr,50 - 249,2014,"United States, India"
3,ELEKS,Your Technology Partner for Software Innovation,5.0,3,10,20,$25 - $49/hr,"1,000 - 9,999",1991,"Estonia, United States"
4,Sigma Data Systems,Discover the world of Big Data with us!,4.7,5,40,10,$25 - $49/hr,250 - 999,2010,"United States, Australia"


### Save dataframe to CSV file

In [35]:
all_df.to_csv('clean_extraction.csv', index=False)

### Changing Data Type

In [36]:
all_df.dtypes

firm_name          object
firm_motor         object
star_rating        object
firm_rev           object
service_pct(%)     object
platform_pct(%)    object
firm_price         object
firm_employee      object
year_founded       object
firm_location      object
dtype: object

In [37]:
all_df['star_rating'] = all_df['star_rating'].astype('float')
all_df['firm_rev'] = all_df['firm_rev'].astype('int')
all_df['service_pct(%)'] = all_df['service_pct(%)'].astype('int')
all_df['platform_pct(%)'] = all_df['platform_pct(%)'].astype('int')

In [38]:
all_df.dtypes

firm_name           object
firm_motor          object
star_rating        float64
firm_rev             int32
service_pct(%)       int32
platform_pct(%)      int32
firm_price          object
firm_employee       object
year_founded        object
firm_location       object
dtype: object

In [40]:
all_df

Unnamed: 0,firm_name,firm_motor,star_rating,firm_rev,service_pct(%),platform_pct(%),firm_price,firm_employee,year_founded,firm_location
0,SPEC INDIA,"Enterprise Software, Mobility & BI Solutions",4.8,26,20,15,< $25/hr,250 - 999,1987,"India, United States"
1,instinctools,Delivering the future. Now.,4.8,8,5,10,$50 - $99/hr,250 - 999,2000,"United States, Germany"
2,SoluLab,Blockchain | IoT | Mobility | AI | Big Data,5.0,32,15,15,$25 - $49/hr,50 - 249,2014,"United States, India"
3,ELEKS,Your Technology Partner for Software Innovation,5.0,3,10,20,$25 - $49/hr,"1,000 - 9,999",1991,"Estonia, United States"
4,Sigma Data Systems,Discover the world of Big Data with us!,4.7,5,40,10,$25 - $49/hr,250 - 999,2010,"United States, Australia"
...,...,...,...,...,...,...,...,...,...,...
157,Right Information,Perfect match when standard software is not en...,0.0,0,10,50,$50 - $99/hr,10 - 49,2001,Poland
158,ISHIR Secure,"ISHIR Secure, a managed security services prov...",0.0,0,20,40,,50 - 249,1999,United States
159,"BrookeWealth Global, LLC",Business Consultants. Creating Value.,0.0,0,50,15,$100 - $149/hr,2 - 9,2019,United States
160,Easy Code LTD,bespoke software solutions,0.0,0,5,50,$100 - $149/hr,2 - 9,2016,United Kingdom


### EDA

In [43]:
px.histogram(all_df, 'star_rating', width=500, title='Star Rating Distribution')

* More than 100 firms has a star rating between 4.5 - 5.4
* 30 firms has a rating below
* Less than 10 firms has a rating between 1 - 4

In [44]:
px.histogram(all_df, 'firm_rev', width=500, title='Review Distribution')

From the distribution above, more than 60 firms has a review between 0 -1.

In [45]:
top5_str_rev = all_df.sort_values(by=['star_rating', 'firm_rev'], ascending=False)[:5]
px.bar(top5_str_rev, 'firm_name', ['star_rating', 'firm_rev'], width=700, title='Top 5 Firms based on star rating and reviews')

In [46]:
top5_str_sev = all_df.sort_values(['star_rating', 'service_pct(%)'], ascending=False)[:5]
px.bar(top5_str_sev, 'firm_name', 'service_pct(%)', width=700, 
       title='Top 5 Firms based on star rating and service percentage')

In [48]:
split = all_df['firm_location'].apply(lambda x: x.split(', '))
lst = []
for x in split:
    lst += x
    
new_lst = []
for x in lst:
    new_lst.append(x.strip())
locations = pd.Series(new_lst).value_counts(ascending=True)
locations[:5]

Hungary        1
Armenia        1
Netherlands    1
Philippines    1
Bulgaria       1
dtype: int64

In [49]:
px.bar(y=locations.index, x=locations.values, width=900, height=800, 
       title= 'Country Frequency', labels={'y':'', 'x':'frequency'})

### Summary

* The webpage from Goodfirm was succesfully accessed and downloaded using Request library. 
* BeautifulSoup was used to locate and extract the details from the downloaded html file
* The extracted details was converted to data frame using Pandas.
* The file was cleaned and changed to its right datatype
* The cleaned file was explored for some insights.

#### Findings
* About 10 firms are rated between 1 - 4 star while 124 are rated over 4.5 
* Very few firms (2) has a review above 30
* Most of the firms are located in United States followed by India.