# WEB SCRAPING

Top Data Analytics Companies

Companies collect and utilize data to improve its effectiveness. 
Data Analytics services can help uncover underlying drivers of profitability and identify opportunities to optimize, etc.

The aim of this project is to scrape Goodfirms website for details about top data analytics company. 
These details includes, review, rating, year founded, location, etc.

Road Map
Stage 1

* Import necessary libraries
* Download page
* Open page
* Get details
* Extract details
* Make a dataframe

Stage 2: Code Refactoring

* Change loops to function
* Take functions to another file
* Import functions and extract
* Make a dataframe

Stage 3: Data Cleaning

Stage 4: EDA

IMPORTING THE NECESSARY LIBRARIES 

In [1]:
import requests as r 
from bs4 import BeautifulSoup 
import pandas as pd 
import plotly.express as px 
import plotly.offline as po 
po.init_notebook_mode(connected=True) 

DOWNLOADING AND SAVING THE HTML FILE 

In [2]:
url_link = ['https://www.goodfirms.co/big-data-analytics/data-analytics', 'https://www.goodfirms.co/big-data-analytics/data-analytics?page=2', 
            'https://www.goodfirms.co/big-data-analytics/data-analytics?page=3', 'https://www.goodfirms.co/big-data-analytics/data-analytics?page=4', 
            'https://www.goodfirms.co/big-data-analytics/data-analytics?page=5']

#Access url link
access = [r.get(url) for url in url_link] 

SAVING WEBPAGE TO PC 

In [3]:
for page in access:
    index = access.index(page)+1
    with open(f'page{index}.html', mode='wb') as file:
        file.write(page.content) 

PAGE 1

In [4]:
with open("page1.html", encoding='utf-8', mode='r') as file:
    bs = BeautifulSoup(file, 'lxml') 

LOCATING DETAILS 

In [5]:
firm_name = bs.find_all('span', {'itemprop': 'name'})
firm_motor = bs.find_all('p', {'class': 'profile-tagline'})
firm_reviews = bs.find_all('span', {'class': 'listinv_review_label'})
progress_value = bs.find_all('div', {'class': 'circle-progress-value'})
firm_price = bs.find_all('div', {'class': 'firm-pricing'})
firm_employees = bs.find_all('div', {'class': 'firm-employees'})
year_founded = bs.find_all('div', {'class': 'firm-founded'})
firm_location = bs.find_all('div', {'class': 'firm-location'}) 

FUNCTION TO EXTRACT DETAILS 

In [6]:
def extract_detail(tag_lst):
    lst = [tag.text for tag in tag_lst]
    return pd.Series(lst)

def extract_progress_values(tag_lst):    
    service_pct = [percent[1].text for percent in enumerate(tag_lst) if percent[0]%2 ==0]
    platform_pct = [percent[1].text for percent in enumerate(tag_lst) if percent[0]%2 ==1]       
    return pd.Series(service_pct), pd.Series(platform_pct) 

EXTRACTING DETAILS 

In [7]:
name = extract_detail(firm_name[3:])
motor = extract_detail(firm_motor)
reviews = extract_detail(firm_reviews)
service, platform = extract_progress_values(progress_value)
price = extract_detail(firm_price)
employees = extract_detail(firm_employees)
year = extract_detail(year_founded)
location = extract_detail(firm_location) 

PUTTING DETAILS IN A DATAFRAME 

In [8]:
df1 = pd.DataFrame() 

In [9]:
df1['firm_name'] = name
df1['firm_motor'] = motor
df1['firm_reviews'] = reviews
df1['service_pct'] = service
df1['platform_pct'] = platform
df1['firm_price'] = price
df1['firm_employees'] = employees
df1['year_founded'] = year
df1['firm_location'] = location 

In [34]:
df1 

Unnamed: 0,firm_name,firm_motor,firm_reviews,service_pct,platform_pct,firm_price,firm_employees,year_founded,firm_location
0,SPEC INDIA,"Enterprise Software, Mobility & BI Solutions",4.8 (26 Reviews),20%,15%,\n< $25/hr,250 - 999,1987,"\n India, United States"
1,instinctools,Delivering the future. Now.,4.8 (8 Reviews),5%,10%,\n$50 - $99/hr,250 - 999,2000,"\nUnited States, Germany"
2,SoluLab,Blockchain | IoT | Mobility | AI | Big Data,5.0 (32 Reviews),15%,15%,\n$25 - $49/hr,50 - 249,2014,"\nUnited States, India"
3,Sigma Data Systems,Discover the world of Big Data with us!,4.7 (5 Reviews),40%,10%,\n$25 - $49/hr,250 - 999,2010,"\nUnited States, Australia"
4,NeenOpal Inc.,The Hub Of Data Science Innovation,5.0 (5 Reviews),25%,10%,\n$25 - $49/hr,10 - 49,2016,"\nUnited States, India"
5,Consagous Technologies,Helping brands by being their Technology Partner,4.8 (32 Reviews),10%,20%,\n$25 - $49/hr,50 - 249,2008,"\nIndia, United States"
6,NEX Softsys,IT Partner for Global Clients,5.0 (12 Reviews),20%,20%,\n$25 - $49/hr,50 - 249,2003,\nUnited States
7,Beyond Key,IT Consulting and Software Development Services,5.0 (7 Reviews),20%,10%,\n$25 - $49/hr,250 - 999,2005,"\nUnited States, India"
8,Datapine,BUSINESS INTELLIGENCE MADE EASY,5.0 (3 Reviews),90%,5%,\n$50 - $99/hr,10 - 49,2012,\nGermany
9,Dataforest,Data Engineering and Web Product Development,5.0 (4 Reviews),30%,20%,\n$50 - $99/hr,50 - 249,2018,"\nUkraine, Estonia"


PAGE 2

In [10]:
with open("page2.html", encoding='utf-8', mode='r') as file:
    bs2 = BeautifulSoup(file, 'lxml') 

LOCATING DETAILS 

In [11]:
firm_name = bs2.find_all('span', {'itemprop': 'name'})
firm_motor = bs2.find_all('p', {'class': 'profile-tagline'})
firm_reviews = bs2.find_all('span', {'class': 'listinv_review_label'})
progress_value = bs2.find_all('div', {'class': 'circle-progress-value'})
firm_price = bs2.find_all('div', {'class': 'firm-pricing'})
firm_employees = bs2.find_all('div', {'class': 'firm-employees'})
year_founded = bs2.find_all('div', {'class': 'firm-founded'})
firm_location = bs2.find_all('div', {'class': 'firm-location'})

EXTRACTING DETAILS

In [12]:
name2 = extract_detail(firm_name[3:])
motor2 = extract_detail(firm_motor)
reviews2 = extract_detail(firm_reviews)
service2, platform2 = extract_progress_values(progress_value)
price2 = extract_detail(firm_price)
employees2 = extract_detail(firm_employees)
year2 = extract_detail(year_founded)
location2 = extract_detail(firm_location) 

PUTTING DETAILS IN A DATAFRAME 

In [13]:
df2 = pd.DataFrame() 

In [14]:
df2['firm_name'] = name2
df2['firm_motor'] = motor2
df2['firm_reviews'] = reviews2
df2['service_pct'] = service
df2['platform_pct'] = platform
df2['firm_price'] = price2
df2['firm_employees'] = employees2
df2['year_founded'] = year2
df2['firm_location'] = location2  

In [15]:
df2 

Unnamed: 0,firm_name,firm_motor,firm_reviews,service_pct,platform_pct,firm_price,firm_employees,year_founded,firm_location
0,Sphinx Solutions,Inspire : Innovate : Evolve,4.6 (8 Reviews),20%,15%,\n< $25/hr,Freelancer,2010.0,"\nIndia, United States"
1,Redian Software,Delivering Open Source Solutions,5.0 (5 Reviews),5%,10%,\n$25 - $49/hr,10 - 49,2016.0,"\nUnited Kingdom, India"
2,Indus Net Technologies,We Deliver Digital Success,5.0 (5 Reviews),15%,15%,\n< $25/hr,250 - 999,1997.0,"\nIndia, United Kingdom"
3,RWaltz Group Inc.,Blockchain Solutions Experts,5.0 (5 Reviews),40%,10%,\n$50 - $99/hr,50 - 249,2000.0,\nUnited States
4,UNL Solutions,"Dedicated Team, Dedicated Developer",4.3 (3 Reviews),25%,10%,\n$25 - $49/hr,50 - 249,2006.0,"\nUnited Kingdom, Belarus"
5,Shockoe,Mobile by Design,5.0 (2 Reviews),10%,20%,\n$100 - $149/hr,10 - 49,2010.0,\nUnited States
6,Decipher Zone Technologies Pvt Ltd,Java Development Company,5.0 (2 Reviews),20%,20%,\n< $25/hr,50 - 249,2015.0,\nIndia
7,Stellen Infotech,You Think We Create,5.0 (2 Reviews),20%,10%,\n$25 - $49/hr,50 - 249,2010.0,"\nIndia, United States"
8,N-iX,Software development company,5.0 (5 Reviews),90%,5%,\n$50 - $99/hr,"1,000 - 9,999",2002.0,"\nMalta, Ukraine"
9,Build Scale Prosper,Build. Scale. Prosper.,5.0 (1 Review),30%,20%,\n$100 - $149/hr,2 - 9,2018.0,\nUnited States


PAGE 3

In [16]:
with open("page3.html", encoding='utf-8', mode='r') as file:
    bs3 = BeautifulSoup(file, 'lxml')

LOCATING DETAILS

In [17]:
firm_name = bs3.find_all('span', {'itemprop': 'name'})
firm_motor = bs3.find_all('p', {'class': 'profile-tagline'})
firm_reviews = bs3.find_all('span', {'class': 'listinv_review_label'})
progress_value = bs3.find_all('div', {'class': 'circle-progress-value'})
firm_price = bs3.find_all('div', {'class': 'firm-pricing'})
firm_employees = bs3.find_all('div', {'class': 'firm-employees'})
year_founded = bs3.find_all('div', {'class': 'firm-founded'})
firm_location = bs3.find_all('div', {'class': 'firm-location'})

EXTRACTING DETAILS 

In [18]:
name3 = extract_detail(firm_name[3:])
motor3 = extract_detail(firm_motor)
reviews3 = extract_detail(firm_reviews)
service3, platform3 = extract_progress_values(progress_value)
price3 = extract_detail(firm_price)
employees3 = extract_detail(firm_employees)
year3 = extract_detail(year_founded)
location3 = extract_detail(firm_location) 

PUTTING DETAILS IN A DATAFRAME

In [19]:
df3 = pd.DataFrame() 

In [20]:
df3['firm_name'] = name3
df3['firm_motor'] = motor3
df3['firm_reviews'] = reviews3
df3['service_pct'] = service
df3['platform_pct'] = platform
df3['firm_price'] = price3
df3['firm_employees'] = employees3
df3['year_founded'] = year3
df3['firm_location'] = location3  

In [21]:
df3 

Unnamed: 0,firm_name,firm_motor,firm_reviews,service_pct,platform_pct,firm_price,firm_employees,year_founded,firm_location
0,ExpertsFromIndia,We Make IT Possible,5.0 (2 Reviews),20%,15%,\n$25 - $49/hr,250 - 999,2003.0,"\nUnited States, India"
1,Volumetree,Impact Through Technology,5.0 (2 Reviews),5%,10%,\n$25 - $49/hr,50 - 249,2017.0,"\nIndia, South Africa"
2,QuadLogix Technologies Pvt. Ltd.,Next-Gen Technology Solutions,5.0 (2 Reviews),15%,15%,\n$25 - $49/hr,10 - 49,2009.0,"\nIndia, United Arab Emirates"
3,WOXAPP,Mobile applications for startups and businesses,5.0 (2 Reviews),40%,10%,\n$25 - $49/hr,10 - 49,2011.0,\nUkraine
4,ISS Art,It’s your chance to break new ground in business!,5.0 (2 Reviews),25%,10%,\n$25 - $49/hr,50 - 249,2003.0,"\nRussia, United States"
5,47Billion,Data Analytics | UXUI | Product Development | ML,5.0 (1 Review),10%,20%,\n< $25/hr,50 - 249,2012.0,"\nUnited States, India"
6,Talentelgia Technologies Private Limited,Transforming ideas into innovations,5.0 (2 Reviews),20%,20%,\n< $25/hr,50 - 249,2012.0,"\nIndia, United Kingdom"
7,Endion IT,Web & Mobile App Development,5.0 (1 Review),20%,10%,\n$25 - $49/hr,10 - 49,2017.0,\nArgentina
8,Programmers.io,Your Extended Software Development Team,5.0 (1 Review),90%,5%,\n$25 - $49/hr,250 - 999,2013.0,"\nUnited States, India"
9,Intetics Inc.,Where Software Concepts Come Alive™,5.0 (1 Review),30%,20%,\n$50 - $99/hr,250 - 999,1995.0,"\nUnited States, Germany"


PAGE 4

In [22]:
with open("page4.html", encoding='utf-8', mode='r') as file:
    bs4 = BeautifulSoup(file, 'lxml')

LOCATING DETAILS 

In [23]:
firm_name = bs4.find_all('span', {'itemprop': 'name'})
firm_motor = bs4.find_all('p', {'class': 'profile-tagline'})
firm_reviews = bs4.find_all('span', {'class': 'listinv_review_label'})
progress_value = bs4.find_all('div', {'class': 'circle-progress-value'})
firm_price = bs4.find_all('div', {'class': 'firm-pricing'})
firm_employees = bs4.find_all('div', {'class': 'firm-employees'})
year_founded = bs4.find_all('div', {'class': 'firm-founded'})
firm_location = bs4.find_all('div', {'class': 'firm-location'})

EXTRACTING DETAILS

In [24]:
name4 = extract_detail(firm_name[3:])
motor4 = extract_detail(firm_motor)
reviews4 = extract_detail(firm_reviews)
service4, platform4 = extract_progress_values(progress_value)
price4 = extract_detail(firm_price)
employees4 = extract_detail(firm_employees)
year4 = extract_detail(year_founded)
location4 = extract_detail(firm_location) 

PUTTING DETAILS IN A DATAFRAME

In [25]:
df4 = pd.DataFrame() 

In [26]:
df4['firm_name'] = name4
df4['firm_motor'] = motor4
df4['firm_reviews'] = reviews4
df4['service_pct'] = service
df4['platform_pct'] = platform
df4['firm_price'] = price4
df4['firm_employees'] = employees4
df4['year_founded'] = year4
df4['firm_location'] = location4  

In [27]:
df4 

Unnamed: 0,firm_name,firm_motor,firm_reviews,service_pct,platform_pct,firm_price,firm_employees,year_founded,firm_location
0,Digital Order Technology Pvt. Ltd.,Integration to Innovation,0.0 (0 Review),20%,15%,\n< $25/hr,2 - 9,2016.0,\nIndia
1,Marlabs Inc.,Accelerate your Digital Transformation,0.0 (0 Review),5%,10%,\nNA,"1,000 - 9,999",1996.0,"\nUnited States, India"
2,Exometrics,Artificial Intelligence for Business,0.0 (0 Review),15%,15%,\n$100 - $149/hr,2 - 9,2016.0,\nUnited Kingdom
3,Visichain,Accelerate your digital procurement transforma...,0.0 (0 Review),40%,10%,\nNA,50 - 249,2015.0,\nChina
4,Abto Software,Where science and technology work for you,0.0 (0 Review),25%,10%,\n$25 - $49/hr,50 - 249,2007.0,"\nUkraine, United States"
5,Profinit,"Custom SW Development, Data Science & Outsourcing",0.0 (0 Review),10%,20%,\n$50 - $99/hr,250 - 999,1998.0,"\nCzech Republic, Slovakia"
6,Prompt Softech,Empowering Enterprises,0.0 (0 Review),20%,20%,\n< $25/hr,50 - 249,2011.0,\nIndia
7,good chain and sustainable supplies Ltd,Better be good than fake perfection,0.0 (0 Review),20%,10%,\nNA,2 - 9,2018.0,"\nChina, Australia"
8,Monique M & Company Digital Marketing,Impacting lives,0.0 (0 Review),90%,5%,\n$25 - $49/hr,2 - 9,2016.0,\nKenya
9,MindGap,Unleash the power of your data with Strategic AI,0.0 (0 Review),30%,20%,\n$50 - $99/hr,2 - 9,2019.0,\nUnited Kingdom


PAGE 5

In [28]:
with open("page5.html", encoding='utf-8', mode='r') as file:
    bs5 = BeautifulSoup(file, 'lxml')

LOCATING DETAILS

In [29]:
firm_name = bs5.find_all('span', {'itemprop': 'name'})
firm_motor = bs5.find_all('p', {'class': 'profile-tagline'})
firm_reviews = bs5.find_all('span', {'class': 'listinv_review_label'})
progress_value = bs5.find_all('div', {'class': 'circle-progress-value'})
firm_price = bs5.find_all('div', {'class': 'firm-pricing'})
firm_employees = bs5.find_all('div', {'class': 'firm-employees'})
year_founded = bs5.find_all('div', {'class': 'firm-founded'})
firm_location = bs5.find_all('div', {'class': 'firm-location'})

EXTRACTING DETAILS

In [30]:
name5 = extract_detail(firm_name[3:])
motor5 = extract_detail(firm_motor)
reviews5 = extract_detail(firm_reviews)
service5, platform5 = extract_progress_values(progress_value)
price5 = extract_detail(firm_price)
employees5 = extract_detail(firm_employees)
year5 = extract_detail(year_founded)
location5 = extract_detail(firm_location) 

PUTTING DETAILS IN A DATAFRAME

In [31]:
df5 = pd.DataFrame() 

In [32]:
df5['firm_name'] = name5
df5['firm_motor'] = motor5
df5['firm_reviews'] = reviews5
df5['service_pct'] = service
df5['platform_pct'] = platform
df5['firm_price'] = price5
df5['firm_employees'] = employees5
df5['year_founded'] = year5
df5['firm_location'] = location5 

In [33]:
df5 

Unnamed: 0,firm_name,firm_motor,firm_reviews,service_pct,platform_pct,firm_price,firm_employees,year_founded,firm_location
0,Indicium Tech,Data Science as a Service,5.0 (6 Reviews),20%,15%,\n$25 - $49/hr,10 - 49,2017.0,\nBrazil
1,Analytics8,Data and Analytics. It's what we do.,5.0 (5 Reviews),5%,10%,\n$25 - $49/hr,10 - 49,,\nUnited States
2,Quilytics,Your Data Our Analytics,5.0 (4 Reviews),15%,15%,\n$50 - $99/hr,2 - 9,2020.0,\nUnited States
3,Forte Group,Your full-spectrum software delivery partner,5.0 (8 Reviews),40%,10%,\n$50 - $99/hr,250 - 999,2000.0,"\nUnited States, Ukraine"
4,Zoomdata,The Fastest Visual Analytics for Big Data,5.0 (1 Review),25%,10%,\n$100 - $149/hr,50 - 249,2012.0,"\nUnited States, Singapore"
5,Enlightenment.ai,Optimize your processes using data and AI,5.0 (2 Reviews),10%,20%,\n$200 - $300/hr,2 - 9,2018.0,\nPortugal
6,QBurst,Technology Leveraged for Your Business,4.9 (4 Reviews),20%,20%,\n$25 - $49/hr,250 - 999,2004.0,"\nIndia, United States"
7,Qlik,Faster answers. More insights. Better outcomes,5.0 (1 Review),20%,10%,\n$100 - $149/hr,"1,000 - 9,999",1993.0,\nUnited States
8,CBIG Consulting,Have a conversation with your data.,5.0 (1 Review),90%,5%,\n$100 - $149/hr,50 - 249,2002.0,\nUnited States
9,Dimensional Insight,Analytics tools built on Diver Platform,5.0 (1 Review),30%,20%,\n$100 - $149/hr,50 - 249,1989.0,"\nUnited States, Germany"


COMBINING ALL DATAFRAMES

In [35]:
all_df = pd.concat([df1, df2, df3, df4, df5], ignore_index=True)
all_df 

Unnamed: 0,firm_name,firm_motor,firm_reviews,service_pct,platform_pct,firm_price,firm_employees,year_founded,firm_location
0,SPEC INDIA,"Enterprise Software, Mobility & BI Solutions",4.8 (26 Reviews),20%,15%,\n< $25/hr,250 - 999,1987,"\n India, United States"
1,instinctools,Delivering the future. Now.,4.8 (8 Reviews),5%,10%,\n$50 - $99/hr,250 - 999,2000,"\nUnited States, Germany"
2,SoluLab,Blockchain | IoT | Mobility | AI | Big Data,5.0 (32 Reviews),15%,15%,\n$25 - $49/hr,50 - 249,2014,"\nUnited States, India"
3,Sigma Data Systems,Discover the world of Big Data with us!,4.7 (5 Reviews),40%,10%,\n$25 - $49/hr,250 - 999,2010,"\nUnited States, Australia"
4,NeenOpal Inc.,The Hub Of Data Science Innovation,5.0 (5 Reviews),25%,10%,\n$25 - $49/hr,10 - 49,2016,"\nUnited States, India"
...,...,...,...,...,...,...,...,...,...
265,Quant Coding,To boldly do what not everyone can!,1.0 (1 Review),10%,50%,\n$25 - $49/hr,2 - 9,2020,\nMacedonia
266,Unicoconnect,Your Tech Solutions Partner,4.9 (11 Reviews),10%,10%,\n< $25/hr,10 - 49,2016,\nIndia
267,Dash Technologies,"Web & Mobile App Development Company, USA",4.8 (11 Reviews),20%,100%,\n$50 - $99/hr,50 - 249,2010,\nUnited States
268,Gzeez Tech Design and Software Development Com...,Top Software development companies,5.0 (6 Reviews),20%,20%,\n$25 - $49/hr,50 - 249,2011,\nUnited Arab Emirates


In [36]:
all_df.shape 

(270, 9)

DATA CLEANING

In [39]:
# extracting star rating 
values = all_df['firm_reviews'].apply(lambda x: x.split()[0])
all_df.insert(3, 'star_rating', values)

# extracting number of reviews
val = all_df['firm_reviews'].apply(lambda x: x.split()[1].strip('('))
all_df.insert(4, 'firm_rev', val)

# drop firm reviews column
all_df.drop(columns='firm_reviews', inplace=True)

# remove "%" from firm service and platform percent
all_df['service_pct'] = all_df['service_pct'].apply(lambda x: x.strip('%'))
all_df['platform_pct'] = all_df['platform_pct'].apply(lambda x: x.strip('%'))

# rename columns
all_df.rename(columns={'service_pct': 'service_pct(%)', 'platform_pct':'platform_pct(%)'}, inplace=True)

# remove "\n" from firm price and location
all_df['firm_price'] = all_df['firm_price'].apply(lambda x: x.strip('\n'))
all_df['firm_location'] = all_df['firm_location'].apply(lambda x: x.strip('\n'))

In [40]:
all_df.head() 

Unnamed: 0,firm_name,firm_motor,star_rating,firm_rev,service_pct(%),platform_pct(%),firm_price,firm_employees,year_founded,firm_location
0,SPEC INDIA,"Enterprise Software, Mobility & BI Solutions",4.8,26,20,15,< $25/hr,250 - 999,1987,"India, United States"
1,instinctools,Delivering the future. Now.,4.8,8,5,10,$50 - $99/hr,250 - 999,2000,"United States, Germany"
2,SoluLab,Blockchain | IoT | Mobility | AI | Big Data,5.0,32,15,15,$25 - $49/hr,50 - 249,2014,"United States, India"
3,Sigma Data Systems,Discover the world of Big Data with us!,4.7,5,40,10,$25 - $49/hr,250 - 999,2010,"United States, Australia"
4,NeenOpal Inc.,The Hub Of Data Science Innovation,5.0,5,25,10,$25 - $49/hr,10 - 49,2016,"United States, India"


SAVE DATAFRAME TO CSV FILE

In [41]:
all_df.to_csv('clean extraction csv', index=False)

CHANGING DATA TYPE

In [42]:
all_df.dtypes 

firm_name          object
firm_motor         object
star_rating        object
firm_rev           object
service_pct(%)     object
platform_pct(%)    object
firm_price         object
firm_employees     object
year_founded       object
firm_location      object
dtype: object

In [43]:
all_df['star_rating'] = all_df['star_rating'].astype('float')
all_df['firm_rev'] = all_df['firm_rev'].astype('int')
all_df['service_pct(%)'] = all_df['service_pct(%)'].astype('int')
all_df['platform_pct(%)'] = all_df['platform_pct(%)'].astype('int')

EXPLORATORY DATA ANALYSIS

In [44]:
px.histogram(all_df, 'star_rating', width=500, title='Star Rating Distribution')

* 163 firms have a star rating between 4.8 - 5.2
* 16 firms have a star rating between 4.3 - 4.7
* 84 firms have a star rating between -0.2 - 0.2
* Less than 7 firms have a rating below 4.3

In [45]:
px.histogram(all_df, 'firm_rev', width=500, title='Review Distribution')

* From the distribution above, 146 firms has a review between 0 -1.

In [46]:
top5_str_rev = all_df.sort_values(by=['star_rating', 'firm_rev'], ascending=False)[:5]
px.bar(top5_str_rev, 'firm_name', ['star_rating', 'firm_rev'], width=700, title='Top 5 Firms based on star rating and reviews')

* From the distribution above, all the top 5 firms have a five-star rating but SoluLab has the highest review of 32

In [52]:
top5_str_sev = all_df.sort_values(by=['star_rating', 'service_pct(%)'], ascending=False)[:5]
px.bar(top5_str_sev, 'firm_name', ['star_rating', 'service_pct(%)'], width=700, 
       title='Top 5 Firms based on star rating and service percentage')

* They all have a five-star rating and service percent of 90

GETTING THE OLDEST FIRM BASED ON YEAR FOUNDED

In [56]:
all_df.sort_values('year_founded')[:7]  

Unnamed: 0,firm_name,firm_motor,star_rating,firm_rev,service_pct(%),platform_pct(%),firm_price,firm_employees,year_founded,firm_location
185,Brewed @ The Lab Technologies Pvt Ltd,IT Service,0.0,0,20,20,$25 - $49/hr,10 - 49,,India
204,Unleashing AI,Your AI & Machine Learning Business Partner,0.0,0,5,25,$150 - $199/hr,2 - 9,,United States
217,Analytics8,Data and Analytics. It's what we do.,5.0,5,5,10,$25 - $49/hr,10 - 49,,United States
248,Innover Digital,Let's Solve a Problem,5.0,1,10,50,,250 - 999,,United States
137,SunTec Data,Turn Data Into Sight,0.0,0,20,10,,250 - 999,,"United States, India"
87,The Analyst Agency,"Web Developers in Buffalo, NY",5.0,1,5,50,,Freelancer,,"United States, India"
174,Heinsohn Business Technology,Software Development Experts,0.0,0,25,10,$25 - $49/hr,250 - 999,1977.0,Colombia


* From our available data, Heinsohn Business Technology is the oldest

GETTING THE FIRM PRICING

In [57]:
all_df['firm_price'].value_counts() 

$25 - $49/hr        103
$50 - $99/hr         53
NA                   35
< $25/hr             34
$100 - $149/hr       29
$150 - $199/hr        4
$200 - $300/hr        3
 $25 - $49/hr         2
$25 - $49/hr          2
 < $25/hr             1
< $25/hr              1
$50 - $99/hr          1
 NA                   1
$150 - $199/hr        1
Name: firm_price, dtype: int64

* From the above analysis, more companies pay $25 - $49/hr

In [48]:
split = all_df['firm_location'].apply(lambda x: x.split(', '))
lst = []
for x in split:
    lst += x
    
new_lst = []
for x in lst:
    new_lst.append(x.strip())
locations = pd.Series(new_lst).value_counts(ascending=True)
locations[:5]

Malta        1
Turkey       1
Lithuania    1
Vietnam      1
Romania      1
dtype: int64

In [59]:
px.bar(y=locations.index, x=locations.values, width=900, height=900, 
       title= 'Country Frequency', labels={'y':'', 'x':'frequency'})

* Most of the firms are in the United States

Summary

* The webpage from Goodfirm was succesfully accessed and downloaded using Request library.
* BeautifulSoup was used to locate and extract the details from the downloaded html file
* The extracted details was converted to data frame using Pandas.
* The file was cleaned and changed to its right datatype
* The cleaned file was explored for some insights.

Findings

* About 30 firms are rated 5 star while 12 are rated between 4.8 - 4.9
* Very few firms (3) has a review above 30
* Most of the firms are located in United States followed by India.