## Project 1 - Web scrapping

Project 1 is about webscaping information about top data Analytics company from the website [Good Firm](https://www.goodfirms.co/big-data-analytics/data-analytics)

The aim of this project is to get details about top data analytics company. These details includes, review, rating, year founded, location, etc.

#### Steps

* Download page (pages 1-3) and save it
* Open page
* Get details
* Extract details
* Change loops to function
* Take functions to another file
* Import functions and extract
* Make a dataframe
* Combine dataframes
* Clean dataset
* Perform EDA

In [1]:
# importing necessary libraries

import requests
from bs4 import BeautifulSoup 
import pandas as pd

### Download and save HTML file

In [2]:
url = 'https://www.goodfirms.co/big-data-analytics/data-analytics'

In [3]:
html = requests.get(url)
html

<Response [200]>

In [4]:
# saving website
with open('page1.html',mode='wb') as openfile:
    openfile.write(html.content)

In [5]:
with open('page1.html', encoding = 'utf-8', mode = 'r') as file:
    bs=BeautifulSoup(file,'lxml')

#### Page 1

In [7]:
# Locating details
firm_n = bs.find_all('span',{'itemprop':'name'})
firm_m = bs.find_all('p',{'class':'profile-tagline'})
firm_r = bs.find_all('span',{'class':'listinv_review_label'})
firm_pv = bs.find_all('div',{'class':'circle-progress-value'})
firm_pr = bs.find_all('div',{'class':'firm-pricing'})
firm_em = bs.find_all('div',{'class':'firm-employees'})
firm_fo = bs.find_all('div',{'class':'firm-founded'})
firm_lo = bs.find_all('div',{'class':'firm-location'})

#### Function to extract details

In [8]:
# Extracting details with a function
def extract_details(tag):
    lst = []
    for each_val in tag:
        lst.append(each_val.text)
    return pd.Series(lst)

#### Extract Details

In [9]:
name_sr = extract_details(firm_n[3:])
moto_sr = extract_details(firm_m)
reviews_sr = extract_details(firm_r)
price_sr = extract_details(firm_pr)
employee_sr = extract_details(firm_em)
founded_sr = extract_details(firm_fo)
location_sr = extract_details(firm_lo)

In [10]:
serv_lst = []
plt_lst = []
for val in enumerate(firm_pv):
    if val[0] % 2 == 0:
        serv_lst.append(val[1].text)
    else:
        plt_lst.append(val[1].text)        
    
serv_lst[:3]

['20%', '5%', '15%']

In [11]:
plt_lst[:3]

['15%', '10%', '15%']

### Put Details in a DataFrame

In [12]:
df1 = pd.DataFrame()

In [13]:
df1['firm_name'] = name_sr
df1['firm_motor'] = moto_sr
df1['firm_reviews'] = reviews_sr
df1['firm_ser_pct(%)'] = serv_lst
df1['firm_plt_pct(%)'] = plt_lst
df1['firm_price'] = price_sr
df1['firm_employee'] = employee_sr 
df1['firm_founded'] = founded_sr
df1['firm_location'] = location_sr

In [14]:
df1.head()

Unnamed: 0,firm_name,firm_motor,firm_reviews,firm_ser_pct(%),firm_plt_pct(%),firm_price,firm_employee,firm_founded,firm_location
0,SPEC INDIA,"Enterprise Software, Mobility & BI Solutions",4.8 (26 Reviews),20%,15%,\n< $25/hr,250 - 999,1987,"\nIndia, United States"
1,instinctools,Delivering the future. Now.,4.8 (8 Reviews),5%,10%,\n$50 - $99/hr,250 - 999,2000,"\nUnited States, Germany"
2,SoluLab,Blockchain | IoT | Mobility | AI | Big Data,5.0 (32 Reviews),15%,15%,\n$25 - $49/hr,50 - 249,2014,"\nUnited States, India"
3,Sigma Data Systems,Discover the world of Big Data with us!,4.7 (5 Reviews),40%,10%,\n$25 - $49/hr,250 - 999,2010,"\nUnited States, Australia"
4,NeenOpal Inc.,The Hub Of Data Science Innovation,5.0 (5 Reviews),25%,10%,\n$25 - $49/hr,10 - 49,2016,"\nUnited States, India"


#### Page 2

In [15]:
url = 'https://www.goodfirms.co/big-data-analytics/data-analytics?page=2'

In [16]:
html = requests.get(url)
html

<Response [200]>

In [17]:
# saving website
with open('page2.html',mode='wb') as openfile:
    openfile.write(html.content)
    
with open("page2.html", encoding='utf-8', mode='r') as file:
    bs2 = BeautifulSoup(file, 'lxml')

In [18]:
# Locate details
firm_names = bs2.find_all('span', {'itemprop': "name"})
firm_motors = bs2.find_all('p', {'class': "profile-tagline"})
firm_reviews = bs2.find_all('span', {'class': "listinv_review_label"})
progress_value = bs2.find_all('div', {'class': "circle-progress-value"})
firm_prices = bs2.find_all('div', {'class': "firm-pricing"})
firm_emps = bs2.find_all('div', {'class': "firm-employees"})
firm_years = bs2.find_all('div', {'class': "firm-founded"})
firm_locations = bs2.find_all('div', {'class': "firm-location"})

In [19]:
# Extract details usingmthe extract_details function
names2 = extract_details(firm_names[3:])
motors = extract_details(firm_motors)
reviews = extract_details(firm_reviews)
prices = extract_details(firm_prices)
emps = extract_details(firm_emps)
years = extract_details(firm_years)
locations = extract_details(firm_locations)

In [20]:
serv = []
plt = []
for val in enumerate(firm_pv):
    if val[0] % 2 == 0:
        serv.append(val[1].text)
    else:
        plt.append(val[1].text)        
    
serv[:3]

['20%', '5%', '15%']

In [21]:
plt[:3]

['15%', '10%', '15%']

In [22]:
df2 = pd.DataFrame()

In [23]:
df2['firm_name'] = names2
df2['firm_motor'] = motors
df2['firm_reviews'] = reviews
df2['firm_ser_pct(%)'] = serv
df2['firm_plt_pct(%)'] = plt
df2['firm_price'] = price_sr
df2['firm_employee'] = emps 
df2['firm_founded'] = years
df2['firm_location'] = locations

In [24]:
df2.head()

Unnamed: 0,firm_name,firm_motor,firm_reviews,firm_ser_pct(%),firm_plt_pct(%),firm_price,firm_employee,firm_founded,firm_location
0,Inoxoft,"IEC/ISO 27001, Google, Microsoft Certified Com...",5.0 (8 Reviews),20%,15%,\n< $25/hr,50 - 249,2014,\nUkraine
1,Noventum Custom Software Development Company,Custom Software & Web Development in New Mexico,5.0 (3 Reviews),5%,10%,\n$50 - $99/hr,2 - 9,2012,\nUnited States
2,Napollo Software Design L.L.C,Best Software Design Agency in Dubai & New York,5.0 (7 Reviews),15%,15%,\n$25 - $49/hr,50 - 249,2011,\nUnited States
3,SemiDot Infotech Pvt Ltd,Right Technology Partner for IT Solutions,4.7 (7 Reviews),40%,10%,\n$25 - $49/hr,50 - 249,2011,"\nUnited States, India"
4,Sphinx Solutions,Inspire : Innovate : Evolve,4.6 (8 Reviews),25%,10%,\n$25 - $49/hr,Freelancer,2010,"\nIndia, United States"


#### Page 3

In [25]:
url = 'https://www.goodfirms.co/big-data-analytics/data-analytics?page=3'

In [26]:
html = requests.get(url)
html

<Response [200]>

In [28]:
# saving website
with open('page3.html',mode='wb') as openfile:
    openfile.write(html.content)
    
with open("page3.html", encoding='utf-8', mode='r') as file:
    bs3 = BeautifulSoup(file, 'lxml')

In [29]:
# Locate details
firm_nam = bs3.find_all('span', {'itemprop': "name"})
firm_moto = bs3.find_all('p', {'class': "profile-tagline"})
firm_reviews = bs3.find_all('span', {'class': "listinv_review_label"})
firm_prog_value = bs3.find_all('div', {'class': "circle-progress-value"})
firm_price = bs3.find_all('div', {'class': "firm-pricing"})
firm_emp = bs3.find_all('div', {'class': "firm-employees"})
firm_year = bs3.find_all('div', {'class': "firm-founded"})
firm_locatn = bs3.find_all('div', {'class': "firm-location"})

In [30]:
# Extract details
names3_sr = extract_details(firm_nam[3:])
motors_sr = extract_details(firm_moto)
reviews_sr = extract_details(firm_reviews)
prices_sr = extract_details(firm_price)
emps_sr = extract_details(firm_emp)
years_sr = extract_details(firm_year)
locations_sr = extract_details(firm_locatn)

In [31]:
serv3 = []
plt3 = []
for val in enumerate(firm_pv):
    if val[0] % 2 == 0:
        serv3.append(val[1].text)
    else:
        plt3.append(val[1].text)        
    
serv3[:3]

['20%', '5%', '15%']

In [32]:
plt3[:3]

['15%', '10%', '15%']

In [33]:
df3 = pd.DataFrame()

In [34]:
df3['firm_name'] = names3_sr
df3['firm_motor'] = motors_sr
df3['firm_reviews'] = reviews_sr
df3['firm_ser_pct(%)'] = serv3
df3['firm_plt_pct(%)'] = plt3
df3['firm_price'] = prices_sr
df3['firm_employee'] = emps_sr 
df3['firm_founded'] = years_sr
df3['firm_location'] = locations_sr

In [35]:
df3.head()

Unnamed: 0,firm_name,firm_motor,firm_reviews,firm_ser_pct(%),firm_plt_pct(%),firm_price,firm_employee,firm_founded,firm_location
0,Hiteshi,Top Mobile & Web Development Company,5.0 (2 Reviews),20%,15%,\n$25 - $49/hr,50 - 249,2006,"\nIndia, Australia"
1,Techmango Technology Services Private Limited,Best Offshore Software Development Company,4.3 (3 Reviews),5%,10%,\n< $25/hr,250 - 999,2014,\nIndia
2,Evolve Technologies,IT agency,4.8 (1 Review),15%,15%,\n$50 - $99/hr,10 - 49,2000,\nIreland
3,Virtual Electronics PTE LTD,Software and mobile app development in Singapore,5.0 (2 Reviews),40%,10%,\n$25 - $49/hr,10 - 49,2019,\nSingapore
4,Reenbit,Intelligent engineering & beyond,5.0 (3 Reviews),25%,10%,\n$25 - $49/hr,50 - 249,2018,"\nUkraine, Poland"


### Combine all three dataframes

In [36]:
all_df = pd.concat([df1, df2, df3], ignore_index=True)
all_df

Unnamed: 0,firm_name,firm_motor,firm_reviews,firm_ser_pct(%),firm_plt_pct(%),firm_price,firm_employee,firm_founded,firm_location
0,SPEC INDIA,"Enterprise Software, Mobility & BI Solutions",4.8 (26 Reviews),20%,15%,\n< $25/hr,250 - 999,1987,"\nIndia, United States"
1,instinctools,Delivering the future. Now.,4.8 (8 Reviews),5%,10%,\n$50 - $99/hr,250 - 999,2000,"\nUnited States, Germany"
2,SoluLab,Blockchain | IoT | Mobility | AI | Big Data,5.0 (32 Reviews),15%,15%,\n$25 - $49/hr,50 - 249,2014,"\nUnited States, India"
3,Sigma Data Systems,Discover the world of Big Data with us!,4.7 (5 Reviews),40%,10%,\n$25 - $49/hr,250 - 999,2010,"\nUnited States, Australia"
4,NeenOpal Inc.,The Hub Of Data Science Innovation,5.0 (5 Reviews),25%,10%,\n$25 - $49/hr,10 - 49,2016,"\nUnited States, India"
...,...,...,...,...,...,...,...,...,...
148,Rock Your Data,Delivering Cloud Analytics that Rocks Your Data,0.0 (0 Review),5%,50%,\n$150 - $199/hr,10 - 49,2017,\nCanada
149,CodeRiders,We desire. Together we achieve!,0.0 (0 Review),5%,100%,\n$25 - $49/hr,10 - 49,2013,\nArmenia
150,Notionmind,Software. Strategy. Managed Services.,0.0 (0 Review),20%,20%,\n$50 - $99/hr,10 - 49,2019,\nIndia
151,OptimusFox,Best Blockchain Development Company in USA,0.0 (0 Review),10%,50%,\n$50 - $99/hr,50 - 249,2018,\nUnited States


In [42]:
all_df.head(3)

Unnamed: 0,firm_name,firm_motor,star_rating,firm_rev,firm_ser_pct(%),firm_plt_pct(%),firm_price,firm_employee,firm_founded,firm_location
0,SPEC INDIA,"Enterprise Software, Mobility & BI Solutions",4.8,26,20%,15%,\n< $25/hr,250 - 999,1987,"\nIndia, United States"
1,instinctools,Delivering the future. Now.,4.8,8,5%,10%,\n$50 - $99/hr,250 - 999,2000,"\nUnited States, Germany"
2,SoluLab,Blockchain | IoT | Mobility | AI | Big Data,5.0,32,15%,15%,\n$25 - $49/hr,50 - 249,2014,"\nUnited States, India"


### Data Cleaning

In [43]:
# extracting star rating 
values = all_df['firm_reviews'].apply(lambda x: x.split()[0])
all_df.insert(3, 'star_rating', values)

# extracting number of reviews
val = all_df['firm_reviews'].apply(lambda x: x.split()[1].strip('('))
all_df.insert(4, 'firm_rev', val)

# drop firm reviews column
all_df.drop(columns='firm_reviews', inplace=True)

In [45]:
# remove "%" from firm service and platform percent
all_df['firm_ser_pct(%)'] = all_df['firm_ser_pct(%)'].apply(lambda x: x.strip('%'))
all_df['firm_plt_pct(%)'] = all_df['firm_plt_pct(%)'].apply(lambda x: x.strip('%'))


# remove "\n" from firm price and location
all_df['firm_price'] = all_df['firm_price'].apply(lambda x: x.strip('\n'))
all_df['firm_location'] = all_df['firm_location'].apply(lambda x: x.strip('\n'))

In [46]:
all_df.head(5)

Unnamed: 0,firm_name,firm_motor,star_rating,firm_rev,firm_ser_pct(%),firm_plt_pct(%),firm_price,firm_employee,firm_founded,firm_location
0,SPEC INDIA,"Enterprise Software, Mobility & BI Solutions",4.8,26,20,15,< $25/hr,250 - 999,1987,"India, United States"
1,instinctools,Delivering the future. Now.,4.8,8,5,10,$50 - $99/hr,250 - 999,2000,"United States, Germany"
2,SoluLab,Blockchain | IoT | Mobility | AI | Big Data,5.0,32,15,15,$25 - $49/hr,50 - 249,2014,"United States, India"
3,Sigma Data Systems,Discover the world of Big Data with us!,4.7,5,40,10,$25 - $49/hr,250 - 999,2010,"United States, Australia"
4,NeenOpal Inc.,The Hub Of Data Science Innovation,5.0,5,25,10,$25 - $49/hr,10 - 49,2016,"United States, India"


### Save dataframe to CSV file

In [47]:
all_df.to_csv('top_data.csv', index=False)

### Changing Data Type

In [48]:
all_df.dtypes

firm_name          object
firm_motor         object
star_rating        object
firm_rev           object
firm_ser_pct(%)    object
firm_plt_pct(%)    object
firm_price         object
firm_employee      object
firm_founded       object
firm_location      object
dtype: object

In [49]:
# Changing star_rating into float, firm_rev, firm_ser and firm_plt columns into int
all_df['star_rating'] = all_df['star_rating'].astype('float')
all_df['firm_rev'] = all_df['firm_rev'].astype('int')
all_df['firm_ser_pct(%)'] = all_df['firm_ser_pct(%)'].astype('int')
all_df['firm_plt_pct(%)'] = all_df['firm_plt_pct(%)'].astype('int')

In [50]:
all_df.dtypes

firm_name           object
firm_motor          object
star_rating        float64
firm_rev             int32
firm_ser_pct(%)      int32
firm_plt_pct(%)      int32
firm_price          object
firm_employee       object
firm_founded        object
firm_location       object
dtype: object

In [51]:
# Importing visuailization libraries
import plotly.express as px
import plotly.offline as po
po.init_notebook_mode(connected=True)

### EDA

In [52]:
px.histogram(all_df, 'star_rating', width=500, title='Star Rating Distribution')

From the chart above

* More than 100 firms has a star rating between 5.0 - 5.2
* 24 firms has a rating below
* Less than 5 firms has a rating between 1 - 4

In [53]:
px.histogram(all_df, 'firm_rev', width=500, title='Review Distribution')

From the distribution above, more than 50 firms has a review between 0 -1.

In [54]:
top5_str_rev = all_df.sort_values(by=['star_rating', 'firm_rev'], ascending=False)[:5]
px.bar(top5_str_rev, 'firm_name', ['star_rating', 'firm_rev'], width=700, title='Top 5 Firms based on star rating and reviews')

In [56]:
top5_str_sev = all_df.sort_values(['star_rating', 'firm_ser_pct(%)'], ascending=False)[:5]
px.bar(top5_str_sev, 'firm_name', 'firm_ser_pct(%)', width=700, 
       title='Top 5 Firms based on star rating and service percentage')

In [57]:
split = all_df['firm_location'].apply(lambda x: x.split(', '))
lst = []
for x in split:
    lst += x
    
new_lst = []
for x in lst:
    new_lst.append(x.strip())
locations = pd.Series(new_lst).value_counts(ascending=True)
locations[:5]

Bulgaria       1
Belgium        1
Hungary        1
Thailand       1
Netherlands    1
dtype: int64

In [58]:
px.bar(y=locations.index, x=locations.values, width=900, height=800, 
       title= 'Country Frequency', labels={'y':'', 'x':'frequency'})

### Summary

* The webpage from Goodfirm was succesfully accessed and downloaded using Request library. 
* BeautifulSoup was used to locate and extract the details from the downloaded html file
* The extracted details was converted to data frame using Pandas.
* The file was cleaned 
* EDA as performed on the cleaned file for some insights.

#### Findings
* About 30 firms are rated 5 star while 12 are rated between 4.8 - 4.9
* Very few firms (3) has a review above 30
* Most of the firms are located in United States followed by India.