# Web Scraping

## Top Data Analytics Companies
To improve effectiveness in business processes, companies are focussing on collecting and utilizing data. Data analytics companies enable businesses to analyze the acquired data and use them as required. Data analytics services can assist in product development, identifying potential market gaps, improving operational efficiency, etc. [Goodfirms](https://www.goodfirms.co/big-data-analytics/data-analytics)

The aim of this project is to scrape Goodfirms website for details about top data analytics company. These details includes, review, rating, year founded, location, etc.

### Road Map
Stage 1
* Download page
* Open page
* Get details
* Extract details
* Make a dataframe

Stage 2 (Code Refactoring)
* Change loops to function
* Take functions to another file
* Import functions and extract
* Make a dataframe

### Libraries

In [1]:
import requests as r
from bs4 import BeautifulSoup
import pandas as pd
import plotly.express as px
import plotly.offline as po
po.init_notebook_mode(connected=True)

### Download and save HTML file

In [5]:
# URL link
url_link = ['https://www.goodfirms.co/big-data-analytics/data-analytics', 'https://www.goodfirms.co/big-data-analytics/data-analytics?page=2', 
       'https://www.goodfirms.co/big-data-analytics/data-analytics?page=3']

# access website
access = [r.get(url) for url in url_link]

""" OR

access = []
for url in url_link:
    access.append(r.get(url))
    
"""


### Saving webpage to PC

In [15]:
for page in access:
    index = access.index(page)+1
    with open(f'page{index}.html', mode='wb') as file:
        file.write(page.content)

### Page 1 

In [34]:
with open("page1.html", encoding='utf-8', mode='r') as file:
    bs = BeautifulSoup(file, 'lxml')

#### Locate Details

In [35]:
firm_names = bs.find_all('span', {'itemprop': "name"})
firm_motors = bs.find_all('p', {'class': "profile-tagline"})
firm_reviews = bs.find_all('span', {'class': "listinv_review_label"})
progress_value = bs.find_all('div', {'class': "circle-progress-value"})
firm_prices = bs.find_all('div', {'class': "firm-pricing"})
firm_emps = bs.find_all('div', {'class': "firm-employees"})
firm_years = bs.find_all('div', {'class': "firm-founded"})
firm_locations = bs.find_all('div', {'class': "firm-location"})

#### Function to extract details

In [36]:
def extract_detail(tag_lst):
    lst = [tag.text for tag in tag_lst]
    return pd.Series(lst)

def extract_progress_values(tag_lst):    
    service_pct = [percent[1].text for percent in enumerate(tag_lst) if percent[0]%2 ==0]
    platform_pct = [percent[1].text for percent in enumerate(tag_lst) if percent[0]%2 ==1]       
    return pd.Series(service_pct), pd.Series(platform_pct) 

#### Extract Details

In [37]:
names = extract_detail(firm_names[3:])
motors = extract_detail(firm_motors)
reviews = extract_detail(firm_reviews)
ser, pct = extract_progress_values(progress_value)
prices = extract_detail(firm_prices)
emps = extract_detail(firm_emps)
years = extract_detail(firm_years)
locations = extract_detail(firm_locations)

### Put Details in a DataFrame

In [38]:
df1 = pd.DataFrame()

In [39]:
df1['firm_name'] = names
df1['firm_motor'] = motors
df1['firm_review'] = reviews
df1['service_pct'] = ser
df1['platform_pct'] = pct
df1['firm_price'] = prices
df1['firm_employee'] = emps
df1['year_founded'] = years
df1['firm_location'] = locations

In [40]:
df1.head()

Unnamed: 0,firm_name,firm_motor,firm_review,service_pct,platform_pct,firm_price,firm_employee,year_founded,firm_location
0,SPEC INDIA,"Enterprise Software, Mobility & BI Solutions",4.8 (26 Reviews),20%,15%,\n< $25/hr,250 - 999,1987,"\nIndia, United States"
1,instinctools,Delivering the future. Now.,4.8 (8 Reviews),5%,10%,\n$50 - $99/hr,250 - 999,2000,"\nUnited States, Germany"
2,SoluLab,Blockchain | IoT | Mobility | AI | Big Data,5.0 (32 Reviews),15%,15%,\n$25 - $49/hr,50 - 249,2014,"\nUnited States, India"
3,Sigma Data Systems,Discover the world of Big Data with us!,4.7 (5 Reviews),40%,10%,\n$25 - $49/hr,250 - 999,2010,"\nUnited States, Australia"
4,NeenOpal Inc.,The Hub Of Data Science Innovation,5.0 (5 Reviews),25%,10%,\n$25 - $49/hr,10 - 49,2016,"\nUnited States, India"


### Page 2

In [41]:
# Open page 2
with open("page2.html", encoding='utf-8', mode='r') as file:
    bs2 = BeautifulSoup(file, 'lxml')
    
# Locate details
firm_names = bs2.find_all('span', {'itemprop': "name"})
firm_motors = bs2.find_all('p', {'class': "profile-tagline"})
firm_reviews = bs2.find_all('span', {'class': "listinv_review_label"})
progress_value = bs2.find_all('div', {'class': "circle-progress-value"})
firm_prices = bs2.find_all('div', {'class': "firm-pricing"})
firm_emps = bs2.find_all('div', {'class': "firm-employees"})
firm_years = bs2.find_all('div', {'class': "firm-founded"})
firm_locations = bs2.find_all('div', {'class': "firm-location"})

# Extract details
names2 = extract_detail(firm_names[3:])
motors = extract_detail(firm_motors)
reviews = extract_detail(firm_reviews)
ser, pct = extract_progress_values(progress_value)
prices = extract_detail(firm_prices)
emps = extract_detail(firm_emps)
years = extract_detail(firm_years)
locations = extract_detail(firm_locations)

# Put details in a dataframe
df2 = pd.DataFrame()
df2['firm_name'] = names2
df2['firm_motor'] = motors
df2['firm_review'] = reviews
df2['service_pct'] = ser
df2['platform_pct'] = pct
df2['firm_price'] = prices
df2['firm_employee'] = emps
df2['year_founded'] = years
df2['firm_location'] = locations

# Preview
df2.head()

Unnamed: 0,firm_name,firm_motor,firm_review,service_pct,platform_pct,firm_price,firm_employee,year_founded,firm_location
0,Inoxoft,"IEC/ISO 27001, Google, Microsoft Certified Com...",5.0 (8 Reviews),10%,10%,\n$25 - $49/hr,50 - 249,2014,\nUkraine
1,Noventum Custom Software Development Company,Custom Software & Web Development in New Mexico,5.0 (3 Reviews),20%,100%,\n$100 - $149/hr,2 - 9,2012,\nUnited States
2,Napollo Software Design L.L.C,Best Software Design Agency in Dubai & New York,5.0 (7 Reviews),20%,20%,\nNA,50 - 249,2011,\nUnited States
3,SemiDot Infotech Pvt Ltd,Right Technology Partner for IT Solutions,4.7 (7 Reviews),10%,30%,\n< $25/hr,50 - 249,2011,"\nUnited States, India"
4,Sphinx Solutions,Inspire : Innovate : Evolve,4.6 (8 Reviews),5%,10%,\n< $25/hr,Freelancer,2010,"\nIndia, United States"


### Page 3

In [42]:
# Open page 3 
with open("page3.html", encoding='utf-8', mode='r') as file:
    bs3 = BeautifulSoup(file, 'lxml')
    
# Locate details
firm_names = bs3.find_all('span', {'itemprop': "name"})
firm_motors = bs3.find_all('p', {'class': "profile-tagline"})
firm_reviews = bs3.find_all('span', {'class': "listinv_review_label"})
progress_value = bs3.find_all('div', {'class': "circle-progress-value"})
firm_prices = bs3.find_all('div', {'class': "firm-pricing"})
firm_emps = bs3.find_all('div', {'class': "firm-employees"})
firm_years = bs3.find_all('div', {'class': "firm-founded"})
firm_locations = bs3.find_all('div', {'class': "firm-location"})

# Extract details
names = extract_detail(firm_names[3:])
motors = extract_detail(firm_motors)
reviews = extract_detail(firm_reviews)
ser, pct = extract_progress_values(progress_value)
prices = extract_detail(firm_prices)
emps = extract_detail(firm_emps)
years = extract_detail(firm_years)
locations = extract_detail(firm_locations)

# Put details in a dataframe
df3 = pd.DataFrame()
df3['firm_name'] = names
df3['firm_motor'] = motors
df3['firm_review'] = reviews
df3['service_pct'] = ser
df3['platform_pct'] = pct
df3['firm_price'] = prices
df3['firm_employee'] = emps
df3['year_founded'] = years
df3['firm_location'] = locations

# Preview
df3.head()

Unnamed: 0,firm_name,firm_motor,firm_review,service_pct,platform_pct,firm_price,firm_employee,year_founded,firm_location
0,Techmango Technology Services Private Limited,Best Offshore Software Development Company,4.3 (3 Reviews),10%,20%,\n< $25/hr,250 - 999,2014,\nIndia
1,Evolve Technologies,IT agency,4.8 (1 Review),30%,30%,\n$50 - $99/hr,10 - 49,2000,\nIreland
2,Virtual Electronics PTE LTD,Software and mobile app development in Singapore,5.0 (2 Reviews),5%,25%,\n$25 - $49/hr,10 - 49,2019,\nSingapore
3,Reenbit,Intelligent engineering & beyond,5.0 (3 Reviews),5%,5%,\n$25 - $49/hr,50 - 249,2018,"\nUkraine, Poland"
4,Michigan Software Labs,We Make Apps,5.0 (2 Reviews),10%,20%,\n$150 - $199/hr,10 - 49,2010,\nUnited States


### Combine all DataFrame

In [44]:
all_df = pd.concat([df1, df2, df3], ignore_index=True)
all_df

Unnamed: 0,firm_name,firm_motor,firm_review,service_pct,platform_pct,firm_price,firm_employee,year_founded,firm_location
0,SPEC INDIA,"Enterprise Software, Mobility & BI Solutions",4.8 (26 Reviews),20%,15%,\n< $25/hr,250 - 999,1987,"\nIndia, United States"
1,instinctools,Delivering the future. Now.,4.8 (8 Reviews),5%,10%,\n$50 - $99/hr,250 - 999,2000,"\nUnited States, Germany"
2,SoluLab,Blockchain | IoT | Mobility | AI | Big Data,5.0 (32 Reviews),15%,15%,\n$25 - $49/hr,50 - 249,2014,"\nUnited States, India"
3,Sigma Data Systems,Discover the world of Big Data with us!,4.7 (5 Reviews),40%,10%,\n$25 - $49/hr,250 - 999,2010,"\nUnited States, Australia"
4,NeenOpal Inc.,The Hub Of Data Science Innovation,5.0 (5 Reviews),25%,10%,\n$25 - $49/hr,10 - 49,2016,"\nUnited States, India"
...,...,...,...,...,...,...,...,...,...
148,Rock Your Data,Delivering Cloud Analytics that Rocks Your Data,0.0 (0 Review),70%,10%,\n$150 - $199/hr,10 - 49,2017,\nCanada
149,CodeRiders,We desire. Together we achieve!,0.0 (0 Review),15%,50%,\n$25 - $49/hr,10 - 49,2013,\n Armenia
150,Notionmind,Software. Strategy. Managed Services.,0.0 (0 Review),15%,50%,\n$50 - $99/hr,10 - 49,2019,\nIndia
151,OptimusFox,Best Blockchain Development Company in USA,0.0 (0 Review),15%,50%,\n$50 - $99/hr,50 - 249,2018,\nUnited States
