# Web Scraping

## Top Data Analytics Companies
To improve effectiveness in business processes, companies are focussing on collecting and utilizing data. Data analytics companies enable businesses to analyze the acquired data and use them as required. Data analytics services can assist in product development, identifying potential market gaps, improving operational efficiency, etc. [Goodfirms](https://www.goodfirms.co/big-data-analytics/data-analytics)

The aim of this project is to scrape Goodfirms website for details about top data analytics company. These details includes, review, rating, year founded, location, etc.

### Road Map
Stage 1
* Download page
* Open page
* Get details
* Extract details
* Make a dataframe

Stage 2 (Code Refactoring)
* Change loops to function
* Take functions to another file
* Import functions and extract
* Make a dataframe

In [11]:
import requests as r
from bs4 import BeautifulSoup 
import pandas as pd
import plotly.express as px
import plotly.offline as po
po.init_notebook_mode(connected = True)

Accessing webpage

In [3]:
url_link = 'https://www.goodfirms.co/big-data-analytics/data-analytics'
html = r.get(url_link)
html

<Response [200]>

Saving webpage

In [5]:
with open('topfirm.html', mode='wb') as file:
    file.write(html.content)

In [2]:
with open('topfirm.html',encoding='utf-8', mode='r') as openfile:
    bs = BeautifulSoup(openfile, 'lxml')

Locate Details

<!-- 1. firm position -->
2. firm name
3. firm reviews
4. firm progress value
5. firm price
6. firm employee
7. year founded
8. firm location

In [3]:
firm_name = bs.find_all('span', {'itemprop': 'name'})
firm_motor = bs.find_all('p', {'class': 'profile-tagline'})
firm_reviews = bs.find_all('span', {'class': 'listinv_review_label'})
firm_progress_value = bs.find_all('div', {'class': 'circle-progress-value'})
firm_price = bs.find_all('div', {'class': 'firm-pricing'})
firm_employee = bs.find_all('div', {'class': 'firm-employee'})
year_founded = bs.find_all('div', {'class': 'firm-founded'})
firm_location = bs.find_all('div', {'class': 'firm-location'})

In [4]:
def extract_details(tag_lst):
    lst = [tag.text for tag in tag_lst]
    return pd.Series(lst)

In [5]:
firm_name_sr = extract_details(firm_name[3:])
firm_motor_sr = extract_details(firm_motor)
firm_reviews_sr = extract_details(firm_reviews)
firm_price_sr = extract_details(firm_price)
firm_employee_sr = extract_details(firm_employee)
year_founded_sr = extract_details(year_founded)
firm_location_sr = extract_details(firm_location)

  return pd.Series(lst)


In [7]:
serv_lst = []
plt_lst = []
for val in enumerate(firm_progress_value):
    if val[0] % 2 == 0:
        serv_lst.append(val[1].text)
    else:
        plt_lst.append(val[1].text)        
    
serv_lst[:3]

['20%', '5%', '15%']

In [8]:
df_new = pd.DataFrame()

In [9]:
df_new['firm_name'] = firm_name_sr
df_new['firm_motor'] = firm_motor_sr
df_new['firm_reviews'] = firm_reviews_sr
df_new['firm_serv_pct(%)'] = serv_lst
df_new['firm_plt_pct(%)'] = plt_lst
df_new['firm_price'] = firm_price_sr
df_new['firm_employee'] = firm_employee_sr
df_new['year_founded'] = year_founded_sr
df_new['firm_location'] = firm_location_sr

In [10]:
df_new

Unnamed: 0,firm_name,firm_motor,firm_reviews,firm_serv_pct(%),firm_plt_pct(%),firm_price,firm_employee,year_founded,firm_location
0,SPEC INDIA,"Enterprise Software, Mobility & BI Solutions",4.8 (26 Reviews),20%,15%,\n< $25/hr,,1987,"\nIndia, United States"
1,instinctools,Delivering the future. Now.,4.8 (8 Reviews),5%,10%,\n$50 - $99/hr,,2000,"\nUnited States, Germany"
2,SoluLab,Blockchain | IoT | Mobility | AI | Big Data,5.0 (32 Reviews),15%,15%,\n$25 - $49/hr,,2014,"\nUnited States, India"
3,Sigma Data Systems,Discover the world of Big Data with us!,4.7 (5 Reviews),40%,10%,\n$25 - $49/hr,,2010,"\nUnited States, Australia"
4,NeenOpal Inc.,The Hub Of Data Science Innovation,5.0 (5 Reviews),25%,10%,\n$25 - $49/hr,,2016,"\nUnited States, India"
5,Consagous Technologies,Helping brands by being their Technology Partner,4.8 (32 Reviews),10%,20%,\n$25 - $49/hr,,2008,"\nIndia, United States"
6,NEX Softsys,IT Partner for Global Clients,5.0 (12 Reviews),20%,20%,\n$25 - $49/hr,,2003,\nUnited States
7,Beyond Key,IT Consulting and Software Development Services,5.0 (7 Reviews),20%,10%,\n$25 - $49/hr,,2005,"\nUnited States, India"
8,Datapine,BUSINESS INTELLIGENCE MADE EASY,5.0 (3 Reviews),90%,5%,\n$50 - $99/hr,,2012,\nGermany
9,Dataforest,Data Engineering and Web Product Development,5.0 (4 Reviews),30%,20%,\n$50 - $99/hr,,2018,"\nUkraine, Estonia"


In [53]:
firm_name_lst = []
for names in firm_name:
    each_name = names.text
    firm_name_lst.append(each_name)
    
firm_name_lst =  firm_name_lst[3:]

In [41]:
firm_motor_lst = []
for motors in firm_motor:
    each_motor = motors.text
    firm_motor_lst.append(each_motor)
    
firm_motor_lst[:5]

['Enterprise Software, Mobility & BI Solutions',
 'Delivering the future. Now.',
 'Blockchain | IoT | Mobility | AI | Big Data',
 'Discover the world of Big Data with us!',
 'The Hub Of Data Science Innovation']

In [43]:
firm_reviews_lst = []
for reviews in firm_reviews:
    each_review = reviews.text
    firm_reviews_lst.append(each_review)
    
firm_reviews_lst[:5]

['4.8 (26 Reviews)',
 '4.8 (8 Reviews)',
 '5.0 (32 Reviews)',
 '4.7 (5 Reviews)',
 '5.0 (5 Reviews)']

In [44]:
serv_lst = []
plt_lst = []
for val in enumerate(firm_progress_value):
    if val[0] % 2 == 0:
        serv_lst.append(val[1].text)
    else:
        plt_lst.append(val[1].text)        
    
serv_lst[:3]

['20%', '5%', '15%']

In [29]:
plt_lst[:3]

['15%', '10%', '15%']

In [37]:
firm_price_lst = []
for prices in firm_price:
    each_price = prices.text
    firm_price_lst.append(each_price)
    
firm_price_lst[:5]

['\n< $25/hr ',
 '\n$50 - $99/hr ',
 '\n$25 - $49/hr ',
 '\n$25 - $49/hr ',
 '\n$25 - $49/hr ']

In [None]:
firm_progress_lst = []
for names in firm_name:
    each_name = names.text
    firm_name_lst.append(each_name)
    
firm_name_lst[3:][:5]

Dataframe

In [57]:
df = pd.DataFrame()

In [58]:
df['firm_name'] = firm_name_lst
df['firm_motor'] = firm_motor
df['firm_reviews'] = firm_reviews
df['firm_serv_pct(%)'] = serv_lst
df['firm_plt_pct(%)'] = plt_lst
df['firm_price'] = firm_price_lst

In [59]:
df

Unnamed: 0,firm_name,firm_motor,firm_reviews,firm_serv_pct(%),firm_plt_pct(%),firm_price
0,SPEC INDIA,"[Enterprise Software, Mobility & BI Solutions]",[4.8 (26 Reviews)],20%,15%,\n< $25/hr
1,instinctools,[Delivering the future. Now.],[4.8 (8 Reviews)],5%,10%,\n$50 - $99/hr
2,SoluLab,[Blockchain | IoT | Mobility | AI | Big Data],[5.0 (32 Reviews)],15%,15%,\n$25 - $49/hr
3,Sigma Data Systems,[Discover the world of Big Data with us!],[4.7 (5 Reviews)],40%,10%,\n$25 - $49/hr
4,NeenOpal Inc.,[The Hub Of Data Science Innovation],[5.0 (5 Reviews)],25%,10%,\n$25 - $49/hr
5,Consagous Technologies,[Helping brands by being their Technology Part...,[4.8 (32 Reviews)],10%,20%,\n$25 - $49/hr
6,NEX Softsys,[IT Partner for Global Clients],[5.0 (12 Reviews)],20%,20%,\n$25 - $49/hr
7,Beyond Key,[IT Consulting and Software Development Servic...,[5.0 (7 Reviews)],20%,10%,\n$25 - $49/hr
8,Datapine,[BUSINESS INTELLIGENCE MADE EASY],[5.0 (3 Reviews)],90%,5%,\n$50 - $99/hr
9,Dataforest,[Data Engineering and Web Product Development],[5.0 (4 Reviews)],30%,20%,\n$50 - $99/hr
