# Web Scraping

## Top Data Analytics Companies
To improve effectiveness in business processes, companies are focussing on collecting and utilizing data. Data analytics companies enable businesses to analyze the acquired data and use them as required. Data analytics services can assist in product development, identifying potential market gaps, improving operational efficiency, etc. [Goodfirms](https://www.goodfirms.co/big-data-analytics/data-analytics)

The aim of this project is to scrape Goodfirms website for details about top data analytics company. These details includes, review, rating, year founded, location, etc.

### Road Map
Stage 1
* Download page
* Open page
* Get details
* Extract details
* Make a dataframe

Stage 2 (Code Refactoring)
* Change loops to function
* Take functions to another file
* Import functions and extract
* Make a dataframe

Libraries

In [1]:
import requests as r
from bs4 import BeautifulSoup 
import pandas as pd

Accessing webpage

In [77]:
url_link = 'https://www.goodfirms.co/big-data-analytics/data-analytics'
html = r.get(url_link)
html

<Response [200]>

Saving webpage

In [78]:
with open('topfirm.html', mode='wb') as file:
    file.write(html.content)

In [82]:
with open('topfirm.html',encoding='utf-8', mode='r') as openfile:
    bs = BeautifulSoup(openfile, 'lxml')

Locate Details

<!-- 1. firm position -->
2. firm name
3. firm reviews
4. firm progress value
5. firm price
6. firm employee
7. year founded
8. firm location

In [38]:
firm_name = bs.find_all('span', {'itemprop': 'name'})
firm_motor = bs.find_all('p', {'class': 'profile-tagline'})
firm_reviews = bs.find_all('span', {'class': 'listinv_review_label'})
firm_progress_value = bs.find_all('div', {'class': 'circle-progress-value'})
firm_price = bs.find_all('div', {'class': 'firm-pricing'})

In [67]:
def extract_details(tag_lst):
    lst = [tag.text for tag in tag_lst]
    return pd.Series(lst)

In [83]:
firm_name_sr = extract_details(firm_name[3:])
firm_motor_sr = extract_details(firm_motor)
firm_reviews_sr = extract_details(firm_reviews)
firm_price_sr = extract_details(firm_price)

In [None]:
serv_lst = []
plt_lst = []
for val in enumerate(firm_progress_value):
    if val[0] % 2 == 0:
        serv_lst.append(val[1].text)
    else:
        plt_lst.append(val[1].text)        
    
serv_lst[:3]

['20%', '5%', '15%']

In [84]:
df_new = pd.DataFrame()

In [85]:
df_new['firm_name'] = firm_name_sr
df_new['firm_motor'] = firm_motor_sr
df_new['firm_reviews'] = firm_reviews_sr
df_new['firm_serv_pct(%)'] = serv_lst
df_new['firm_plt_pct(%)'] = plt_lst
df_new['firm_price'] = firm_price_sr

In [86]:
df_new

Unnamed: 0,firm_name,firm_motor,firm_reviews,firm_serv_pct(%),firm_plt_pct(%),firm_price
0,SPEC INDIA,"Enterprise Software, Mobility & BI Solutions",4.8 (26 Reviews),20%,15%,\n< $25/hr
1,instinctools,Delivering the future. Now.,4.8 (8 Reviews),5%,10%,\n$50 - $99/hr
2,SoluLab,Blockchain | IoT | Mobility | AI | Big Data,5.0 (32 Reviews),15%,15%,\n$25 - $49/hr
3,Sigma Data Systems,Discover the world of Big Data with us!,4.7 (5 Reviews),40%,10%,\n$25 - $49/hr
4,NeenOpal Inc.,The Hub Of Data Science Innovation,5.0 (5 Reviews),25%,10%,\n$25 - $49/hr
5,Consagous Technologies,Helping brands by being their Technology Partner,4.8 (32 Reviews),10%,20%,\n$25 - $49/hr
6,NEX Softsys,IT Partner for Global Clients,5.0 (12 Reviews),20%,20%,\n$25 - $49/hr
7,Beyond Key,IT Consulting and Software Development Services,5.0 (7 Reviews),20%,10%,\n$25 - $49/hr
8,Datapine,BUSINESS INTELLIGENCE MADE EASY,5.0 (3 Reviews),90%,5%,\n$50 - $99/hr
9,Dataforest,Data Engineering and Web Product Development,5.0 (4 Reviews),30%,20%,\n$50 - $99/hr
