# Web Scraping

## Top Data Analytics Companies
To improve effectiveness in business processes, companies are focussing on collecting and utilizing data. Data analytics companies enable businesses to analyze the acquired data and use them as required. Data analytics services can assist in product development, identifying potential market gaps, improving operational efficiency, etc. [Goodfirms](https://www.goodfirms.co/big-data-analytics/data-analytics)

The aim of this project is to scrape Goodfirms website for details about top data analytics company. These details includes, review, rating, year founded, location, etc.

### Road Map
Stage 1
* Download page
* Open page
* Get details
* Extract details
* Make a dataframe

Stage 2 (Code Refactoring)
* Change loops to function
* Take functions to another file
* Import functions and extract
* Make a dataframe

Libraries

In [4]:
import requests as r
from bs4 import BeautifulSoup 
import pandas as pd

Accessing webpage

In [None]:
url = 'https://www.goodfirms.co/big-data-analytics/data-analytics'
html = r.get(url)
html

Saving webpage

In [None]:
with open('page1.html', mode='wb') as openfile:
    openfile.write(html.content)

details to extract
1. firm name
2. firm motor
3. firm reviews
4. progress value
5. firm price
6. firm employee
7. year founded
8. firm location

In [5]:
with open('page1.html',encoding='utf-8', mode='r') as file:
    bs = BeautifulSoup(file, 'lxml')

Extraction

In [6]:
firm_name = bs.find_all('span', {'itemprop': 'name'})
firm_motor = bs.find_all('p', {'class': 'profile-tagline'})
firm_reviews = bs.find_all('span', {'class': 'listinv_review_label'})
progress_value = bs.find_all('div', {'class': 'circle-progress-value'})
firm_price = bs.find_all('div', {'class': 'firm-pricing'})

In [7]:
firm_name[:5]

[<span itemprop="name">Home</span>,
 <span itemprop="name">big data analytics</span>,
 <span itemprop="name">
 Data Analytics </span>,
 <span itemprop="name">SPEC INDIA</span>,
 <span itemprop="name">instinctools</span>]

Function to extract details

In [8]:
def extrac_details(tag):
    lst = []
    for each_val in tag:
        lst.append(each_val.text)
    return pd.Series(lst)
        

In [9]:
nam_sr = extrac_details(firm_name[3:])
nam_sr[:5]

0            SPEC INDIA
1          instinctools
2               SoluLab
3    Sigma Data Systems
4         NeenOpal Inc.
dtype: object

In [10]:
motor_sr = extrac_details(firm_motor)
review_sr = extrac_details(firm_reviews)
price_sr = extrac_details(firm_price)

In [11]:
progress_value[:5]

[<div class="circle-progress-value">20%</div>,
 <div class="circle-progress-value">15%</div>,
 <div class="circle-progress-value">5%</div>,
 <div class="circle-progress-value">10%</div>,
 <div class="circle-progress-value">15%</div>]

In [12]:
ser =[]
plt = []
for pct in enumerate(progress_value):
    if pct[0] % 2 == 0:
        ser.append(pct[1].text)
    else:
        plt.append(pct[1].text)
ser[:5]
    

['20%', '5%', '15%', '40%', '25%']

In [13]:
plt[:5]

['15%', '10%', '15%', '10%', '10%']

Covert details to a dataframe

In [14]:
df = pd.DataFrame()

In [15]:
df['firm_name'] = nam_sr
df['firm_motor'] = motor_sr
df['firm_reviews'] = review_sr
df['firm_ser_pct(%)'] = ser
df['firm_plt_pct(%)'] = plt
df['firm_price'] = price_sr
    

In [16]:
len(df)

51

TASK
> Extract the remaing three details, firm emloyee, year founded and firm location.

> Download the second page from the web and extract the same details and join it to the previous dataframe.

> The least page is 2, but you can do more than two if you want to.