# Web Scraping

## Top Data Analytics Companies
To improve effectiveness in business processes, companies are focussing on collecting and utilizing data. Data analytics companies enable businesses to analyze the acquired data and use them as required. Data analytics services can assist in product development, identifying potential market gaps, improving operational efficiency, etc. [Goodfirms](https://www.goodfirms.co/big-data-analytics/data-analytics)

The aim of this project is to scrape Goodfirms website for details about top data analytics company. These details includes, review, rating, year founded, location, etc.

### Road Map
Stage 1
* Download page
* Open page
* Get details
* Extract details
* Make a dataframe

Stage 2 (Code Refactoring)
* Change loops to function
* Take functions to another file
* Import functions and extract
* Make a dataframe

Libraries

In [50]:

import requests as r
from bs4 import BeautifulSoup 
import pandas as pd
import numpy as np
import plotly.express as px

Accessing webpage

In [2]:
url_link = 'https://www.goodfirms.co/big-data-analytics/data-analytics'
html = r.get(url_link)
html

<Response [200]>

### Download and save HTML file

In [15]:
# URL link
url_link = ['https://www.goodfirms.co/big-data-analytics/data-analytics', 'https://www.goodfirms.co/big-data-analytics/data-analytics?page=2', 
       'https://www.goodfirms.co/big-data-analytics/data-analytics?page=3']

# access website
access = [r.get(url) for url in url_link]

""" OR

access = []
for url in url_link:
    access.append(r.get(url))
    
"""


' OR\n\naccess = []\nfor url in url_link:\n    access.append(r.get(url))\n    \n'

Saving webpage

In [19]:
for page in access:
    index = access.index(page)+1
    with open(f'page{index}.html', mode='wb') as file:
        file.write(page.content)

### Page 1 

In [31]:
with open("page1.html", encoding='utf-8', mode='r') as file:
    bs = BeautifulSoup(file, 'lxml')

In [3]:
with open('topfirm.html', mode='wb') as file:
    file.write(html.content)

In [4]:
with open('topfirm.html',encoding='utf-8', mode='r') as openfile:
    bs = BeautifulSoup(openfile, 'lxml')

Locate Details

<!-- 1. firm position -->
2. firm name
3. firm reviews
4. firm progress value
5. firm price
6. firm employee
7. year founded
8. firm location

In [32]:
x = bs.find_all('div', {'class': 'firm-header-wrapper'})
y = bs.find_all('div', {'class': 'firm-services'})



name = [i.find('h3') for i in x]
motor = [i.find('div', {'class': 'tagline'}) for i in x]
reviews_count = [i.find('span', {'class': 'review-count'}) for i in x]
review_rating = [i.find('span', {'class':"review-rating"}) for i in x]
peyl = [i.find_all('span') for i in y]



firm_reviews_count = [x.text.strip() if x and x.text else '0' for x in reviews_count]
firm_name = [x.text.strip() if x and x.text else np.nan for x in name]
firm_motor = [x.text.strip() if x and x.text else np.nan for x in motor]
firm_reviews_rating = [x.text.splitlines()[1] if x and x.text else np.nan for x in review_rating]
firm_price = [i[0].text if i[0] and i[0].text else np.nan for i in peyl]
firm_employee = [i[1].text if i[1] and i[1].text else np.nan for i in peyl] 
firm_year_founded = [i[2].text if i[2] and i[2].text else np.nan for i in peyl]
firm_location = [i[-1].text if i[-1] and i[-1].text else np.nan for i in peyl]
# [i[3] for i in peyl]
firm_location

['United States, Germany',
 'United States, Canada',
 'United Kingdom, United States',
 'United States',
 'United States',
 'Germany',
 'India',
 'India',
 'India',
 'United States',
 'United States',
 'United States',
 'India',
 'Russia',
 'United States',
 'Poland',
 'United States',
 'Thailand',
 'United States',
 'United States',
 'United States',
 'India',
 'India',
 'Mexico',
 'United States',
 'United States',
 'India',
 'Lithuania',
 'India',
 'India',
 'Brazil',
 'United States',
 'United States',
 'United States',
 'Germany',
 'Ukraine',
 'United States',
 'Luxembourg, Poland',
 'Kazakhstan',
 'United States, United Kingdom',
 'United States',
 'Portugal',
 'United States',
 'Vietnam',
 'Poland',
 'United States',
 'United States',
 'Israel',
 'Portugal']

In [26]:
df_new = pd.DataFrame()

In [33]:
df_new['firm_name'] = firm_name
df_new['firm_motor'] = firm_motor
df_new['firm_review_count'] = firm_reviews_count
df_new['firm_review_rating'] = firm_reviews_rating
# df_new['firm_serv_pct(%)'] = serv_lst
# df_new['firm_plt_pct(%)'] = plt_lst
df_new['firm_price'] = firm_price
df_new['firm_emp'] = firm_employee
df_new['firm_year_founded'] = firm_year_founded
df_new['firm_location'] = firm_location

df_new.head()

Unnamed: 0,firm_name,firm_motor,firm_review_count,firm_review_rating,firm_price,firm_emp,firm_year_founded,firm_location
0,instinctools,BlackFriday discount for all contracts before ...,11 Reviews,4.9,$25 - $49/hr,250 - 999,2000,"United States, Germany"
1,Outsource Bigdata,AI-driven Web Scraping & Data Labeling Provider,19 Reviews,5.0,< $25/hr,50 - 249,2012,"United States, Canada"
2,Systango Technologies,"We Make the Impossible, Possible.",6 Reviews,5.0,$25 - $49/hr,250 - 999,2007,"United Kingdom, United States"
3,NEX Softsys,IT Partner for Global Clients,12 Reviews,5.0,$25 - $49/hr,50 - 249,2003,United States
4,Beyond Key,IT Consulting and Software Development Services,7 Reviews,5.0,$25 - $49/hr,250 - 999,2005,United States


In [34]:
df_new.dtypes

firm_name             object
firm_motor            object
firm_review_count     object
firm_review_rating    object
firm_price            object
firm_emp              object
firm_year_founded     object
firm_location         object
dtype: object

In [68]:
df_new1 = df_new.copy()

In [69]:
rev_cnt = df_new1['firm_review_count'].apply(lambda x: x.split()[0])
df_new1.insert(3, 'review_count', rev_cnt)


In [70]:
new_loc = df_new1['firm_location'].apply(lambda x: x.split(','))
df_new1['new_loc'] = new_loc

df_new1 = df_new1.explode('new_loc')

In [71]:
df_new1.head()

Unnamed: 0,firm_name,firm_motor,firm_review_count,review_count,firm_review_rating,firm_price,firm_emp,firm_year_founded,firm_location,new_loc
0,instinctools,BlackFriday discount for all contracts before ...,11 Reviews,11,4.9,$25 - $49/hr,250 - 999,2000,"United States, Germany",United States
0,instinctools,BlackFriday discount for all contracts before ...,11 Reviews,11,4.9,$25 - $49/hr,250 - 999,2000,"United States, Germany",Germany
1,Outsource Bigdata,AI-driven Web Scraping & Data Labeling Provider,19 Reviews,19,5.0,< $25/hr,50 - 249,2012,"United States, Canada",United States
1,Outsource Bigdata,AI-driven Web Scraping & Data Labeling Provider,19 Reviews,19,5.0,< $25/hr,50 - 249,2012,"United States, Canada",Canada
2,Systango Technologies,"We Make the Impossible, Possible.",6 Reviews,6,5.0,$25 - $49/hr,250 - 999,2007,"United Kingdom, United States",United Kingdom


In [60]:
# df_new1.drop(columns=[''])

In [73]:
df_new1.dtypes

firm_name              object
firm_motor             object
firm_review_count      object
review_count            int64
firm_review_rating    float64
firm_price             object
firm_emp               object
firm_year_founded     float64
firm_location          object
new_loc                object
dtype: object

In [72]:
int_col = ['review_count', 'firm_review_rating', 'firm_year_founded']

for col in int_col:
    df_new1[col] =  pd.to_numeric(df_new1[col], errors='coerce')

In [75]:
px.histogram(df_new1, 'firm_year_founded')

In [79]:
px.histogram(df_new1, 'firm_year_founded', color='firm_emp')

In [76]:
px.histogram(df_new1, 'review_count')

In [80]:
px.histogram(df_new1, 'review_count', color='firm_emp')

In [77]:
px.histogram(df_new1, 'firm_review_rating')

In [81]:
px.histogram(df_new1, 'firm_review_rating', color='firm_emp')

In [14]:
df_new['firm_location'].apply(lambda x: x.split(', ')).explode()

0      United States
0            Germany
1      United States
1             Canada
2     United Kingdom
2      United States
3      United States
4      United States
5            Germany
6              India
7              India
8              India
9      United States
10     United States
11     United States
12             India
13            Russia
14     United States
15            Poland
16     United States
17          Thailand
18     United States
19     United States
20     United States
21             India
22             India
23            Mexico
24     United States
25     United States
26             India
27         Lithuania
28             India
29             India
30            Brazil
31     United States
32     United States
33     United States
34           Germany
35           Ukraine
36     United States
37        Luxembourg
37            Poland
38        Kazakhstan
39     United States
39    United Kingdom
40     United States
41          Portugal
42     United

In [8]:
# serv_lst = []
# plt_lst = []
# for val in enumerate(firm_progress_value):
#     if val[0] % 2 == 0:
#         serv_lst.append(val[1].text)
#     else:
#         plt_lst.append(val[1].text)        
    
# serv_lst[:3]

In [9]:
# df_new = pd.DataFrame()

In [10]:
# df_new['firm_name'] = firm_name_sr
# df_new['firm_motor'] = firm_motor_sr
# df_new['firm_reviews'] = firm_reviews_sr
# df_new['firm_serv_pct(%)'] = serv_lst
# df_new['firm_plt_pct(%)'] = plt_lst
# df_new['firm_price'] = firm_price_sr

In [11]:
# df_new