# <strong>BeautifulSoup:</strong> Web Scrapping - Market Cap

**Name:** Arsalan Ali<br>
**Email:** arslanchaos@gmail.com

---

### **Table of Contents**
* Website to Scrap: "Companies Market Cap"
* Link of the site: https://companiesmarketcap.com/tech/largest-tech-companies-by-market-cap/
* Import Libraries
* Set URL and Headers
* Fetch Webpage
* Parse Webpage Data into BeautifulSoup
* Testing
* Web-scrapping of Multiple Pages
* Saving DataFrame as a CSV Dataset


**Note :** Columns to extract
*   rank
*   company name
*   market cap
*   price
*   today
*   country

---

### Import Libraries

In [104]:
import pandas as pd
import requests
import re
from bs4 import BeautifulSoup

### Set URL and Headers

In [2]:
url = 'https://companiesmarketcap.com/tech/largest-tech-companies-by-market-cap/?page=1'

# Headers are used to access the websites as a real user (not a bot)
headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win 64 ; x64) Apple WeKit /537.36(KHTML , like Gecko) Chrome/80.0.3987.162 Safari/537.36'}

# If you've a proxy then you can use it too
# proxies = {'https': 'http://62.60.160.34:3129'}

### Fetch Webpage

In [4]:
webpage = requests.get(url, headers=headers).text

# We can use the proxy using the proxies parameter
# webpage = requests.get(url, proxies=proxies, headers=headers).text

### Parse Webpage data into BeautifulSoup

In [5]:
soup = BeautifulSoup(webpage, "lxml")

### Testing

In [8]:
# Trying to access H1 tag of the website

soup.find_all("h1")[0].text

'Largest tech companies by market cap'

In [23]:
# Trying to fetch the names of the companies through DIV and CLASS

companies_html = soup.find_all("div", {"class": "company-name"})
for company in companies_html:
    print(company.text)

    # print(company.text.strip()) --> if there is formating involved. Use the strip() to get rid of it

Apple
Microsoft
Alphabet (Google)
Amazon
Tesla
TSMC
Meta Platforms (Facebook)
Tencent
NVIDIA
Samsung
Alibaba
Broadcom
ASML
Oracle
Cisco
Salesforce
Texas Instruments
QUALCOMM
Adobe
Meituan
Intuit
Intel
IBM
AMD
Netflix
PayPal
Automatic Data Processing
SAP
Keyence
Sony
Jingdong Mall
Pinduoduo
ServiceNow
Analog Devices
Applied Materials
Booking Holdings (Booking.com)
Airbnb
Fiserv
Schneider Electric
Activision Blizzard
Atlassian
Micron Technology
Uber
Equinix
Snowflake
NetEase
Lam Research
Nintendo

Palo Alto Networks

Fidelity National Information Services
Synopsys
Vmware
Foxconn (Hon Hai Precision Industry)
Cadence Design Systems
Dassault Systèmes
KLA
Tokyo Electron
Baidu
MercadoLibre
Roper Technologies
NXP Semiconductors
Autodesk
SK Hynix
Fortinet
Adyen
Workday
Enphase Energy

CrowdStrike
Marvell Technology Group
TE Connectivity

Shopify
IQVIA
Microchip Technology
Arista Networks
Block
Electronic Arts
Twitter
Global Payments
Murata Manufacturing (Murata Seisakusho)
STMicroelectronics
Se

In [52]:
# Fetching entire containers where data is placed

companies = soup.findChildren("table", {"class": "default-table table marketcap-table dataTable"})
company_rows = companies[0].findChildren(['tr'])[1:]

In [103]:
# Creating empty lists so data can be fed into them
rank, company_name, market_cap, price, today, country = ([] for i in range(6))

# Looping through the containers to get the required data
for company in company_rows:
    rank.append(company.find("td", {"class": "rank-td"}).text)
    company_name.append(company.find("div", {"class": "company-name"}).text)
    market_cap.append(company.select("td[class='td-right']")[0].text)
    price.append(company.select("td[class='td-right']")[1].text)
    today.append(company.find("td", {"class": "rh-sm" }).text)
    country.append(company.find("span", {"class": "responsive-hidden" }).text)

In [106]:
# Creating a dictionary that'll act as columns and rows for the DataFrame
dataframe = {
    "Rank": rank,
    "Name": company_name,
    "Market Cap": market_cap,
    "Price": price,
    "Today": today,
    "Country":country
}

# Creating the DataFrame by feeding it the dictionary
company_data = pd.DataFrame(dataframe)

# Viewing the DataFrame
company_data

Unnamed: 0,Rank,Name,Market Cap,Price,Today,Country
0,1,Apple,$2.422 T,$150.77,0.23%,USA
1,2,Microsoft,$1.770 T,$237.45,0.20%,USA
2,3,Alphabet (Google),$1.285 T,$98.81,0.36%,USA
3,4,Amazon,$1.173 T,$115.15,1.20%,USA
4,5,Tesla,$858.66 B,$276.01,0.25%,USA
...,...,...,...,...,...,...
95,96,Veeva Systems,$24.53 B,$157.99,2.82%,USA
96,97,Wolters Kluwer,$24.50 B,$95.63,1.91%,Netherlands
97,98,Nokia,$24.03 B,$4.24,0.47%,Finland
98,99,Canon,$23.40 B,$22.05,1.65%,Japan


### Web-scrapping of Multiple Pages

In [114]:
market_cap_all = pd.DataFrame()
rank, company_name, market_cap, price, today, country = ([] for i in range(6))
for loop in range(1,11):
    url = f'https://companiesmarketcap.com/tech/largest-tech-companies-by-market-cap/?page={loop}'
    headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.3; Win 64 ; x64) Apple WeKit /537.36(KHTML , like Gecko) Chrome/80.0.3987.162 Safari/537.36'}
    webpage = requests.get(url, headers=headers).text
    soup = BeautifulSoup(webpage, "lxml")
    companies = soup.findChildren("table", {"class": "default-table table marketcap-table dataTable"})
    company_rows = companies[0].findChildren(['tr'])[1:]
    for company in company_rows:
        rank.append(company.find("td", {"class": "rank-td"}).text.strip())
        company_name.append(company.find("div", {"class": "company-name"}).text.strip())
        market_cap.append(company.select("td[class='td-right']")[0].text.strip())
        price.append(company.select("td[class='td-right']")[1].text.strip())
        today.append(company.find("td", {"class": "rh-sm" }).text.strip())
        country.append(company.find("span", {"class": "responsive-hidden" }).text.strip())
    
cols_dict = {
    "Rank": rank,
    "Name": company_name,
    "Market Cap": market_cap,
    "Price": price,
    "Today": today,
    "Country":country
}
market_cap_all = pd.DataFrame(cols_dict, columns=["Rank", "Name", "Market Cap", "Price", "Today", "Country"])

market_cap_all

Unnamed: 0,Rank,Name,Market Cap,Price,Today,Country
0,1,Apple,$2.422 T,$150.77,0.23%,USA
1,2,Microsoft,$1.770 T,$237.45,0.20%,USA
2,3,Alphabet (Google),$1.285 T,$98.81,0.36%,USA
3,4,Amazon,$1.173 T,$115.15,1.20%,USA
4,5,Tesla,$858.66 B,$276.01,0.25%,USA
...,...,...,...,...,...,...
827,828,Mobilicom,$2.25 M,$1.71,0.59%,Australia
828,829,Pareteum Corporation,$0.02 M,,0.00%,USA
829,830,Yandex,,$26.49,16.12%,Netherlands
830,831,Ozon,,,0.00%,USA


### Saving DataFrame as a CSV Dataset

In [116]:
market_cap_all.to_csv("tech_market_cap.csv")