# Title: Scraping `top 1000 technology companies`

`Author` : Abdullah Khan Kakar [Github](https://github.com/AbdullahKhanKakar)--[LinkedIn](https://www.linkedin.com/in/abdullahkhankakar/)--[Kaggle](https://www.kaggle.com/abdullahkhanuet22)

`Date`   : 06.Febuary.2024

`Source`: [DISFOLD](https://disfold.com/sector/technology/companies/?page=1)

## Import Libraries
Let's start web scraping by importing all necessary libraries used in this notebook

In [133]:
import requests
from bs4 import BeautifulSoup
import pandas as pd

## Aims and Objectives
Aim is to scrape 1000 top technology companies from DISFOLD website as of report `January, 2024`. There are total 20 pages that we need to scraped because one page contains 50 companies record. I split down the task into 3 parts:

- First: Scrape one page
- Second: Scrape all 20 pages through for loop by building function
- Thrid: Clean dataset

# First Part: Scrape one page

In [134]:
url = "https://disfold.com/sector/technology/companies/?page=1"

In [135]:
page = requests.get(url).text
page

'<!DOCTYPE html>\n<html lang="en">\n<head>\n\n<!-- Google tag (gtag.js) -->\n<script async src="https://www.googletagmanager.com/gtag/js?id=G-LCD4NF3FCT"></script>\n<script>\n  window.dataLayer = window.dataLayer || [];\n  function gtag(){dataLayer.push(arguments);}\n  gtag(\'js\', new Date());\n\n  gtag(\'config\', \'G-LCD4NF3FCT\');\n</script>\n<!-- Google AdSense -->\n\n    <script async src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-4014224489839616"\n         crossorigin="anonymous"></script>\n\n    \n    \n    <meta charset="utf-8">\n    <meta name="viewport" content="width=device-width, initial-scale=1.0">\n    <meta name="robots" content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1">\n\n    <link rel="icon" href="/static/favicon.ico">\n    <title>Largest 1000 Technology Companies in the World in 2024</title>\n    <meta name="description" content="List of the top companies in the Technology sector in the world r

In [136]:
soup = BeautifulSoup(page, "lxml")

In [137]:
soup

<!DOCTYPE html>
<html lang="en">
<head>
<!-- Google tag (gtag.js) -->
<script async="" src="https://www.googletagmanager.com/gtag/js?id=G-LCD4NF3FCT"></script>
<script>
  window.dataLayer = window.dataLayer || [];
  function gtag(){dataLayer.push(arguments);}
  gtag('js', new Date());

  gtag('config', 'G-LCD4NF3FCT');
</script>
<!-- Google AdSense -->
<script async="" crossorigin="anonymous" src="https://pagead2.googlesyndication.com/pagead/js/adsbygoogle.js?client=ca-pub-4014224489839616"></script>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1.0" name="viewport"/>
<meta content="index, follow, max-image-preview:large, max-snippet:-1, max-video-preview:-1" name="robots"/>
<link href="/static/favicon.ico" rel="icon"/>
<title>Largest 1000 Technology Companies in the World in 2024</title>
<meta content="List of the top companies in the Technology sector in the world ranked by market capitalization" name="description"/>
<link href="https://disfold.com/sector/t

In [138]:
rows = soup.find_all("tr")

In [139]:
rows

[<tr>
 <th class="center-align">Ranking</th>
 <th>Company</th>
 <th class="center-align">Market Cap (USD)</th>
 <th class="center-align">Stock</th>
 <th>Country</th>
 <th>Sector</th>
 <th>Industry</th>
 </tr>,
 <tr>
 <td class="center-align">1</td>
 <td><a href="/company/apple-inc/">Apple Inc.</a></td>
 <td class="center-align">
 <a href="/company/apple-inc/marketcap/">
                                                 $2.866 T
                                             </a>
 </td>
 <td class="center-align">
 <a class="waves-effect waves-light btn-small light-green darken-2" href="/stock/nasdaq-aapl/">
                                             AAPL
                                         </a>
 <a class="waves-effect waves-light btn-small deep-orange" href="/stock/nasdaq-aapl/backtest/">
 <i class="material-icons">wb_incandescent</i>
 </a>
 </td>
 <td>
 <a href="/united-states/companies/">
 <img loading="lazy" src="/static/flags/us.gif"/>
                                           

In [140]:
# creating null dataframe, then one by one we will add rows in it
data = pd.DataFrame({
    'Ranking': [],
    'Company': [],
    'Market Cap': [],
    'Stock': [],
    'Country': [],
    'Sector': [],
    'Industry': []
})
data

Unnamed: 0,Ranking,Company,Market Cap,Stock,Country,Sector,Industry


In [75]:
# here is the code through which we scrape records
for item in rows[1:]:
    td = item.find_all("td")
    if td[0].text == str(a):
        ranking = td[0].text.strip()
        company = td[1].text.strip()
        market_cap = td[2].text.strip()
        stock = td[3].text.strip("\n")
        country = td[4].text.strip()
        sector = td[5].text.strip()
        industry = td[6].text.strip()
        
        new_row = {
            'Ranking': ranking,
            'Company': company,
            'Market Cap': market_cap,
            'Stock': stock,
            'Country': country,
            'Sector': sector,
            'Industry': industry
        }
        # Convert the new row to a DataFrame
        new_row_df = pd.DataFrame([new_row])

        # Concatenate the new row DataFrame with the existing DataFrame
        data = pd.concat([data, new_row_df], ignore_index=True)

In [76]:
data

Unnamed: 0,Ranking,Company,Market Cap,Stock,Country,Sector,Industry
0,1,Apple Inc.,$2.866 T,AA...,United States,Technology,Consumer Electronics
1,2,Microsoft Corporation,$2.755 T,MS...,United States,Technology,Software—Infrastructure
2,3,Nvidia Corporation,$1.186 T,NV...,United States,Technology,Semiconductors
3,4,Broadcom Inc.,$495.95 B,AV...,United States,Technology,Semiconductors
4,5,Taiwan Semiconductor Manufacturing Company Lim...,$487.64 B,23...,Taiwan,Technology,Semiconductors
5,6,"Samsung Electronics Co., Ltd.",$392.38 B,00...,South Korea,Technology,Consumer Electronics
6,7,ASML Holding N.V.,$297.10 B,AS...,Netherlands,Technology,Semiconductor Equipment & Materials
7,8,Oracle Corporation,$282.01 B,OR...,United States,Technology,Software—Infrastructure
8,9,Adobe Inc.,$260.23 B,AD...,United States,Technology,Software—Infrastructure
9,10,"salesforce.com, inc.",$243.78 B,CR...,United States,Technology,Software—Application


# Second Part: Scraping all pages through for loop

In [88]:
n = 1
for a in range(1,21):
    url = f"https://disfold.com/sector/technology/companies/?page={a}"
    page = requests.get(url).text
    soup = BeautifulSoup(page, "lxml")
    rows = soup.find_all("tr")

    for item in rows[1:]:
        td = item.find_all("td")
        if td[0].text == str(n):
            ranking = td[0].text.strip()
            company = td[1].text.strip()
            market_cap = td[2].text.strip()
            stock = td[3].text.strip("\n")
            country = td[4].text.strip()
            sector = td[5].text.strip()
            industry = td[6].text.strip()
            n+=1
            new_row = {
                'Ranking': ranking,
                'Company': company,
                'Market Cap': market_cap,
                'Stock': stock,
                'Country': country,
                'Sector': sector,
                'Industry': industry
            }
            # Convert the new row to a DataFrame
            new_row_df = pd.DataFrame([new_row])

            # Concatenate the new row DataFrame with the existing DataFrame
            data = pd.concat([data, new_row_df], ignore_index=True)


In [89]:
data

Unnamed: 0,Ranking,Company,Market Cap,Stock,Country,Sector,Industry
0,1,Apple Inc.,$2.866 T,AA...,United States,Technology,Consumer Electronics
1,2,Microsoft Corporation,$2.755 T,MS...,United States,Technology,Software—Infrastructure
2,3,Nvidia Corporation,$1.186 T,NV...,United States,Technology,Semiconductors
3,4,Broadcom Inc.,$495.95 B,AV...,United States,Technology,Semiconductors
4,5,Taiwan Semiconductor Manufacturing Company Lim...,$487.64 B,23...,Taiwan,Technology,Semiconductors
...,...,...,...,...,...,...,...
995,996,"Henan Thinker Automatic Equipment Co.,Ltd.",$825.4 M,60...,China,Technology,Scientific & Technical Instruments
996,997,"transcosmos, Inc.",$819.5 M,97...,Japan,Technology,Information Technology Services
997,998,Yeahka Ltd,$819.3 M,99...,China,Technology,Software—Infrastructure
998,999,Beijing Wanji Technology Co. Ltd,$816.1 M,30...,China,Technology,Scientific & Technical Instruments


In [90]:
data.to_csv("Top 1000 technology companies as of Jan 2024.csv", index=False)

In [114]:
data.shape

(1000, 7)

# Third Part: Clean the dataset
There are problem in 2 columns, as they contains extra information and spaces and new lines:
- Market Cap
- Stock

In [116]:
data["Market Cap"] = data["Market Cap"].apply(lambda x: x.split("\n")[0])
data["Stock"] = data["Stock"].apply(lambda x: x.split("\n")[0].strip())

In [143]:
data

Unnamed: 0,Ranking,Company,Market Cap,Stock,Country,Sector,Industry
0,1,Apple Inc.,$2.866 T,AAPL,United States,Technology,Consumer Electronics
1,2,Microsoft Corporation,$2.755 T,MSFT,United States,Technology,Software—Infrastructure
2,3,Nvidia Corporation,$1.186 T,NVDA,United States,Technology,Semiconductors
3,4,Broadcom Inc.,$495.95 B,AVGO,United States,Technology,Semiconductors
4,5,Taiwan Semiconductor Manufacturing Company Lim...,$487.64 B,2330,Taiwan,Technology,Semiconductors
...,...,...,...,...,...,...,...
995,996,"Henan Thinker Automatic Equipment Co.,Ltd.",$825.4 M,603508,China,Technology,Scientific & Technical Instruments
996,997,"transcosmos, Inc.",$819.5 M,9715,Japan,Technology,Information Technology Services
997,998,Yeahka Ltd,$819.3 M,9923,China,Technology,Software—Infrastructure
998,999,Beijing Wanji Technology Co. Ltd,$816.1 M,300552,China,Technology,Scientific & Technical Instruments


In [144]:
# saving dataset into csv file
data3.to_csv("Top 1000 technology companies.csv", index=False)

<p style="color:red;font-weight:900;">End Of Code!</p>