# Scraping Top Companies in the world from Forbes by Using Python


![](https://i.imgur.com/KiAL5K7.jpg)

Web scraping is the process of extracting and parsing data from websites in an automated fashion using a computer program. It's a useful technique for creating datasets for research and learning. Mostly it is unstructured html data which is then converted into structured data and stored in spreadsheet or in database format.

Web scraping technique is used to fetch data from websites. While surfing on the web, many websites don’t allow the user to save data for personal use. One way is to manually copy-paste the data, which both tedious and time-consuming. Web Scraping is the automation of the data extraction process from websites. This event is done with the help of web scraping software known as web scrapers. They automatically load and extract data from the websites based on user requirements. These can be custom built to work for one site or can be configured to work with any website.

# Forbes
![](https://i.imgur.com/EtbdQYu.png)

* Forbes is an American business magazine owned by Integrated Whale Media Investments and the Forbes family. Published eight times a year, it features articles on finance, industry, investing, and marketing topics. Forbes also reports on related subjects such as technology, communications, science, politics, and law. It is based in Jersey City, New Jersey. Competitors in the national business magazine category include Fortune and Bloomberg Businessweek. Forbes has an international edition in Asia as well as editions produced under license in 27 countries and regions worldwide.

* The magazine is well known for its lists and rankings, including of the richest Americans (the Forbes 400), of the America's Wealthiest Celebrities, of the world's top companies (the Forbes Global 2000), Forbes list of the World's Most Powerful People, and The World's Billionaires.The motto of Forbes magazine is "Change the World".[5] Its chair and editor-in-chief is Steve Forbes, and its CEO is Mike Federle. In 2014, it was sold to a Hong Kong–based investment group, Integrated Whale Media Investments.

## Objective:
Scraping Top Comapnies in the world in each creative field by parsing the information from this website in the form of Tabular data and saving those data in a csv files.

## Project Outline:

1. Understanding the structure of [Forbes Website]("https://www.forbes.com/?sh=44b2a1d42254")
2. Installing and Importing required libraries 
3. Simulating the page and Extracting the URLs of different creative fields from website using `BeautifulSoup`
4. Parsing the Top companies and storing as 7 different functions such as Rank,Name of the Company,Country,Sales,Profit,Assets,Market Value.
5. Storing the extracted data into a dictionary.
6. Compiling all the data into a DataFrame using `Pandas` and saving the data  into `CSV` file.

# Steps To Be Followed

## 1. Use the requests library to download web pages

In [28]:
import requests
from bs4 import BeautifulSoup

In [29]:
topic_url = "https://www.forbes.com/lists/global2000/"

In [30]:
response = requests.get(topic_url)
#We use requests.get to download the content from a webpage
response.status_code
page_contents = response.text

In [31]:
#Writing the page contents into a html file
with open('top_companies.html', 'w', encoding="utf-8") as f:
    f.write(page_contents)

In [32]:
page_contents[:500]

'<!DOCTYPE html><html lang="en"><head><title>The Global 2000 2022</title><meta charset="utf-8"><meta http-equiv="Content-Language" content="en_US"><link rel="shortcut icon" href="https://i.forbesimg.com/48X48-F.png"><meta name="referrer" content="no-referrer-when-downgrade"><link rel="canonical" itemprop="url" href=""><meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=5,minimum-scale=1,user-scalable=yes"><meta name="description" itemprop="description" content="Despite'

## 2. Use Beautiful Soup to parse and extract information

In [33]:
# Converting the page to  Beautiful soup document using html.parser
doc = BeautifulSoup(page_contents, 'html.parser')
a_tags = doc.find_all('a')

In [34]:
a1_tags = doc.find_all('a',{'class' : "table-row active premiumProfile"})

## 3. Extracting Company Details 
![](https://i.imgur.com/EK3ZcsO.png)

In [35]:
# Getting required tags for finding company details
rank_tag = doc.find_all('div',{'class' : "rank first table-cell rank"})

# name_tag for finding  name
name_tag = doc.find_all('div',{'class' : "organizationName second table-cell name" })

# country_tag for finding country
country_tag = doc.find_all('div',{'class' : 'country table-cell country' })

# sales_tag for finding sales
sales_tag = doc.find_all('div',{'class' : 'revenue table-cell sales' })

# profit_tag for finding profit
profit_tag = doc.find_all('div',{'class' : 'profits table-cell profit' })

# asset tag for finding asset
asset_tag = doc.find_all('div',{'class' : 'assets table-cell assets' })

# market_tag for finding market value
market_tag = doc.find_all('div',{'class' : 'marketValue table-cell market value' })

In [36]:
#function to get company url
def get_account_link(doc):
    account_url_tag = doc.find_all('a', {'class': 'table-row active'})
    account_full_url = []
    for tag in account_url_tag:
        account_full_url.append(tag['href'])
    return account_full_url

In [37]:
account_full_url = get_account_link(doc)

In [38]:
#function to get rank
def get_topic_rank(doc):  
    topic_rank_tags = doc.find_all('div', {'class': 'rank first table-cell rank'})
    topic_rank = []
    for tag in topic_rank_tags:
        topic_rank.append(tag.text)
    return topic_rank

#function to get name
def get_topic_name(doc):  
    topic_name_tags = doc.find_all('div', {'class': 'organizationName second table-cell name'})
    topic_name = []
    for tag in topic_name_tags:
        topic_name.append(tag.text)
    return topic_name

#function to get country
def get_topic_country(doc):  
    topic_country_tags = doc.find_all('div', {'class': 'country table-cell country'})
    topic_country = []
    for tag in topic_country_tags:
        topic_country.append(tag.text)
    return topic_country

#function to get sales
def get_topic_sales(doc):  
    topic_sales_tags = doc.find_all('div', {'class': 'revenue table-cell sales'})
    topic_sales = []
    for tag in topic_sales_tags:
        topic_sales.append(tag.text)
    return topic_sales

#function to get profit
def get_topic_profit(doc):  
    topic_profit_tags = doc.find_all('div', {'class': 'profits table-cell profit'})
    topic_profit = []
    for tag in topic_profit_tags:
        topic_profit.append(tag.text)
    return topic_profit

#function to get assets
def get_topic_assets(doc):  
    topic_assets_tags = doc.find_all('div', {'class': 'assets table-cell assets'})
    topic_assets = []
    for tag in topic_assets_tags:
        topic_assets.append(tag.text)
    return topic_assets

#function to get market value
def get_topic_market(doc):  
    topic_market_tags = doc.find_all('div', {'class': 'marketValue table-cell market value'})
    topic_market = []
    for tag in topic_market_tags:
        topic_market.append(tag.text)
    return topic_market

## 4. Storing all the Extracted data into a Dictionary and returning Data Frame

In [39]:
import pandas as pd

topics_dict = {
        'Rank': get_topic_rank(doc),
        'Name of the Company': get_topic_name(doc),
        'Country': get_topic_country(doc),
        'Sales': get_topic_sales(doc),
        'Profit': get_topic_profit(doc),
        'Assets': get_topic_assets(doc),
        'Market Value': get_topic_market(doc),
    }

In [40]:
import pandas as pd
# Storing obtained data in Dataframe
company_Df = pd.DataFrame(topics_dict)

In [41]:
company_Df

Unnamed: 0,Rank,Name of the Company,Country,Sales,Profit,Assets,Market Value
0,1.,Berkshire Hathaway,United States,$276.09 B,$89.8 B,$958.78 B,$741.48 B
1,2.,ICBC,China,$208.13 B,$54.03 B,"$5,518.51 B",$214.43 B
2,3.,Saudi Arabian Oil Company (Saudi Aramco),Saudi Arabia,$400.38 B,$105.36 B,$576.04 B,"$2,292.08 B"
3,4.,JPMorgan Chase,United States,$124.54 B,$42.12 B,"$3,954.69 B",$374.45 B
4,5.,China Construction Bank,China,$202.07 B,$46.89 B,"$4,746.95 B",$181.32 B
...,...,...,...,...,...,...,...
1995,1995.,Shenzhen Feima International Supply Chain,China,$37 M,$1.41 B,$166 M,$1.14 B
1996,1997.,NMDC,India,$3.52 B,$1.41 B,$5.71 B,$6.4 B
1997,1997.,Sichuan Changhong Electric,China,$15.72 B,$53.1 M,$12.11 B,$1.96 B
1998,1999.,Satellite Chemical,China,$4.41 B,$931.3 M,$7.64 B,$9.52 B


## 5. Creating Folder and Saving CSV file(s) with the extracted information

In [42]:
import os
# Creating folder
os.makedirs('Comapny_Details', exist_ok=True)

In [43]:
# Saving the data in csv files
company_Df.to_csv('Comapny_Details/Top_Companies.csv', index=None)

## 6. Scraping each page to get Information of the each company

sraping first page
![](https://i.imgur.com/2tFioGS.png)

In [44]:
top1_url = account_full_url[0]
top1_url

'https://www.forbes.com/companies/berkshire-hathaway/?list=global2000'

In [45]:
response1 = requests.get(top1_url)
response1.status_code
topic_doc = BeautifulSoup(response1.text, 'html.parser')

In [46]:
h1_tag = topic_doc.find_all('span', {'class': "profile-stats__text"})
def get_comany_info(h1_tag):
    # returns all the required info about a repository
    a_tags = topic_doc.find_all('span', {'class': "profile-stats__text"})
    industry = a_tags[1].text.strip()
    founded = a_tags[3].text.strip()
    Headquarters = a_tags[4].text.strip()
    country = a_tags[6].text.strip()
    Chief_Executive_Officer = a_tags[8].text.strip()
    Employees = a_tags[10].text.strip()
    
    return industry,founded,Headquarters,country,Chief_Executive_Officer,Employees

In [47]:
get_comany_info(h1_tag)

('Diversified Financials',
 '1939',
 'Omaha, Nebraska',
 'United States',
 'Warren Edward Buffett',
 '372,000')

## 7.Getting information of top 250 companies

In [48]:
account = account_full_url[0:250]

In [49]:
def get_company_page_info(channel_url):
    response = requests.get(channel_url)
    if response.status_code != 200:
        raiseException('Failed to load page{}'.format(page_url))
    doc12 = BeautifulSoup(response.text, 'html.parser')
    return doc12

def get_company_info(channel_url):
    doc12 = get_company_page_info(channel_url)
    
    names = doc12.find_all('div', {'class':"listuser-header__name"})
    company  = doc12.find_all('span', {'class': "profile-stats__text"})
    revenue = doc12.find_all('div', {'class':"listuser-financial-item__value"})
    Name = names[0].text.strip()
    Industry = company[1].text.strip()
    Founded = company[3].text.strip()
    Headquarters = company[5].text.strip()
    CEO = company[8].text.strip()
    Revenue = revenue[0].text.strip()
    Assets = revenue[1].text.strip()
    Profits = revenue[2].text.strip()
  
    return Name, Industry, Founded, Headquarters, CEO, Revenue, Assets, Profits 
    
def get_final_dict(account):
    
    company_Dictionary = {
        'Name of Company' : [],
        'Industry' : [],
        'Founded' : [],
        'Headquarters' : [],
        'CEO' : [],
        'Revenue' : [],
        'Assets' : [],
        'Profits' : []
    } 

    for i in range(len(account)):
  
        details = get_company_info(account[i])
        company_Dictionary['Name of Company'].append(details[0])
        company_Dictionary['Industry'].append(details[1])
        company_Dictionary['Founded'].append(details[2])
        company_Dictionary['Headquarters'].append(details[3])
        company_Dictionary['CEO'].append(details[4])
        company_Dictionary['Revenue'].append(details[5])
        company_Dictionary['Assets'].append(details[6])
        company_Dictionary['Profits'].append(details[7])
          
    return pd.DataFrame(company_Dictionary)        

In [50]:
%%time
account_DFrame = get_final_dict(account)
account_DFrame

CPU times: user 29.6 s, sys: 294 ms, total: 29.9 s
Wall time: 47.2 s


Unnamed: 0,Name of Company,Industry,Founded,Headquarters,CEO,Revenue,Assets,Profits
0,Berkshire Hathaway,Diversified Financials,1939,"Omaha, Nebraska",Warren Edward Buffett,$276.1B,$958.8B,$89.8B
1,ICBC,Banking,2011,Beijing,Shu Gu,$208.1B,$5.5T,$54B
2,Saudi Arabian Oil Company (Saudi Aramco),Oil & Gas Operations,1933,Dhahran,Amin bin Hasan Al-Nasser,$400.4B,$576B,$105.4B
3,JPMorgan Chase,Banking and Financial Services,2000,"New York, New York",Jamie Dimon,$124.5B,$4T,$42.1B
4,China Construction Bank,Banking,2014,Beijing,Wang Zuji,$202.1B,$4.7T,$46.9B
...,...,...,...,...,...,...,...,...
245,Humana,Insurance,1961,"Louisville, Kentucky",Bruce D. Broussard,$84.1B,$44.7B,$2.9B
246,General Dynamics,Aerospace & Defense,1952,"Reston, Virginia",Phebe N. Novakovic,$38.5B,$50.1B,$3.3B
247,Power Corp of Canada,Diversified Financials,1925,Montréal,Robert Jeffrey Orr,$57.9B,$507.1B,$2.4B
248,Qatar National Bank,Banking,1964,Doha,Abdulla Mubarak Nasser Al-Khalifa,$14.2B,$304.4B,$3.4B


In [51]:
account_DFrame.to_csv('Comapny_Details/Top_250_Company_details.csv', index=None)

# Summary


- The Scraping was done using Python libraries such as Requests, BeatifulSoup for extracting the data
- Scraping the Top companies from Forbes Website and storing as 7 different functions such as Rank,Name of the Company,Country,Sales,Profit,Assets,Market Value.
- Parsed all the scraped data into a csv folder containing 2 csv files containing 2000 rows and 7 columns and another one containing 250 rows and 8 columns.

# Future work
- Extracting more details of the project and creator by accessing the `project links` and `creator links`
- We can now work forward to explore this data more and more to fetch meaningful information out of it.
- With all the insights , and further analysis into the data, we can have answers to a lot of questions like -
- We can get what will be the top companies for the week, for the month etc

# References
(1) Python offical documentation. https://docs.python.org/3/

(2) Requests library. https://pypi.org/project/requests/

(3) Beautiful Soup documentation. https://www.crummy.com/software/BeautifulSoup/bs4/doc/

(4) Pandas library documentation. https://pandas.pydata.org/docs/

(5) Forbes website. https://www.forbes.com/?sh=44b2a1d42254

In [None]:
import jovian
jovian.commit(files=['Comapny_Details'], outputs=["Web_Scraping_Project.ipynb"])

<IPython.core.display.Javascript object>