## Farzaneh Shirzadeh

### Data Extraction from Real Website
### List of largest companies in Canada(wikipedia)

## Summary of the Project:
This Python script performs web scraping on the Wikipedia page for the list of largest companies in Canada. It uses BeautifulSoup to parse the HTML content of the page and extracts data from the table representing the list of companies. The extracted data, including company names, industry sectors, and other relevant information, is stored in a pandas DataFrame. Finally, the data is saved to a CSV file for further analysis or use. This project is a basic example of web scraping and data extraction from a real website, demonstrating how to collect tabular data from a webpage and convert it into a structured format for analysis or storage.

In [3]:
# Importing Libraries
from bs4 import BeautifulSoup  # Import BeautifulSoup for web scraping
import requests  # Import requests to make HTTP requests

In [4]:
# Fetching Data from Wikipedia
url = 'https://en.wikipedia.org/wiki/List_of_largest_companies_in_Canada'
page = requests.get(url)   # Send an HTTP GET request to the URL
soup = BeautifulSoup(page.text, 'html')   # Parse the HTML content using BeautifulSoup
print(soup)  # Display the parsed HTML content

<!DOCTYPE html>
<html class="client-nojs vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled" dir="ltr" lang="en">
<head>
<meta charset="utf-8"/>
<title>List of largest companies in Canada - Wikipedia</title>
<script>document.documentElement.className="client-js vector-feature-language-in-header-enabled vector-feature-language-in-main-page-header-disabled vector-feature-sticky-header-disabled vector-feature-page-tools-pinned-disabled vector-feature-toc-pinned-enabled vector-feature-main-menu-pinned-disabled vector-feature-limited-width-enabled vector-feature-limited-width-content-enabled vector-feature-zebra-design-disabled";(function(){var cookie=document.cookie.match(/(?:^

In [5]:
# Finding the Table
soup.find('table', class_ ='wikitable sortable')  

<table class="wikitable sortable">
<tbody><tr>
<th>Rank
</th>
<th>Fortune 500<br/>rank
</th>
<th>Name
</th>
<th>Industry
</th>
<th>Revenue<br/><small>(USD millions)</small>
</th>
<th>Profits<br/><small>(USD millions)</small>
</th>
<th>Employees
</th>
<th>Headquarters
</th></tr>
<tr>
<td>1
</td>
<td>180
</td>
<td><a class="mw-redirect" href="/wiki/Brookfield_Asset_Management" title="Brookfield Asset Management">Brookfield Asset Management</a>
</td>
<td><a href="/wiki/Finance" title="Finance">Finance</a>
</td>
<td style="text-align:center;">56,771
</td>
<td style="text-align:center;">3,584
</td>
<td style="text-align:center;">100,750
</td>
<td><a href="/wiki/Toronto" title="Toronto">Toronto</a>
</td></tr>
<tr>
<td>2
</td>
<td>210
</td>
<td><a href="/wiki/Alimentation_Couche-Tard" title="Alimentation Couche-Tard">Alimentation Couche-Tard</a>
</td>
<td><a href="/wiki/Retail" title="Retail">Retail</a>
</td>
<td style="text-align:center;">51,394
</td>
<td style="text-align:center;">1,673
</t

In [6]:
table = soup.find_all('table')[0] # Find the first table element in the parsed HTML
print(table)

<table class="wikitable sortable">
<tbody><tr>
<th>Rank
</th>
<th>Fortune 500<br/>rank
</th>
<th>Name
</th>
<th>Industry
</th>
<th>Revenue<br/><small>(USD millions)</small>
</th>
<th>Profits<br/><small>(USD millions)</small>
</th>
<th>Employees
</th>
<th>Headquarters
</th></tr>
<tr>
<td>1
</td>
<td>180
</td>
<td><a class="mw-redirect" href="/wiki/Brookfield_Asset_Management" title="Brookfield Asset Management">Brookfield Asset Management</a>
</td>
<td><a href="/wiki/Finance" title="Finance">Finance</a>
</td>
<td style="text-align:center;">56,771
</td>
<td style="text-align:center;">3,584
</td>
<td style="text-align:center;">100,750
</td>
<td><a href="/wiki/Toronto" title="Toronto">Toronto</a>
</td></tr>
<tr>
<td>2
</td>
<td>210
</td>
<td><a href="/wiki/Alimentation_Couche-Tard" title="Alimentation Couche-Tard">Alimentation Couche-Tard</a>
</td>
<td><a href="/wiki/Retail" title="Retail">Retail</a>
</td>
<td style="text-align:center;">51,394
</td>
<td style="text-align:center;">1,673
</t

In [7]:
# Extracting Table Headers
Companies_titles = table.find_all('th')   # Find all header elements (th) within the table

In [8]:
Companies_table_titles = [title.text.strip() for title in Companies_titles]   # Extract text content and remove leading/trailing whitespace from each header
print(Companies_table_titles)

['Rank', 'Fortune 500rank', 'Name', 'Industry', 'Revenue(USD millions)', 'Profits(USD millions)', 'Employees', 'Headquarters']


In [9]:
import pandas as pd

In [10]:
# Creating an Empty DataFrame
Companies = pd.DataFrame(columns = Companies_table_titles)
Companies

Unnamed: 0,Rank,Fortune 500rank,Name,Industry,Revenue(USD millions),Profits(USD millions),Employees,Headquarters


In [11]:
# Extracting Data Rows and Populating the DataFrame
column_data = table.find_all('tr')   # Find all row elements (tr) within the table

In [12]:
for row in column_data[1:]:  # Iterate over each row, excluding the header row
    row_data = row.find_all('td')  # Find all data cell elements (td) within each row
    individual_row_data = [data.text.strip() for data in row_data]  # Extract text content and remove leading/trailing whitespace from each data cell
    length = len(Companies)  # Get the current length of the DataFrame
    Companies.loc[length] = individual_row_data  # Populate the DataFrame row by row with data from each row in the HTML table

In [13]:
Companies

Unnamed: 0,Rank,Fortune 500rank,Name,Industry,Revenue(USD millions),Profits(USD millions),Employees,Headquarters
0,1,180,Brookfield Asset Management,Finance,56771,3584,100750,Toronto
1,2,210,Alimentation Couche-Tard,Retail,51394,1673,130000,Laval
2,3,256,Royal Bank of Canada,Banking,44609,9635,81870,Toronto
3,4,295,Toronto-Dominion Bank,Banking,41199,8751,84383,Toronto
4,5,299,Magna International,Automotive parts,40827,2296,174000,Aurora
5,6,325,George Weston Limited,Retail,37475,443,197000,Toronto
6,7,331,Power Corporation of Canada,Finance,37112,1033,30000,Montreal
7,8,346,Enbridge,Oil and Gas,35785,2224,12000,Calgary
8,9,398,Scotiabank,Banking,31589,6642,97629,Toronto
9,10,417,Suncor Energy,Oil and Gas,30081,2541,12480,Calgary


In [14]:
# Saving Data to a CSV File
Companies.to_csv(r'C:\Users\shirz\Desktop\new learning\Alex boot camp\python\Canada_Largest_Companies.csv', index = False)