# Visualization on The Global 2000 Largest Companies 2024 from Forbes

## 1. Project description

### Objective statement and goals for working with this dataset
* Analyze company performance: Investigate which sectors are most profitable, which companies lead in revenue vs. profit, or how market value correlates with other financial metrics.
* Compare across regions/countries: Explore patterns in terms of geographic location—are certain regions dominating the largest companies (e.g., US vs. China)?
* Visualize financial metrics: Create visualizations comparing revenue, profits, and market value across different industries or countries.
* Explore relationships: Explore correlations between factors like revenue and profit, market value and employees, etc.



## 2. Data description


### Data source statement
* I got this dataset from Kaggle, and you can download it from [here](https://www.kaggle.com/datasets/mohammadgharaei77/largest-2000-global-companies/data) as well.
* Also, here is the resource link for your reference: https://www.forbes.com/lists/global2000/
  

### **Key Attributes**

| Attribute      | Description                                                                 | Type                        | Role                                                                                   | Example                                    |
|:----------------|:-----------------------------------------------------------------------------|:-----------------------------|:----------------------------------------------------------------------------------------|:--------------------------------------------|
| Company Name   | The name of the company.                                                     | Categorical (String)        | Identifies the company in the dataset.                                                 | "Apple", "Microsoft", "Amazon"            |
| Industry       | The sector or industry in which the company operates.                       | Categorical (String)        | Allows analysis of how different industries are performing relative to each other.     | "Technology", "Healthcare", "Financials"   |
| Market Value   | The total market value (or capitalization) of the company, in billions USD. | Numerical (Float or Integer) | Indicates the company’s size and financial strength.                                   | 2,458.7 (in billions of dollars for Apple) |
| Sales        | The total income generated by the company, typically in billions USD.      | Numerical (Float or Integer) | Provides insight into the company's business scale and operations.                     | 274.5 (billions of dollars for Apple)      |
| Profit         | The net profit of the company after expenses, taxes, etc., in billions USD.| Numerical (Float or Integer) | Reflects the company's financial health and operational efficiency.                    | 57.4 (billions of dollars for Apple)       |
| Assets         | The total value of everything the company owns, in billions USD.           | Numerical (Float or Integer) | Indicates the financial resources the company has for generating revenue.              | 380.7 (billions of dollars for Apple)      |
| Employees      | The total number of employees working at the company.                      | Numerical (Integer)         | Provides insight into the scale of operations and workforce size.                      | 147,000 employees (Apple)                  |
| Country/Region | The country or region where the company is based or operates.              | Categorical (String)        | Indicates the geographical location of the company.                                    | "United States", "China", "Germany"       |




### Import data and get some basic understanding
* Importing dataset

In [1]:
import altair as alt
import pandas as pd
import numpy as np

# Load the dataset with a specified encoding
file_path = "../data/Largest-Companies.csv"
df = pd.read_csv(file_path, encoding='ISO-8859-1')


* Print out first 5 row of data

In [2]:
df.head()

Unnamed: 0,Rank,Name,Sales,Profit,Assets,Market Value,Industry,Founded,Headquarters,Country,CEO,Employees
0,1,JPMorgan Chase,252.9,50.0,4090.7,588.1,Banking and Financial Services,2000.0,New York- New York,United States,Jamie Dimon,186751.0
1,2,Berkshire Hathaway,369.0,73.4,1070.0,899.1,Conglomerate,1839.0,Omaha- Nebraska,United States,Warren Edward Buffett,396500.0
2,3,Saudi Arabian Oil Company (Saudi Aramco),489.1,116.9,661.5,1919.3,Construction- Chemicals- Raw Materials,1933.0,Dhahran,Saudi Arabia,Amin bin Hasan Al-Nasser,70000.0
3,4,ICBC,223.8,50.4,6586.0,215.2,Banking and Financial Services,1984.0,Beijing,China,Wang Jingwu,427587.0
4,5,Bank of America,183.3,25.0,3273.8,307.3,Banking and Financial Services,1904.0,Charlotte- North Carolina,United States,Brian T. Moynihan,166140.0


* Get basic infomation of this dataset

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2001 entries, 0 to 2000
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Rank          2001 non-null   int64  
 1   Name          2001 non-null   object 
 2   Sales         2001 non-null   float64
 3   Profit        2001 non-null   float64
 4   Assets        2001 non-null   float64
 5   Market Value  2001 non-null   float64
 6   Industry      1999 non-null   object 
 7   Founded       1999 non-null   float64
 8   Headquarters  1991 non-null   object 
 9   Country       2001 non-null   object 
 10  CEO           1970 non-null   object 
 11  Employees     1943 non-null   float64
dtypes: float64(6), int64(1), object(5)
memory usage: 187.7+ KB


* Check null values

In [4]:
df.isnull().sum()

Rank             0
Name             0
Sales            0
Profit           0
Assets           0
Market Value     0
Industry         2
Founded          2
Headquarters    10
Country          0
CEO             31
Employees       58
dtype: int64

### Existing visualization working with the same dataset
* https://www.kaggle.com/code/mohammadgharaei77/forbes-global-2000-largest-companies
* https://www.kaggle.com/code/devraai/unveiling-secrets-of-the-worlds-largest-companies

## 3. Task description

Based on the goals working for this **Forbes Global 2000 Largest Companies 2024** dataset.



### **Task 1: Analyzing Company Performance**

#### **Why is this task pursued? (Goal)**
The goal is to investigate which sectors are most profitable, which companies lead in revenue versus profit, and how market value correlates with other financial metrics. This analysis will provide valuable insights into the global business landscape and help identify the strongest companies across different financial measures.

#### **How is this task conducted? (Means)**
- **Data Processing**: Filter and aggregate data based on financial metrics such as `Market Value`, `Sales`, and `Profit`. For comparison, we can use groupings by `Industry` or `Region`.
- **Visual Analysis**: Use visualizations (e.g., bar charts, scatter plots) to compare companies’ performance by sector and identify patterns.
- **Statistical Analysis**: Correlate `Sales`, `Profit`, and `Market Value` to understand relationships.

#### **What does this task seek to learn about the data? (Characteristics)**
- Which companies and industry sectors have the highest revenue, profit, and market value.
- The relationship between revenue, profit, and market value for companies.
- The leading companies in each financial metric.
  
#### **Where does this task operate? (Target Data)**
- **Target Data**: The dataset contains financial attributes like `Sales`, `Profit`, `Market Value`, and `Industry`. The goal is to focus on these dimensions, particularly looking at them in the context of company performance.

#### **When is this task performed? (Workflow)**
- **Workflow**: After importing and cleaning the dataset, this task is typically performed during the exploratory data analysis (EDA) phase. It involves inspecting and summarizing the key attributes related to financial performance.

#### **Who is executing this task? (Roles)**
- **Roles**: The visualization designer (me) will conduct this analysis. The task is also relevant for stakeholders interested in financial performance, such as business analysts or investors.





### **Task 1: Visualizations**

#### **Scatter Plot: Sales vs. Profit vs. Market Value**
In this scatter plot, users can find and compare all the financial variables which includes sales, profit and market value.
* The Profit is mapped on the Y axis.
* The Sales is mapped on the X axis.
* The industriy is mapped into the color of the circles.
* And the market value is mapped into the size of the circles.

People can easily find out the top companies in each of the domains, like sales, profit or market value throught the position and size of the circles. Becasue the choosing visual spaces for the financial variables are very suitable and expressive.

In [5]:

scatter_plot = alt.Chart(df).mark_point().encode(
    x='Sales:Q',  # Quantitative axis for Sales
    y='Profit:Q',  # Quantitative axis for Profit
    size='Market Value:Q',  # Size of the point based on Market Value
    color='Industry:N',  # Color points by Industry
    tooltip=['Name:N', 'Sales:Q', 'Profit:Q', 'Market Value:Q', 'Industry:N']  # Tooltip details
).properties(
    width=600,
    height=400
)

scatter_plot.show()


#### **Bar charts to show top companies in each financial variable**
* Here is another pressentation for ranking the top companies in each one of the financial variables.
* There are 3 separate bar chart and the companies are sorted in it.
* So users can very easily find out which companies are lead in revenue versus profit, or market value.

In [31]:
import altair as alt

# Market Value chart
bar_market_value = alt.Chart(df).mark_bar().encode(
    x=alt.X('Name:N', sort='-y'),
    y='Market Value:Q',
    color='Industry:N',
    tooltip=['Name:N', 'Market Value:Q', 'Industry:N']
).transform_window(
    rank='rank()',  # Rank the companies by Market Value
    sort=[alt.SortField('Market Value', order='descending')]
).transform_filter(
    alt.datum.rank <= 10  # Show top 10 companies by Market Value
).properties(
    width=800,
    height=400
)


# Show the Market Value chart
# bar_market_value.show()

# Sales chart
bar_sales = alt.Chart(df).mark_bar().encode(
    x=alt.X('Name:N', sort='-y'),
    y='Sales:Q',
    color='Industry:N',
    tooltip=['Name:N', 'Sales:Q', 'Industry:N']
).transform_window(
    rank='rank()',  # Rank the companies by Sales
    sort=[alt.SortField('Sales', order='descending')]
).transform_filter(
    alt.datum.rank <= 10  # Show top 10 companies by Sales
).properties(
    width=800,
    height=400
)

# Show the Sales chart
#bar_sales.show()

# Profit chart
bar_profit = alt.Chart(df).mark_bar().encode(
    x=alt.X('Name:N', sort='-y'),
    y='Profit:Q',
    color='Industry:N',
    tooltip=['Name:N', 'Profit:Q', 'Industry:N']
).transform_window(
    rank='rank()',  # Rank the companies by Profit
    sort=[alt.SortField('Profit', order='descending')]
).transform_filter(
    alt.datum.rank <= 10  # Show top 10 companies by Profit
).properties(
    width=800,
    height=400
)

# Show the Profit chart
#bar_market_value.show() | bar_sales.show() | bar_profit.show()
# Combine the three charts horizontally
combined_chart = alt.vconcat(
    bar_market_value,
    bar_sales,
    bar_profit
).resolve_scale(
    color='independent'  # Ensure color scales are independent across the charts
)

combined_chart.show()

#### **Correlation matrix for the financial variables**

In [30]:
import altair as alt
import pandas as pd

# Assuming df is already loaded and contains the necessary columns

# Create a correlation matrix for the selected columns
corr_matrix = df[['Sales', 'Profit', 'Market Value']].corr()

# Melt the correlation matrix for visualization
corr_df = corr_matrix.reset_index().melt(id_vars="index")
corr_df['value'] = corr_df['value'].round(2)

# Create the heatmap with text annotations
heatmap = alt.Chart(corr_df).mark_rect().encode(
    x='index:N',
    y='variable:N',
    color='value:Q',
    tooltip=['index:N', 'variable:N', 'value:Q']
).properties(
    width=300,
    height=300
)

# Add text annotations to the heatmap
heatmap_with_text = heatmap + alt.Chart(corr_df).mark_text(
    align='center',
    baseline='middle',
    fontSize=14,
    fontWeight='bold',
    color='white'
).encode(
    x='index:N',
    y='variable:N',
    text='value:Q'
)

heatmap_with_text.show()

### **Task 2: Comparing Companies Across Regions/Countries**

#### **Why is this task pursued? (Goal)**
The goal is to explore geographic patterns in the largest companies, particularly to see how certain regions (like the US vs. China) dominate the list. This task can reveal regional economic strength and help identify global business hubs.

#### **How is this task conducted? (Means)**
- **Data Processing**: Group the companies by their `Country` to compare financial performance across geographies.
- **Visual Analysis**: Create visualizations such as choropleth maps, bar charts, and pie charts that compare the number of top companies or financial metrics across countries.
- **Statistical Analysis**: Perform regional breakdowns to understand whether certain regions dominate in market value or revenue.

#### **What does this task seek to learn about the data? (Characteristics)**
- The leading countries or regions in terms of total market value, revenue, and number of companies in the Global 2000.
- How financial metrics differ by country or region.
  
#### **Where does this task operate? (Target Data)**
- **Target Data**: The `Country` column, along with financial attributes like `Market Value`, `Sales`, and `Profit`, are the main dimensions of focus.

#### **When is this task performed? (Workflow)**
- **Workflow**: This task would follow the initial data cleaning and exploration phase, after which you’ll aggregate data by country and visualize geographic trends.

#### **Who is executing this task? (Roles)**
- **Roles**: Analysts or business strategists who are interested in regional market performance, global economic trends, and comparative analysis.



### **Task 2: Visualizations**

#### **Pie chart and Bar chart for top contries in Market value, Sales or Profit**
* Here we grouped the companies by their Country to compare financial performance
* Users are very instreasted in economic strength in countries.
* So we ranked top countries in those financial variables in bar chart.
* Additionally, we provided the bar chat to help users find out how dominance the top countries is.
* Although United States and China are obviously the top two stronge economic power, there still has huge gap between them. 

In [10]:
agg_df = df.groupby('Country')['Market Value', 'Sales', 'Profit'].sum().reset_index()
# agg_df.head()

# Create the percentage column
agg_df['percentage'] = (agg_df['Market Value'] / agg_df['Market Value'].sum()) * 100

# Sort the data in descending order by Sales
agg_df_sorted = agg_df.sort_values(by='Market Value', ascending=False)

# Create the pie chart sorted by sales (descending order)
pie_chart = alt.Chart(agg_df_sorted).mark_arc().encode(
    theta=alt.Theta(field="Market Value", type="quantitative"),
    color=alt.Color('Country:N', scale=alt.Scale(scheme='category20')),  # Color by Country
    tooltip=['Country:N', 'Market Value:Q', 'percentage:Q'],  # Tooltip with country, sales, and percentage
).properties(
    width=4### **Task 1: Visualizations**

#### **Scatter Plot: Sales vs. Profit vs. Market Value**
In this scatter plot, users can find and compare all the financial variables which includes sales, profit and market value.
* The Profit is mapped on the Y axis.
* The Sales is mapped on the X axis.
* The industriy is mapped into the color of the circles.
* And the market value is mapped into the size of the circles.

People can easily find out the top companies in each of the domains, like sales, profit or market value throught the position and size of the circles. Becasue the choosing visual spaces for the financial variables are very suitable and expressive.00,
    height=400,
    title="Top Countries by Market Value Distribution"
)

# Create the bar chart (country list) sorted by sales
bar_chart = alt.Chart(agg_df_sorted).mark_bar().encode(
    x='Market Value:Q',  # Sales on x-axis
    y=alt.Y('Country:N', sort='-x'),  # Sort countries by sales in descending order
    color=alt.Color('Country:N', scale=alt.Scale(scheme='category20')),  # Color by country
    tooltip=['Country:N', 'Market Value:Q', 'percentage:Q']  # Tooltip with country, sales, and percentage
).properties(
    width=400,
    height=400,
    title="Top Countries by Market Value"
)

# Combine both pie chart and bar chart in a side-by-side layout
pie_chart | bar_chart

  agg_df = df.groupby('Country')['Market Value', 'Sales', 'Profit'].sum().reset_index()


In [11]:

# Create the percentage column
agg_df['percentage'] = (agg_df['Sales'] / agg_df['Sales'].sum()) * 100

# Sort the data in descending order by Sales
agg_df_sorted = agg_df.sort_values(by='Sales', ascending=False)

# Create the pie chart sorted by sales (descending order)
pie_chart = alt.Chart(agg_df_sorted).mark_arc().encode(
    theta=alt.Theta(field="Sales", type="quantitative"),
    color=alt.Color('Country:N', scale=alt.Scale(scheme='category20')),  # Color by Country
    tooltip=['Country:N', 'Sales:Q', 'percentage:Q'],  # Tooltip with country, sales, and percentage
).properties(
    width=400,
    height=400,
    title="Top Countries by Sales Distribution"
)

# Create the bar chart (country list) sorted by sales
bar_chart = alt.Chart(agg_df_sorted).mark_bar().encode(
    x='Sales:Q',  # Sales on x-axis
    y=alt.Y('Country:N', sort='-x'),  # Sort countries by sales in descending order
    color=alt.Color('Country:N', scale=alt.Scale(scheme='category20')),  # Color by country
    tooltip=['Country:N', 'Sales:Q', 'percentage:Q']  # Tooltip with country, sales, and percentage
).properties(
    width=400,
    height=400,
    title="Top Countries by Sales"
)

# Combine both pie chart and bar chart in a side-by-side layout
pie_chart | bar_chart


In [12]:
# Create the percentage column
agg_df['percentage'] = (agg_df['Profit'] / agg_df['Profit'].sum()) * 100

# Sort the data in descending order by Sales
agg_df_sorted = agg_df.sort_values(by='Profit', ascending=False)

# Create the pie chart sorted by sales (descending order)
pie_chart = alt.Chart(agg_df_sorted).mark_arc().encode(
    theta=alt.Theta(field="Profit", type="quantitative"),
    color=alt.Color('Country:N', scale=alt.Scale(scheme='category20')),  # Color by Country
    tooltip=['Country:N', 'Profit:Q', 'percentage:Q'],  # Tooltip with country, sales, and percentage
).properties(
    width=400,
    height=400,
    title="Top Countries by Profit Distribution"
)

# Create the bar chart (country list) sorted by sales
bar_chart = alt.Chart(agg_df_sorted).mark_bar().encode(
    x='Profit:Q',  # Sales on x-axis
    y=alt.Y('Country:N', sort='-x'),  # Sort countries by sales in descending order
    color=alt.Color('Country:N', scale=alt.Scale(scheme='category20')),  # Color by country
    tooltip=['Country:N', 'Profit:Q', 'percentage:Q']  # Tooltip with country, sales, and percentage
).properties(
    width=400,
    height=400,
    title="Top Countries by Profit"
)

# Combine both pie chart and bar chart in a side-by-side layout
pie_chart | bar_chart

#### **Map for Market Value, color based on Market Value**
* Here I genereated 3 real world maps and put the previous aggregated financial data for each countries on it. 
* Comparing to the previous pie charts and bar charts, this form of visualization is more expressive on the relation between the economic powers and their geographic Information. This is more intuitive and effective in telling users where powerful economies are located around the world.
* At the same time, regional economic comparisons outside the country can also be made.

In [35]:
import geopandas as gpd
import altair as alt

# Load the GeoJSON file from a local path (after downloading it)
world = gpd.read_file('../data/countries.geojson')

# Make sure to only keep relevant columns (i.e., Country and geometry)
world = world[['name', 'geometry']]

# Check your `agg_df` and ensure that 'Country' column matches the 'name' column in the world map
merged_df = world.merge(agg_df, left_on='name', right_on='Country', how='left')

# Check if the merge worked properly
merged_df.head()

# Map for Profit, color based on Profit
profit_map = alt.Chart(merged_df).mark_geoshape().encode(
    color='Profit:Q',  # Use Profit for color encoding
    tooltip=['name:N', 'Profit:Q']  # Show country name and profit value in tooltip
).properties(
    title='Global Profit Distribution by Country',
    width=600,
    height=400
)

# Map for Sales, color based on Sales
sales_map = alt.Chart(merged_df).mark_geoshape().encode(
    color='Sales:Q',  # Use Sales for color encoding
    tooltip=['name:N', 'Sales:Q']  # Show country name and sales value in tooltip
).properties(
    title='Global Sales Distribution by Country',
    width=600,
    height=400
)

# Map for Market Value, color based on Market Value
market_value_map = alt.Chart(merged_df).mark_geoshape().encode(
    color='Market Value:Q',  # Use Market Value for color encoding
    tooltip=['name:N', 'Market Value:Q']  # Show country name and market value in tooltip
).properties(
    title='Global Market Value Distribution by Country',
    width=600,
    height=400
)

# Show the maps
profit_map.show()
sales_map.show()
market_value_map.show()


## 4. Evaluation and design methodology description


#### Design Method
* For the design method, I typically use both the Five Design Sheets and the Design Studies approach.
* First, I define the project objectives and the goals of visualizing the selected dataset. I then outline the tasks needed to achieve these goals.
* Next, I create low-fidelity visualizations, iterating on them based on feedback to refine and improve them step by step.
* Finally, I write a conclusion and provide additional explanations to help users better understand the entire project and the specific visualizations.



#### Evaluation Description
* For the evaluation, I use both Insight-based evaluation and Qualitative evaluation methods.
* As the designer of this report and the visualizations it contains, I continuously strive to uncover deeper insights from the visualizations I have designed and created during the iteration process.
* Honestly, I did not receive feedback from any human beings. Instead, I used ChatGPT to gather feedback on these visualizations and then improved them based on the insights provided.





## 5. Synthesis of Findings
**What Worked Well:**
   - **Clear Objectives and Iteration Process**: By following the instructions listed in assignments in each week, I started defining objectives and continuously iterating to refine the visualizations.
   - **Use of Insight-Based and Qualitative Evaluation**: By focusing on both types of evaluations, ensuring that the visualizations were not only data-driven but also meaningful and intuitive for the user. 

**What Could Be Refined in Future Iterations:**
   - **Human Feedback**: While I used ChatGPT for feedback, incorporating feedback from real users (target audience) would be beneficial. Testing with actual users will provide more actionable insights into the visualizations' usability and effectiveness.
   - **Data Representation in Region comparison**: Although I have provided a comparison of economic strength between countries, this makes the United States and China more prominent and downplays the economic strength of other countries, especially in the world map, where the strength of economic indicators of countries is shown in shades of color, it seems that except for the United States and China, other countries are dim. I think it would be more realistic if a comparison between Asia, Europe and North America was provided, for example.

