# Data Analysis with Python using Forbes 2022 Dataset

## Loading Data

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("Data/forbes_2022_billionaires.csv")

In [3]:
df.head()

Unnamed: 0,rank,personName,age,finalWorth,year,month,category,source,country,state,...,organization,selfMade,gender,birthDate,title,philanthropyScore,residenceMsa,numberOfSiblings,bio,about
0,1,Elon Musk,50.0,219000.0,2022,4,Automotive,"Tesla, SpaceX",United States,Texas,...,Tesla,True,M,1971-06-28,CEO,1.0,,,Elon Musk is working to revolutionize transpor...,Musk was accepted to a graduate program at Sta...
1,2,Jeff Bezos,58.0,171000.0,2022,4,Technology,Amazon,United States,Washington,...,Amazon,True,M,1964-01-12,Entrepreneur,1.0,"Seattle-Tacoma-Bellevue, WA",,Jeff Bezos founded e-commerce giant Amazon in ...,"Growing up, Jeff Bezos worked summers on his g..."
2,3,Bernard Arnault & family,73.0,158000.0,2022,4,Fashion & Retail,LVMH,France,,...,LVMH Moët Hennessy Louis Vuitton,False,M,1949-03-05,Chairman and CEO,,,,Bernard Arnault oversees the LVMH empire of so...,"Arnault apparently wooed his wife, Helene Merc..."
3,4,Bill Gates,66.0,129000.0,2022,4,Technology,Microsoft,United States,Washington,...,Bill & Melinda Gates Foundation,True,M,1955-10-28,Cofounder,4.0,"Seattle-Tacoma-Bellevue, WA",,Bill Gates turned his fortune from software fi...,"When Gates was a kid, he spent so much time re..."
4,5,Warren Buffett,91.0,118000.0,2022,4,Finance & Investments,Berkshire Hathaway,United States,Nebraska,...,Berkshire Hathaway,True,M,1930-08-30,CEO,5.0,"Omaha, NE",,"Known as the ""Oracle of Omaha,"" Warren Buffett...","Buffett still lives in the same Omaha, Nebrask..."


# Determine the dimensions of a pandas DataFrame, where df represents the name of the DataFrame variable.

In [6]:
df.shape

(2668, 22)

# Retrieve the column names of a pandas DataFrame, where df represents the name of the DataFrame variable.

In [7]:
df.columns

Index(['rank', 'personName', 'age', 'finalWorth', 'year', 'month', 'category',
       'source', 'country', 'state', 'city', 'countryOfCitizenship',
       'organization', 'selfMade', 'gender', 'birthDate', 'title',
       'philanthropyScore', 'residenceMsa', 'numberOfSiblings', 'bio',
       'about'],
      dtype='object')

# Data Preprocessing

# Select specific columns from a pandas DataFrame and assign them back to the variable df. Take columns (["rank","personName","age","finalWorth","category","country","gender"])

In [9]:
df =  df.loc[:, ["rank","personName","age","finalWorth","category","country","gender"]]
df.head()

Unnamed: 0,rank,personName,age,finalWorth,category,country,gender
0,1,Elon Musk,50.0,219000.0,Automotive,United States,M
1,2,Jeff Bezos,58.0,171000.0,Technology,United States,M
2,3,Bernard Arnault & family,73.0,158000.0,Fashion & Retail,France,M
3,4,Bill Gates,66.0,129000.0,Technology,United States,M
4,5,Warren Buffett,91.0,118000.0,Finance & Investments,United States,M


# Set a specific column as the index of a pandas DataFrame.In this case ("rank")

In [10]:
df = df.set_index("rank")

In [11]:
df.head()

Unnamed: 0_level_0,personName,age,finalWorth,category,country,gender
rank,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,Elon Musk,50.0,219000.0,Automotive,United States,M
2,Jeff Bezos,58.0,171000.0,Technology,United States,M
3,Bernard Arnault & family,73.0,158000.0,Fashion & Retail,France,M
4,Bill Gates,66.0,129000.0,Technology,United States,M
5,Warren Buffett,91.0,118000.0,Finance & Investments,United States,M


# Check datatype of every column in dataframe df

In [12]:
df.dtypes

personName     object
age           float64
finalWorth    float64
category       object
country        object
gender         object
dtype: object

# Count the number of missing or null values in each column of a pandas DataFrame df.

In [13]:
df.isnull().sum()

personName     0
age           86
finalWorth     0
category       0
country       13
gender        16
dtype: int64

# Remove rows with missing or null values from a pandas DataFrame df.

In [14]:
df.dropna(inplace = True)

# Again check the dimensions of a pandas DataFrame, where df represents the name of the DataFrame variable

In [16]:
df.shape

(2568, 6)

# Informations about gender of the richest in the world

# Count the occurrences of each unique value in a specific column of a pandas DataFrame df.

In [17]:
df["gender"].value_counts()

gender
M    2282
F     286
Name: count, dtype: int64

# Calculate the relative frequencies or proportions of each unique value in a specific column of a pandas DataFrame df.

In [18]:
df["gender"].value_counts(normalize = True)

gender
M    0.888629
F    0.111371
Name: proportion, dtype: float64

# Calculate the relative frequencies or proportions of each unique value in the "gender" column, specifically for the rows where the "country" column is equal to "France" in a pandas DataFrame df.

In [20]:
df[df["country"] == 'France'].gender.value_counts(normalize = True)

gender
M    0.878788
F    0.121212
Name: proportion, dtype: float64

# Retrieve an array of unique values in the "country" column of a pandas DataFrame df

In [21]:
df["country"].unique()

array(['United States', 'France', 'India', 'Mexico', 'China', 'Singapore',
       'Spain', 'Canada', 'Germany', 'Switzerland', 'Belgium',
       'Hong Kong', 'United Kingdom', 'Australia', 'Austria', 'Italy',
       'Japan', 'Bahamas', 'Indonesia', 'Chile', 'Russia', 'Sweden',
       'Czechia', 'Monaco', 'United Arab Emirates', 'Nigeria', 'Denmark',
       'Thailand', 'Malaysia', 'Brazil', 'Colombia', 'New Zealand',
       'South Korea', 'South Africa', 'Philippines', 'Egypt', 'Taiwan',
       'Israel', 'Vietnam', 'Poland', 'Norway', 'Cayman Islands',
       'Netherlands', 'Eswatini (Swaziland)', 'Peru', 'Algeria',
       'Kazakhstan', 'Georgia', 'Portugal', 'British Virgin Islands',
       'Turkey', 'Finland', 'Ukraine', 'Ireland', 'Bermuda', 'Lebanon',
       'Argentina', 'Cambodia', 'Oman', 'Guernsey', 'Liechtenstein',
       'Turks and Caicos Islands', 'Qatar', 'Morocco', 'Uruguay',
       'Slovakia', 'Romania', 'Nepal', 'Tanzania', 'Bahrain', 'Greece',
       'Hungary', 'Andorra']

# Calculate the relative frequencies or proportions of each unique value in the "gender" column, specifically for the rows where the "country" column is equal to "Canada" in a pandas DataFrame df.

In [22]:
df[df["country"] == 'Canada'].gender.value_counts(normalize = True)

gender
M    0.952381
F    0.047619
Name: proportion, dtype: float64

# Group a pandas DataFrame df based on the unique values in the "gender" column.

In [23]:
df_gender = df.groupby(["gender"])

# Calculate the mean (average) value of the "age" column within each group of a grouped pandas DataFrame df_gender.

# Import the seaborn library and set its theme. Adjust the DPI (dots per inch) setting for the figures. Additionally, import the warnings module and sets it to ignore warnings.

# Create a bar plot representing the size or count of each group within a grouped pandas DataFrame df_gender.

## Who are the top 10 richest in the world?

# Create a bar plot. It represents the top 10 individuals' names (personName) and their corresponding final worth (finalWorth) from a pandas DataFrame df

## Which country has the highest number of billionaires?

# Calculate the number of unique values in the "country" column of a pandas DataFrame df.

# Group a pandas DataFrame df based on the unique values in the "country" column.

# Obtain a new DataFrame df_country_number that contains two columns: "country" (index) and "number" (count/size). The DataFrame is sorted in descending order based on the group sizes, providing a summary of the number of data points or individuals for each country in the dataset.

# Create a bar plot. It should represent the top 10 countries (index) and their corresponding group sizes (number) from a DataFrame df_country_number.

## Who are the Top 10 richest in the France?

# Create a new DataFrame df_France by filtering the original DataFrame df to include only the rows where the "country" column is equal to "France".

# Calculate the count of non-null values in the "personName" column of the DataFrame df_France.

# Use the seaborn library to create a bar plot. It should represent the top 10 individuals' names (personName) and their corresponding final worth (finalWorth) specifically for the subset of data where the country is "France" (DataFrame df_France).

## Which Industry has the most billionaires in it?

# Retrieve an array of unique values from the "category" column of a pandas DataFrame df.

# Modify the values in the "category" column of a pandas DataFrame df. Removes spaces and replace ampersands with underscores in each category value.

# Retrieve an array of unique values from the modified "category" column of a pandas DataFrame df

# Group the pandas DataFrame df based on the unique values in the "category" column and calculate the size or count of each group.

# Convert the pandas Series df_category into a DataFrame df_category.

# Rename the column of a DataFrame df_category to "numbers" and sorts the DataFrame based on the "numbers" column in descending order.

# Utilizing the seaborn library to create a bar plot. It should represent the top 10 categories (index) and their corresponding count or size (numbers) from the DataFrame df_category.

## Is there a relationship between money and age?

# Utilizing the seaborn library to create a scatter plot. It should represent the relationship between the "age" and "finalWorth" variables from the DataFrame df.

## The distribution of age

# Utilizing the seaborn library to create a histogram plot. It should represent the distribution of values in the "age" variable from the DataFrame df.

## The youngest billionaires

# Firs sort the DataFrame df based on the values in the "age" column and assigns the sorted DataFrame to a new DataFrame df_age.

# Use the seaborn library to create a bar plot. It should represent the top 10 individuals' names (personName) and their corresponding ages (age) from the sorted DataFrame df_age.