# **World's Wealthiest - Descriptive Statistics**

The World's Billionaires is an annual ranking by documented net worth of the world's wealthiest billionaires compiled and published in March annually by the American business magazine Forbes. The list was first published in March 1987. The total net worth of each individual on the list is estimated and is cited in United States dollars, based on their documented assets and accounting for debt. Royalty and dictators whose wealth comes from their positions are excluded from these lists. This ranking is an index of the wealthiest documented individuals, excluding and ranking against those with wealth that is not able to be completely ascertained. (wikipedia)

The dataset has following features:
 - Year
 - Rank
 - Name
 - Net_Worth
 - Age
 - Nationality
 - Source_wealth

**Objective**: Perform descriptive analytics to understand what data tells us about the world's wealthiest

In [None]:
import numpy as np
import pandas as pd
import scipy
from scipy import stats
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
pd.set_option('display.max_columns',25)

Read input file billionaires.csv

In [None]:
data=pd.read_csv("../input/world-billionaires/billionaires.csv")

**Read top 10 values from the csv file**

In [None]:
data.head(10)

Correct the column name from "natinality" to "nationality"

In [None]:
data.rename(columns = {'natinality':'nationality'}, inplace = True)

In [None]:
data.head(20)

In [None]:
data.info()

In [None]:
data.describe()

In [None]:
data.describe(include='object')

In [None]:
#int64 --> Numerical discrete
#float64 --> Numerical continuous
#object --> String Categories

In [None]:
data['net_worth']=data['net_worth'].astype('float')

In [None]:
data['year']=data['year'].astype('object')
data['rank']=data['rank'].astype('object')

In [None]:
data.info()

In [None]:
data.describe()



*   Count : No of entries or records
*   Mean : Average value of Networth is 42.36 and Average age of billionair is 66.7 
*   std : Deviation from the mean
*   min : Minimum quartile value
*   25% : 25 percent of the value
*   50% : 50 percent of the value(median)
*   75% : 75 percent of the value
*   Max : Maximum value











In [None]:
data.describe(include='object')

In [None]:
type(data.columns[0])

In [None]:
data.columns = ['year', 'rank', 'name', 'net_worth', 'age', 'nationality','source_wealth']

In [None]:
data.head()

In [None]:
data['NewCol'] = "Something"

In [None]:
data.head()

In [None]:
data.drop(['NewCol'],inplace=True, axis=1) #drop the column, which is an operation to be performed on whole column

# axis = 1, do operations column wise
# axis = 0, do operation row wise

In [None]:
data['year_new'] = data['year'].astype('object')
data['new_worth'] = data['net_worth'].astype('object')
data['rank'] = data['rank'].astype('object')

In [None]:
data.info()

In [None]:
data['rank'].value_counts()

In [None]:
data['name']

In [None]:
data[['year','rank','net_worth']].head(20)

In [None]:
data.columns

In [None]:
data[(data['nationality'] == 'United States') & (data['year'] == 2019) & (data['net_worth'] > 50 )][['name','rank']]

In [None]:
data.groupby('year')

In [None]:
data['name'].value_counts()

In [None]:
data['source_wealth'].value_counts()

Based on the above count we can say tha worlds wealthiest people belong to microsoft

In [None]:
data[(data['name'] == 'Mukesh Ambani')]

In [None]:
data_by_year=data.groupby('year')
data_by_year.describe()

In [None]:
data.groupby(by=["year","age"]).describe()

Describe funtion in python is a powerful when trying to get statistical details of entire data, as above we can see mean, median and mode details of each column.
For 2002 mean age of billionair is 55 and in 2019 mean age is 66.7, 
For 2002 mean networth of billionair is 27.53 and in 2019 mean networth is 74.38

**Descriptive statistics** is the summary given data set. Descriptive statistics are broken down into 1. Measures of Central Tendency and 2. Measures of Spread

# **Measures of Central Tendency**

The obvious question when looking at a salary dataset is "How much do people make?". And when asking that nobody is interested to get 100s of rows of data. They want just a single number which can represent the entire dataset. And that's exactly what Central Tendency seeks to do. There are three measures of central tendency viz. **Mean, Median, Mode**



**MEAN**: The mean is the average value.

In [None]:
data_by_year[['net_worth','age']].mean()

**Observation** : Mean net_worth and age for each year is the average value of networth and age

**Median** : Middle value of the numbers, arranged in ascending order

In [None]:
data_by_year[['net_worth','age']].median()

**Mode** Most requently occuring value 

In [None]:
data.describe(include='object')

In [None]:
data['net_worth'].mean()

In [None]:
data['nationality'].mode()

In [None]:
plt.figure(figsize=[16,6])
y=data.groupby('year').mean()['age']
xi = list(range(2002,2020))
plt.plot(xi,y, marker ='o', linestyle='--', color='r', label='Square')
plt.xticks(xi,xi)
plt.xlabel('Year')
plt.ylabel('Mean Age for that year')

In [None]:
plt.figure(figsize=[16,6])
y=data.groupby('year').mean()['net_worth']
xi = list(range(2002,2020))
plt.plot(xi,y, marker ='o', linestyle='--', color='r', label='Square')
plt.xticks(xi,xi)

plt.ylabel('Mean Age for that year')
plt.xlabel('Year')

In [None]:
!pip install seaborn 
import seaborn as sns 

In [None]:
filter= ['Bill Gates','Jeff Bezos','Mark Zuckerberg','Paul Allen','Larry Ellison']
data[data['name'].isin(filter)][['name','year','net_worth']]

In [None]:
plt.figure(figsize=(16,6))
filter = ['Bill Gates','Jeff Bezos','Mark Zuckerberg','Paul Allen','Larry Ellison', "Mukesh Ambani"]
comparison = data[data['name'].isin(filter)][['name','year','net_worth']]
sns.lineplot(data=comparison,x='year',y='net_worth',hue='name')
plt.xticks(xi,xi)
plt.xlabel('Year')
plt.ylabel('Net Worth')
plt.title('Wealth of Major Tech Companies Owners over years')


In [None]:
!pip install seaborn 
import seaborn as sns 

In [None]:
plt.figure(figsize=(10,5))
filter = ['Microsoft','Facebook','Bershire Hathway','Wal-Mart','Amazon','LVMH']
comparision = data[data['source_wealth'].isin(filter)][['source_wealth','year','net_worth']]
sns.lineplot(data=comparision,x='year',y='net_worth',hue='source_wealth')
plt.xticks(xi,xi)
plt.xlabel('Year')
plt.ylabel('Net Worth')
plt.title('Wealth of Major Tech Companies Owner over Years')
plt.show()

**Summary of data**

In [None]:
data.describe()

For networth 50% that is Q2 is 39.8 which is the median, which means 50% of the value is below that and 50% above that. Q1 is 25% i.e. 1st quartile has 25% of the value below it and 75% above it. 75% i.e. Q3 is the 3rd quartile that means 75% of data is below this and 25% above this value.

**Measures of Speed**
This is also knows as variability. There are 4 commonly used Measures of Spread
Range, Variance, Standard Deviation, Interqaurtile Range


**Range** : Difference between smallest and largest number

In [None]:
data_by_year['net_worth'].max()

In [None]:
data_by_year['net_worth'].max() - data_by_year['net_worth'].min()

**Variance** : Variance (σ2) in statistics is a measurement of the spread between numbers in a data set. That is, it measures how far each number in the set is from the mean and therefore from every other number in the set. 

In [None]:
data_by_year['net_worth'].var()

**STANDARD DEVIATION** : Square root of Variance.
The standard deviation is a statistic that measures the dispersion of a dataset relative to its mean and is calculated as the square root of the variance. It is calculated as the square root of variance by determining the variation between each data point relative to the mean. If the data points are further from the mean, there is a higher deviation within the data set; thus, the more spread out the data, the higher the standard deviation.

In [None]:
data_by_year['net_worth'].std()

**INTERQUARTILE RANGE**: The measure of statistical dispersion between upper (75th) and lower (25th) quartiles.

In [None]:
Q1 = data_by_year['net_worth'].quantile(0.25)
Q3 = data_by_year['net_worth'].quantile(0.75)
IQR = Q3 - Q1
print(IQR)

## Finding outliers in a column using Standard Deviation
Standard deviation is helpful because it describes how far away from the mean your data generally is. We can use this to find data points that are usually far from the mean. These are called as **outliers**.

Let's try to find out the outliers for the age column. Meaning thereby we are finding people who are too old or too young as compared to the entire dataset of billionaires. 

In [None]:
data['age_std'] = ((data['age'] - data['age'].mean())/ data['age'].std())
data.sort_values(by='age_std').head(10)

As a thumbrule, values that are more than 3 standard deviations away from the mean are considered as outliers. So there are no outliers in this case.

In [None]:
# we take difference of values with mean and then divide by SD --> This give you average Devaition w.r.t mean/sd

In [None]:
data['age_std']=((data['age'] - data['age'].mean())/data['age'].std())
data.sort_values(by='age_std').head(10)

In [None]:
data['net_worth_std']=((data['net_worth'] - data['net_worth'].mean())/data['net_worth'].std())
data.sort_values(by='net_worth_std', ascending=False).head(10)

So the top person here can be considered as outlier as far as net worth is considered meaning thereby Jeff Bezos in the year 2018 and 2019 was <i> super rich </i>, i.e. way more rich even as compared to the rich. 