
# **Data Mentorship Program - Introduction to Stats**

Data Mentorship Proogram presents this class as a form of an educational event. All the information and code on this file is the property of Data Mentorship Program. If you have any questions, please email us at rajvi@datamentorship.org

In this Google Collab, we will be using a public dataset to learn about basic statisical coding using Jupyter Notebooks. Please follow along each step. 


## Import Notebooks and Packages 

This first exercise goes over basic statistical exercises in python using Top Billionaire Dataset. 

Answer the following questions throughout this exercise

1. What are the average and median ages of the top 50 billionaires? (data. describe) (basic EDA)
2. What are the common trends among the top 10 billionaires and compare them to the bottom 10 billionaires? (bar graph, stacked bar chart)  
3. Which continent have the most billionaires? (group by)
4. What is the correlation between age and net worth? Graphical Representation (scatter plot)
5. Which COUNTRY has the highest proportion of billionaires? Which country has the highest probability of billionaires whose net worth will be above average(group by sources) 
6. ANOVA F-test (United States vs. other countries)



In [None]:
## Download library to import from Google sheets
# Conduct descriptive analysis in Python (mean, median, mode, Standard deviation)
#Goal : This document shares a step by step guide on conducting descriptive statistics using Jupyter

##Step 1 - Download google sheet data
from google.colab import auth
auth.authenticate_user()

import gspread
from google.auth import default
creds, _ = default()

gc = gspread.authorize(creds)

workbook = gc.open_by_url('https://docs.google.com/spreadsheets/d/1ShGbohm9zma4LNJkw1eXrpufjZX6soMKlk6EISOEezo/edit#gid=0')
ws = workbook.worksheet('Sheet1')

# get_all_values gives a list of rows.
rows = ws.get_all_values()


# Convert to a DataFrame and render.
import pandas as pd
df = pd.DataFrame.from_records(rows)
#set column names equal to values in row index position 0
df.columns = df.iloc[0]
#remove first row from DataFrame
df = df[1:]
print(df.head(10))

In [None]:
##Question 1 - conduct basic EDA
##Quick Python tip - To run a cell click Shift + Enter

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
df.head(10)

In [None]:
##Convert the columns to specific data types


df['NET WORTH'] = df['NET WORTH'].replace({'\$':'','B':''},regex =True).astype(float)
df['RANK'] = df['RANK'].astype(float)
df['AGE'] = df['AGE'].astype(float)

In [None]:
##df.info()

df.info()

In [None]:
##df.describe()

In [None]:
df.columns

Question 2: What are the common trends among the top 10 billionaires and compare them to the bottom 10 billionaires?

In [None]:
##Separate the top 10 and bottom 10 billionaires
#CHECK

bottom_10 = df.nlargest(10,'RANK')
top_10 = df.nsmallest(10,'RANK')

In [None]:
#Compare the mean age for each group

import numpy as np
print("Top 10 Billionaires:")
print(top_10["AGE"].mean())

In [None]:
print("Bottom 10 Billionaires:")
print(bottom_10["AGE"].mean())

In [None]:
#plot a bar chart of the average ages of the top 10 and bottom 10 billionaires
import matplotlib.pyplot as plt
plt.bar(["Top 10", "Bottom 10"], top_10["AGE"].mean(), bottom_10["AGE"].mean())
plt.xlabel("Group")
plt.ylabel("Average Age")
plt.title("Average Age of Top 10 vs. Bottom 10 Billionaires")
plt.show()

Comparison by Country

In [None]:
#Get the number of billionaires by country for each group
print("Top 10 Billionaires by Country:")
print(top_10["COUNTRY/TERRITORY"].value_counts())
print()
print("Bottom 10 Billionaires by Country:")
print(bottom_10["COUNTRY/TERRITORY"].value_counts())
top_10_counts = top_10["COUNTRY/TERRITORY"].value_counts()
bottom_10_counts = bottom_10["COUNTRY/TERRITORY"].value_counts()

In [None]:
#Combone the two series into a single dataframe for plotting 

counts = pd.DataFrame({"Top 10": top_10_counts, "Bottom 10": bottom_10_counts})
counts = counts.fillna(0).astype(int)

In [None]:
#Plot the stacked bar chart
counts.plot(kind = "bar", stacked = True)
plt.xlabel("Country")
plt.ylabel("Number of Billionaires")
plt.title("Number of Billionaires by country for top 10 vs. bottom 10")
plt.legend(loc = "upper left")
plt.show()

Question 3: Which Country has the most billionaires?

In [None]:
#Group the data by country/territory and count the number of billionaires in each group

grouped = df.groupby("COUNTRY/TERRITORY")["RANK"].count()

In [None]:
#Find the country/territory with the most billionaires 
most_billionaires_country = grouped.idxmax()

print("The country/territory with the most billionaires is:", most_billionaires_country)

Question 4: What is the correlation between age and net worth


In [None]:
#Find correlation between age and net worth 
correlation = df["AGE"].corr(df["NET WORTH"])
correlation

A correlation coefficient of -0.058 indicates a small negative linear relationship between age and net worth. This suggests that there is a weak linear relationship between the two variables and that changes in one variable are not strongly associated with changes in the other. In this case, the negative sign indicates that as age increases, the net worth decreases, although the magnitude of the correlation is small.

In [None]:
import seaborn as sns

In [None]:
#plot the scatterplot 
plt.figure(figsize = (8,6))
sns.scatterplot(x = "AGE", y = "NET WORTH", data = df)
plt.show()

Question 5: Which country has the highest proportion of billionaires?

In [None]:
#group the data by source and count the number of billionaires from each source

country_group = df.groupby("COUNTRY/TERRITORY").size().reset_index(name = 'counts')

#calculate the proportion of billionaires from each source
country_group["proportion"] = country_group["counts"]/df.shape[0]

#print the proportion of billionaires from each source
print(country_group[["COUNTRY/TERRITORY","proportion"]])

Alternatively, to find which country has the highest probability of billionaires whose net worth will be above average (among the top 50), we can calculate the average net worth of all billionaires and then count the number of billionaires in each country whose net worth is above the average. Finally, divide this number by the total number of billionaires in each country to get the probability of having above-average billionaires in that country.


In [None]:
#Calcualte the average net worth of all billionaires
average_net_worth = df["NET WORTH"].mean()
#group thedata by country
grouped = df.groupby(by = "COUNTRY/TERRITORY")
#For each country, count the number of billionaires with net worth above the average
above_average_counts = grouped.apply(lambda x:x[x['NET WORTH']> average_net_worth].count()["NET WORTH"])

#calculate the total number of billioanires in each country
total_counts = grouped.count()["NET WORTH"]

#Calculate the probability of having above average billionaires by dividng the number of above average billionaires

probabililties = above_average_counts/total_counts

#Find the country with the highest probability 
result = probabililties.idxmax()
print("The Country with the highest probability of having above average billioanires is:", result)