### Demographic analyzer
##### In this project, we look at demographic data from the US Census: we explore the dataset, we clean when necessary, and ask a few questions.

In [45]:
import pandas as pd
import numpy as np

df = pd.read_csv("../csv/adult.data.csv")
df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'salary'],
      dtype='object')

In [46]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


How many people per each race group we have in the sample?

In [47]:
race_count = df["race"].value_counts()
race_count

race
White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: count, dtype: int64

What is the average age for male individuals in the sample?

In [48]:
filter_male = df["sex"]=="Male"
df.age[filter_male].mean()

39.43354749885268

How big is the portion of people holding a bachelor's degree?

In [49]:
bachelors_number = len(df.education[df["education"]=="Bachelors"])
total_population = len(df)

percentage_bachelors = bachelors_number/total_population
formatted_percentage_bachelors = "{:.0%}".format(percentage_bachelors)

print(f"The percentage of people from the sample having a Bachelor's degree is {formatted_percentage_bachelors}")

The percentage of people from the sample having a Bachelor's degree is 16%


In [50]:
#Let's see all possible education groups
df["education"].unique()

array(['Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
       'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
       '5th-6th', '10th', '1st-4th', 'Preschool', '12th'], dtype=object)

In [51]:
#Let's see all possible salary groups
df["salary"].unique()

array(['<=50K', '>50K'], dtype=object)

How many people belong to the high-income group (>50k per year) while, at the same time, holding *at least* a bachelor's degree? And how many people in total there are holding at least a bachelor's degree?

In [52]:
rich_with_degrees = df[(df["salary"]==">50K") & ((df["education"]=="Bachelors")|(df["education"]=="Masters")|(df["education"]=="Doctorate"))]
poor_with_degrees = df[(df["salary"]=="<=50K") & ((df["education"]=="Bachelors")|(df["education"]=="Masters")|(df["education"]=="Doctorate"))]

total_with_degrees = df[(df["education"]=="Bachelors")|(df["education"]=="Masters")|(df["education"]=="Doctorate")]


In [53]:
percentage_rich = len(rich_with_degrees)/len(total_with_degrees)
formatted_percentage_rich = "{:.2%}".format(percentage_rich)


print(f"Of the people having at least a bachelor's degree, {formatted_percentage_rich} is rich and belongs to the high-income group, while {(len(poor_with_degrees)/len(total_with_degrees)):.2%} belongs to the low-income group.") 

Of the people having at least a bachelor's degree, 46.54% is rich and belongs to the high-income group, while 53.46% belongs to the low-income group.


Now we know that if you have at least a bachelor's degree, you have roughly 47% probability of ending up in the high-income group. What is that probability if instead you hold an education **lower** than the Bachelor's?

In [54]:
rich_without_degrees = df[(df["salary"]==">50K") & ((df["education"]!="Bachelors") & (df["education"]!="Masters") & (df["education"]!="Doctorate"))]
poor_without_degrees = df[(df["salary"]=="<=50K") & ((df["education"]!="Bachelors") & (df["education"]!="Masters") & (df["education"]!="Doctorate"))]


total_without_degrees = df[((df["education"]!="Bachelors") & (df["education"]!="Masters") & (df["education"]!="Doctorate"))]

In [55]:
percentage_rich_nodegree = len(rich_without_degrees)/len(total_without_degrees)
percentage_poor_nodegree = len(poor_without_degrees)/len(total_without_degrees)

print(f"Among people who do not hold at least a bachelor's degree, {percentage_rich_nodegree:.2%} have a high income, while {percentage_poor_nodegree:.2%} below to a low-income group.")

Among people who do not hold at least a bachelor's degree, 17.37% have a high income, while 82.63% below to a low-income group.


We can then spot that having at least a bachelor's degree is associated with a probability of being rich that is almost 3 times higher compared to not having a bachelor's degree. 

Is this difference statistically significant, or is it likely for this to be a random effect? 

In [56]:
from scipy.stats import chi2_contingency

observed_data = [[17.37, 82.63], [46.54, 53.46]]

chi2_stat, p_value, _, _ = chi2_contingency(observed_data)

if p_value < 0.05:
    print(f"There is a significant association between education level and income level, since the p-value is {p_value:.5%}.")
else:
    print(f"There is no significant association between education level and income level,  since the p-value is {p_value:.5%}.")


There is a significant association between education level and income level, since the p-value is 0.00194%.


#### Working Conditions
Let's investigate now a bit more the relationship between the country of origin of each individual in the sample, and their working conditions/salary.

In [57]:
minimum_number_working_hours = df["hours-per-week"].min()
minimum_number_working_hours

1

In [58]:
#Let's calculate how many people, and then how many *rich* people, only work the minimum amount of hours existing in the dataset.
people_working_min = df[df["hours-per-week"] == minimum_number_working_hours]

rich_people_working_min = df[(df["hours-per-week"] == minimum_number_working_hours) & (df["salary"]==">50K")]

In [59]:
percentage_rich_work_min = len(rich_people_working_min)/len(people_working_min)
formatted_percentage_rich_work_min = "{:.2%}".format(percentage_rich_work_min)

print(f"{formatted_percentage_rich_work_min} of the people who work the minimum amount of hours (i.e. {minimum_number_working_hours} hour) belong to the high-income category.")

10.00% of the people who work the minimum amount of hours (i.e. 1 hour) belong to the high-income category.


In [60]:
df["native-country"].unique()

array(['United-States', 'Cuba', 'Jamaica', 'India', '?', 'Mexico',
       'South', 'Puerto-Rico', 'Honduras', 'England', 'Canada', 'Germany',
       'Iran', 'Philippines', 'Italy', 'Poland', 'Columbia', 'Cambodia',
       'Thailand', 'Ecuador', 'Laos', 'Taiwan', 'Haiti', 'Portugal',
       'Dominican-Republic', 'El-Salvador', 'France', 'Guatemala',
       'China', 'Japan', 'Yugoslavia', 'Peru',
       'Outlying-US(Guam-USVI-etc)', 'Scotland', 'Trinadad&Tobago',
       'Greece', 'Nicaragua', 'Vietnam', 'Hong', 'Ireland', 'Hungary',
       'Holand-Netherlands'], dtype=object)

In [61]:
#Since there is a typo in the dataframe, let's amend it.
df["native-country"].replace("Trinadad&Tobago","Trinidad&Tobago",inplace=True)


In [62]:
#The original file used "?" to represent missing values. To avoid problems in the calculations let's replace those with np.NaN, and then drop those entries containing them.
df["native-country"].replace("?",np.nan,inplace=True)
df["native-country"].dropna(inplace=True)



In [63]:
country = df["native-country"]
country


0        United-States
1        United-States
2        United-States
3        United-States
4                 Cuba
             ...      
32556    United-States
32557    United-States
32558    United-States
32559    United-States
32560    United-States
Name: native-country, Length: 32561, dtype: object

In [64]:
#Let's print out each country existing in the dataset along with the percentage of people from that country who earn more than 50K per year.

countries_and_perc = {}
for i in df["native-country"].unique():
    total_natives_of_country = len(df[country==i])
    rich_people_of_country = len(df[(df["native-country"]==i) & (df["salary"] == ">50K")])
    
    if rich_people_of_country != 0:
        perc_rich_people_of_country = "{:.2%}".format(rich_people_of_country/total_natives_of_country)
        countries_and_perc[i] = perc_rich_people_of_country

countries_and_perc

{'United-States': '24.58%',
 'Cuba': '26.32%',
 'Jamaica': '12.35%',
 'India': '40.00%',
 'Mexico': '5.13%',
 'South': '20.00%',
 'Puerto-Rico': '10.53%',
 'Honduras': '7.69%',
 'England': '33.33%',
 'Canada': '32.23%',
 'Germany': '32.12%',
 'Iran': '41.86%',
 'Philippines': '30.81%',
 'Italy': '34.25%',
 'Poland': '20.00%',
 'Columbia': '3.39%',
 'Cambodia': '36.84%',
 'Thailand': '16.67%',
 'Ecuador': '14.29%',
 'Laos': '11.11%',
 'Taiwan': '39.22%',
 'Haiti': '9.09%',
 'Portugal': '10.81%',
 'Dominican-Republic': '2.86%',
 'El-Salvador': '8.49%',
 'France': '41.38%',
 'Guatemala': '4.69%',
 'China': '26.67%',
 'Japan': '38.71%',
 'Yugoslavia': '37.50%',
 'Peru': '6.45%',
 'Scotland': '25.00%',
 'Trinidad&Tobago': '10.53%',
 'Greece': '27.59%',
 'Nicaragua': '5.88%',
 'Vietnam': '7.46%',
 'Hong': '30.00%',
 'Ireland': '20.83%',
 'Hungary': '23.08%'}

Which country is the origin country having the highest percentage of rich people?

In [79]:

def get_keys_from_value(dictionary, value_to_retrieve):
    """Retrieve the list of keys associated with a given value in a dictionary."""
    retrieved_keys = []
    for key, value in dictionary.items():
        if value == value_to_retrieve:
            retrieved_keys.append(key)
    return retrieved_keys

# Assuming values in countries_and_perc are string representations of percentages like '12.34%'
# Convert them to numerical values for comparison
percentages = [float(value.rstrip('%')) for value in countries_and_perc.values()]

max_percentage = np.max(percentages)

country_max_percentage = get_keys_from_value(countries_and_perc, f'{max_percentage:.2f}%')
percentage_max_country = countries_and_perc[country_max_percentage[0]]

In [80]:
print(f"The country having the highest percentage of rich people is {country_max_percentage}, which has {percentage_max_country} rich people in the sample.")


The country having the highest percentage of rich people is ['Iran'], which has 41.86% rich people in the sample.


Out of curiosity, we could then ask: what is the most frequent occupation among these people originally from Iran who are rich?

In [85]:
rich_iran_people = df[(df["native-country"]=="Iran") & (df["salary"]==">50K")]

most_frequent_occupation_rich_iran = rich_iran_people["occupation"].mode()

print(f"The most frequent occupation among rich people originally from Iran is {most_frequent_occupation_rich_iran[0]}.")


The most frequent occupation among rich people originally from Iran is Prof-specialty.
