# Demographic Data

This project is an assignment for a freecodecamp.com certificate. This data is taken from the 1994 US census. Our goal is to answer a series of questions about different populations.

In [1]:
import pandas as pd
import numpy as np
my_df = pd.read_csv("adult.data.csv")
print(my_df.head())

   age         workclass  fnlwgt  education  education-num  \
0   39         State-gov   77516  Bachelors             13   
1   50  Self-emp-not-inc   83311  Bachelors             13   
2   38           Private  215646    HS-grad              9   
3   53           Private  234721       11th              7   
4   28           Private  338409  Bachelors             13   

       marital-status         occupation   relationship   race     sex  \
0       Never-married       Adm-clerical  Not-in-family  White    Male   
1  Married-civ-spouse    Exec-managerial        Husband  White    Male   
2            Divorced  Handlers-cleaners  Not-in-family  White    Male   
3  Married-civ-spouse  Handlers-cleaners        Husband  Black    Male   
4  Married-civ-spouse     Prof-specialty           Wife  Black  Female   

   capital-gain  capital-loss  hours-per-week native-country salary  
0          2174             0              40  United-States  <=50K  
1             0             0             

To begin, we're looking for a count of how many people of different races are included in the dataset.

In [2]:
counts = my_df["race"].value_counts()
print(counts)

White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64


Next is average age of all male entries, which is easy enough with the .mean() method.

In [3]:
average_df = my_df[my_df["sex"]=="Male"]
aver_age = round(average_df['age'].mean(), 1)
print(aver_age)


39.4


Next we're looking for the percentage of people who have a bachelors degree. We can filter, then sum up number of entries where Bachelor's = True, then divide by the total number of entries.

In [4]:
percent = ((my_df["education"]== "Bachelors").sum()/ len(my_df["education"])) * 100
percent = round(percent, 1)
print(percent)

print(my_df["salary"].value_counts())

16.4
<=50K    24720
>50K      7841
Name: salary, dtype: int64


We're looking for the percentage of people with advanced education (Bachelors, Masters, or Doctorate) who make more than 50K. This gets a little bit more complicated, but combining different filters makes it fairly straight forward.

In [5]:
richer_folks_edu = len((my_df[(my_df["salary"]==">50K") & ((my_df["education"]== "Bachelors") | (my_df["education"]=="Masters") | (my_df["education"]=="Doctorate"))]))

all_edu = len((my_df[(my_df["education"]== "Bachelors") | (my_df["education"]=="Masters") | (my_df["education"]=="Doctorate")]))

edu_percent = round((richer_folks_edu/all_edu) * 100, 1)
print(edu_percent)


46.5


Then we're trying to find the percentage of people without advanced education degrees that make more than 50K. I listed the value counts of the education column to see which way would be the quickest to specify the values. As there are many more non-degree values than degree values, it was faster to just use != "degree type." Then I calculated and rounded the value.

In [11]:
print(my_df["education"].value_counts())
print("\n")

less_edu = len((my_df[(my_df["salary"]==">50K") & ((my_df["education"]!= "Bachelors") & (my_df["education"]!="Masters") & (my_df["education"]!="Doctorate"))]))
all_less_edu = len(my_df[(my_df["education"]!= "Bachelors") & (my_df["education"]!="Masters") & (my_df["education"] != "Doctorate")])

less_edu_percent = round((less_edu/all_less_edu) * 100, 1)
print(less_edu_percent)


HS-grad         10501
Some-college     7291
Bachelors        5355
Masters          1723
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           646
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           333
1st-4th           168
Preschool          51
Name: education, dtype: int64


17.4


The project then asked to find the minimum number of hours anyone worked per week. This was quite easy to do by finding the minimum of the hours-per-week column.

In [7]:
my_df["hours-per-week"].min()

1

Then we calculate the number of people who work the minimum number of hours per week and have a salary of >=50K. 20 people seems surprisingly high; it must a pretty nice gig to pay a salary for one hour of work per week. We then want to find what percentage this is, which is easy enough. 

In [18]:
short_hours_big_money = len((my_df[(my_df["hours-per-week"]==1) & (my_df["salary"]== ">50K")]))
print(short_hours_big_money)

short_per = short_hours_big_money/len(my_df[(my_df["hours-per-week"]==1)])
print(round(short_per*100, 1))

2
10.0


Next I answered, "What country has the highest percentage of people that earn >50K and what is that percentage?". I found this easiest to do with a quick crosstab calculating the percentage of people with a salary over 50K, then just taking the max value. 

In [9]:
cross_df = pd.crosstab(my_df["native-country"],my_df['salary']).apply(lambda r: r/r.sum(), axis=1)
#print(cross_df)

result = cross_df[(cross_df[">50K"] == cross_df[">50K"].max())]
print(result.iloc[0])
print(round((cross_df[">50K"].max()*100), 1))

salary
<=50K    0.581395
>50K     0.418605
Name: Iran, dtype: float64
41.9


Finally, we want to identify the most popular occupatio nfor those who earn >50K in India. This time I decided to use filters to find the needed values, then group by occupation. I then aggregated the values by size, essentially counting the rows for each occupation, then put that in a dataframe. I then grabbed the maximum row in that data frame, and assigned it to a new variable. I then found the occupation by using iloc.

This makes me miss SQL.

In [10]:
india_df = my_df[(my_df["native-country"] == "India") & (my_df["salary"] == ">50K")]
count_df = india_df.groupby("occupation")
count_df_agg = count_df.agg(np.size)

max_value = count_df_agg[(count_df_agg["age"]==count_df_agg["age"].max())]

pop_oc = (max_value.iloc[0].name)
print(pop_oc)

Prof-specialty
