## **FCC Data Analysis using Python Certification - Project 2**
### Analyzing demographic data from a 1994 cencus database
#### -By Azeen Hodekar
##### The task is to answer the following questions:
1. How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (race column)
2. What is the average age of men?
3. What is the percentage of people who have a Bachelor's degree?
4. What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?
5. What percentage of people without advanced education make more than 50K?
6. What is the minimum number of hours a person works per week?
7. What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?
8. What country has the highest percentage of people that earn >50K and what is that percentage?
9. Identify the most popular occupation for those who earn >50K in India.in India.n India.

In [1]:
#Importing the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
df=pd.read_csv("adult.data.csv")

In [3]:
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [4]:
df.tail()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K
32560,52,Self-emp-inc,287927,HS-grad,9,Married-civ-spouse,Exec-managerial,Wife,White,Female,15024,0,40,United-States,>50K


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [6]:
df.shape

(32561, 15)

In [7]:
df.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,32561.0,32561.0,32561.0,32561.0,32561.0,32561.0
mean,38.581647,189778.4,10.080679,1077.648844,87.30383,40.437456
std,13.640433,105550.0,2.57272,7385.292085,402.960219,12.347429
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117827.0,9.0,0.0,0.0,40.0
50%,37.0,178356.0,10.0,0.0,0.0,40.0
75%,48.0,237051.0,12.0,0.0,0.0,45.0
max,90.0,1484705.0,16.0,99999.0,4356.0,99.0


#### Question 1:
- How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (race column)

In [17]:
#For this we can use value_counts() funtion
race_count=df.value_counts('race')
print(race_count)

race
White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: count, dtype: int64


#### Question 2
- What is the average age of men?

In [18]:
#For this we can use mean() function
mean_age=df.loc[df.sex=='Male','age'].mean()
#Rounding upto one decimal 
avg_age_men=round(mean_age,1)
print("The average age of men is",avg_age_men)

The average age of men is 39.4


#### Question 3
- What is the percentage of people who have a Bachelor's degree?

In [22]:
#For this we will find the ratio of people having bachelors to total people
ratio_bach=df.loc[df.education=="Bachelors",'education'].count()/df.education.count()
#We'll now multiply this ratio by 100 and round it to one decimal place
percentage_bachelors=round(100*ratio_bach,1)
print("The percentage of people who have a bachelor's degreee is", percentage_bachelors)

The percentage of people who have a bachelor's degreee is 16.4


#### Question 4
- What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?

In [26]:
#For this we will create a new dataframe with people having a higher education (Bachelors, Masters, or Doctorate)
higher_education=df[df.education.isin(['Bachelors', 'Masters', 'Doctorate'])]
higher_education

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32538,38,Private,139180,Bachelors,13,Divorced,Prof-specialty,Unmarried,Black,Female,15020,0,45,United-States,>50K
32539,71,?,287372,Doctorate,16,Married-civ-spouse,?,Husband,White,Male,0,0,10,United-States,>50K
32544,31,Private,199655,Masters,14,Divorced,Other-service,Not-in-family,Other,Female,0,0,30,United-States,<=50K
32553,32,Private,116138,Masters,14,Never-married,Tech-support,Not-in-family,Asian-Pac-Islander,Male,0,0,11,Taiwan,<=50K


In [45]:
#From this dataframe we have to find out people who make more than 50k
#i.e ratio of people earning more than 50k to total number of people in this dataframe
higher_education_rich_ratio=higher_education.loc[higher_education.salary=='>50K','salary'].count()/higher_education.salary.count()
#To find percentage we will multiply this by 100 and round it to one decimal place
higher_education_rich_percentage=round(higher_education_rich_ratio*100,1)
print("The percentage of people with advanced education (Bachelors, Masters, or Doctorate) that make more than 50K is",higher_education_rich_percentage)

The percentage of people with advanced education (Bachelors, Masters, or Doctorate) that make more than 50K is 46.5


#### Question 5
- What percentage of people without advanced education make more than 50K?

In [46]:
#For this we will create a new dataframe with people NOT having a higher education (Bachelors, Masters, or Doctorate)
lower_education=df[~df.education.isin(['Bachelors', 'Masters', 'Doctorate'])]
lower_education

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K
10,37,Private,280464,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0,0,80,United-States,>50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [47]:
#From this dataframe we have to find out people who make more than 50k
#i.e ratio of people earning more than 50k to total number of people in this dataframe
lower_education_rich_ratio=lower_education.loc[lower_education.salary=='>50K','salary'].count()/lower_education.salary.count()
#To find percentage we will multiply this by 100 and round it to one decimal place
lower_education_rich_percentage=round(lower_education_rich_ratio*100,1)
print("The percentage of people without advanced education (Bachelors, Masters, or Doctorate) that make more than 50K is",lower_education_rich_percentage)

The percentage of people without advanced education (Bachelors, Masters, or Doctorate) that make more than 50K is 17.4


#### Question 6
- What is the minimum number of hours a person works per week?

In [48]:
min_work_hours=df['hours-per-week'].min()
print("The minimum number of hours a person works per week is", min_work_hours)

The minimum number of hours a person works per week is 1


#### Question 7
- What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?

In [74]:
#For this we will find out people who work min hours and earn more than 50K
num_min_workers=df.loc[((df['hours-per-week']==min_work_hours) & (df.salary=='>50K')),'salary'].count()
rich_ratio=num_min_workers/df.loc[df['hours-per-week']==min_work_hours,'salary'].count()
rich_percentage=round(100*rich_ratio,3)
print("Percentage of the people who work the minimum number of hours per week and have a salary of more than 50K is",rich_percentage)

Percentage of the people who work the minimum number of hours per week and have a salary of more than 50K is 10.0


#### Question 8
- What country has the highest percentage of people that earn >50K and what is that percentage?

In [61]:
#For this we will first create a df with people earning more than 50K
df_high_income = df[df['salary'] == '>50K']
#We will now calculate percentages of people by country who earn more than 50K
percentage_by_country=df_high_income['native-country'].value_counts()/df['native-country'].value_counts()*100
#We will now find that country by idxmax()
highest_earning_country = percentage_by_country.idxmax()
highest_earning_country_percentage = round((percentage_by_country.max()),1)
print(f"The country that has the highest percentage of people that earn more than 50K is {highest_earning_country} and that percentage is {highest_earning_country_percentage}")

The country that has the highest percentage of people that earn more than 50K is Iran and that percentage is 41.9


#### Question 9
- Identify the most popular occupation for those who earn >50K in India.

In [67]:
#For this we will create a df with people living in India and earning more than 50K
high_earning_indian=df[(df.salary=='>50K') & (df['native-country']=='India')]
high_earning_indian

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
11,30,State-gov,141297,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,India,>50K
968,48,Private,164966,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,Asian-Pac-Islander,Male,0,0,40,India,>50K
1327,52,Private,168381,HS-grad,9,Widowed,Other-service,Unmarried,Asian-Pac-Islander,Female,0,0,40,India,>50K
7258,42,State-gov,102343,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,72,India,>50K
7285,54,State-gov,93449,Masters,14,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,India,>50K
8124,36,Private,172104,Prof-school,15,Never-married,Prof-specialty,Not-in-family,Other,Male,0,0,40,India,>50K
9939,43,Federal-gov,325706,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,50,India,>50K
10590,35,Private,98283,Prof-school,15,Never-married,Prof-specialty,Not-in-family,Asian-Pac-Islander,Male,0,0,40,India,>50K
10661,59,Private,122283,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,99999,0,40,India,>50K
10736,30,Private,243190,Prof-school,15,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,20,India,>50K


In [69]:
#Now we will find the value counts of occupations of these people
occupation_frequency = high_earning_indian['occupation'].value_counts()
occupation_frequency

occupation
Prof-specialty      25
Exec-managerial      8
Other-service        2
Tech-support         2
Transport-moving     1
Sales                1
Adm-clerical         1
Name: count, dtype: int64

In [71]:
#Now we can find the most popular occupation by idxmax()
top_IN_occupation = occupation_frequency.idxmax()
print("The most popular occupation for those who earn more than 50K in India is",top_IN_occupation)

The most popular occupation for those who earn more than 50K in India is Prof-specialty
