### Introduction
The given dataset contains the demographic data that that was extracted from the [1994 Census database](https://www.census.gov/en.html).

This dataset contains the demographic data acording to the the following columns:
**age, workclass, fnlwgt, education, education-num, martial-status, occupation, relationship, race,sex, capital-gain, capital-loss, hours-per-week, native-country, salary.**

Our aim is to answer the following questions and do an exploratory data analysis.  
1. How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (`race` column)
2. What is the average age of men?
3. What is the percentage of people who have a Bachelor's degree?
4. What percentage of people with advanced education (`Bachelors`, `Masters`, or `Doctorate`) make more than 50K?
5. What percentage of people without advanced education make more than 50K?
6. What is the minimum number of hours a person works per week?
7. What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?
8. What country has the highest percentage of people that earn >50K and what is that percentage?
9. Identify the most popular occupation for those who earn >50K in India.

### Dataset Source

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

The following code needs to load the necessasary packages to the Jupyter Notebook.

In [2]:
import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")

Setup Complete


Let's read the data set and get a sample of data to see the shape of the data.

In [3]:
adult_data = pd.read_csv("adult.data.csv")
adult_data.head(5)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


First we need to figure out the shape of our dataset. Output result format as below. **(Number of raws, Number of columns)**

In [4]:
adult_data.shape

(32561, 15)

We can explore the dataset info as belows.

In [5]:
adult_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


### Analyze the data set.

### Question 1:
1. Let's figure out how many people of each race are represented in this dataset.

In [6]:
adult_data.race.value_counts()

White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
Name: race, dtype: int64

We can analyze the `sex` column to identify how many Males and Females are represented in this dataset.

In [7]:
adult_data["sex"].value_counts()

Male      21790
Female    10771
Name: sex, dtype: int64

### Question 2:
2. The average age of Men can be calculated as follows.

In [8]:
group = adult_data.groupby(['sex']).age.mean()
average_age_men = round(group['Male'])
average_age_men

39

In below, we can see how many people of each education qualification are represented in this dataset.

In [9]:
adult_data.education.value_counts()

HS-grad         10501
Some-college     7291
Bachelors        5355
Masters          1723
Assoc-voc        1382
11th             1175
Assoc-acdm       1067
10th              933
7th-8th           646
Prof-school       576
9th               514
12th              433
Doctorate         413
5th-6th           333
1st-4th           168
Preschool          51
Name: education, dtype: int64

### Question 3:
3. Let's define the percentage of people who have a Bacherlor's degree.

a. Total number of people who have a Bacherlor's degree.

In [10]:
bachelors = adult_data.education.value_counts()
num = bachelors.loc[pd.Index(['Bachelors'])][0]

b. Total number of people.

In [11]:
total = adult_data.shape[0]

c. Calculating the percentage of people who have a Bachelor's Degree.

In [12]:
round(num/total *100, 1)

16.4

### Question 4 and 5:
4. Finding the percentage of people with higher education (Bachelors, Masters, or Doctorate) make more than 50K
5. Finding the percentage of people without higher education make more than 50K

Let's find Total number of people who have higher education and total number of people who have lower education as follows.

In [13]:
education = ['Bachelors','Masters','Doctorate']

higher_education = adult_data[adult_data['education'].isin(education)]
lower_education = adult_data[~adult_data['education'].isin(education)]

higher_education_total = higher_education.shape[0]
lower_education_total = lower_education.shape[0]

print("Total number of people who have higher education:", higher_education_total)
print("Total number of people who have lower education:", lower_education_total)

Total number of people who have higher education: 7491
Total number of people who have lower education: 25070


We can find the number of people with higher education (Bachelors, Masters or Doctorate) make more than 50K and the number of people with lower education make more than 50K as follows.

In [14]:
higher_education_salary = higher_education[higher_education['salary']=='>50K'].shape[0]
lower_education_salary = lower_education[lower_education['salary']=='>50K'].shape[0]

print("Total number of people who earn more than 50K with higher education qualifications:", higher_education_salary)
print("Total number of people who earn more than 50K without higher education qualifications:", lower_education_salary)

Total number of people who earn more than 50K with higher education qualifications: 3486
Total number of people who earn more than 50K without higher education qualifications: 4355


Percentage can be calculated as follows.

In [15]:
higher_education_rich = round((higher_education_salary / higher_education_total) * 100,1)
lower_education_rich = round((lower_education_salary / lower_education_total) * 100, 1)

print("Rich higher education percentage:", higher_education_rich,"%")
print("Rich lower education percentage:", lower_education_rich,"%")

Rich higher education percentage: 46.5 %
Rich lower education percentage: 17.4 %


We can find the number of people without advanced education make more than 50K as follows.

### Question 6:
6. Finding the minimum number of hours a person works per week.

In [16]:
min_work_hours = adult_data['hours-per-week'].min()
print("Minimum number of hours a person works per week:", min_work_hours)

Minimum number of hours a person works per week: 1


### Question 7:
7. The percentage of the people who work the minimum number of hours per week have a salary of more than 50K

Let's find the number of people who work the minimum number of hours per week have a salary of more than 50K

In [19]:
total_min_hour = adult_data[adult_data['hours-per-week']==min_work_hours].shape[0]
min_hour_salary = adult_data[(adult_data['salary']=='>50K') & (adult_data['hours-per-week']==min_work_hours)]
min_number = min_hour_salary.shape[0]

Percentage can be calculated as follows.

In [31]:
print("The percentage of the number of people who work the minimum number of hours per week:",(min_number/total_min_hour)*100,"%")

The percentage of the number of people who work the minimum number of hours per week: 10.0 %


### Question 8:
8. The country has the highest percentage of people that earn >50K and the percentage of it.

In [23]:
highest_salary = adult_data[(adult_data['salary']=='>50K')]
earning_country_counts = highest_salary['native-country'].value_counts()
country_counts = adult_data['native-country'].value_counts()
earning_percentage = earning_country_counts/country_counts * 100
highest_earning_country = earning_percentage.idxmax()
highest_earning_country_percentage = round(earning_percentage.max(),1)

In [33]:
earning_percentage

?                             25.042882
Cambodia                      36.842105
Canada                        32.231405
China                         26.666667
Columbia                       3.389831
Cuba                          26.315789
Dominican-Republic             2.857143
Ecuador                       14.285714
El-Salvador                    8.490566
England                       33.333333
France                        41.379310
Germany                       32.116788
Greece                        27.586207
Guatemala                      4.687500
Haiti                          9.090909
Holand-Netherlands                  NaN
Honduras                       7.692308
Hong                          30.000000
Hungary                       23.076923
India                         40.000000
Iran                          41.860465
Ireland                       20.833333
Italy                         34.246575
Jamaica                       12.345679
Japan                         38.709677


In [28]:
print("The country has the highest percentage of people that earn >50K: ",highest_earning_country)

The country has the highest percentage of people that earn >50K:  Iran


In [29]:
print("Percentage: ",highest_earning_country_percentage)

Percentage:  41.9


### Question 9:
9. The most popular occupation for those who earn >50K in India.

In [36]:
popular_occu = adult_data[(adult_data['salary']=='>50K') & (adult_data['native-country']=='India')]['occupation'].value_counts()
popular_occu.idxmax()

'Prof-specialty'