## Demographic Data Analyzer

In this challenge you must analyze demographic data using Pandas. You are given a dataset of demographic data that was extracted from the 1994 Census database. Here is a sample of what the data looks like:

Você deve usar o Pandas para responder as seguintes questões:

- Quantas pessoas de cada raça estão representadas neste dataset? Esta deve ser uma série Pandas com nomes das raças como rótulos de índice. (coluna race)
- Qual é a média de idade dos homens?
- Qual é a porcentagem de pessoas que têm um diploma de bacharel?
- Qual é a porcentagem de pessoas com educação superior (Bachelors, Masters, ou Doctorate - graduados, mestres e doutores, respectivamente) que ganham mais de 50 mil?
- Qual é a porcentagem de pessoas sem educação superior que ganham mais de 50 mil?
- Qual é o número mínimo de horas que uma pessoa trabalha por semana?
- Qual é a porcentagem das pessoas que trabalham o número mínimo de horas por semana e que têm um salário superior a 50 mil?
- Qual país tem a maior porcentagem de pessoas que ganham > 50mil e qual é essa porcentagem?
- Identifique a ocupação mais popular entre aqueles que ganham > 50 mil na Índia.

Use o código inicial do arquivo demographic_data_analyzer. Atualize o código para que todas as variáveis definidas como "None" sejam definidas com o cálculo ou código apropriado. Arredonde todos os números decimais para o décimo mais próximo.

You must use Pandas to answer the following questions:

- How many people of each race are represented in this dataset? This should be a Pandas series with race names as the index labels. (race column)
- What is the average age of men?
- What is the percentage of people who have a Bachelor's degree?
- What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?
- What percentage of people without advanced education make more than 50K?
- What is the minimum number of hours a person works per week?
- What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?
- What country has the highest percentage of people that earn >50K and what is that percentage?
- Identify the most popular occupation for those who earn >50K in India.

Use the starter code in the file demographic_data_analyzer. Update the code so all variables set to "None" are set to the appropriate calculation or code. Round all decimals to the nearest tenth.

Unit tests are written for you under test_module.py.

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("datas/adult.data.csv")
df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [3]:
df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education-num     0
marital-status    0
occupation        0
relationship      0
race              0
sex               0
capital-gain      0
capital-loss      0
hours-per-week    0
native-country    0
salary            0
dtype: int64

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

### How many people of each race are represented in this dataset?

In [4]:
search = [df[df['race'] == race]['race'].count() for race in df.race.unique()]
race_count = pd.Series(search, index = df.race.unique())
print(race_count)

White                 27816
Black                  3124
Asian-Pac-Islander     1039
Amer-Indian-Eskimo      311
Other                   271
dtype: int64


![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

### What is the average age of men?

In [5]:
df['sex'].unique()

array(['Male', 'Female'], dtype=object)

In [6]:
average_age_men = round(df.query('sex == "Male"')['age'].mean(),1)
average_age_men

39.4

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

### What is the percentage of people who have a Bachelor's degree?

In [7]:
df['education'].unique()

array(['Bachelors', 'HS-grad', '11th', 'Masters', '9th', 'Some-college',
       'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Doctorate', 'Prof-school',
       '5th-6th', '10th', '1st-4th', 'Preschool', '12th'], dtype=object)

In [8]:
bachelors_degree = df.query('education == "Bachelors"').count()[0]
all_degree  = df.count()[0]
percentage_bachelors = round(bachelors_degree/all_degree*100,1)
percentage_bachelors

16.4

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

### What percentage of people with advanced education (Bachelors, Masters, or Doctorate) make more than 50K?

In [9]:
advanced_education_list = ['Bachelors', 'Masters', 'Doctorate']

df_higher_education = df.loc[df['education'].isin(advanced_education_list)]
df_higher_education_rich = df_higher_education.query("salary == '>50K'")

In [10]:
count_higher_education = df_higher_education.count()[0]
count_higher_education_rich = df_higher_education_rich.count()[0]

In [11]:
higher_education = round(count_higher_education/df.count()[0]*100,1)
higher_education

23.0

In [12]:
higher_education_rich = round(count_higher_education_rich/count_higher_education*100,1)
higher_education_rich

46.5

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

### What percentage of people without advanced education make more than 50K?

In [13]:
not_advanced_education_list = ['HS-grad', '11th', '9th', 'Some-college',
         'Assoc-acdm', 'Assoc-voc', '7th-8th', 'Prof-school',
         '5th-6th', '10th', '1st-4th', 'Preschool', '12th']


In [14]:
df_lower_education = df.loc[df['education'].isin(not_advanced_education_list)]

df_lower_education_rich = df_lower_education.query("salary == '>50K'")

count_lower_education = df_lower_education.count()[0]
count_lower_education_rich = df_lower_education_rich.count()[0]

In [15]:
lower_education = round(count_lower_education/df.count()[0]*100,1)
lower_education

77.0

In [16]:
lower_education_rich = round(count_lower_education_rich/count_lower_education*100,1)
lower_education_rich 

17.4

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

### What is the minimum number of hours a person works per week?

In [17]:
min_work_hours = df['hours-per-week'].min()
min_work_hours

1

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

### What percentage of the people who work the minimum number of hours per week have a salary of more than 50K?

In [18]:
df_num_min_workers = df[df['hours-per-week'] == 1]
num_min_workers = df_num_min_workers.query("salary == '>50K'").count()[0]
num_min_workers

2

In [33]:
rich_percentage = num_min_workers/df_num_min_workers.count()[0]*100
rich_percentage

10.0

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

### What country has the highest percentage of people that earn >50K and what is that percentage?

In [20]:
df['native-country'].unique()

array(['United-States', 'Cuba', 'Jamaica', 'India', '?', 'Mexico',
       'South', 'Puerto-Rico', 'Honduras', 'England', 'Canada', 'Germany',
       'Iran', 'Philippines', 'Italy', 'Poland', 'Columbia', 'Cambodia',
       'Thailand', 'Ecuador', 'Laos', 'Taiwan', 'Haiti', 'Portugal',
       'Dominican-Republic', 'El-Salvador', 'France', 'Guatemala',
       'China', 'Japan', 'Yugoslavia', 'Peru',
       'Outlying-US(Guam-USVI-etc)', 'Scotland', 'Trinadad&Tobago',
       'Greece', 'Nicaragua', 'Vietnam', 'Hong', 'Ireland', 'Hungary',
       'Holand-Netherlands'], dtype=object)

In [21]:
df_group1 = df[['salary','native-country']].groupby('native-country').count()
df_group1.rename(columns = {'salary':'all'},inplace = True)

In [22]:
df_salary = df.query('salary == ">50K"')
df_group2 = df_salary[['salary','native-country']].groupby('native-country').count()
df_group2.rename(columns = {'salary':'rich'},inplace = True)

In [48]:
result = pd.concat([df_group1, df_group2], axis=1)
result['percentage'] = round(result['rich']/result['all']*100,1)
result.sort_values('percentage',ascending=False, inplace = True)
result

Unnamed: 0_level_0,all,rich,percentage
native-country,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Iran,43,18.0,41.9
France,29,12.0,41.4
India,100,40.0,40.0
Taiwan,51,20.0,39.2
Japan,62,24.0,38.7
Yugoslavia,16,6.0,37.5
Cambodia,19,7.0,36.8
Italy,73,25.0,34.2
England,90,30.0,33.3
Canada,121,39.0,32.2


In [24]:
highest_earning_country = result.iloc[0].name
highest_earning_country

'Iran'

In [25]:
highest_earning_country_percentage = result.iloc[0]['percentage']
highest_earning_country_percentage

41.9

![purple-divider](https://user-images.githubusercontent.com/7065401/52071927-c1cd7100-2562-11e9-908a-dde91ba14e59.png)

### Identify the most popular occupation for those who earn >50K in India

In [26]:
df_india_salary = df[df['native-country'] == 'India'].query('salary == ">50K"')
df_india_salary['occupation'].unique()

df_group_india = df_india_salary.groupby('occupation').count()[['salary']].sort_values('salary', ascending= False)
df_group_india

Unnamed: 0_level_0,salary
occupation,Unnamed: 1_level_1
Prof-specialty,25
Exec-managerial,8
Other-service,2
Tech-support,2
Adm-clerical,1
Sales,1
Transport-moving,1


In [27]:
top_IN_occupation = df_group_india.iloc[0].name
top_IN_occupation 

'Prof-specialty'