# Análise Exploratória de Dados com Pandas

Vamos utilizar Oandas para responder algumas questões relacionadas ao dataset [Adult](https://archive.ics.uci.edu/ml/datasets/Adult). 


In [482]:
import pandas as pd
import numpy as np

In [483]:
data = pd.read_csv('./adult.data.csv')
data = data.rename_axis('id').rename_axis('id', axis='columns')

### 1. Quantos homens e mulheres (atributo *sex*) são representados nesse dataset?

In [484]:
data['sex'].value_counts()

 Male      21790
 Female    10771
Name: sex, dtype: int64

### 2. Qual a idade média (atributo *age*) das mulheres?

In [486]:
data.loc[(data['sex'] == ' Female'), 'age'].mean()

36.85823043357163

### 3.  Qual é a porcentagem de cidadãos alemães (atributo *native-country*)?

In [487]:
data['native-country'].value_counts('p')[' Germany']

0.004207487485028101

### 4. Qua é a média e o desvio padrão de idade daqueles que ganham mais que 50K por ano (atributo *salary*) e daqueles que ganham menos que 50K por ano?

In [488]:
print(data.loc[data['salary'] == ' <=50K', 'age'].std())
print(data.loc[data['salary'] == ' >50K', 'age'].std())

14.02008849082488
10.519027719851826


### 5. É verdade que pessoas que ganham mais que 50K possuem, pelo menos, educação superior (atributo *education*:  Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters ou Doctorate)? 

In [489]:
print((data.loc[data['salary'] == ' >50K', 'education'].any() != [' Bachelors', ' Prof-school', ' Assoc-acdm', ' Assoc-voc', ' Masters',' Doctorate']))
print(~(data.loc[data['salary'] == ' >50K', 'education'].any() != [' Bachelors', ' Prof-school', ' Assoc-acdm', ' Assoc-voc', ' Masters',' Doctorate']))
#Aparentemente a negação de True é -2, mas você pegou a ideia, é -2.

True
-2


### 6. Mostre estatísticas sobre a idade de cada raça (atributo *race*) e cada gênero (atributo *sex*). Use as funções *groupby()* e *describe()*. Encontre a idade máxima dos homens da raça *Amer-Indian-Eskimo*.  

In [490]:
d = data.groupby([' race', 'sex'])
d.describe()

Unnamed: 0_level_0,id,age,age,age,age,age,age,age,age,fnlwgt,fnlwgt,...,capital-loss,capital-loss,hours-per-week,hours-per-week,hours-per-week,hours-per-week,hours-per-week,hours-per-week,hours-per-week,hours-per-week
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max,count,mean,...,75%,max,count,mean,std,min,25%,50%,75%,max
race,sex,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2,Unnamed: 12_level_2,Unnamed: 13_level_2,Unnamed: 14_level_2,Unnamed: 15_level_2,Unnamed: 16_level_2,Unnamed: 17_level_2,Unnamed: 18_level_2,Unnamed: 19_level_2,Unnamed: 20_level_2,Unnamed: 21_level_2,Unnamed: 22_level_2
Amer-Indian-Eskimo,Female,119.0,37.117647,13.114991,17.0,27.0,36.0,46.0,80.0,119.0,112950.731092,...,0.0,1721.0,119.0,36.579832,11.046509,4.0,35.0,40.0,40.0,84.0
Amer-Indian-Eskimo,Male,192.0,37.208333,12.049563,17.0,28.0,35.0,45.0,82.0,192.0,125715.364583,...,0.0,1980.0,192.0,42.197917,11.59628,3.0,40.0,40.0,45.0,84.0
Asian-Pac-Islander,Female,346.0,35.089595,12.300845,17.0,25.0,33.0,43.75,75.0,346.0,147452.075145,...,0.0,2258.0,346.0,37.439306,12.479459,1.0,35.0,40.0,40.0,99.0
Asian-Pac-Islander,Male,693.0,39.073593,12.883944,18.0,29.0,37.0,46.0,90.0,693.0,166175.865801,...,0.0,2457.0,693.0,41.468975,12.387563,1.0,40.0,40.0,45.0,99.0
Black,Female,1555.0,37.854019,12.637197,17.0,28.0,37.0,46.0,90.0,1555.0,212971.387781,...,0.0,4356.0,1555.0,36.834084,9.41996,2.0,35.0,40.0,40.0,99.0
Black,Male,1569.0,37.6826,12.882612,17.0,27.0,36.0,46.0,90.0,1569.0,242920.644997,...,0.0,2824.0,1569.0,39.997451,10.909413,1.0,40.0,40.0,40.0,99.0
Other,Female,109.0,31.678899,11.631599,17.0,23.0,29.0,39.0,74.0,109.0,172519.642202,...,0.0,1740.0,109.0,35.926606,10.300761,6.0,30.0,40.0,40.0,65.0
Other,Male,162.0,34.654321,11.355531,17.0,26.0,32.0,42.0,77.0,162.0,213679.104938,...,0.0,2179.0,162.0,41.851852,11.084779,5.0,40.0,40.0,40.0,98.0
White,Female,8642.0,36.811618,14.329093,17.0,25.0,35.0,46.0,90.0,8642.0,183549.966906,...,0.0,4356.0,8642.0,36.296691,12.190951,1.0,30.0,40.0,40.0,99.0
White,Male,19174.0,39.652498,13.436029,17.0,29.0,38.0,49.0,90.0,19174.0,188987.386148,...,0.0,3770.0,19174.0,42.668822,12.194633,1.0,40.0,40.0,50.0,99.0


In [491]:
d.describe()['age']['max'][(' Amer-Indian-Eskimo', ' Male')]

82.0

### 7. Encontre a proporção, considerando casados e solteiros (atributo *marital-status*), daqueles que ganham muito (>50K).   

In [492]:
s = data.loc[data['salary'] == ' >50K', 'marital-status'].value_counts()
(s[' Never-married']+s[' Separated']+s[' Divorced']+s[' Widowed'])/(s[' Married-civ-spouse']+s[' Married-spouse-absent']+s[' Married-AF-spouse'])

0.16404394299287411

### 8. Qual é o número máximo de horas que uma pessoa trabalha por semana (atributo *hours-per-week*)? Quantas pessoas trabalham esse número de horas e qual a porcentagem daqueles que ganham muito (>50K) entre eles? 

In [493]:
maxi = data['hours-per-week'].max()
maxi

99

In [494]:
slaves = data.loc[data['hours-per-week'] == maxi, 'salary']
slaves.count()

85

In [495]:
slaves[slaves == ' >50K'].count()/slaves.count()

0.29411764705882354

### 9. Determine o tempo médio de trabalho (*hours-per-week*) daqueles que ganham pouco (<=50K) e muito (atributo *salary*) para cada país (atributo *native-country*). Quantos dos indivíduos são do Japão? 

In [496]:
data.groupby(['salary','native-country'])['hours-per-week'].mean()

salary  native-country             
 <=50K   ?                             40.164760
         Cambodia                      41.416667
         Canada                        37.914634
         China                         37.381818
         Columbia                      38.684211
         Cuba                          37.985714
         Dominican-Republic            42.338235
         Ecuador                       38.041667
         El-Salvador                   36.030928
         England                       40.483333
         France                        41.058824
         Germany                       39.139785
         Greece                        41.809524
         Guatemala                     39.360656
         Haiti                         36.325000
         Holand-Netherlands            40.000000
         Honduras                      34.333333
         Hong                          39.142857
         Hungary                       31.300000
         India                   

In [497]:
data.loc[data['native-country'] == ' Japan','native-country'].count()

62

### 10. Gere uma tabela que permita comparar (*pivot table*) o valor médio de horas trabalhadas por semana (atributo *hours-per-week*) considerando as variáveis tipo de trabalho (atributo *workclass*) e educação (atributo *education*).

In [498]:
pd.pivot_table(data, values='hours-per-week', index=' workclass', columns=['education'], aggfunc='mean')

education,10th,11th,12th,1st-4th,5th-6th,7th-8th,9th,Assoc-acdm,Assoc-voc,Bachelors,Doctorate,HS-grad,Masters,Preschool,Prof-school,Some-college
workclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
?,33.15,31.711864,34.525,34.916667,37.233333,32.527778,32.392157,26.234043,33.885246,32.416185,29.533333,31.896617,31.541667,34.0,28.388889,31.363813
Federal-gov,42.166667,27.444444,35.0,,40.0,35.0,40.0,41.036364,41.315789,42.617925,50.1875,40.228137,42.656716,,49.896552,40.429134
Local-gov,38.225806,30.555556,36.631579,33.0,34.555556,36.535714,37.086957,40.409091,41.651163,42.140461,43.592593,39.894632,43.827485,32.5,45.37931,40.204134
Never-worked,35.0,10.0,,,,35.0,,,,,,40.0,,,,22.0
Private,36.827338,33.930661,35.141141,38.955882,39.488722,39.910377,38.258398,40.871056,41.225871,42.698676,48.668508,40.507198,44.458613,37.682927,48.101167,38.731645
Self-emp-inc,38.368421,39.285714,42.571429,40.0,42.5,45.714286,46.9,47.914286,49.947368,49.40293,54.685714,47.007168,52.911392,,50.839506,49.362832
Self-emp-not-inc,43.507463,40.483333,44.789474,36.769231,36.105263,42.882979,41.676471,44.211268,46.916667,44.177945,41.74,45.435335,43.16129,,45.870229,44.117284
State-gov,39.076923,33.357143,39.0,20.0,31.25,31.8,39.666667,36.853659,41.086957,39.692593,46.820225,39.425373,40.775148,24.0,50.129032,34.698462
Without-pay,,,,,,50.0,,50.0,,,,28.0,,,,35.333333


### 11. Gere uma tabela que permita comparar (*pivot table*) o valor médio de horas trabalhadas por semana (atributo *hours-per-week*) considerando as variáveis ocupação (atributo *occupation*) e gênero (atributo *sex*).

In [499]:
pd.pivot_table(data, values='hours-per-week', index='occupation', columns=['sex'], aggfunc='mean')

sex,Female,Male
occupation,Unnamed: 1_level_1,Unnamed: 2_level_1
?,29.976219,33.525948
Adm-clerical,36.741033,39.240065
Armed-Forces,,40.666667
Craft-repair,39.869369,42.443642
Exec-managerial,41.517688,46.371173
Farming-fishing,37.784615,47.634015
Handlers-cleaners,36.103659,38.198176
Machine-op-inspct,38.929091,41.447658
Other-service,33.437778,36.223411
Priv-house-serv,32.489362,39.875


### 12. Exiba o desvio padrão de horas trabalhadas por país (atributo *native-country*) e gênero (atributo *sex*). Faça o mesmo para (atributo *native-country*) e raça (atributo *race*).

In [500]:
pd.pivot_table(data, values='hours-per-week', index='native-country', columns=['sex'], aggfunc='std')

sex,Female,Male
native-country,Unnamed: 1_level_1,Unnamed: 2_level_1
?,15.211396,10.782206
Cambodia,0.0,2.719528
Canada,13.485235,13.067702
China,12.976903,10.267628
Columbia,7.489002,9.461821
Cuba,11.72634,8.956611
Dominican-Republic,7.375636,13.521971
Ecuador,5.659309,9.112038
El-Salvador,8.451841,9.947265
England,15.903407,12.99048


### 13. Inclua a coluna *retired*, para indicar aposentadoria para quem tem mais de 60 anos e é mulher ou mais de 65 anos e é homem (atirbutos *age* e *sex*). 

In [502]:
data['retired'] = (((data['age'] > 60) & (data['sex'] == ' Female')) | ((data['age'] > 65) & (data['sex'] == ' Male')))
data

id,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary,retired
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K,False
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K,False
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K,False
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K,False
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K,False
5,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K,False
6,49,Private,160187,9th,5,Married-spouse-absent,Other-service,Not-in-family,Black,Female,0,0,16,Jamaica,<=50K,False
7,52,Self-emp-not-inc,209642,HS-grad,9,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,45,United-States,>50K,False
8,31,Private,45781,Masters,14,Never-married,Prof-specialty,Not-in-family,White,Female,14084,0,50,United-States,>50K,False
9,42,Private,159449,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,5178,0,40,United-States,>50K,False


### 14. Calcule o valor do *z-score* do tempo de trabalho por semana (atributo *hours-per-week*) de cada indivíduo, considerando como média (*mean*) e desvio padrão (*std*) os valores obtido por grupos dados pelo país de origem (atributo *native-country*) e o sexo (atributo *sex*). Dica: o *z-score* de um valor x é dado por:  *(x-mean)/std*.

In [503]:
data.groupby(['native-country','sex'])['hours-per-week'].apply(lambda v: np.mean(v)/np.std(v))

  """Entry point for launching an IPython kernel.


native-country               sex    
 ?                            Female     2.471034
                              Male       4.000325
 Cambodia                     Female          inf
                              Male      15.594311
 Canada                       Female     2.945268
                              Male       3.154937
 China                        Female     2.842668
                              Male       3.783037
 Columbia                     Female     5.154824
                              Male       4.283108
 Cuba                         Female     3.016280
                              Male       4.759127
 Dominican-Republic           Female     5.529950
                              Male       3.357208
 Ecuador                      Female     5.726682
                              Male       4.943294
 El-Salvador                  Female     4.191277
                              Male       3.818815
 England                      Female     2.637283
             

### 15. Calcule a porporção de indivíduos que ganham bem por país (atributo *native-country*), sexo (atributo *sex*) e raça (atributo *race*).

In [508]:
def proportion(att):
    query = data.groupby(['salary',att]).agg('count').loc[' >50K'].iloc[:, 1]
    return query.apply(lambda v: v/query.sum())

In [509]:
proportion('native-country')

native-country
 ?                     0.018620
 Cambodia              0.000893
 Canada                0.004974
 China                 0.002551
 Columbia              0.000255
 Cuba                  0.003188
 Dominican-Republic    0.000255
 Ecuador               0.000510
 El-Salvador           0.001148
 England               0.003826
 France                0.001530
 Germany               0.005612
 Greece                0.001020
 Guatemala             0.000383
 Haiti                 0.000510
 Honduras              0.000128
 Hong                  0.000765
 Hungary               0.000383
 India                 0.005101
 Iran                  0.002296
 Ireland               0.000638
 Italy                 0.003188
 Jamaica               0.001275
 Japan                 0.003061
 Laos                  0.000255
 Mexico                0.004209
 Nicaragua             0.000255
 Peru                  0.000255
 Philippines           0.007780
 Poland                0.001530
 Portugal              0.

In [510]:
proportion('sex')

sex
 Female    0.150363
 Male      0.849637
Name:  workclass, dtype: float64

In [511]:
proportion(' race')

 race
 Amer-Indian-Eskimo    0.004591
 Asian-Pac-Islander    0.035200
 Black                 0.049356
 Other                 0.003188
 White                 0.907665
Name:  workclass, dtype: float64