## Introdution to Machine Learning

# <center> Part 1. Exploratory analysis with Pandas

**In this part we will use Pandas for an exploratory analysis of the dataset [Adult](https://archive.ics.uci.edu/ml/datasets/Adult). **


Variables and their type:
- `age`: continuous.
- `workclass`: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- `fnlwgt`: continuous.
- `education`: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- `education-num`: continuous.
- `marital-status`: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- `occupation`: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- `relationship`: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- `race`: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- `sex`: Female, Male.
- `capital-gain`: continuous.
- `capital-loss`: continuous.
- `hours-per-week`: continuous.
- `native-country`: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.   
- `salary`: >50K,<=50K

In [5]:
import numpy as np
import pandas as pd
pd.set_option('display.max.columns', 100)

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns

import os
import warnings
warnings.filterwarnings('ignore')

In [9]:
current_directory = os.getcwd()
path = current_directory + '/Data/adult_data.csv'
data = pd.read_csv(path)
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [8]:
print(data.info())
print(data.shape)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education-num   32561 non-null  int64 
 5   marital-status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital-gain    32561 non-null  int64 
 11  capital-loss    32561 non-null  int64 
 12  hours-per-week  32561 non-null  int64 
 13  native-country  32561 non-null  object
 14  salary          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
None
(32561, 15)


**1. How many women and men are in this dataset? (variable *sex* )**

In [9]:
data['sex'].value_counts()

Male      21790
Female    10771
Name: sex, dtype: int64

**2. What is the average age in women?**

In [11]:
data[data["sex"] == 'Female']['age'].mean()

36.85823043357163

**3. What is the percentage of German citizens?**

In [None]:
data['native-country'].value_counts(normalize=True)['Germany']

0.004207487485028101

**4. What are the mean and standard deviation of age for those who earn more than 50k per year?**

In [12]:
print(data[data["salary"] == '>50K']['age'].mean())
print(data[data["salary"] == '>50K']['age'].std())

44.24984058155847
10.519027719851826


**5. What are the mean and standard deviation of age for those who earn less than 50k per year?**

In [13]:
print(data[data["salary"] == '<=50K']['age'].mean())
print(data[data["salary"] == '<=50K']['age'].std())

36.78373786407767
14.02008849082488


**6. Is it true that people who earn more than 50K have at least a high school education? (variable *education – Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters* or *Doctorate*)**

In [18]:
array = ['Bachelors', 'Prof-school', 'Assoc-acdm', 'Assoc-voc', 'Masters', 'Doctorate']
p_condition = len(data[(data['salary'] == '>50K') & ~data['education'].isin(array)])
print('Number of people who earn more than 50k and do not have at least a high school education =', p_condition)

Number of people who earn more than 50k and do not have at least a high school education = 3306


**7. Show Statistics are shown for each race and gender using *groupby()* y *describe()*. Then it shown the max age for men who are *Amer-Indian-Eskimo* .**

In [30]:
#In this case only the age is included for the description for reasons of simplicity
data.groupby(['race', 'sex']).describe(exclude=['O'])['age']

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
race,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Amer-Indian-Eskimo,Female,119.0,37.117647,13.114991,17.0,27.0,36.0,46.0,80.0
Amer-Indian-Eskimo,Male,192.0,37.208333,12.049563,17.0,28.0,35.0,45.0,82.0
Asian-Pac-Islander,Female,346.0,35.089595,12.300845,17.0,25.0,33.0,43.75,75.0
Asian-Pac-Islander,Male,693.0,39.073593,12.883944,18.0,29.0,37.0,46.0,90.0
Black,Female,1555.0,37.854019,12.637197,17.0,28.0,37.0,46.0,90.0
Black,Male,1569.0,37.6826,12.882612,17.0,27.0,36.0,46.0,90.0
Other,Female,109.0,31.678899,11.631599,17.0,23.0,29.0,39.0,74.0
Other,Male,162.0,34.654321,11.355531,17.0,26.0,32.0,42.0,77.0
White,Female,8642.0,36.811618,14.329093,17.0,25.0,35.0,46.0,90.0
White,Male,19174.0,39.652498,13.436029,17.0,29.0,38.0,49.0,90.0


In [44]:
max_age = data[(data['race'] == 'Amer-Indian-Eskimo') & (data['sex']=='Male')]['age'].max()
max_age_id = data[(data['race'] == 'Amer-Indian-Eskimo') & (data['sex']=='Male')]['age'].idxmax()

print('Oldest Amer-Indian-Eskimo man')
print('Id:', max_age_id, 'Age:', max_age)

Oldest Amer-Indian-Eskimo man
Id: 12492 Age: 82


In [48]:
data[data['race'].apply(lambda race: race == 'Amer-Indian-Eskimo')]['age'].max()

82

Otra manera aplicando el "apply"

**8. De las siguientes categorías, cual es la que tiene la mayor proporcion de ricos (>50K): married or single men (variable *marital-status*)? Considere como casados  aquellos que tienen *marital-status* comenzando por *Married* (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), el resto son bachelors.**

In [None]:
arr = ['Married-civ-spouse', 'Married-spouse-absent', 'Married-AF-spouse']
solt = data.loc[(data['salary'] == '>50K') & ~data['marital-status'].isin(arr)]
married = data.loc[(data['salary'] == '>50K') & data['marital-status'].isin(arr)]

In [None]:
print(solt['salary'].value_counts(), 'Solteros')
print(married['salary'].value_counts(), 'Casados')

>50K    1105
Name: salary, dtype: int64 Solteros
>50K    6736
Name: salary, dtype: int64 Casados


Me quedó feo como me tira el dato

**9. Cual es el máximo de horas que una persona trabaja por semana? (variable *hours-per-week*)? Cuantas personas trabajan ese numero de horas y cual es el porcentaje entre esas personas que ademas ganan mucho (>50K) ?**

In [None]:
# Entre su código python aquí

**10. Cuente el numero de horas de trabajo (*hours-per-week*) de aquellos que ganan poco, y de los que ganan mucho (*salary*) para cada pais (*native-country*). Cuales son esos conteos para Japon?**

In [None]:
data[data['native-country']== 'Japan'].groupby(['salary'])['hours-per-week'].describe(percentiles=[])

Unnamed: 0_level_0,count,mean,std,min,50%,max
salary,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
<=50K,38.0,41.0,11.902759,10.0,40.0,65.0
>50K,24.0,47.958333,16.120414,21.0,42.5,99.0


In [None]:
data[data['age'] == data['age'].max()]

NameError: name 'data' is not defined