## <center>  Exploratory data analysis with Pandas

### Unique values of all features ([Adult](https://archive.ics.uci.edu/ml/datasets/Adult)):
- `age`: continuous.
- `workclass`: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
- `fnlwgt`: continuous.
- `education`: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
- `education-num`: continuous.
- `marital-status`: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
- `occupation`: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
- `relationship`: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
- `race`: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
- `sex`: Female, Male.
- `capital-gain`: continuous.
- `capital-loss`: continuous.
- `hours-per-week`: continuous.
- `native-country`: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.   
- `salary`: >50K,<=50K

In [1]:
import pandas as pd

In [2]:
import numpy as np

In [3]:
data = pd.read_csv('../input/adult.data.csv')
data.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,salary
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


Data dimentionality: how many rows and columns

In [4]:
print(data.shape)

(32561, 15)


In [5]:
print(data.columns)

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'salary'],
      dtype='object')


In [6]:
print(data.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
age               32561 non-null int64
workclass         32561 non-null object
fnlwgt            32561 non-null int64
education         32561 non-null object
education-num     32561 non-null int64
marital-status    32561 non-null object
occupation        32561 non-null object
relationship      32561 non-null object
race              32561 non-null object
sex               32561 non-null object
capital-gain      32561 non-null int64
capital-loss      32561 non-null int64
hours-per-week    32561 non-null int64
native-country    32561 non-null object
salary            32561 non-null object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB
None


### <center>  Now I am going to perform some data exploration by answering important questions to see what I am dealing with. 

**1. How many men and women (*sex* feature) are represented in this dataset?** 

In [7]:
data["sex"].value_counts()

Male      21790
Female    10771
Name: sex, dtype: int64

I decided to look at how many men and women are represented in this dataset. As we can see there almost twice as much men than women. 

**2. What is the average age (*age* feature) of women?**

In [8]:
round(data[data["sex"] == 'Female']["age"].mean(), 1)

36.9

In [9]:
round(data[data["sex"] == 'Male']["age"].mean(), 1)

39.4

Average age of a woman in our dataset is approximately 37 years old and of men is 39 yo

**3. What is the percentage of German citizens (*native-country* feature)?**

In [16]:
ger_cit = (data['native-country'] == 'Germany').sum() / data.shape[0]
# or (data['native-country'] == 'Germany').value_counts(normalize = True)
print(ger_cit) 

0.004207487485028101


Data has only 0.4 % of German citizens.

**4-5. What are the mean and standard deviation of age for those who earn more than 50K per year (*salary* feature) and those who earn less than 50K per year? **

In [17]:
# You code here
more = data[data['salary'] == '>50K']['age'] 
print("The average age for those who earn more tham 50K per year is",int(round(more.mean(), 0)), "and standard deviation is", round(more.std(), 1))

The average age for those who earn more tham 50K per year is 44 and standard deviation is 10.5


In [19]:
less = data[data['salary'] == '<=50K']['age'] 
print("The average age for those who earn less tham 50K per year is",int(round(less.mean(), 0)), "and standard deviation is", round(less.std(), 1))

The average age for those who earn less tham 50K per year is 37 and standard deviation is 14.0


**6. Is it true that people who earn more than 50K have at least high school education? (*education – Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters* or *Doctorate* feature)**

In [20]:
data.loc[data['salary'] == '>50K', 'education'].unique()

array(['HS-grad', 'Masters', 'Bachelors', 'Some-college', 'Assoc-voc',
       'Doctorate', 'Prof-school', 'Assoc-acdm', '7th-8th', '12th',
       '10th', '11th', '9th', '5th-6th', '1st-4th'], dtype=object)

Above we saw that people who make more than 50K are very well educated and have at least high school education. 

**7. Display age statistics for each race (*race* feature) and each gender (*sex* feature). Use *groupby()* and *describe()*. Find the maximum age of men of *Amer-Indian-Eskimo* race.**

In [21]:
data.groupby(['race', 'sex'])['age'].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,count,mean,std,min,25%,50%,75%,max
race,sex,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Amer-Indian-Eskimo,Female,119.0,37.117647,13.114991,17.0,27.0,36.0,46.0,80.0
Amer-Indian-Eskimo,Male,192.0,37.208333,12.049563,17.0,28.0,35.0,45.0,82.0
Asian-Pac-Islander,Female,346.0,35.089595,12.300845,17.0,25.0,33.0,43.75,75.0
Asian-Pac-Islander,Male,693.0,39.073593,12.883944,18.0,29.0,37.0,46.0,90.0
Black,Female,1555.0,37.854019,12.637197,17.0,28.0,37.0,46.0,90.0
Black,Male,1569.0,37.6826,12.882612,17.0,27.0,36.0,46.0,90.0
Other,Female,109.0,31.678899,11.631599,17.0,23.0,29.0,39.0,74.0
Other,Male,162.0,34.654321,11.355531,17.0,26.0,32.0,42.0,77.0
White,Female,8642.0,36.811618,14.329093,17.0,25.0,35.0,46.0,90.0
White,Male,19174.0,39.652498,13.436029,17.0,29.0,38.0,49.0,90.0


As we can see the maximum age of men of Amer-Indian-Eskimo race is 82 years old.

**8. Among whom is the proportion of those who earn a lot (>50K) greater: married or single men (*marital-status* feature)? Consider as married those who have a *marital-status* starting with *Married* (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), the rest are considered bachelors.**

In [22]:
# marital status count
data.groupby(['marital-status'])['sex'].value_counts()

marital-status         sex   
Divorced               Female     2672
                       Male       1771
Married-AF-spouse      Female       14
                       Male          9
Married-civ-spouse     Male      13319
                       Female     1657
Married-spouse-absent  Male        213
                       Female      205
Never-married          Male       5916
                       Female     4767
Separated              Female      631
                       Male        394
Widowed                Female      825
                       Male        168
Name: sex, dtype: int64

In [55]:
# not married men statistics
data.loc[(data['sex'] == 'Male') & (data['marital-status'].isin(['Never-married','Separated',
                                                             'Divorced','Widowed'])), 'salary'].value_counts()


<=50K    7552
>50K      697
Name: salary, dtype: int64


In [56]:
# married men statistics 
data.loc[(data['sex'] == 'Male') & (data['marital-status'].isin(['Married-civ-spouse', 
                                                                 'Married-spouse-absent', 
                                                                 'Married-AF-spouse'])), 'salary'].value_counts()


<=50K    7576
>50K     5965
Name: salary, dtype: int64

If we compare salaries of men who are married and who are bachelors, we can see a significant difference that much more married men make a lot of money. Married men (>50K) = 5965 and single men (<50K) = 697. But how about the ratio?

In [66]:
married_men_tot = 13541
not_married_men_tot = 8249
reach_married_men = 5965
reach_not_married_men = 697

# reach married men vs reach not married men
print((reach_married_men / married_men_tot) * 100)
print((reach_not_married_men / not_married_men_tot) * 100)

44.05139945351156
8.449509031397746


Now we can see that 44% of married men make more than 50K and only 8% of not married men make nore than 50K. Final word: there are more reach men who are married than who are not. I assume, it is good to be married? 

**9. What is the maximum number of hours a person works per week (*hours-per-week* feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot (>50K) among them?**

In [67]:
# maximum number of hours a person works per week
data['hours-per-week'].max()

99

In [68]:
# How many people work such a number of hours?
workaholics = data[data['hours-per-week'] == data['hours-per-week'].max()].shape[0]
print("Total number of hard workers is:", workaholics)

Total number of hard workers is: 85


What is the percentage of those who earn a lot (>50K) among them?

In [69]:
pd.crosstab(data['hours-per-week'].max(), data['salary'],normalize=True )

salary,<=50K,>50K
row_0,Unnamed: 1_level_1,Unnamed: 2_level_1
99,0.75919,0.24081


Maximum number of hours that people worked was 99 per week. And there were 85 people who worked those 99 hours and only 25% of them make a lot of money. 

**10. Count the average time of work (*hours-per-week*) for those who earn a little and a lot (*salary*) for each country (*native-country*). What will these be for Japan?**

In [70]:
pd.crosstab(data['native-country'], data['salary'], 
           values=data['hours-per-week'], aggfunc=np.mean).T

native-country,?,Cambodia,Canada,China,Columbia,Cuba,Dominican-Republic,Ecuador,El-Salvador,England,France,Germany,Greece,Guatemala,Haiti,Holand-Netherlands,Honduras,Hong,Hungary,India,Iran,Ireland,Italy,Jamaica,Japan,Laos,Mexico,Nicaragua,Outlying-US(Guam-USVI-etc),Peru,Philippines,Poland,Portugal,Puerto-Rico,Scotland,South,Taiwan,Thailand,Trinadad&Tobago,United-States,Vietnam,Yugoslavia
salary,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1
<=50K,40.16476,41.416667,37.914634,37.381818,38.684211,37.985714,42.338235,38.041667,36.030928,40.483333,41.058824,39.139785,41.809524,39.360656,36.325,40.0,34.333333,39.142857,31.3,38.233333,41.44,40.947368,39.625,38.239437,41.0,40.375,40.003279,36.09375,41.857143,35.068966,38.065693,38.166667,41.939394,38.470588,39.444444,40.15625,33.774194,42.866667,37.058824,38.799127,37.193548,41.6
>50K,45.547945,40.0,45.641026,38.9,50.0,42.44,47.0,48.75,45.0,44.533333,50.75,44.977273,50.625,36.666667,42.75,,60.0,45.0,50.0,46.475,47.5,48.0,45.4,41.1,47.958333,40.0,46.575758,37.5,,40.0,43.032787,39.0,41.5,39.416667,46.666667,51.4375,46.8,58.333333,40.0,45.505369,39.2,49.5


In [71]:
# Japan only
country_salary = round(data.groupby([data['native-country'] == 'Japan', 'salary'])['hours-per-week'].mean(), 2)
print(country_salary)

native-country  salary
False           <=50K     38.84
                >50K      45.47
True            <=50K     41.00
                >50K      47.96
Name: hours-per-week, dtype: float64


In Japan people who work 48 hours per week make more than 50K, and people who work 41 hrs make less than 50K. So people who work 7 hours more overtime make moe money which is predictable. 