# People salaries dataset analysis

## Dataset description

The dataset contains information about people and their salaries. The dataset has the following columns:
- age: the age of the person
- workclass: the type of work the person does
- education: the level of education of the person
- marital-status: the marital status of the person
- occupation: the occupation of the person
- relationship: the relationship status
- race: the race of the person
- sex: sex of the person
- hours-per-week: the number of hours the person works per week
- native-country: the country of origin of the person
- salary: enum indicating if the salary is <=50K or >50K
- salary K$: the salary in thousands of dollars

## Data importing and preprocessing

### Data importing and first look

In [202]:
import pandas as pd

df = pd.read_csv('./data/adult.csv', index_col=0)

df.head()

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,hours-per-week,native-country,salary,salary K$
0,39,State-gov,Bachelors,Never-married,Adm-clerical,Not-in-family,White,Male,40,United-States,<=50K,39
1,50,Self-emp-not-inc,Bachelors,Married-civ-spouse,Exec-managerial,Husband,White,Male,13,United-States,<=50K,35
2,38,Private,HS-grad,Divorced,Handlers-cleaners,Not-in-family,White,Male,40,United-States,<=50K,27
3,53,Private,11th,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,40,United-States,<=50K,43
4,28,Private,Bachelors,Married-civ-spouse,Prof-specialty,Wife,Black,Female,40,Cuba,<=50K,25


### Data basic statistics

In [203]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 32561 entries, 0 to 32560
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   education       32561 non-null  object
 3   marital-status  32561 non-null  object
 4   occupation      32561 non-null  object
 5   relationship    32561 non-null  object
 6   race            32561 non-null  object
 7   sex             32561 non-null  object
 8   hours-per-week  32561 non-null  int64 
 9   native-country  32561 non-null  object
 10  salary          32561 non-null  object
 11  salary K$       32561 non-null  int64 
dtypes: int64(3), object(9)
memory usage: 3.2+ MB


In [204]:
df.describe()

Unnamed: 0,age,hours-per-week,salary K$
count,32561.0,32561.0,32561.0
mean,38.581647,40.437456,72.674611
std,13.640433,12.347429,84.345976
min,17.0,1.0,15.0
25%,28.0,40.0,26.0
50%,37.0,40.0,38.0
75%,48.0,45.0,49.0
max,90.0,99.0,349.0


### Cleaning missing values

In [205]:
df = df[~df.apply(lambda x: x.astype(str).str.contains(r"\?"))]

df.isnull().sum()

age                  0
workclass         1836
education            0
marital-status       0
occupation        1843
relationship         0
race                 0
sex                  0
hours-per-week       0
native-country     583
salary               0
salary K$            0
dtype: int64

As we can see there are some missing workclass, occupation and native-country values. We will remove the rows that contain at least one missing value.

In [206]:
df = df.dropna(how='any')

Deleting rows where at least one value is missing.

### Data validity check

In [207]:
df['salary'].value_counts()

salary
<=50K    22654
>50K      7508
Name: count, dtype: int64

In [208]:
df['salary K$'].describe()

count    30162.000000
mean        73.968570
std         85.365144
min         15.000000
25%         26.000000
50%         38.000000
75%         49.000000
max        349.000000
Name: salary K$, dtype: float64

In [209]:
def validate_salary(row):
  if (row['salary'] == '>50K' and row['salary K$'] > 50):
    return True
  if (row['salary'] == '<=50K' and row['salary K$'] <= 50):
    return True
  return False
  
  
df.apply(validate_salary, axis=1).all()

np.True_

We have checked salary validity and we have found that there are no invalid values.

## Data analysis

### Age distribution

In [210]:
df['sex'].value_counts()

sex
Male      20380
Female     9782
Name: count, dtype: int64

As we see there are twice as many males in the dataset.

### Average males age

In [211]:
df.groupby(by='sex')['age'].mean()

sex
Female    36.883459
Male      39.184004
Name: age, dtype: float64

Average male age is 39.2 years.

### Percentage of people from Poland

In [212]:
df['native-country'].value_counts()['Poland'] / df.shape[0] * 100

np.float64(0.18566408063125786)

Near 0.19% of people are from Poland.

### People without high school diploma and with salary >50K

In [223]:
high_diploma = ['Bachelors', 'Prof-school', 'Assoc-acdm', 'Assoc-voc', 'Masters', 'Doctorate']
with_high_diploma_amount = df[(df['education'].isin(high_diploma)) & (df['salary'] == '>50K')].shape[0]
without_high_diploma_amount = df[(~df['education'].isin(high_diploma)) & (df['salary'] == '>50K')].shape[0]

print(f'With high school diploma {with_high_diploma_amount}')
print(f'Without high school diploma {without_high_diploma_amount}')
print(f'Percentage of people without high school diploma {without_high_diploma_amount / (without_high_diploma_amount + with_high_diploma_amount) * 100:.2f}%')

With high school diploma 4330
Without high school diploma 3178
Percentage of people without high school diploma 42.33%


There are 3178 people without high school diploma and with salary >50K, and 433 people with high school diploma and with salary >50K. So the percentage of people without high school diploma and with salary >50K is 42.33%.

### Age distribution of people education level

In [214]:
df.groupby(by='education')['age'].describe().sort_values(by='mean')

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
education,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
12th,377.0,32.013263,14.37371,17.0,19.0,28.0,41.0,79.0
11th,1048.0,32.36355,15.089307,17.0,18.0,28.5,43.0,90.0
Some-college,6678.0,36.13537,13.073528,17.0,25.0,35.0,45.0,90.0
Assoc-acdm,1008.0,37.286706,10.509755,19.0,29.0,36.0,44.0,90.0
10th,820.0,37.897561,16.225795,17.0,23.0,36.0,52.0,90.0
Assoc-voc,1307.0,38.246366,11.181253,19.0,30.0,37.0,45.0,84.0
HS-grad,9840.0,38.640955,13.06773,17.0,28.0,37.0,48.0,90.0
Bachelors,5044.0,38.641554,11.577566,19.0,29.0,37.0,46.0,90.0
9th,455.0,40.303297,15.335754,17.0,28.0,38.0,53.0,90.0
Preschool,45.0,41.288889,15.175672,19.0,30.0,40.0,53.0,75.0


We see that trend is that people with higher education are older. But there are some exceptions like a average age in `7th-8th` group is 48 years.

### Salary married and single males

In [215]:
is_married = df['marital-status'].str.startswith('Married')
is_man = df['sex'] == 'Male'

print(f'Average married man salary: {df[is_married & is_man]['salary K$'].mean()}')
print(f'Average single man salary: {df[~is_married & is_man]['salary K$'].mean()}')

Average married man salary: 107.49455968688845
Average single man salary: 46.59723865877712


Average married man is richer than average single man in two times :)

### Maximum hours per week worked

In [216]:
df[df['hours-per-week'].max() == df['hours-per-week']].sort_values(by='age')

Unnamed: 0,age,workclass,education,marital-status,occupation,relationship,race,sex,hours-per-week,native-country,salary,salary K$
16992,19,Private,7th-8th,Married-civ-spouse,Craft-repair,Husband,White,Male,99,United-States,<=50K,42
12788,24,State-gov,Doctorate,Never-married,Prof-specialty,Not-in-family,White,Female,99,England,<=50K,17
15180,25,Private,11th,Never-married,Other-service,Not-in-family,White,Male,99,United-States,<=50K,33
1172,25,Private,Masters,Married-civ-spouse,Farming-fishing,Not-in-family,White,Male,99,United-States,>50K,207
22313,26,Self-emp-not-inc,10th,Married-civ-spouse,Farming-fishing,Husband,White,Male,99,United-States,<=50K,26
...,...,...,...,...,...,...,...,...,...,...,...,...
26858,66,Private,Bachelors,Married-civ-spouse,Prof-specialty,Husband,White,Male,99,United-States,<=50K,17
23398,66,Private,Bachelors,Married-civ-spouse,Priv-house-serv,Other-relative,White,Male,99,United-States,<=50K,35
9831,67,Private,Masters,Never-married,Prof-specialty,Not-in-family,White,Male,99,United-States,<=50K,46
16604,73,Self-emp-not-inc,7th-8th,Married-civ-spouse,Farming-fishing,Husband,White,Male,99,United-States,>50K,236


99 hours per week is the maximum number of hours worked per week, I think it was a maximum value that was in questionary. There are 85 people that work 99 hours per week.