# Adult Income data 


The Adult Income Dataset predicts whether a person earns “>50K” or “<=50K” per year based on demographic and employment attributes.

It contains approximately 48,842 rows with 15 attributes.


In [None]:
1. age

    Type: Numerical (Continuous)
    
    Represents the person’s age in years.

2. workclass

    Type: Categorical (Nominal)
    
    Examples: Private, Self-emp, Govt, etc.
    
    Contains missing values indicated by "?".

3. fnlwgt

    Type: Numerical (Continuous)
    
    Represents final sample weight (used for census sampling).

4. education

    Type: Categorical (Ordinal)
    
    Examples: Bachelors, Masters, HS-grad
    
    Represents highest education level.

5. education-num

    Type: Numerical (Discrete)
    
    Encoded version of education (1 to 16).

6. marital-status

    Type: Categorical (Nominal)
    
    Example: Married, Divorced, Never-married.

7. occupation

    Type: Categorical (Nominal)
    
    Examples: Tech-support, Craft-repair
    
    Missing values represented as "?".

8. relationship

    Type: Categorical (Nominal)
    
    Example: Husband, Wife, Not-in-family

9. race

    Type: Categorical (Nominal)

10. sex

    Type: Categorical (Nominal)

    Typically: Male / Female

11. capital-gain

    Type: Numerical (Continuous)
    
    Monetary gain (mostly zero).

12. capital-loss

    Type: Numerical (Continuous)

13. hours-per-week

    Type: Numerical (Continuous)
    
    Working hours per week.

14. native-country

    Type: Categorical (Nominal)
    
    Examples: United States, India, Mexico
    
    Missing values represented as "?".

15. income (target variable)

    Type: Categorical (Binary)
    
    Categories:
    
    <=50K
    
    >50K

In [None]:

import pandas as pd

In [None]:

import warnings
warnings.filterwarnings("ignore")

In [4]:

# Load the dataset

df = pd.read_csv('adult.csv')

In [5]:

# Display the first few rows

df

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,HS-grad,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,Some-college,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,22,Private,310152,Some-college,10,Never-married,Protective-serv,Not-in-family,White,Male,0,0,40,United-States,<=50K
32557,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32558,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32559,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K


In [10]:

df.isna().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education.num        0
marital.status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital.gain         0
capital.loss         0
hours.per.week       0
native.country     583
income               0
dtype: int64

In [7]:

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32561 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             32561 non-null  int64 
 1   workclass       32561 non-null  object
 2   fnlwgt          32561 non-null  int64 
 3   education       32561 non-null  object
 4   education.num   32561 non-null  int64 
 5   marital.status  32561 non-null  object
 6   occupation      32561 non-null  object
 7   relationship    32561 non-null  object
 8   race            32561 non-null  object
 9   sex             32561 non-null  object
 10  capital.gain    32561 non-null  int64 
 11  capital.loss    32561 non-null  int64 
 12  hours.per.week  32561 non-null  int64 
 13  native.country  32561 non-null  object
 14  income          32561 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


In [8]:

missing_val = '?'

df = pd.read_csv('adult.csv',na_values = missing_val)

In [9]:

df

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,,77053,HS-grad,9,Widowed,,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,HS-grad,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,,186061,Some-college,10,Widowed,,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,7th-8th,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,Some-college,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,22,Private,310152,Some-college,10,Never-married,Protective-serv,Not-in-family,White,Male,0,0,40,United-States,<=50K
32557,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32558,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32559,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K


We can see that missing values are present in the form of "?" in "workclass","occupation", "native.country".

so we are replacing it with NaN (not a null) for identifying them as missing value for further analysis.

In [12]:

df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education.num      int64
marital.status    object
occupation        object
relationship      object
race              object
sex               object
capital.gain       int64
capital.loss       int64
hours.per.week     int64
native.country    object
income            object
dtype: object

In [None]:

df.shape

In [13]:

df.isnull().sum()

age                  0
workclass         1836
fnlwgt               0
education            0
education.num        0
marital.status       0
occupation        1843
relationship         0
race                 0
sex                  0
capital.gain         0
capital.loss         0
hours.per.week       0
native.country     583
income               0
dtype: int64

In [15]:

df['occupation'].mode()[0]

'Prof-specialty'

In [16]:

df['workclass'].fillna(df['workclass'].mode()[0], inplace=True)

df['occupation'].fillna(df['occupation'].mode()[0], inplace=True)

df['native.country'].fillna(df['native.country'].mode()[0], inplace=True)    


In [17]:

df.isnull().sum()

age               0
workclass         0
fnlwgt            0
education         0
education.num     0
marital.status    0
occupation        0
relationship      0
race              0
sex               0
capital.gain      0
capital.loss      0
hours.per.week    0
native.country    0
income            0
dtype: int64

In [18]:

cat_fet= df.select_dtypes(include=['object']).columns.tolist()
print(cat_fet)


['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country', 'income']


In [19]:

num_fet= df.select_dtypes(exclude=['object']).columns.tolist()
print(num_fet)


['age', 'fnlwgt', 'education.num', 'capital.gain', 'capital.loss', 'hours.per.week']


In [20]:
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()

In [21]:

label_cols = ['workclass', 'education', 'marital.status', 'occupation', 'relationship', 'race', 'sex', 'native.country']

for col in label_cols:
    
    df[col] = le.fit_transform(df[col])
    

In [22]:

df

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,3,77053,11,9,6,9,1,4,0,0,4356,40,38,<=50K
1,82,3,132870,11,9,6,3,1,4,0,0,4356,18,38,<=50K
2,66,3,186061,15,10,6,9,4,2,0,0,4356,40,38,<=50K
3,54,3,140359,5,4,0,6,4,4,0,0,3900,40,38,<=50K
4,41,3,264663,15,10,5,9,3,4,0,0,3900,40,38,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,22,3,310152,15,10,4,10,1,4,1,0,0,40,38,<=50K
32557,27,3,257302,7,12,2,12,5,4,0,0,0,38,38,<=50K
32558,40,3,154374,11,9,2,6,0,4,1,0,0,40,38,>50K
32559,58,3,151910,11,9,6,0,4,4,0,0,0,40,38,<=50K


In [23]:

df['income'] = df['income'].map({'<=50K': 0, '>50K': 1})

In [24]:

df

Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,3,77053,11,9,6,9,1,4,0,0,4356,40,38,0
1,82,3,132870,11,9,6,3,1,4,0,0,4356,18,38,0
2,66,3,186061,15,10,6,9,4,2,0,0,4356,40,38,0
3,54,3,140359,5,4,0,6,4,4,0,0,3900,40,38,0
4,41,3,264663,15,10,5,9,3,4,0,0,3900,40,38,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,22,3,310152,15,10,4,10,1,4,1,0,0,40,38,0
32557,27,3,257302,7,12,2,12,5,4,0,0,0,38,38,0
32558,40,3,154374,11,9,2,6,0,4,1,0,0,40,38,1
32559,58,3,151910,11,9,6,0,4,4,0,0,0,40,38,0


In [25]:

X = df.iloc[:,0:14]
X


Unnamed: 0,age,workclass,fnlwgt,education,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country
0,90,3,77053,11,9,6,9,1,4,0,0,4356,40,38
1,82,3,132870,11,9,6,3,1,4,0,0,4356,18,38
2,66,3,186061,15,10,6,9,4,2,0,0,4356,40,38
3,54,3,140359,5,4,0,6,4,4,0,0,3900,40,38
4,41,3,264663,15,10,5,9,3,4,0,0,3900,40,38
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,22,3,310152,15,10,4,10,1,4,1,0,0,40,38
32557,27,3,257302,7,12,2,12,5,4,0,0,0,38,38
32558,40,3,154374,11,9,2,6,0,4,1,0,0,40,38
32559,58,3,151910,11,9,6,0,4,4,0,0,0,40,38


In [26]:

y = df.iloc[:,14]
y

0        0
1        0
2        0
3        0
4        0
        ..
32556    0
32557    0
32558    1
32559    0
32560    0
Name: income, Length: 32561, dtype: int64

In [27]:

from sklearn.model_selection import train_test_split

# split the dataset into training and test datasets
X_train, X_test, y_train, y_test  = train_test_split(X,y, test_size=0.3, random_state=122)


In [28]:

from sklearn.tree import DecisionTreeClassifier

dec_tree = DecisionTreeClassifier()

In [29]:

dec_tree.fit(X_train,y_train)

In [30]:

Y_pred_dec_tree = dec_tree.predict(X_test)
Y_pred_dec_tree


array([0, 0, 1, ..., 0, 0, 1], dtype=int64)

In [31]:

from sklearn.metrics import accuracy_score

In [32]:

print('Accuracy score:', round(accuracy_score(y_test, Y_pred_dec_tree) * 100, 2))

Accuracy score: 80.87


In [33]:

from sklearn.metrics import classification_report, confusion_matrix

print(confusion_matrix(y_test, Y_pred_dec_tree))
print(classification_report(y_test, Y_pred_dec_tree))


[[6476  938]
 [ 931 1424]]
              precision    recall  f1-score   support

           0       0.87      0.87      0.87      7414
           1       0.60      0.60      0.60      2355

    accuracy                           0.81      9769
   macro avg       0.74      0.74      0.74      9769
weighted avg       0.81      0.81      0.81      9769

