# Data Mining Project : Predicting Income Category of Individuals

## Context

An individual’s annual income results from various factors. Intuitively, it is influenced by the individual’s education level, age, gender, occupation, etc. That's why I chose this dataset for my preprocessing data mining project to clarify this influence and at the same time apply CRISP-DM.

There are 15 attributes which are the demographics and other features to describe a person. We can explore the possibility of predicting income level based on the individual’s personal information.

### Description

This intermediate level data set was extracted from the census bureau database. There are 48,842 instances of the data set, a mix of continuous and discrete (train=32,561, test=16,281).

The data set has 15 attributes which include age, sex, education level, and other relevant details of a person. 

- **age**: The age of an individual.
- **workclass**: The type of work or employment of an individual. It can have the following categories:
    - Private: Working in the private sector.
    - Self-emp-not-inc: Self-employed individuals who are not incorporated.
    - Self-emp-inc: Self-employed individuals who are incorporated.
    - Federal-gov: Working for the federal government.
    - Local-gov: Working for the local government.
    - State-gov: Working for the state government.
    - Without-pay: Not working and without pay.
    - Never-worked: Never worked before.
- **fnlwgt/final weight**: The weights on the CPS files are controlled to independent estimates of the civilian non-institutional population of the US. These are prepared monthly for us by Population Division here at the Census Bureau. We use 3 sets of controls. 
    - These are:
        - A single cell estimate of the population 16+ for each state.
        - Controls for Hispanic Origin by age and sex.
        - Controls by Race, age, and sex.
    
    > We use all three sets of controls in our weighting program and "rake" through them 6 times so that by the end we come back to all the controls we used.
    
    > People with similar demographic characteristics should have similar weights. There is one important caveat to remember about this statement. That is that since the CPS sample is actually a collection of 51 state samples, each with its own probability of selection, the statement only applies within state.
    
- **education**: The highest level of education completed.
- **education-num**: The number of years of education completed.
- **marital-status**: The marital status.
- **occupation**: Type of work performed by an individual.
- **relationship**: The relationship status.
- **race**: The race of an individual.
- **sex**: The gender of an individual.
- **capital-gain**: The amount of capital gain (financial profit).
- **capital-loss**: The amount of capital loss an individual has incurred.
- **hours-per-week**: The number of hours worked per week.
- **native-country**: The country of origin or the native country.
- **income**: The income level of an individual and serves as the target variable. It indicates whether the income is greater than $50,000 or less than or equal to $50,000, denoted as (>50K, <=50K).

## Exploratory Data Analysis

### Load Data

In [7]:
import pandas as pd

df = pd.read_csv('data/Income_category.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [8]:
import numpy as np

df.replace('?', np.nan, inplace=True)
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,,103497,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K


### Initial Data Exploration

In [9]:
df.shape

(48842, 15)

In [10]:
df.isnull().sum()

age                   0
workclass          2799
fnlwgt                0
education             0
educational-num       0
marital-status        0
occupation         2809
relationship          0
race                  0
gender                0
capital-gain          0
capital-loss          0
hours-per-week        0
native-country      857
income                0
dtype: int64

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        46043 non-null  object
 2   fnlwgt           48842 non-null  int64 
 3   education        48842 non-null  object
 4   educational-num  48842 non-null  int64 
 5   marital-status   48842 non-null  object
 6   occupation       46033 non-null  object
 7   relationship     48842 non-null  object
 8   race             48842 non-null  object
 9   gender           48842 non-null  object
 10  capital-gain     48842 non-null  int64 
 11  capital-loss     48842 non-null  int64 
 12  hours-per-week   48842 non-null  int64 
 13  native-country   47985 non-null  object
 14  income           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [15]:
df.duplicated().sum()

52

In [16]:
df.dtypes

age                 int64
workclass          object
fnlwgt              int64
education          object
educational-num     int64
marital-status     object
occupation         object
relationship       object
race               object
gender             object
capital-gain        int64
capital-loss        int64
hours-per-week      int64
native-country     object
income             object
dtype: object

### Data Analysis

### Data Cleaning

## Data Preprocessing

### Handling Missing Values

### Handling Outliers

### Imbalanced Data

### Enoding Categorical Variables

### Feature Scaling (Normalization & Standardization) 

### Feature Selection

### Feature Reduction (Dimensionality Reduction)

## Data Prediction

### Data Splitting

### Categorical Models Comparison

### Fine Tuning Model

### Model Prediction

### Model Evaluation