# Assignment 7 (Week 7)

## Name: Eniola Ogunmona


### Table of Contents
- [Introduction](#intro)
- [Data preprocessing](#dataprep)
- [Model building](#modelbuild)
- [Model evaluation](#modeleval)
- [Conclusion](#conclusion)

<a id="intro"></a>
## Introduction

The objective of this project is to predict whether a person makes over 50K a year based on various demographic, educational, and employment-related features. 

The dataset contains information about individuals' age, workclass, education level, occupation, and other characteristics, as well as their income level. 

We will use this data to train a classification model that can predict whether a person makes over 50K a year or not.

> The data can be found [here](https://drive.google.com/file/d/1_c3KA14xQC02K0QZ4cpi1emjdz0rqHzb/view?usp=share_link).

### Data Dictionary

```
- Age: continuous.

- Workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.

- Final_weight: continuous.

- Education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, - Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.

- Education_num: continuous.

- Marital_status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, - Married-AF-spouse.

- Occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.

- Relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.

- Race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.

- Genger: Female, Male.

- Capital_gain: continuous.

- Capital_loss: continuous.

- Hours_per_week: continuous.

- Country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.

- Salary: 
```

**Importing the necessary libraries**

In [1]:
# Built-in library
import itertools

# Standard imports
import numpy as np
import pandas as pd
import seaborn as sns

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OrdinalEncoder, MinMaxScaler
from sklearn.metrics import accuracy_score, precision_score, recall_score

# pandas settings
pd.options.display.max_rows = 1_000
pd.options.display.max_columns = 1_000
pd.options.display.max_colwidth = 1_000


# Black code formatter (Optional)
%load_ext nb_black

<IPython.core.display.Javascript object>

In [2]:
# Loading data
df = pd.read_csv("salary.csv", skipinitialspace=True)
df.head()

Unnamed: 0,Age,Workclass,Final_weight,Education,Education_num,Marital_status,Occupation,Relationship,Race,Sex,Capital_gain,Capital_loss,Hours_per_week,Country,Salary
0,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
1,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
2,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
3,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
4,37,Private,284582,Masters,14,Married-civ-spouse,Exec-managerial,Wife,White,Female,0,0,40,United-States,<=50K


<IPython.core.display.Javascript object>

In [3]:
# Checking for more info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32560 entries, 0 to 32559
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Age             32560 non-null  int64 
 1   Workclass       32560 non-null  object
 2   Final_weight    32560 non-null  int64 
 3   Education       32560 non-null  object
 4   Education_num   32560 non-null  int64 
 5   Marital_status  32560 non-null  object
 6   Occupation      32560 non-null  object
 7   Relationship    32560 non-null  object
 8   Race            32560 non-null  object
 9   Sex             32560 non-null  object
 10  Capital_gain    32560 non-null  int64 
 11  Capital_loss    32560 non-null  int64 
 12  Hours_per_week  32560 non-null  int64 
 13  Country         32560 non-null  object
 14  Salary          32560 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


<IPython.core.display.Javascript object>

In [4]:
# Checking for null values
df.isna().sum()

Age               0
Workclass         0
Final_weight      0
Education         0
Education_num     0
Marital_status    0
Occupation        0
Relationship      0
Race              0
Sex               0
Capital_gain      0
Capital_loss      0
Hours_per_week    0
Country           0
Salary            0
dtype: int64

<IPython.core.display.Javascript object>

<a id="dataprep"></a>
## Data preprocessing

### Discretisation of variables

Mapping the different categorical variables

In [5]:
# Discretisation of Workclass Variable
wc_mapper = {
    "Private": "Employed",
    "Self-emp-inc": "Employed",
    "?": "Unemployed",
    "Local-gov": "Employed",
    "Self-emp-not-inc": "Unemployed",
    "Federal-gov": "Employed",
    "State-gov": "Employed",
    "Never-worked": "Unemployed",
    "Without-pay": "Unemployed",
}

df["Workclass"] = df["Workclass"].map(wc_mapper)
df["Workclass"].unique()

array(['Unemployed', 'Employed'], dtype=object)

<IPython.core.display.Javascript object>

In [6]:
# Discretisation of Marital_status Variable
mar_mapper = {
    "Married-civ-spouse": "Married",
    "Never-married": "Not_Married",
    "Divorced": "Not_Married",
    "Separated": "Not_Married",
    "Widowed": "Not_Married",
    "Married-spouse-absent": "Married",
    "Married-AF-spouse": "Married",
}

df["Marital_status"] = df["Marital_status"].map(mar_mapper)
df["Marital_status"].unique()

array(['Married', 'Not_Married'], dtype=object)

<IPython.core.display.Javascript object>

In [7]:
# Discretisation of Occupation Variable
occupation_mapper = {
    "Prof-specialty": "Professional",
    "Craft-repair": "Non_technical",
    "Exec-managerial": "Non_technical",
    "Adm-clerical": "Non_technical",
    "Sales": "Non_technical",
    "Other-service": "Other_service",
    "Machine-op-inspct": "Professional",
    "?": "Other_service",
    "Transport-moving": "Non_technical",
    "Handlers-cleaners": "Non_technical",
    "Farming-fishing": "Non_technical",
    "Tech-support": "Professional",
    "Protective-serv": "Other_service",
    "Priv-house-serv": "Other_service",
    "Armed-Forces": "Other_service",
}

df["Occupation"] = df["Occupation"].map(occupation_mapper)
df["Occupation"].unique()

array(['Non_technical', 'Professional', 'Other_service'], dtype=object)

<IPython.core.display.Javascript object>

In [8]:
# Discretisation of Relationship Variable
rel_mapper = {
    "Husband": "H",
    "Not-in-family": "S",  # stranger
    "Own-child": "C",  # children
    "Unmarried": "U",
    "Wife": "W",
    "Other-relative": "E",  # extended family
}

df["Relationship"] = df["Relationship"].map(rel_mapper)
df["Relationship"].unique()

array(['H', 'S', 'W', 'C', 'U', 'E'], dtype=object)

<IPython.core.display.Javascript object>

In [9]:
# Discretisation of Race Variable
race_mapper = {
    "White": "White",
    "Black": "Black",
    "Asian-Pac-Islander": "Other",
    "Amer-Indian-Eskimo": "Other",
    "Other": "Other",
}

df["Race"] = df["Race"].map(race_mapper)
df["Race"].unique()

array(['White', 'Black', 'Other'], dtype=object)

<IPython.core.display.Javascript object>

In [10]:
# Dropping columns not to be used
vars_to_drop = ["Education_num", "Country"]
df.drop(columns=vars_to_drop, inplace=True)

df.shape

(32560, 13)

<IPython.core.display.Javascript object>

In [11]:
# View more info on numerical data
num_data = df.select_dtypes(exclude="O")
num_data.describe()

Unnamed: 0,Age,Final_weight,Capital_gain,Capital_loss,Hours_per_week
count,32560.0,32560.0,32560.0,32560.0,32560.0
mean,38.581634,189781.8,1077.615172,87.306511,40.437469
std,13.640642,105549.8,7385.402999,402.966116,12.347618
min,17.0,12285.0,0.0,0.0,1.0
25%,28.0,117831.5,0.0,0.0,40.0
50%,37.0,178363.0,0.0,0.0,40.0
75%,48.0,237054.5,0.0,0.0,45.0
max,90.0,1484705.0,99999.0,4356.0,99.0


<IPython.core.display.Javascript object>

In [12]:
# View caategorical properties
cat_data = df.select_dtypes(include="O")
cat_data.describe()

Unnamed: 0,Workclass,Education,Marital_status,Occupation,Relationship,Race,Sex,Salary
count,32560,32560,32560,32560,32560,32560,32560,32560
unique,2,16,2,3,6,3,2,2
top,Employed,HS-grad,Not_Married,Non_technical,H,White,Male,<=50K
freq,28162,10501,17143,19545,13193,27815,21789,24719


<IPython.core.display.Javascript object>

In [13]:
# Checking correlation between numerical data
corr_matrix = num_data.corr()
corr_matrix

Unnamed: 0,Age,Final_weight,Capital_gain,Capital_loss,Hours_per_week
Age,1.0,-0.076646,0.077674,0.057775,0.068756
Final_weight,-0.076646,1.0,0.000437,-0.010259,-0.01877
Capital_gain,0.077674,0.000437,1.0,-0.031614,0.078409
Capital_loss,0.057775,-0.010259,-0.031614,1.0,0.054256
Hours_per_week,0.068756,-0.01877,0.078409,0.054256,1.0


<IPython.core.display.Javascript object>

In [14]:
# Encoding target variable
sal_mapper = {
    "<=50K": 0,
    ">50K": 1,
}

df["Salary"] = df["Salary"].map(sal_mapper)
df["Salary"].unique()

array([0, 1], dtype=int64)

<IPython.core.display.Javascript object>

<a id="modelbuild"></a>
## Model building

Here the data is split into training and testing sets then trained with the logistic regression model.

#### Split the data into train and test

In [15]:
RANDOM_STATE = 100
TEST_SIZE = 0.2
TARGET = "Salary"

<IPython.core.display.Javascript object>

In [16]:
# Independent features (Matrix)
X = df.drop(columns=TARGET)

# Target variable (Vector)
y = df[TARGET]

# Splitting into test and train
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=TEST_SIZE, random_state=RANDOM_STATE
)

X_train.shape, X_test.shape

((26048, 12), (6512, 12))

<IPython.core.display.Javascript object>

In [17]:
# Selecting columns to perform preprocessing on
vars_to_encode = [
    "Workclass",
    "Marital_status",
    "Education",
    "Occupation",
    "Relationship",
    "Race",
    "Sex",
]

vars_to_scale = [
    "Age",
    "Final_weight",
    "Capital_gain",
    "Capital_loss",
    "Hours_per_week",
]

# ===== OE =====
oe = OrdinalEncoder(handle_unknown="error")

# ===== Scaler =====
scaler = MinMaxScaler()

preprocessor = ColumnTransformer(
    transformers=[
        ("oe", oe, vars_to_encode),
        ("scaler", scaler, vars_to_scale),
    ],
    remainder="passthrough",
)

preprocessor

<IPython.core.display.Javascript object>

In [18]:
# View preprocessed info
tr = preprocessor.fit_transform(X_train)
pd.DataFrame(tr, columns=preprocessor.get_feature_names_out()).head()

Unnamed: 0,oe__Workclass,oe__Marital_status,oe__Education,oe__Occupation,oe__Relationship,oe__Race,oe__Sex,scaler__Age,scaler__Final_weight,scaler__Capital_gain,scaler__Capital_loss,scaler__Hours_per_week
0,0.0,0.0,11.0,1.0,5.0,2.0,0.0,0.123288,0.313321,0.0,0.0,0.295918
1,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.342466,0.241268,0.0,0.0,0.428571
2,0.0,1.0,15.0,0.0,0.0,2.0,1.0,0.068493,0.117173,0.0,0.0,0.397959
3,0.0,0.0,1.0,0.0,2.0,2.0,1.0,0.452055,0.122192,0.0,0.0,0.397959
4,0.0,0.0,10.0,2.0,2.0,2.0,1.0,0.328767,0.327879,1.0,0.0,0.704082


<IPython.core.display.Javascript object>

In [19]:
# Define model
model = LogisticRegression()

# Bundle preprocessing and modeling code in a pipeline
clf = Pipeline(steps=[("preprocessor", preprocessor), ("model", model)])

# Preprocessing of training data, fit model
clf.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = clf.predict(X_test)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


<IPython.core.display.Javascript object>

<a id="modeleval"></a>
## Model evaluation

Evaluating the performance of the model using three different performance metrics: accuracy, precision, and recall. 

In [21]:
# Checking model performance
accuracy = accuracy_score(y_test, preds)
precision = precision_score(y_test, preds)
recall = recall_score(y_test, preds)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)

Accuracy: 0.8100429975429976
Precision: 0.6864077669902913
Recall: 0.43615052436767426


<IPython.core.display.Javascript object>

<a id="conclusion"></a>
## Conclusion
In conclusion, a classification model that can predict whether a person makes over 50K a year based on various demographic, educational, and employment-related features was built. 

The performance of the model using three different performance metrics, which suggest that our model is able to predict the income level with a reasonable degree of accuracy. However, there is still room for improvement, and future work could involve exploring different models or improving the feature engineering process.