# Age Vs Salary Classification either obove 50k or less 50k through logistic regression classification

Reference of data set: https://www.kaggle.com/wenruliu/adult-income-dataset

![](https://miro.medium.com/max/668/0*g0SY0MVS41m_Yma_.png)

## NOTE!

 7. Attribute Information:

- **age**: continuous  
- **workclass**: `Private`, `Self-emp-not-inc`, `Self-emp-inc`, `Federal-gov`, `Local-gov`, `State-gov`, `Without-pay`, `Never-worked`  
- **fnlwgt**: continuous  
- **education**: `Bachelors`, `Some-college`, `11th`, `HS-grad`, `Prof-school`, `Assoc-acdm`, `Assoc-voc`, `9th`, `7th-8th`, `12th`, `Masters`, `1st-4th`, `10th`, `Doctorate`, `5th-6th`, `Preschool`  
- **education-num**: continuous  
- **marital-status**: `Married-civ-spouse`, `Divorced`, `Never-married`, `Separated`, `Widowed`, `Married-spouse-absent`, `Married-AF-spouse`  
- **occupation**: `Tech-support`, `Craft-repair`, `Other-service`, `Sales`, `Exec-managerial`, `Prof-specialty`, `Handlers-cleaners`, `Machine-op-inspct`, `Adm-clerical`, `Farming-fishing`, `Transport-moving`, `Priv-house-serv`, `Protective-serv`, `Armed-Forces`  
- **relationship**: `Wife`, `Own-child`, `Husband`, `Not-in-family`, `Other-relative`, `Unmarried`  
- **race**: `White`, `Asian-Pac-Islander`, `Amer-Indian-Eskimo`, `Other`, `Black`  
- **sex**: `Female`, `Male`  
- **capital-gain**: continuous  
- **capital-loss**: continuous  
- **hours-per-week**: continuous  
- **native-country**: `United-States`, `Cambodia`, `England`, `Puerto-Rico`, `Canada`, `Germany`, `Outlying-US(Guam-USVI-etc)`, `India`, `Japan`, `Greece`, `South`, `China`, `Cuba`, `Iran`, `Honduras`, `Philippines`, `Italy`, `Poland`, `Jamaica`, `Vietnam`, `Mexico`, `Portugal`, `Ireland`, `France`, `Dominican-Republic`, `Laos`, `Ecuador`, `Taiwan`, `Haiti`, `Columbia`, `Hungary`, `Guatemala`, `Nicaragua`, `Scotland`, `Thailand`, `Yugoslavia`, `El-Salvador`, `Trinadad&Tobago`, `Peru`, `Hong`, `Holand-Netherlands`  
- **class**: `>50K`, `<=50K`


------------------------------------------------------------------------------

**`fnlwgt` (Final Weight)** is a feature used in datasets from the **U.S. Census Bureau** to represent the **estimated number of people** in the U.S. population that each sampled individual represents.  

### **What Does It Mean?**  
- It is a **weighting factor** that adjusts for **sampling bias** in the survey.  
- A higher `fnlwgt` means that the person surveyed **represents more people** in the general population.  

### **How Is It Used?**  
- When analyzing **population-level** statistics, `fnlwgt` helps ensure that the data reflects the actual distribution of the population.
- Example: If two people have different `fnlwgt` values, the one with a higher weight should contribute **more influence** to statistical calculations.

### **Should You Use `fnlwgt` in Machine Learning?**  
🔹 **Usually, NO** – it’s often dropped in ML models because it does not provide useful predictive information.  
🔹 If working with **population-level** statistics, you might use it for weighted analysis.  

## Load Data

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
df = pd.read_csv(r'https://github.com/kaopanboonyuen/Python-Data-Science/raw/master/Dataset/adult_data/adult.csv')
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,<=50K


In [None]:
df.describe()

Unnamed: 0,age,fnlwgt,educational-num,capital-gain,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.643585,189664.1,10.078089,1079.067626,87.502314,40.422382
std,13.71051,105604.0,2.570973,7452.019058,403.004552,12.391444
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117550.5,9.0,0.0,0.0,40.0
50%,37.0,178144.5,10.0,0.0,0.0,40.0
75%,48.0,237642.0,12.0,0.0,0.0,45.0
max,90.0,1490400.0,16.0,99999.0,4356.0,99.0


## Data Preprocessing

### Find missing data and Fill it!

In [None]:
df.drop(['fnlwgt'],axis=1,inplace=True)

In [None]:
# Check for missing value
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype 
---  ------           --------------  ----- 
 0   age              48842 non-null  int64 
 1   workclass        48842 non-null  object
 2   education        48842 non-null  object
 3   educational-num  48842 non-null  int64 
 4   marital-status   48842 non-null  object
 5   occupation       48842 non-null  object
 6   relationship     48842 non-null  object
 7   race             48842 non-null  object
 8   gender           48842 non-null  object
 9   capital-gain     48842 non-null  int64 
 10  capital-loss     48842 non-null  int64 
 11  hours-per-week   48842 non-null  int64 
 12  native-country   48842 non-null  object
 13  income           48842 non-null  object
dtypes: int64(5), object(9)
memory usage: 5.2+ MB


In [None]:
# Count occurrences of NaN (null)
null_counts = df.isna().sum()

# Total occurrences in the whole DataFrame
total_nulls = null_counts.sum()

print("Occurrences per column (NaN):\n", null_counts)
print("Total occurrences of NaN in DataFrame:", total_nulls)

Occurrences per column (NaN):
 age                0
workclass          0
education          0
educational-num    0
marital-status     0
occupation         0
relationship       0
race               0
gender             0
capital-gain       0
capital-loss       0
hours-per-week     0
native-country     0
income             0
dtype: int64
Total occurrences of NaN in DataFrame: 0


In [None]:
# Count occurrences of "?" in each column
question_mark_counts = (df == "?").sum()

# Total occurrences in the whole DataFrame
total_question_marks = (df == "?").sum().sum()

print("Occurrences per column:\n", question_mark_counts)
print("Total occurrences in DataFrame:", total_question_marks)

Occurrences per column:
 age                   0
workclass          2799
education             0
educational-num       0
marital-status        0
occupation         2809
relationship          0
race                  0
gender                0
capital-gain          0
capital-loss          0
hours-per-week        0
native-country      857
income                0
dtype: int64
Total occurrences in DataFrame: 6465


In [None]:
# Convert '?' to NaN
df.replace("?", np.nan, inplace=True)

# Calculate the percentage of missing values per column
missing_percentage = df.isnull().mean() * 100

# Print missing values percentage for each column
print("Missing values percentage per column:")
print(missing_percentage)

Missing values percentage per column:
age                0.000000
workclass          5.730724
education          0.000000
educational-num    0.000000
marital-status     0.000000
occupation         5.751198
relationship       0.000000
race               0.000000
gender             0.000000
capital-gain       0.000000
capital-loss       0.000000
hours-per-week     0.000000
native-country     1.754637
income             0.000000
dtype: float64


Missing percentage not exceed 6%, Let fill it

In [None]:
df.head()

Unnamed: 0,age,workclass,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,,Some-college,10,Never-married,,Own-child,White,Female,0,0,30,United-States,<=50K


In [None]:
# Fill missing values with mode (most common value)
categorical_columns = ['workclass', 'occupation', 'native-country']
for col in categorical_columns:
    df[col] = df[col].fillna(df[col].mode()[0])

In [None]:
# Calculate the percentage of missing values per column
missing_percentage = df.isnull().mean() * 100

# Print missing values percentage for each column
print("Missing values percentage per column:")
print(missing_percentage)

Missing values percentage per column:
age                0.0
workclass          0.0
education          0.0
educational-num    0.0
marital-status     0.0
occupation         0.0
relationship       0.0
race               0.0
gender             0.0
capital-gain       0.0
capital-loss       0.0
hours-per-week     0.0
native-country     0.0
income             0.0
dtype: float64


### Converting Categorical Features

In [None]:
df.head()

Unnamed: 0,age,workclass,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,income
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,<=50K
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,<=50K
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,>50K
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,>50K
4,18,Private,Some-college,10,Never-married,Prof-specialty,Own-child,White,Female,0,0,30,United-States,<=50K


In [None]:
income = pd.get_dummies(df['income'],drop_first=True)
income

Unnamed: 0,>50K
0,False
1,False
2,True
3,True
4,False
...,...
48837,False
48838,True
48839,False
48840,False


In [None]:
df.drop('income', axis=1, inplace=True)

In [None]:
df = pd.concat([df,income],axis=1)

In [None]:
df

Unnamed: 0,age,workclass,education,educational-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,>50K
0,25,Private,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,False
1,38,Private,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,False
2,28,Local-gov,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,True
3,44,Private,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,True
4,18,Private,Some-college,10,Never-married,Prof-specialty,Own-child,White,Female,0,0,30,United-States,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,27,Private,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,False
48838,40,Private,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,True
48839,58,Private,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,False
48840,22,Private,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,False


In [None]:
from sklearn.preprocessing import OneHotEncoder, LabelEncoder

# List categorical columns
categorical_cols = ["workclass", "education", "marital-status", "occupation", "relationship", "race", "gender", "native-country"]

# Apply One-Hot Encoding (Nominal)
df = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

# Display transformed data
df.head()

Unnamed: 0,age,educational-num,capital-gain,capital-loss,hours-per-week,>50K,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,...,native-country_Portugal,native-country_Puerto-Rico,native-country_Scotland,native-country_South,native-country_Taiwan,native-country_Thailand,native-country_Trinadad&Tobago,native-country_United-States,native-country_Vietnam,native-country_Yugoslavia
0,25,7,0,0,40,False,False,False,True,False,...,False,False,False,False,False,False,False,True,False,False
1,38,9,0,0,50,False,False,False,True,False,...,False,False,False,False,False,False,False,True,False,False
2,28,12,0,0,40,True,True,False,False,False,...,False,False,False,False,False,False,False,True,False,False
3,44,10,7688,0,40,True,False,False,True,False,...,False,False,False,False,False,False,False,True,False,False
4,18,10,0,0,30,False,False,False,True,False,...,False,False,False,False,False,False,False,True,False,False


## Building a Logistic Regression model

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
X = df.drop('>50K',axis=1)
y = df['>50K']
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.30, random_state=101)

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
model = LogisticRegression()
model.fit(X_train,y_train)

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


In [None]:
y_pred = model.predict(X_test)

In [None]:
print(list(y_test[:5]))
print(y_pred[:5])

[True, True, True, False, False]
[False False False False False]


## Evaluation

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report

In [None]:
confusion_matrix(y_test, y_pred, labels=[0,1])

array([[10349,   798],
       [ 1494,  2012]])

In [None]:
print(classification_report(y_test,y_pred, digits=4))

              precision    recall  f1-score   support

       False     0.8738    0.9284    0.9003     11147
        True     0.7160    0.5739    0.6371      3506

    accuracy                         0.8436     14653
   macro avg     0.7949    0.7511    0.7687     14653
weighted avg     0.8361    0.8436    0.8373     14653



## Summary

F1-score (macro avg) = 76.87%