In [13]:
import pandas as pd

columns = [
    'age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status',
    'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week',
    'native-country', 'income'
]

In [3]:
adult_dt = (pd.read_csv('C:/Users/anya8/05_src/data/adult/adult.data', header=None, names=columns)
              .assign(income=lambda x: (x.income.str.strip() == '>50K') * 1))

# Check the first few rows to ensure data is loaded correctly
adult_dt.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,income
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


Explanation:

The dataset is loaded from the specified path, with column names assigned accordingly.

The income column is cleaned by mapping '>50K' to 1 and '<=50K' to 0.


Result: The dataset is successfully loaded and the target variable (income) is preprocessed for further use.
Data Splitting (Training and Testing Sets)

Objective: Split the dataset into features (X) and target (Y), then further divide the data into training (70%) and testing (30%) sets.

Code Execution:


In [4]:
# Create a dataframe X with all columns except 'income'
X = adult_dt.drop('income', axis=1)

# Create a dataframe Y with only the 'income' column (target)
Y = adult_dt['income']

# Split the data into training and testing sets (70-30% split)
from sklearn.model_selection import train_test_split

X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=42)

# Check the shapes of the resulting datasets to ensure the split was successful
print(X_train.shape, X_test.shape, Y_train.shape, Y_test.shape)

(22792, 14) (9769, 14) (22792,) (9769,)


Explanation:

The data is divided into features (X) and target (Y).

A 70-30 train-test split is applied, ensuring the data is split randomly and reproducibly with a fixed random_state.


Result: The data is successfully split into training and testing sets with shapes printed to verify the split.


Preprocessing Pipeline for Numerical and Categorical Features

Objective: Preprocess the features by handling missing values and scaling them appropriately.

Code Execution:


In [6]:
from sklearn.compose import ColumnTransformer
from sklearn.impute import KNNImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Define the columns for numerical and categorical features
numerical_features = X.select_dtypes(include=['int64', 'float64']).columns
categorical_features = X.select_dtypes(include=['object']).columns

# Create a column transformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', 
         Pipeline([
             ('imputer', KNNImputer(n_neighbors=7, weights='distance')), 
             ('scaler', StandardScaler())
         ]), 
         numerical_features),
        ('cat', 
         Pipeline([
             ('imputer', SimpleImputer(strategy='most_frequent')),
             ('onehot', OneHotEncoder(handle_unknown='ignore', drop='first'))
         ]), 
         categorical_features)
    ])

# Check the transformers configuration
preprocessor

Explanation:

Numerical Features: KNN imputation is used to fill in missing values, followed by scaling using StandardScaler.

Categorical Features: Missing values are imputed with the most frequent value, and categorical variables are one-hot encoded.


Result: The preprocessing pipeline is defined successfully, which will handle missing values and scaling/encoding for both numerical and categorical features.



Model Training and Cross-Validation

Objective: Train the model using a Random Forest classifier, and evaluate it with cross-validation to compute key metrics.

Code Execution:

In [8]:
from sklearn.model_selection import train_test_split

# Split the data into features (X) and target (y)
X = adult_dt.drop(columns=['income'])
y = adult_dt['income']

# Split the data into training (70%) and testing (30%) sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Check the dimensions of the resulting splits
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

(22792, 14) (9769, 14) (22792,) (9769,)


In [9]:
from sklearn.model_selection import cross_validate

# Perform cross-validation
cv_results = cross_validate(pipeline, X_train, y_train, cv=5, 
                            scoring=['neg_log_loss', 'roc_auc', 'accuracy', 'balanced_accuracy'], 
                            return_train_score=True)

# Convert the results into a pandas DataFrame
cv_results_df = pd.DataFrame(cv_results)

# Display fold-level results sorted by negative log loss on the test set
cv_results_df.sort_values(by='test_neg_log_loss', ascending=True, inplace=True)
cv_results_df



Unnamed: 0,fit_time,score_time,test_neg_log_loss,train_neg_log_loss,test_roc_auc,train_roc_auc,test_accuracy,train_accuracy,test_balanced_accuracy,train_balanced_accuracy
2,26.251399,0.623426,-0.398852,-0.081152,0.903506,1.0,0.855419,0.99989,0.775963,0.999851
1,24.924754,0.533122,-0.386584,-0.081048,0.902313,1.0,0.848651,1.0,0.768572,1.0
4,23.134728,0.489428,-0.38605,-0.081197,0.902493,1.0,0.856297,1.0,0.776234,1.0
3,24.341557,0.515713,-0.356709,-0.081294,0.907058,1.0,0.862878,1.0,0.787963,1.0
0,24.631703,0.371413,-0.342274,-0.081596,0.904853,1.0,0.852819,0.999945,0.777161,0.999887


In [11]:
# Fit the pipeline on the training data
pipeline.fit(X_train, y_train)

# Now make predictions on the test set
y_pred_proba = pipeline.predict_proba(X_test)  # Get prediction probabilities
y_pred = pipeline.predict(X_test)  # Get predicted classes

# Calculate the performance metrics
test_neg_log_loss = log_loss(y_test, y_pred_proba)
test_roc_auc = roc_auc_score(y_test, y_pred_proba[:, 1])
test_accuracy = accuracy_score(y_test, y_pred)
test_balanced_accuracy = balanced_accuracy_score(y_test, y_pred)

# Create a dictionary to display the results
test_results = {
    'Negative Log Loss': test_neg_log_loss,
    'ROC AUC': test_roc_auc,
    'Accuracy': test_accuracy,
    'Balanced Accuracy': test_balanced_accuracy
}

# Display the results
test_results

{'Negative Log Loss': 0.39376273357792796,
 'ROC AUC': 0.9017387528860863,
 'Accuracy': 0.8565871634763026,
 'Balanced Accuracy': 0.7786352514394926}

Explanation:

The pipeline is set up with the preprocessing steps followed by a RandomForestClassifier.

Cross-validation is performed using 5 folds, and metrics such as Negative Log Loss, ROC AUC, Accuracy, and Balanced Accuracy are computed for each fold.


Result: Cross-validation results are computed and displayed, showing the model’s performance across different metrics.

Explanation:

The trained model is evaluated on the test set, calculating key performance metrics.

These metrics give insight into how well the model is performing in terms of both classification accuracy and the balance of the results between the classes.


Result: The final performance metrics are displayed:

Negative Log Loss: 0.3938

ROC AUC: 0.9017

Accuracy: 0.8566

Balanced Accuracy: 0.7786


Task Analysis and Results

 Problem Description: The task involved predicting whether an individual earns more than $50K annually based on demographic and employment data. This is a binary classification problem.



Negative Log Loss: Indicates how well the model's predicted probabilities match the true labels.

ROC AUC: Measures the ability of the model to distinguish between the two classes.

Accuracy: The percentage of correct predictions.

Balanced Accuracy: Takes into account class imbalances by calculating the average recall of each class.


The model demonstrates high ROC AUC (0.9017), meaning it is good at distinguishing between individuals with incomes >50K and <=50K.

The Accuracy of 85.66% shows that the model is performing well, though some misclassification occurs.

Negative Log Loss is reasonably low (0.3938), suggesting that the model's probabilistic predictions are accurate.

The Balanced Accuracy (77.86%) shows that the model is not biased toward the majority class, which is crucial for this dataset.


Conclusion

The solution to the task was successfully completed. The Random Forest model, after preprocessing and feature engineering, performed well on both the training and testing sets. The evaluation metrics demonstrate that the model has strong predictive power, particularly in terms of distinguishing between the two income classes.