# **Project Overview: Predicting Income Levels Using a Deep Neural Network**

### **1. Objective**
This project builds a binary classification model to predict whether an individual's income exceeds $50,000 using U.S. Census data. We use a deep neural network (DNN) implemented with TensorFlow/Keras.

### **2. Data Overview**
- Target: `income` (binary)
- Features: Demographic and employment variables such as `age`, `education`, `occupation`, etc.

In [None]:
import pandas as pd
train_set = pd.read_csv('USCensusTraining.csv')
train_set.head()

### **3. Data Cleaning & Feature Engineering**
- Replace '?' with NaN and impute using mode
- Drop redundant columns
- One-hot encode categorical features

In [None]:
import numpy as np
train_set.replace('?', np.nan, inplace=True)
train_set.drop('education-num', axis=1, inplace=True)
train_set.fillna(train_set.mode().iloc[0], inplace=True)
train_set['native-country'] = train_set['native-country'].where(train_set['native-country'] == 'United-States', 'Other')
train_set = pd.get_dummies(train_set, drop_first=False)

In [None]:
train_set['income'] = np.where(train_set['income'] == '<=50K.', 0, 1)

### **4. Data Preparation**
Split the data and apply SMOTE to handle class imbalance.

In [None]:
from sklearn.model_selection import train_test_split
from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import MinMaxScaler

X = train_set.drop('income', axis=1)
y = train_set['income']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

smote = SMOTE(random_state=17)
X_train, y_train = smote.fit_resample(X_train, y_train)

scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

### **5. Model Architecture**
Build and train a basic DNN with two hidden layers.

In [None]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

model = Sequential()
model.add(Dense(64, input_shape=(X_train.shape[1],), activation='sigmoid'))
model.add(Dense(32, activation='sigmoid'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='binary_crossentropy', optimizer='RMSprop', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=200, batch_size=50)

### **6. Feature Importance**
Extract input layer weights and compute average absolute importance.

In [None]:
weights = model.layers[0].get_weights()[0]
importance = np.mean(np.abs(weights), axis=1)
features = X.columns

importance_df = pd.DataFrame({'Feature': features, 'Importance': importance})
importance_df.sort_values(by='Importance', ascending=False).head(10)

### **7. Evaluation**
Evaluate accuracy and display classification metrics.

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

predictions = model.predict(X_train).round().astype(int)
accuracy = accuracy_score(y_train, predictions)
print(f"Training Accuracy: {accuracy * 100:.2f}%")
print(confusion_matrix(y_train, predictions))
print(classification_report(y_train, predictions))

### **8. Hyperparameter Tuning**
Use `GridSearchCV` with `KerasClassifier` for optimizer, batch size, and epochs.

In [None]:
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import GridSearchCV

def build_model(optimizer='Adam'):
    model = Sequential()
    model.add(Dense(64, input_shape=(X_train.shape[1],), activation='relu'))
    model.add(Dense(32, activation='relu'))
    model.add(Dense(1, activation='sigmoid'))
    model.compile(loss='binary_crossentropy', optimizer=optimizer, metrics=['accuracy'])
    return model

model = KerasClassifier(model=build_model, verbose=0)
param_grid = {
    'optimizer': ['SGD', 'RMSprop', 'Adam'],
    'batch_size': [80, 100],
    'epochs': [50, 100]
}
grid = GridSearchCV(estimator=model, param_grid=param_grid, cv=3)
grid_result = grid.fit(X_train, y_train)
print("Best Params:", grid_result.best_params_)

### **9. Test Prediction**
Apply best model to test set and export predictions.

In [None]:
test_set = pd.read_csv("USCensusTest.csv")
test_set.replace('?', np.nan, inplace=True)
test_set.fillna(test_set.mode().iloc[0], inplace=True)
test_set.drop('education-num', axis=1, inplace=True)
test_set['native-country'] = test_set['native-country'].where(test_set['native-country'] == 'United-States', 'Other')
test_set = pd.get_dummies(test_set, drop_first=False)
X_final = scaler.transform(test_set)
predictions = grid_result.best_estimator_.predict(X_final)
pd.DataFrame(predictions, columns=['Predictions']).to_csv('Team17_predictions.txt', index=False)