# Predicting Default Payments with Fully-Connected NNs

The dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.

## Dataset Description
This dataset employs a binary variable to indicate whether a credit card payment occurred (1 = Yes, 0 = No). The study selected the following 23 factors as explanatory variables:

- Variable 1: Amount of credit granted (in local currency), which includes both individual credit and family (supplementary) credit.
- Variable 2: Gender (1 = male; 2 = female).
- Variable 3: Education level (1 = graduate school; 2 = university; 3 = high school; 4 = others).
- Variable 4: Age (years).
- Variables 5-10: Payment history over several months. The scale for payment status ranges from -1 (paid on time) to 9 (delayed by nine months or more). It tracks payments from April to September:

    - Variable 5: Payment status in September;
    - Variable 6: Payment status in August;
    - Variable 7: Payment status in July;
    - Variable 8: Payment status in June;
    - Variable 9: Payment status in May;
    - Variable 10: Payment status in April. 
- Variables 11-16: Amount of monthly billing (in local currency), tracking statements from September to April.
- Variables 17-22: Amount of previous payments (in local currency), corresponding to monthly payments made from September to April.

## Inspecting the data

any comment about data dimensionality/distribution goes here

In [None]:
# import librerie
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [None]:
# carichiamo il dataset
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

In [None]:
train_data.dtypes

In [None]:
train_data.head()

In [None]:
train_data.describe()

In [None]:
train_data.info()

In [None]:
# Controlla la dimensionalità del dataset di training e test
print("Dimensionalità del dataset di training:", train_data.shape)
print("Dimensionalità del dataset di test:", test_data.shape)

In [None]:
# controllo presenza di valori nulli
train_data.isnull().sum().any()

In [None]:
# controllo di presenza di valori duplicati
train_data.duplicated().sum()

In [None]:
np.isinf(train_data.values).any()

### Analisi statica univariata

In [None]:
# Mostra i valori unici di ciascuna colonna categorica
categorical_columns = ['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']
for col in categorical_columns:
    print(f"{col}: {train_data[col].unique()}")

In [None]:
# Variabili numeriche e categoriche
numerical_columns = ['LIMIT_BAL', 'AGE'] + [f'BILL_AMT{i}' for i in range(1, 7)] + [f'PAY_AMT{i}' for i in range(1, 7)]
categorical_columns = ['SEX', 'EDUCATION', 'MARRIAGE'] + ['PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']

# Analisi univariata delle variabili numeriche
for column in numerical_columns:
    plt.figure(figsize=(8, 4))
    sns.histplot(train_data[column], kde=True, color='skyblue')
    plt.title(f'Distribution of {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.show()

# Analisi univariata con percentuali per variabili categoriche
for column in categorical_columns:
    plt.figure(figsize=(8, 4))
    sns.countplot(x=train_data[column], color='lightgreen')
    plt.title(f'Count of {column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.show()


In [None]:
column = 'default payment next month'
total_rows = len(train_data)
counts = train_data[column].value_counts()
percentages = [count / total_rows * 100 for count in counts]
plt.pie(percentages, autopct='%1.1f%%', colors=['green', 'orange'])
plt.title(f'Proportion of {column} (target)')
labels = ['0', '1']
plt.legend(labels=labels, loc='upper right')
plt.show()

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# Supponendo che il target sia la colonna 'default payment next month' del DataFrame df
class_counts = train_data['default payment next month'].value_counts()

# Mostra la distribuzione delle classi
print("Distribuzione delle classi:")
print(class_counts)

# Grafico a torta
plt.figure(figsize=(6, 6))
plt.pie(class_counts, labels=class_counts.index, autopct='%1.1f%%', startangle=90, colors=['#66c2a5', '#fc8d62'])
plt.title('Distribuzione delle Classi')
plt.show()


### Analisi Statica Multivariata

In [None]:
# feature categoriche
categorical_columns = ['SEX', 'EDUCATION', 'MARRIAGE', 'PAY_0', 'PAY_2', 'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6']

# Relazione tra variabili categoriche e la variabile target
for column in categorical_columns:
    plt.figure(figsize=(8, 4))
    sns.countplot(x=column, hue='default payment next month', data=train_data, palette='coolwarm')
    plt.title(f'{column} distribution by Default Payment Status')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.legend(title='Default Payment')
    plt.show()


In [None]:
# Calcola la correlazione tra ogni feature e il target
correlation = train_data.drop('default payment next month', axis=1).corrwith(train_data['default payment next month'])

# Crea un grafico a barre per visualizzare le correlazioni
plt.figure(figsize=(12, 8))
correlation.plot(kind='bar', grid=True, color='orange')
plt.title("Correlazione con 'default payment next month'")
plt.xlabel("Features")
plt.ylabel("Correlazione")
plt.xticks(rotation=45)
plt.show()


In [None]:
# relazioni tra variabili
# Mappa di correlazione
plt.figure(figsize=(16, 13))
correlation_matrix = train_data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Matrice di correlazione delle variabili numeriche')
plt.show()


In [None]:
df = train_data.drop(columns=['ID'])
#df = df.drop(columns=['LIMIT_BAL'])
#df = df.drop(columns=['SEX'])
#df = df.drop(columns=['MARRIAGE'])
#df = df.drop(columns=['BILL_AMT1'])
#df = df.drop(columns=['BILL_AMT2'])
#df = df.drop(columns=['BILL_AMT3'])
#df = df.drop(columns=['BILL_AMT4'])
#df = df.drop(columns=['BILL_AMT5'])
#df = df.drop(columns=['BILL_AMT6'])
#df = df.drop(columns=['PAY_AMT1'])
#df = df.drop(columns=['PAY_AMT2'])
#df = df.drop(columns=['PAY_AMT3'])
#df = df.drop(columns=['PAY_AMT4'])
#df = df.drop(columns=['PAY_AMT5'])
#df = df.drop(columns=['PAY_AMT6'])

In [None]:
df.dtypes

## Preparing the data

describe the choice made during the preprocessing operations, also taking into account the previous considerations during the data inspection.

## Building the network

any description/comment about the procedure you followed in the choice of the network structure and hyperparameters goes here, together with consideration about the training/optimization procedure (e.g. optimizer choice, final activations, loss functions, training metrics)

In [None]:
df['SEX'] = df['SEX'].astype('category')
df['EDUCATION'] = df['EDUCATION'].astype('category')
df['MARRIAGE'] = df['MARRIAGE'].astype('category')
df['PAY_0'] = df['PAY_0'].astype('category')
df['PAY_2'] = df['PAY_2'].astype('category')
df['PAY_3'] = df['PAY_3'].astype('category')
df['PAY_4'] = df['PAY_4'].astype('category')
df['PAY_5'] = df['PAY_5'].astype('category')
df['PAY_6'] = df['PAY_6'].astype('category')

In [None]:
from tensorflow.keras.utils import to_categorical
from sklearn.preprocessing import StandardScaler
from keras.models import Sequential
from keras.layers import Dense, Activation, Input
from keras.optimizers import SGD, Adam
from sklearn.model_selection import train_test_split
from scikeras.wrappers import KerasClassifier
from sklearn.model_selection import GridSearchCV

In [None]:
y = df['default payment next month']
X = df.drop(columns=['default payment next month'])

In [None]:
# Normalizzare 
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [None]:
x_train, x_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

iperparametri da vedere

In [None]:
# build the network 
dims = x_train.shape[1]
print('Input Shape =', dims)

#y_train = to_categorical(y_train)

nb_classes = 1
print('Number classes = Output Shape =', nb_classes)

model = Sequential()
model.add(Input((dims,)))
model.add(Dense(128, activation='relu'))
#model.add(Dense(64, activation='relu'))
#model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

optimizer = Adam(learning_rate=0.0001)

model.compile(optimizer=optimizer, loss='binary_crossentropy',
              metrics=['accuracy'])

In [None]:
model.summary()

In [None]:
# training
history = model.fit(x_train, y_train, batch_size=100, epochs=30, validation_split=0.1)

In [None]:
import matplotlib.pyplot as plt

def plot_loss(history):
  x_plot = list(range(1,len(history.history["loss"])+1))
  plt.figure()
  plt.title("Loss")
  plt.xlabel('Epochs')
  plt.ylabel('Loss')
  plt.plot(x_plot, history.history['loss'])
  plt.plot(x_plot, history.history['val_loss'])
  plt.legend(['Training', 'Validation'])

def plot_accuracy(history):
  x_plot = list(range(1,len(history.history["accuracy"])+1))
  plt.figure()
  plt.title("Accuracy")
  plt.xlabel('Epochs')
  plt.ylabel('Accuracy')
  plt.plot(x_plot, history.history['accuracy'])
  plt.plot(x_plot, history.history['val_accuracy'])
  plt.legend(['Training', 'Validation'])

In [None]:
plot_loss(history)
plot_accuracy(history)

In [None]:
from sklearn.metrics import classification_report
import numpy as np

# Ottenere le previsioni per il set di test
y_pred_test = model.predict(x_test)

# Arrotondare le previsioni per ottenere una previsione binaria
y_pred_test_bin = np.round(y_pred_test)

# Ottenere le previsioni per il set di train
y_pred_train = model.predict(x_train)

# Arrotondare le previsioni per ottenere una previsione binaria
y_pred_train_bin = np.round(y_pred_train)

print("Prestazioni sul Set di Addestramento:")
print(classification_report(y_train, y_pred_train_bin))

print("Prestazioni sul Set di Test:")
print(classification_report(y_test, y_pred_test_bin))

In [None]:
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

# Calcolo della matrice di confusione
cm = confusion_matrix(y_pred_test_bin, y_test)
labels = [1,0]
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=labels)
disp.plot()

## Analyze and comment the training results

here goes any comment/visualization of the training history and any initial consideration on the training results  

## Validate the model and comment the results

please describe the evaluation procedure on a validation set, commenting the generalization capability of your model (e.g. under/overfitting). You may also describe the performance metrics that you choose: what is the most suitable performance measure (or set of performance measures) in this case/dataset, according to you? Why?

## Make predictions (on the provided test set)

Based on the results obtained and analyzed during the training and the validation phases, what are your (rather _personal_) expectations with respect to the performances of your model on the blind external test set? Briefly motivate your answer.

# OPTIONAL -- Export the predictions in the format indicated in the assignment release page and verify you prediction on the [assessment page](https://aml-assignmentone-2425.streamlit.app/).