# Predicting Default Payments with Fully-Connected NNs

The dataset contains information on default payments, demographic factors, credit data, history of payment, and bill statements of credit card clients in Taiwan from April 2005 to September 2005.

## Dataset Description
This dataset employs a binary variable to indicate whether a credit card payment occurred (1 = Yes, 0 = No). The study selected the following 23 factors as explanatory variables:

- Variable 1: Amount of credit granted (in local currency), which includes both individual credit and family (supplementary) credit.
- Variable 2: Gender (1 = male; 2 = female).
- Variable 3: Education level (1 = graduate school; 2 = university; 3 = high school; 4 = others).
- Variable 4: Age (years).
- Variables 5-10: Payment history over several months. The scale for payment status ranges from -1 (paid on time) to 9 (delayed by nine months or more). It tracks payments from April to September:

    - Variable 5: Payment status in September;
    - Variable 6: Payment status in August;
    - Variable 7: Payment status in July;
    - Variable 8: Payment status in June;
    - Variable 9: Payment status in May;
    - Variable 10: Payment status in April. 
- Variables 11-16: Amount of monthly billing (in local currency), tracking statements from September to April.
- Variables 17-22: Amount of previous payments (in local currency), corresponding to monthly payments made from September to April.

## Inspecting the data

any comment about data dimensionality/distribution goes here

In [None]:
# import librerie
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np

In [None]:
# carichiamo il dataset
train_data = pd.read_csv("train.csv")
test_data = pd.read_csv("test.csv")

In [None]:
# mostra prime righe del dataset train
df_train_data = train_data.head() 
print(df_train_data)

In [None]:
# Controlla la dimensionalità del dataset di training e test
print("Dimensionalità del dataset di training:", train_data.shape)
print("Dimensionalità del dataset di test:", test_data.shape)

In [None]:
df_train_data.info() # otteniamo informazioni sulle colonne

missing_values = df_train_data.isnull().sum()
print(missing_values)

In [None]:
# otteniamo statistiche descrittive per le variabili numeriche 
print(df_train_data.describe())

In [None]:
# controllo presenza di valori nulli
df_train_data.isnull().sum().any()

In [None]:
# controllo di presenza di valori duplicati
df_train_data.duplicated().sum()

### Analisi statica univariata

In [None]:
column = 'default payment next month'
total_rows = len(df_train_data)
counts = df_train_data[column].value_counts()
percentages = [count / total_rows * 100 for count in counts]
plt.pie(percentages, autopct='%1.1f%%', colors=['green', 'orange'])
plt.title(f'Proportion of {column} (target)')
labels = ['yes', 'no']
plt.legend(labels=labels, loc='upper right')
plt.show()

In [None]:
# Analisi della variabile target
# Conta la distribuzione della variabile target
sns.countplot(x='default payment next month', data=df_train_data)
plt.title('Distribuzione di default payment next month')
plt.show()

In [None]:
# distribuzione delle feature numeriche 

numerical_columns = ['LIMIT_BAL', 'AGE', 
                     'PAY_AMT1', 'PAY_AMT2','PAY_AMT3','PAY_AMT4','PAY_AMT5','PAY_AMT6', 
                     'BILL_AMT1', 'BILL_AMT2', 'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6']

# Istogramma per ogni variabile numerica
bins = np.arange(df_train_data[column].min(), df_train_data[column].max() + 2)
df_train_data[numerical_columns].hist(figsize=(16, 12), bins=bins, color='green')

#df_train_data[numerical_columns].hist(figsize=(16, 12), bins=20, color='green')
plt.show()

In [None]:

# distribuzione feature categoriche 
categorical_columns = ['SEX', 
                       'PAY_0', 'PAY_2','PAY_3','PAY_4',
                       'EDUCATION', 'MARRIAGE']

def print_categoric_feature(column):
    plt.figure(figsize=(10, 7))
    #sns.countplot(data=df_train_data, x=column, color='orange', legend=True)
    sns.countplot(data=df_train_data, x=column, hue=column, palette='Set2', legend=False)
    plt.title(f'{column}')
    plt.xlabel(column)
    plt.ylabel('Frequency')
    plt.show()

for i in categorical_columns:
    print_categoric_feature(i)

In [None]:
# Calcola la correlazione tra ogni feature e il target
correlation = df_train_data.drop('default payment next month', axis=1).corrwith(df_train_data['default payment next month'])

# Crea un grafico a barre per visualizzare le correlazioni
plt.figure(figsize=(12, 8))
correlation.plot(kind='bar', grid=True, color='orange')
plt.title("Correlazione con 'default payment next month'")
plt.xlabel("Features")
plt.ylabel("Correlazione")
plt.xticks(rotation=45)
plt.show()


In [None]:
# relazioni tra variabili
# Mappa di correlazione
plt.figure(figsize=(16, 13))
correlation_matrix = df_train_data.corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Matrice di correlazione delle variabili numeriche')
plt.show()


## Preparing the data

describe the choice made during the preprocessing operations, also taking into account the previous considerations during the data inspection.

## Building the network

any description/comment about the procedure you followed in the choice of the network structure and hyperparameters goes here, together with consideration about the training/optimization procedure (e.g. optimizer choice, final activations, loss functions, training metrics)

### Con il dataset grezzo

In [None]:
from tensorflow.keras.utils import to_categorical
from keras.datasets import mnist
from keras.models import Sequential
from keras.layers import Dense, Activation, Input
from keras.optimizers import SGD


In [None]:
# Estrazione delle features e variabile target
y = df_train_data['default payment next month']
X = df_train_data.drop(columns=['default payment next month'])

In [None]:
# Suddivisione dataset in training set e test set (con dimensione del test_size del 30%)
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
y_train = to_categorical(y_train, num_classes=3)

In [None]:
nb_classes = y_train.shape[0]
print(nb_classes, 'classes')

dims = X_train.shape[1]
print(X_train.shape, 'dims Training set')

model = Sequential()
model.add(Input((dims,)))
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='tanh'))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

In [None]:
optimizer1 = SGD(learning_rate=0.001)

model.compile(optimizer=optimizer1, loss='categorical_crossentropy',
              metrics=['accuracy'])

In [None]:
history1 = model.fit(X_train, y_train, batch_size=128, epochs=10, validation_split=0.1)

In [None]:
def plot_loss(history):
  x_plot = list(range(1,len(history.history["loss"])+1))
  plt.figure()
  plt.title("Loss")
  plt.xlabel('Epochs')
  plt.ylabel('Loss')
  plt.plot(x_plot, history.history['loss'])
  plt.plot(x_plot, history.history['val_loss'])
  plt.legend(['Training', 'Validation'])

def plot_accuracy(history):
  x_plot = list(range(1,len(history.history["accuracy"])+1))
  plt.figure()
  plt.title("Accuracy")
  plt.xlabel('Epochs')
  plt.ylabel('Accuracy')
  plt.plot(x_plot, history.history['accuracy'])
  plt.plot(x_plot, history.history['val_accuracy'])
  plt.legend(['Training', 'Validation'])

In [None]:
plot_loss(history1)
plot_accuracy(history1)

### inizio modifiche

In [None]:
import pandas as pd
from keras.utils import to_categorical
from sklearn.model_selection import train_test_split

# Supponiamo che il tuo dataset sia un DataFrame chiamato df
# df = pd.read_csv('tuo_dataset.csv')  # Carica i dati

# Seleziona le colonne di input e il target
X = df_train_data.drop('default payment next month', axis=1)  # Rimuovi la colonna target
y = df_train_data['default payment next month']  # Colonna target

# Splitta il dataset in training e test set
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Normalizza i dati (se necessario)
x_train = x_train.astype('float32') / x_train.max()  # Normalizza i dati a [0, 1]
x_test = x_test.astype('float32') / x_test.max()

# Se 'y_train' è binaria (0 o 1), non hai bisogno di to_categorical.
# Se hai più classi, utilizza to_categorical.
y_train = to_categorical(y_train)  # Solo se hai più classi (0, 1)



In [None]:
dims = x_train.shape[1]
print('Input Shape =', dims)

nb_classes = y_train.shape[1]
print('Number classes = Output Shape =', nb_classes)

model = Sequential()
model.add(Input((dims,)))
model.add(Dense(32, activation='relu'))
model.add(Dense(16, activation='tanh'))
model.add(Dense(nb_classes))
model.add(Activation('softmax'))

In [None]:
model.summary()

In [None]:
# compile the model
model.compile(optimizer='sgd', loss='categorical_crossentropy', metrics=['accuracy'])

In [None]:
# Train the model
model.fit(x_train, y_train)

In [None]:
# predict
predictions = model.predict(x_test)

int_predictions = np.argmax(predictions, axis=1)

print(int_predictions[:10])

iperparametri da vedere

In [None]:
# build the network 
dims = x_train.shape[1]
print('Input Shape =', dims)

nb_classes = y_train.shape[1]
print('Number classes = Output Shape =', nb_classes)

model = Sequential()
model.add(Input((dims,)))
model.add(Dense(64, activation='relu'))
model.add(Dense(32, activation='relu'))
model.add(Dense(2, activation='relu'))
model.add(Dense(nb_classes, activation='softmax'))

optimizer = SGD(learning_rate=0.001)

#model.compile(optimizer=optimizer, loss='categorical_crossentropy',
#              metrics=['accuracy'])

model.compile(optimizer=optimizer, loss='binary_crossentropy',
              metrics=['accuracy'])

In [None]:
# training
history = model.fit(x_train, y_train, batch_size=128, epochs=50, validation_split=0.1)

In [None]:
import matplotlib.pyplot as plt

def plot_loss(history):
  x_plot = list(range(1,len(history.history["loss"])+1))
  plt.figure()
  plt.title("Loss")
  plt.xlabel('Epochs')
  plt.ylabel('Loss')
  plt.plot(x_plot, history.history['loss'])
  plt.plot(x_plot, history.history['val_loss'])
  plt.legend(['Training', 'Validation'])

def plot_accuracy(history):
  x_plot = list(range(1,len(history.history["accuracy"])+1))
  plt.figure()
  plt.title("Accuracy")
  plt.xlabel('Epochs')
  plt.ylabel('Accuracy')
  plt.plot(x_plot, history.history['accuracy'])
  plt.plot(x_plot, history.history['val_accuracy'])
  plt.legend(['Training', 'Validation'])

In [None]:
plot_loss(history)
plot_accuracy(history)

## Analyze and comment the training results

here goes any comment/visualization of the training history and any initial consideration on the training results  

## Validate the model and comment the results

please describe the evaluation procedure on a validation set, commenting the generalization capability of your model (e.g. under/overfitting). You may also describe the performance metrics that you choose: what is the most suitable performance measure (or set of performance measures) in this case/dataset, according to you? Why?

## Make predictions (on the provided test set)

Based on the results obtained and analyzed during the training and the validation phases, what are your (rather _personal_) expectations with respect to the performances of your model on the blind external test set? Briefly motivate your answer.

# OPTIONAL -- Export the predictions in the format indicated in the assignment release page and verify you prediction on the [assessment page](https://aml-assignmentone-2425.streamlit.app/).