## Data for Admission in the University: analisando modelos de Redes Neurais aplicados

#### Importando as bibliotecas necessárias

In [1531]:
import pandas as pd
import numpy as np
import tensorflow as tf
from sklearn.model_selection import train_test_split, KFold

#### Lendo o arquivo .csv com os dados

In [1532]:
data = pd.read_csv('adm_data.csv')  

#### Renomeando colunas para tirar espaços desnecessários e facilitar o acesso a elas

In [1533]:
data.rename(columns = {'Chance of Admit ': 'Chance of Admit'}, inplace = True)
data.rename(columns = {'LOR ': 'LOR'}, inplace = True)

#### Visualizando os dados

In [1534]:
data.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


#### Retirando a coluna 'Serial No.', que não será utilizada

In [1535]:
data1 = data.drop(columns = ['Serial No.'])

#### Definindo features e label

In [1536]:
label = ['Chance of Admit']
features = list(set(data1.columns).difference(label))

In [1537]:
features

['GRE Score',
 'SOP',
 'Research',
 'CGPA',
 'LOR',
 'University Rating',
 'TOEFL Score']

#### Definindo X e y

In [1538]:
X = data[features]
y = data[label]

#### Split dos dados em train e test

In [1539]:
random_state = 25
# Aqui o shuffling dos índices é automático.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = random_state)

#### Scaling nos dados para obter features no intervalo [0, 1]

In [1540]:
X_train_scaled = (X_train - X_train.min())/(X_train.max() - X_train.min())
X_test_scaled = (X_test - X_train.min())/(X_train.max() - X_train.min())
# y_train e y_test já estão na escala correta, pois são probabilidades.
y_train_scaled = y_train
y_test_scaled = y_test

#### Criando funções para nos ajudar a definir e treinar o modelo

In [1541]:
def create_model(first_layer_units = 32, alpha = 0.1, reg_lambda = 0.1, features = [], random_seed = 0):
    '''Create a defined model with learning rate = alpha.'''
    # Defining Tensorflow's random seed to generate reproducible results.
    tf.keras.utils.set_random_seed(random_seed)
    tf.config.experimental.enable_op_determinism()
    
    # We will be using Tensorflow's Sequencial API to build our model.
    model = tf.keras.models.Sequential()
    
    # Adding first layer with input_dim = number of features and relu activation function 
    # to learn some non-linearities. We will also add L2 regularization in layer's kernel (weights).
    model.add(tf.keras.layers.Dense(units = first_layer_units, input_shape = (len(features), ), 
                                   kernel_regularizer = tf.keras.regularizers.L2(l2 = reg_lambda),
                                   activation = 'relu'))
    
    # Adding output layer with linear activation (linear regression output).
    
    model.add(tf.keras.layers.Dense(units = 1, activation = 'linear'))
    
    # Compiling model: selecting optimizer, loss and metrics.
    
    model.compile(optimizer = tf.keras.optimizers.RMSprop(learning_rate = alpha),
                 loss = 'mean_squared_error',
                 metrics = tf.keras.metrics.MeanSquaredError())
    
    return model

In [1542]:
def train_model(model = None, X_train = None, y_train = None, epochs = 10,
               batch_size = 10, validation_split = 0.3,
               random_seed = 0):
    tf.keras.utils.set_random_seed(random_seed)
    tf.config.experimental.enable_op_determinism()
    history = model.fit(x = X_train, y = y_train, batch_size = batch_size,
         epochs = epochs, shuffle = False, 
         validation_split = validation_split, verbose = 0)
    
    hist = pd.DataFrame(history.history)
    train_mse = np.array(hist['mean_squared_error'])
    if validation_split != 0:
        val_mse = np.array(hist['val_mean_squared_error'])
        return train_mse, val_mse
    return train_mse

In [1543]:
def test_evaluation(model = None, X_test = None, y_test = None, batch_size = 10):
    return model.evaluate(X_test, y_test, batch_size = batch_size, verbose = 0)

#### Definindo hiperparâmetros

In [1544]:
hp = {'first_layer_units': 64, 'epochs': 6, 'batch_size': 20, 'alpha': 0.001, 'reg_lambda': 0.1}

#### Modelo

In [1545]:
model = create_model(first_layer_units = hp['first_layer_units'], alpha = hp['alpha'],
                     reg_lambda = hp['reg_lambda'], features = features, random_seed = random_state)

#### Treino e validação de nosso primeiro modelo

In [1546]:
val_split = 0.2

In [1547]:
train_mse, val_mse = train_model(model, X_train_scaled, y_train_scaled, epochs = hp['epochs'], 
                                 batch_size = hp['batch_size'], random_seed = random_state,
                                validation_split = val_split)

- Notamos um comportamento à primeira vista "estranho": os valores da função loss, que foi escolhida como mean_squared_error, está diferente da métrica mean_squared_error. Isso ocorre porque o cálculo de loss ocorre fazendo a média aritmética das somas dos quadrados dos erros e o cálculo da métrica ocorre fazendo a média ponderada desse valor.

In [1548]:
print(train_mse[-1], val_mse[-1])

0.006387853529304266 0.005437376908957958


In [1549]:
test_evaluation(model, X_test_scaled, y_test_scaled, batch_size = hp['batch_size'])[1]

0.006299859378486872

Parece que o modelo se saiu muito bem, pois os valores das métricas nos datasets de treino, validação e de teste estão muito próximas e bastante pequenas, chegando a valores próximos daqueles que obtivemos por meio do modelo de regressão linear feito from scratch. Notemos que o valor do mse dos dados de validação é consistentemente menor que dos dados de treino. Isso nos indica que o conjunto de dados é, de alguma forma, enviesado e os dados de validação escolhidos acabam sofrendo esse viés.

#### Agora que encontramos valores bons para os hiperparâmetros, vamos testar  a performance do modelo para vários valores de nodes na first layer do modelo, verificando como o resultado se comporta em função da complexidade.

In [1550]:
unit_values = [2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]

In [1551]:
train_final_mse = {}
val_final_mse = {}
test_final_mse = {}
for first_layer_unit in unit_values:
    model = create_model(first_layer_units = first_layer_unit, alpha = hp['alpha'],
                     reg_lambda = hp['reg_lambda'], features = features, random_seed = random_state)
    train_mse, val_mse = train_model(model, X_train_scaled, y_train_scaled, epochs = hp['epochs'], 
                                     batch_size = hp['batch_size'], random_seed = random_state)
    train_final_mse[first_layer_unit] = train_mse[-1]
    val_final_mse[first_layer_unit] = val_mse[-1]
    test_eval = test_evaluation(model, X_test_scaled, y_test_scaled, batch_size = hp['batch_size'])
    test_final_mse[first_layer_unit] = test_eval[1]

In [1552]:
for unit in unit_values:
    print(f'{unit}:')
    print(f'Train final mse: {train_final_mse[unit]}')
    print(f'Val final mse: {val_final_mse[unit]}')
    print(f'Test final mse: {test_final_mse[unit]}')

2:
Train final mse: 0.28186458349227905
Val final mse: 0.25501877069473267
Test final mse: 0.25808054208755493
4:
Train final mse: 0.45445516705513
Val final mse: 0.4187878668308258
Test final mse: 0.4218711256980896
8:
Train final mse: 0.38972270488739014
Val final mse: 0.36853405833244324
Test final mse: 0.37934401631355286
16:
Train final mse: 0.016190094873309135
Val final mse: 0.016656536608934402
Test final mse: 0.016049453988671303
32:
Train final mse: 0.012281469069421291
Val final mse: 0.011341418139636517
Test final mse: 0.011271507479250431
64:
Train final mse: 0.00787552073597908
Val final mse: 0.006427847780287266
Test final mse: 0.007352294400334358
128:
Train final mse: 0.006064746994525194
Val final mse: 0.005578806158155203
Test final mse: 0.006258651614189148
256:
Train final mse: 0.005137995816767216
Val final mse: 0.004596150945872068
Test final mse: 0.005783664062619209
512:
Train final mse: 0.006234250962734222
Val final mse: 0.003946168813854456
Test final mse: 0

Vemos, através desse experimento, fatos curiosos:

- O valor do erro final do modelo tende a ser maior se temos um modelo exageradamente complexo, enquanto que modelos não tão complexo, o modelo performa relativamente bem. Isso acontece porque quando aumentamos a complexidade do modelo, ele tende a sofrer overfitting nos dados de treino, perdendo sua capacidade de generalização.

- Além disso, o valor "ótimo" de units é data-dependent, logo, não conseguimos prever antecipadamente qual dos valores de units será o melhor - apenas sabemos que não deve ser um valor extremamente grande, dado que estamos resolvendo um problema em que a regressão linear se encaixa muito bem devido às altas correlações das features com o target.

- Parece que o melhor valor foi 128, em termos de proximidades dos erros e de valores pequenos.

#### Note que as correlações entre a variável dependente (target) e as features são de fato elevadas:

In [1553]:
def reconstruct_data(X, y):
    data = X.assign(Chance_of_admit  = y)
    data = data.rename(columns = {'Chance_of_admit': 'Chance of Admit'})
    return data

In [1554]:
final_data = reconstruct_data(X_train, y_train)

In [1555]:
final_data.corr()['Chance of Admit']

GRE Score            0.796729
SOP                  0.674721
Research             0.517726
CGPA                 0.869127
LOR                  0.686924
University Rating    0.713946
TOEFL Score          0.788699
Chance of Admit      1.000000
Name: Chance of Admit, dtype: float64

### Modelo selecionado

Para nosso modelo, escolheremos à primeira vista first_layer_units = 128, pois foi um modelo que performou bem e que mostrou poucos sinais de overfitting nos dados de treino com os hiperparâmetros selecionados.

In [1556]:
hp['first_layer_units'] = 128

In [1557]:
hp

{'first_layer_units': 128,
 'epochs': 6,
 'batch_size': 20,
 'alpha': 0.001,
 'reg_lambda': 0.1}

In [1558]:
selected_model = create_model(first_layer_units = hp['first_layer_units'], alpha = hp['alpha'],
                     reg_lambda = hp['reg_lambda'], features = features, random_seed = random_state)

In [1559]:
se_train_mse, se_val_mse = train_model(selected_model, X_train_scaled, y_train_scaled, epochs = hp['epochs'], 
                                     batch_size = hp['batch_size'], random_seed = random_state)

In [1560]:
se_test_eval = test_evaluation(selected_model, X_test_scaled, y_test_scaled, batch_size = hp['batch_size'])
print(f'\nMetrics for our selected linear regression model made with NN in Tensorflow Keras API:\n')
print(f'Selected model train mse: {se_train_mse[-1]}')
print(f'Selected model val mse: {se_val_mse[-1]}')
print(f'Selected model test mse: {se_test_eval[1]}')


Metrics for our selected linear regression model made with NN in Tensorflow Keras API:

Selected model train mse: 0.006064746994525194
Selected model val mse: 0.005578806158155203
Selected model test mse: 0.006258651614189148


#### Agora que temos nosso modelo, vamos analisar novamente se ele não sofre overfitting por meio de outra técnica de cross-validation: K-fold.

In [1561]:
hp

{'first_layer_units': 128,
 'epochs': 6,
 'batch_size': 20,
 'alpha': 0.001,
 'reg_lambda': 0.1}

In [1562]:
def shuffle_data(data, random_state):
    '''Shuffles a Pandas Dataframe's data.'''
    rand = np.random.RandomState(random_state)
    return data.reindex(rand.permutation(data.index))

In [1563]:
# KFold: k iterations in which (k-1) splits are for train and 1 for test.
def kfold_cross_validation(X = None, y = None, n_splits = 5, hp = {}, shuffle = True, random_state = 0):
    # Shuffling to avoid bias
    if shuffle:
        features = X.columns
        label = y.columns
        data = reconstruct_data(X, y)
        data = shuffle_data(data, random_state)
        X_shuffled = data[features]
        y_shuffled = data[label]
        
    kf = KFold(n_splits = n_splits)
    k_metrics = []
    for train_index, test_index in kf.split(X):
        X_train = X_shuffled.iloc[train_index].copy()
        y_train = y_shuffled.iloc[train_index].copy()
        X_test = X_shuffled.iloc[test_index].copy()
        y_test = y_shuffled.iloc[test_index].copy()
        
        # Scaling data:
        X_train_scaled = (X_train - X_train.min())/(X_train.max() - X_train.min())
        X_test_scaled = (X_test - X_train.min())/(X_train.max() - X_train.min())
        y_train_scaled = y_train
        y_test_scaled = y_test

        # Creating selected model:
        selected_model = create_model(first_layer_units = hp['first_layer_units'], alpha = hp['alpha'],
                                        reg_lambda = hp['reg_lambda'], features = features, 
                                        random_seed = random_state)
        se_train_mse = train_model(selected_model, X_train_scaled, y_train_scaled, epochs = hp['epochs'], 
                                  batch_size = hp['batch_size'], random_seed = random_state,
                                  validation_split = 0)
        se_test_eval = test_evaluation(selected_model, X_test_scaled, y_test_scaled, batch_size = hp['batch_size'])
        se_metrics_list = [se_train_mse[-1], se_test_eval[1]]
        k_metrics.append(se_metrics_list)
    return k_metrics

In [1564]:
n_splits = 5
k_metrics = kfold_cross_validation(X, y, n_splits = n_splits, shuffle = True,
                                   random_state = random_state, hp = hp)

In [1565]:
def print_k_metrics(k_metrics):
    k = 0
    for metrics in k_metrics:
        print(f'\nMetrics for fold {k + 1}:\n')
        print(f'Selected model train mse: {metrics[0]}')
        print(f'Selected model test mse: {metrics[1]}')
        k += 1

In [1566]:
print_k_metrics(k_metrics)


Metrics for fold 1:

Selected model train mse: 0.005074808839708567
Selected model test mse: 0.005736889783293009

Metrics for fold 2:

Selected model train mse: 0.004938953556120396
Selected model test mse: 0.006124397274106741

Metrics for fold 3:

Selected model train mse: 0.005400301888585091
Selected model test mse: 0.0045867436565458775

Metrics for fold 4:

Selected model train mse: 0.005435791797935963
Selected model test mse: 0.004165829159319401

Metrics for fold 5:

Selected model train mse: 0.005468782503157854
Selected model test mse: 0.00477219745516777


#### Os resultados do KFold estão mostrando que ainda ainda há um certo grau de overfitting, então podemos tentar melhorar esse resultado mudando alguns parâmetros.

Podemos tentar:

- Reduzir alpha
- Reduzir epochs
- Reduzir batch_size
- Aumentar reg_lambda

In [1567]:
hp

{'first_layer_units': 128,
 'epochs': 6,
 'batch_size': 20,
 'alpha': 0.001,
 'reg_lambda': 0.1}

In [1568]:
hp['reg_lambda'] = 0.16
hp['batch_size'] = 15
hp['alpha'] = 0.001/2
hp['epochs'] = 5

In [1569]:
k_metrics = kfold_cross_validation(X, y, n_splits = n_splits, shuffle = True,
                                   random_state = random_state, hp = hp)

In [1570]:
print_k_metrics(k_metrics)


Metrics for fold 1:

Selected model train mse: 0.006748770363628864
Selected model test mse: 0.005964679177850485

Metrics for fold 2:

Selected model train mse: 0.00663524866104126
Selected model test mse: 0.00767184142023325

Metrics for fold 3:

Selected model train mse: 0.0068711512722074986
Selected model test mse: 0.005178166553378105

Metrics for fold 4:

Selected model train mse: 0.006851423531770706
Selected model test mse: 0.005149665288627148

Metrics for fold 5:

Selected model train mse: 0.006702333688735962
Selected model test mse: 0.005697226617485285


In [1571]:
hp

{'first_layer_units': 128,
 'epochs': 5,
 'batch_size': 15,
 'alpha': 0.0005,
 'reg_lambda': 0.16}

Apesar de tentarmos várias vezes mudar os hiperparâmetros, para o fold 2 não conseguimos aproximar o test mse do train mse. Além disso, há comportamentos anormais em todos os outros folds: os valores do train mse são maiores que do test mse. Isso indica que há viés nos dados, especialmente nesses folds. Uma maneira de tentarmos contornar esse problema seria aumentar o número de camadas do modelo ou o número de nodes por camada. Podemos também adicionar uma camada de dropout para evitar overfitting com o aumento de complexidade.

In [1589]:
def create_model(first_layer_units = 32, alpha = 0.1, reg_lambda = 0.1, dropout_rate = 0, features = [], random_seed = 0):
    '''Create a defined model with learning rate = alpha.'''
    # Defining Tensorflow's random seed to generate reproducible results.
    tf.keras.utils.set_random_seed(random_seed)
    tf.config.experimental.enable_op_determinism()
    
    # We will be using Tensorflow's Sequencial API to build our model.
    model = tf.keras.models.Sequential()
    
    # Adding first layer with input_dim = number of features and relu activation function 
    # to learn some non-linearities. We will also add L2 regularization in layer's kernel (weights).
    model.add(tf.keras.layers.Dense(units = first_layer_units, input_shape = (len(features), ), 
                                   kernel_regularizer = tf.keras.regularizers.L2(l2 = reg_lambda),
                                   activation = 'relu'))
    
    # Dropout layer
    
    model.add(tf.keras.layers.Dropout(rate = dropout_rate, seed = random_seed))
    
    model.add(tf.keras.layers.Dense(units = first_layer_units, input_shape = (len(features), ), 
                                    kernel_regularizer = tf.keras.regularizers.L2(l2 = reg_lambda)))
    
    # Adding output layer with linear activation (linear regression output).
    
    model.add(tf.keras.layers.Dense(units = 1, activation = 'linear'))
    
    # Compiling model: selecting optimizer, loss and metrics.
    
    model.compile(optimizer = tf.keras.optimizers.RMSprop(learning_rate = alpha),
                 loss = 'mean_squared_error',
                 metrics = tf.keras.metrics.MeanSquaredError())
    
    return model

In [1595]:
# KFold: k iterations in which (k-1) splits are for train and 1 for test.
def kfold_cross_validation(X = None, y = None, n_splits = 5, hp = {}, shuffle = True, random_state = 0):
    # Shuffling to avoid bias
    if shuffle:
        features = X.columns
        label = y.columns
        data = reconstruct_data(X, y)
        data = shuffle_data(data, random_state)
        X_shuffled = data[features]
        y_shuffled = data[label]
        
    kf = KFold(n_splits = n_splits)
    k_metrics = []
    for train_index, test_index in kf.split(X):
        X_train = X_shuffled.iloc[train_index].copy()
        y_train = y_shuffled.iloc[train_index].copy()
        X_test = X_shuffled.iloc[test_index].copy()
        y_test = y_shuffled.iloc[test_index].copy()
        
        # Scaling data:
        X_train_scaled = (X_train - X_train.min())/(X_train.max() - X_train.min())
        X_test_scaled = (X_test - X_train.min())/(X_train.max() - X_train.min())
        y_train_scaled = y_train
        y_test_scaled = y_test

        # Creating selected model:
        selected_model = create_model(first_layer_units = hp['first_layer_units'], alpha = hp['alpha'],
                                        reg_lambda = hp['reg_lambda'], features = features, 
                                        random_seed = random_state, dropout_rate = hp['dropout_rate'])
        se_train_mse = train_model(selected_model, X_train_scaled, y_train_scaled, epochs = hp['epochs'], 
                                  batch_size = hp['batch_size'], random_seed = random_state,
                                  validation_split = 0)
        se_test_eval = test_evaluation(selected_model, X_test_scaled, y_test_scaled, batch_size = hp['batch_size'])
        se_metrics_list = [se_train_mse[-1], se_test_eval[1]]
        k_metrics.append(se_metrics_list)
    return k_metrics

In [1644]:
hp['reg_lambda'] = 0.16
hp['batch_size'] = 15
hp['alpha'] = 0.001/2
hp['epochs'] = 5
hp['dropout_rate'] = 0.1

In [1645]:
hp

{'first_layer_units': 128,
 'epochs': 5,
 'batch_size': 15,
 'alpha': 0.0005,
 'reg_lambda': 0.16,
 'dropout_rate': 0.1}

In [1646]:
k_metrics = kfold_cross_validation(X, y, n_splits = n_splits, shuffle = True,
                                   random_state = random_state, hp = hp)

In [1647]:
print_k_metrics(k_metrics)


Metrics for fold 1:

Selected model train mse: 0.00832164753228426
Selected model test mse: 0.009839721024036407

Metrics for fold 2:

Selected model train mse: 0.007600715849548578
Selected model test mse: 0.007761502172797918

Metrics for fold 3:

Selected model train mse: 0.008036257699131966
Selected model test mse: 0.006805111654102802

Metrics for fold 4:

Selected model train mse: 0.008295999839901924
Selected model test mse: 0.007244655396789312

Metrics for fold 5:

Selected model train mse: 0.008186479099094868
Selected model test mse: 0.006620819214731455


- Note que adicionar novas camadas ao modelo, mesmo que ajustemos alguns parâmetros, aumentou o erro obtido e ainda não auxiliou na diminuição da diferença entre os mse de treino e de teste em geral, pois o fold 1 ficou agora com uma grande diferença. Isso indica mais ainda que, em determinados folds, há, de fato, dados enviesados, mesmo que tenhamos embaralhado anteriormente o dataset. 

- O ideal nesse caso era termos mais dados para conseguirmos, com sorte, um modelo melhor.

- Nesse dataset em específico, aparentemente o modelo mais simples performou melhor, apesar dos comportamentos estranhos.