# Machine Learning: Gradiente Descendente Estocástico

### Bibliotecas

In [503]:
import numpy as np
import pandas as pd

Utilizaremos o dataset "Data for Admission in the University", disponível no Kaggle.

https://www.kaggle.com/datasets/akshaydattatraykhare/data-for-admission-in-the-university

In [504]:
data = pd.read_csv('adm_data.csv')

### Visualizando o dataset

In [505]:
data.shape

(400, 9)

In [506]:
data.columns

Index(['Serial No.', 'GRE Score', 'TOEFL Score', 'University Rating', 'SOP',
       'LOR ', 'CGPA', 'Research', 'Chance of Admit '],
      dtype='object')

In [507]:
data.rename(columns = {'Chance of Admit ': 'Chance of Admit'}, inplace = True)
data.rename(columns = {'LOR ': 'LOR'}, inplace = True)

In [508]:
data.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


#### Definições:

__Serial No.:__ número de série do estudante (índice + 1)

__GRE Score:__ pontuação do estudante no GRE (Graduate Record Examination), um exame padronizado semelhante ao GMAT.

__TOEFL Score:__ pontuação no TOEFL, um exame completo de inglês utilizado, entre outras coisas, para admissão em universidades.

__University Rating:__ avaliação da universidade (quanto maior, mais conceituada é a universidade na qual o estudante quer entrar).

__SOP:__ avaliação do Statement of Purpose, uma redação explicando o propósito do estudante ao aplicar para uma vaga em uma dada graduação em uma universidade.

__LOR:__ avaliação da Letter of Recommendation, a carta de recomendação do estudante para a universidade.

__CGPA:__ Cumulative Grade Point Average, é uma pontuação utilizada para medir o desempenho médio de um estudante.

__Research:__ experiência em pesquisa (1 se o estudante tiver, 0 se não).

__Chance of Admit:__ chance de admissão na universidade, indo de 0 a 1 (100%).

### Definição da label e das features

Nossa label (y) será a Chance of Admit.

A lista de colunas ["GRE Score", "TOEFL Score", "University Rating", "SOP", "LOR", "CGPA", "Research"] contém as features altamente correlacionadas com a chance de admissão na universidade. No nosso caso, iremos tratar apenas das features contínuas, sendo a lista de features que pode ser utilizada a seguinte: features = ["GRE Score", "TOEFL Score", "University Rating", "SOP", "LOR", "CGPA"].

Implementaremos o algoritmo de __Gradiente Descendente Estocástico em mini-lotes__  com todas as features contínuas do dataset.

### Modelo de Regressão Linear com todas as features

Com n features, o modelo será da forma:

$\hat{y} = w_{1}x_{1} + w_{2}x_{2} + ... + w_{m}x_{m} + b$

Em que $\hat{y}$ é a estimativa (ou predição) da label $y$, dados os valores das features $x_{1}, x_{2},...,x_{m}$, $(w_{1}, w_{2},...,w_{m})$ é o conjunto de pesos (weight) associado às features e $b$ é chamado de viés (bias).

Vamos definir a perda (loss) associada ao modelo através da função de perda $L_{2}$, da mesma forma que foi feito no primeiro modelo. Assim:

$loss = \frac{1}{n}\sum_{i = 1}^{n} (\hat{y_{i}} - y_{i})^2$

Como queremos reduzir a perda de nosso modelo ao máximo, devemos fazer os parâmetros variarem na direção do  negativo do gradiente de f (esse é o princípio do algoritmo de Gradiente Descendente Estocástico). 

Nesse contexto, chamemos $\theta = (w_{1}, w_{2}, ... , w_{m}, b)$. Além disso, definimos o hiperparâmetro $\alpha$ como a taxa de aprendizagem do modelo, de forma análoga ao modelo anterior.

Assim, por meio do gradiente descendente estocástico, o novo valor de $\theta$, $\theta'$, será dado pela fórmula:

$\theta' = \theta - \alpha \cdot \nabla f(w_{1}, w_{2}, ... , w_{m}, b)$

Onde $\nabla f(w_{1}, w_{2}, ... , w_{m}, b) = (\frac{\partial f}{\partial w_{1}}, \frac{\partial f}{\partial w_{2}},..., \frac{\partial f}{\partial w_{m}}, \frac{\partial f}{\partial b})$ é o gradiente de f.

Calculando as derivadas parciais:

$\frac{\partial f}{\partial w_{k}} = \frac{2}{n}\sum_{i = 1}^{n} (\hat{y}_{i} - y_{i}) \cdot (x_{k})_{i}$, para toda feature $x_{k}$.

$\frac{\partial f}{\partial b} = \frac{2}{n}\sum_{i = 1}^{n} (\hat{y}_{i} - y_{i})$

Obtemos os novos valores do peso e do viés após uma iteração:

$w_{k}' = w_{k} - \frac{2 \alpha}{n}\sum_{i = 1}^{n} (\hat{y}_{i} - y_{i}) \cdot (x_{k})_{i}$, para toda feature $x_{k}$.


$b' = b - \frac{2 \alpha}{n}\sum_{i = 1}^{n} (\hat{y}_{i} - y_{i})$

Com isso em mente, iremos implementar o modelo de regressão linear que se adequa aos dados através do Gradiente Descendente Estocástico em mini-lotes.

In [509]:
class Linear_Regression_Model():
    def __init__(self, features, label, ws: list,
                 b = 0, alpha = 0.1, random_state = 0):
        '''Constructor for the Multi_Features_Linear_Regression_Model class. It takes in seven arguments:

        - features: a Pandas DataFrame containing the features to be used for training the model;
        - label: a Pandas Series containing the label corresponding to the features that will be predicted using the model;
        - ws: a list of floats representing the initial values of the weights (coefficients) of the features;
        - b: a float representing the initial value of the bias term (default value is 0);
        - alpha: a float representing the learning rate (i.e., the size of the step taken in the direction of the gradient during gradient descent) (default value is 0.1);
        - random_state: an integer representing the random seed to use for generating random numbers (default value is 0).'''
        self.features = features
        self.label = label
        ws = [float(w) for w in ws]
        self.ws = np.array(ws)
        self.b = b
        self.alpha = alpha
        self.rand = np.random.RandomState(random_state)
        
    def print_parameters(self):
        '''Prints the current values of the weights ws and bias b of the model.'''
        for i in range(1, len(self.ws) + 1):
            print(f'w{i} = {self.ws[i - 1]}')
        print (f'b = {self.b}')
    
    def predict(self, X):
        '''Takes in a Pandas DataFrame X containing feature values 
        and returns a NumPy array of predicted values of the label using 
        the current values of ws and b.'''
        n = len(X)
        X_copy = X.copy()
        X_copy.reset_index(inplace = True, drop = True)
        X_copy = X.mul(self.ws)
        predictions = np.array(X_copy.sum(axis = 1) + self.b)
        return predictions
    
    def get_loss(self, X, y):
        '''Takes in a Pandas DataFrame X containing feature values 
        and a Pandas Series y containing the corresponding true label values, 
        and returns the mean squared error loss of the model on the data.'''
        n = len(X)
        predictions = self.predict(X)
        diff = predictions - y
        loss = np.mean(diff**2)
        return loss
    
    def __get_Xy_sample(self, begin_index, end_index):
        '''Private helper function which from X and y starting in the begin_index 
        row and ending in end_index row.'''
        X = self.features.iloc[begin_index:end_index]
        y = self.label.iloc[begin_index:end_index]
        return X, y
    
    def __get_sample_partial_w(self, X, diff, batch_size):
        '''Private helper function which gets the partial derivative of loss with 
        respect to weights for a X sample.'''
        partial_w = (2/batch_size) * (diff @ X)
        return partial_w
    
    def __get_sample_partial_b(self, diff, batch_size):
        partial_b = (2/batch_size) * np.sum(diff)
        return partial_b
    
    def __batch_update_parameters(self, X, diff, batch_size, inexact_batch_size):
        '''Private helper function which updates weights (ws) and bias (b) for a batch.'''
        partial_w = self.__get_sample_partial_w(X, diff, inexact_batch_size)
        partial_b = self.__get_sample_partial_b(diff, inexact_batch_size)
        correction_constant = batch_size/inexact_batch_size
        self.ws -= self.alpha * partial_w * correction_constant
        self.b -= self.alpha * partial_b * correction_constant
        
    def __sgd_update_parameters(self, batch_size: int):
        '''Private helper function that performs one step of stochastic gradient descent on the model's parameters (ws and b). 
        It takes in a single argument batch_size, which is the number of samples to use in the mini-batch for this step of gradient descent. 
        The function first selects mini-batches from the training data, and then performs an update to the model's parameters 
        using the gradient of the mean squared error loss with respect to the parameters for each mini-batch.
        The update is performed using the learning rate alpha.'''
        num_of_data_rows = len(self.label)
        inexact_batch_size = num_of_data_rows % batch_size
        num_of_exact_batches = int(num_of_data_rows/batch_size)
        for exact_batch in range(1, num_of_exact_batches + 1):
            begin_index = (exact_batch - 1) * batch_size
            end_index = exact_batch * batch_size
            X, y = self.__get_Xy_sample(begin_index, end_index)
            predictions = self.predict(X)
            diff = predictions - y
            self.__batch_update_parameters(X, diff, batch_size, batch_size)
        if inexact_batch_size != 0:
            begin_index = (num_of_exact_batches) * batch_size + 1
            end_index = num_of_data_rows + 1
            X, y = self.__get_Xy_sample(begin_index, end_index)
            predictions = self.predict(X)
            diff = predictions - y
            self.__batch_update_parameters(X, diff, batch_size, inexact_batch_size)
        
    def sgd(self, iterations: int, batch_size: float, print_loss: bool): # stochastic gradient descent
        '''Performs stochastic gradient descent for a specified number of iterations. It takes in three arguments:

        - iterations: an integer representing the number of iterations of gradient descent to perform
        - batch_size: a float representing the number of samples to use in each mini-batch for each step of gradient descent
        - print_loss: a boolean indicating whether to print the loss after each iteration of gradient descent. 
        If True, the loss will be printed; if False, the loss will not be printed.'''
        for i in range(0, iterations):
            self.__sgd_update_parameters(batch_size)
            if print_loss:
                print(f'loss = {self.get_loss(self.features, self.label)}')
    
    @staticmethod
    def shuffle_data(data, random_state):
        '''Shuffles a Pandas Dataframe's data.'''
        rand = np.random.RandomState(random_state)
        return data.reindex(rand.permutation(data.index))
    
    @staticmethod
    def train_val_test_split(X, y, test_split_factor: float, val_split_factor: float):
        '''Get train, validation and test Pandas Dataframes from X (features) and y (label).'''
        num_of_data_rows = len(y)
        test_size = int(test_split_factor * n)
        val_size = int(val_split_factor * n)
        X_test = X.iloc[0:test_size]
        y_test = y.iloc[0:test_size]
        X_val = X.iloc[test_size:test_size + val_size]
        y_val = y.iloc[test_size:test_size + val_size]
        X_train = X.iloc[test_size + val_size:num_of_data_rows]
        y_train = y.iloc[test_size + val_size:num_of_data_rows]
        return X_test, y_test, X_val, y_val, X_train, y_train

In [510]:
features = ["GRE Score", "TOEFL Score", "University Rating", "SOP", "LOR", "CGPA"]
label = 'Chance of Admit'
X_train = data[features]
y_train = data[label]

In [511]:
def scale(feature, train_feature):
    minimum = min(train_feature)
    maximum = max(train_feature)
    return (feature - minimum)/(maximum - minimum)

In [512]:
unscaled_X_train = X_train.copy()
X_train_scaled = X_train.copy()
for feature in features:
    X_train_scaled[feature] = scale(X_train_scaled[feature], unscaled_X_train[feature])

In [513]:
ws = [0, 0, 0, 0, 0, 0]
b = 0
alpha = 0.001
multi_features_model = Linear_Regression_Model(features = X_train_scaled, 
                                               label = y_train, ws = ws, b = b, alpha = alpha,
                                               random_state = 0)

In [514]:
multi_features_model.print_parameters()

w1 = 0.0
w2 = 0.0
w3 = 0.0
w4 = 0.0
w5 = 0.0
w6 = 0.0
b = 0


In [515]:
multi_features_model.sgd(iterations = 100, batch_size = 20, print_loss = False)

In [516]:
multi_features_model.get_loss(X_train_scaled, y_train)

0.0061997711355344645

In [517]:
multi_features_model.print_parameters()

w1 = 0.131551318370844
w2 = 0.13288348783321113
w3 = 0.10749988946457083
w4 = 0.12838007846968924
w5 = 0.14179470909567662
w6 = 0.14573570611596243
b = 0.2693263331497966


In [518]:
n = len(y_train)
pred = np.zeros(n)
pred = multi_features_model.predict(X_train_scaled)

In [519]:
data_with_pred = data.assign(Predictions = pred)

In [520]:
data_with_pred

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit,Predictions
0,1,337,118,4,4.5,4.5,9.65,1,0.92,0.966528
1,2,324,107,4,4.0,4.5,8.87,1,0.76,0.827639
2,3,316,104,3,3.0,3.5,8.00,1,0.72,0.657297
3,4,322,110,3,3.5,2.5,8.67,1,0.80,0.713453
4,5,314,103,2,2.0,3.0,8.21,0,0.65,0.580404
...,...,...,...,...,...,...,...,...,...,...
395,396,324,110,3,3.5,3.5,9.04,1,0.82,0.771446
396,397,325,107,3,3.0,3.5,9.11,1,0.84,0.747062
397,398,330,116,4,5.0,4.5,9.45,1,0.91,0.945325
398,399,312,103,3,3.5,4.0,8.78,0,0.67,0.712233
