<a href="https://colab.research.google.com/github/LuisPeMoraRod/AI-Laboratories/blob/ejercicio_3/Lab6_LogisticRegression.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# import necessary libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import accuracy_score
from math import exp as e

### Ejercicio 1

El conjunto de datos se importa desde un repositorio de GitHub.

El conjunto de datos contiene atributos sobre las cuentas de seguros de un grupo de 1339 ciudadanos estadounidenses. Los atributos del conjunto de datos incluyen: edad, sexo, índice de masa corporal, cantidad de hijos, si es fumador/a, región, cargos.

Este conjunto de datos se va a utilizar para crear un modelo que prediga si una persona es fumadora o no, en función del resto de los atributos.




In [None]:
url = 'https://raw.githubusercontent.com/LuisPeMoraRod/AI-Laboratories/logistic_reg_exercise/logistic_reg_lab_data.csv'
df = pd.read_csv(url) #store in pandas dataframe
df

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,19,female,27.900,0,yes,southwest,16884.92400
1,18,male,33.770,1,no,southeast,1725.55230
2,28,male,33.000,3,no,southeast,4449.46200
3,33,male,22.705,0,no,northwest,21984.47061
4,32,male,28.880,0,no,northwest,3866.85520
...,...,...,...,...,...,...,...
1333,50,male,30.970,3,no,northwest,10600.54830
1334,18,female,31.920,0,no,northeast,2205.98080
1335,18,female,36.850,0,no,southeast,1629.83350
1336,21,female,25.800,0,no,southwest,2007.94500


### Ejercicio 2

In [None]:
#parse data frame to make all atributes numeric
df['sex'] = df['sex'].astype('category')
df['sex'] = df['sex'].cat.codes

df['region'] = df['region'].astype('category')
df['region'] = df['region'].cat.codes

df['smoker'] = df['smoker'].astype('category')
df['smoker'] = df['smoker'].cat.codes

# separate X variables from y
X = df.drop(columns = 'smoker')
y = df['smoker']

#define training and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state = 0)

#use sklean linear regression model to get linear regression equation
linear_regression = LinearRegression()
linear_regression.fit(X_train, y_train)
b = linear_regression.intercept_
coeffs = linear_regression.coef_
print('Linear regression:')
print(f'b = {b}')
print(f'Coefficients = {coeffs}')

X_train = np.array(X_train)

def predict(x_data: np.array, coeffs: list, b: float, threshold: float) -> np.array:
  coeffs = np.array(coeffs)
  rows = len(x_data)
  cols = len(coeffs)
  ones = np.ones((rows, cols))
  coeffs = np.multiply(ones, coeffs)

  # compute logit for every entry
  logits = np.multiply(x_data, coeffs) # multiply every variable with the coefficient computed from linear regression
  logits = np.sum(logits, axis=1) # add every variable
  logits = logits + b # add intercept value

  # apply sigmoid to every logit element
  def sigmoid(num: float):
    sigmoid = 1/(1+e(-num))
    if (sigmoid >= threshold): 
      return 1
    return 0
  
  applySigmoid = np.vectorize(sigmoid)
  y_pred = applySigmoid(logits)
  return y_pred

threshold = 0.6
y_pred = predict(X_train, coeffs, b, threshold)
print(f'\nPredictions for training data (smokers):\n{y_pred}')
print(f'\nReal training data (smokers):\n{y_train.to_numpy()}')

accuracy = accuracy_score(y_train, y_pred)
print(f'\nAccuracy score: {accuracy}')

y_test_pred = predict(X_test, coeffs, b, threshold)
print(f'\nPredictions for test data (smokers):\n{y_test_pred}')
print(f'\nReal test data (smokers):\n{y_test.to_numpy()}')

accuracy = accuracy_score(y_test, y_test_pred)
print(f'\nAccuracy score: {accuracy}')

Linear regression:
b = 0.40430152275569914
Coefficients = [-7.84434847e-03  1.44723798e-02 -9.80706018e-03 -1.29068881e-02
  1.16749447e-02  2.99693056e-05]

Predictions for training data (smokers):
[1 1 1 ... 0 0 0]

Real training data (smokers):
[1 1 1 ... 0 0 0]

Accuracy score: 0.9485049833887044

Predictions for test data (smokers):
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0
 1 1 0 0 0 1 1 1 1 0 0 0 0 0 1 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0
 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 1 0
 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0]

Real test data (smokers):
[0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0
 1 1 0 0 0 1 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 0
 0 0 0 0 1 1 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 1 1 0 0 0 0 1 0
 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 1 1 1 0 0 0 0 0]

Accuracy score: 0.9850746268656716


### Ejercicio 3
**Incluya paso a paso la derivada parcial de la función de activación del algoritmo,es decir, la derivada parcial de la función logística de acuerdo con sus respectivas variables.**

Para esta derivada, vamos a estar derivando la función $p = \frac{1}{1 + e^{-(a_1*x_1 + a_2*x_2 + a_n*x_n + b)}}$

De esta manera tenemos que:

$b = 0.40430152275569914 \\
a_1 = -7.84434847e-03 \\
a_2 = 1.44723798e-02 \\
a_3 = -9.80706018e-03 \\
a_4 = -1.29068881e-02 \\
a_5 = 1.16749447e-02 \\
a_6 = 2.99693056e-05 $ \\

Si consideramos nuestro ejemplo inicial, nuestras variables son: \\
$
x_1 = age \\
x_2 = sex \\
x_3 = bmi \\
x_4 = children \\
x_5 = smoker \\
x_6 = region
$

Para evitar trabajar con numeros muy grandes, vamos a aproximar nuestras constantes a 5 decimales, de esta manera tendriamos siguiente expresión: \\
 $p = \frac{1}{1 + e^{(0.00784*x_1 - 0.01447*x_2 + 0.00980*x_3 + 0.01290*x_4 - 0.01674*x_5 - 0.00002*x_6 - 0.40430)}}$

 Y las derivadas parciales de cada una de sus variables quedarian de la siguiente manera:

$
\frac{∂p}{\partial x_1} = -\frac{0.00784*e^{0.00784*x_1-0.01447*x_2+0.00980*x_3+0.01290*x_4-0.01674*x_5-0.00002*x_6-0.40430}}{(1 + e^{(0.00784*x_1 - 0.01447*x_2 + 0.00980*x_3 + 0.01290*x_4 - 0.01674*x_5 - 0.00002*x_6 - 0.40430)})^2} \\
\frac{∂p}{\partial x_2} = \frac{0.01447*e^{0.00784*x_1-0.01447*x_2+0.00980*x_3+0.01290*x_4-0.01674*x_5-0.00002*x_6-0.40430}}{(1 + e^{(0.00784*x_1 - 0.01447*x_2 + 0.00980*x_3 + 0.01290*x_4 - 0.01674*x_5 - 0.00002*x_6 - 0.40430)})^2} \\
\frac{∂p}{\partial x_3} = -\frac{0.00980*e^{0.00784*x_1-0.01447*x_2+0.00980*x_3+0.01290*x_4-0.01674*x_5-0.00002*x_6-0.40430}}{(1 + e^{(0.00784*x_1 - 0.01447*x_2 + 0.00980*x_3 + 0.01290*x_4 - 0.01674*x_5 - 0.00002*x_6 - 0.40430)})^2}\\
\frac{∂p}{\partial x_4} = \frac{0.01290*e^{0.00784*x_1-0.01447*x_2+0.00980*x_3+0.01290*x_4-0.01674*x_5-0.00002*x_6-0.40430}}{(1 + e^{(0.00784*x_1 - 0.01447*x_2 + 0.00980*x_3 + 0.01290*x_4 - 0.01674*x_5 - 0.00002*x_6 - 0.40430)})^2}\\
\frac{∂p}{\partial x_5} = -\frac{0.01674*e^{0.00784*x_1-0.01447*x_2+0.00980*x_3+0.01290*x_4-0.01674*x_5-0.00002*x_6-0.40430}}{(1 + e^{(0.00784*x_1 - 0.01447*x_2 + 0.00980*x_3 + 0.01290*x_4 - 0.01674*x_5 - 0.00002*x_6 - 0.40430)})^2}\\
\frac{∂p}{\partial x_6} = -\frac{0.00002*e^{0.00784*x_1-0.01447*x_2+0.00980*x_3+0.01290*x_4-0.01674*x_5-0.00002*x_6-0.40430}}{(1 + e^{(0.00784*x_1 - 0.01447*x_2 + 0.00980*x_3 + 0.01290*x_4 - 0.01674*x_5 - 0.00002*x_6 - 0.40430)})^2}\\
$



### Ejercicio 4