# MACHINE LEARNING MODEL

---

0. Import libraries, csv and set target

In [30]:
# Libraries
import pandas as pd
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import SelectKBest, f_regression
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import numpy as np

train_data = pd.read_csv(r'C:\Users\rnogu\OneDrive\Documentos\GitHub\Linear-regression-model\data\processed\norm_insurance_train.csv')
test_data = pd.read_csv(r'C:\Users\rnogu\OneDrive\Documentos\GitHub\Linear-regression-model\data\processed\norm_insurance_test.csv')
target = 'charges'
train_data.head(1)

Unnamed: 0,age,sex,bmi,children,smoker,region,charges
0,0.472227,-1.024602,-1.756525,0.734336,0.508747,0.456116,9193.8385


1. Division X and Y

In [31]:
# DROP COLUMN 'Y'
X_train = train_data.drop([target], axis = 1)
y_train = train_data[target]
X_test = test_data.drop([target], axis = 1)
y_test = test_data[target]

In [32]:
print(y_train.head(1))
X_train.head(1)

0    9193.8385
Name: charges, dtype: float64


Unnamed: 0,age,sex,bmi,children,smoker,region
0,0.472227,-1.024602,-1.756525,0.734336,0.508747,0.456116


2. Create loop for lineal regression model

In [43]:
# AUX list for dataframe
results_list = []
# Rango de variables a probar (por ejemplo, de 2 a 5)
for num_features in range(2, 7):
    # Selecciona las mejores 'num_features' características usando SelectKBest
    selector = SelectKBest(score_func=f_regression, k=num_features)
    X_train_selected = selector.fit_transform(X_train, y_train)

    # Crea y entrena el modelo de regresión lineal con las características seleccionadas
    model = LinearRegression()
    model.fit(X_train_selected, y_train)

    # Transforma el conjunto de prueba original con las mismas características seleccionadas
    X_test_selected = selector.transform(X_test)

    # Predice con el modelo entrenado en el conjunto de prueba seleccionado
    y_pred = model.predict(X_test_selected)

    # Calcula y muestra las métricas de evaluación
    mae = mean_absolute_error(y_test, y_pred)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)

    print(f"Modelo con {num_features} variables seleccionadas:")
    print(f"Error absoluto medio: {mae:.2f}")
    print(f"Error cuadrático medio: {mse:.2f}")
    print(f"Coeficiente de determinación (R^2): {r2:.2f}")
    print()

    # Assing results to variables
    results_list.append({
        'Variables': num_features,
        'MAE': mae,
        'MSE': round(mse,2),
        'R2': r2
    })

result_df = pd.DataFrame(results_list)
result_df

Modelo con 2 variables seleccionadas:
Error absoluto medio: 3990.98
Error cuadrático medio: 38274699.68
Coeficiente de determinación (R^2): 0.75

Modelo con 3 variables seleccionadas:
Error absoluto medio: 4260.56
Error cuadrático medio: 34512843.88
Coeficiente de determinación (R^2): 0.78

Modelo con 4 variables seleccionadas:
Error absoluto medio: 4213.80
Error cuadrático medio: 33981653.95
Coeficiente de determinación (R^2): 0.78

Modelo con 5 variables seleccionadas:
Error absoluto medio: 4213.48
Error cuadrático medio: 33979257.05
Coeficiente de determinación (R^2): 0.78

Modelo con 6 variables seleccionadas:
Error absoluto medio: 4186.51
Error cuadrático medio: 33635210.43
Coeficiente de determinación (R^2): 0.78



Unnamed: 0,Variables,MAE,MSE,R2
0,2,3990.979515,38274699.68,0.753462
1,3,4260.560091,34512843.88,0.777693
2,4,4213.798595,33981653.95,0.781115
3,5,4213.484798,33979257.05,0.78113
4,6,4186.508898,33635210.43,0.783346


---

# CONCLUSIONS

1. Depending on the number of variables, the result of the machine learning models are the following:


| Variables |         MAE |          MSE |        R2 |
|-----------|-------------|--------------|-----------|
|         2 | 3990.979515 |  38274699.68 | 0.753462  |
|         3 | 4260.560091 |  34512843.88 | 0.777693  |
|         4 | 4213.798595 |  33981653.95 | 0.781115  |
|         5 | 4213.484798 |  33979257.05 | 0.781130  |
|         6 | 4186.508898 |  33635210.43 | 0.783346  |

2. The best result according variables:
- According to MAE is 2 variables.
- According to MSE is 6 variables.
- According to R2 is 2 variables.