# Recall Machine Learning Linear Regression

At the end of this Lesson the studen will remember the main steps to train a model:

 - Split dataset in train and test subsets
 - Standardize continuous varuables
 - Transform categorical variables to dummy
 - Train linear regression models
 - Train classification models
 - Interpret the error and accuracy metrics to validate the built models

**You have two exercises at the end of the notebook**

### Import data and libraries

In [None]:
import pandas as pd
import numpy as np
import statsmodels.api as sm
from sklearn.linear_model import LinearRegression
from sklearn import metrics

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [None]:
df = pd.read_csv('..\data\Fish.csv')
df

Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width
0,Bream,242.0,23.2,25.4,30.0,11.5200,4.0200
1,Bream,290.0,24.0,26.3,31.2,12.4800,4.3056
2,Bream,340.0,23.9,26.5,31.1,12.3778,4.6961
3,Bream,363.0,26.3,29.0,33.5,12.7300,4.4555
4,Bream,430.0,26.5,29.0,34.0,12.4440,5.1340
...,...,...,...,...,...,...,...
154,Smelt,12.2,11.5,12.2,13.4,2.0904,1.3936
155,Smelt,13.4,11.7,12.4,13.5,2.4300,1.2690
156,Smelt,12.2,12.1,13.0,13.8,2.2770,1.2558
157,Smelt,19.7,13.2,14.3,15.2,2.8728,2.0672


In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Species  159 non-null    object 
 1   Weight   159 non-null    float64
 2   Length1  159 non-null    float64
 3   Length2  159 non-null    float64
 4   Length3  159 non-null    float64
 5   Height   159 non-null    float64
 6   Width    159 non-null    float64
dtypes: float64(6), object(1)
memory usage: 8.8+ KB


### Species variable treatment

Species is a categorical variable, hence we need to transform it to dummies before inserting in the model

https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html


In [None]:
df.Species.value_counts()

Perch        56
Bream        35
Roach        20
Pike         17
Smelt        14
Parkki       11
Whitefish     6
Name: Species, dtype: int64

Firstly, let's reduce the categories to Perch, Bream and Others

In [None]:
def fish_species(x):
    if x == 'Perch':
        return 'Perch'
    elif x == 'Bream':
        return 'Bream'
    else:
        return 'Others'

df['fish_species'] = df['Species'].apply(fish_species)
df

Unnamed: 0,Species,Weight,Length1,Length2,Length3,Height,Width,fish_species
0,Bream,242.0,23.2,25.4,30.0,11.5200,4.0200,Bream
1,Bream,290.0,24.0,26.3,31.2,12.4800,4.3056,Bream
2,Bream,340.0,23.9,26.5,31.1,12.3778,4.6961,Bream
3,Bream,363.0,26.3,29.0,33.5,12.7300,4.4555,Bream
4,Bream,430.0,26.5,29.0,34.0,12.4440,5.1340,Bream
...,...,...,...,...,...,...,...,...
154,Smelt,12.2,11.5,12.2,13.4,2.0904,1.3936,Others
155,Smelt,13.4,11.7,12.4,13.5,2.4300,1.2690,Others
156,Smelt,12.2,12.1,13.0,13.8,2.2770,1.2558,Others
157,Smelt,19.7,13.2,14.3,15.2,2.8728,2.0672,Others


### Get dummies

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.get_dummies.html

https://stats.stackexchange.com/questions/350492/why-do-we-create-dummy-variables

https://towardsdatascience.com/what-are-dummy-variables-and-how-to-use-them-in-a-regression-model-ee43640d573e

In [None]:
df_dum = pd.get_dummies(df.fish_species)
df = df.merge(df_dum, right_index = True, left_index = True, how = 'left')


In [None]:
df.drop(['Species','fish_species'], axis = 1, inplace = True)
df.columns

Index(['Weight', 'Length1', 'Length2', 'Length3', 'Height', 'Width', 'Bream',
       'Others', 'Perch'],
      dtype='object')

### Train test split

It is mandatory to randomly divide the dataset into two. One for training the model and the test split for validate it.

If error metrics are low with the test split means that our model is robust


https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html


In [None]:
fish_train, fish_test = train_test_split(df, test_size=0.2, random_state=0)

In [None]:
print(fish_train.shape)
print(fish_test.shape)

(127, 9)
(32, 9)


### Standardize the numerical variables

Sometimes numerical variables in our dataset have very different scales, taht's to have very different values between one column and other. That can harm model accuracy.

For solve this situation, we standardize, that's to put every continuous variable centered in 0 and with standard deviation 1

We **first** standardize the training set, then the test set with the training set parameters

We do not standardize the target variable

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html#sklearn.preprocessing.MinMaxScaler


https://www.askpython.com/python/examples/standardize-data-in-python#:~:text=Ways%20to%20Standardize%20Data%20in%20Python%201%201.,load_iris%20...%202%202.%20Using%20StandardScaler%20%28%29%20function



In [None]:
scale= StandardScaler()

variables_sc = ['duration_ms', 'loudness', 'tempo']

scale_fit = scale.fit(X_train1[variables_sc])

X_train_sc = pd.DataFrame(scale.transform(X_train1[variables_sc]), columns = variables_sc)

X_test_sc = pd.DataFrame(scale.transform(X_test1[variables_sc]), columns = variables_sc)

X_train_sc.shape

X_train1.drop(variables_sc, axis = 1, inplace = True)
X_train1 = X_train1.reset_index(drop = True)
y_train1 = y_train1.reset_index(drop=True)
X_train = pd.concat([X_train1, X_train_sc], axis = 1)

X_test1.drop(variables_sc, axis = 1, inplace = True)
X_test1 = X_test1.reset_index(drop = True)
y_test1 = y_test1.reset_index(drop=True)
X_test = pd.concat([X_test1, X_test_sc], axis = 1)

In [None]:
scale= StandardScaler()
variables_sc = ['Length1', 'Length2', 'Length3', 'Height', 'Width']

X_train = fish_train[['Length1', 'Length2', 'Length3', 'Height', 'Width', 'Bream','Others', 'Perch']]
y_train  = fish_train['Weight']

X_test = fish_test[['Length1', 'Length2', 'Length3', 'Height', 'Width', 'Bream', 'Others', 'Perch']]
y_test  = fish_test['Weight']


scale_train = scale.fit(X_train[variables_sc])

X_train_sc = pd.DataFrame(scale_train.transform(X_train[variables_sc]), columns = [variables_sc])
X_train = X_train.drop(variables_sc, axis = 1) # , inplace = True
X_train = X_train.reset_index(drop = True)
X_train = pd.concat([X_train, X_train_sc], axis = 1)
X_train.columns = ['Length1', 'Length2', 'Length3', 'Height', 'Width', 'Bream','Others', 'Perch']
y_train = y_train.reset_index(drop=True)

X_test_sc = pd.DataFrame(scale_train.transform(X_test[variables_sc]), columns =[variables_sc])
X_test = X_test.drop(variables_sc, axis = 1) # , inplace = True
X_test = X_test.reset_index(drop = True)
X_test = pd.concat([X_test, X_test_sc], axis = 1)
X_test.columns = ['Length1', 'Length2', 'Length3', 'Height', 'Width', 'Bream','Others', 'Perch']
y_test = y_test.reset_index(drop=True)

## Linear Regression

https://medium.com/swlh/interpreting-linear-regression-through-statsmodels-summary-4796d359035a

In [None]:
X_train = sm.add_constant(X_train)
result = sm.OLS(y_train, X_train).fit()

print(result.summary())

                            OLS Regression Results                            
Dep. Variable:                 Weight   R-squared:                       0.895
Model:                            OLS   Adj. R-squared:                  0.889
Method:                 Least Squares   F-statistic:                     144.5
Date:                Tue, 12 Sep 2023   Prob (F-statistic):           4.55e-55
Time:                        13:06:34   Log-Likelihood:                -774.25
No. Observations:                 127   AIC:                             1564.
Df Residuals:                     119   BIC:                             1587.
Df Model:                           7                                         
Covariance Type:            nonrobust                                         
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          0.0001     76.904    1.5e-06      1.0

In [None]:
X_train['predict'] = result.predict(X_train)

X_test = sm.add_constant(X_test)
X_test['predict'] = result.predict(X_test)

In [None]:
def mean_absolute_percentage_error(y_true, y_pred):
    return np.mean(np.abs((sum(y_true) - sum(y_pred)) / sum(y_true))) * 100

print("MAE: ", metrics.mean_absolute_error(y_train, X_train['predict']))
print("MSE: ", metrics.mean_squared_error(y_train, X_train['predict']))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(y_train, X_train['predict'])))
print("MAPE: ", mean_absolute_percentage_error(y_train, X_train['predict']))
print("R2: ", metrics.r2_score(y_train, X_train['predict']))


MAE:  79.48159685784069
MSE:  11556.170674549487
RMSE:  107.49963104378305
MAPE:  1.5084644534756303e-14
R2:  0.8947146195772226


In [None]:
print("MAE: ", metrics.mean_absolute_error(y_test, X_test['predict']))
print("MSE: ", metrics.mean_squared_error(y_test, X_test['predict']))
print("RMSE: ", np.sqrt(metrics.mean_squared_error(y_test, X_test['predict'])))
print("MAPE: ", mean_absolute_percentage_error(y_test, X_test['predict']))
print("R2: ", metrics.r2_score(y_test, X_test['predict']))


MAE:  99.17774652309242
MSE:  26636.577496367398
RMSE:  163.20716129008372
MAPE:  3.159067095609248
R2:  0.8600657372046447


## Exercise 1

Response the answers in 4-5 lines each, read the links you have along this document, or in the theory notebooks, or you can also search on the internet:

 - Which type of variables do we transform into dummies? Why do we do it?
 - Why is so important to divide our data into train and test datasets? Which is the purpose of doing it?
 - Why do we standardize some varaiables? Which type of variables do we standardize?

**Exercise 1_1** \\
Transformamos variables categóricas en variables dummy o variables ficticias. Esto se hace para poder incluir estas variables categóricas en modelos estadísticos o algoritmos de aprendizaje automático que requieren datos numéricos como entrada.

**Exercise 1_2** \\
 dividir los datos en conjuntos de entrenamiento y prueba es esencial para evaluar y mejorar la capacidad de generalización de un modelo de aprendizaje automático. Ayuda a evitar el sobreajuste, ajustar los hiperparámetros de manera efectiva y estimar el rendimiento real del modelo en datos no vistos, lo que es fundamental para construir modelos predictivos confiables y útiles.

**Exercise 1_3** \\
La estandarización de datos, en ocasiones también conocida como normalización, es el proceso de ajustar o adaptar ciertas características para que los datos se asemejen a un tipo, modelo o normal común con el objetivo de que su tratamiento, acceso y uso sea más sencillo para los usuarios o personas que dispongan de ellos, mejoran la interpretabilidad y la precisión de los resultados en análisis estadísticos y modelos de aprendizaje automático. \\
Estandarizamos variables númericas

## Exercise 2

Regarding the summary and the errors, Would you use this model to predict the weights of the fishes? Justify your answer. Comment the usefulness of the main indicators of the summary and the errors.


Si usaria este modelo para predecir los pesos de los peces ya que vemos que existe una relación lineal entre  todas las variables y el peso de los peces. \\
En cuanto a los errores primero se hacen los de train y luego los de test.