# 6. auditory exercise examples


### Reference:
- More information on formula syntax: https://patsy.readthedocs.io/en/latest/formulas.html
- More information about statsmodels: https://www.statsmodels.org/dev/example_formulas.html
- Dataset: Chicco, D., Jurman, G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med Inform Decis Mak 20, 16 (2020). https://doi.org/10.1186/s12911-020-1023-5

In [9]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from sklearn.preprocessing import LabelEncoder

# set seed for consistency
np.random.seed(2)

# Data - preview

We will use a dataset that contains data of comet goldfish. The task is to predict lifespan of the fish using various factors such as care, environment and genetics. 

Features in the dataset are: 
- average length: unit is inches
- average weight: unit is ounces
- habitat: lakes, ponds, slow moving waters, rivers, idle water
- ph_of_water: PH level 
- color: 9 different types of color
- gender: False for Female, True for Male
- life_span: age of fish

In [10]:
df = pd.read_csv('data/fish_data.csv')
df.drop(['id'], axis=1, inplace=True)
df=df.rename(columns={"average_length(inches))": "avg_length", "average_weight(inches))": "avg_weight"})
df

Unnamed: 0,avg_length,avg_weight,habitat,ph_of_water,color,Gender,life_span
0,14.69,5.87,ponds,6.2,Reddish_Orange,False,10.9
1,1.32,3.86,idlewater,6.8,Calico,True,5.2
2,14.23,12.09,lakes,7.9,Reddish_Orange,True,25.3
3,2.54,3.20,rivers,6.7,White,False,16.4
4,13.10,9.81,lakes,7.8,Orange,True,3.2
...,...,...,...,...,...,...,...
1995,16.12,6.46,ponds,8.0,Red_and_White_Bi_Color,,6.4
1996,7.50,7.07,rivers,6.3,Black_and_Orange,True,14.0
1997,10.52,3.27,slowmovingwaters,6.1,Orange,True,13.1
1998,7.70,15.41,ponds,7.8,Orange,False,15.6


The dataset needs preprocessing before we continue working with it. Let's check for missing data. 

We should also replace textual values into categorical using label encoder.

In [None]:
df

# Linear regression task 1: Modeling age of fish depending on genetics

- Using linear regression, we will model the age of fish using the size of fish (length and weight) and gender.
- The model that achieves this is formulated as:
        life_span ~ avg_length + avg_weight + C(Gender)


In [None]:
# Declares the model


- The dependent variable : life_span (how old is the fish)
- Method: The type of model that was fitted (OLS)
- Nb observations: The number of datapoints (??? calico fish)
- R2: The fraction of explained variance
- A list of predictors
- For each predictor: coefficient, standard error of the coefficients, p-value, 95% confidence intervals

### We can now interpret the learned model

$life span = intercept? + coeff1?* avg length + coeff2?* avg weight + coeff3?*Gender$

- What is the expected life span if the fish is female?
- How does the weight and length of the fish affect the life span?


# Linear regression task 2: Modeling age of fish depending on environment

- using linear regression we will model the age of fish using environment variables: habitat and ph of water and check the interaction between the two variables. 
- The model that achieves this is formulated as:
        life_span ~ C(habitat) * ph_of_water 


In [23]:
# Declares the model


0,1,2,3
Dep. Variable:,life_span,R-squared:,0.004
Model:,OLS,Adj. R-squared:,-0.001
Method:,Least Squares,F-statistic:,0.8717
Date:,"Fri, 08 Dec 2023",Prob (F-statistic):,0.55
Time:,10:02:43,Log-Likelihood:,-6837.1
No. Observations:,1976,AIC:,13690.0
Df Residuals:,1966,BIC:,13750.0
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,19.2001,4.749,4.043,0.000,9.887,28.513
C(habitat)[T.1],-1.6310,6.550,-0.249,0.803,-14.476,11.214
C(habitat)[T.2],-2.6762,6.691,-0.400,0.689,-15.798,10.446
C(habitat)[T.3],-0.0134,6.869,-0.002,0.998,-13.485,13.458
C(habitat)[T.4],-1.3907,6.731,-0.207,0.836,-14.591,11.810
ph_of_water,-0.7662,0.678,-1.131,0.258,-2.095,0.563
C(habitat)[T.1]:ph_of_water,0.2896,0.933,0.311,0.756,-1.539,2.119
C(habitat)[T.2]:ph_of_water,0.4407,0.951,0.463,0.643,-1.425,2.307
C(habitat)[T.3]:ph_of_water,0.1430,0.977,0.146,0.884,-1.773,2.059

0,1,2,3
Omnibus:,1100.749,Durbin-Watson:,2.048
Prob(Omnibus):,0.0,Jarque-Bera (JB):,109.515
Skew:,-0.002,Prob(JB):,1.66e-24
Kurtosis:,1.847,Cond. No.,513.0


### Interpretation
- let's discuss the values


# Standardization

In [None]:
# how we standardize the countinuous variables
columns_to_standardize = [

]

for col in columns_to_standardize:
    df[col] = (df[col] - df[col].mean()) / df[col].std()  # standardize column


In [None]:
# Declares the model
