# 6. auditory exercise examples


### Reference:
- More information on formula syntax: https://patsy.readthedocs.io/en/latest/formulas.html
- More information about statsmodels: https://www.statsmodels.org/dev/example_formulas.html
- Dataset: Chicco, D., Jurman, G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med Inform Decis Mak 20, 16 (2020). https://doi.org/10.1186/s12911-020-1023-5

In [1]:
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
from sklearn.preprocessing import LabelEncoder

# set seed for consistency
np.random.seed(2)

# Data - preview

We will use a dataset that contains data of comet goldfish. The task is to predict lifespan of the fish using various factors such as care, environment and genetics. 

Features in the dataset are: 
- average length: unit is inches
- average weight: unit is ounces
- habitat: lakes, ponds, slow moving waters, rivers, idle water
- ph_of_water: PH level 
- color: 9 different types of color
- gender: False for Female, True for Male
- life_span: age of fish

In [3]:
df = pd.read_csv('fish_data.csv')
df.drop(['id'], axis=1, inplace=True)
df=df.rename(columns={"average_length(inches))": "avg_length", "average_weight(inches))": "avg_weight"})
df

Unnamed: 0,avg_length,avg_weight,habitat,ph_of_water,color,Gender,life_span
0,14.69,5.87,ponds,6.2,Reddish_Orange,False,10.9
1,1.32,3.86,idlewater,6.8,Calico,True,5.2
2,14.23,12.09,lakes,7.9,Reddish_Orange,True,25.3
3,2.54,3.20,rivers,6.7,White,False,16.4
4,13.10,9.81,lakes,7.8,Orange,True,3.2
...,...,...,...,...,...,...,...
1995,16.12,6.46,ponds,8.0,Red_and_White_Bi_Color,,6.4
1996,7.50,7.07,rivers,6.3,Black_and_Orange,True,14.0
1997,10.52,3.27,slowmovingwaters,6.1,Orange,True,13.1
1998,7.70,15.41,ponds,7.8,Orange,False,15.6


The dataset needs preprocessing before we continue working with it. Let's check for missing data. 

In [8]:

df['Gender'].fillna(df['Gender'].mode()[0], inplace=True)
df.isna().sum()
df

Unnamed: 0,avg_length,avg_weight,habitat,ph_of_water,color,Gender,life_span
0,14.69,5.87,ponds,6.2,Reddish_Orange,False,10.9
1,1.32,3.86,idlewater,6.8,Calico,True,5.2
2,14.23,12.09,lakes,7.9,Reddish_Orange,True,25.3
3,2.54,3.20,rivers,6.7,White,False,16.4
4,13.10,9.81,lakes,7.8,Orange,True,3.2
...,...,...,...,...,...,...,...
1995,16.12,6.46,ponds,8.0,Red_and_White_Bi_Color,False,6.4
1996,7.50,7.07,rivers,6.3,Black_and_Orange,True,14.0
1997,10.52,3.27,slowmovingwaters,6.1,Orange,True,13.1
1998,7.70,15.41,ponds,7.8,Orange,False,15.6


We should also replace textual values into categorical using label encoder.

In [12]:
le = LabelEncoder()
le.fit(df.loc[:,'habitat'])
df.loc[:,'habitat'] = le.transform(df.loc[:,'habitat'])
le.fit(df.loc[:,'color'])
df.loc[:,'color'] = le.transform(df.loc[:,'color'])
le.fit(df.loc[:,'Gender'])
df.loc[:,'Gender'] = le.transform(df.loc[:,'Gender'])

In [13]:
df

Unnamed: 0,avg_length,avg_weight,habitat,ph_of_water,color,Gender,life_span
0,14.69,5.87,2,6.2,6,0,10.9
1,1.32,3.86,0,6.8,1,1,5.2
2,14.23,12.09,1,7.9,6,1,25.3
3,2.54,3.20,3,6.7,7,0,16.4
4,13.10,9.81,1,7.8,3,1,3.2
...,...,...,...,...,...,...,...
1995,16.12,6.46,2,8.0,5,0,6.4
1996,7.50,7.07,3,6.3,0,1,14.0
1997,10.52,3.27,4,6.1,3,1,13.1
1998,7.70,15.41,2,7.8,3,0,15.6


# Linear regression task 1: Modeling age of fish depending on genetics

- Using linear regression, we will model the age of fish using the size of fish (length and weight) and gender.
- The model that achieves this is formulated as:
        life_span ~ avg_length + avg_weight + C(Gender)


In [15]:
# Declares the model
mod = smf.ols(formula='life_span~ avg_length + avg_weight + C(Gender)', data=df)
res = mod.fit()
res.summary()

0,1,2,3
Dep. Variable:,life_span,R-squared:,0.001
Model:,OLS,Adj. R-squared:,-0.0
Method:,Least Squares,F-statistic:,0.8119
Date:,"Fri, 05 Jan 2024",Prob (F-statistic):,0.487
Time:,15:17:37,Log-Likelihood:,-6920.1
No. Observations:,2000,AIC:,13850.0
Df Residuals:,1996,BIC:,13870.0
Df Model:,3,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,13.8102,0.539,25.641,0.000,12.754,14.866
C(Gender)[T.1],-0.0097,0.345,-0.028,0.978,-0.686,0.667
avg_length,-0.0009,0.031,-0.030,0.976,-0.062,0.060
avg_weight,0.0551,0.035,1.561,0.119,-0.014,0.124

0,1,2,3
Omnibus:,1217.808,Durbin-Watson:,2.058
Prob(Omnibus):,0.0,Jarque-Bera (JB):,113.171
Skew:,-0.002,Prob(JB):,2.66e-25
Kurtosis:,1.835,Cond. No.,50.7


- The dependent variable : life_span (how old is the fish)
- Method: The type of model that was fitted (OLS)
- Nb observations: The number of datapoints (??? calico fish)
- R2: The fraction of explained variance
- A list of predictors
- For each predictor: coefficient, standard error of the coefficients, p-value, 95% confidence intervals

### We can now interpret the learned model

$life span = intercept? + coeff1?* avg length + coeff2?* avg weight + coeff3?*Gender$

- What is the expected life span if the fish is female?
- How does the weight and length of the fish affect the life span?


# Linear regression task 2: Modeling age of fish depending on environment

- using linear regression we will model the age of fish using environment variables: habitat and ph of water and check the interaction between the two variables. 
- The model that achieves this is formulated as:
        life_span ~ C(habitat) * ph_of_water 


In [16]:
# Declares the model
mod1 = smf.ols(formula=' life_span ~ C(habitat) * ph_of_water', data=df)
res1 = mod1.fit()
res1.summary()

0,1,2,3
Dep. Variable:,life_span,R-squared:,0.004
Model:,OLS,Adj. R-squared:,-0.001
Method:,Least Squares,F-statistic:,0.8726
Date:,"Fri, 05 Jan 2024",Prob (F-statistic):,0.549
Time:,15:19:40,Log-Likelihood:,-6917.4
No. Observations:,2000,AIC:,13850.0
Df Residuals:,1990,BIC:,13910.0
Df Model:,9,,
Covariance Type:,nonrobust,,

0,1,2,3,4,5,6
,coef,std err,t,P>|t|,[0.025,0.975]
Intercept,19.0758,4.712,4.048,0.000,9.835,28.317
C(habitat)[T.1],-1.5789,6.517,-0.242,0.809,-14.359,11.201
C(habitat)[T.2],-2.3685,6.617,-0.358,0.720,-15.346,10.609
C(habitat)[T.3],-0.1589,6.830,-0.023,0.981,-13.554,13.236
C(habitat)[T.4],-1.2135,6.682,-0.182,0.856,-14.317,11.890
ph_of_water,-0.7521,0.672,-1.118,0.264,-2.071,0.567
C(habitat)[T.1]:ph_of_water,0.2847,0.928,0.307,0.759,-1.535,2.105
C(habitat)[T.2]:ph_of_water,0.3994,0.941,0.425,0.671,-1.445,2.244
C(habitat)[T.3]:ph_of_water,0.1605,0.971,0.165,0.869,-1.744,2.065

0,1,2,3
Omnibus:,1099.003,Durbin-Watson:,2.053
Prob(Omnibus):,0.0,Jarque-Bera (JB):,110.454
Skew:,0.001,Prob(JB):,1.04e-24
Kurtosis:,1.849,Cond. No.,512.0


### Interpretation
- let's discuss the values


# Standardization

In [17]:
# how we standardize the countinuous variables
columns_to_standardize = [

]

for col in columns_to_standardize:
    df[col] = (df[col] - df[col].mean()) / df[col].std()  # standardize column


In [None]:
# Declares the model
