# Predicting using Regression

Regression Models: Several regression techniques are applied to labelled data, including Ridge regression, LASSO, and Elastic Net. Additionally, Partial Least Squares (PLS) regression is used for spectral analysis to predict concentration values.

### Load the datasets

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# load the datasets
glucose = pd.read_csv('modified_data/data_glucose2.csv')
lactate = pd.read_csv('modified_data/data_lactate.csv')
ethanol = pd.read_csv('modified_data/data_ethanol.csv')
acetate = pd.read_csv('modified_data/data_acetate.csv')
biomass = pd.read_csv('modified_data/data_biomass.csv')
formate = pd.read_csv('modified_data/data_formate.csv')

---

### For each target variable, let's build regression models.

Regression Models: Several regression techniques are applied to labelled data, including Ridge regression, LASSO, and Elastic Net. Additionally, Partial Least Squares (PLS) regression is used for spectral analysis to predict concentration values.

The regression models are trained on the training data and tested on the test data. The performance of the models is evaluated using the coefficient of determination (R2) and the root mean squared error (RMSE).

In [3]:
models = ['Analyte', 'Ridge Regression', 'LASSO', 'Elastic Net', 'PLS Regression']
r2 = []
rmse = []
accuracy = []

#### Glucose

In [4]:
r2glucose = ['Glucose (g/L)']
rmseglucose = ['Glucose (g/L)']
accuracyglucose = ['Glucose (g/L)']

# Ridge Regression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

X = glucose.drop(['Glucose (g/L)'], axis=1)
y = glucose['Glucose (g/L)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)

print('Ridge Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', ridge.score(X_test, y_test))

r2glucose.append(r2_score(y_test, y_pred))
rmseglucose.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyglucose.append(ridge.score(X_test, y_test))

# LASSO
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred = lasso.predict(X_test)

print('LASSO')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', lasso.score(X_test, y_test))

r2glucose.append(r2_score(y_test, y_pred))
rmseglucose.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyglucose.append(lasso.score(X_test, y_test))

# Elastic Net
from sklearn.linear_model import ElasticNet

elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_train, y_train)
y_pred = elastic.predict(X_test)

print('Elastic Net')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', elastic.score(X_test, y_test))

r2glucose.append(r2_score(y_test, y_pred))
rmseglucose.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyglucose.append(elastic.score(X_test, y_test))

# Partial Least Squares (PLS) Regression
from sklearn.cross_decomposition import PLSRegression

pls = PLSRegression(n_components=2)
pls.fit(X_train, y_train)
y_pred = pls.predict(X_test)

print('PLS Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', pls.score(X_test, y_test))

r2glucose.append(r2_score(y_test, y_pred))
rmseglucose.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyglucose.append(pls.score(X_test, y_test))

r2.append(r2glucose)
rmse.append(rmseglucose)
accuracy.append(accuracyglucose)

Ridge Regression
R2: 0.07926067857458874
RMSE: 17.54091150067323
Accuracy: 0.07926067857458874
LASSO
R2: 0.06759059591468464
RMSE: 17.651724258066437
Accuracy: 0.06759059591468464
Elastic Net
R2: 0.019055770890092805
RMSE: 18.105310213884945
Accuracy: 0.019055770890092805
PLS Regression
R2: 0.07927708952759727
RMSE: 17.54075517827536
Accuracy: 0.07927708952759727


#### Interpretation

The ***PLS regression*** model has the highest R2 value and the lowest RMSE value, indicating that it is the best model for predicting glucose concentration values. The ***Ridge regression model*** also performs well, with a higher R2 value and lower RMSE value compared to the LASSO and Elastic Net models.

#### Lactate

In [5]:
r2lactate = ['Lactate (g/L)']
rmselactate = ['Lactate (g/L)']
accuracylactate = ['Lactate (g/L)']

# Ridge Regression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

X = lactate.drop(['Lactate (g/L)'], axis=1)
y = lactate['Lactate (g/L)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)

print('Ridge Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', ridge.score(X_test, y_test))

r2lactate.append(r2_score(y_test, y_pred))
rmselactate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracylactate.append(ridge.score(X_test, y_test))

# LASSO
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred = lasso.predict(X_test)

print('LASSO')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', lasso.score(X_test, y_test))

r2lactate.append(r2_score(y_test, y_pred))
rmselactate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracylactate.append(lasso.score(X_test, y_test))

# Elastic Net
from sklearn.linear_model import ElasticNet

elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_train, y_train)
y_pred = elastic.predict(X_test)

print('Elastic Net')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', elastic.score(X_test, y_test))

r2lactate.append(r2_score(y_test, y_pred))
rmselactate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracylactate.append(elastic.score(X_test, y_test))

# Partial Least Squares (PLS) Regression
from sklearn.cross_decomposition import PLSRegression

pls = PLSRegression(n_components=2)
pls.fit(X_train, y_train)
y_pred = pls.predict(X_test)

print('PLS Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', pls.score(X_test, y_test))

r2lactate.append(r2_score(y_test, y_pred))
rmselactate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracylactate.append(pls.score(X_test, y_test))

r2.append(r2lactate)
rmse.append(rmselactate)
accuracy.append(accuracylactate)

Ridge Regression
R2: 0.18547020908362033
RMSE: 10.34492635491047
Accuracy: 0.18547020908362033
LASSO
R2: 0.1492648925053225
RMSE: 10.572339345724044
Accuracy: 0.1492648925053225
Elastic Net
R2: 0.1521077809727589
RMSE: 10.55465985137868
Accuracy: 0.1521077809727589
PLS Regression
R2: 0.16855291231324643
RMSE: 10.451803232323138
Accuracy: 0.16855291231324643


#### Ethanol

In [6]:
r2ethanol = ['Ethanol (g/L)']
rmseethanol = ['Ethanol (g/L)']
accuracyethanol = ['Ethanol (g/L)']

# Ridge Regression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

X = ethanol.drop(['Ethanol (g/L)'], axis=1)
y = ethanol['Ethanol (g/L)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)

print('Ridge Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', ridge.score(X_test, y_test))

r2ethanol.append(r2_score(y_test, y_pred))
rmseethanol.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyethanol.append(ridge.score(X_test, y_test))

# LASSO
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred = lasso.predict(X_test)

print('LASSO')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', lasso.score(X_test, y_test))

r2ethanol.append(r2_score(y_test, y_pred))
rmseethanol.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyethanol.append(lasso.score(X_test, y_test))

# Elastic Net
from sklearn.linear_model import ElasticNet

elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_train, y_train)
y_pred = elastic.predict(X_test)

print('Elastic Net')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', elastic.score(X_test, y_test))

r2ethanol.append(r2_score(y_test, y_pred))
rmseethanol.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyethanol.append(elastic.score(X_test, y_test))

# Partial Least Squares (PLS) Regression
from sklearn.cross_decomposition import PLSRegression

pls = PLSRegression(n_components=2)
pls.fit(X_train, y_train)
y_pred = pls.predict(X_test)

print('PLS Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', pls.score(X_test, y_test))

r2ethanol.append(r2_score(y_test, y_pred))
rmseethanol.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyethanol.append(pls.score(X_test, y_test))

r2.append(r2ethanol)
rmse.append(rmseethanol)
accuracy.append(accuracyethanol)

Ridge Regression
R2: 0.13727957173884253
RMSE: 0.7355352089335115
Accuracy: 0.13727957173884253
LASSO
R2: -0.0009493392149675373
RMSE: 0.7922723040325659
Accuracy: -0.0009493392149675373
Elastic Net
R2: -0.0009493392149675373
RMSE: 0.7922723040325659
Accuracy: -0.0009493392149675373
PLS Regression
R2: 0.08188774846935143
RMSE: 0.7587807720279386
Accuracy: 0.08188774846935143


#### Acetate

In [7]:
r2acetate = ['Acetate (g/L)']
rmseacetate = ['Acetate (g/L)']
accuracyacetate = ['Acetate (g/L)']

# Ridge Regression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

X = acetate.drop(['Acetate (g/L)'], axis=1)
y = acetate['Acetate (g/L)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)

print('Ridge Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', ridge.score(X_test, y_test))

r2acetate.append(r2_score(y_test, y_pred))
rmseacetate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyacetate.append(ridge.score(X_test, y_test))

# LASSO
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred = lasso.predict(X_test)

print('LASSO')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', lasso.score(X_test, y_test))

r2acetate.append(r2_score(y_test, y_pred))
rmseacetate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyacetate.append(lasso.score(X_test, y_test))

# Elastic Net
from sklearn.linear_model import ElasticNet

elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_train, y_train)
y_pred = elastic.predict(X_test)

print('Elastic Net')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', elastic.score(X_test, y_test))

r2acetate.append(r2_score(y_test, y_pred))
rmseacetate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyacetate.append(elastic.score(X_test, y_test))

# Partial Least Squares (PLS) Regression
from sklearn.cross_decomposition import PLSRegression

pls = PLSRegression(n_components=2)
pls.fit(X_train, y_train)
y_pred = pls.predict(X_test)

print('PLS Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', pls.score(X_test, y_test))

r2acetate.append(r2_score(y_test, y_pred))
rmseacetate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyacetate.append(pls.score(X_test, y_test))

r2.append(r2acetate)
rmse.append(rmseacetate)
accuracy.append(accuracyacetate)

Ridge Regression
R2: 0.2078618194270745
RMSE: 0.5727832095148687
Accuracy: 0.2078618194270745
LASSO
R2: -0.004248611417303483
RMSE: 0.6449268012856316
Accuracy: -0.004248611417303483
Elastic Net
R2: -0.004248611417303483
RMSE: 0.6449268012856316
Accuracy: -0.004248611417303483
PLS Regression
R2: 0.11622982944774107
RMSE: 0.605005711871769
Accuracy: 0.11622982944774107


#### Biomass

In [8]:
r2biomass = ['Biomass (g/L)']
rmsebiomass = ['Biomass (g/L)']
accuracybiomass = ['Biomass (g/L)']

# Ridge Regression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

X = biomass.drop(['Biomass (g/L)'], axis=1)
y = biomass['Biomass (g/L)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)

print('Ridge Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', ridge.score(X_test, y_test))

r2biomass.append(r2_score(y_test, y_pred))
rmsebiomass.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracybiomass.append(ridge.score(X_test, y_test))

# LASSO
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred = lasso.predict(X_test)

print('LASSO')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', lasso.score(X_test, y_test))

r2biomass.append(r2_score(y_test, y_pred))
rmsebiomass.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracybiomass.append(lasso.score(X_test, y_test))

# Elastic Net
from sklearn.linear_model import ElasticNet

elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_train, y_train)
y_pred = elastic.predict(X_test)

print('Elastic Net')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', elastic.score(X_test, y_test))

r2biomass.append(r2_score(y_test, y_pred))
rmsebiomass.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracybiomass.append(elastic.score(X_test, y_test))

# Partial Least Squares (PLS) Regression
from sklearn.cross_decomposition import PLSRegression

pls = PLSRegression(n_components=2)
pls.fit(X_train, y_train)
y_pred = pls.predict(X_test)

print('PLS Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', pls.score(X_test, y_test))

r2biomass.append(r2_score(y_test, y_pred))
rmsebiomass.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracybiomass.append(pls.score(X_test, y_test))

r2.append(r2biomass)
rmse.append(rmsebiomass)
accuracy.append(accuracybiomass)

Ridge Regression
R2: 0.9712892052978089
RMSE: 0.1966631200962284
Accuracy: 0.9712892052978089
LASSO
R2: -0.08521430012688191
RMSE: 1.2090884489135438
Accuracy: -0.08521430012688191
Elastic Net
R2: 0.24301535550022024
RMSE: 1.0098197331735266
Accuracy: 0.24301535550022024
PLS Regression
R2: 0.9712585212381466
RMSE: 0.19676818182550534
Accuracy: 0.9712585212381466


#### Formate

In [9]:
r2formate = ['Formate (g/L)']
rmseformate = ['Formate (g/L)']
accuracyformate = ['Formate (g/L)']

# Ridge Regression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

X = formate.drop(['Formate (g/L)'], axis=1)
y = formate['Formate (g/L)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)

print('Ridge Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', ridge.score(X_test, y_test))

r2formate.append(r2_score(y_test, y_pred))
rmseformate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyformate.append(ridge.score(X_test, y_test))

# LASSO
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred = lasso.predict(X_test)

print('LASSO')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', lasso.score(X_test, y_test))

r2formate.append(r2_score(y_test, y_pred))
rmseformate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyformate.append(lasso.score(X_test, y_test))

# Elastic Net
from sklearn.linear_model import ElasticNet

elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_train, y_train)
y_pred = elastic.predict(X_test)

print('Elastic Net')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', elastic.score(X_test, y_test))

r2formate.append(r2_score(y_test, y_pred))
rmseformate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyformate.append(elastic.score(X_test, y_test))

# Partial Least Squares (PLS) Regression
from sklearn.cross_decomposition import PLSRegression

pls = PLSRegression(n_components=2)
pls.fit(X_train, y_train)
y_pred = pls.predict(X_test)

print('PLS Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', pls.score(X_test, y_test))

r2formate.append(r2_score(y_test, y_pred))
rmseformate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyformate.append(pls.score(X_test, y_test))

r2.append(r2formate)
rmse.append(rmseformate)
accuracy.append(accuracyformate)

Ridge Regression
R2: 0.32446791565322575
RMSE: 0.7387522723209791
Accuracy: 0.32446791565322575
LASSO
R2: -0.004535021254018501
RMSE: 0.9008622334747978
Accuracy: -0.004535021254018501
Elastic Net
R2: -0.004535021254018501
RMSE: 0.9008622334747978
Accuracy: -0.004535021254018501
PLS Regression
R2: 0.2890289811716852
RMSE: 0.7578823418825718
Accuracy: 0.2890289811716852


---

## Let's look at the r2, rmse and accuracy values for each model for each target variable

In [10]:
# make a dataframe with columns as models and rows as analytes with R2 values and RMSE values
r2 = pd.DataFrame(r2, columns=models)
r2

Unnamed: 0,Analyte,Ridge Regression,LASSO,Elastic Net,PLS Regression
0,Glucose (g/L),0.079261,0.067591,0.019056,0.079277
1,Lactate (g/L),0.18547,0.149265,0.152108,0.168553
2,Ethanol (g/L),0.13728,-0.000949,-0.000949,0.081888
3,Acetate (g/L),0.207862,-0.004249,-0.004249,0.11623
4,Biomass (g/L),0.971289,-0.085214,0.243015,0.971259
5,Formate (g/L),0.324468,-0.004535,-0.004535,0.289029


In [11]:
rmse = pd.DataFrame(rmse, columns=models)
rmse

Unnamed: 0,Analyte,Ridge Regression,LASSO,Elastic Net,PLS Regression
0,Glucose (g/L),17.540912,17.651724,18.10531,17.540755
1,Lactate (g/L),10.344926,10.572339,10.55466,10.451803
2,Ethanol (g/L),0.735535,0.792272,0.792272,0.758781
3,Acetate (g/L),0.572783,0.644927,0.644927,0.605006
4,Biomass (g/L),0.196663,1.209088,1.00982,0.196768
5,Formate (g/L),0.738752,0.900862,0.900862,0.757882


In [12]:
accuracy = pd.DataFrame(accuracy, columns=models)
accuracy

Unnamed: 0,Analyte,Ridge Regression,LASSO,Elastic Net,PLS Regression
0,Glucose (g/L),0.079261,0.067591,0.019056,0.079277
1,Lactate (g/L),0.18547,0.149265,0.152108,0.168553
2,Ethanol (g/L),0.13728,-0.000949,-0.000949,0.081888
3,Acetate (g/L),0.207862,-0.004249,-0.004249,0.11623
4,Biomass (g/L),0.971289,-0.085214,0.243015,0.971259
5,Formate (g/L),0.324468,-0.004535,-0.004535,0.289029


## Interpretation

Overall, ***PLS Regression*** is the best performing regression model.

---

# Using complete dataset

Building the same models but using the complete dataset (i.e., all the features)

### Load the data

In [13]:
data = pd.read_csv('data/data.csv')
target = pd.read_csv('data/target.csv')

### Build the regression models for each of the 6 target variables

In [14]:
models = ['Analyte', 'Ridge Regression', 'LASSO', 'Elastic Net', 'PLS Regression']
r2_complete = []
rmse_complete = []
accuracy_complete = []

In [15]:
X = data

#### Glucose

In [16]:
r2glucose = ['Glucose (g/L)']
rmseglucose = ['Glucose (g/L)']
accuracyglucose = ['Glucose (g/L)']

# Ridge Regression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

y = target['Glucose (g/L)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)

print('Ridge Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', ridge.score(X_test, y_test))

r2glucose.append(r2_score(y_test, y_pred))
rmseglucose.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyglucose.append(ridge.score(X_test, y_test))

# LASSO
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred = lasso.predict(X_test)

print('LASSO')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', lasso.score(X_test, y_test))

r2glucose.append(r2_score(y_test, y_pred))
rmseglucose.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyglucose.append(lasso.score(X_test, y_test))

# Elastic Net
from sklearn.linear_model import ElasticNet

elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_train, y_train)
y_pred = elastic.predict(X_test)

print('Elastic Net')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', elastic.score(X_test, y_test))

r2glucose.append(r2_score(y_test, y_pred))
rmseglucose.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyglucose.append(elastic.score(X_test, y_test))

# Partial Least Squares (PLS) Regression
from sklearn.cross_decomposition import PLSRegression

pls = PLSRegression(n_components=2)
pls.fit(X_train, y_train)
y_pred = pls.predict(X_test)

print('PLS Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', pls.score(X_test, y_test))

r2glucose.append(r2_score(y_test, y_pred))
rmseglucose.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyglucose.append(pls.score(X_test, y_test))

r2_complete.append(r2glucose)
rmse_complete.append(rmseglucose)
accuracy_complete.append(accuracyglucose)

Ridge Regression
R2: 0.7189513662663378
RMSE: 9.691132943029633
Accuracy: 0.7189513662663378


  model = cd_fast.enet_coordinate_descent(


LASSO
R2: 0.1431864815237457
RMSE: 16.92103668702178
Accuracy: 0.1431864815237457
Elastic Net
R2: 0.15021606705251667
RMSE: 16.851480797613636
Accuracy: 0.15021606705251667
PLS Regression
R2: 0.07386258844675231
RMSE: 17.592255589883425
Accuracy: 0.07386258844675231


#### Lactate

In [17]:
r2lactate = ['Lactate (g/L)']
rmselactate = ['Lactate (g/L)']
accuracylactate = ['Lactate (g/L)']

# Ridge Regression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

y = target['Lactate (g/L)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)

print('Ridge Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', ridge.score(X_test, y_test))

r2lactate.append(r2_score(y_test, y_pred))
rmselactate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracylactate.append(ridge.score(X_test, y_test))

# LASSO
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred = lasso.predict(X_test)

print('LASSO')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', lasso.score(X_test, y_test))

r2lactate.append(r2_score(y_test, y_pred))
rmselactate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracylactate.append(lasso.score(X_test, y_test))

# Elastic Net
from sklearn.linear_model import ElasticNet

elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_train, y_train)
y_pred = elastic.predict(X_test)

print('Elastic Net')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', elastic.score(X_test, y_test))

r2lactate.append(r2_score(y_test, y_pred))
rmselactate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracylactate.append(elastic.score(X_test, y_test))

# Partial Least Squares (PLS) Regression
from sklearn.cross_decomposition import PLSRegression

pls = PLSRegression(n_components=2)
pls.fit(X_train, y_train)
y_pred = pls.predict(X_test)

print('PLS Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', pls.score(X_test, y_test))

r2lactate.append(r2_score(y_test, y_pred))
rmselactate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracylactate.append(pls.score(X_test, y_test))

r2_complete.append(r2lactate)
rmse_complete.append(rmselactate)
accuracy_complete.append(accuracylactate)

Ridge Regression
R2: 0.48886553812528444
RMSE: 8.194859241551061
Accuracy: 0.48886553812528444
LASSO
R2: 0.20828621377909962
RMSE: 10.199010068338598
Accuracy: 0.20828621377909962
Elastic Net
R2: 0.21556252462501224
RMSE: 10.152034464330024
Accuracy: 0.21556252462501224
PLS Regression
R2: 0.17468222006057876
RMSE: 10.413207378656207
Accuracy: 0.17468222006057876


#### Ethanol

In [18]:
r2ethanol = ['Ethanol (g/L)']
rmseethanol = ['Ethanol (g/L)']
accuracyethanol = ['Ethanol (g/L)']

# Ridge Regression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

y = target['Ethanol (g/L)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)

print('Ridge Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', ridge.score(X_test, y_test))

r2ethanol.append(r2_score(y_test, y_pred))
rmseethanol.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyethanol.append(ridge.score(X_test, y_test))

# LASSO
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred = lasso.predict(X_test)

print('LASSO')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', lasso.score(X_test, y_test))

r2ethanol.append(r2_score(y_test, y_pred))
rmseethanol.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyethanol.append(lasso.score(X_test, y_test))

# Elastic Net
from sklearn.linear_model import ElasticNet

elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_train, y_train)
y_pred = elastic.predict(X_test)

print('Elastic Net')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', elastic.score(X_test, y_test))

r2ethanol.append(r2_score(y_test, y_pred))
rmseethanol.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyethanol.append(elastic.score(X_test, y_test))

# Partial Least Squares (PLS) Regression
from sklearn.cross_decomposition import PLSRegression

pls = PLSRegression(n_components=2)
pls.fit(X_train, y_train)
y_pred = pls.predict(X_test)

print('PLS Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', pls.score(X_test, y_test))

r2ethanol.append(r2_score(y_test, y_pred))
rmseethanol.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyethanol.append(pls.score(X_test, y_test))

r2_complete.append(r2ethanol)
rmse_complete.append(rmseethanol)
accuracy_complete.append(accuracyethanol)

Ridge Regression
R2: 0.15689966779360698
RMSE: 0.7271232907229364
Accuracy: 0.15689966779360698
LASSO
R2: -0.0009493392149675373
RMSE: 0.7922723040325659
Accuracy: -0.0009493392149675373
Elastic Net
R2: -0.0009493392149675373
RMSE: 0.7922723040325659
Accuracy: -0.0009493392149675373
PLS Regression
R2: 0.16178526383915837
RMSE: 0.7250134635127221
Accuracy: 0.16178526383915837


#### Acetate

In [19]:
r2acetate = ['Acetate (g/L)']
rmseacetate = ['Acetate (g/L)']
accuracyacetate = ['Acetate (g/L)']

# Ridge Regression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

y = target['Acetate (g/L)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)

print('Ridge Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', ridge.score(X_test, y_test))

r2acetate.append(r2_score(y_test, y_pred))
rmseacetate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyacetate.append(ridge.score(X_test, y_test))

# LASSO
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred = lasso.predict(X_test)

print('LASSO')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', lasso.score(X_test, y_test))

r2acetate.append(r2_score(y_test, y_pred))
rmseacetate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyacetate.append(lasso.score(X_test, y_test))

# Elastic Net
from sklearn.linear_model import ElasticNet

elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_train, y_train)
y_pred = elastic.predict(X_test)

print('Elastic Net')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', elastic.score(X_test, y_test))

r2acetate.append(r2_score(y_test, y_pred))
rmseacetate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyacetate.append(elastic.score(X_test, y_test))

# Partial Least Squares (PLS) Regression
from sklearn.cross_decomposition import PLSRegression

pls = PLSRegression(n_components=2)
pls.fit(X_train, y_train)
y_pred = pls.predict(X_test)

print('PLS Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', pls.score(X_test, y_test))

r2acetate.append(r2_score(y_test, y_pred))
rmseacetate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyacetate.append(pls.score(X_test, y_test))

r2_complete.append(r2acetate)
rmse_complete.append(rmseacetate)
accuracy_complete.append(accuracyacetate)

Ridge Regression
R2: 0.22782088802725875
RMSE: 0.565521121749045
Accuracy: 0.22782088802725875
LASSO
R2: -0.004248611417303483
RMSE: 0.6449268012856316
Accuracy: -0.004248611417303483
Elastic Net
R2: -0.004248611417303483
RMSE: 0.6449268012856316
Accuracy: -0.004248611417303483
PLS Regression
R2: 0.21529014007373437
RMSE: 0.5700912301104003
Accuracy: 0.21529014007373437


#### Biomass

In [20]:
r2biomass = ['Biomass (g/L)']
rmsebiomass = ['Biomass (g/L)']
accuracybiomass = ['Biomass (g/L)']

# Ridge Regression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

y = target['Biomass (g/L)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)

print('Ridge Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', ridge.score(X_test, y_test))

r2biomass.append(r2_score(y_test, y_pred))
rmsebiomass.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracybiomass.append(ridge.score(X_test, y_test))

# LASSO
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred = lasso.predict(X_test)

print('LASSO')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', lasso.score(X_test, y_test))

r2biomass.append(r2_score(y_test, y_pred))
rmsebiomass.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracybiomass.append(lasso.score(X_test, y_test))

# Elastic Net
from sklearn.linear_model import ElasticNet

elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_train, y_train)
y_pred = elastic.predict(X_test)

print('Elastic Net')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', elastic.score(X_test, y_test))

r2biomass.append(r2_score(y_test, y_pred))
rmsebiomass.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracybiomass.append(elastic.score(X_test, y_test))

# Partial Least Squares (PLS) Regression
from sklearn.cross_decomposition import PLSRegression

pls = PLSRegression(n_components=2)
pls.fit(X_train, y_train)
y_pred = pls.predict(X_test)

print('PLS Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', pls.score(X_test, y_test))

r2biomass.append(r2_score(y_test, y_pred))
rmsebiomass.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracybiomass.append(pls.score(X_test, y_test))

r2_complete.append(r2biomass)
rmse_complete.append(rmsebiomass)
accuracy_complete.append(accuracybiomass)

Ridge Regression
R2: 0.9826882646065163
RMSE: 0.1527111054173147
Accuracy: 0.9826882646065163
LASSO
R2: 0.14319857793155144
RMSE: 1.0743366894615642
Accuracy: 0.14319857793155144
Elastic Net
R2: 0.7443194944804096
RMSE: 0.5868797522546225
Accuracy: 0.7443194944804096
PLS Regression
R2: 0.9739711723109447
RMSE: 0.18725249901060223
Accuracy: 0.9739711723109447


#### Formate

In [21]:
r2formate = ['Formate (g/L)']
rmseformate = ['Formate (g/L)']
accuracyformate = ['Formate (g/L)']

# Ridge Regression
from sklearn.linear_model import Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

y = target['Formate (g/L)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)
y_pred = ridge.predict(X_test)

print('Ridge Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', ridge.score(X_test, y_test))

r2formate.append(r2_score(y_test, y_pred))
rmseformate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyformate.append(ridge.score(X_test, y_test))

# LASSO
from sklearn.linear_model import Lasso

lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)
y_pred = lasso.predict(X_test)

print('LASSO')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', lasso.score(X_test, y_test))

r2formate.append(r2_score(y_test, y_pred))
rmseformate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyformate.append(lasso.score(X_test, y_test))

# Elastic Net
from sklearn.linear_model import ElasticNet

elastic = ElasticNet(alpha=0.1, l1_ratio=0.5)
elastic.fit(X_train, y_train)
y_pred = elastic.predict(X_test)

print('Elastic Net')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', elastic.score(X_test, y_test))

r2formate.append(r2_score(y_test, y_pred))
rmseformate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyformate.append(elastic.score(X_test, y_test))

# Partial Least Squares (PLS) Regression
from sklearn.cross_decomposition import PLSRegression

pls = PLSRegression(n_components=2)
pls.fit(X_train, y_train)
y_pred = pls.predict(X_test)

print('PLS Regression')
print('R2:', r2_score(y_test, y_pred))
print('RMSE:', np.sqrt(mean_squared_error(y_test, y_pred)))
print('Accuracy:', pls.score(X_test, y_test))

r2formate.append(r2_score(y_test, y_pred))
rmseformate.append(np.sqrt(mean_squared_error(y_test, y_pred)))
accuracyformate.append(pls.score(X_test, y_test))

r2_complete.append(r2formate)
rmse_complete.append(rmseformate)
accuracy_complete.append(accuracyformate)

Ridge Regression
R2: 0.43465930007461717
RMSE: 0.6758198841338598
Accuracy: 0.43465930007461717
LASSO
R2: -0.004535021254018501
RMSE: 0.9008622334747978
Accuracy: -0.004535021254018501
Elastic Net
R2: -0.004535021254018501
RMSE: 0.9008622334747978
Accuracy: -0.004535021254018501
PLS Regression
R2: 0.3932581669245483
RMSE: 0.7001285843402357
Accuracy: 0.3932581669245483


---

## Let's look at the r2, rmse and accuracy values for each model for each target variable

In [22]:
# make a dataframe with columns as models and rows as analytes with R2 values and RMSE values
r2_complete = pd.DataFrame(r2_complete, columns=models)
r2_complete

Unnamed: 0,Analyte,Ridge Regression,LASSO,Elastic Net,PLS Regression
0,Glucose (g/L),0.718951,0.143186,0.150216,0.073863
1,Lactate (g/L),0.488866,0.208286,0.215563,0.174682
2,Ethanol (g/L),0.1569,-0.000949,-0.000949,0.161785
3,Acetate (g/L),0.227821,-0.004249,-0.004249,0.21529
4,Biomass (g/L),0.982688,0.143199,0.744319,0.973971
5,Formate (g/L),0.434659,-0.004535,-0.004535,0.393258


In [23]:
rmse_complete = pd.DataFrame(rmse_complete, columns=models)
rmse_complete

Unnamed: 0,Analyte,Ridge Regression,LASSO,Elastic Net,PLS Regression
0,Glucose (g/L),9.691133,16.921037,16.851481,17.592256
1,Lactate (g/L),8.194859,10.19901,10.152034,10.413207
2,Ethanol (g/L),0.727123,0.792272,0.792272,0.725013
3,Acetate (g/L),0.565521,0.644927,0.644927,0.570091
4,Biomass (g/L),0.152711,1.074337,0.58688,0.187252
5,Formate (g/L),0.67582,0.900862,0.900862,0.700129


In [24]:
accuracy_complete = pd.DataFrame(accuracy_complete, columns=models)
accuracy_complete

Unnamed: 0,Analyte,Ridge Regression,LASSO,Elastic Net,PLS Regression
0,Glucose (g/L),0.718951,0.143186,0.150216,0.073863
1,Lactate (g/L),0.488866,0.208286,0.215563,0.174682
2,Ethanol (g/L),0.1569,-0.000949,-0.000949,0.161785
3,Acetate (g/L),0.227821,-0.004249,-0.004249,0.21529
4,Biomass (g/L),0.982688,0.143199,0.744319,0.973971
5,Formate (g/L),0.434659,-0.004535,-0.004535,0.393258


## Interpretation

Overall, ***Ridge Regression*** is the best performing regression model.

---

# Comparing performance between using complete dataset and only selected features

In [25]:
accuracy_complete

Unnamed: 0,Analyte,Ridge Regression,LASSO,Elastic Net,PLS Regression
0,Glucose (g/L),0.718951,0.143186,0.150216,0.073863
1,Lactate (g/L),0.488866,0.208286,0.215563,0.174682
2,Ethanol (g/L),0.1569,-0.000949,-0.000949,0.161785
3,Acetate (g/L),0.227821,-0.004249,-0.004249,0.21529
4,Biomass (g/L),0.982688,0.143199,0.744319,0.973971
5,Formate (g/L),0.434659,-0.004535,-0.004535,0.393258


In [26]:
accuracy

Unnamed: 0,Analyte,Ridge Regression,LASSO,Elastic Net,PLS Regression
0,Glucose (g/L),0.079261,0.067591,0.019056,0.079277
1,Lactate (g/L),0.18547,0.149265,0.152108,0.168553
2,Ethanol (g/L),0.13728,-0.000949,-0.000949,0.081888
3,Acetate (g/L),0.207862,-0.004249,-0.004249,0.11623
4,Biomass (g/L),0.971289,-0.085214,0.243015,0.971259
5,Formate (g/L),0.324468,-0.004535,-0.004535,0.289029


Regression models are performing better using the ***complete dataset***.

---

# Finding ridge regression accuracy by using each feature to predict each target variable.

I.e., Building a ridge regression model to predict each target variable using 1 feature, and repeating for each of teh 2800 features.

In [27]:
# Find ridge regression accuracy by using each feature (from the complete dataset: 'data') to predict each target variable.

# I.e., Building a ridge regression model to predict each target variable using 1 feature, and repeating for each of teh 2800 features.

# This will give us a list of 2800 accuracy values for each target variable.

# We can then find the mean accuracy for each target variable. This will give us an idea of how well each target variable can be predicted using a single feature. 

# We can then sort the feature variables based on their accuracy to see which features are most useful to predict each target variable.




In [None]:
# Initialize a dictionary to store the accuracy values for each target variable
accuracy_dict = {col: [] for col in target.columns}

# Loop through each feature in the dataset
for feature in data.columns:
    X_feature = data[[feature]]
    
    # Loop through each target variable
    for target_variable in target.columns:
        y = target[target_variable]
        
        # Split the data into training and testing sets
        X_train, X_test, y_train, y_test = train_test_split(X_feature, y, test_size=0.2, random_state=0)
        
        # Initialize and train the Ridge regression model
        ridge = Ridge(alpha=0.1)
        ridge.fit(X_train, y_train)
        
        # Predict the target variable
        y_pred = ridge.predict(X_test)
        
        # Calculate the accuracy and append it to the dictionary
        accuracy = ridge.score(X_test, y_test)
        accuracy_dict[target_variable].append((feature, accuracy))

# # Calculate the mean accuracy for each target variable
# mean_accuracy = {target_variable: np.mean([acc for _, acc in accuracy_list]) for target_variable, accuracy_list in accuracy_dict.items()}

# Sort the features based on their accuracy for each target variable
sorted_accuracy = {target_variable: sorted(accuracy_list, key=lambda x: x[1], reverse=True) for target_variable, accuracy_list in accuracy_dict.items()}

# # Display the mean accuracy for each target variable
# print("Mean Accuracy for each target variable:")
# for target_variable, mean_acc in mean_accuracy.items():
#     print(f"{target_variable}: {mean_acc}")

# # Display the top 5 features for each target variable based on accuracy
# print("\nTop 5 features for each target variable based on accuracy:")
# for target_variable, accuracy_list in sorted_accuracy.items():
#     print(f"\n{target_variable}:")
#     for feature, accuracy in accuracy_list[:5]:
#         print(f"Feature: {feature}, Accuracy: {accuracy}")

# convert to a dataframe and display for better visualization
sorted_accuracy_df = pd.DataFrame(sorted_accuracy)
sorted_accuracy_df

Unnamed: 0,Glucose (g/L),Lactate (g/L),Ethanol (g/L),Acetate (g/L),Biomass (g/L),Formate (g/L)
0,"(1402.0, 0.0716123401898141)","(1535.0, 0.2186540132030046)","(1881.0, 0.06787955341722784)","(1405.5, 0.09216202182662214)","(1749.0, 0.9619422590877423)","(1402.0, 0.20413760275184)"
1,"(1401.5, 0.07159684631901142)","(1534.5, 0.2186484815586206)","(1881.5, 0.06782329495982331)","(1405.0, 0.09215263482522917)","(1749.5, 0.9619407572895359)","(1402.5, 0.20413520612330494)"
2,"(1402.5, 0.07151058194856985)","(1535.5, 0.2186432568590706)","(1880.5, 0.06767231722338651)","(1406.0, 0.09209336430810633)","(1748.5, 0.9619321119880818)","(1403.0, 0.2040393952456082)"
3,"(1401.0, 0.07144038708981093)","(1536.0, 0.21861513881842398)","(1882.0, 0.06749874678645684)","(1404.5, 0.09207098185183371)","(1750.0, 0.9619258208009938)","(1401.5, 0.20402855951370946)"
4,"(1403.0, 0.0712706697796277)","(1534.0, 0.21861206598595384)","(1880.0, 0.06725974141849356)","(1406.5, 0.09195495819608657)","(1748.0, 0.9619101188532349)","(1403.5, 0.20383670070784043)"
...,...,...,...,...,...,...
2795,"(1033.0, -0.16637371235507592)","(986.5, -0.06037104365956347)","(1083.5, -0.010322662007385075)","(1112.0, -0.018393754560636966)","(949.0, -0.09159113848147293)","(1532.5, -0.007287716194751992)"
2796,"(1032.5, -0.16637593423793517)","(984.5, -0.06037163365444842)","(1085.5, -0.010323146026416818)","(1112.5, -0.018396152879967875)","(947.0, -0.09163413282095201)","(1530.5, -0.007289822781032074)"
2797,"(1031.0, -0.1663766338032333)","(985.0, -0.060376487840196)","(1084.0, -0.010324323000870717)","(1114.0, -0.018396783162072605)","(948.5, -0.09163799545127826)","(1532.0, -0.007292194040743771)"
2798,"(1032.0, -0.16637845797443163)","(986.0, -0.06037654482098964)","(1085.0, -0.010324430009631014)","(1113.0, -0.018397225429964248)","(947.5, -0.09165928156079772)","(1531.0, -0.00729240892033145)"
