# Fitting a linear regression model

## Adding additional rows

We start by importing the required libraries and loading the dataframes created from our segmentation masks. In my case I load 3 dataframe, because the data was originally split that way to train a CNN, but you could also load all data from one dataframe.

In [1]:
import pandas as pd

#load all dataframes and concatenate them
sam_train = pd.read_csv('path/to/sam_train_df.csv')
sam_val = pd.read_csv('path/to/sam_val_df.csv')
sam_test = pd.read_csv('path/to/sam_test_df.csv')
df_total = pd.concat([sam_train, sam_val, sam_test], ignore_index= True)

# load weight df to add weight column
df_weight = pd.read_csv('path/to/plant_weights.csv')
df_total_weight = pd.merge(df_total, df_weight, on=['Week', 'Plant'], how='left')

# add the pixel-ratio to the df
df_total_weight['pixel_ratio'] = [32/i for i in df_total_weight['Pot_Diameter']]

# calculate the plant area in cm², using the pixel area
df_total_weight['plant_area'] = df_total_weight['Plant_Pixels'] * df_total_weight['pixel_ratio'] * df_total_weight['pixel_ratio']

# save the modified dataframes
df_total_weight.to_csv('/path/to/sam_total_weight.csv')

## Loading the completed dataframe
Now we load the dataframe again, so you can skip the first step, if you run the script repeatedly.

In [2]:
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import joblib

In [3]:
# load the new dataframe
df_blank = pd.read_csv('/path/to/sam_total_weight.csv', index_col=0)

## Fitting the model

I determined 3 Variables that could make sense to fit to a linear regression model that predicts the plant weight. The plant_area variable is used as the base variable, with the camera angle "Angle" and the plant age "Week" as potential additional variables. In order to determine if these variablesn increase performance, we create 4 linear regression models (Base, Base + Week, Base + Angle, and Base + Both) and compare them.

### Variant 1, base model
#### Setting up the model

In [4]:
df = df_blank.copy()
# split each df into "train" and "val"
random_seed = 6
train, test = train_test_split(df, test_size= 0.2, random_state= random_seed)

# perform linear regression on df_train
X1 = ['plant_area']
X = train[X1]
y = train['Weight']

regression_model = LinearRegression()
regression_model.fit(X,y)

# save the model
#joblib.dump(regression_model, 'sam_lr_model.joblib')

# predict the weights for both train and val, to asses overfitting later on
test['predicted_weights'] = regression_model.predict(test[X1])
train['predicted_weights'] = regression_model.predict(train[X1])

# compute differences
# Compute the absolute differences
train['Absolute_Difference'] = abs(train['Weight'] - train['predicted_weights'])
test['Absolute_Difference'] = abs(test['Weight'] - test['predicted_weights'])

# Compute the squared differences
train['Squared_Difference'] = (train['Weight'] - train['predicted_weights']) ** 2
test['Squared_Difference'] = (test['Weight'] - test['predicted_weights']) ** 2

# You can also compute the percentage error if desired
train['Percentage_Error'] = (train['Absolute_Difference'] / train['Weight']) * 100
test['Percentage_Error'] = (test['Absolute_Difference'] / test['Weight']) * 100

test.to_csv('sam_test_predicted.csv', index=False)
train.to_csv('sam_train_predicted.csv', index=False)



#### Measuring performance on training set

In [5]:
# assess performance with different measures
mean = train['Absolute_Difference'].mean()
median = train['Absolute_Difference'].median()
absolute_difference = f'mean: {mean}\nmedian: {median}'

mean = train['Squared_Difference'].mean()
median = train['Squared_Difference'].median()
squared_difference = f'mean: {mean}\nmedian: {median}'

mean = train['Percentage_Error'].mean()
median = train['Percentage_Error'].median()
percentage_error = f'mean: {mean}\nmedian: {median}'

print(f'Absolute Difference:\n{absolute_difference}\n\nSquared Difference:\n{squared_difference}\n\nPercentage Error:\n{percentage_error}')

train_rmse = mean_squared_error(y, train['predicted_weights'], squared=False)
train_r2 = r2_score(y, train['predicted_weights'])

print(f"\nRMSE: {train_rmse}\nR2: {train_r2}")

Absolute Difference:
mean: 7.3451524140427695
median: 5.847600261708539

Squared Difference:
mean: 173.06115873366824
median: 34.194430772441606

Percentage Error:
mean: 22.84117231669745
median: 12.18788201736295

RMSE: 13.155271138736298
R2: 0.6966460007528668


#### Measuring performance on testing set

In [6]:
# plot the results
mean = test['Absolute_Difference'].mean()
median = test['Absolute_Difference'].median()
absolute_difference = f'mean: {mean}\nmedian: {median}'

mean = test['Squared_Difference'].mean()
median = test['Squared_Difference'].median()
squared_difference = f'mean: {mean}\nmedian: {median}'

mean = test['Percentage_Error'].mean()
median = test['Percentage_Error'].median()
percentage_error = f'mean: {mean}\nmedian: {median}'

print(f'Absolute Difference:\n{absolute_difference}\n\nSquared Difference:\n{squared_difference}\n\nPercentage Error:\n{percentage_error}')

test_rmse = mean_squared_error(test['Weight'], test['predicted_weights'], squared=False)
test_r2 = r2_score(test['Weight'], test['predicted_weights'])

print(f"\nRMSE: {test_rmse}\nR2: {test_r2}")

Absolute Difference:
mean: 7.125439035676194
median: 5.846551793583245

Squared Difference:
mean: 131.47466777552006
median: 34.182209349719365

Percentage Error:
mean: 21.07979988652215
median: 12.326802952326553

RMSE: 11.466240350503737
R2: 0.773149808715826


### Variant 2, base model + Angle
#### Setting up the model

In [7]:
df = df_blank.copy()
# split each df into "train" and "val"
random_seed = 6
train, test = train_test_split(df, test_size= 0.2, random_state= random_seed)

# perform linear regression on df_train
X1 = ['plant_area', 'Angle']
X = train[X1]
y = train['Weight']

regression_model = LinearRegression()
regression_model.fit(X,y)

# save the model
# joblib.dump(regression_model, 'sam_lr_model.joblib')

# predict the weights for both train and val, to asses overfitting later on
test['predicted_weights'] = regression_model.predict(test[X1])
train['predicted_weights'] = regression_model.predict(train[X1])

# compute differences
# Compute the absolute differences
train['Absolute_Difference'] = abs(train['Weight'] - train['predicted_weights'])
test['Absolute_Difference'] = abs(test['Weight'] - test['predicted_weights'])

# Compute the squared differences
train['Squared_Difference'] = (train['Weight'] - train['predicted_weights']) ** 2
test['Squared_Difference'] = (test['Weight'] - test['predicted_weights']) ** 2

# You can also compute the percentage error if desired
train['Percentage_Error'] = (train['Absolute_Difference'] / train['Weight']) * 100
test['Percentage_Error'] = (test['Absolute_Difference'] / test['Weight']) * 100

test.to_csv('sam_test_predicted.csv', index=False)
train.to_csv('sam_train_predicted.csv', index=False)



#### Measuring performance on training set

In [8]:
# assess performance with different measures
mean = train['Absolute_Difference'].mean()
median = train['Absolute_Difference'].median()
absolute_difference = f'mean: {mean}\nmedian: {median}'

mean = train['Squared_Difference'].mean()
median = train['Squared_Difference'].median()
squared_difference = f'mean: {mean}\nmedian: {median}'

mean = train['Percentage_Error'].mean()
median = train['Percentage_Error'].median()
percentage_error = f'mean: {mean}\nmedian: {median}'

print(f'Absolute Difference:\n{absolute_difference}\n\nSquared Difference:\n{squared_difference}\n\nPercentage Error:\n{percentage_error}')

train_rmse = mean_squared_error(y, train['predicted_weights'], squared=False)
train_r2 = r2_score(y, train['predicted_weights'])

print(f"\nRMSE: {train_rmse}\nR2: {train_r2}")

Absolute Difference:
mean: 7.232718906498344
median: 5.744487256869688

Squared Difference:
mean: 171.34886023070214
median: 32.9991343271778

Percentage Error:
mean: 22.52614149905948
median: 12.33065555419158

RMSE: 13.090029038573679
R2: 0.6996474402588799


#### Measuring performance on testing set

In [9]:
# plot the results
mean = test['Absolute_Difference'].mean()
median = test['Absolute_Difference'].median()
absolute_difference = f'mean: {mean}\nmedian: {median}'

mean = test['Squared_Difference'].mean()
median = test['Squared_Difference'].median()
squared_difference = f'mean: {mean}\nmedian: {median}'

mean = test['Percentage_Error'].mean()
median = test['Percentage_Error'].median()
percentage_error = f'mean: {mean}\nmedian: {median}'

print(f'Absolute Difference:\n{absolute_difference}\n\nSquared Difference:\n{squared_difference}\n\nPercentage Error:\n{percentage_error}')

test_rmse = mean_squared_error(test['Weight'], test['predicted_weights'], squared=False)
test_r2 = r2_score(test['Weight'], test['predicted_weights'])

print(f"\nRMSE: {test_rmse}\nR2: {test_r2}")

Absolute Difference:
mean: 7.008819449512693
median: 5.709488821883291

Squared Difference:
mean: 128.5987574771839
median: 32.59847543467123

Percentage Error:
mean: 20.700013524462793
median: 12.042256887440942

RMSE: 11.340139217716153
R2: 0.7781119874558977


### Variant 3, base model + Age
#### Setting up the model

In [10]:
df = df_blank.copy()
# split each df into "train" and "val"
random_seed = 6
train, test = train_test_split(df, test_size= 0.2, random_state= random_seed)

# perform linear regression on df_train
X1 = ['plant_area', 'Week']
X = train[X1]
y = train['Weight']

regression_model = LinearRegression()
regression_model.fit(X,y)

# save the model
# joblib.dump(regression_model, 'sam_lr_model.joblib')

# predict the weights for both train and val, to asses overfitting later on
test['predicted_weights'] = regression_model.predict(test[X1])
train['predicted_weights'] = regression_model.predict(train[X1])

# compute differences
# Compute the absolute differences
train['Absolute_Difference'] = abs(train['Weight'] - train['predicted_weights'])
test['Absolute_Difference'] = abs(test['Weight'] - test['predicted_weights'])

# Compute the squared differences
train['Squared_Difference'] = (train['Weight'] - train['predicted_weights']) ** 2
test['Squared_Difference'] = (test['Weight'] - test['predicted_weights']) ** 2

# You can also compute the percentage error if desired
train['Percentage_Error'] = (train['Absolute_Difference'] / train['Weight']) * 100
test['Percentage_Error'] = (test['Absolute_Difference'] / test['Weight']) * 100

test.to_csv('sam_test_predicted.csv', index=False)
train.to_csv('sam_train_predicted.csv', index=False)



#### Measuring performance on training set

In [11]:
# assess performance with different measures
mean = train['Absolute_Difference'].mean()
median = train['Absolute_Difference'].median()
absolute_difference = f'mean: {mean}\nmedian: {median}'

mean = train['Squared_Difference'].mean()
median = train['Squared_Difference'].median()
squared_difference = f'mean: {mean}\nmedian: {median}'

mean = train['Percentage_Error'].mean()
median = train['Percentage_Error'].median()
percentage_error = f'mean: {mean}\nmedian: {median}'

print(f'Absolute Difference:\n{absolute_difference}\n\nSquared Difference:\n{squared_difference}\n\nPercentage Error:\n{percentage_error}')

train_rmse = mean_squared_error(y, train['predicted_weights'], squared=False)
train_r2 = r2_score(y, train['predicted_weights'])

print(f"\nRMSE: {train_rmse}\nR2: {train_r2}")

Absolute Difference:
mean: 6.255032104258782
median: 4.483706927152122

Squared Difference:
mean: 94.12183684889308
median: 20.10362892733488

Percentage Error:
mean: 17.548384262067767
median: 10.12642317075743

RMSE: 9.701640935887758
R2: 0.8350165003313178


#### Measuring performance on testing set

In [12]:
# plot the results
mean = test['Absolute_Difference'].mean()
median = test['Absolute_Difference'].median()
absolute_difference = f'mean: {mean}\nmedian: {median}'

mean = test['Squared_Difference'].mean()
median = test['Squared_Difference'].median()
squared_difference = f'mean: {mean}\nmedian: {median}'

mean = test['Percentage_Error'].mean()
median = test['Percentage_Error'].median()
percentage_error = f'mean: {mean}\nmedian: {median}'

print(f'Absolute Difference:\n{absolute_difference}\n\nSquared Difference:\n{squared_difference}\n\nPercentage Error:\n{percentage_error}')

test_rmse = mean_squared_error(test['Weight'], test['predicted_weights'], squared=False)
test_r2 = r2_score(test['Weight'], test['predicted_weights'])

print(f"\nRMSE: {test_rmse}\nR2: {test_r2}")

Absolute Difference:
mean: 6.210769613488025
median: 4.7889789944413295

Squared Difference:
mean: 81.59977459993128
median: 22.934320958893892

Percentage Error:
mean: 16.688647394261753
median: 10.411108083070697

RMSE: 9.033259356396853
R2: 0.8592053907423028


### Variant 4, base model + Both
#### Setting up the model

In [13]:
df = df_blank.copy()
# split each df into "train" and "val"
random_seed = 6
train, test = train_test_split(df, test_size= 0.2, random_state= random_seed)

# perform linear regression on df_train
X1 = ['plant_area', 'Week', 'Angle']
X = train[X1]
y = train['Weight']

regression_model = LinearRegression()
regression_model.fit(X,y)

# save the model
joblib.dump(regression_model, '../Application/sam_lr_model.joblib')

# predict the weights for both train and val, to asses overfitting later on
test['predicted_weights'] = regression_model.predict(test[X1])
train['predicted_weights'] = regression_model.predict(train[X1])

# compute differences
# Compute the absolute differences
train['Absolute_Difference'] = abs(train['Weight'] - train['predicted_weights'])
test['Absolute_Difference'] = abs(test['Weight'] - test['predicted_weights'])

# Compute the squared differences
train['Squared_Difference'] = (train['Weight'] - train['predicted_weights']) ** 2
test['Squared_Difference'] = (test['Weight'] - test['predicted_weights']) ** 2

# You can also compute the percentage error if desired
train['Percentage_Error'] = (train['Absolute_Difference'] / train['Weight']) * 100
test['Percentage_Error'] = (test['Absolute_Difference'] / test['Weight']) * 100

test.to_csv('sam_test_predicted.csv', index=False)
train.to_csv('sam_train_predicted.csv', index=False)



#### Measuring performance on training set

In [14]:
# assess performance with different measures
mean = train['Absolute_Difference'].mean()
median = train['Absolute_Difference'].median()
absolute_difference = f'mean: {mean}\nmedian: {median}'

mean = train['Squared_Difference'].mean()
median = train['Squared_Difference'].median()
squared_difference = f'mean: {mean}\nmedian: {median}'

mean = train['Percentage_Error'].mean()
median = train['Percentage_Error'].median()
percentage_error = f'mean: {mean}\nmedian: {median}'

print(f'Absolute Difference:\n{absolute_difference}\n\nSquared Difference:\n{squared_difference}\n\nPercentage Error:\n{percentage_error}')

train_rmse = mean_squared_error(y, train['predicted_weights'], squared=False)
train_r2 = r2_score(y, train['predicted_weights'])

print(f"\nRMSE: {train_rmse}\nR2: {train_r2}")

Absolute Difference:
mean: 6.17425394268119
median: 4.396182074333936

Squared Difference:
mean: 93.59724474341316
median: 19.32643885920509

Percentage Error:
mean: 17.25772797651889
median: 9.789008129837226

RMSE: 9.674566902110563
R2: 0.8359360429619997


#### Measuring performance on testing set

In [15]:
# plot the results
mean = test['Absolute_Difference'].mean()
median = test['Absolute_Difference'].median()
absolute_difference = f'mean: {mean}\nmedian: {median}'

mean = test['Squared_Difference'].mean()
median = test['Squared_Difference'].median()
squared_difference = f'mean: {mean}\nmedian: {median}'

mean = test['Percentage_Error'].mean()
median = test['Percentage_Error'].median()
percentage_error = f'mean: {mean}\nmedian: {median}'

print(f'Absolute Difference:\n{absolute_difference}\n\nSquared Difference:\n{squared_difference}\n\nPercentage Error:\n{percentage_error}')

test_rmse = mean_squared_error(test['Weight'], test['predicted_weights'], squared=False)
test_r2 = r2_score(test['Weight'], test['predicted_weights'])

print(f"\nRMSE: {test_rmse}\nR2: {test_r2}")

Absolute Difference:
mean: 6.13358936701372
median: 4.71827398146389

Squared Difference:
mean: 80.58171893855548
median: 22.26214204417786

Percentage Error:
mean: 16.404741064366082
median: 10.059665684281832

RMSE: 8.976732085706661
R2: 0.8609619734013696


#### Selecting the final model
From these different models, we can see that for all of the variants, the training and testing data seem to perform similarly, with the differences being less pronounced in variants 3 and 4, this suggests that the model generalizes well on unseen data.
Performance wise, variant 4 delivers the best values, while variant 3 is not far behind. Variant 1 has the lowest performance, with variant 2 improving only slightly. The jump from variant 2 to variant 3 is more significant, suggesting that, in combination with the base variable "plant_area", the age variable "Week" is the most important variable to get accurate weight estimations. The camera angle variable "Angle" can be used for further improvements.

In this specific use case, the camera angle is known for all images and can be extracted from the dataframe, therefore we will continue with variant 4 as our linear regression model. If you don't have known camera angles, I'd suggest using variant 3 for your weight estimation.

The next step is optional, but I concatenate the two dataframes back together again and safe them with all the additional data again (also removing an extra column that is in there for some reason).

In [31]:
df = pd.concat([train, test], ignore_index=True)
df = df.loc[:, ~df.columns.str.contains('^Unnamed')]
df.to_csv('/path/to/sam_lr_applied.csv')