# Analysis of the Bananas

* Quality Score: A numerical score, likely on a scale of 1-4 that rates the overall quality of the banana sample
* Variety: It refers to banana's type.
* Ripeness: How ripeness the banana is.
* Sugar Content Brix: Amount of sugar measured in degrees Brix.
* Firmness (kgf): It's associated with th texture of the banana (hard or soft).
* Tree age: The age of the banana tree.
* Altitude: It's the altitude at which bananas plants are growing, measured in meter above the sea. There are bananas that grow better in a certain altitude and conditions that others.
* Rainfall (mm): Amount of rainfall in the zone where the bananas were harvested. A higher amount of rainfall may indicate favorable growing conditions for bananas.
* Soil nitrogen: Measured in part per million (ppm). It's the concentration of nitrogen in the soil where the bananas were harvested. This variable is a crucial nutrient for plant growth.


In [4]:
# Load the data
bananas <- read.csv("../data/banana.csv")

The goal of this analysis is to identify which variables are important at the time to determine the quality of a banana. And then build a regression model to predict the quality record of an unseen record. 

In [20]:
# Filter those variables that are not numeric to set the feature's matrix.
numeric_cols <- c()
for (i in 1:ncol(bananas)){
    if (class(bananas[, i]) == "numeric"){
        numeric_cols <- c(numeric_cols, i)
    }
}
bananas_num <- bananas[, numeric_cols]
# Build a regression model by setting the quality_score and the rest as explained variables. 
model <- lm(quality_score~., data = bananas_num)
model


Call:
lm(formula = quality_score ~ ., data = bananas_num)

Coefficients:
       (Intercept)      ripeness_index  sugar_content_brix        firmness_kgf  
        -1.997e+00           2.002e-01           1.565e-01          -1.565e-02  
         length_cm            weight_g      tree_age_years          altitude_m  
         3.975e-02           3.207e-05           2.286e-04           3.833e-06  
       rainfall_mm   soil_nitrogen_ppm  
         1.692e-06          -6.756e-05  


In this case we consider the rest as predicted variables, but this could cause that the regression model is not good to predict the quality score of a banana. Therefore, it will be explained how to know when a model is good or not and how to improve it. 

In [22]:
# Analyse the domain of the target variable
summary(bananas_num$quality_score)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.920   2.090   2.440   2.465   2.850   3.890 

In [34]:
# Another form to calculate the coefficientes of the model by using the canonical regression function
Y <- bananas_num[, "quality_score"] # Dependent or target variable
X <- bananas_num[, -1] # Exclude the target variable
ones_col <- matrix(rep(1, nrow(bananas_num)), nrow = nrow(bananas_num))
# Add a column with ones
X <- as.matrix(cbind(ones_col, X)) # Reserved for intercept (\beta0)
# Get the coefficients vector
betas <- solve(t(X) %*% X) %*% t(X) %*% Y
# Now, we can compute the predictive values of the observations.
y_hat <- X %*% betas
# To understand the content of y_hat
cat("The expected quality score for the banana associated with observation 1:", y_hat[1])

The expected quality score for the banana associated with observation 1: 1.85639

In [41]:
# Calculate the mean of the residuals or errors that our model committed.
# If it's accuracy the value has to be close to 0.
SSR <- sum((Y - y_hat)^2)
# To calculte the mean of the sum of squares it's important to know the number of
# degrees of freedom that we have now (n-10), since we have estimated 10 betas.
n <- nrow(bananas_num) - 10
MSSR <- 1/n * SSR
MSSR

Despite the fact that we have choosen all the numeric variables to build the model, the predictions are very close to real observations.
$$
MSSR = 4.72 \cdot 10^{-3} 
$$

In [49]:
# To do the same process by using the model object
MSSR <- sum(model$residuals^2) / model$df.residual
Y_hat <- model$fitted.values # Predictive values for each observation

Another important result that it can be computed by using the expression that we use to calculate the coefficients vector:
$$
Y^{\hat{}} = X\beta^{\hat{}} = X(X^tX)^{-1}X^tY = HY;\quad H =  X(X^tX)^{-1}X^t
$$
The famuous Hat matrix.