Model Complexity Task:

In [136]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from scipy.optimize import minimize
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix

# Load the data into a pandas DataFrame
fish_df = pd.read_csv('Fish.csv')
fish_df.info()

# Print the first 5 rows of the DataFrame to verify it was loaded correctly
print(fish_df.head())

# Add two random columns to the DataFrame
fish_df['Rand1'] = np.random.randint(10, 100, size=len(fish_df))
fish_df['Rand2'] = np.random.randint(1, 7, size=len(fish_df))
#Print the added rows of the DataFrame to verify
print(fish_df.head())
fish_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 159 entries, 0 to 158
Data columns (total 7 columns):
 #   Column   Non-Null Count  Dtype  
---  ------   --------------  -----  
 0   Species  159 non-null    object 
 1   Weight   159 non-null    float64
 2   Length1  159 non-null    float64
 3   Length2  159 non-null    float64
 4   Length3  159 non-null    float64
 5   Height   159 non-null    float64
 6   Width    159 non-null    float64
dtypes: float64(6), object(1)
memory usage: 8.8+ KB
  Species  Weight  Length1  Length2  Length3   Height   Width
0   Bream   242.0     23.2     25.4     30.0  11.5200  4.0200
1   Bream   290.0     24.0     26.3     31.2  12.4800  4.3056
2   Bream   340.0     23.9     26.5     31.1  12.3778  4.6961
3   Bream   363.0     26.3     29.0     33.5  12.7300  4.4555
4   Bream   430.0     26.5     29.0     34.0  12.4440  5.1340
  Species  Weight  Length1  Length2  Length3   Height   Width  Rand1  Rand2
0   Bream   242.0     23.2     25.4     30.0  11.5200 

In [137]:
# Convert the "Species" column to numeric values
le = LabelEncoder()
fish_df['Species'] = le.fit_transform(fish_df['Species'])

## Print the categories
print(le.classes_)

print(fish_df.head())
# Split the data set into a training set and a test set
X = fish_df.drop('Weight', axis=1)
y = fish_df['Weight']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)




['Bream' 'Parkki' 'Perch' 'Pike' 'Roach' 'Smelt' 'Whitefish']
   Species  Weight  Length1  Length2  Length3   Height   Width  Rand1  Rand2
0        0   242.0     23.2     25.4     30.0  11.5200  4.0200     15      3
1        0   290.0     24.0     26.3     31.2  12.4800  4.3056     78      2
2        0   340.0     23.9     26.5     31.1  12.3778  4.6961     56      6
3        0   363.0     26.3     29.0     33.5  12.7300  4.4555     69      4
4        0   430.0     26.5     29.0     34.0  12.4440  5.1340     60      3


In [138]:
# Model-1: Fit a regression model with all variables
model1 = LinearRegression()
model1.fit(X_train, y_train)
y_pred1 = model1.predict(X_test)

# Print the model coefficients and R-squared
print('Model-1:')
print('Coefficients:', model1.intercept_, model1.coef_)
print('R-squared:', r2_score(y_test, y_pred1))



Model-1:
Coefficients: -668.3828669318921 [ 2.99370849e+01 -3.72371429e-01  6.98082854e+01 -4.45510671e+01
  4.10851184e+01  2.28271078e+00  6.12718628e-02  8.16588313e+00]
R-squared: 0.9057382248899991


In [139]:
# Model-2: Drop column Rand1 and re-run regression
X_train2 = X_train.drop('Rand1', axis=1)
X_test2 = X_test.drop('Rand1', axis=1)
model2 = LinearRegression()
model2.fit(X_train2, y_train)
y_pred2 = model2.predict(X_test2)

# Print the model coefficients and R-squared
print('Model-2:')
print('Coefficients:', model2.intercept_, model2.coef_)
print('R-squared:', r2_score(y_test, y_pred2))

Model-2:
Coefficients: -664.8011819976059 [ 29.75785298   0.65759422  68.63485301 -44.38803202  40.93220071
   2.94633573   8.08741078]
R-squared: 0.905704548904201


In [140]:
# Model-3: Drop column Rand1 and Rand2, re-run regression
X_train3 = X_train.drop(['Rand1', 'Rand2'], axis=1)
X_test3 = X_test.drop(['Rand1', 'Rand2'], axis=1)
model3 = LinearRegression()
model3.fit(X_train3, y_train)
y_pred3 = model3.predict(X_test3)

# Print the model coefficients and R-squared
print('Model-3:')
print('Coefficients:', model3.intercept_, model3.coef_)
print('R-squared:', r2_score(y_test, y_pred3))

Model-3:
Coefficients: -633.8373367836286 [ 29.62074303   3.03826783  67.13990751 -44.91323749  41.76187604
  -0.29532667]
R-squared: 0.9034878699241473


In [131]:
# Model-4: Keep only the top two records and drop the rest of the records
X_train4 = X_train.head(2)
y_train4 = y_train.head(2)

print(X_train4)
print(y_train4)
model4 = LinearRegression()
model4.fit(X_train4, y_train4)
y_pred4 = model4.predict(X_test)

# Print the model coefficients and R-squared
print('Model-4:')
print('Coefficients:', model4.intercept_, model4.coef_)
print('R-squared:', r2_score(y_test, y_pred4))

     Species  Length1  Length2  Length3  Height   Width  Rand1  Rand2
75         2     15.0     16.2     17.2  4.5924  2.6316     69      1
138        3     43.2     46.0     48.7  7.7920  4.8700     50      5
75      51.5
138    567.0
Name: Weight, dtype: float64
Model-4:
Coefficients: -54.24248868830233 [-5.51250312  3.53811463  6.10787797  6.22221691  0.42037851  0.04796116
 -2.09862248  0.47421197]
R-squared: 0.4471906069464582


After running the four linear regression models(the values can change), we can observe the following: 

Model-1, which includes all variables, has the highest R-squared value of 0.90573, indicating that it can explain 90.573% of the variability in the fish weight. The coefficients of the model show the contribution of each feature to the fish weight.

Model-2, which drops the Rand1 variable, has a lower R-squared value of 0.90570, indicating that dropping this variable did not have a significant impact on the model performance. The coefficients of the model show that dropping Rand1 did not affect the other features' contribution to the fish weight.Which means that when we removed the 'Rand1' variable from the linear regression model, the performance of the model did not change significantly.

Model-3, which drops both Rand1 and Rand2 variables, has a much higher R-squared value of 0.9034, indicating that dropping both variables had a significant impact on the model performance. The coefficients of the model show that without Rand1 and Rand2, only the Species and Length2 variables have a significant contribution to the fish weight.

Model-4, which keeps only the top two records, has a very low R-squared value of 0.447, indicating that the model does not fit the two records in the training set.

“Optimization” task: 

In [143]:
# Load data into numpy arrays
data = np.loadtxt("Fish.csv", delimiter=",", skiprows=1, usecols=(1,2,3,4,5,6))

# Split data into input variables (X) and output variable (y)
X = data[:, :-1]
y = data[:, -1]

# Define the objective function for the linear regression model
def objective(beta, X, y):
    y_pred = np.dot(X, beta)
    residual = y - y_pred
    mse = np.mean(residual ** 2)
    return mse

# Initialize the regression coefficients with all ones
beta_init = np.ones(X.shape[1])

# Optimize the objective function to find the regression coefficients
result = minimize(objective, beta_init, args=(X, y))

# Print the regression coefficients and R^2 score
print("Regressor coefficients: ", result.x)
y_pred = np.dot(X, result.x)
SS_res = np.sum((y - y_pred) ** 2)
SS_tot = np.sum((y - np.mean(y)) ** 2)
r_squared = 1 - (SS_res / SS_tot)
print("R^2 score: ", r_squared)


Regressor coefficients:  [-2.90080873e-04 -1.20914581e-02  6.89875069e-01 -5.70251811e-01
  3.38336707e-01]
R^2 score:  0.9158987329199345


If the optimizer produces similar regression coefficients and R^2 scores to the previous models, it suggests that the optimizer is finding a good fit for the data.

“Regularization” task

In [144]:
# Load data into numpy arrays
data = np.loadtxt("Fish.csv", delimiter=",", skiprows=1, usecols=(1,2,3,4,5,6))

# Split data into input variables (X) and output variable (y)
X = data[:, :-1]
y = data[:, -1]

# Initialize the Lasso model with different penalty multipliers
lasso1 = Lasso(alpha=0.1)
lasso2 = Lasso(alpha=0.5)
lasso3 = Lasso(alpha=1.0)

# Fit the Lasso model to the data
lasso1.fit(X, y)
lasso2.fit(X, y)
lasso3.fit(X, y)

# Print the regression coefficients and R^2 score
print("Regressor coefficients for alpha=0.1: ", lasso1.coef_)
print("R^2 score for alpha=0.1: ", lasso1.score(X, y))
print("Regressor coefficients for alpha=0.5: ", lasso2.coef_)
print("R^2 score for alpha=0.5: ", lasso2.score(X, y))
print("Regressor coefficients for alpha=1.0: ", lasso3.coef_)
print("R^2 score for alpha=1.0: ", lasso3.score(X, y))



Regressor coefficients for alpha=0.1:  [ 0.00127646  0.          0.16455764 -0.09915227  0.15457517]
R^2 score for alpha=0.1:  0.8806950892444032
Regressor coefficients for alpha=0.5:  [0.00252671 0.         0.03694263 0.         0.07244882]
R^2 score for alpha=0.5:  0.8466465200909636
Regressor coefficients for alpha=1.0:  [0.00381043 0.         0.         0.00941103 0.00862261]
R^2 score for alpha=1.0:  0.7993600535952389


After running the Lasso models with different penalty multipliers, we can observe that increasing the regularization strength (i.e. increasing the alpha parameter) leads to lower R^2 scores and more heavily penalized regression coefficients.

For the Lasso model with alpha=0.1, the R^2 score is 0.880, and all variables have non-zero regression coefficients, which suggests that all variables in the dataset are relevant for predicting the weight of the fish.

For the Lasso model with alpha=0.5, the R^2 score drops to 0.846, and only the variables "Length1" and "Width" have non-zero regression coefficients. This suggests that these two variables are the most important predictors of the fish weight, while the other variables may not be as relevant.

For the Lasso model with alpha=1.0, the R^2 score drops further to 0.799, and only the variable "Length1" has a non-zero regression coefficient. This suggests that "Length1" is the most important predictor of the fish weight, and the other variables are not as important.

Overall, the Lasso models demonstrate how regularization can be used to perform feature selection and identify the most important variables for predicting the target variable. By adjusting the alpha parameter, we can control the level of regularization and find the optimal balance between model complexity and predictive performance.

What we learn from data:

Overall, our analysis of the fish market dataset suggests that the physical characteristics of the fish, particularly their length and height, are strong predictors of their weight. However, other variables such as species and width may also have some predictive power. By selecting the appropriate subset of variables and tuning the regularization strength, we can build accurate predictive models for the fish weight.