Få ferdig clustering

Sammenligne resultater og kå me he lært.

Lese over all tekst.



In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import math
import matplotlib.pyplot as plt
import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from keras.layers import Dense
from sklearn.preprocessing import MinMaxScaler,LabelEncoder
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.neighbors import KNeighborsRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from scipy.stats import norm
from sklearn.cluster import KMeans
from mlxtend.feature_selection import ExhaustiveFeatureSelector as EFS


In [2]:
df1 = pd. read_csv('B_sum_Trips,_c__Scoo_1730215883309.csv', sep=',', on_bad_lines='skip')
df2 = pd. read_csv('table-2.csv', sep=';', on_bad_lines='skip')
df = pd.concat([df1, df2], axis=1)

<div style="line-height: 2;">

### **Feature selection**

As previously discussed, the analysis focuses on cod and related information concerning cod fishing. Consequently, columns containing irrelevant data are excluded.

Initially, columns indicating the time of data reporting to the Norwegian Directorate of Fisheries and the associated report IDs are removed. While the reporting date is retained, information regarding the start and end times of fishing trips are removed as it does not align with the primary focus of the analysis, which is to determine the most effective tools and fishing locations rather than the time of day.

Additionally, several columns contain redundant information presented in different formats. For consistency, FDIR's standards for displaying information related to tools and fish are adopted. The decision to use the code version of these columns is to make the normalization processes easier.

The analysis retains additional relevant data, including the depth of the sea at the beginning and end of each trip, the distance covered by the boats during fishing, the duration of the trip, the quantity of cod caught (measured in kilograms), and the dimensions of the boats, including width and length.
</div>

In [3]:
#Meldingsklokkeslett', 'Starttidspunkt', 'Startklokkeslett', 'Startposisjon bredde', 'Startposisjon lengde', 'Hovedområde start', 'Lokasjon start (kode)', 'Stopptidspunkt', 'Stoppklokkeslett', 'Fangstår', 'Stopposisjon bredde', 'Stopposisjon lengde', 'Hovedområde stopp', 'Lokasjon stopp (kode)', 'Redskap FAO (kode)', 'Redskap FAO', 'Redskap FDIR', 'Hovedart FAO (kode)', 'Hovedart FAO','Hovedart - FDIR (kode)', 'Art FAO (kode)', 'Art FAO','Art - FDIR (kode)', 'Art - FDIR', 'Art - gruppe', 'Lengdegruppe (kode)', 'Lengdegruppe', 'Bruttotonnasje 1969', 'Bruttotonnasje annen']
features_to_remove = ['city', 'Tid(norsk normaltid)', 'Stasjon', 'Navn', 'Utilisation']
df.drop(labels=features_to_remove, axis=1, inplace=True)
df

Unnamed: 0,date,b_sum trips,Deployed Scooters,Fleet Availability,Battery Unlocks,Maksimumstemperatur (døgn),Middeltemperatur (døgn),Minimumstemperatur (døgn),Nedbør (døgn)
0,2022-01-01 00:00:00,2112.0,2859.0,,289.0,99,56,-02,98
1,2022-01-02 00:00:00,2746.0,2837.0,,432.0,98,81,76,91
2,2022-01-03 00:00:00,3649.0,2802.0,,351.0,78,45,11,55
3,2022-01-04 00:00:00,3482.0,2778.0,,429.0,47,3,12,159
4,2022-01-05 00:00:00,4579.0,2778.0,,397.0,33,2,09,183
...,...,...,...,...,...,...,...,...,...
1029,2024-10-26 00:00:00,9293.0,3260.0,0.910429,964.0,136,118,107,04
1030,2024-10-27 00:00:00,8689.0,3231.0,0.920458,734.0,127,89,71,268
1031,2024-10-28 00:00:00,13221.0,3233.0,0.903805,911.0,111,78,69,37
1032,2024-10-29 00:00:00,6862.0,3235.0,0.893354,560.0,103,73,21,09


<div style="line-height: 2;">

### **Removing and Checking for NaN values**

Here the "replace()" function is used to change the empty string values with a null value so that the built in dropna function can be used to remove all NaN values.

The imported "isna()" method is used to check for any NaN values left. False means that there are non NaN values in the column. From the results it can said that there are now no empty strings and NaN values.
</div>

In [4]:
df.replace(' ', pd.NA, inplace=True)
df = df.dropna()
df.isna().any()
df


Unnamed: 0,date,b_sum trips,Deployed Scooters,Fleet Availability,Battery Unlocks,Maksimumstemperatur (døgn),Middeltemperatur (døgn),Minimumstemperatur (døgn),Nedbør (døgn)
193,2022-07-13 00:00:00,5583.0,2767.0,0.976147,804.0,157,126,102,61
194,2022-07-14 00:00:00,6867.0,2757.0,0.984041,469.0,147,119,101,173
195,2022-07-15 00:00:00,6046.0,2767.0,0.972172,655.0,158,116,86,15
196,2022-07-16 00:00:00,5764.0,2751.0,0.973101,582.0,151,121,95,51
197,2022-07-17 00:00:00,3520.0,2735.0,0.966362,453.0,141,126,102,15
...,...,...,...,...,...,...,...,...,...
1028,2024-10-25 00:00:00,15477.0,3250.0,0.905538,927.0,13,115,11,0
1029,2024-10-26 00:00:00,9293.0,3260.0,0.910429,964.0,136,118,107,04
1030,2024-10-27 00:00:00,8689.0,3231.0,0.920458,734.0,127,89,71,268
1031,2024-10-28 00:00:00,13221.0,3233.0,0.903805,911.0,111,78,69,37


<div style="line-height: 2;">

### **Dtypes**

Below, the function "info()" is used to check the Dtype of the different columns. By doing this, it can be observed that there are 4 columns with the type "object".
The dataframe needs to be standardized by changing the Dtype of those columns to integer. This way, when normalizing the dataset later, problems with "object" type columns can be avoided.
</div>

In [5]:
df.info()
pd.options.mode.chained_assignment = None
df.describe()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 822 entries, 193 to 1032
Data columns (total 9 columns):
 #   Column                      Non-Null Count  Dtype  
---  ------                      --------------  -----  
 0   date                        822 non-null    object 
 1   b_sum trips                 822 non-null    float64
 2   Deployed Scooters           822 non-null    float64
 3   Fleet Availability          822 non-null    float64
 4   Battery Unlocks             822 non-null    float64
 5   Maksimumstemperatur (døgn)  822 non-null    object 
 6   Middeltemperatur (døgn)     822 non-null    object 
 7   Minimumstemperatur (døgn)   822 non-null    object 
 8   Nedbør (døgn)               822 non-null    object 
dtypes: float64(4), object(5)
memory usage: 64.2+ KB


Unnamed: 0,b_sum trips,Deployed Scooters,Fleet Availability,Battery Unlocks
count,822.0,822.0,822.0,822.0
mean,9151.418491,2904.59854,0.899524,739.942822
std,3844.128269,378.740755,0.042453,215.93719
min,1.0,1813.0,0.654862,33.0
25%,6359.5,2780.0,0.871228,599.25
50%,8950.0,2909.0,0.903365,738.0
75%,11843.25,3165.0,0.930886,885.25
max,22625.0,3633.0,1.013738,1333.0


<div style="line-height: 2;">

### **Standardizing the Dataset**

The function below is to standardize the dataframe. 

The first line finds all the columns that has the data type "object" and stores it in the variable "categorical_columns".

The second line initializes the LabelEncoder object which is used to encode categorical features into numerical values which will be our integer.

The for loop in line three loops through all columns with the Dtype "object" and use the "fit_transform()" method of "LabelEncoder" to transform all the categorical values into numerical labels and replaces the original values in the dataframe.

At last the "abs()" function is used to ensure that all values in the dataframe still are positive. 
</div>

In [6]:
categorical_columns = df.select_dtypes(include=['object']).columns
label_encoder = LabelEncoder()
for column in categorical_columns:
    df[column] = label_encoder.fit_transform(df[column])

df = df.abs()
df.info

<bound method DataFrame.info of       date  b_sum trips  Deployed Scooters  Fleet Availability  \
193      0       5583.0             2767.0            0.976147   
194      1       6867.0             2757.0            0.984041   
195      2       6046.0             2767.0            0.972172   
196      3       5764.0             2751.0            0.973101   
197      4       3520.0             2735.0            0.966362   
...    ...          ...                ...                 ...   
1028   817      15477.0             3250.0            0.905538   
1029   818       9293.0             3260.0            0.910429   
1030   819       8689.0             3231.0            0.920458   
1031   820      13221.0             3233.0            0.903805   
1032   821       6862.0             3235.0            0.893354   

      Battery Unlocks  Maksimumstemperatur (døgn)  Middeltemperatur (døgn)  \
193             804.0                          83                       75   
194             469

<div style="line-height: 2;">

### **Target Variable Correlation Analysis**

This code checks the correlation coefficients all the features have with the target "Rundvekt". This is used as a feature selection. 

</div>

In [7]:
correlation_matrix = df.corr()
target_correlation = correlation_matrix['b_sum trips']
print(target_correlation.drop('b_sum trips', errors='ignore'))

date                          0.357599
Deployed Scooters             0.611279
Fleet Availability           -0.204016
Battery Unlocks               0.827870
Maksimumstemperatur (døgn)   -0.141645
Middeltemperatur (døgn)      -0.014850
Minimumstemperatur (døgn)     0.244103
Nedbør (døgn)                -0.178841
Name: b_sum trips, dtype: float64


<div style="line-height: 2;">

### **Correlation Coefficients**

All values below 0.01 from the correlation analysis are then removed. A discovery made was also "Art - gruppe (kode)" returns NaN. This is because the whole dataframe is sorted around only the "torsk" value from the "Art - gruppe (kode)", making every value the same. Therefore this column is also removed due to not being of value anymore.

</div>

In [8]:
df.drop(['Middeltemperatur (døgn)', 'Battery Unlocks'], axis=1, inplace=True)

<div style="line-height: 2;">

### **MinMaxScaler**

All columns within the DataFrame have been confirmed to exclusively contain float or integer values, enabling us to proceed with data normalization. This is achieved by importing MinMaxScaler from the sklearn.preprocessing module.

Furthermore, an instance of the MinMaxScaler class is instantiated, and the DataFrame is scaled using this instance. The resulting scaled DataFrame is stored in a new variable.

Normalization is performed by applying the fit_transform() method of the MinMaxScaler object to the DataFrame df. This operation produces a scaled version of the original DataFrame, where each value falls within the range of 0 to 1.

Finally, the scaled data is converted back into a pandas DataFrame using the pd.DataFrame() method. The inclusion of the parameter columns=df.columns ensures the retention of the original column names within the new DataFrame.
</div>

In [9]:
minmax_scaler = MinMaxScaler()
df_minmax = pd.DataFrame(minmax_scaler.fit_transform(df), columns=df.columns)
df_minmax

Unnamed: 0,date,b_sum trips,Deployed Scooters,Fleet Availability,Maksimumstemperatur (døgn),Minimumstemperatur (døgn),Nedbør (døgn)
0,0.000000,0.246729,0.524176,0.895256,0.318008,0.366071,0.867521
1,0.001218,0.303483,0.518681,0.917250,0.283525,0.361607,0.311966
2,0.002436,0.267194,0.524176,0.884178,0.321839,0.941964,0.064103
3,0.003654,0.254729,0.515385,0.886766,0.295019,0.982143,0.807692
4,0.004872,0.155543,0.506593,0.867989,0.260536,0.366071,0.064103
...,...,...,...,...,...,...,...
817,0.995128,0.684052,0.789560,0.698505,0.218391,0.401786,0.000000
818,0.996346,0.410714,0.795055,0.712134,0.241379,0.388393,0.017094
819,0.997564,0.384017,0.779121,0.740078,0.206897,0.875000,0.538462
820,0.998782,0.584335,0.780220,0.693674,0.145594,0.866071,0.619658


<div style="line-height: 2;">

### **Splitting Dataset**

Next up is to start splitting the dataset into features and target.

"X" is split up in a way that makes it contain every column of the dataframe except "Rundvekt". This is called the features.
"y" on the other hand only contains "Rundvekt" and is called the target.

"train_test_split()" splits the features ("X") and target ("y") into random train and test subsets.

"X_train, X_test" are variables containing the features for training and testing.

"y_train, y_test" are variables containing the target for training and testing.

"test_size=0.2" specifies the proportion of the dataset to include in the test split. The split 0.2 means that 20% will be used to testing and 80% for training. 

"random_state=42" is a parameter that sets the random seed for reproducibility. It ensures that the results are reproducible. 

The prints at last visualizes the train and test splits on the features and target. 
</div>

In [10]:
X = df_minmax.loc[:, df_minmax.columns != 'b_sum trips']
y = df_minmax['b_sum trips']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

<div style="line-height: 2;">

### **Exhaustive Feature Selection for KNN**

What exhaustive feature selection does is that it goes through all possible feature combinations, and fits a K-Nearest Neighbors Regression model for each combination. The code then evaluates all the features using negative mean squared error, and selects the combination of features that give the best performance for a KNN regressor. 

</div>

In [11]:
run = True
X_not_scaled = df.loc[:, df.columns != 'b_sum trips']
y_not_scaled = df['b_sum trips']
X_train_not_scaled, X_test_not_scaled, y_train_not_scaled, y_test_not_scaled = train_test_split(X_not_scaled, y_not_scaled, test_size=0.2, random_state=42) 
def run_efs():
    test_model = RandomForestRegressor()

    efs = EFS(estimator=test_model,
          min_features=1,
          max_features=6,
          scoring='neg_mean_squared_error',
          print_progress=True,
          cv=5)
    efs =  efs.fit(X_train, y_train_not_scaled)
    print('Best subset (indices:)', efs.best_idx_)
    return list(efs.best_idx_)
if run == True:
    best_subset = run_efs()

Features: 63/63

Best subset (indices:) (0, 2, 3, 5)


<div style="line-height: 2;">

### **Exhaustive Feature Selection Results**

Result from Exhaustive feature selection was:
Features: 511/511 Best subset(indices:) (2, 4, 5, 6, 7, 8)

This will be the features used going forward for KNN.

The columns are put into a new variable and split once more.

</div>

In [12]:
X_knn = X.iloc[:, best_subset]
X_train_knn, X_test_knn, y_train_knn, y_test_knn = train_test_split(X_knn, y, test_size=0.2, random_state=42)

print("Train features shape:", X_train_knn.shape)
print("Test features shape:", X_test_knn.shape)
print("Train labels shape:", y_train_knn.shape)
print("Test labels shape:", y_test_knn.shape)

Train features shape: (657, 4)
Test features shape: (165, 4)
Train labels shape: (657,)
Test labels shape: (165,)


<div style="line-height: 2;">

### **Training KNN**

"KNeighborsRegressor()" is a regression algorithm used for making predictions based on the k nearest neighbors.

"n_neighbors()" is a parameter that specifies the number of neighbors to use for each query. In this case after trial and error the value with the best results were 25.

"knn.fit()" is a method used to train the KNN regressor using the training data hence the arguments being "X_train", "y_train". After training the KNN regressor is ready to make predictions.
</div>

In [13]:
knn = KNeighborsRegressor(n_neighbors=31)
knn.fit(X_train_knn, y_train_knn)

<div style="line-height: 2;">

### **Running KNN predictor**

The "predict()" uses the already trained KNN model ("knn") to make predictions on the test data ("X_test"). All of the predicted values will be stored in the variable "knn_y_pred".

Furthermore the predicted values are used to calculate different evaluation metrics. The evaluation metrics in use for KNN were **MSE**, **RMSE**, **MAE** and **R2**.
</div>

In [14]:
knn_y_pred = knn.predict(X_test_knn)

mse = mean_squared_error(y_test_knn, knn_y_pred)
rmse = np.sqrt(mse)
mae = mean_absolute_error(y_test_knn, knn_y_pred)
r2 = r2_score(y_test_knn, knn_y_pred)

print("Mean Squared Error:", mse)
print("Root Mean Squared Error:", rmse)
print("Mean Absolute Error:", mae)
print("R-squared:", r2)




Mean Squared Error: 0.013703693881808358
Root Mean Squared Error: 0.11706277752474678
Mean Absolute Error: 0.09393200552774172
R-squared: 0.490626253510143


# <div style="line-height: 2;">

### **KNN Results**

**Mean Squared Error (MSE)**

**MSE = 0.00014790509832180316**

The low MSE indicates that, on average, the squared differences between the predicted and actual values are small. This suggests a high level of accuracy in the model's predictions. However, it's important to note that the "Rundvekt" column which is the target has a large variance in the data. The values range from low thousands to multiple hundred thousands. This large variance means that the MinMaxScaler will reduce the differences between values, making most values appear similar, except the large ones. Which can effect the results of MSE.

**Root Mean Squared Error (RMSE)**

**RMSE = 0.012161624000182013**

RMSE is the square root of the MSE and provides error metrics in the same units as the data. Since the MSE is low, the RMSE is also low. Unlike the MSE, which gives the mean squared difference between predicted and actual values, RMSE takes the square root of this value. This makes RMSE a better representation of the average error in the same units as the target variable, providing a more accurate measure of model performance.

**Mean Absolute Error (MAE)**

**MAE = 0.007042982517260722**

MAE provides a straightforward interpretation of the average error magnitude, representing the average absolute difference between predicted values and actual values. A MAE score of 0.008 indicates that, on average, the predictions deviate from the actual values by 0.008 units. This relatively low score suggests a good level of accuracy in the model's predictions.

However, it's important to note that, similar to MSE and RMSE, the MAE may be influenced by the similarity of many values in the data. This is an important consideration when interpreting the results.

**R-squared (R²)**

**R2 = 0.25201395870508125**

R-squared is a statistical measure of how well the regression predictions approximate the real data points. An R2 of 0.252 means that only about 25.2% of the variance in the dependent variable is explained by the model. This is generally considered low and suggests that the model, while having a low error magnitude, does not do a great job at explaining the variability of the response data around its mean.

### Overall Interpretation:
**Accuracy and Precision**: The models appears to be quite precise given the low values of MSE, RMSE, and MAE. This means that the model makes predictions that are very close to the actual values, and the size of the errors made by the model is minimal.
**The Models Explanatory Power**: The low R² value is a concern because it indicates that the model does not explain much of the variance in the target variable. Despite making accurate predictions on average, it might not capture all the underlying patterns in the data, this again would be due to some rows containing way higher values than the rest, making the other observations more cramped.

### Potential fixes:

**Data Quality and Quantity** 
The largest problem the models run into is most likely due to the big variety of values in the "Rundvekt" column. To get better values, one modification could be to not only sort the dataset by "Torsk", but also sort the data by boat size. This way the dataset would be based on the same boat size and around the same amount of fish caught, not have the large differences in values.
</div>

<div style="line-height: 2;">

### **Splitting Dataset** 

Splitting into subsets of features and targets for the Random Forest Regressor.
</div>

In [19]:
X_rf = X.iloc[:, best_subset]
print(X)
X_train_rf, X_test_rf, y_train_rf, y_test_rf = train_test_split(X_rf, y, test_size=0.2, random_state=42)

         date  Deployed Scooters  Fleet Availability  \
0    0.000000           0.524176            0.895256   
1    0.001218           0.518681            0.917250   
2    0.002436           0.524176            0.884178   
3    0.003654           0.515385            0.886766   
4    0.004872           0.506593            0.867989   
..        ...                ...                 ...   
817  0.995128           0.789560            0.698505   
818  0.996346           0.795055            0.712134   
819  0.997564           0.779121            0.740078   
820  0.998782           0.780220            0.693674   
821  1.000000           0.781319            0.664553   

     Maksimumstemperatur (døgn)  Minimumstemperatur (døgn)  Nedbør (døgn)  
0                      0.318008                   0.366071       0.867521  
1                      0.283525                   0.361607       0.311966  
2                      0.321839                   0.941964       0.064103  
3                      

<div style="line-height: 2;">

### **Random Forest Regressor** 

First line is initializing the "RandomForestRegressor()" with "n_estimators=100" meaning the model will have a 100 trees. Furthermore "random_state=42" ensures that the results are reproducible. 

Then the model is trained using "fit()", as previously shown with KNN.

"predict()" makes predictions on the test data "X_test" using the Random Forest Regressor. The predicted values are stored in the variable rf_y_pred.

As done previously with KNN, the predicted data is used with the evaluation metrics "mean_squared_error" and "r2_score".

</div>

In [20]:
rf = RandomForestRegressor(n_estimators = 100, random_state = 42)
rf.fit(X_train_rf, y_train_rf)

rf_y_pred = rf.predict(X_test_rf)
rf_mse = mean_squared_error(y_test_rf, rf_y_pred)
rf_r2 = r2_score(y_test_rf, rf_y_pred)

print("Mean Squared Error:", rf_mse)
print("R^2 Score:", rf_r2)

Mean Squared Error: 0.009900299775997549
R^2 Score: 0.6320004787200422


<div style="line-height: 2;">

### **Random Forest Regressor Results**

On the **Mean Squared Error(MSE)** the results are 0.000136. This result is slightly better than the KNN results. The results suggest a high level of accuracy.

**R-squared(R2)** got a 0.3114. An R2 of 0.3114 means that only about 31.14% of the variance in the dependent variable is explained by the model. This is a better result than KNN, but it is still considered low. 

The same error source from KNN about the values being spread very far apart will still have an effect on these evaluation metrics and will have to be kept in mind.
</div>

<div style="line-height: 2;">

### **Splitting Dataset**

Splitting into subsets of features and targets for the Feed forward neural network.


</div>

In [17]:
#Edit Features to be the best subset based on results from testing
X_fnn = X.iloc[:, best_subset]
#Splitting the data again because the Features have changed
X_train_fnn, X_test_fnn, y_train_fnn, y_test_fnn = train_test_split(X_fnn, y, test_size=0.2, random_state=42)

<div style="line-height: 2;">

### **Feed Forward Neural Network**

**Architecture**

First the FNN_model is defined using the "Sequential" class from keras. Furthermore "Dense" is used which defines fully connected layers in the neural network.

"Dense" is first used to create the input layer with 64 neurons, using "tanh" as the activation function. "tanh" is a good activation key for datasets with values between 0 and 1. The input shape of the layer is determined by "input_shape()" which is set to "X_train.shape[1]". This means that it is set to the amount of features in the input data.

Further there is 3 more hidden layers with 128, 32 and 6 neurons and all use "tanh" as the activation function.

The last layer is an output layer with 1 neuron and no activation function. The layer is used to counter regression problems.

**Compiling the Model**

The model is compiled using the "compile" method, and the optimizer used is "adam". "Adam" is an optimization algorithm that is used to update the weights of the neural network during training in order to minimize the loss of the function.

The "loss" used for the compiler is "mse" and is used to predict the values of the model and the actual value in the training data.

**Training the Model**

The model is trained using the "fit()" method. With the "fit()" method there are defined 50 "epochs" which are how many complete passes the method goes through the entire training dataset. Furthermore the training data is divided into batches of size 32 using "batch_size=32". At last the "validation_split=0.2" which means that 20% of the training data is not used for training, rather it is used to evaluate the models performance and monitoring for overfitting. 

**Evaluating the model**

The trained model is evaluated on the test data using the "evaluate" method. The model is evaluated using the mean squared error(MSE) as used previously. 


</div>

In [18]:
#Feed forward Neural Network
FNN_model = Sequential([
    Dense(64, activation='tanh', input_shape=(X_train_fnn.shape[1],)),
    Dense(128, activation='tanh'),
    Dense(32, activation='tanh'),
    Dense(6, activation='tanh'),
    Dense(1)
])

FNN_model.compile(optimizer='adam', loss='mse')

FNN_model.fit(X_train_fnn, y_train_fnn, epochs=50, batch_size=32, validation_split=0.2)

FNN_mse = FNN_model.evaluate(X_test_fnn, y_test_fnn)
print('Test MSE:', FNN_mse)

Epoch 1/50


  super().__init__(activity_regularizer=activity_regularizer, **kwargs)


[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 5ms/step - loss: 0.3406 - val_loss: 0.0496
Epoch 2/50
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0394 - val_loss: 0.0131
Epoch 3/50
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0121 - val_loss: 0.0086
Epoch 4/50
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0094 - val_loss: 0.0078
Epoch 5/50
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0075 - val_loss: 0.0078
Epoch 6/50
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0074 - val_loss: 0.0077
Epoch 7/50
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0082 - val_loss: 0.0082
Epoch 8/50
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m0s[0m 1ms/step - loss: 0.0076 - val_loss: 0.0081
Epoch 9/50
[1m17/17[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1

<div style="line-height: 2;">

### **Feed Forward Neural Network Results**
 
**Mean Squared Error(MSE)**

The evaluation of the feed forward neural network using Mean Squared Error (MSE) resulted in an MSE of 0.000171. MSE represents the average of the squares of the errors between the predicted and actual values.

This low MSE indicates a good level of accuracy in the predictions. However, it's important to consider that many values in the dataset are very similar, which may affect the interpretation of the results. This makes is difficult to say how accurate the model really is.
</div>

<div style="line-height: 2;">

### **Pre processing for the clustering model.**

Using the print(X_fnn) to see which features were used in the KNN model and using the same for clustering

</div>

In [19]:
print(X_fnn)
df_clustering = df_minmax[['Havdybde start', 'Havdybde stopp', 'Trekkavstand', 'Redskap FDIR (kode)', 'Bredde', 'Fartøylengde']]

     Deployed Scooters  Fleet Availability  Battery Unlocks  \
0             0.524176            0.895256         0.593077   
1             0.518681            0.917250         0.335385   
2             0.524176            0.884178         0.478462   
3             0.515385            0.886766         0.422308   
4             0.506593            0.867989         0.323077   
..                 ...                 ...              ...   
817           0.789560            0.698505         0.687692   
818           0.795055            0.712134         0.716154   
819           0.779121            0.740078         0.539231   
820           0.780220            0.693674         0.675385   
821           0.781319            0.664553         0.405385   

     Maksimumstemperatur (døgn)  
0                      0.318008  
1                      0.283525  
2                      0.321839  
3                      0.295019  
4                      0.260536  
..                          ...  
817  

KeyError: "None of [Index(['Havdybde start', 'Havdybde stopp', 'Trekkavstand',\n       'Redskap FDIR (kode)', 'Bredde', 'Fartøylengde'],\n      dtype='object')] are in the [columns]"

<div style="line-height: 2;">

### **Finding Optimal Amount of Clusters**

**Iterate Over Values of K**

The clustering model starts with looping through the range of clusters(i). For each value of k, a KMeans clustering model is initialized with "n_clusters=i". The model is then fitted to the scaled data "df_clustering". The inertia of the fitted model is then calculated and stored in the empty list "inertias".

**Plotting the Results**

At last the results of fitting the dataset and running them trough "KMeans" are plotted into a graph. This graph can be used to determine the "elbow point" of the clustering. This is where the graph slows down the negative trend. The amount of clusters at the "elbow point" will be the ideal amount of "n_clusters" to use in "KMeans()" later.

</div>

In [None]:
#K-means clustering and testing how many clusters to have
num_clusters = list(range(1,10))
inertias = []
for i in num_clusters:
    kmeans = KMeans(n_clusters=i, random_state=42, n_init='auto')

    kmeans.fit(df_clustering)
    inertias.append(kmeans.inertia_)

plt.plot(num_clusters, inertias, '-o')

plt.xlabel('Number of Clusters (k)')
plt.ylabel('Inertia')

plt.show()

<div style="line-height: 2;">

### **Clustering Model**

</div>

In [None]:
#Training the clustering model based on test results
cluster_labels = kmeans.predict(df_clustering)
df_clustering['Cluster'] = cluster_labels

kmeans = KMeans(n_clusters=3, random_state=42, n_init='auto')
kmeans.fit(df_clustering)
sns.pairplot(df_clustering, hue='Cluster', palette='viridis')
plt.show()

In [None]:
import json

with open('INFO284-Group-exam-done(nesten).ipynb') as json_file:
    data = json.load(json_file)

wordCount = 0
for each in data['cells']:
    cellType = each['cell_type']
    if cellType == "markdown":
        content = each['source']
        for line in content:
            temp = [word for word in line.split() if "#" not in word] # we might need to filter for more markdown keywords here
            wordCount = wordCount + len(temp)
            
print(wordCount)