## Phase 3 Predictions

### 🛠️ Preprocessing

#### 💡 Feature selection

In [None]:
# Combine the relevant columns into a new DataFrame for correlation analysis
correlation_df = finalDf[['Total guests', 'PSV_Count', 'Effenaar_Count', 'Temperature', 'Rain', 'Duration rain', 'Max rain', 'Wind']]

# Compute the correlation matrix
correlation_matrix = correlation_df.corr()

# Generate a heatmap of the correlation matrix
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f", linewidths=0.5)
plt.title('Correlation Map')
plt.show()
# Encode the 'Day' column into numerical values
day_encoded = finalDf['Day'].cat.codes

# Calculate the correlation between 'Day' and 'Total guests'
day_guests_correlation = day_encoded.corr(finalDf['Total guests'])

print("Correlation between Day and Total guests:", day_guests_correlation)

As of now, the most important features are the type of day, with Effenaar events showing the highest correlation. It's important to note that these observations may change over time, but for testing purposes, this setup suffices.

In [None]:
# Define features and target variables
features = ["Day", "Effenaar_Count"]
target = "Total guests"

X = finalDf[features]
y = finalDf[target]

#### 🪓 Splitting into train/test
Before the model can be trained, a little part of the data is to be put aside for testing purposes. The reasoning here is that the model trains with, for example 80% of the data available, and the other 20% is used to ask it to predict the target variable for. Because the true target variable of that 20% is known, we can compare the predictions with the ground truth and devise how well the model performs.

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.2, random_state=1)
print("There are in total", len(X), "observations, of which", len(X_train), "are now in the train set, and", len(X_test), "in the test set.")

#### ⚖️ Scaling
For other machine learning algorithms scaling may be needed, however, linear regression can usually do fine without scaling because it will make a mathematically formula to predict the target with, that can adapt to features in different units. However, for visualization purposes it may be required to scale anyway, or plots may look bad. For now, no scaling is applied. 

#### 🆔 Encoding
Given the fact that machine learning algorithms work with only numeric values, often the input data needs to be encoded, which means turning the non-numeric data into numeric representations (codes).

In [None]:
X_train_encoded = pd.get_dummies(X_train, columns=['Day'], drop_first=True)
X_test_encoded = pd.get_dummies(X_test, columns=['Day'], drop_first=True)

### 🧬 Modelling
In this step only the train set is used to fit the model, which in this case uses a Linear Regression algorithm named [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html). And after that the test set is used to calculate the model's score, in other words how well it performs. For regression problems the score is provided as the [coefficient of determination](https://en.wikipedia.org/wiki/Coefficient_of_determination) denoted *R²*, which is a fraction where any value closer to 1 is considered better, and 1 itself (100% accurate) is usually impossible.

In [None]:
from sklearn.linear_model import LinearRegression
model = LinearRegression()
result = model.fit(X_train_encoded, y_train)
score = model.score(X_test_encoded, y_test)
print("R²:", score)

### 🔬 Evaluation
Now, for each in the test set the model makes a prediction of the total guests. Since the true total guest is known, it is then possible to compare the truth with the prediction and calculate an error from that, meaning *"how far away is the prediction from the truth?"*. Note that the error is absolute (non-negative), and in this example it is also cast to an integer for legibility reasons.

In [None]:
# Step 1: Make predictions using the trained model on the test data
predictions = model.predict(X_test_encoded)

# Step 2: Create a DataFrame to store the true total guests, predicted total guests, and the error
prediction_overview = pd.DataFrame()
prediction_overview["truth"] = y_test
prediction_overview["prediction"] = predictions

# Step 3: Calculate the absolute error
prediction_overview["error"] = prediction_overview["truth"] - prediction_overview["prediction"]
prediction_overview["error"] = abs(prediction_overview["error"]).astype(int)

# Step 4: Reset the index of the DataFrame
prediction_overview = prediction_overview.reset_index(drop=True)

# Display the prediction overview DataFrame
print(prediction_overview)
plot = sns.regplot(y=y_test.values.flatten(), x=predictions.flatten(), line_kws={"color": "r"})
plot.set_xlabel("predicted amount")
plot.set_ylabel("true amount")
plot
from sklearn.metrics import max_error
from sklearn.metrics import mean_squared_error
import math

me = max_error(y_test, predictions)
me = math.ceil(me)
print("Max Error:", me)

mse = mean_squared_error(y_test, predictions)
rmse = math.sqrt(mse)
rmse = math.ceil(rmse)
print("Root Mean Squared Error:", rmse)

#### 🗳️ Conclusion
The first model, even though it was quite basic and used only a few factors, showed some good results. It predicted about 77% of the variation in the number of guests accurately, which is a decent start. It means that the type of day and the number of events at Effenaar have a noticeable effect on how many guests show up.

To make our predictions even better, new things have to be tried. First, more factors could be addded to the model, like weather conditions or special occasions happening nearby. Also, getting more data would help. The more information our model has, the smarter it gets.

Trying different ways of making predictions is also important. The first model was pretty simple, but there are other methods out there that might work better for the data.

So, while the first model did okay, there's still lots of room to make it better. By adding more factors, getting more data, and trying different methods, we hope to build a model that can predict the number of guests even more accurately in the future.