<div style="width:100%;height:30px;background-color:#E31134"></div>


## Minimal Preproccesing

Since the model will not work with missing values in the dataset, minimal preprocessing is requiered.

To address missing values in both the training and testing datasets, we use the `SimpleImputer` from scikit-learn. We replace all missing values with the most frequent values in each respective column. This imputation strategy is employed using the 'most_frequent' strategy, ensuring that the data remains complete for further analysis and modeling.



## Splitting Data into Features and Labels

We split the training data into features and labels. The features, denoted as `train_features`, consist of the first 13 columns of the training dataset. The labels, represented by `train_labels`, correspond to the last column, capturing the target variable.

Similarly, we split the test data into features (`test_features`) and labels (`test_labels`) using the same logic, ensuring consistency between the training and testing datasets for subsequent modeling.




## Model Prediction on Test Data

We make predictions on the test examples using a baseline model. The variable `ypred` holds the predicted values generated by the model applied to the test features. These predictions will be be further evaluated and compared with the actual test labels to assess the model's performance.


## Visualizing Model Output

We visualize the model output and compare it with the actual test labels. The `matplotlib.pyplot` library is employed to create a plot, where the blue line represents the actual test labels, and the orange line represents the predicted values (`ypred`). This visualization aids in assessing the model's ability to capture the underlying patterns in the data.

#### What we can observe from the plot:

While the model generally exhibits a learning trend, it is apparent that the predictions are not as accurate as desired. There are instances of underestimation of bike rentals as low as -1252 but the prevalence of positive values suggests a general trend of overestimation, with values as high as 1663 above the test data. That suggest that there must be areas for improvement. Further exploration, feature engineering, or model tuning will be necessary to enhance accuracy and address these discrepancies.


## Single Example Prediction and Evaluation

We demonstrate the model's ability to predict a single example from the test data. The features of the first test example are displayed, and the model's prediction is compared with the actual label. 

- Features of the example: `test_features.iloc[0,:]`
- Predicted label: `predicted_value`
- Actual label: `test_labels.iloc[0]`
- Deviation predicted from actual value: `predicted_value - test_labels.iloc[0]`

This analysis provides insight into the model's performance on individual instances, helping to understand its predictive accuracy.


- **Predicted Label:** [3406.9932736]
- **Actual Label:** 3894
- **Deviation from Actual Value:** -487.006726

The model predicted a bike rental count of approximately 3407 for a specific example, while the actual count was 3894. This suggests an underestimation by about 487. The model's accuracy and deviation from actual values indicate potential limitations and areas for improvement.

## Model Evaluation

We evaluate the performance of the model using two metrics:

### Mean Absolute Error (MAE) & Coefficient of Determination (R^2)

The Mean Absolute Error is a measure of the average absolute differences between the predicted and actual values. A lower MAE indicates better model performance.


mae = mean_absolute_error(test_labels, ypred)
print('MAE: %.3f' % mae)

### Model Evaluation Results

After evaluating the model on the test data, we obtained the following metrics:

- **Mean Absolute Error (MAE):** 1054.862
  - The MAE represents the average absolute difference between the predicted and actual values. In this case, a lower MAE is desirable, and the obtained value provides insight into the average magnitude of prediction errors.

- **Coefficient of Determination (R^2):** 0.4002
  - The R^2 value measures the proportion of the variance in the bike rental counts that the model can explain. A higher R^2 value (closer to 1.0) indicates a better fit. In this instance, the obtained R^2 value of 0.4002 suggests that the model explains 40.02% of the variance in the test data.

These results offer an assessment of the model's performance, indicating areas for potential improvement or refinement.


## 2. Preprocessing

#### Handling Outliers in Humidity and Windspeed

To achieve better results with our model, we initiated preprocessing on our training data. Initially, we counted all values from the 'hum' feature that exceeded 100, addressing a total of 229 outliers. Following that, we counted the negative values in the 'windspeed' feature, which amounted to 4.

- Humidity Outliers: 229
- Windspeed Outliers: 4

Given that there are 229 values in the 'hum' data above 100 out of a total of 600, we anticipate improvements in the model's performance, if we drop the feature alltogether. The 'windspeed' feature, with only 4 negative values, will be  handled by replacing the negative values with the median of windspeed.





#### Imputing Missing Values in 'season' Based on 'dteday'

To handle missing values in the 'season' feature of our training data, we implemented a imputation strategy using the 'dteday' (date) feature. For each missing value in 'season', we used the associated date information to estimate the season.

The imputation process is outlined as follows:
- If the date falls between December 21 and March 20, the season is assigned as 1 (Winter).
- If the date falls between March 21 and June 20, the season is assigned as 2 (Spring).
- If the date falls between June 21 and September 20, the season is assigned as 3 (Summer).
- If the date falls between September 21 and December 20, the season is assigned as 4 (Fall).



#### Imputing Missing Values in 'weekday' Based on 'dteday' - New Order

To address missing values in the 'weekday' feature of our training data, we basically used the same strategy as for the season. In this approach, we introduced a new order for the weekdays based on the provided date information.

The imputation process is as follows:
- We converted the 'dteday' column to a datetime format and extracted the day of the week using the `dayofweek` function.
- The resulting 'weekday' values now represent the order of weekdays (0 for Monday, 1 for Tuesday, and so on).

By deriving the 'weekday' values from the date, we effectively imputed missing values in a way that aligns with the chronological order of weekdays. 

These imputations ensures the completeness of the 'weekday' and 'season' features, contributing to a more comprehensive dataset for subsequent modeling.


## Power Transformation of 'windspeed' ???
<div style="width:100%;height:30px;background-color:#E31134"></div>





## Train Data Preprocessing

### Handling Outliers in 'cnt'

From Task 1 we know that there are four outliers in our training data, in the label 'cnt'.
To address outliers in 'cnt' we identified and removed rows where the bike rental count exceeded 20,000. Outliers in the target variable disproportionately influences model training, and their removal helps prevent skewed predictions.

### Outlier Removal Details:
- Rows with 'cnt' values greater than 20,000 were identified using boolean indexing.
- These outlier rows were subsequently dropped from the training dataset.

We double check our training data and generated a plot displaying the cnt label.
The results appear satisfactory, with cnt values ranging from approximately 0 to 8000.

## Test Data Preprocesing

### Addressing Outliers in 'hum' of the Test Data

Upon inspection, we is observed that there are, just like in the training data, instances of 'hum' values exceeding 100 in the test data.

### Outlier Count:
The count of outliers in the 'hum' feature with values greater than 100 in the test data is 45

### Missing values in 'season'

Missing values in 'season' are handled the same way, we handled in our our training data using the imputation strategy using the 'dteday' (date) feature:

- If the date falls between December 21 and March 20, the season is assigned as 1 (Winter).
- If the date falls between March 21 and June 20, the season is assigned as 2 (Spring).
- and so on...