<div style="width:100%;height:30px;background-color:#E31134"></div>


# 1. Baseline linear regression model

## 1.1 Minimal Preproccesing

Since the model will not work with missing values in the dataset, minimal preprocessing is requiered.

To address missing values in both the training and testing datasets, we use the `SimpleImputer` from scikit-learn. We replace all missing values with the most frequent values in each respective column. This imputation strategy is employed using the 'most_frequent' strategy, ensuring that the data remains complete for further analysis and modeling.



## 1.2 Splitting Data into Features and Labels

We split the training data into features and labels. The features, denoted as `train_features`, consist of the first 13 columns of the training dataset. The labels, represented by `train_labels`, correspond to the last column, capturing the target variable.

Similarly, we split the test data into features (`test_features`) and labels (`test_labels`) using the same logic, ensuring consistency between the training and testing datasets for subsequent modeling.




## 1.3. Linear Regression

For our initial model building, we employed linear regression, a supervised learning technique, to predict the bike rental count ('cnt') using labeled training data.

### Model Selection:
- We instantiated a linear regression model using the `linear_model.LinearRegression()` function.

### Training the Model:
- The model was trained on the training features (`train_features`) and labels (`train_labels`) using the `fit` method.

### Model Coefficients:
- The coefficients of the linear regression model, representing the weights assigned to each feature, are printed using `print(baseline_model.coef_)`. These coefficients provide insights into the contribution of each feature to the prediction of the target variable

This linear regression model serves as our baseline model, providing a starting point for evaluation and potential refinement in subsequent stages of model development.



### Model Prediction on Test Data

Using the trained linear regression model, we made predictions on the test examples to evaluate its performance on unseen data.

### Prediction Details:
- Predictions for bike rental counts ('cnt') in the test set were generated using the `predict` method on the trained linear regression model.
- The resulting predictions are stored in the `baseline_pred` array.

These predictions can be further evaluated and compared with the actual test labels to assess the model's accuracy and generalization capabilities.

### Model Prediction on Test Data

Using the trained linear regression model, we predicted bike rental counts ('cnt') for all examples in the test set.

### Prediction Results:
The predicted values for bike rental counts in the test set are stored in the `baseline_pred` array. These predictions will be utilized for evaluating the model's performance against the actual test labels.

This step allows us to assess the accuracy and effectiveness of the linear regression model on new, unseen data.


### Visualizing Model Output

We visualize the model output and compare it with the actual test labels. The `matplotlib.pyplot` library is employed to create a plot, where the blue line represents the actual test labels, and the orange line represents the predicted values (`baseline_pred`). This visualization aids in assessing the model's ability to capture the underlying patterns in the data.

#### What we can observe from the plot:

While the model generally exhibits a learning trend, it is apparent that the predictions are not as accurate as desired. There are instances of underestimation of bike rentals as low as -1252 but the prevalence of positive values suggests a general trend of overestimation, with values as high as 1663 above the test data. That suggest that there must be areas for improvement. Further exploration, feature engineering, or model tuning will be necessary to enhance accuracy and address these discrepancies.


## 1.4 Single Example Prediction and Evaluation

We demonstrate the model's ability to predict a single example from the test data. The features of the first test example are displayed, and the model's prediction is compared with the actual label. 

- Features of the example: `test_features.iloc[0,:]`
- Predicted label: `predicted_value`
- Actual label: `test_labels.iloc[0]`
- Deviation predicted from actual value: `predicted_value - test_labels.iloc[0]`

This analysis provides insight into the model's performance on individual instances, helping to understand its predictive accuracy.


- **Predicted Label:** [3406.9932736]
- **Actual Label:** 3894
- **Deviation from Actual Value:** -487.006726

The model predicted a bike rental count of approximately 3407 for a specific example, while the actual count was 3894. This suggests an underestimation by about 487. The model's accuracy and deviation from actual values indicate potential limitations and areas for improvement.

## Model Evaluation

We evaluate the performance of the model using two metrics:

### Mean Absolute Error (MAE) & Coefficient of Determination (R^2)

The Mean Absolute Error is a measure of the average absolute differences between the predicted and actual values. A lower MAE indicates better model performance.


mae = mean_absolute_error(test_labels, ypred)
print('MAE: %.3f' % mae)

### Model Evaluation Results

After evaluating the model on the test data, we obtained the following metrics:

- **Mean Absolute Error (MAE):** 1054.862
  - The MAE represents the average absolute difference between the predicted and actual values. In this case, a lower MAE is desirable, and the obtained value provides insight into the average magnitude of prediction errors.

- **Coefficient of Determination (R^2):** 0.4002
  - The R^2 value measures the proportion of the variance in the bike rental counts that the model can explain. A higher R^2 value (closer to 1.0) indicates a better fit. In this instance, the obtained R^2 value of 0.4002 suggests that the model explains 40.02% of the variance in the test data.

These results offer an assessment of the model's performance, indicating areas for potential improvement or refinement.


<div style="width:100%;height:30px;background-color:#E31134"></div>


## 2. Preprocessing

### 2.1 Training Data preprocessing

#### Handling Outliers in Humidity and Windspeed

To achieve better results with our model, we initiated preprocessing on our training data. Initially, we counted all values from the 'hum' feature that exceeded 100, addressing a total of 229 invalid values. Following that, we counted the negative values in the 'windspeed' feature, which amounted to 4.

- Humidity Outliers: 229
- Windspeed Outliers: 4

Additional to that, the 'hum' have 34 missing values, so that there are totally 263 invalid values and missing values in the 'hum' data above 100 out of a total of 600. We anticipate improvements in the model's performance, if we drop the feature alltogether. The 'windspeed' feature, with only 4 negative values, will be  handled by replacing the negative values with the median of windspeed.



#### Imputing Missing Values in 'season' Based on 'dteday'

To handle missing values in the 'season' feature of our training data, we implemented a imputation strategy using the 'dteday' (date) feature. For each missing value in 'season', we used the associated date information to estimate the season.

The imputation process is outlined as follows:
- If the date falls between December 21 and March 20, the season is assigned as 1 (Winter).
- If the date falls between March 21 and June 20, the season is assigned as 2 (Spring).
- If the date falls between June 21 and September 20, the season is assigned as 3 (Summer).
- If the date falls between September 21 and December 20, the season is assigned as 4 (Fall).



#### Imputing Missing Values in 'weekday' Based on 'dteday' - New Order

To address missing values in the 'weekday' feature of our training data, we basically used the same strategy as for the season. In this approach, we introduced a new order for the weekdays based on the provided date information.

The imputation process is as follows:
- We converted the 'dteday' column to a datetime format and extracted the day of the week using the `dayofweek` function.
- The resulting 'weekday' values now represent the order of weekdays (0 for Monday, 1 for Tuesday, and so on).

By deriving the 'weekday' values from the date, we effectively imputed missing values in a way that aligns with the chronological order of weekdays. 

These imputations ensures the completeness of the 'weekday' and 'season' features, contributing to a more comprehensive dataset for subsequent modeling.



### Power Transformation of 'windspeed'

Looking at histplots for all features the data for windspeed seems to be skewed ( leans toward the left side).


To improve the distribution of the 'windspeed' feature, a power transformation was applied using the PowerTransformer from scikit-learn. We will also do this later in the test data and get more into the detail of Power transformation. 








#### Handling Outliers in 'cnt'

From Task 1 we know that there are four outliers in our training data, in the label 'cnt'.
To address outliers in 'cnt' we identified and removed rows where the bike rental count exceeded 20,000. Outliers in the target variable disproportionately influences model training, and their removal helps prevent skewed predictions.

#### Outlier Removal Details:
- Rows with 'cnt' values greater than 20,000 were identified using boolean indexing.
- These outlier rows were subsequently dropped from the training dataset.

We double check our training data and generated a plot displaying the cnt label.
The results appear satisfactory, with cnt values ranging from approximately 0 to 8000.

### 2.2 Test Data Preprocesing

#### Addressing Outliers in 'hum' of the Test Data

Upon inspection, we is observed that there are, just like in the training data, instances of 'hum' values exceeding 100 in the test data.

The count of invalid values in the 'hum' feature with values greater than 100 in the test data is 45. 
We will handle these and the invalid values from the training data later via the feature selection.

#### Missing values in 'season'

Missing values in 'season' are handled the same way, we handled in our our training data using the imputation strategy using the 'dteday' (date) feature:

- If the date falls between December 21 and March 20, the season is assigned as 1 (Winter).
- If the date falls between March 21 and June 20, the season is assigned as 2 (Spring).
- and so on...

Looking at histplots for all features the data for windspeed seems to be skewed ( leans toward the left side).

### Power Transformation of 'windspeed' in the Test Data

To maintain consistency with the preprocessing applied to the training data, a power transformation was performed on the 'windspeed' feature of the test data. The PowerTransformer from scikit-learn was utilized for this transformation.

### Transformation Steps:
1. Power transformation was applied to the 'windspeed' values in the test data.
2. The transformed values were stored in a new column named 'windspeed ptransformed' in the `feature_test_df` DataFrame.

### Applying Transformation to Test Data:
To ensure consistency in the preprocessing pipeline, the transformed 'windspeed' values were then assigned back to the 'windspeed' column in the original test dataset (`test_df`).

#### Visualization:
A histogram of the transformed 'windspeed' values is generated and displayed for visual inspection. The transformation has achieved its intended purpose of making the distribution more even/symmetric.transformation has achieved its intended purpose of making the distribution more even or symmetric.


<div style="width:100%;height:30px;background-color:#E31134"></div>


## 3. Feature Selection

In the feature selection process, we aim to identify and retain the most relevant features for modeling, optimizing the model's performance by reducing complexity and potential overfitting.

### Removing Unnecessary Features

To streamline and focus our dataset for modeling, the 'instant', 'dteday', and 'hum' features were identified as unnecessary and subsequently removed from both the training and test datasets.

### Removed Features:
1. 'instant': Represents the instant record number and is deemed unnecessary for modeling.
2. 'dteday': Denotes the date, and since relevant temporal information is captured by other features, it is excluded for simplicity.
3. 'hum': Humidity was previously identified as having invalid and missing values and may not contribute significantly to the model

### Dataset Simplification:
By eliminating these features, we aim to simplify the dataset, potentially improving model efficiency and interpretability while retaining the essential information needed for accurate predictions.


## Exploring Feature Correlations

To gain insights into the relationships between features in our dataset, we calculated the correlations among all features using the Pearson correlation coefficient again, like we did in Task 1.

### Correlation Analysis:
The correlation matrix, displays numerical values representing the strength and direction of linear relationships between pairs of features. Positive values indicate a positive correlation, negative values indicate a negative correlation, and values close to zero suggest weak or no correlation as we know, based on the weak correlations we will further decide which features to include or drop.


To get better overview, we calculated the correlation coefficients. The resulting correlation values provide insights into how each feature correlates with the target variable 'cnt'.

These correlation coefficients help identify features that may have a stronger influence on predicting 'cnt', guiding us in the subsequent feature selection process.


## Selection of Features Based on Correlation with 'cnt'

In the feature selection process, we identified and extracted subsets of features from the preprocessed training data based on their correlation with the target variable 'cnt' (bike rental count).

### Selected Features:
1. **All Preprocessed Features:** `train_corr_high` includes all features after preprocessing.

2. **Nine Features with the Highest Correlation with 'cnt':** Subsetting the features to include the nine with the highest absolute correlation values.

3. **Four Features with the Highest Correlation with 'cnt':** Alternatively, a more focused subset includes only the four features with the highest absolute correlation values. This can be considered as a favorit feature set.

These subsets serve as candidates for subsequent model training, allowing us to assess the impact of feature selection on model performance and potentially enhance the interpretability of the model.


<div style="width:100%;height:30px;background-color:#E31134"></div>

## 4. New Linear regression Model



### 4.1 Split

#### Feature and Label Selection

In this step, we refined the feature and label selection process based on the subsets of features identified earlier.

#### Selected Features and Labels:
#### Training Data:
- **Features (`train_features`):** Extracted features from the `train_corr_high` dataset, excluding the target variable ('cnt'). Subset includes the nine features with the highest correlation with 'cnt'.
- **Labels (`train_labels`):** Extracted the target variable 'cnt' from the `train_corr_high` dataset.

#### Test Data:
- **Features (`test_features`):** Extracted specific features from the test dataset, including 'atemp', 'temp', 'yr', 'season', 'weathersit', 'windspeed', 'mnth', 'holiday', and 'weekday'.
- **Labels (`test_labels`):** Extracted the target variable 'cnt' from the test dataset.

These refined feature and label sets will be used to train and evaluate the linear regression model, providing a focused exploration of feature subsets and their impact on predictive performance.


### 4.2. Linear Regression

#### Model Building with Linear Regression (Refined Features)

Just like with our baseline model building process, we employed linear regression, but this time utilizing labeled training data with the refined feature and label subsets.

#### Model Selection:
- We instantiated a new linear regression model using the `linear_model.LinearRegression()` function.

#### Training the Model:
- The model was trained on the refined training features (`train_features`) and labels (`train_labels`) using the `fit` method.


### Model Prediction on Test Data (Refined Features)

Using the refined linear regression model, we made predictions on the test examples to evaluate its performance on unseen data.

### Prediction Results:
The predicted values for bike rental counts ('cnt') in the test set are stored in the `pred` array. These predictions are derived from the refined set of features and will be used for evaluating and visualization by creating a plot, just like we did with the baseline model.




## 4.3 Evaluation

### Visualization of Model Predictions

To visually assess the performance of the refined linear regression model, we plotted the actual test labels against the predicted values.

#### Visualization Details:
- A line plot was created using Matplotlib to display the actual test labels (`test_labels`) and the corresponding predicted values (`pred`).
- The x-axis represents the data points, and the y-axis represents the bike rental counts ('cnt').

This visualization aids in understanding how well the model predictions align with the actual values, providing insights into the model's accuracy and its ability to capture the underlying patterns in the test data.

### Single Example Prediction and Evaluation

To demonstrate the model's prediction for a specific example in the test set, we selected the first row of test features.

Result:

For the given test example with features:
- 'atemp': 0.472846
- 'temp': 19.366700
- 'yr': 0.000000
- 'season': 4.000000

The model predicted a bike rental count ('cnt') of approximately 3978.94. The actual label for this example was 3894, resulting in a deviation of approximately 84.94.

Compared to our Baseline model which had a deviation from the actual label of -487.006726, this is a pretty good improvement. Next up we will compare the First 10 examples, to see if the better results carry on.




## Prediction Deviation for the First 10 Examples

To further analyze the model's performance on the first 10 examples in the test set, the deviation between the predicted and actual bike rental counts ('cnt') was calculated and rounded for readability.

We evaluate the performance of the model using two metrics again:

### Mean Absolute Error (MAE) & Coefficient of Determination (R^2) of Model B


- **Mean Absolute Error (MAE):** 930.547

- **Coefficient of Determination (R^2):** 0.3664375340807854

So what we can see is, the MAE got slightly better. The old model had an MAE of 1054.862.

But R² actually got a a little bit worse, with the old model having a value of 0.4002 
