# Task: Predicting Paris House Prices

In this homework assignment, you will be working with a dataset aimed at predicting house prices in Paris. Here's a breakdown of the dataset variables:

- squareMeters: Represents the area of the house in square meters.
- numberOfRooms: Indicates the number of rooms in the house.
- hasYard: Binary variable indicating whether the house has a yard (1 for yes, 0 for no).
- hasPool: Binary variable indicating whether the house has a pool (1 for yes, 0 for no).
- floors: Number of floors in the house.
- cityCode: Code representing the city location.
- cityPartRange: Represents the range of the city part.
- numPrevOwners: Number of previous owners of the house.
- made: Year when the house was made.
- isNewBuilt: Binary variable indicating whether the house is newly built (1 for yes, 0 for no).
- hasStormProtector: Binary variable indicating whether the house has a storm protector (1 for yes, 0 for no).
- basement: Binary variable indicating whether the house has a basement (1 for yes, 0 for no).
- attic: Binary variable indicating whether the house has an attic (1 for yes, 0 for no).
- garage: Binary variable indicating whether the house has a garage (1 for yes, 0 for no).
- hasStorageRoom: Binary variable indicating whether the house has a storage room (1 for yes, 0 for no).
- hasGuestRoom: Binary variable indicating whether the house has a guest room (1 for yes, 0 for no).
- price: The target variable representing the price of the house.

For this assignment, you will perform the following tasks using three different regression techniques: Ordinary Least Squares (OLS), Lasso, and Ridge.

You will be provided with code snippets for each step of the process, and you will need to fill in the "???" with appropriate code to complete the tasks.


### Exercise Overview: Read and split the data

In this exercise, you will split the cleaned `house price` dataset into training and test sets for further analysis. This step is crucial for evaluating the performance of machine learning models. Instead of writing repetitive code, we'll use a function stored in a script to streamline the process.


In [None]:
# Define Path
base_path <- "???"
base_path <- path.expand(base_path)
setwd(base_path)



# Read the data from the specified folder
data <- read.csv("self_study_tutorial/scripts_and_data????")



# Read the file and load the functions
????("/????/r_functions.r")

# Call the function to split the data
??? <- split_data(??, ???)

# Access the training and test sets from the result
train <- ??$train_set
test <- ????



ERROR: Error in parse(text = x, srcfile = src): <text>:10:5: unexpected assignment
9: # Call the function to split the data
10: ??? <-
        ^


## Linear Regression

In this exercise, you'll utilize the `glmnet` and `dplyr` libraries in R to perform linear regression modeling for predicting house prices using the Ordinary Least Squares (OLS) approach. First, load the necessary libraries. Then, fit a linear regression model (`ols`) to the training dataset (`train`). Next, examine the model summary to evaluate its performance and predictor significance. Subsequently, predict house prices for the test dataset (`test`) using the trained model, storing the results. Calculate the Mean Squared Error (MSE) to gauge prediction accuracy and print it to assess the model's performance. This exercise provides insights into the OLS regression model's effectiveness in predicting house prices from the given dataset.

In [None]:
library(glmnet)
library(dplyr)

# Fit the linear regression model
ols <- ????
# Display the summary of the model to understand its performance
summary(ols)

# Predicting the price for the test dataset
ols$predict_outcome <- ???

# Calculating the Mean Squared Error (MSE) for our predictions
predMSEols <- ????
# Print the MSE to the console
print(predMSEols)

### Exercise Overview: Lasso Regression Modeling

In this exercise, you'll estimate a Lasso regression model using all predictors except the outcome variable. First, utilize the `glmnet` package to fit the Lasso model (`lasso`) using the training dataset (`train`). Then, perform cross-validation to determine the optimal lambda value using the `cv.glmnet` function. By evaluating the Mean Squared Error (MSE) with 5-fold cross-validation, you'll identify the lambda value that yields the best model performance. This exercise aims to showcase the effectiveness of Lasso regression in predicting house prices and selecting relevant predictors from the dataset.

In [None]:

# Estimate a Lasso model using all predictors except the outcome variable
lasso <- glmnet(as.matrix(???[, -ncol(???)]), train$price, ????)
# Perform cross-validation to find the optimal lambda value
lasso.cv <- cv.glmnet(as.matrix(train[, -ncol(train)]), ????

### Exercise Overview: Lasso Coefficient Analysis

In this exercise, you'll print the coefficients of the Lasso model (`lasso.cv`) obtained using cross-validation. By specifying the optimal lambda value (`lambda.min`), you'll examine the coefficients representing the importance of each predictor variable in predicting house prices. Additionally, you'll save these coefficients (`coef_lasso`) for later comparison with other regression models. This analysis provides valuable insights into the impact of each predictor on house price predictions and aids in model comparison.

In [None]:
# Print Lasso coefficients
print(??? "lambda.min"))

# Save for later comparison
coef_lasso <- ??? s = "lambda.min") 


### Exercise Overview: Ridge Regression Modeling

In this exercise, you'll perform Ridge regression modeling to predict house prices using all predictors except the outcome variable. First, use the `glmnet` package to fit a Ridge regression model (`ridge`) to the training dataset (`train`). Then, conduct cross-validation to determine the optimal lambda value (`lambda.min`) using the `cv.glmnet` function. By evaluating the Mean Squared Error (MSE) with 5-fold cross-validation and setting `alpha = 0`, you'll identify the lambda value that minimizes the prediction error. This exercise aims to showcase the effectiveness of Ridge regression in predicting house prices and its regularization properties for handling multicollinearity.

In [None]:

ridge <- g???


ridge.cv <- ?????


### Exercise Overview: Ridge Coefficient Analysis

In this exercise, you'll print the coefficients of the Ridge regression model (`ridge.cv`) obtained using cross-validation. By specifying the optimal lambda value (`lambda.min`), you'll examine the coefficients representing the importance of each predictor variable in predicting house prices. Additionally, you'll save these coefficients (`coef_ridge`) for later comparison with other regression models. This analysis provides valuable insights into the impact of each predictor on house price predictions and aids in model comparison, highlighting Ridge regression's regularization effects.

In [None]:
# Print Ridge coefficients
????

# Save for later comparison
coef_ridge <- ???

### Exercise Overview: Optimal Lambda Comparison

In this exercise, you'll compare the optimal lambda values obtained through cross-validation for Lasso and Ridge regression models. 

For Lasso regression:
- Print the optimal lambda value that minimizes cross-validated Mean Squared Error (MSE) (`lasso.cv$lambda.min`).
- Print the optimal lambda value using the one-standard-error rule (`lasso.cv$lambda.1se`).

For Ridge regression:
- Print the optimal lambda value that minimizes cross-validated MSE (`ridge.cv$lambda.min`).
- Print the optimal lambda value using the one-standard-error rule (`ridge.cv$lambda.1se`).

By comparing these optimal lambda values, you'll gain insights into the regularization strength chosen for each regression model and their respective impacts on prediction accuracy. This analysis aids in selecting the appropriate regularization parameter for optimal model performance.

In [None]:
# Print the optimal lambda value
print(??("Optimal lambda that minimizes cross-validated MSE: ", ????))
??"Optimal lambda using one-standard-error-rule: ", lasso.cv$lambda.1se))


print(??("Optimal lambda that minimizes cross-validated MSE: ", ???))
print(?("Optimal lambda using one-standard-error-rule: ", ??))


### Exercise Overview: Prediction and MSE Calculation

In this exercise, you'll predict house prices using the fitted Lasso and Ridge regression models and calculate the Mean Squared Error (MSE) to evaluate prediction accuracy.

For Lasso Regression:
- Predict house prices using the fitted Lasso model (`lasso`) with the test dataset.
- Calculate the MSE between the actual and predicted house prices.
- Print the calculated MSE.

For Ridge Regression:
- Predict house prices using the fitted Ridge model (`ridge`) with the test dataset.
- Calculate the MSE between the actual and predicted house prices.
- Print the calculated MSE.

By comparing the MSE values for Lasso and Ridge regression models, you'll assess their performance in predicting house prices and determine the effectiveness of regularization in minimizing prediction errors. This analysis provides valuable insights into the predictive power of each regression technique and aids in selecting the optimal model for predicting house prices accurately.

In [None]:



# Predict using the fitted Lasso model
lasso$predict_outcome <- predict(l???

# Calculate the MSE
predMSElasso <- mean((????
print(paste0("MSE: ", predMSElasso))



# Predict using the fitted Lasso model
ridge$predict_outcome <-????

# Calculate the MSE
predMSEridge <- m???
print(paste0("??

Compare the OLS as well as the ridge and lasso regression in terms of predictive accuracy.

In [None]:
print(??