# Project: Linear models

## Programming project:  Predicting Apartment Prices in Barcelona 2023

As the real estate market in Barcelona continues to evolve, accurate prediction of apartment prices has become a vital aspect for buyers, sellers, investors, and real estate professionals alike. Understanding the factors that influence property values and being able to forecast future prices is crucial for making informed decisions and maximizing returns on real estate investments.

This assessment aims to delve into the art and science of predicting apartment prices in Barcelona in the year 2023. By analyzing historical data, market trends, and key determinants that impact property values, we will endeavor to develop a robust predictive model capable of estimating apartment prices with a high degree of accuracy.

Data (*Regression_Train.csv*) consist of a list of features plus the resulting <i>price</i>, described below. Each row corresponds to a particular appartment price. Properties are defined by <i>id</i>. 

+ Using this data build a predictive model for <b>price</b> 
+ In your analysis for faster algorithms use the MSE criterion for choosing any hyperparameters 
+ Try a first quick implementation, then try to optimize hyperparameters
+ For this analysis there is an extra test dataset. Once your code is submitted we will run a competition to see how you score in the test data. Hence have prepared also the necessary script to compute the MSE estimate on the test data once released.
+ Bonus: Try an approach to fill NA without removing features or observations, and check improvements.

## 1. You can follow those **steps** in your first implementation:

1. **Dataset Exploration:** Begin by thoroughly exploring and understanding the dataset you are working with. Gain insights into its structure, variables, and any potential patterns or trends that may exist within the data.

2. **Handling Missing Data:** Identify and report any missing data present in the dataset. Implement suitable strategies to handle missing values, such as imputation or elimination, ensuring the integrity and quality of the data.

3. **Addressing Categorical Features and Outliers:** Process the categorical features within the dataset, converting them into a suitable format for machine learning algorithms. Additionally, detect and handle any outliers that may affect the model's performance and make appropriate adjustments or treatments.

4. **Model Building:** Construct your machine learning model using the preprocessed dataset. Utilize an appropriate algorithm based on the nature of your prediction task and the available data. Train the model on the input data and evaluate its performance.

5. **Assessing Accuracy with Cross-Validation:** Optionally, employ cross-validation techniques to assess the expected accuracy of your model. This will help validate the model's generalization capabilities and provide more robust performance metrics.

6. **Identifying Impactful Variables:** Analyze the model's results and identify which variables have the most significant impact on the prediction outcomes. Report these variables, as they offer valuable insights into the factors that drive the predicted prices.

It is recommended to iterate and refine the steps mentioned above based on the performance results obtained during step 5. This iterative process will enable you to enhance the accuracy and overall effectiveness of your model. 


## 2. Main criteria for grading
From more to less important (the weighting of these components will vary between the in-class and extended projects):
+ Code runs
+ Price prediction made
+ Accuracy of predictions for test properties is calculated (kaggle)
+ Linear Model, Ridge and LASSO have been used
+ Accuracy itself
+ Data exploration
+ Data preparation
+ Hyperparameter optimization (alphas)
+ Code is combined with neat and understandable commentary, with some titles and comments (demonstrate you have understood the methods and the outputs produced)
+ Insights obtained

## 3. Data provided

Here are the definitions for each of the variables:

* **num_rooms:** This variable represents the number of bedrooms in an apartment.

* **num_baths:** It refers to the number of bathrooms in an apartment, indicating the count of spaces equipped with facilities for personal hygiene, such as toilets, sinks, and showers or baths.

* **square_meters:** This variable represents the total area or size of the apartment, measured in square meters. It provides an indication of the spatial extent or physical footprint of the property.

* **orientation:** It refers to the cardinal or directional aspect of the apartment, specifying the compass direction it faces or the direction in which its windows are oriented.

* **year_built:** This variable represents the year in which the apartment was constructed or built, providing an indication of its age and potential implications for its condition, architectural style, and infrastructure.

* **door:** It refers to the specific door number or identifier associated with the apartment within a building or complex. It distinguishes one apartment from another within the same property.

* **is_furnished:** This variable indicates whether the apartment is offered or equipped with furniture. It helps determine whether the tenant or buyer will have access to pre-existing furnishings or whether they need to provide their own.

* **has_pool:** It denotes whether the apartment has a swimming pool as part of its amenities or shared facilities. This feature adds a recreational element and can influence the desirability and value of the property.

* **neighborhood:** This variable represents the specific neighborhood or locality in which the apartment is situated within Barcelona. It provides geographical context and helps capture the characteristics and amenities associated with that area.

* **num_crimes:** It refers to the count or frequency of reported crimes that have occurred in the vicinity of the apartment's location or neighborhood. It serves as an indicator of safety and security within the area.

* **has_ac:** This variable indicates whether the apartment is equipped with an air conditioning system, offering cooling or heating capabilities to maintain a comfortable indoor temperature.

* **accepts_pets:** It denotes whether the apartment allows or accepts pets as tenants or residents. This variable is essential for individuals with pets who are seeking suitable accommodations.

* **num_supermarkets:** This variable represents the count or availability of supermarkets in close proximity to the apartment. It reflects the ease of access to grocery shopping facilities in the neighborhood.

* **price:** It represents the price of the apartment, typically measured in a specific currency (e.g., Euros). It is the dependent variable in the prediction task and serves as the target value to be estimated or predicted using the other variables.

## 4. Kaggle submission

Once you have produced testset predictions you can submit these to <i> kaggle </i> in order to see how your model performs and compete with your collegues. 

The following code provides an example of generating a <i> .csv </i> file to submit to kaggle
1. Create a pandas dataframe with two columns, one with the test set "id"'s and the other with your predicted "price" for that observation

2. Use the <i> .to_csv </i> pandas method to create a csv file. The <i> index = False </i> is important to ensure the <i> .csv </i> is in the format kaggle expects 

In [None]:
# Produce .csv for kaggle testing 
test_predictions_submit = pd.DataFrame({"id": test_df["id"], "price": test_predictions})
test_predictions_submit.to_csv("test_predictions_submit.csv", index = False)