# Internship competition: Predict the correct house prices!

## Fictitious assignment
An online platform for buying and selling apartments wants to develop an app to attract more potential sellers. The app should reliably predict the
selling price based on some key data about an apartment. 


Everyone will receive
- `house_price_data.csv`, a data set with features/variables for different houses and the label (here: house price). This can be used to train and test models.
- `house_price_data_unknown.csv` is a data set with houses for which the price is unknown.

## Goal: Predict house prices as accurately as possible
At the end, all teams should submit their price predictions for the houses in `house_price_data_unknown.csv`. After submission, the predictions will be compared with the actual values (using MAE, mean absolute error). The team with the most accurate predictions (according to the MAE value) wins :)

In [None]:
import os
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sb

pd.set_option('display.max_columns', 100)

# Let's get started... Import and explore data

- As always, import the data using `pd.read_csv` (`house_price_data.csv`)
- The column `price` contains our label (or target variable).
- Are there any missing values?
- Are there any outliers/incorrect/strange entries?

--> `.describe()` & `.info()`

In [None]:
filename = "C:\Users\Sander\Repos\DataScience\Data\Daten für Machine Learning Competition-20250528\house_price_data.csv"  # set to your own path

data = pd.read_csv(filename)
data = data.set_index('id')

In [None]:
data.head()

In [None]:
data.info()

In [None]:
data.describe()

## Graphical overview over all features

In [None]:
data.hist(figsize=(15, 15), bins=15, rwidth=0.8)
plt.show()

## Correlations!?
- Are there meaningful correlations with our target label (`price`)? If so, does this make us confident that we can train a model to predict the price?
- Are there any features with suspiciously high (or low) correlations that could be a sign of duplication?

In [None]:
fig, ax = plt.subplots(figsize=(11, 10))

sb.heatmap(
    data.corr(),
    annot=True, cmap="PuOr",fmt=".1f",
    vmin=-1, vmax=1
)

# Data cleaning & division data --> X, y
- Remove missing values (if any)
- Remove columns that are not to be used by the machine learning models.
(e.g. with `.drop(..., axis=1)`)
- Convert columns that contain categories (depending on the model).
- Split the data into `X` (without the label) and `y` (only the labels)

In [None]:
data = data.drop(["waterfront"], axis=1)
X = data.drop(["price"], axis=1)
y = data["price"]

# Searching for the "right" model...

This is about a regression model. There are many options! See, for example, https://scikit-learn.org/stable/supervised_learning.html

Possible candidates would be:

- `sklearn.linear_model import LinearRegression`
- `sklearn.tree.DecisionTreeRegressor`
- `sklearn.neighbors.KNeighborsRegressor`
- `sklearn.ensemble.RandomForestRegressor`

**Caution: Please do not use any of the neural networks from Scikit-Learn.**

## Warning: Some of the models may require a lot of time for training (or even prediction).
(Therefore, it is better to work on several computers in a team).

---

## Quick guide to using Scikit-Learn models

### General procedure: Initialize, train, predict
The models in Scikit-Learn are always executed according to the same principle.

1) Create object: `my_model = SomeFancyModel(parameter1=4, ...)` 
2) Train model: `my_model.fit(X, y)`
3) Make predictions: `my_model.predict(X_new)`

### Model parameters – which ones are there?

We get a list of all modifiable parameters via `my_model.get_params()` (this also works for pipelines).

However, to understand exactly what each parameter does, we need to look at the Scikit-Learn documentation (https://scikit-learn.org/stable/supervised_learning.html).

### Pipelines

In Scikit-Learn, the various processing steps can be linked in a pipeline. This makes sense if data processing is part of the model, for example if the data needs to be scaled. Here is an example:

```python
from sklearn.neighbors import KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

pipe = Pipeline([
("scale", StandardScaler()),
("model", KNeighborsRegressor())
])
```
A pipeline object is then treated like a model, i.e., training is done with `pipe.fit()` and prediction with `pipe.predict()`.


### Grid search
To test multiple conditions of a model (or pipeline), a so-called "grid search" is useful, i.e., simply running through all possible parameter combinations.

Here is an example:

```python
grid = GridSearchCV(estimator=my_model,
param_grid={
‘parameter_whatever’: [3, 5, 7]
},
cv=3,
verbose=2)
```

A GridSearchCV object is then treated similarly to a model, i.e., training is done with `grid.fit()`. We get the results of the search via `grid.cv_results_` (Python dictionary), or to display them a little better via `pd.DataFrame(grid.cv_results_)`.

Multiple parameters can also be tested simultaneously, in which case all combinations are trained and tested accordingly.

Other information about grid search:
- `cv` stands for *cross validation*.
- `verbose` specifies how much information should be output during training (default is 0, which outputs nothing; slightly more information is provided in ascending order with 1, 2, 3).

### Grid search scoring

If nothing else is specified, GridSearch simply uses a metric that is specified by default for the respective model type. However, this can vary greatly depending on the model. If different models are to be compared with each other, it is often necessary to specify a common "score." This can be done as follows:

```python
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import mean_absolute_error, make_scorer

grid = GridSearchCV(estimator=my_model,
param_grid={
‘model__whatever’: [5, 10, 20]
},
scoring={"MAE": make_scorer(mean_absolute_error,
greater_is_better=False)},
cv=3, refit="MAE")
```

### Explore results

One option is a scatter plot:
```python
fig, ax = plt.subplots(figsize=(6,6))

ax.scatter(y_test, pipe.predict(X_test), alpha=0.25)
ax.set_xlabel("True values")
ax.set_ylabel("Predicted values")
```

In [None]:
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.metrics import mean_absolute_error, make_scorer

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=0)

X_train.shape, X_test.shape, y_train.shape, y_test.shape

---
# Finally: the test!

Once the appropriate model and parameters have been found, predictions can be made on the unknown test data!

These are located in the file `house_price_data_unknowns.csv`.

Most of the code for this is in the following cells.

In [None]:
filename = "datasets/house_price_data_unknowns.csv"

data_competition = pd.read_csv(filename).dropna()
data_competition.head()

## Make your FINAL predictions!

In [None]:
predicted_prices = my_model.predict(data_competition.drop(["id"], axis=1))
predicted_prices

## Create your FINAL results!
Hier ordnen wir nur die Vorhersagen den "id" zu.

In [None]:
competition_results = pd.DataFrame({"id": data_competition["id"],
                                    "price": predicted_prices})
competition_results                                

## Save results! (and when done --> upload on Moodle)
- Should be obvious, but: **please replace `name1_name2` by your names.

In [None]:
filename = "house_price_predictions_name1_name2.csv"

competition_results.to_csv(filename)

### Good luck!