# Predictions 
## Niccolò Simonato 
## Data & Web Mining, Academic Year 2021-2022

## Importing the dependencies and the cleaned dataset

The cleaned dataset is now imported.

The first snipped is intended to be used in the Google Drive environment, just set the path variable as needed.

The second one is intended to be used in the Jupyter Notebook environment.

In [1]:
# dependencies
import pandas as pd
import matplotlib.pyplot as plt
from sklearn import neighbors
from sklearn.metrics import RocCurveDisplay

In [None]:
# from google.colab import drive
# drive.mount('/gdrive')
# path = '/gdrive/MyDrive/Progetto DWM/Data/*.csv'
# %cd /gdrive

In [4]:
path = 'Data/'

# cleaned_df = pd.read_csv(path, low_memory = False)
train_datasets = []
test_datasets = []

for i in range(5):
    train_datasets.append(pd.read_csv(f"{path}train_dataset_2016_{i + 1}.csv", low_memory = True))
    test_datasets.append(pd.read_csv(f"{path}test_dataset_2016_{i + 1}.csv", low_memory = True))

## Predictions - Attempt 1 - k-NN algorithm

### Why k-NN? - Introduction 
I chose the k-NN algorithm because, usually, the house construction doesn't happen randomly. It's really unusual that a private party builds his own house, with his own money, and wherever he likes: it's more likely that the municipality's dedicated office decides where and how the houses of a given zone are buildt. 

Therefore, i think is safe to assume that houses of a given zone will have similar prices. The k-NN hopefully will help achiving this target, especially if we tune the geolocalization features with a greater weight over the other ones. 

This attempt will use the [ScikitLearn implementation](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html#sklearn.neighbors.KNeighborsRegressor) of the k-NN algorithm for prediction.

The first attempt will be conducted with the parameter "weights" set as "uniform", the second one will use the value "distance".

The model will be tested with a number of neighbors beetween 4 and 8, because usually these are the value that yield the best results.

The following snippet contains the functions that wraps the described procedure.

In [8]:
n_neighbors = [4,5,6,7,8]
    
def train_test_kNN(x_train, y_train, x_test, y_test, n_neighbors, w):
    knn = neighbors.KNeighborsRegressor(n_neighbors, weights=w)
    model = knn.fit(x_train, y_train)
    prediction = model.predict(x_test)
    scores = model.score(x_test, y_test)
    data = {'train': (x_train, y_train),
            'test': (x_test, y_test),
            'n_neighbors': n_neighbors, 
            'weights': w,
            'prediction': prediction,
            'score' : score,
            'model': model
           }
    return data

In [7]:
results = []
types = ['weights', 'uniform']
for train in train_datasets:
    for test in test_datasets:
        for n in n_neighbors:
            for w in types:
                results.append(train_test_kNN(train.loc[:, train.columns!='logerror'], train['logerror'], test.loc[:, test.columns!='logerror'], test['logerror'], n, w))       

ValueError: weights not recognized: should be 'uniform', 'distance', or a callable function

### How did it go? - Evaluation
After obtaining the results, we can proceed with the evaluation of the results.

In order to keep this notebook as clean as possible, the evaluation will be done with the built-in evaluator of the KNeighborsRegressor object.

In [None]:
def show_results_kNN(parameters):
    plt.scatter(parameters['X_train'], parameters['y_train'], color="darkorange", label="data")
    plt.plot((parameters['X_test'], parameters['y_test'], color="navy", label="prediction")
    plt.legend()
    plt.title(f"KNeighborsRegressor (k = {parameters['n_neighbors']}, weights = {parameters['weights']}, Adj-R2 = {score})")
    plt.tight_layout()
    plt.show()
    
plt.subplot(5, 2)

In [None]:
for i in results:
    show_results_kNN(i)

## Predictions - Attempt 2 - Linear Regression

### Why Linear Regression? - Introduction
The idea behind the adoption of the LinReg model is correlated to the low integrity of the inititial dataset. 

In contrast with the previously analyzed model, this is an attempt to see what would happen with an "assumption-free" model. It is expected that this type of analysis will underline some unseen correlations, and also will produce some interesting predictions.

This test will be conducted with the [ScikitLearn implementation](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) of the LinearRegression.

In [None]:
def train_test_LinReg(x_train, y_train, x_test, y_test):
    model = LinearRegression().fit(x_train, y_train)
    predictions = model.predict(x_test)
    data = {
        'x_train': x_train,
        'y_train': y_train,
        'x_test': x_test,
        'y_test': y_test,
        'predictions': predictions,
        'model': model
    }
    return data


In [None]:
res = []
for train in train_datasets:
    for test in test_datasets:
        res.append(train_test_LinReg(train.loc[:, train.columns!='logerror'], train['logerror'], test.loc[:, test.columns!='logerror'], test['logerror']))

### How did it go? - Evaluation
We can now proceed with the evaluation of the results.

The following tests will be used:


*  Mean Squared Error
*  R-squared index


The evaluations will be done by using the [ScikitLearn Metrics](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.metrics) module.

In [None]:
def show_results_LinReg(parameters):
    R_sq = model.score(parameters['x_test'],parameters['y_test']) #R-squared index
    MSE = mean_squared_error(parameters['y_test'], parameters['predictions']) #Mean Squared Error

    plt.scatter(parameters['x_train'], parameters['y_train'], color="darkorange", label="data")
    plt.plot(parameters['x_test'], parameters['predictions'], color="navy", label="prediction")
    plt.legend()
    plt.title(f"LinearRegressor (R2-index: {R_sq}, MSE: {MSE})")
    plt.tight_layout()
    plt.show()

plt.subplot(1)

In [None]:
for i in res:
    show_results_LinReg(i)

## Final Considerations

In order to decide which algoritm gives us a better prediction, a ROC curve will be displayed, using the library function provided by [ScikitLearn Metrics](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.RocCurveDisplay.html#sklearn.metrics.RocCurveDisplay.from_predictions).

### K-NN

In [None]:
for i in results:  # List of the dictionaries that resume the single experiment for the k-NN algorithm
    RocCurveDisplay.from_estimator(i['model'], i['x_test'], i['y_test'])

### LinReg

In [None]:
for i in res:  # List of the dictionaries that resume the single experiment for the LinReg algorithm
    RocCurveDisplay.from_estimator(i['model'], i['x_test'], i['y_test'])

### Conclusions