# Ames Housing Dataset - Bagged Random Forest

These notebooks represent the project submission for the course [Data and Web Mining](https://www.unive.it/data/course/337525) by Professor [Claudio Lucchese](https://www.unive.it/data/people/5590426) at [Ca' Foscari University of Venice](https://www.unive.it).

---

## Premises

### Process

Even though, ideally, the process of exploratory data analysis should leave the data untouched, I decided to perform changes whenever I had the chance to improve the quality of the data. This was crucial as the presence of outliers is very high and there is evidence of data incoherence. What I mean by this, is that even though it is believed that the best approach is to separate the *passive* analysis and the actual feature engineering phase, I personally believe it is possible to do both for each step. Namely, at any moment we can use the inference we make to modify the data in order to achieve the best version of the original dataset.

### Requirements for the project

The project needs some global variables to run, as this allows the automation of data saving and import.

---

## Structure of the notebooks

The project has 5 notebooks:
1. Introduction and Data Preparation
    * Domain research, Context of Data
    * Preliminary Dataset Overview
    * Correction of possible errors and coherence check 
    * Some preliminary feature creation
2. Exploratory Data Analysis 
    * Univariate, Bivariate and Multivariate analysis
    * Outliers removal 
3. Data Encoding, Type Conversion and Feature Selection
    * Encoding of Categorical Features
    * Conversion of all types
    * Feature Importance Assessment
    * Feature Selection
4. Bagged Random Forest Regression
    * Data Rescaling
    * Train, Validation, Test
    * Hyperparameters Tuning
    * Diagnostics and Evaluation
5. Graph Neural Network Regression
    * Data conversion to graph and edges creation
    * Data Rescaling
    * Train, Validation and Test loops
    * Diagnostics and Evaluation

**Gianmaria Pizzo - 872966@stud.unive.it**

---

## Structure of this notebook

This notebook covers the following points
* Data Rescaling
* Train, Validation, Test
* Hyperparameters Tuning
* Diagnostics and Evaluation

## Data Preparation

As we know from https://towardsdatascience.com/one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why-d64b282b5769

Because this procedure generates several new variables, it is prone to causing a large problem (too many predictors) if the original column has a large number of unique values. Another disadvantage of one-hot encoding is that it produces multicollinearity among the various variables, lowering the model's accuracy.


## Model Definition

## Parameters Tuning

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

#Split 60/20/20
X_train_80, X_test, y_train_80, y_test = train_test_split(df_train, df_target,
                                                          test_size = 0.20, random_state = 33)

X_train, X_valid, y_train, y_valid  = train_test_split(X_train_80, y_train_80, 
                                                       test_size=0.25, random_state=42)

accuracies = []

for c in [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0]:
    # train and predict
    model = SVC(C=c, kernel='poly')
    model.fit(X_train, y_train)

    # compute Accuracy
    train_acc = accuracy_score(y_true = y_train, 
                               y_pred = model.predict(X_train))
    valid_acc = accuracy_score(y_true = y_valid, 
                               y_pred = model.predict(X_valid))
    print ("C: {:8.3f} - Train Accuracy: {:.3f} - Validation Accuracy: {:.3f}"
           .format( c, train_acc, valid_acc) )
    
    accuracies += [ [valid_acc, c] ]

best_accuracy, best_c = max(accuracies)
print ( "Best C:", best_c )

# here we are using both training and validation,
# to exploit the most data
model = SVC(C=best_c, kernel='poly')
model.fit(X_train_80,y_train_80)

test_acc = accuracy_score(y_true = y_test, 
                          y_pred = model.predict(X_test) )
print ("Test Accuracy: {:.3f}".format(test_acc) )

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

X_train_80, X_test, y_train_80, y_test = train_test_split(df_train, df_target,
                                                          test_size = 0.20, random_state = 42)

model = SVC()
parameters = { 'C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0],
                'kernel': ['linear', 'poly', 'rbf', 'sigmoid']}
        
tuned_model = GridSearchCV(model, parameters, cv=5, verbose=0)
tuned_model.fit(X_train_80, y_train_80)

print ("Best Score: {:.3f}".format(tuned_model.best_score_) )
print ("Best Params: ", tuned_model.best_params_)

In [None]:
tuned_model.cv_results_

In [None]:
pd.DataFrame( tuned_model.cv_results_ )

In [None]:
test_acc = accuracy_score(y_true = y_test, 
                          y_pred = tuned_model.predict(X_test) )
print ("Test Accuracy: {:.3f}".format(test_acc) )

## Diagnostics and Evaluation

### Investigating Instances