## Importing necessary libs

In [None]:
import matplotlib.pyplot as plt
import pathlib
import numpy as np
import pickle
import seaborn as sns
from scipy import stats
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import FunctionTransformer

## Defining file path

In [None]:
DATA_DIR = pathlib.Path.cwd().parent / 'data'
clean_data_path = DATA_DIR / 'processed' / 'ames_clean.pkl'
print(DATA_DIR)

## Opening file with cleaned data

In [None]:
clean_data_path = DATA_DIR / 'processed' / 'ames_clean.pkl'

with open(clean_data_path, 'rb') as file:
    data = pickle.load(file)

## Removing outliers

Despite some outliers were already removed on "02_analysis_and_preprocessing.ipynb", by reading the documentation we found this piece of information:

```There are 5 observations that an instructor may wish to remove from the data set before giving it to students (a plot of SALE PRICE versus GR LIV AREA will indicate them quickly). Three of them are true outliers (Partial Sales that likely don�t represent actual market values) and two of them are simply unusual sales (very large houses priced relatively appropriately). I would recommend removing any houses with more than 4000 square feet from the data set (which eliminates these 5 unusual observations) before assigning it to students.```

So let's check if Prof. Ayres has already removed this outliers highlitghed in the documentation.

In [None]:
plt.plot(data['Gr.Liv.Area'], data.SalePrice, 'o', alpha=1)
plt.show()


Well, he did not. So let's remove them. 

In [None]:
data = data[data['Gr.Liv.Area'] < 4000]

plt.plot(data['Gr.Liv.Area'], data.SalePrice, 'o', alpha=1)
plt.show()

Look's better! Now let's start the data transformation.

## Transforming the data for the model

There are lots of possible data transformations to improve model performance. To understand which ones make sense to AMES dataset, it is necessary to investigate and understand data analysis made in notebook "02_analysis_and_processing.ipynb". One characteristic that stood out from some features was the concentrations to the left in the scatter plots. It may mean that calculating the log of the value can improve correlation with target variable. To check if this is true, it is necessary to select only numerical data. 

In [None]:
continuous_variables = [
    'Lot.Frontage',
    'Lot.Area',
    'Mas.Vnr.Area',
    'BsmtFin.SF.1',
    'BsmtFin.SF.2',
    'Bsmt.Unf.SF',
    'Total.Bsmt.SF',
    'X1st.Flr.SF',
    'X2nd.Flr.SF',
    'Low.Qual.Fin.SF',
    'Gr.Liv.Area',
    'Garage.Area',
    'Wood.Deck.SF',
    'Open.Porch.SF',
    'Enclosed.Porch',
    'X3Ssn.Porch',
    'Screen.Porch',
    'Pool.Area',
    'Misc.Val',
]

continuous_data = data[continuous_variables].copy()
continuous_data

Now we can validate our idea. One way to understand if makes sense calculating log values is checking data distribution. It was checked on notebook "02.1_some_more_analysis.ipynb" and it was confirmed that some features have a distribution that can be improved by calculating log values.

#### Go check charts again and come back!

### Checking data distribution

In [None]:
for col in continuous_variables:

    num_nonzero_data = continuous_data[continuous_data[col] != 0]
    
    fig, (ax1, ax2, ax3) = plt.subplots(1, 3, figsize=(15, 5))

    sns.distplot(num_nonzero_data[col], ax=ax1)
    stats.probplot(num_nonzero_data[col], plot=ax2)
    stats.probplot(np.log(num_nonzero_data[col]), plot=ax3)

    ax1.set_title(col)
    ax2.set_title('probplot')
    ax3.set_title('probplot log')
    
    plt.show()

Wow. It seems that our idea was right. We can calculate log values for the features that, by calculating, increase similarity to normal distribution. Other strategy is to calculate log values for all numerical features and check if it improves correlation with target data. Let's do it.

### Checking log correlation

In [None]:
target = data['SalePrice'].copy()

In [None]:
for column, series in continuous_data.items():
    # Calculate correlation between the two columns
    corr = series.corr(target)

    series = series.loc[series != 0]
    log_series = series.apply(np.log)
    corr_log = log_series.corr(target)
    
    if abs(corr_log) > (abs(corr) + 0.05):
        print("Correlation between", column, "and the target is", corr)
        print("Correlation between log("+column+") and the target is", corr_log)
        print()

Yyyyaaaaaayyy! Correlation increases on "Lot.Area", "BsmtFin.SF.2", "X2nd.Flr.SF", "Low.Qual.Fin.SF", "Enclosed.Porch", "X3Ssn.Porch", "Screen.Porch", "Pool.Area" and "Misc.Val" when log is calculated. Let's create a list with features with better correlation and improvement in distribution.

In [None]:
columns_to_log = ('Gr.Liv.Area', 
                  'Lot.Area', 
                  'BsmtFin.SF.2', 
                  'X2nd.Flr.SF',
                  'Low.Qual.Fin.SF', 
                  'Enclosed.Porch', 
                  'X3Ssn.Porch', 
                  'Screen.Porch',
                  'Pool.Area', 
                  'Misc.Val', 
                  'Open.Porch.SF', 
                  'Wood.Deck.SF', 
                  'Garage.Area',
                  'X1st.Flr.SF', 
                  'Total.Bsmt.SF', 
                  'Bsmt.Unf.SF', 
                  'Mas.Vnr.Area')

## Scaling the data

Another great strategy to improve performance on models is to scale the numerical data. This strategy has no effect on models like Decisions Trees, but have a huge impact on linear models like Elastic Net.

Checking boxplot charts, lots of outliers were noticed. If data is scaled by minimum and maximun values, data quality will be impacted because of the effect of outliers. Instead of applying min and max scaling, a better strategy is to apply standard scaler.

# Transform data

In [None]:
numerical_data = data.select_dtypes(include=['float64'])
numerical_columns = []

for column in numerical_data.columns:
    numerical_columns.append(column)

transformer = ColumnTransformer([
    ("log_calculation", FunctionTransformer(np.log, validate=True), columns_to_log),
    ("scaler", StandardScaler(), numerical_columns),
])

In [None]:
transformed_data = transformer.fit_transform(data)

# Save the data

In [None]:
transformed_data_path = DATA_DIR / 'processed' / 'ames_transformed.pkl'

In [None]:
with open(transformed_data_path, 'wb') as file:
    pickle.dump(data, file)