This notebook contains some functions from the [utils folder](utils/__init__.py) which you can use to clean, convert, transform and scale the data.\
Included will be the function documentation, how to import the function (relative to the project root), and a small example for any notable function.\
Make sure the [requirements](requirements.txt) are installed into your current python environment (see [here](https://packaging.python.org/en/latest/guides/installing-using-pip-and-virtual-environments/) for more information).

# Loading the data

In [None]:
from utils.data_cleaning import load_and_clean

This [function](utils/data_cleaning.py#L8) cleans the raw data by dropping unnecessary columns and rows and converting the target price column to a float.

**Parameters:**
- *drop_columns (bool):* Whether to drop any columns.
- *drop_rows (bool):* Whether to drop any rows. It will however, always convert the target price column to a float for the sake of visualization.
- *\*\*kwargs:* Additional keyword arguments to pass to the drop_features and handle_price functions.

**Returns:**
- *pd.DataFrame*: The cleaned data as a pandas DataFrame.

---

Aditional arguments which can be passed through the *\*\*kwargs* argument, must be key=value pairs (**e.g.:** 'quantile=0.8, verbose=True'; **not:** '0.8, True'):
- *quantile: float = 0.99*, the quantile to use for outlier removal.
- *verbose: bool = False*, whether to print some meaningfull information during computation.
- *remove_redundant_feats: bool = True*, whether to remove redundant features, specified in *manual_redundant_feats* and *add_redundant_feats*.
- *manual_redundant_feats: list = [...]*, a list of features to remove. A predefined list is the default value, for exactly which features are removed, see the [function](utils/data_cleaning.py#L73).
- *add_redundant_feats: list = []*, optional list of features to add to the *manual_redundant_feats* list.
- *remove_high_corr_feats: bool = True*, whether to remove features with a high correlation to other features.
- *corr_threshold: float = 0.8*, the threshold for the correlation coefficient above which features are removed.
- *remove_missing_feats: bool = True*, whether to remove features with more than *missing_threshold* missing values.
- *missing_threshold: float = 0.5*, the threshold for the percentage of missing values above which features are removed.
- *remove_single_value_feats: bool = True*, whether to remove features with only one unique value.
- *remove_text_feats: bool = True*, whether to remove features with text values.

In [None]:
# Load all the data as is (no cleaning or removal), useful for plotting the data before and after cleaning
data = load_and_clean(False, False)

# Load the data, with different thresholds and columns to remove
data = load_and_clean(add_redundant_feats=['host_location', 'latitude'], corr_threshold=0.7, missing_threshold=.4)

# Print which columns / how many rows are removed during cleaning
data = load_and_clean(verbose=True)

# Data Preprocessing

In [None]:
from utils.pipeline import create_pipeline

In [None]:
# The code below is used to remove warnings from the notebook output, you can ignore it
import warnings
warnings.filterwarnings('ignore')

This [function](utils/pipeline.py#L18) creates a Sci-kit Learn pipeline for preprocessing the given dataframe. It uses information on the feature types specified [here](utils/_feature_types.py) to determine the preprocessing steps to be applied per feature.

**Parameters:**
- *df (pd.DataFrame):* The dataframe to preprocess.
- *model (BaseEstimator)*: The model to be used in the pipeline. Default is DecisionTreeRegressor(random_state=24).
- *filter (BaseEstimator):* The filter to be used in the pipeline. Default is SelectKBest(f_regression, k=10).
- *convert (bool):* Whether to convert the features or not. Default is True.
- *impute (bool):* Whether to impute the missing values or not. Default is True.
- *encode (bool):* Whether to encode the categorical features or not. Default is True.
- *scale (bool):* Whether to scale the features or not. Default is True.
- *feature_selection (bool):* Whether to select the features or not. Default is True.
- *model_selection (bool):* Whether to add an estimator model or not. If True, an estimator model should be provided in *model*. Default is True.
    
**Returns:**
- *Pipeline:* The pipeline for preprocessing the dataframe.

---

The function can both be used to preprocess a dataframe in order to visualize the data after cleaning, or to preprocess the data and attach it to a model.\
When the *model_selection* parameter is set to True, the function will return a pipeline which can be used to fit data to a model (*.fit()* / *.predict()* / *.fit_predict()*).\
When the *model_selection* parameter is set to False, the function will return a pipeline which can be used to fit and transform the data (*.fit()* / *.transform()* / *.fit_transform()*).

The pipeline is constructed from both Sci-kit Learn native transformers, as well as custom converters and encoders. For reference or additional declarations, you can find the custom converters [here](utils/_custom_converters.py) and the custom encoders [here](utils/_custom_encoders.py).

In [None]:
# Split the data into features and target variable
# In practice, you would typically split the data into training and testing sets, but for illustration purposes, we'll use the entire dataset
X = data.drop(columns=['price'])
y = data['price']

# Preprocess the data to only convert features into interpretable values
# This will convert specific columns and return the transformed data as a pandas DataFrame
pipe = create_pipeline(data, impute=False, encode=False, scale=False, feature_selection=False, model_selection=False)
pipe.fit_transform(X)

# Preprocess the data to convert, impute, and encode all features into numeric instances
# This function requires fit_transform() to be called since it requires fitted data for target encoding
# The returned pandas DataFrame will contain the fully transformed dataset
pipe = create_pipeline(data, scale=False, feature_selection=False, model_selection=False)
pipe.fit_transform(X, y)

# Create the full pipeline to fit data to the model, with an interactive graphical interface
# As the function incorporates a model, we can only call fit() on the pipeline
pipe = create_pipeline(data) # Optional: specify the model to use as second parameter, import this model seperately from the sklearn library
pipe.fit(X, y)

The fitted pipeline can be used to make predictions on new data and compute metrics (note this can only be done when the function argument *model_selection* is set to *True*).

You can find a full list of the available metrics [here](https://scikit-learn.org/stable/modules/model_evaluation.html).

In [None]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

# Normally, you would use the test set for making predictions, but in this example we'll use the training set for simplicity
# Due to training and predicting on the same set, we are overfitting and getting a near perfect score, which is not realistic in practice
y_pred = pipe.predict(X)
print(f"""
      Values of Target 'price':
      {'-' * 80}
      predicted y values:
      {y_pred.tolist()[:10]}

      actual y values:
      {y.tolist()[:10]}



      Model Performance Metrics:
      {'-' * 80}
      {'mean squared error:': <30} {mean_squared_error(y, y_pred):.5f}
      {'mean absolute error:': <30} {mean_absolute_error(y, y_pred):.5f}
      {'r2 score:': <30} {r2_score(y, y_pred):.5f}
      """)

# Saving Figures

In [None]:
from utils import save_figure

This [function](utils/__init__.py#L6) can be used to save a figure to the `figures` folder. It will automatically create the folder or subfolder if it does not exist in the project root.

**Parameters:**
- *fig (plt.Figure):* The figure object to save.
- *name (str):* The name of the figure with or without the extension. The default extension is .png.
- *subfolder (str):* The subfolder to save the figure to. If none is provided, the figure will be saved to the root folder.
- *dpi (int):* The resolution of the figure. Defaults to 300.
- *bbox_inches (str):* The bounding box of the figure. Defaults to tight.
- *\*\*kwargs:* Additional keyword arguments to pass to the savefig function.

In [None]:
# Create the figure of choice
# This example will create a random smiley plot (courtesy of https://gist.github.com/bbengfort/dd9d8027a37f3a96c44323a8303520f0)
import matplotlib.pyplot as plt
import numpy as np

fig = plt.figure(figsize=(8,8))
ax = fig.add_subplot(1,1,1, aspect=1)

ax.scatter([.5],[.5], c='#FFCC00', s=120000, label="face")
ax.scatter([.35, .65], [.63, .63], c='k', s=1000, label="eyes")

X = np.linspace(.3, .7, 100)
Y = 2* (X-.5)**2 + 0.30

ax.plot(X, Y, c='k', linewidth=8, label="smile")

ax.set_xlim(0,1)
ax.set_ylim(0,1)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.spines['left'].set_visible(False)
ax.spines['bottom'].set_visible(False)
ax.set_xticks([])
ax.set_yticks([])

# Save the figure using the custom function
save_figure(fig, 'smiley', subfolder='test')