In [17]:
import warnings
warnings.filterwarnings('ignore')

In [18]:
import pandas as pd

In [19]:
import insolver
from insolver.frame import InsolverDataFrame
from insolver.transforms import InsolverTransform
from insolver.transforms import (
    DatetimeTransforms
)
from insolver.feature_engineering import DataPreprocessing

# Load dataset

Let's load a dataset with the pd.read_csv() and create InsolverDataFrame.

In [21]:
#https://www.kaggle.com/dgomonov/new-york-city-airbnb-open-data
dataset = InsolverDataFrame(pd.read_csv("data/AB_NYC_2019.csv"))
dataset.head()

Unnamed: 0,id,name,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,price,minimum_nights,number_of_reviews,last_review,reviews_per_month,calculated_host_listings_count,availability_365
0,2539,Clean & quiet apt home by the park,2787,John,Brooklyn,Kensington,40.64749,-73.97237,Private room,149,1,9,2018-10-19,0.21,6,365
1,2595,Skylit Midtown Castle,2845,Jennifer,Manhattan,Midtown,40.75362,-73.98377,Entire home/apt,225,1,45,2019-05-21,0.38,2,355
2,3647,THE VILLAGE OF HARLEM....NEW YORK !,4632,Elisabeth,Manhattan,Harlem,40.80902,-73.9419,Private room,150,3,0,,,1,365
3,3831,Cozy Entire Floor of Brownstone,4869,LisaRoxanne,Brooklyn,Clinton Hill,40.68514,-73.95976,Entire home/apt,89,1,270,2019-07-05,4.64,1,194
4,5022,Entire Apt: Spacious Studio/Loft by central park,7192,Laura,Manhattan,East Harlem,40.79851,-73.94399,Entire home/apt,80,10,9,2018-11-19,0.1,1,0


Information about a loaded dataset.

In [22]:
dataset.info()

<class 'insolver.frame.frame.InsolverDataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              48895 non-null  int64  
 1   name                            48879 non-null  object 
 2   host_id                         48895 non-null  int64  
 3   host_name                       48874 non-null  object 
 4   neighbourhood_group             48895 non-null  object 
 5   neighbourhood                   48895 non-null  object 
 6   latitude                        48895 non-null  float64
 7   longitude                       48895 non-null  float64
 8   room_type                       48895 non-null  object 
 9   price                           48895 non-null  int64  
 10  minimum_nights                  48895 non-null  int64  
 11  number_of_reviews               48895 non-null  int64  
 12  last_review          

Transform date using DatetimeTransforms class.

In [23]:
transform = DatetimeTransforms(['last_review'])
transform(dataset);

Deleting useless columns.

In [24]:
dataset.drop(['id', 'name', 'host_id', 'host_name', 'last_review'], axis = 1, inplace=True)

In [25]:
dataset.info()

<class 'insolver.frame.frame.InsolverDataFrame'>
RangeIndex: 48895 entries, 0 to 48894
Data columns (total 12 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   neighbourhood_group             48895 non-null  object 
 1   neighbourhood                   48895 non-null  object 
 2   latitude                        48895 non-null  float64
 3   longitude                       48895 non-null  float64
 4   room_type                       48895 non-null  object 
 5   price                           48895 non-null  int64  
 6   minimum_nights                  48895 non-null  int64  
 7   number_of_reviews               48895 non-null  int64  
 8   reviews_per_month               38843 non-null  float64
 9   calculated_host_listings_count  48895 non-null  int64  
 10  availability_365                48895 non-null  int64  
 11  last_review_unix                38843 non-null  float64
dtypes: float64(4), int64(

In [8]:
dataset.isnull().sum()

neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_365                      0
dtype: int64

# Automated Data Preprocessing

DataPreprocessing class allows you to automatically preprocess data. 
By default, it applies `AutoFillNA`, `OneHotEncoder` and `Normalization` transformations to data.  
Any feature engineering method used in this class can be disabled using assigned parameters. 
You can also use `dimensionality reduction`, `sampling`, `smoothing`, `feature selection` and change their parameters available in this class using the assigned parameters. 

**Default settings** 

You can call this class without initializing any parameters and it will use the default settings.

In [None]:
new_dataset = DataPreprocessing().preprocess(df = dataset, target='price')

new_dataset.head()

## Columns types

You can initialize the list of columns names as the `categorical_columns` and `numerical_columns` parameters.

In [None]:
new_dataset = DataPreprocessing(
    categorical_columns=['neighbourhood_group', 'neighbourhood', 'room_type']).preprocess(df = dataset, target='price')

new_dataset.head()

If some columns were not included it will give a warning.

In [None]:
new_dataset = DataPreprocessing(
    numerical_columns = ['price', 'minimum_nights', 'number_of_reviews', 'reviews_per_month', 
                         'calculated_host_listings_count', 'availability_365', 'last_review_unix']
    categorical_columns = ['neighbourhood_group', 'neighbourhood', 'room_type']).preprocess(df = dataset, target='price')

new_dataset.head()

## Categorical transformation

`transform_categorical` parameter is the name of the categorical transform method, values `one_hot_encoder`, `encoder` are supported. If True `one_hot_encoder` will be used. If False/None categorical won't be transformed. 

`transform_categorical_drop` parameter is the list of categorical columns to not transform.

In [None]:
new_dataset = DataPreprocessing(transform_categorical = None).preprocess(df = dataset, target='price')

new_dataset.head()

In [None]:
new_dataset = DataPreprocessing(transform_categorical = True, 
                                transform_categorical_drop = ['room_type']).preprocess(df = dataset, target='price')

new_dataset.head()

## Fill NA values

`fillna` parameter is a bool: if True Auto fill NA will be applied, if False/None it won't be applied.

`fillna_numerical` parameter is the name of the auto fill NA numerical method, values `median`, `mean`, `mode`, `remove` are supported.

`fillna_categorical` parameter is the name of the auto fill NA categorical method, values `frequent`, `new_category`, `imputed_column`, `remove` are supported.

In [None]:
new_dataset = DataPreprocessing(fillna = None).preprocess(df = dataset, target='price')

new_dataset.head()

In [None]:
new_dataset = DataPreprocessing(fillna_categorical='imputed_column',
                                fillna_numerical='mode').preprocess(df = dataset, target='price')

new_dataset.head()

## Normalization

`normalization` parameter is the name of the normalization method, values `standard`, `minmax`, `robust`, `normalizer`, `yeo-johnson`, `box-cox`, `log` are supported. If True 'standard' will be used. If False/None normalization won't be applied.

`normalization_drop` parameter is the list of columns to not normalize. 

In [None]:
new_dataset = DataPreprocessing(normalization = None).preprocess(df = dataset, target='price')

new_dataset.head()

In [None]:
new_dataset = DataPreprocessing(normalization = 'minmax',
                                normalization_drop = ['last_review_unix']).preprocess(df = dataset, target='price')

new_dataset.head()

## Feature Selection

`feature_selection` parameter is the name of the feature selection method, values `random_forest`, `mutual_inf`, `chi2`, `f_statistic`, `lasso` and `elasticnet` are supported. If True `random_forest` will be used. If False/None feature selection won't be applied.

`feat_select_task` parameter is the name of the feature selection task, values `reg`, `class`, `multiclass`, `multiclass_multioutput` are supported. If `feature_selection` is True or str this prameter must be initialized.

`feat_select_threshold` parameter is the feature selection threshold, values `mean`, `median` are supported or the threshold can be numeric.

In [None]:
new_dataset = DataPreprocessing(feature_selection = True, 
                                feat_select_task = 'reg').preprocess(df = dataset, target='price')

new_dataset.head()

In [None]:
new_dataset = DataPreprocessing(feature_selection = 'mutual_inf', 
                                feat_select_task = 'reg', 
                                feat_select_threshold = 'mean').preprocess(df = dataset, target='price')

new_dataset.head()

    The following specified methods can be used for each individual task:
    - for the classification problem Mutual information, F statistics, chi-squared test, Random Forest, Lasso or ElasticNet can be used;
    - for the regression problem Mutual information, F statistics, Random Forest, Lasso or ElasticNet can be used;
    - for the multiclass classification Random Forest, Lasso or ElasticNet can be used;
    - for the multiclass multioutput classification Random Forest can be used.

## Dimensionality Reduction

`dim_red` parameter is the name of the dimensionality reduction method, values `pca`, `svd`, `lda`, `t_sne`, `isomap`, `lle`, `fa`, `nmf` are supported. If True `pca` will be used. If False/None dimensionality reduction won't be applied.

`dim_red_n_components` parameter is the dimensionality reduction n_components parameter value. If None n_components will be calculated by the model or will be set to the default value = 2.

`dim_red_n_neighbors` is the dimensionality reduction n_neighbors (or perplexity in the `t_sne`) parameter value. If None it will be set to the default value = 5 (for the `t_sne` = 30).

In [None]:
new_dataset = DataPreprocessing(dim_red = True).preprocess(df = dataset, target='price')

new_dataset.head()

In [None]:
new_dataset = DataPreprocessing(dim_red = 'svd', dim_red_n_components = 10).preprocess(df = dataset, target='price')

new_dataset.head()

## Sampling

`sampling` parameter is the name of the sampling method, values `simple`, `systematic`, `cluster`, `stratified` are supported. If True `simple` will be used. If False/None sampling won't be applied. 

`sampling_n` parameter is the sampling n value. If None it will be set to the default value depending on the method.

`sampling_n_clusters` parameter is the sampling number of clusters value.

In [None]:
new_dataset = DataPreprocessing(sampling = 'cluster', sampling_n = 2, 
                                sampling_n_clusters = 8).preprocess(df = dataset, target='price')

new_dataset.head()

## Smoothing

`smoothing` parameter is the name of the smoothing method, values `moving_average`, `lowess`, `s_g_filter`, `fft` are supported. If True `moving_average` will be used. If False/None smoothing won't be applied.

`smoothing_column` parameter is the name of the column to smooth.

In [None]:
new_dataset = DataPreprocessing(normalization_drop = ['reviews_per_month'], 
                                smoothing = 'moving_average',
                                smoothing_column = 'reviews_per_month').preprocess(df = dataset, target='price')

new_dataset.head()