# 🔻 [Return to workflow](#leftoff)

# 🏡 **AirBNB Dataset Review** 🏨

# ❌ Update target audience and guiding questions

---

**Who?**
>* 🏢 **AirBNB Corporate** interested in maximizing customer satisfaction to increase repeat guests and encourage new guests to stay with AirBNB hosts
>
>
>* 🏡**AirBNB hosts** interested in maximizing the ratings

**Why?**
>* 💰 **Revenue Management:** 
>
>
>
>* 🤝 **Sales:**
>
>
>
>* 🛌 **Rooms Ops:**

>
>
>

**What?**
>* 🧾 Dataset comprised of... 
>  * # different features
>  * # reservation records
>  * Source cited in Readme

❌ **How?**
>* Which models/methods?
>* Data prep and feature engineering

---

# 🎯  **Goal:**

Determining whether or not a host location would receive a score greater than or equal to 4/5 (defined by `'review_scores_rating'`).

# 📌 **To-Do**

---

- [ ] [TD1](#td1)
- [ ] [TD2](#td2)
- [ ] [TD3](#td3)
- [ ] [todo4](#td4)
- [ ] [todo5](#td5)
- [ ] [todo6](#td6)
- [ ] [todo7](#td7)

---

# 📂 **Imports and Settings**

In [None]:
## Data Handling
import pandas as pd
import numpy as np
from scipy import stats

## Visualizations
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import interact_manual
import missingno

## Modeling - SKLearn
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import MultiLabelBinarizer, OneHotEncoder
from sklearn.preprocessing import StandardScaler, FunctionTransformer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.dummy import DummyRegressor
from sklearn import set_config
set_config(display='diagram')

# from sklearn.naive_bayes import MultinomialNB # for naive bayes model

## Settings
%matplotlib inline
plt.style.use('seaborn-talk')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
pd.set_option('max_rows', 100)

In [None]:
## Personal functions
import clf_functions.functions as cf
%load_ext autoreload
%autoreload 1
%aimport clf_functions.functions

## ✅ Show Visualizations Setting

In [None]:
## Controlling whether or not to show visualizations
show_visualizations = False

## ❓ FSDS

In [None]:
# import fsds as fs

In [None]:
# fs.ihelp_menu([fs.ihelp_menu, sort_report])

# 📖 **Read Data**

In [None]:
## Reading data and saving to a DataFrame

source = 'data/listings.csv.gz'

data = pd.read_csv(source)

In [None]:
## Inspecting imported dataset
data.head(5)

In [None]:
## Checking number of rows and columns
data.shape

---

> The initial read of the dataset shows there are 74 features and 8,033 entries. A quick glance at the `.head()` gives a sample of the entries, showing that some of the features are not relevant to my analysis.
>
> I need to get a better idea of the statistics for the dataset, especially any missing values and the datatypes for each column. I need to pre-process this data before I can perform any modeling.

---

# 👨‍💻 **Interactive Investigation**

---

> To increase accessibility to the data, **I include a widget to allow the user to sort through the data interactively.** I use [**Jupyter Widgets**](https://ipywidgets.readthedocs.io/en/latest/index.html) to create this interactive report.
>
>**To use:** select which column by which you would like to sort from the dropdown menu, then click the "Run Interact" button.
>
>***Note about 'Drop_Cols' and Cols:*** these keyword arguments are used to allow the user to drop specific columns.
>
> **Only click the "Drop_Cols" option when specifying "Cols"!** Otherwise it will cause an error.
>
>The 'Cols' dropdown menu does not affect the resulting report; the data is filtered from the report prior to displaying the results. 
>
>I chose to include this option for flexibility and adaptability, but it does have the unintended consequence of creating another drop-down menu. Please ignore this menu, as it does not provide any additional functionality. For future work, I will disable the menu to prevent confusion.

---

In [None]:
## Running report on unfiltered dataset

interact_manual(cf.sort_report, Sort_by=list(cf.report_df(data).columns),
                Source=source);

In [None]:
data.head(3)

---

> After reviewing my data, I see there are several features that contain irrelevant entries (URLs, source data, meta data) or values that are too complicated for simple processing (such as host and listing descriptions).
>
> I will drop these columns for the second report to review the remaining data for further processing.

---

In [None]:
## Specifying columns to drop

drop = ['id', 'host_id', 'name', 'description', 'neighborhood_overview', 'host_name',
        'host_about', 'host_location', 'neighbourhood', 'property_type',
        'listing_url', 'scrape_id', 'last_scraped', 'picture_url','host_url',
        'host_thumbnail_url','host_picture_url','calendar_last_scraped']

In [None]:
## Creating updated interactive report

interact_manual(cf.sort_report, Drop_Cols = True, Cols = drop,
                Sort_by=list(cf.report_df(data).columns), Source=source);

---

> **Interpretation:**
>
> The report shows that the dataset has a big problem with missing values:
>
> * **Empty:**
>   * `neighbourhood_group_cleansed`
>   * `bathrooms`
>   * `calendar_updated`
>
>
> * **Nearly empty:**
>  * `license`
>
>
> * **Missing 26-39% of data:**
>  * `host_about`
>  * `neighborhood_overview`
>  * `neighbourhood`
>  * `host_response_time`
>  * `host_response_rate`
>  * `review_scores_value`
>  * `review_scores_checkin`
>  * `review_scores_location`
>  * `review_scores_accuracy`
>  * `review_scores_communication`
>  * `review_scores_cleanliness`
>  * `host_acceptance_rate`
>  * `reviews_per_month`
>  * `first_review`
>  * `review_scores_rating`
>  * `last_review`
>
>---
>
> I will need to address these missing values before processing with the modeling. A few options include:
>
> * **Filling with the string "missing"** to indicate the value was missing.
>    * *I would be able to treat "missing" as a distinct category and use it for modeling as well.*
>
>
> * **Dropping the rows with missing values.**
>    * *This may negatively impact the accuracy of my results by overfitting to the training data.*
>
>
> * I could **use the `SimpleImputer` tool from SKLearn to fill the missing values** with the mean, median, or mode values for each.
>    * *I could couple this with a `GridSearchCV` to identify the method that has the strongest positive impact on my classification metrics.*

---

---

> To get a better idea of the missing values, I create a visual of the values via the 'Missingno' package. This visualization package includes several options for visualizing the missing data.

---

In [None]:
## Visually inspecting missing values
if show_visualizations == True:
    missingno.bar(data, labels=True);

---

> Based on this visualization, I see that **there is a consistent trend in missing values for review scores:** if a row is missing one review score, it seems to be missing all of them.
>
> Additionally, **there are many missing values for the response time, response rate, and acceptance rate.** I want to use these columns in my classification, so I will need to replace those missing values.
>
> After reviewing these details, **I feel more comfortable with the option of dropping those rows with missing review values.** I will drop the values as part of my overall classification process.

---

# 🧼 **Data Cleaning and EDA**

## 🔎 Fixing Missing Values

---

> This dataset is missing a significant number of values for different columns. **In order to perform any modeling, I will need to address these missing values first.**
>
> Depending on the feature and the number of missing values per row, I will take different approaches to keep as much data as possible and in its original state.

---

In [None]:
# Dropping features with high percentages (25%+) of missing values

drop_na_cols = []
for col in data.columns:
    if ((data[col].isna().sum()) / len(data[col])) > .25 and col != 'review_scores_rating':
        drop_na_cols.append(col)

drop_na_cols

In [None]:
## Appending previous list of columns to drop (metadata, etc.)

for col in drop:
    if col not in drop_na_cols:
        drop_na_cols.append(col)

drop_na_cols

In [None]:
## Creating new dataframe that does not include the features to drop
df = data.drop(columns= drop_na_cols).copy()
df

In [None]:
## Confirming dropped columns with high missing values
cf.report_df(df)

In [None]:
## Filling missing values for 'beds' with values for 'bedrooms'

for idx in list(df['beds'][df['beds'].isna()].index):
    if df['bedrooms'][idx] > 0:
        df['beds'][idx] = df['bedrooms'][idx]

In [None]:
## Filling missing values for 'bedrooms' with values for 'beds'

for idx in list(df['bedrooms'][df['bedrooms'].isna()].index):
    if df['beds'][idx] > 0:
        df['bedrooms'][idx] = df['beds'][idx]

In [None]:
## Confirming reduction in missing values for 'beds' and 'bedrooms'

rpt_clean  = cf.report_df(df)
rpt_clean[rpt_clean['null_sum'] >0]

In [None]:
## Checking remaining missing values

df.isna().sum()

In [None]:
## Removing rows with 6+ null values

df = df[df.isna().sum(axis=1) < 6]
df.head(5)

In [None]:
df.isna().sum()

---

> At this point, **I cleaned up most of the null values via dropping columns with 25%+ missing values and dropping rows with 6+ missing values.**
>
>Additionally, **I filled missing values for 'beds'/'bedrooms' by checking the missing values for each column against the values in the other for each row.** If a row had a value in one of the columns but not the other, I filled the missing value with the value from the other column.
>
> Now, **I will take a different approach to my target feature, the "review_scores_rating."** This feature contains a substantial number of missing values. Due to the impact of changing the values of my target variable, I will use a box plot to inspect the values, then determine whether to use the mean or median value to replace the missing values.

---

In [None]:
## Visualizing the pre-processed target values

sns.boxplot(data = df['review_scores_rating']);

In [None]:
sns.displot(data=df['review_scores_rating']);

In [None]:
sns.displot(data=df['review_scores_rating'][df['review_scores_rating'] >= 4]);

In [None]:
## Calculating mean for target values
mean = df['review_scores_rating'].mean()
mean

In [None]:
## Calculating median for target values
median = df['review_scores_rating'].median()
median

---

>Based on this box plot and distributions, I see that **a large majority of the values are 4+, with a large number of values being 4.8+.**
>
>For the purposes of my classification (whether or not a given host would have a rating of 4+), I can fill the missing values with either the mean/median.
>
> **I will fill the missing values with the mean value of 4.68** to help ensure a fair representation of the overall data.

---

In [None]:
df.loc[:,'review_scores_rating'] = df.loc[:,'review_scores_rating'].fillna(median)

In [None]:
df['review_scores_rating'].describe()

In [None]:
cf.report_df(df)

In [None]:
## Resetting the index after dropping rows

df.reset_index(drop=True, inplace=True)

---

> At this point, I addressed most of the missing values in my dataset by dropping columns and filling missing values. There are still a few columns with missing values, but I will use a SimpleImputer combined with a GridSearchCV to determine the best method by which to fill those values.
>
> Now I will review the remaining data and determine if there are any other issues with my data.

---

RETURN ANCHOR FOR LINK <a name="leftoff"></a>

## **🛠** Changing DataTypes

In [None]:
## Reviewing the remaining dataframe
df.head(3)

**COMMENT:** what next? 

* DONE: T/F columns to 1/0


* DONE: 'host_since' to DT


* DONE: 'price' -$, to float


* DONE: 'neighbourhood_cleansed' split on ", " and convert to binary columns, then drop host_neighbourhood


* DONE: 'bathrooms_text' split on space, keep 1st part, convert to int


* 'host_verifications' - single string, needs extensive work in order to MLB

### Converting True/False to 1/0

In [None]:
## Creating list of true/false features to convert to 1/0, respectively

t_f_xf = ['host_is_superhost','host_has_profile_pic','host_identity_verified',
          'has_availability','instant_bookable']
t_f_xf

In [None]:
## Converting datatype to "string" to replace values

df[t_f_xf] = df[t_f_xf].astype('str')
df[t_f_xf].dtypes

In [None]:
df[t_f_xf]

In [None]:
## Converting t/f to 1/0, respectively

df[t_f_xf] = df[t_f_xf].replace({ 't' : 1, 'f' : 0})

In [None]:
df[t_f_xf]

In [None]:
df[t_f_xf] = df[t_f_xf].astype(int)

In [None]:
## Verifying results

cf.report_df(df[t_f_xf])

### Price -$, to Int

In [None]:
## Converting each value into a float for processing

try:
    df['price'] = df['price'].map(lambda price: price[1:].replace(',','')).astype('float')
    df['price'][0]
except Exception:
    print('\nValues are already processed and saved. No changes necessary')
    print(f"\nSample value: {df['price'][0]}")

In [None]:
df['price'].describe()

### Host_Since to Datetime

In [None]:
df.head(2)

In [None]:
df.loc[:,'host_since'] = pd.to_datetime(df.loc[:,'host_since'])
df['host_since']

In [None]:
df["host_since"].describe(datetime_is_numeric=True)

### Bathrooms_Text

---

> Goal: convert "bathrooms_text" into new "num_bathrooms" column to indicate number of bathrooms at a host property.
>
> The old "bathrooms" feature was empty and was dropped as part of processing missing data.

---

In [None]:
## Checking current dataframe contents

df.head(3)

In [None]:
## Checking for null values overall
df.isna().sum()[df.isna().sum() > 0]

In [None]:
## Inspecting a selection of values from the column to understand the values
df.loc[:,'bathrooms_text'][:21]

In [None]:
## Inspecting the rows in which there are null values
df[df['bathrooms_text'].isna()]

In [None]:
## Filling null values with unique string ('Baths' not present otherwise)
## Unique string can be used later to check for any other zero baths

df.loc[:,'bathrooms_text'].fillna('0 Baths', inplace=True)

In [None]:
## Verifying all null values are filled
df.isna().sum()[df.isna().sum() > 0]

In [None]:
df.loc[:,'bathrooms_text'].isna().sum()

In [None]:
## Splitting each list into separate strings
df['num_bathrooms'] = df['bathrooms_text'].map(lambda x: x.split(' ')[0])
df['num_bathrooms'].value_counts()

In [None]:
## Verifying results that are words, not numbers

replace = ['Half-bath', 'Shared', 'Private']

for x in df['bathrooms_text']:
    for i in replace:
        if i in x:
            print(x)

---

> **I will replace these values with the numeric value .5 as they are half-baths.** This will allow me to convert the column datatype to a float and use the column more easily in my modeling.

---

In [None]:
## Replacing string values with .5 to represent half-bathrooms

replace = {'Half-bath': .5, 'Shared': .5, 'Private': .5}

df['num_bathrooms'].replace(replace, inplace = True)

df['num_bathrooms'] = df['num_bathrooms'].astype(float)

In [None]:
## Inspecting resulting values

df['num_bathrooms'].value_counts(dropna=False)

In [None]:
## Inspecting listings with more than 10 rooms

df[df['num_bathrooms'] >10]

---

> After taking a look at the locations listed above on Google Maps (using their latitude/longitude), I feel like these three listings with more than 10 bathrooms are either duplicates or incorrect values (for 50 baths).
>
> Due to the questionable nature of these values, I will drop these rows to prevent these outliers from impacting my results.

---

In [None]:
## Identifying and using indices for rows with zero bathrooms for inspection

zero_bath_idx = df.loc[:,'bathrooms_text'][df['num_bathrooms'] == 0].index.to_list()

df.iloc[zero_bath_idx][:10]#['bathrooms_text']

---

> My review of the original bathroom text for the zero bathrooms column shows that the listings are associated with a private room. This would make sense as the listings may not include an option such as a shared bath, etc..
>
> Additionally I did fill 9 instances of missing values with "0 Baths," which would contribute slightly to this count.
>
> Overall, I feel the data is valid and I will use it for my modeling.

---

### neighbourhood_cleansed

In [None]:
df.loc[:,'neighbourhood_cleansed']

In [None]:
df.loc[:,'neighbourhood_cleansed'].dtype

---

> The current values for "neighbourhood_cleansed" are a single string value. **I will separate each neighborhood and convert them into a binary column to represent whether or not that neighborhood is included in the listing, then drop the old column.**

---

In [None]:
## Testing the splitting between neighborhoods

df.loc[:,'neighbourhood_cleansed'][1].split(', ')

In [None]:
## Converting values into a list of strings for each neighborhood

try:
    df['neighbourhood_cleansed'] = df['neighbourhood_cleansed'] \
                                                .apply(lambda x: x.split(', '))
    display(df.loc[:,'neighbourhood_cleansed'])
except Exception:
    print('\nValues are already processed and saved. No changes necessary.')
    print(f"\nSample value: {df.loc[:,'neighbourhood_cleansed'][3]}")

---

> The following code snippet is adapted from [here](https://stackoverflow.com/questions/45312377/how-to-one-hot-encode-from-a-pandas-column-containing-a-list#:~:text=Sparse%20solution%20(for%20Pandas%20v0.25.0%2B)) by the user [Maxu](https://stackoverflow.com/users/5741205/maxu).

---

In [None]:
## Converting each neighborhood into a binary column and dropping old column

mlb = MultiLabelBinarizer()

try:
    df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('neighbourhood_cleansed')),
                              columns=mlb.classes_,index=df.index))
except Exception:
        print('\nValues are already processed and saved. No changes necessary.')

In [None]:
## Inspecting results

df.head(3)

---

> After using the MultiLabelBinarizer, I successfully added a column for each neighborhood, indicating whether or not that neighborhood was included in the listing.
>
> This enables me to use the presence/absence of a  neighborhood as a category in my modeling.

---

### Host_Verifications

In [None]:
df['host_verifications'][:10]

---

> For the "host_verifications" and "amenities" features, the values are a single string with several items within the string.
>
> It is somewhat similar to the "neighborhoods_cleaned" feature in the sense that I will need to filter out the individual items from the string. However, there is an added complication as I need to remove the brackets and quotations from the strings.
>
> Once I filter out the items, I will be able to use the MultiLabelBinarizer again to create more categories for each amenity.

---

In [None]:
## Testing the splitting between items

df.loc[:,'host_verifications'][1]

In [None]:
for x in ['host_verifications', 'amenities']:
    df[x] = df[x].str.replace('[', '')
    df[x] = df[x].str.replace(']', '')
    df[x] = df[x].str.replace("'", '')
    df[x] = df[x].str.replace('"', '')
    df[x] = df[x].apply(lambda x: x.split(', '))

In [None]:
## Converting each value into a binary column and dropping old column

mlb2 = MultiLabelBinarizer()
    
df = df.join(pd.DataFrame(mlb2.fit_transform(df.pop('host_verifications')),
                                  columns=mlb2.classes_,index=df.index))

df

---

> At this point, I successfully processed the 'host_verification' feature into distinct categories for modeling.
>
> In the future, I may attempt to do the same for the 'amenities' feature, but I don't want to create too many columns before my initial modeling.

---

#### Old Code

In [None]:
# for x in ['host_verifications', 'amenities']:
#     print(df[x])

In [None]:
# df['amenities'][:10]

In [None]:
# for x in ['host_verifications', 'amenities']:
#     df[x] = df[x].str.replace('and', '')

In [None]:
# ## Converting each value into a binary column and dropping old column

# mlb = MultiLabelBinarizer()
    
# df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('amenities')),
#                                   columns=mlb.classes_,index=df.index))

In [None]:
# df.loc[:,'host_verifications'] = df.loc[:,'host_verifications'].str.replace('[', '')
# df.loc[:,'host_verifications'] = df.loc[:,'host_verifications'].str.replace(']', '')
# df.loc[:,'host_verifications'] = df.loc[:,'host_verifications'].str.replace("'", '')

In [None]:
# df.loc[:,'host_verifications']

In [None]:
# df['amenities'] = df['amenities'].str.replace('[', '')
# df['amenities'] = df['amenities'].str.replace(']', '')
# df['amenities'] = df['amenities'].str.replace('"', '')

In [None]:
# df['amenities']

In [None]:
# df['amenities'] = df['amenities'].apply(lambda x: x.split(', '))

In [None]:
# df['amenities'][0]

In [None]:
# df['host_verifications'] = df['host_verifications'].apply(lambda x: x.split(', '))

In [None]:
# df['host_verifications'][0][0]

In [None]:
# def convert_to_col(df, list_cols):
#     '''For a given list of column names, separates each string value in the
#     column by the comma/space pattern to return new strings of single values.
    
#     Then, instantiates a MultiLabelBinarizer to create new columns for each 
#     new string to indicate the presence or absence of that string in the 
#     original column.'''
    
# #     mlb = MultiLabelBinarizer()
    
#     for x in list_cols:
#         try:
#             df[x] = df[x].apply(lambda x: x.split(', '))
#             print(f'Successfully split values in column "{x}"')
            
#         except Exception:
#             print('\nValues are already processed and saved.')
#             print(f"\nSample value: {df.loc[:,x][3]}")
            
# #         try:
# #             df = df.join(pd.DataFrame(mlb.fit_transform(df.pop(x)),
# #                                       columns=mlb.classes_,index=df.index))
# #         except Exception:
# #                 print('\nValues are already processed and saved.')
                
#     return df

In [None]:
# binarize_cols = ['host_verifications', 'amenities'] 

# convert_to_col(df, binarize_cols)

In [None]:
# ## Converting each value into a binary column and dropping old column

# mlb = MultiLabelBinarizer()
    
# df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('amenities')),
#                                   columns=mlb.classes_,index=df.index))

In [None]:
# # mlb = MultiLabelBinarizer()
    
# df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('amenities')),
#                                   columns=mlb.classes_,index=df.index))

In [None]:
# ## Converting values into a list of strings for each neighborhood

# try:
#     df['host_verifications'] = df['host_verifications'] \
#                                                 .apply(lambda x: x.split(', '))
#     display(df.loc[:,'host_verifications'])
# except Exception:
#     print('\nValues are already processed and saved. No changes necessary.')
#     print(f"\nSample value: {df.loc[:,'host_verifications'][3]}")
    
    

In [None]:
# ## Inspecting results

# df.head(3)

In [None]:
# test3 = df['host_verifications'][0]
# test3[1:-1].replace('"', "'").split(",")

In [None]:
# # df['Tags'] = df.Tags.apply(lambda x: x[1:-1].split(','))

# df['host_verifications'].apply(lambda x: x.split(','))[0]

## 🎯 Inspecting the Target Variable

In [None]:
df['review_scores_rating'].value_counts(bins=5, sort=False, normalize=True)

In [None]:
df['review_scores_rating'][df['review_scores_rating'] < 4].value_counts(bins=4, sort=False, normalize=True)

---

> The target feature, `'review_scores_rating'`, is currently a range of values from 0 to 5, with 95% of the scores being 4 or above. The above results show a sub-zero value; this is for the purpose of binning the values; the lowest value is actually 0.00.
>
> Of the scores less than 4, a little under half are between 3 and 4 (rounded) and about a third are between 0 and 1 (rounded).
>
> For my classification modeling, my classes are significantly imbalanced between values less/greater than 4 with a 5/95 split. I need to perform some over-sampling of the minority class to increase my model's performance.

---

# 🪓 **Train/Test Split**

---

> Before I run any further pre-processing, I split my data into training and test sets to allow me to test my model's performance.

---

In [None]:
# ## Creating features/target for dataset
# target = 'review_scores_rating'

# X = df_cleaned.drop(columns = target).copy()
# y = df_cleaned[target].copy()

In [None]:
# ## Confirming same number of rows
# X.shape[0] == y.shape[0]

In [None]:
# ## Splitting to prevent data leakage
# X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# 🚿 **Preprocessing Pipeline**

In [None]:
# cat_cols = ['hotel', 'meal','arrival_date_month', 'country', 'market_segment',
#             'distribution_channel','is_repeated_guest','reserved_room_type',
#             'assigned_room_type','deposit_type', 'agent',
#             'customer_type','reservation_status']

# cont_cols = [col for col in X_train.drop(['reservation_status_date','company'],axis=1).columns if col not in cat_cols]

# cont_cols

In [None]:
# X_train[cat_cols] = X_train[cat_cols].astype(str)

In [None]:
# X_test[cat_cols] = X_test[cat_cols].astype(str)

In [None]:
# ## Creating ColumnTransformer and sub-transformers for imputation and encoding

# # Filling missing "Children"
# zero_transformer = SimpleImputer(strategy='constant', fill_value=0)

# ##  
# missing_transformer = SimpleImputer(strategy='constant', fill_value='missing')

# ## Encoding categoricals - handling errors to prevent issues w/ test set
# categorical_transformer = OneHotEncoder(handle_unknown='ignore', sparse=False)

# cat_pipe = Pipeline(steps=[('imputer', missing_transformer),
#                       ('ohe', categorical_transformer)])

# cont_pipe = Pipeline(steps=[('imputer', zero_transformer),
#                            ('scaler', StandardScaler())])

# ## Instantiating the ColumnTransformer and including all transformers
# preprocessor = ColumnTransformer(
#     transformers=[('conts', cont_pipe, cont_cols),
#                   ('cats', cat_pipe, cat_cols)])

# preprocessor

In [None]:
# preprocessor.fit(X_train)

# ## Getting feature names from OHE
# ohe_cat_names = preprocessor.named_transformers_['cats'].named_steps['ohe'].get_feature_names(cat_cols)

In [None]:
# ## Generating list for column index
# final_cols = [*cont_cols, *ohe_cat_names]

# ## Fit and transform the data via the ColumnTransformer
# X_train_tf = preprocessor.transform(X_train)
# X_train_tf_df = pd.DataFrame(X_train_tf, columns=final_cols, index=X_train.index)

# ## Transforming the test set and saving
# X_test_tf = preprocessor.transform(X_test)
# X_test_tf_df = pd.DataFrame(X_test_tf, columns=final_cols, index=X_test.index)

# display(X_train_tf_df.head(5),X_test_tf_df.head(5))

# 📝 Next Steps

* Process classification model - i.e. Logreg, KNN, DecisionTrees, etc.
* Evaluate results
* Determine if I need to redo pre-processing steps

# 🚿 Classification Pipeline