# 🔻 [Return to workflow](#leftoff)

⚓ ANCHOR FOR RETURN TO WORKFLOW LINK <a name="leftoff"></a>

# 🏡 **AirBNB Dataset Review** 🏨

# ❌ Update target audience and guiding questions

---

**Who?**
>* 🏢 **AirBNB Corporate** interested in maximizing customer satisfaction to increase repeat guests and encourage new guests to stay with AirBNB hosts
>
>
>* 🏡**AirBNB hosts** interested in maximizing the ratings

**Why?**
>* 💰 **Revenue Management:** 
>
>
>
>* 🤝 **Sales:**
>
>
>
>* 🛌 **Rooms Ops:**

>
>
>

**What?**
>* 🧾 Dataset comprised of... 
>  * different features
>  * reservation records
>  * Source cited in Readme

❌ **How?**
>* Which models/methods?
>* Data prep and feature engineering

---

# 🎯  **Goal:**

Determining whether or not a host location would receive a score greater than or equal to 4/5 (defined by `'review_scores_rating'`).

# 📌 **To-Do**

---

- [ ] [TD1](#td1)
- [ ] [TD2](#td2)
- [ ] [TD3](#td3)
- [ ] [todo4](#td4)
- [ ] [todo5](#td5)
- [ ] [todo6](#td6)
- [ ] [todo7](#td7)

---

# 📂 **Imports and Settings**

In [1]:
## Data Handling
import pandas as pd
import numpy as np
from scipy import stats


## Visualizations
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from ipywidgets import interact_manual
import missingno

## Modeling - SKLearn
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import Binarizer, MultiLabelBinarizer, \
                                    OneHotEncoder, StandardScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.dummy import DummyRegressor
from sklearn import set_config
set_config(display='diagram')

from imblearn.over_sampling import SMOTE,SMOTENC

# from sklearn.naive_bayes import MultinomialNB # for naive bayes model

## Settings
%matplotlib inline
plt.style.use('seaborn-talk')
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: f'{x:,.2f}')
pd.set_option('max_rows', 100)

In [2]:
## Personal functions
import clf_functions.functions as cf
%load_ext autoreload
%autoreload 1
%aimport clf_functions.functions

## ✅ Show Visualizations Setting

In [3]:
## Controlling whether or not to show visualizations
show_visualizations = False

## ❓ FSDS

In [4]:
# import fsds as fs

In [5]:
# fs.ihelp_menu([fs.ihelp_menu, sort_report])

# 📖 **Read Data**

In [6]:
## Reading data and saving to a DataFrame

source = 'data/listings.csv.gz'

data = pd.read_csv(source)

In [7]:
## Inspecting imported dataset
data.head(5)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,3686,https://www.airbnb.com/rooms/3686,20210710190002,2021-07-11,Vita's Hideaway,IMPORTANT NOTES<br />* Carefully read and be s...,We love that our neighborhood is up and coming...,https://a0.muscache.com/pictures/61e02c7e-3d66...,4645,https://www.airbnb.com/users/show/4645,Vita,2008-11-26,"Washington D.C., District of Columbia, United ...","I am a literary scholar, teacher, poet, vegan ...",within a day,80%,75%,f,https://a0.muscache.com/im/users/4645/profile_...,https://a0.muscache.com/im/users/4645/profile_...,Anacostia,2.0,2.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,"Washington, District of Columbia, United States",Historic Anacostia,,38.86,-76.99,Private room in house,Private room,1,,1 private bath,1.0,1.0,"[""First aid kit"", ""Long term stays allowed"", ""...",$55.00,2,365,2,2,365,365,2.0,365.0,,t,1,31,61,336,2021-07-11,75,3,0,2014-06-22,2021-01-12,4.59,4.71,4.44,4.89,4.82,3.8,4.58,,f,2,0,2,0,0.87
1,3943,https://www.airbnb.com/rooms/3943,20210710190002,2021-07-11,Historic Rowhouse Near Monuments,Please contact us before booking to make sure ...,This rowhouse is centrally located in the hear...,https://a0.muscache.com/pictures/432713/fab7dd...,5059,https://www.airbnb.com/users/show/5059,Vasa,2008-12-12,"Washington, District of Columbia, United States",I have been living and working in DC for the l...,within a few hours,100%,29%,f,https://a0.muscache.com/im/pictures/user/8ec69...,https://a0.muscache.com/im/pictures/user/8ec69...,Eckington,0.0,0.0,"['email', 'phone', 'reviews', 'kba']",t,t,"Washington, District of Columbia, United States","Edgewood, Bloomingdale, Truxton Circle, Eckington",,38.91,-77.0,Private room in townhouse,Private room,2,,1.5 shared baths,1.0,1.0,"[""Cooking basics"", ""First aid kit"", ""Dedicated...",$70.00,2,1125,2,2,1125,1125,2.0,1125.0,,t,9,39,69,344,2021-07-11,429,0,0,2010-08-08,2018-08-07,4.82,4.89,4.91,4.94,4.9,4.54,4.74,,f,2,0,2,0,3.22
2,4529,https://www.airbnb.com/rooms/4529,20210710190002,2021-07-11,Bertina's House Part One,This is large private bedroom with plenty of...,Very quiet neighborhood and it is easy accessi...,https://a0.muscache.com/pictures/86072003/6709...,5803,https://www.airbnb.com/users/show/5803,Bertina'S House,2008-12-30,"Washington, District of Columbia, United States","I am an easy going, laid back person who loves...",,,,f,https://a0.muscache.com/im/users/5803/profile_...,https://a0.muscache.com/im/users/5803/profile_...,Eastland Gardens,3.0,3.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"Washington, District of Columbia, United States","Eastland Gardens, Kenilworth",,38.91,-76.94,Private room in house,Private room,4,,1 shared bath,1.0,1.0,"[""Cooking basics"", ""First aid kit"", ""Keypad"", ...",$54.00,30,180,30,30,180,180,30.0,180.0,,t,29,59,89,179,2021-07-11,102,0,0,2014-09-23,2019-07-05,4.66,4.8,4.6,4.93,4.93,4.51,4.83,,f,1,0,1,0,1.23
3,4967,https://www.airbnb.com/rooms/4967,20210710190002,2021-07-11,"DC, Near Metro","<b>The space</b><br />Hello, my name is Seveer...",,https://a0.muscache.com/pictures/2439810/bb320...,7086,https://www.airbnb.com/users/show/7086,Seveer,2009-01-26,"Washington D.C., District of Columbia, United ...","I am fun, honest and very easy going and trave...",within a few hours,100%,78%,t,https://a0.muscache.com/im/pictures/user/6efb4...,https://a0.muscache.com/im/pictures/user/6efb4...,Ivy City,5.0,5.0,"['email', 'phone', 'reviews', 'kba']",t,t,,"Ivy City, Arboretum, Trinidad, Carver Langston",,38.91,-76.99,Private room in house,Private room,1,,3 baths,1.0,1.0,"[""Cable TV"", ""TV with standard cable"", ""Kitche...",$99.00,2,365,2,2,365,365,2.0,365.0,,t,0,0,0,146,2021-07-11,31,0,0,2012-02-13,2016-09-22,4.74,4.68,4.89,4.93,4.93,4.21,4.64,,f,3,0,3,0,0.27
4,5589,https://www.airbnb.com/rooms/5589,20210710190002,2021-07-11,Cozy apt in Adams Morgan,This is a 1 br (bedroom + living room in Adams...,"Adams Morgan spills over with hipsters, salsa ...",https://a0.muscache.com/pictures/207249/9f1df8...,6527,https://www.airbnb.com/users/show/6527,Ami,2009-01-13,"Washington D.C., District of Columbia, United ...","I am an environmentalist, and I own and operat...",within a few hours,100%,17%,f,https://a0.muscache.com/im/users/6527/profile_...,https://a0.muscache.com/im/users/6527/profile_...,Adams Morgan,4.0,4.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"Washington, District of Columbia, United States","Kalorama Heights, Adams Morgan, Lanier Heights",,38.92,-77.04,Entire apartment,Entire home/apt,3,,1 bath,1.0,1.0,"[""Window guards"", ""Cooking basics"", ""First aid...",$86.00,5,150,5,23,150,150,8.8,150.0,,t,7,32,62,121,2021-07-11,95,0,0,2010-07-30,2020-03-05,4.54,4.75,4.17,4.83,4.84,4.91,4.47,,f,2,1,1,0,0.71


In [8]:
## Checking number of rows and columns
data.shape

(8033, 74)

---

> The initial read of the dataset shows there are 74 features and 8,033 entries. A quick glance at the `.head()` gives a sample of the entries, showing that some of the features are not relevant to my analysis.
>
> I need to get a better idea of the statistics for the dataset, especially any missing values and the datatypes for each column. I need to pre-process this data before I can perform any modeling.

---

# 👨‍💻 **Interactive Investigation**

---

> To increase accessibility to the data, **I include a widget to allow the user to sort through the data interactively.** I use [**Jupyter Widgets**](https://ipywidgets.readthedocs.io/en/latest/index.html) to create this interactive report.
>
>**To use:** select which column by which you would like to sort from the dropdown menu, then click the "Run Interact" button.
>
>***Note about 'Drop_Cols' and Cols:*** these keyword arguments are used to allow the user to drop specific columns.
>
> **Only click the "Drop_Cols" option when specifying "Cols"!** Otherwise it will cause an error.
>
>The 'Cols' dropdown menu does not affect the resulting report; the data is filtered from the report prior to displaying the results. 
>
>I chose to include this option for flexibility and adaptability, but it does have the unintended consequence of creating another drop-down menu. Please ignore this menu, as it does not provide any additional functionality. For future work, I will disable the menu to prevent confusion.

---

In [9]:
## Running report on unfiltered dataset

interact_manual(cf.sort_report, Sort_by=list(cf.report_df(data).columns),
                Source=source);

interactive(children=(Text(value='data/listings.csv.gz', description='Source'), Dropdown(description='Sort_by'…

In [10]:
data.head(3)

Unnamed: 0,id,listing_url,scrape_id,last_scraped,name,description,neighborhood_overview,picture_url,host_id,host_url,host_name,host_since,host_location,host_about,host_response_time,host_response_rate,host_acceptance_rate,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood,neighbourhood_cleansed,neighbourhood_group_cleansed,latitude,longitude,property_type,room_type,accommodates,bathrooms,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,calendar_updated,has_availability,availability_30,availability_60,availability_90,availability_365,calendar_last_scraped,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,first_review,last_review,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,license,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month
0,3686,https://www.airbnb.com/rooms/3686,20210710190002,2021-07-11,Vita's Hideaway,IMPORTANT NOTES<br />* Carefully read and be s...,We love that our neighborhood is up and coming...,https://a0.muscache.com/pictures/61e02c7e-3d66...,4645,https://www.airbnb.com/users/show/4645,Vita,2008-11-26,"Washington D.C., District of Columbia, United ...","I am a literary scholar, teacher, poet, vegan ...",within a day,80%,75%,f,https://a0.muscache.com/im/users/4645/profile_...,https://a0.muscache.com/im/users/4645/profile_...,Anacostia,2.0,2.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,"Washington, District of Columbia, United States",Historic Anacostia,,38.86,-76.99,Private room in house,Private room,1,,1 private bath,1.0,1.0,"[""First aid kit"", ""Long term stays allowed"", ""...",$55.00,2,365,2,2,365,365,2.0,365.0,,t,1,31,61,336,2021-07-11,75,3,0,2014-06-22,2021-01-12,4.59,4.71,4.44,4.89,4.82,3.8,4.58,,f,2,0,2,0,0.87
1,3943,https://www.airbnb.com/rooms/3943,20210710190002,2021-07-11,Historic Rowhouse Near Monuments,Please contact us before booking to make sure ...,This rowhouse is centrally located in the hear...,https://a0.muscache.com/pictures/432713/fab7dd...,5059,https://www.airbnb.com/users/show/5059,Vasa,2008-12-12,"Washington, District of Columbia, United States",I have been living and working in DC for the l...,within a few hours,100%,29%,f,https://a0.muscache.com/im/pictures/user/8ec69...,https://a0.muscache.com/im/pictures/user/8ec69...,Eckington,0.0,0.0,"['email', 'phone', 'reviews', 'kba']",t,t,"Washington, District of Columbia, United States","Edgewood, Bloomingdale, Truxton Circle, Eckington",,38.91,-77.0,Private room in townhouse,Private room,2,,1.5 shared baths,1.0,1.0,"[""Cooking basics"", ""First aid kit"", ""Dedicated...",$70.00,2,1125,2,2,1125,1125,2.0,1125.0,,t,9,39,69,344,2021-07-11,429,0,0,2010-08-08,2018-08-07,4.82,4.89,4.91,4.94,4.9,4.54,4.74,,f,2,0,2,0,3.22
2,4529,https://www.airbnb.com/rooms/4529,20210710190002,2021-07-11,Bertina's House Part One,This is large private bedroom with plenty of...,Very quiet neighborhood and it is easy accessi...,https://a0.muscache.com/pictures/86072003/6709...,5803,https://www.airbnb.com/users/show/5803,Bertina'S House,2008-12-30,"Washington, District of Columbia, United States","I am an easy going, laid back person who loves...",,,,f,https://a0.muscache.com/im/users/5803/profile_...,https://a0.muscache.com/im/users/5803/profile_...,Eastland Gardens,3.0,3.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"Washington, District of Columbia, United States","Eastland Gardens, Kenilworth",,38.91,-76.94,Private room in house,Private room,4,,1 shared bath,1.0,1.0,"[""Cooking basics"", ""First aid kit"", ""Keypad"", ...",$54.00,30,180,30,30,180,180,30.0,180.0,,t,29,59,89,179,2021-07-11,102,0,0,2014-09-23,2019-07-05,4.66,4.8,4.6,4.93,4.93,4.51,4.83,,f,1,0,1,0,1.23


---

> After reviewing my data, I see there are several features that contain irrelevant entries (URLs, source data, meta data) or values that are too complicated for simple processing (such as host and listing descriptions).
>
> I will drop these columns for the second report to review the remaining data for further processing.

---

In [11]:
## Specifying columns to drop

drop = ['id', 'host_id', 'name', 'description', 'neighborhood_overview', 'host_name',
        'host_about', 'host_location', 'neighbourhood', 'property_type',
        'listing_url', 'scrape_id', 'last_scraped', 'picture_url','host_url',
        'host_thumbnail_url','host_picture_url','calendar_last_scraped']

In [12]:
## Creating updated interactive report

interact_manual(cf.sort_report, Drop_Cols = True, Cols = drop,
                Sort_by=list(cf.report_df(data).columns), Source=source);

interactive(children=(Text(value='data/listings.csv.gz', description='Source'), Dropdown(description='Sort_by'…

---

> **Interpretation:**
>
> The report shows that the dataset has a big problem with missing values:
>
> * **Empty:**
>   * `neighbourhood_group_cleansed`
>   * `bathrooms`
>   * `calendar_updated`
>
>
> * **Nearly empty:**
>  * `license`
>
>
> * **Missing 26-39% of data:**
>  * `host_about`
>  * `neighborhood_overview`
>  * `neighbourhood`
>  * `host_response_time`
>  * `host_response_rate`
>  * `review_scores_value`
>  * `review_scores_checkin`
>  * `review_scores_location`
>  * `review_scores_accuracy`
>  * `review_scores_communication`
>  * `review_scores_cleanliness`
>  * `host_acceptance_rate`
>  * `reviews_per_month`
>  * `first_review`
>  * `review_scores_rating`
>  * `last_review`
>
>---
>
> I will need to address these missing values before processing with the modeling. A few options include:
>
> * **Filling with the string "missing"** to indicate the value was missing.
>    * *I would be able to treat "missing" as a distinct category and use it for modeling as well.*
>
>
> * **Dropping the rows with missing values.**
>    * *This may negatively impact the accuracy of my results by overfitting to the training data.*
>
>
> * I could **use the `SimpleImputer` tool from SKLearn to fill the missing values** with the mean, median, or mode values for each.
>    * *I could couple this with a `GridSearchCV` to identify the method that has the strongest positive impact on my classification metrics.*

---

---

> To get a better idea of the missing values, I create a visual of the values via the 'Missingno' package. This visualization package includes several options for visualizing the missing data.

---

In [13]:
## Visually inspecting missing values
if show_visualizations == True:
    missingno.bar(data, labels=True);

---

> Based on this visualization, I see that **there is a consistent trend in missing values for review scores:** if a row is missing one review score, it seems to be missing all of them.
>
> Additionally, **there are many missing values for the response time, response rate, and acceptance rate.** I want to use these columns in my classification, so I will need to replace those missing values.
>
> After reviewing these details, **I feel more comfortable with the option of dropping those rows with missing review values.** I will drop the values as part of my overall classification process.

---

# 🧼 **Data Cleaning and EDA**

## 🔎 Fixing Missing Values

---

> This dataset is missing a significant number of values for different columns. **In order to perform any modeling, I will need to address these missing values first.**
>
> Depending on the feature and the number of missing values per row, I will take different approaches to keep as much data as possible and in its original state.

---

In [14]:
# Dropping features with high percentages (25%+) of missing values

drop_na_cols = []
for col in data.columns:
    if ((data[col].isna().sum()) / len(data[col])) > .25 and col != 'review_scores_rating':
        drop_na_cols.append(col)

drop_na_cols

['neighborhood_overview',
 'host_about',
 'host_response_time',
 'host_response_rate',
 'host_acceptance_rate',
 'neighbourhood',
 'neighbourhood_group_cleansed',
 'bathrooms',
 'calendar_updated',
 'first_review',
 'last_review',
 'review_scores_accuracy',
 'review_scores_cleanliness',
 'review_scores_checkin',
 'review_scores_communication',
 'review_scores_location',
 'review_scores_value',
 'license',
 'reviews_per_month']

In [15]:
## Appending previous list of columns to drop (metadata, etc.)

for col in drop:
    if col not in drop_na_cols:
        drop_na_cols.append(col)

drop_na_cols

['neighborhood_overview',
 'host_about',
 'host_response_time',
 'host_response_rate',
 'host_acceptance_rate',
 'neighbourhood',
 'neighbourhood_group_cleansed',
 'bathrooms',
 'calendar_updated',
 'first_review',
 'last_review',
 'review_scores_accuracy',
 'review_scores_cleanliness',
 'review_scores_checkin',
 'review_scores_communication',
 'review_scores_location',
 'review_scores_value',
 'license',
 'reviews_per_month',
 'id',
 'host_id',
 'name',
 'description',
 'host_name',
 'host_location',
 'property_type',
 'listing_url',
 'scrape_id',
 'last_scraped',
 'picture_url',
 'host_url',
 'host_thumbnail_url',
 'host_picture_url',
 'calendar_last_scraped']

In [16]:
## Creating new dataframe that does not include the features to drop
df = data.drop(columns= drop_na_cols).copy()
df

Unnamed: 0,host_since,host_is_superhost,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood_cleansed,latitude,longitude,room_type,accommodates,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,has_availability,availability_30,availability_60,availability_90,availability_365,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,review_scores_rating,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms
0,2008-11-26,f,Anacostia,2.00,2.00,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,Historic Anacostia,38.86,-76.99,Private room,1,1 private bath,1.00,1.00,"[""First aid kit"", ""Long term stays allowed"", ""...",$55.00,2,365,2,2,365,365,2.00,365.00,t,1,31,61,336,75,3,0,4.59,f,2,0,2,0
1,2008-12-12,f,Eckington,0.00,0.00,"['email', 'phone', 'reviews', 'kba']",t,t,"Edgewood, Bloomingdale, Truxton Circle, Eckington",38.91,-77.00,Private room,2,1.5 shared baths,1.00,1.00,"[""Cooking basics"", ""First aid kit"", ""Dedicated...",$70.00,2,1125,2,2,1125,1125,2.00,1125.00,t,9,39,69,344,429,0,0,4.82,f,2,0,2,0
2,2008-12-30,f,Eastland Gardens,3.00,3.00,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"Eastland Gardens, Kenilworth",38.91,-76.94,Private room,4,1 shared bath,1.00,1.00,"[""Cooking basics"", ""First aid kit"", ""Keypad"", ...",$54.00,30,180,30,30,180,180,30.00,180.00,t,29,59,89,179,102,0,0,4.66,f,1,0,1,0
3,2009-01-26,t,Ivy City,5.00,5.00,"['email', 'phone', 'reviews', 'kba']",t,t,"Ivy City, Arboretum, Trinidad, Carver Langston",38.91,-76.99,Private room,1,3 baths,1.00,1.00,"[""Cable TV"", ""TV with standard cable"", ""Kitche...",$99.00,2,365,2,2,365,365,2.00,365.00,t,0,0,0,146,31,0,0,4.74,f,3,0,3,0
4,2009-01-13,f,Adams Morgan,4.00,4.00,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"Kalorama Heights, Adams Morgan, Lanier Heights",38.92,-77.04,Entire home/apt,3,1 bath,1.00,1.00,"[""Window guards"", ""Cooking basics"", ""First aid...",$86.00,5,150,5,23,150,150,8.80,150.00,t,7,32,62,121,95,0,0,4.54,f,2,1,1,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8028,2020-08-03,f,Southeast Washington,0.00,0.00,"['email', 'phone']",t,f,"Congress Heights, Bellevue, Washington Highlands",38.83,-77.00,Entire home/apt,8,1 bath,3.00,3.00,"[""First aid kit"", ""Dedicated workspace"", ""Smok...",$400.00,2,2,2,2,1125,1125,2.00,1125.00,t,17,43,73,348,0,0,0,,t,1,1,0,0
8029,2020-07-29,f,East Forest,125.00,125.00,"['email', 'phone']",t,t,"Howard University, Le Droit Park, Cardozo/Shaw",38.92,-77.02,Entire home/apt,5,2 baths,2.00,,"[""Cooking basics"", ""Lockbox"", ""Long term stays...",$198.00,90,365,90,90,365,365,90.00,365.00,t,30,60,90,365,0,0,0,,t,215,215,0,0
8030,2016-04-27,f,Near Northeast/H Street Corridor,32.00,32.00,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,"Kalorama Heights, Adams Morgan, Lanier Heights",38.92,-77.04,Entire home/apt,2,1 bath,1.00,2.00,"[""Cooking basics"", ""Shampoo"", ""Dedicated works...",$70.00,30,1125,30,30,1125,1125,30.00,1125.00,t,30,60,90,364,0,0,0,,f,30,30,0,0
8031,2020-09-23,f,Cherry Creek,2232.00,2232.00,"['email', 'phone']",t,t,"Shaw, Logan Circle",38.91,-77.03,Entire home/apt,3,1 bath,1.00,,"[""Cooking basics"", ""Elevator"", ""Lockbox"", ""Lon...",$223.00,91,365,91,91,365,365,91.00,365.00,t,30,60,90,365,0,0,0,,t,50,50,0,0


In [17]:
## Confirming dropped columns with high missing values
cf.report_df(df)

Unnamed: 0,null_sum,null_pct,datatypes,num_unique,count,mean,std,min,25%,50%,75%,max
host_since,136,0.02,object,2415,,,,,,,,
host_is_superhost,136,0.02,object,2,,,,,,,,
host_neighbourhood,741,0.09,object,183,,,,,,,,
host_listings_count,136,0.02,float64,57,7897.0,86.19,334.72,0.0,1.0,2.0,6.0,3924.0
host_total_listings_count,136,0.02,float64,57,7897.0,86.19,334.72,0.0,1.0,2.0,6.0,3924.0
host_verifications,0,0.0,object,297,,,,,,,,
host_has_profile_pic,136,0.02,object,2,,,,,,,,
host_identity_verified,136,0.02,object,2,,,,,,,,
neighbourhood_cleansed,0,0.0,object,39,,,,,,,,
latitude,0,0.0,float64,5113,8033.0,38.91,0.02,38.82,38.9,38.91,38.92,39.0


In [18]:
## Filling missing values for 'beds' with values for 'bedrooms'

for idx in list(df['beds'][df['beds'].isna()].index):
    if df['bedrooms'][idx] > 0:
        df['beds'][idx] = df['bedrooms'][idx]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['beds'][idx] = df['bedrooms'][idx]


In [19]:
## Filling missing values for 'bedrooms' with values for 'beds'

for idx in list(df['bedrooms'][df['bedrooms'].isna()].index):
    if df['beds'][idx] > 0:
        df['bedrooms'][idx] = df['beds'][idx]

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['bedrooms'][idx] = df['beds'][idx]


In [20]:
## Confirming reduction in missing values for 'beds' and 'bedrooms'

rpt_clean  = cf.report_df(df)
rpt_clean[rpt_clean['null_sum'] >0]

Unnamed: 0,null_sum,null_pct,datatypes,num_unique,count,mean,std,min,25%,50%,75%,max
host_since,136,0.02,object,2415,,,,,,,,
host_is_superhost,136,0.02,object,2,,,,,,,,
host_neighbourhood,741,0.09,object,183,,,,,,,,
host_listings_count,136,0.02,float64,57,7897.0,86.19,334.72,0.0,1.0,2.0,6.0,3924.0
host_total_listings_count,136,0.02,float64,57,7897.0,86.19,334.72,0.0,1.0,2.0,6.0,3924.0
host_has_profile_pic,136,0.02,object,2,,,,,,,,
host_identity_verified,136,0.02,object,2,,,,,,,,
bathrooms_text,9,0.0,object,30,,,,,,,,
bedrooms,157,0.02,float64,9,7876.0,1.5,0.88,1.0,1.0,1.0,2.0,9.0
beds,58,0.01,float64,17,7975.0,1.83,1.48,0.0,1.0,1.0,2.0,50.0


In [21]:
## Checking remaining missing values

df.isna().sum()

host_since                                       136
host_is_superhost                                136
host_neighbourhood                               741
host_listings_count                              136
host_total_listings_count                        136
host_verifications                                 0
host_has_profile_pic                             136
host_identity_verified                           136
neighbourhood_cleansed                             0
latitude                                           0
longitude                                          0
room_type                                          0
accommodates                                       0
bathrooms_text                                     9
bedrooms                                         157
beds                                              58
amenities                                          0
price                                              0
minimum_nights                                

In [22]:
## Removing rows with 6+ null values

df = df[df.isna().sum(axis=1) < 6]
df.head(5)

Unnamed: 0,host_since,host_is_superhost,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood_cleansed,latitude,longitude,room_type,accommodates,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,has_availability,availability_30,availability_60,availability_90,availability_365,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,review_scores_rating,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms
0,2008-11-26,f,Anacostia,2.0,2.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",t,t,Historic Anacostia,38.86,-76.99,Private room,1,1 private bath,1.0,1.0,"[""First aid kit"", ""Long term stays allowed"", ""...",$55.00,2,365,2,2,365,365,2.0,365.0,t,1,31,61,336,75,3,0,4.59,f,2,0,2,0
1,2008-12-12,f,Eckington,0.0,0.0,"['email', 'phone', 'reviews', 'kba']",t,t,"Edgewood, Bloomingdale, Truxton Circle, Eckington",38.91,-77.0,Private room,2,1.5 shared baths,1.0,1.0,"[""Cooking basics"", ""First aid kit"", ""Dedicated...",$70.00,2,1125,2,2,1125,1125,2.0,1125.0,t,9,39,69,344,429,0,0,4.82,f,2,0,2,0
2,2008-12-30,f,Eastland Gardens,3.0,3.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"Eastland Gardens, Kenilworth",38.91,-76.94,Private room,4,1 shared bath,1.0,1.0,"[""Cooking basics"", ""First aid kit"", ""Keypad"", ...",$54.00,30,180,30,30,180,180,30.0,180.0,t,29,59,89,179,102,0,0,4.66,f,1,0,1,0
3,2009-01-26,t,Ivy City,5.0,5.0,"['email', 'phone', 'reviews', 'kba']",t,t,"Ivy City, Arboretum, Trinidad, Carver Langston",38.91,-76.99,Private room,1,3 baths,1.0,1.0,"[""Cable TV"", ""TV with standard cable"", ""Kitche...",$99.00,2,365,2,2,365,365,2.0,365.0,t,0,0,0,146,31,0,0,4.74,f,3,0,3,0
4,2009-01-13,f,Adams Morgan,4.0,4.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",t,t,"Kalorama Heights, Adams Morgan, Lanier Heights",38.92,-77.04,Entire home/apt,3,1 bath,1.0,1.0,"[""Window guards"", ""Cooking basics"", ""First aid...",$86.00,5,150,5,23,150,150,8.8,150.0,t,7,32,62,121,95,0,0,4.54,f,2,1,1,0


In [23]:
df.isna().sum()

host_since                                         0
host_is_superhost                                  0
host_neighbourhood                               605
host_listings_count                                0
host_total_listings_count                          0
host_verifications                                 0
host_has_profile_pic                               0
host_identity_verified                             0
neighbourhood_cleansed                             0
latitude                                           0
longitude                                          0
room_type                                          0
accommodates                                       0
bathrooms_text                                     9
bedrooms                                         157
beds                                              58
amenities                                          0
price                                              0
minimum_nights                                

In [24]:
cf.report_df(df)

Unnamed: 0,null_sum,null_pct,datatypes,num_unique,count,mean,std,min,25%,50%,75%,max
host_since,0,0.0,object,2415,,,,,,,,
host_is_superhost,0,0.0,object,2,,,,,,,,
host_neighbourhood,605,0.08,object,183,,,,,,,,
host_listings_count,0,0.0,float64,57,7897.0,86.19,334.72,0.0,1.0,2.0,6.0,3924.0
host_total_listings_count,0,0.0,float64,57,7897.0,86.19,334.72,0.0,1.0,2.0,6.0,3924.0
host_verifications,0,0.0,object,296,,,,,,,,
host_has_profile_pic,0,0.0,object,2,,,,,,,,
host_identity_verified,0,0.0,object,2,,,,,,,,
neighbourhood_cleansed,0,0.0,object,39,,,,,,,,
latitude,0,0.0,float64,5076,7897.0,38.91,0.02,38.82,38.9,38.91,38.92,39.0


In [25]:
## Resetting the index after dropping rows

df.reset_index(drop=True, inplace=True)

In [26]:
print(len(df) == len(df.index),"\n")
print(len(df),len(df.index))

True 

7897 7897


---

> At this point, **I cleaned up most of the null values via dropping columns with 25%+ missing values and dropping rows with 6+ missing values.**
>
>Additionally, **I filled missing values for 'beds'/'bedrooms' by checking the missing values for each column against the values in the other for each row.** If a row had a value in one of the columns but not the other, I filled the missing value with the value from the other column.
>
> At this point, I addressed most of the missing values in my dataset by dropping columns and filling missing values. There are still a few columns with missing values, but I will use a SimpleImputer combined with a GridSearchCV to determine the best method by which to fill those values.
>
> Now I will review the remaining data and determine if there are any other issues with my data.

---

In [27]:
len(df) == len(df.index)

True

# **COMMENT:** What else to clean?? 

* DONE: T/F columns to 1/0


* DONE: 'host_since' to DT


* DONE: 'price' -$, to float


* DONE: 'neighbourhood_cleansed' split on ", " and convert to binary columns, then drop host_neighbourhood


* DONE: 'bathrooms_text' split on space, keep 1st part, convert to int


* 'host_verifications' - single string, needs extensive work in order to MLB

## Converting True/False Columns to Binary Values

In [28]:
## Creating list of true/false features to convert to 1/0, respectively

t_f_xf = ['host_is_superhost','host_has_profile_pic','host_identity_verified',
          'has_availability','instant_bookable']
t_f_xf

['host_is_superhost',
 'host_has_profile_pic',
 'host_identity_verified',
 'has_availability',
 'instant_bookable']

In [29]:
## Converting datatype to "string" to replace values

df[t_f_xf] = df[t_f_xf].astype('str')
df[t_f_xf].dtypes

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


host_is_superhost         object
host_has_profile_pic      object
host_identity_verified    object
has_availability          object
instant_bookable          object
dtype: object

In [30]:
df[t_f_xf]

Unnamed: 0,host_is_superhost,host_has_profile_pic,host_identity_verified,has_availability,instant_bookable
0,f,t,t,t,f
1,f,t,t,t,f
2,f,t,t,t,f
3,t,t,t,t,f
4,f,t,t,t,f
...,...,...,...,...,...
7892,f,t,f,t,t
7893,f,t,t,t,t
7894,f,t,t,t,f
7895,f,t,t,t,t


In [31]:
## Converting t/f to 1/0, respectively

df[t_f_xf] = df[t_f_xf].replace({ 't' : 1, 'f' : 0})

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


In [32]:
df[t_f_xf]

Unnamed: 0,host_is_superhost,host_has_profile_pic,host_identity_verified,has_availability,instant_bookable
0,0,1,1,1,0
1,0,1,1,1,0
2,0,1,1,1,0
3,1,1,1,1,0
4,0,1,1,1,0
...,...,...,...,...,...
7892,0,1,0,1,1
7893,0,1,1,1,1
7894,0,1,1,1,0
7895,0,1,1,1,1


In [33]:
df[t_f_xf] = df[t_f_xf].astype(int)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


In [34]:
## Verifying results

cf.report_df(df[t_f_xf])

Unnamed: 0,null_sum,null_pct,datatypes,num_unique,count,mean,std,min,25%,50%,75%,max
host_is_superhost,0,0.0,int32,2,7897.0,0.25,0.44,0.0,0.0,0.0,1.0,1.0
host_has_profile_pic,0,0.0,int32,2,7897.0,1.0,0.05,0.0,1.0,1.0,1.0,1.0
host_identity_verified,0,0.0,int32,2,7897.0,0.82,0.39,0.0,1.0,1.0,1.0,1.0
has_availability,0,0.0,int32,2,7897.0,0.97,0.18,0.0,1.0,1.0,1.0,1.0
instant_bookable,0,0.0,int32,2,7897.0,0.41,0.49,0.0,0.0,0.0,1.0,1.0


## Converting Price to Float 

In [35]:
## Converting each value into a float for processing

df['price'] = df['price'].map(lambda price: price[1:].replace(',','')).astype('float')
df['price'][0]

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['price'] = df['price'].map(lambda price: price[1:].replace(',','')).astype('float')


55.0

In [36]:
df['price'].describe()

count    7,897.00
mean       185.65
std        322.32
min          0.00
25%         80.00
50%        119.00
75%        187.00
max     10,000.00
Name: price, dtype: float64

## Creating "Years_Hosting"

---

> Since the 'host_since' feature is clearly a date, I will create a separate feature for how many years of activity for each host.

---

In [37]:
df['years_hosting'] = df["host_since"].map(lambda x: 2021- int(x.split("-")[0]))
df['years_hosting']

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['years_hosting'] = df["host_since"].map(lambda x: 2021- int(x.split("-")[0]))


0       13
1       13
2       13
3       12
4       12
        ..
7892     1
7893     1
7894     5
7895     1
7896     1
Name: years_hosting, Length: 7897, dtype: int64

In [38]:
df['years_hosting'].value_counts()

6     1550
5     1344
7     1047
8      824
4      609
2      513
1      478
9      450
3      423
10     257
0      199
11      95
12      87
13      21
Name: years_hosting, dtype: int64

In [39]:
df['years_hosting'].describe()

count   7,897.00
mean        5.59
std         2.58
min         0.00
25%         4.00
50%         6.00
75%         7.00
max        13.00
Name: years_hosting, dtype: float64

---

> I successfully created the new feature to represent how long each host is active (up to 2021). I will be curious to see the impact of the years of experience on the overall rating at the end of my modeling process.

---

## Bathrooms_Text to Num_Bathrooms

---

> In the raw data, the original "bathrooms" feature was empty and was dropped as part of processing missing data.
>
> **My goal is to convert the "bathrooms_text" feature into a new "num_bathrooms" feature to indicate the number of bathrooms at a host property.**
>
> I assume the number of bathrooms would have an impact on the rating . More bathrooms could mean more space/comfort for the guest, but could also cause an increase in price.


---

In [40]:
## Checking current dataframe contents
df.head(3)

Unnamed: 0,host_since,host_is_superhost,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood_cleansed,latitude,longitude,room_type,accommodates,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,has_availability,availability_30,availability_60,availability_90,availability_365,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,review_scores_rating,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,years_hosting
0,2008-11-26,0,Anacostia,2.0,2.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",1,1,Historic Anacostia,38.86,-76.99,Private room,1,1 private bath,1.0,1.0,"[""First aid kit"", ""Long term stays allowed"", ""...",55.0,2,365,2,2,365,365,2.0,365.0,1,1,31,61,336,75,3,0,4.59,0,2,0,2,0,13
1,2008-12-12,0,Eckington,0.0,0.0,"['email', 'phone', 'reviews', 'kba']",1,1,"Edgewood, Bloomingdale, Truxton Circle, Eckington",38.91,-77.0,Private room,2,1.5 shared baths,1.0,1.0,"[""Cooking basics"", ""First aid kit"", ""Dedicated...",70.0,2,1125,2,2,1125,1125,2.0,1125.0,1,9,39,69,344,429,0,0,4.82,0,2,0,2,0,13
2,2008-12-30,0,Eastland Gardens,3.0,3.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",1,1,"Eastland Gardens, Kenilworth",38.91,-76.94,Private room,4,1 shared bath,1.0,1.0,"[""Cooking basics"", ""First aid kit"", ""Keypad"", ...",54.0,30,180,30,30,180,180,30.0,180.0,1,29,59,89,179,102,0,0,4.66,0,1,0,1,0,13


In [41]:
## Checking for null values overall
df.isna().sum()[df.isna().sum() > 0]

host_neighbourhood       605
bathrooms_text             9
bedrooms                 157
beds                      58
review_scores_rating    2081
dtype: int64

In [42]:
## Inspecting a selection of values from the column to understand the values
df.loc[:,'bathrooms_text'][:21]

0       1 private bath
1     1.5 shared baths
2        1 shared bath
3              3 baths
4               1 bath
5               1 bath
6        1 shared bath
7               1 bath
8               1 bath
9     1.5 shared baths
10              1 bath
11    1.5 shared baths
12      1 private bath
13              1 bath
14    1.5 shared baths
15      1 private bath
16       1 shared bath
17           2.5 baths
18      1 private bath
19              1 bath
20                 NaN
Name: bathrooms_text, dtype: object

In [43]:
## Inspecting the rows in which there are null values
df[df['bathrooms_text'].isna()]

Unnamed: 0,host_since,host_is_superhost,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood_cleansed,latitude,longitude,room_type,accommodates,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,has_availability,availability_30,availability_60,availability_90,availability_365,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,review_scores_rating,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,years_hosting
20,2010-10-04,1,Logan Circle,1.0,1.0,"['email', 'phone', 'facebook', 'reviews']",1,0,"Dupont Circle, Connecticut Avenue/K Street",38.91,-77.03,Entire home/apt,2,,1.0,1.0,"[""Cooking basics"", ""Lockbox"", ""Dedicated works...",195.0,3,365,3,3,365,365,3.0,365.0,1,17,47,77,352,156,2,0,4.85,0,1,1,0,0,11
25,2009-01-26,1,Ivy City,5.0,5.0,"['email', 'phone', 'reviews', 'kba']",1,1,"Ivy City, Arboretum, Trinidad, Carver Langston",38.9,-76.99,Private room,1,,1.0,1.0,"[""Cable TV"", ""TV with standard cable"", ""Kitche...",99.0,2,730,2,2,730,730,2.0,730.0,1,4,13,18,197,10,0,0,4.89,0,3,0,3,0,12
26,2009-01-26,1,Ivy City,5.0,5.0,"['email', 'phone', 'reviews', 'kba']",1,1,"Ivy City, Arboretum, Trinidad, Carver Langston",38.91,-76.98,Private room,1,,1.0,1.0,"[""Cable TV"", ""TV with standard cable"", ""Kitche...",99.0,2,730,2,2,730,730,2.0,730.0,1,1,14,20,198,10,0,0,4.9,0,3,0,3,0,12
120,2009-11-26,0,Capitol Hill,1.0,1.0,"['email', 'phone', 'reviews', 'kba']",1,1,"Capitol Hill, Lincoln Park",38.89,-76.99,Entire home/apt,6,,3.0,3.0,[],2000.0,4,4,4,4,4,4,4.0,4.0,1,30,60,90,365,0,0,0,,0,1,1,0,0,12
5643,2019-11-18,0,Adams Morgan,3.0,3.0,"['email', 'phone']",1,1,"Kalorama Heights, Adams Morgan, Lanier Heights",38.92,-77.04,Hotel room,4,,,,"[""First aid kit"", ""Long term stays allowed"", ""...",0.0,1,365,1,1,28,28,1.0,28.0,1,0,0,0,0,0,0,0,,0,1,0,0,0,2
5684,2014-10-28,0,16th Street Heights,1.0,1.0,"['email', 'phone', 'reviews', 'kba']",1,1,"Brightwood Park, Crestwood, Petworth",38.94,-77.03,Private room,2,,1.0,1.0,"[""Shampoo"", ""Hot water"", ""Carbon monoxide alar...",85.0,30,180,30,30,1125,1125,30.0,1125.0,1,30,60,90,365,0,0,0,,1,2,0,2,0,7
5754,2019-11-26,0,Mount Vernon Square,0.0,0.0,['phone'],0,1,"Downtown, Chinatown, Penn Quarters, Mount Vern...",38.9,-77.02,Hotel room,0,,,,"[""Bed sheets and pillows"", ""First aid kit"", ""C...",0.0,1,365,1,1,1125,1125,1.0,1125.0,1,0,0,0,0,0,0,0,,0,1,0,0,0,2
5849,2019-08-22,0,U Street Corridor,0.0,0.0,"['email', 'phone']",1,1,"Howard University, Le Droit Park, Cardozo/Shaw",38.92,-77.03,Hotel room,0,,,,"[""Bed sheets and pillows"", ""First aid kit"", ""O...",0.0,1,365,1,1,365,365,1.0,365.0,1,0,0,0,0,33,19,7,4.36,0,3,0,0,2,2
7441,2019-04-24,0,,1.0,1.0,"['email', 'phone', 'offline_government_id', 's...",1,1,"Colonial Village, Shepherd Park, North Portal ...",39.0,-77.04,Private room,1,,1.0,1.0,"[""Dedicated workspace"", ""Smoke alarm"", ""Shampo...",64.0,1,7,1,1,1125,1125,1.0,1125.0,1,16,46,76,351,0,0,0,,0,1,0,1,0,2


In [44]:
## Filling null values with unique string ('Baths' not present otherwise)
## Unique string can be used later to check for any other zero baths

df.loc[:,'bathrooms_text'].fillna('0 Baths', inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().fillna(


In [45]:
## Verifying all null values are filled
df.isna().sum()[df.isna().sum() > 0]

host_neighbourhood       605
bedrooms                 157
beds                      58
review_scores_rating    2081
dtype: int64

In [46]:
df.loc[:,'bathrooms_text'].isna().sum()

0

In [47]:
## Splitting each list into separate strings
df['num_bathrooms'] = df['bathrooms_text'].map(lambda x: x.split(' ')[0])
df['num_bathrooms'].value_counts()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['num_bathrooms'] = df['bathrooms_text'].map(lambda x: x.split(' ')[0])


1            5578
2             873
1.5           596
2.5           442
3.5           147
3             136
4              36
0              33
4.5            31
5.5             7
6.5             3
5               3
6               3
11              2
Half-bath       2
Shared          2
50              1
Private         1
8               1
Name: num_bathrooms, dtype: int64

In [48]:
## Inspecting results that are phrases, not numbers

replace = ['Half-bath', 'Shared', 'Private']

for x in df['bathrooms_text']:
    for i in replace:
        if i in x:
            print(x)

Shared half-bath
Half-bath
Shared half-bath
Half-bath
Private half-bath


---

> **I will replace these values with the numeric value .5 as they are half-baths.** This will allow me to convert the column datatype to a float and use the column more easily in my modeling.

---

In [49]:
## Replacing string values with .5 to represent half-bathrooms

replace = {'Half-bath': .5, 'Shared': .5, 'Private': .5}

df['num_bathrooms'].replace(replace, inplace = True)

df['num_bathrooms'] = df['num_bathrooms'].astype(float)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  return super().replace(
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['num_bathrooms'] = df['num_bathrooms'].astype(float)


In [50]:
## Inspecting resulting values

df['num_bathrooms'].value_counts(dropna=False)

1.00     5578
2.00      873
1.50      596
2.50      442
3.50      147
3.00      136
4.00       36
0.00       33
4.50       31
5.50        7
0.50        5
5.00        3
6.00        3
6.50        3
11.00       2
8.00        1
50.00       1
Name: num_bathrooms, dtype: int64

In [51]:
## Inspecting listings with more than 10 rooms

df[df['num_bathrooms'] >10]

Unnamed: 0,host_since,host_is_superhost,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood_cleansed,latitude,longitude,room_type,accommodates,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,has_availability,availability_30,availability_60,availability_90,availability_365,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,review_scores_rating,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,years_hosting,num_bathrooms
1384,2016-01-31,0,Adams Morgan,7.0,7.0,"['email', 'phone', 'reviews']",1,0,"Kalorama Heights, Adams Morgan, Lanier Heights",38.92,-77.04,Shared room,1,11 shared baths,1.0,6.0,"[""Cooking basics"", ""First aid kit"", ""Dedicated...",47.0,1,31,13,13,91,91,13.0,91.0,1,0,0,0,0,7,1,0,4.86,1,3,0,1,2,5,11.0
4228,2015-06-09,0,Edgewood,1.0,1.0,"['email', 'phone']",1,0,"Edgewood, Bloomingdale, Truxton Circle, Eckington",38.93,-77.0,Shared room,16,50 shared baths,1.0,50.0,"[""Cable TV"", ""First aid kit"", ""TV with standar...",60.0,1,1125,1,1,1125,1125,1.0,1125.0,1,27,57,87,87,0,0,0,,0,1,0,0,1,6,50.0
6244,2016-01-31,0,Adams Morgan,7.0,7.0,"['email', 'phone', 'reviews']",1,0,"Kalorama Heights, Adams Morgan, Lanier Heights",38.92,-77.04,Shared room,1,11 shared baths,1.0,3.0,"[""Cooking basics"", ""First aid kit"", ""Dedicated...",47.0,7,180,13,13,91,91,13.0,91.0,1,0,0,0,0,6,6,0,4.17,1,3,0,1,2,5,11.0


---

> After taking a look at the locations listed above on Google Maps (using their latitude/longitude), I feel like these three listings with more than 10 bathrooms are either duplicates or incorrect values (for 50 baths).
>
> Due to the questionable nature of these values, I will drop these rows to prevent these outliers from impacting my results.

---

In [52]:
## Inspecting rows where 'num_bathrooms' is zero to validate data

df[df['num_bathrooms'] ==0]

Unnamed: 0,host_since,host_is_superhost,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,neighbourhood_cleansed,latitude,longitude,room_type,accommodates,bathrooms_text,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,has_availability,availability_30,availability_60,availability_90,availability_365,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,review_scores_rating,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,years_hosting,num_bathrooms
20,2010-10-04,1,Logan Circle,1.0,1.0,"['email', 'phone', 'facebook', 'reviews']",1,0,"Dupont Circle, Connecticut Avenue/K Street",38.91,-77.03,Entire home/apt,2,0 Baths,1.0,1.0,"[""Cooking basics"", ""Lockbox"", ""Dedicated works...",195.0,3,365,3,3,365,365,3.0,365.0,1,17,47,77,352,156,2,0,4.85,0,1,1,0,0,11,0.0
25,2009-01-26,1,Ivy City,5.0,5.0,"['email', 'phone', 'reviews', 'kba']",1,1,"Ivy City, Arboretum, Trinidad, Carver Langston",38.9,-76.99,Private room,1,0 Baths,1.0,1.0,"[""Cable TV"", ""TV with standard cable"", ""Kitche...",99.0,2,730,2,2,730,730,2.0,730.0,1,4,13,18,197,10,0,0,4.89,0,3,0,3,0,12,0.0
26,2009-01-26,1,Ivy City,5.0,5.0,"['email', 'phone', 'reviews', 'kba']",1,1,"Ivy City, Arboretum, Trinidad, Carver Langston",38.91,-76.98,Private room,1,0 Baths,1.0,1.0,"[""Cable TV"", ""TV with standard cable"", ""Kitche...",99.0,2,730,2,2,730,730,2.0,730.0,1,1,14,20,198,10,0,0,4.9,0,3,0,3,0,12,0.0
120,2009-11-26,0,Capitol Hill,1.0,1.0,"['email', 'phone', 'reviews', 'kba']",1,1,"Capitol Hill, Lincoln Park",38.89,-76.99,Entire home/apt,6,0 Baths,3.0,3.0,[],2000.0,4,4,4,4,4,4,4.0,4.0,1,30,60,90,365,0,0,0,,0,1,1,0,0,12,0.0
486,2015-03-05,0,Dupont Circle,8.0,8.0,"['email', 'phone', 'reviews']",1,1,"Dupont Circle, Connecticut Avenue/K Street",38.91,-77.04,Private room,2,0 baths,1.0,1.0,"[""Long term stays allowed"", ""Essentials"", ""Hea...",80.0,1,1125,1,1,1125,1125,1.0,1125.0,1,0,0,0,0,159,0,0,4.69,0,8,0,8,0,6,0.0
487,2015-03-05,0,Dupont Circle,8.0,8.0,"['email', 'phone', 'reviews']",1,1,"Dupont Circle, Connecticut Avenue/K Street",38.91,-77.04,Private room,2,0 baths,1.0,1.0,"[""Cable TV"", ""TV with standard cable"", ""Long t...",80.0,1,1125,1,1,1125,1125,1.0,1125.0,1,0,0,0,0,231,0,0,4.59,0,8,0,8,0,6,0.0
488,2015-03-05,0,Dupont Circle,8.0,8.0,"['email', 'phone', 'reviews']",1,1,"Dupont Circle, Connecticut Avenue/K Street",38.91,-77.04,Private room,1,0 baths,1.0,1.0,"[""Cable TV"", ""TV with standard cable"", ""Kitche...",95.0,1,1125,1,1,1125,1125,1.0,1125.0,1,0,0,0,0,256,0,0,4.56,0,8,0,8,0,6,0.0
803,2014-09-10,0,Columbia Heights,1.0,1.0,"['email', 'phone', 'reviews', 'jumio', 'govern...",1,1,"Brightwood Park, Crestwood, Petworth",38.94,-77.03,Private room,1,0 baths,1.0,1.0,"[""Shampoo"", ""Dedicated workspace"", ""Kitchen"", ...",36.0,2,1125,2,2,1125,1125,2.0,1125.0,1,0,0,0,0,0,0,0,,0,1,0,1,0,7,0.0
1178,2015-08-07,0,Petworth,3.0,3.0,"['email', 'phone', 'reviews', 'kba']",1,0,"Brightwood Park, Crestwood, Petworth",38.94,-77.02,Private room,1,0 shared baths,1.0,1.0,"[""Kitchen"", ""Carbon monoxide alarm"", ""Keypad"",...",65.0,1,1125,1,1,1125,1125,1.0,1125.0,1,30,60,90,365,43,0,0,4.86,0,3,0,3,0,6,0.0
1648,2013-09-09,1,,1.0,1.0,"['email', 'phone', 'facebook', 'reviews']",1,0,"North Cleveland Park, Forest Hills, Van Ness",38.94,-77.06,Private room,2,0 shared baths,,0.0,"[""Elevator"", ""Dedicated workspace"", ""Long term...",150.0,1,1125,1,1,1125,1125,1.0,1125.0,1,0,0,9,284,25,0,0,5.0,0,1,0,1,0,8,0.0


In [53]:
## Removing old column post-conversion

df = df.drop(columns = 'bathrooms_text')

In [54]:
## Confirming removal

'bathrooms_text' in df.columns

False

---

> My review of the original bathroom text for the zero bathrooms column shows that the listings are associated with a private room. This would make sense as the listings may not include an option such as a shared bath, etc..
>
> Additionally I did fill 9 instances of missing values with "0 Baths," which would contribute slightly to this count.
>
> Overall, I feel the data is valid and I will use it for my modeling.

---

## Cleaning Room_Type

In [55]:
df['room_type'].value_counts()

Entire home/apt    5816
Private room       1894
Shared room         159
Hotel room           28
Name: room_type, dtype: int64

In [56]:
replace_rooms = {'Entire home/apt': 'entire_home', 
                 'Private room': 'private_room',
                 'Shared room': 'shared_room',
                 'Hotel room': 'hotel_room'
                }

df['room_type'].replace(replace_rooms, inplace=True)
df['room_type'].value_counts(dropna=False)

entire_home     5816
private_room    1894
shared_room      159
hotel_room        28
Name: room_type, dtype: int64

## Binarizing Neighbourhood_Cleansed

---

> The current values for "neighbourhood_cleansed" are a single string value. **I will separate each neighborhood and convert them into a binary column to represent whether or not that neighborhood is included in the listing, then drop the old column.**

---

In [57]:
## Inspecting feature
df.loc[:,'neighbourhood_cleansed']

0                                      Historic Anacostia
1       Edgewood, Bloomingdale, Truxton Circle, Eckington
2                            Eastland Gardens, Kenilworth
3          Ivy City, Arboretum, Trinidad, Carver Langston
4          Kalorama Heights, Adams Morgan, Lanier Heights
                              ...                        
7892     Congress Heights, Bellevue, Washington Highlands
7893       Howard University, Le Droit Park, Cardozo/Shaw
7894       Kalorama Heights, Adams Morgan, Lanier Heights
7895                                   Shaw, Logan Circle
7896    Columbia Heights, Mt. Pleasant, Pleasant Plain...
Name: neighbourhood_cleansed, Length: 7897, dtype: object

In [58]:
## Identifying datatype
df.loc[:,'neighbourhood_cleansed'].dtype

dtype('O')

In [59]:
## Testing the splitting between neighborhoods

df.loc[:,'neighbourhood_cleansed'][1].split(', ')

['Edgewood', 'Bloomingdale', 'Truxton Circle', 'Eckington']

In [60]:
## Converting values into a list of strings of neighborhoods

df['neighbourhood_cleansed'] = df['neighbourhood_cleansed'] \
                                    .apply(lambda x: x.split(', '))

display(df.loc[:,'neighbourhood_cleansed'])

0                                    [Historic Anacostia]
1       [Edgewood, Bloomingdale, Truxton Circle, Eckin...
2                          [Eastland Gardens, Kenilworth]
3        [Ivy City, Arboretum, Trinidad, Carver Langston]
4        [Kalorama Heights, Adams Morgan, Lanier Heights]
                              ...                        
7892    [Congress Heights, Bellevue, Washington Highla...
7893     [Howard University, Le Droit Park, Cardozo/Shaw]
7894     [Kalorama Heights, Adams Morgan, Lanier Heights]
7895                                 [Shaw, Logan Circle]
7896    [Columbia Heights, Mt. Pleasant, Pleasant Plai...
Name: neighbourhood_cleansed, Length: 7897, dtype: object

---

> The following code snippet is adapted from [here](https://stackoverflow.com/questions/45312377/how-to-one-hot-encode-from-a-pandas-column-containing-a-list#:~:text=Sparse%20solution%20(for%20Pandas%20v0.25.0%2B)) by the user [Maxu](https://stackoverflow.com/users/5741205/maxu).

---

In [61]:
## Converting each neighborhood into a binary column and dropping old column

mlb = MultiLabelBinarizer()

df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('neighbourhood_cleansed')),
                              columns=mlb.classes_,index=df.index))

In [62]:
## Inspecting results

df.head(3)

Unnamed: 0,host_since,host_is_superhost,host_neighbourhood,host_listings_count,host_total_listings_count,host_verifications,host_has_profile_pic,host_identity_verified,latitude,longitude,room_type,accommodates,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,has_availability,availability_30,availability_60,availability_90,availability_365,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,review_scores_rating,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,years_hosting,num_bathrooms,Adams Morgan,American University Park,Arboretum,Barnaby Woods,Barry Farm,Bellevue,Benning,Benning Heights,Bloomingdale,Brentwood,Brightwood,Brightwood Park,Brookland,Buena Vista,Burleith/Hillandale,Burrville,Buzzard Point,Capitol Hill,Capitol View,Cardozo/Shaw,Carver Langston,Cathedral Heights,Chevy Chase,Chinatown,Cleveland Park,Colonial Village,Columbia Heights,Congress Heights,Connecticut Avenue/K Street,Crestwood,Deanwood,Douglas,Downtown,Dupont Circle,Dupont Park,Eastland Gardens,Eckington,Edgewood,Fairfax Village,Fairlawn,Fairmont Heights,Foggy Bottom,Forest Hills,Fort Davis Park,Fort Dupont,Fort Lincoln,Fort McNair,Fort Totten,Foxhall Crescent,Foxhall Village,Friendship Heights,GWU,Garfield Heights,Gateway,Georgetown,Georgetown Reservoir,Glover Park,Grant Park,Greenway,Hawthorne,Hillbrook,Hillcrest,Historic Anacostia,Howard University,Ivy City,Kalorama Heights,Kenilworth,Kingman Park,Knox Hill,Lamont Riggs,Langdon,Lanier Heights,Le Droit Park,Lincoln Heights,Lincoln Park,Logan Circle,Mahaning Heights,Manor Park,Marshall Heights,Massachusetts Avenue Heights,Mayfair,McLean Gardens,Michigan Park,Mount Vernon Square,Mt. Pleasant,Navy Yard,Naylor Gardens,Near Southeast,North Capitol Street,North Cleveland Park,North Michigan Park,North Portal Estates,Palisades,Park View,Penn Branch,Penn Quarters,Petworth,Pleasant Hill,Pleasant Plains,Queens Chapel,Randle Highlands,River Terrace,Shaw,Shepherd Park,Sheridan,Shipley Terrace,Southwest Employment Area,Southwest/Waterfront,Spring Valley,Stanton Park,Summit Park,Takoma,Tenleytown,Trinidad,Truxton Circle,Twining,Union Station,University Heights,Van Ness,Washington Highlands,Wesley Heights,West End,Woodland-Normanstone Terrace,Woodland/Fort Stanton,Woodley Park,Woodridge
0,2008-11-26,0,Anacostia,2.0,2.0,"['email', 'phone', 'reviews', 'jumio', 'offlin...",1,1,38.86,-76.99,private_room,1,1.0,1.0,"[""First aid kit"", ""Long term stays allowed"", ""...",55.0,2,365,2,2,365,365,2.0,365.0,1,1,31,61,336,75,3,0,4.59,0,2,0,2,0,13,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,2008-12-12,0,Eckington,0.0,0.0,"['email', 'phone', 'reviews', 'kba']",1,1,38.91,-77.0,private_room,2,1.0,1.0,"[""Cooking basics"", ""First aid kit"", ""Dedicated...",70.0,2,1125,2,2,1125,1125,2.0,1125.0,1,9,39,69,344,429,0,0,4.82,0,2,0,2,0,13,1.5,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
2,2008-12-30,0,Eastland Gardens,3.0,3.0,"['email', 'phone', 'facebook', 'reviews', 'jum...",1,1,38.91,-76.94,private_room,4,1.0,1.0,"[""Cooking basics"", ""First aid kit"", ""Keypad"", ...",54.0,30,180,30,30,180,180,30.0,180.0,1,29,59,89,179,102,0,0,4.66,0,1,0,1,0,13,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


---

> After using the MultiLabelBinarizer, I successfully added a column for each neighborhood, indicating whether or not that neighborhood was included in the listing.
>
> This enables me to use the presence/absence of a  neighborhood as a category in my modeling.

---

## Host_Verifications to Binary Columns

---

> For the "host_verifications" and "amenities" features, the values are a single string with several items within the string.
>
> It is somewhat similar to the "neighborhoods_cleaned" feature in the sense that I will need to filter out the individual items from the string. However, there is an added complication as I need to remove the brackets and quotations from the strings.
>
> Once I filter out the items, I will be able to use the MultiLabelBinarizer again to create more categories for each amenity.

---

In [63]:
## Inspecting contents
df['host_verifications'][:10]

0    ['email', 'phone', 'reviews', 'jumio', 'offlin...
1                 ['email', 'phone', 'reviews', 'kba']
2    ['email', 'phone', 'facebook', 'reviews', 'jum...
3                 ['email', 'phone', 'reviews', 'kba']
4    ['email', 'phone', 'facebook', 'reviews', 'jum...
5    ['email', 'phone', 'reviews', 'offline_governm...
6    ['email', 'phone', 'facebook', 'reviews', 'jum...
7    ['email', 'phone', 'facebook', 'reviews', 'off...
8                 ['email', 'phone', 'reviews', 'kba']
9                 ['email', 'phone', 'reviews', 'kba']
Name: host_verifications, dtype: object

In [64]:
## Testing the splitting between items

df.loc[:,'host_verifications'][1]

"['email', 'phone', 'reviews', 'kba']"

In [65]:
## Removing e'host_verifications'tra characters and splitting items

df['host_verifications'] = df['host_verifications'].str.replace('[', '')
df['host_verifications'] = df['host_verifications'].str.replace(']', '')
df['host_verifications'] = df['host_verifications'].str.replace("'", '')
df['host_verifications'] = df['host_verifications'].str.replace('"', '')
df['host_verifications'] = df['host_verifications'].apply(lambda x: x.split(', '))

In [66]:
df['host_verifications']

0       [email, phone, reviews, jumio, offline_governm...
1                            [email, phone, reviews, kba]
2       [email, phone, facebook, reviews, jumio, gover...
3                            [email, phone, reviews, kba]
4       [email, phone, facebook, reviews, jumio, offli...
                              ...                        
7892                                       [email, phone]
7893                                       [email, phone]
7894    [email, phone, reviews, jumio, offline_governm...
7895                                       [email, phone]
7896                                       [email, phone]
Name: host_verifications, Length: 7897, dtype: object

In [67]:
## Converting each value into a binary column and dropping old column

mlb2 = MultiLabelBinarizer()
    
df = df.join(pd.DataFrame(mlb2.fit_transform(df.pop('host_verifications')),
                                  columns=mlb2.classes_,index=df.index))

df

Unnamed: 0,host_since,host_is_superhost,host_neighbourhood,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,latitude,longitude,room_type,accommodates,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,has_availability,availability_30,availability_60,availability_90,availability_365,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,review_scores_rating,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,years_hosting,num_bathrooms,Adams Morgan,American University Park,Arboretum,Barnaby Woods,Barry Farm,Bellevue,Benning,Benning Heights,Bloomingdale,Brentwood,Brightwood,Brightwood Park,Brookland,Buena Vista,Burleith/Hillandale,Burrville,Buzzard Point,Capitol Hill,Capitol View,Cardozo/Shaw,Carver Langston,Cathedral Heights,Chevy Chase,Chinatown,Cleveland Park,Colonial Village,Columbia Heights,Congress Heights,Connecticut Avenue/K Street,Crestwood,Deanwood,Douglas,Downtown,Dupont Circle,Dupont Park,Eastland Gardens,Eckington,Edgewood,Fairfax Village,Fairlawn,Fairmont Heights,Foggy Bottom,Forest Hills,Fort Davis Park,Fort Dupont,Fort Lincoln,Fort McNair,Fort Totten,Foxhall Crescent,Foxhall Village,Friendship Heights,GWU,Garfield Heights,Gateway,Georgetown,Georgetown Reservoir,Glover Park,Grant Park,Greenway,Hawthorne,Hillbrook,Hillcrest,Historic Anacostia,Howard University,Ivy City,Kalorama Heights,Kenilworth,Kingman Park,Knox Hill,Lamont Riggs,Langdon,Lanier Heights,Le Droit Park,Lincoln Heights,Lincoln Park,Logan Circle,Mahaning Heights,Manor Park,Marshall Heights,Massachusetts Avenue Heights,Mayfair,McLean Gardens,Michigan Park,Mount Vernon Square,Mt. Pleasant,Navy Yard,Naylor Gardens,Near Southeast,North Capitol Street,North Cleveland Park,North Michigan Park,North Portal Estates,Palisades,Park View,Penn Branch,Penn Quarters,Petworth,Pleasant Hill,Pleasant Plains,Queens Chapel,Randle Highlands,River Terrace,Shaw,Shepherd Park,Sheridan,Shipley Terrace,Southwest Employment Area,Southwest/Waterfront,Spring Valley,Stanton Park,Summit Park,Takoma,Tenleytown,Trinidad,Truxton Circle,Twining,Union Station,University Heights,Van Ness,Washington Highlands,Wesley Heights,West End,Woodland-Normanstone Terrace,Woodland/Fort Stanton,Woodley Park,Woodridge,Unnamed: 166,email,facebook,google,government_id,identity_manual,jumio,kba,manual_offline,manual_online,offline_government_id,phone,reviews,selfie,sent_id,weibo,work_email
0,2008-11-26,0,Anacostia,2.00,2.00,1,1,38.86,-76.99,private_room,1,1.00,1.00,"[""First aid kit"", ""Long term stays allowed"", ""...",55.00,2,365,2,2,365,365,2.00,365.00,1,1,31,61,336,75,3,0,4.59,0,2,0,2,0,13,1.00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,1,1,1,0,0,0,1
1,2008-12-12,0,Eckington,0.00,0.00,1,1,38.91,-77.00,private_room,2,1.00,1.00,"[""Cooking basics"", ""First aid kit"", ""Dedicated...",70.00,2,1125,2,2,1125,1125,2.00,1125.00,1,9,39,69,344,429,0,0,4.82,0,2,0,2,0,13,1.50,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0
2,2008-12-30,0,Eastland Gardens,3.00,3.00,1,1,38.91,-76.94,private_room,4,1.00,1.00,"[""Cooking basics"", ""First aid kit"", ""Keypad"", ...",54.00,30,180,30,30,180,180,30.00,180.00,1,29,59,89,179,102,0,0,4.66,0,1,0,1,0,13,1.00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,1,1,0,0,0,0
3,2009-01-26,1,Ivy City,5.00,5.00,1,1,38.91,-76.99,private_room,1,1.00,1.00,"[""Cable TV"", ""TV with standard cable"", ""Kitche...",99.00,2,365,2,2,365,365,2.00,365.00,1,0,0,0,146,31,0,0,4.74,0,3,0,3,0,12,3.00,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0
4,2009-01-13,0,Adams Morgan,4.00,4.00,1,1,38.92,-77.04,entire_home,3,1.00,1.00,"[""Window guards"", ""Cooking basics"", ""First aid...",86.00,5,150,5,23,150,150,8.80,150.00,1,7,32,62,121,95,0,0,4.54,0,2,1,1,0,12,1.00,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,1,1,1,0,0,1,1,1,1,0,0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7892,2020-08-03,0,Southeast Washington,0.00,0.00,1,0,38.83,-77.00,entire_home,8,3.00,3.00,"[""First aid kit"", ""Dedicated workspace"", ""Smok...",400.00,2,2,2,2,1125,1125,2.00,1125.00,1,17,43,73,348,0,0,0,,1,1,1,0,0,1,1.00,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
7893,2020-07-29,0,East Forest,125.00,125.00,1,1,38.92,-77.02,entire_home,5,2.00,2.00,"[""Cooking basics"", ""Lockbox"", ""Long term stays...",198.00,90,365,90,90,365,365,90.00,365.00,1,30,60,90,365,0,0,0,,1,215,215,0,0,1,2.00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0
7894,2016-04-27,0,Near Northeast/H Street Corridor,32.00,32.00,1,1,38.92,-77.04,entire_home,2,1.00,2.00,"[""Cooking basics"", ""Shampoo"", ""Dedicated works...",70.00,30,1125,30,30,1125,1125,30.00,1125.00,1,30,60,90,364,0,0,0,,0,30,30,0,0,5,1.00,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,1,0,0,0,1,1,1,1,0,0,0
7895,2020-09-23,0,Cherry Creek,2232.00,2232.00,1,1,38.91,-77.03,entire_home,3,1.00,1.00,"[""Cooking basics"", ""Elevator"", ""Lockbox"", ""Lon...",223.00,91,365,91,91,365,365,91.00,365.00,1,30,60,90,365,0,0,0,,1,50,50,0,0,1,1.00,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0


---

> At this point, I successfully processed the 'host_verification' feature into distinct categories for modeling.
>
> In the future, I may attempt to do the same for the 'amenities' feature, but I don't want to create too many columns before my initial modeling.

---

## ❌ ERROR ❌ Binarizing Room_Type

---

> **Can't get MLB/OHE to work for individual property types.**

---

In [68]:
# df['room_type'].describe()

In [69]:
# df['room_type'].value_counts(dropna=False)

In [70]:
# df['room_type'] = df['room_type'].replace('Entire home/apt', 'Home/Apt')

In [71]:
# df['room_type'] = df['room_type'].map(lambda x: x.split(' ')[0])

In [72]:
# df['room_type'].value_counts(dropna=False)

In [73]:
# ohe = OneHotEncoder(sparse=False)

# df_ohe = ohe.fit_transform([df['room_type']])
# df_ohe

In [74]:
# pd.DataFrame(df_ohe)

## ❌ ERROR ❌ Converting Amenities


---

> same issue as w/ room type

---

In [75]:
# for x in ['host_verifications', 'amenities']:
#     print(df[x])

In [76]:
# df['amenities'][:10]

In [77]:
# for x in ['host_verifications', 'amenities']:
#     df[x] = df[x].str.replace('and', '')

In [78]:
# ## Converting each value into a binary column and dropping old column

# mlb = MultiLabelBinarizer()
    
# df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('amenities')),
#                                   columns=mlb.classes_,index=df.index))

In [79]:
# df.loc[:,'host_verifications'] = df.loc[:,'host_verifications'].str.replace('[', '')
# df.loc[:,'host_verifications'] = df.loc[:,'host_verifications'].str.replace(']', '')
# df.loc[:,'host_verifications'] = df.loc[:,'host_verifications'].str.replace("'", '')

In [80]:
# df.loc[:,'host_verifications']

In [81]:
# df['amenities'] = df['amenities'].str.replace('[', '')
# df['amenities'] = df['amenities'].str.replace(']', '')
# df['amenities'] = df['amenities'].str.replace('"', '')

In [82]:
# df['amenities']

In [83]:
# df['amenities'] = df['amenities'].apply(lambda x: x.split(', '))

In [84]:
# df['amenities'][0]

In [85]:
# df['host_verifications'] = df['host_verifications'].apply(lambda x: x.split(', '))

In [86]:
# df['host_verifications'][0][0]

In [87]:
# def convert_to_col(df, list_cols):
#     '''For a given list of column names, separates each string value in the
#     column by the comma/space pattern to return new strings of single values.
    
#     Then, instantiates a MultiLabelBinarizer to create new columns for each 
#     new string to indicate the presence or absence of that string in the 
#     original column.'''
    
# #     mlb = MultiLabelBinarizer()
    
#     for x in list_cols:
#         try:
#             df[x] = df[x].apply(lambda x: x.split(', '))
#             print(f'Successfully split values in column "{x}"')
            
#         except Exception:
#             print('\nValues are already processed and saved.')
#             print(f"\nSample value: {df.loc[:,x][3]}")
            
# #         try:
# #             df = df.join(pd.DataFrame(mlb.fit_transform(df.pop(x)),
# #                                       columns=mlb.classes_,index=df.index))
# #         except Exception:
# #                 print('\nValues are already processed and saved.')
                
#     return df

In [88]:
# binarize_cols = ['host_verifications', 'amenities'] 

# convert_to_col(df, binarize_cols)

In [89]:
# ## Converting each value into a binary column and dropping old column

# mlb = MultiLabelBinarizer()
    
# df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('amenities')),
#                                   columns=mlb.classes_,index=df.index))

In [90]:
# # mlb = MultiLabelBinarizer()
    
# df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('amenities')),
#                                   columns=mlb.classes_,index=df.index))

In [91]:
# ## Converting values into a list of strings for each neighborhood

# try:
#     df['host_verifications'] = df['host_verifications'] \
#                                                 .apply(lambda x: x.split(', '))
#     display(df.loc[:,'host_verifications'])
# except Exception:
#     print('\nValues are already processed and saved. No changes necessary.')
#     print(f"\nSample value: {df.loc[:,'host_verifications'][3]}")
    
    

In [92]:
# ## Inspecting results

# df.head(3)

In [93]:
# test3 = df['host_verifications'][0]
# test3[1:-1].replace('"', "'").split(",")

In [94]:
# # df['Tags'] = df.Tags.apply(lambda x: x[1:-1].split(','))

# df['host_verifications'].apply(lambda x: x.split(','))[0]

# Pre-Pipeline Review

In [95]:
## Review remaining data
df.head(3)

Unnamed: 0,host_since,host_is_superhost,host_neighbourhood,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,latitude,longitude,room_type,accommodates,bedrooms,beds,amenities,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,has_availability,availability_30,availability_60,availability_90,availability_365,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,review_scores_rating,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,years_hosting,num_bathrooms,Adams Morgan,American University Park,Arboretum,Barnaby Woods,Barry Farm,Bellevue,Benning,Benning Heights,Bloomingdale,Brentwood,Brightwood,Brightwood Park,Brookland,Buena Vista,Burleith/Hillandale,Burrville,Buzzard Point,Capitol Hill,Capitol View,Cardozo/Shaw,Carver Langston,Cathedral Heights,Chevy Chase,Chinatown,Cleveland Park,Colonial Village,Columbia Heights,Congress Heights,Connecticut Avenue/K Street,Crestwood,Deanwood,Douglas,Downtown,Dupont Circle,Dupont Park,Eastland Gardens,Eckington,Edgewood,Fairfax Village,Fairlawn,Fairmont Heights,Foggy Bottom,Forest Hills,Fort Davis Park,Fort Dupont,Fort Lincoln,Fort McNair,Fort Totten,Foxhall Crescent,Foxhall Village,Friendship Heights,GWU,Garfield Heights,Gateway,Georgetown,Georgetown Reservoir,Glover Park,Grant Park,Greenway,Hawthorne,Hillbrook,Hillcrest,Historic Anacostia,Howard University,Ivy City,Kalorama Heights,Kenilworth,Kingman Park,Knox Hill,Lamont Riggs,Langdon,Lanier Heights,Le Droit Park,Lincoln Heights,Lincoln Park,Logan Circle,Mahaning Heights,Manor Park,Marshall Heights,Massachusetts Avenue Heights,Mayfair,McLean Gardens,Michigan Park,Mount Vernon Square,Mt. Pleasant,Navy Yard,Naylor Gardens,Near Southeast,North Capitol Street,North Cleveland Park,North Michigan Park,North Portal Estates,Palisades,Park View,Penn Branch,Penn Quarters,Petworth,Pleasant Hill,Pleasant Plains,Queens Chapel,Randle Highlands,River Terrace,Shaw,Shepherd Park,Sheridan,Shipley Terrace,Southwest Employment Area,Southwest/Waterfront,Spring Valley,Stanton Park,Summit Park,Takoma,Tenleytown,Trinidad,Truxton Circle,Twining,Union Station,University Heights,Van Ness,Washington Highlands,Wesley Heights,West End,Woodland-Normanstone Terrace,Woodland/Fort Stanton,Woodley Park,Woodridge,Unnamed: 166,email,facebook,google,government_id,identity_manual,jumio,kba,manual_offline,manual_online,offline_government_id,phone,reviews,selfie,sent_id,weibo,work_email
0,2008-11-26,0,Anacostia,2.0,2.0,1,1,38.86,-76.99,private_room,1,1.0,1.0,"[""First aid kit"", ""Long term stays allowed"", ""...",55.0,2,365,2,2,365,365,2.0,365.0,1,1,31,61,336,75,3,0,4.59,0,2,0,2,0,13,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,0,1,1,1,0,0,0,1
1,2008-12-12,0,Eckington,0.0,0.0,1,1,38.91,-77.0,private_room,2,1.0,1.0,"[""Cooking basics"", ""First aid kit"", ""Dedicated...",70.0,2,1125,2,2,1125,1125,2.0,1125.0,1,9,39,69,344,429,0,0,4.82,0,2,0,2,0,13,1.5,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,1,0,0,0,0
2,2008-12-30,0,Eastland Gardens,3.0,3.0,1,1,38.91,-76.94,private_room,4,1.0,1.0,"[""Cooking basics"", ""First aid kit"", ""Keypad"", ...",54.0,30,180,30,30,180,180,30.0,180.0,1,29,59,89,179,102,0,0,4.66,0,1,0,1,0,13,1.0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0,1,0,0,0,0,1,1,0,0,0,0


In [96]:
## Removing columns with no impact on modeling

df.drop(columns = ['host_since', 'host_neighbourhood', 'amenities'], inplace=True)

In [97]:
## Final review

df.describe()

Unnamed: 0,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,latitude,longitude,accommodates,bedrooms,beds,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,maximum_maximum_nights,minimum_nights_avg_ntm,maximum_nights_avg_ntm,has_availability,availability_30,availability_60,availability_90,availability_365,number_of_reviews,number_of_reviews_ltm,number_of_reviews_l30d,review_scores_rating,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,years_hosting,num_bathrooms,Adams Morgan,American University Park,Arboretum,Barnaby Woods,Barry Farm,Bellevue,Benning,Benning Heights,Bloomingdale,Brentwood,Brightwood,Brightwood Park,Brookland,Buena Vista,Burleith/Hillandale,Burrville,Buzzard Point,Capitol Hill,Capitol View,Cardozo/Shaw,Carver Langston,Cathedral Heights,Chevy Chase,Chinatown,Cleveland Park,Colonial Village,Columbia Heights,Congress Heights,Connecticut Avenue/K Street,Crestwood,Deanwood,Douglas,Downtown,Dupont Circle,Dupont Park,Eastland Gardens,Eckington,Edgewood,Fairfax Village,Fairlawn,Fairmont Heights,Foggy Bottom,Forest Hills,Fort Davis Park,Fort Dupont,Fort Lincoln,Fort McNair,Fort Totten,Foxhall Crescent,Foxhall Village,Friendship Heights,GWU,Garfield Heights,Gateway,Georgetown,Georgetown Reservoir,Glover Park,Grant Park,Greenway,Hawthorne,Hillbrook,Hillcrest,Historic Anacostia,Howard University,Ivy City,Kalorama Heights,Kenilworth,Kingman Park,Knox Hill,Lamont Riggs,Langdon,Lanier Heights,Le Droit Park,Lincoln Heights,Lincoln Park,Logan Circle,Mahaning Heights,Manor Park,Marshall Heights,Massachusetts Avenue Heights,Mayfair,McLean Gardens,Michigan Park,Mount Vernon Square,Mt. Pleasant,Navy Yard,Naylor Gardens,Near Southeast,North Capitol Street,North Cleveland Park,North Michigan Park,North Portal Estates,Palisades,Park View,Penn Branch,Penn Quarters,Petworth,Pleasant Hill,Pleasant Plains,Queens Chapel,Randle Highlands,River Terrace,Shaw,Shepherd Park,Sheridan,Shipley Terrace,Southwest Employment Area,Southwest/Waterfront,Spring Valley,Stanton Park,Summit Park,Takoma,Tenleytown,Trinidad,Truxton Circle,Twining,Union Station,University Heights,Van Ness,Washington Highlands,Wesley Heights,West End,Woodland-Normanstone Terrace,Woodland/Fort Stanton,Woodley Park,Woodridge,Unnamed: 162,email,facebook,google,government_id,identity_manual,jumio,kba,manual_offline,manual_online,offline_government_id,phone,reviews,selfie,sent_id,weibo,work_email
count,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7740.0,7839.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,5816.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0,7897.0
mean,0.25,86.19,86.19,1.0,0.82,38.91,-77.02,3.55,1.51,1.84,185.65,12.17,14542.12,13.17,24.26,14704.94,10892168.12,23.6,10867161.53,0.97,9.13,22.99,38.34,147.33,37.52,6.12,0.71,4.68,0.41,20.43,17.83,2.19,0.3,5.59,1.36,0.04,0.01,0.03,0.01,0.0,0.01,0.01,0.01,0.07,0.02,0.02,0.05,0.02,0.0,0.03,0.01,0.03,0.09,0.01,0.04,0.03,0.01,0.01,0.05,0.01,0.01,0.09,0.01,0.08,0.05,0.01,0.0,0.05,0.08,0.01,0.0,0.07,0.07,0.0,0.01,0.01,0.04,0.01,0.01,0.01,0.01,0.03,0.01,0.01,0.01,0.01,0.04,0.0,0.01,0.03,0.01,0.01,0.01,0.01,0.01,0.01,0.0,0.01,0.04,0.03,0.04,0.0,0.09,0.0,0.01,0.02,0.04,0.04,0.01,0.09,0.07,0.01,0.02,0.01,0.01,0.01,0.01,0.01,0.05,0.09,0.01,0.0,0.01,0.05,0.01,0.01,0.01,0.01,0.09,0.01,0.05,0.05,0.01,0.09,0.01,0.01,0.01,0.07,0.01,0.0,0.0,0.03,0.03,0.01,0.09,0.0,0.02,0.01,0.03,0.07,0.01,0.09,0.01,0.01,0.01,0.01,0.04,0.01,0.0,0.01,0.01,0.0,0.94,0.15,0.08,0.6,0.24,0.42,0.25,0.01,0.01,0.44,1.0,0.67,0.27,0.0,0.0,0.21
std,0.44,334.72,334.72,0.05,0.39,0.02,0.03,2.21,0.88,1.49,322.32,31.74,1130892.48,43.6,97.04,1130890.46,152462331.88,94.66,152111860.9,0.18,11.35,23.07,35.33,141.15,70.01,13.21,1.59,0.7,0.49,48.77,48.23,9.78,2.11,2.58,0.89,0.2,0.1,0.16,0.08,0.06,0.11,0.08,0.1,0.25,0.12,0.15,0.22,0.12,0.06,0.17,0.08,0.18,0.29,0.1,0.19,0.16,0.11,0.08,0.21,0.1,0.07,0.28,0.11,0.27,0.22,0.08,0.06,0.21,0.27,0.08,0.04,0.25,0.25,0.06,0.1,0.08,0.19,0.08,0.1,0.1,0.08,0.18,0.12,0.1,0.1,0.1,0.19,0.03,0.08,0.17,0.1,0.11,0.08,0.08,0.08,0.07,0.06,0.09,0.19,0.16,0.2,0.04,0.29,0.03,0.12,0.12,0.2,0.19,0.08,0.29,0.25,0.07,0.15,0.1,0.1,0.07,0.11,0.1,0.21,0.28,0.09,0.06,0.09,0.21,0.08,0.1,0.07,0.1,0.28,0.1,0.21,0.22,0.12,0.28,0.12,0.1,0.08,0.25,0.07,0.06,0.06,0.18,0.18,0.1,0.29,0.06,0.15,0.1,0.16,0.25,0.1,0.29,0.1,0.08,0.11,0.1,0.19,0.1,0.03,0.1,0.08,0.02,0.24,0.36,0.26,0.49,0.43,0.49,0.43,0.12,0.09,0.5,0.05,0.47,0.45,0.03,0.01,0.41
min,0.0,0.0,0.0,0.0,0.0,38.82,-77.11,0.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,1.0,1.0,1.0,1.0,38.9,-77.04,2.0,1.0,1.0,80.0,1.0,30.0,1.0,2.0,180.0,180.0,1.6,180.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,4.67,0.0,1.0,1.0,0.0,0.0,4.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
50%,0.0,2.0,2.0,1.0,1.0,38.91,-77.02,3.0,1.0,1.0,119.0,2.0,365.0,2.0,3.0,1125.0,1125.0,2.0,1125.0,1.0,3.0,17.0,37.0,102.0,7.0,0.0,0.0,4.86,0.0,2.0,1.0,0.0,0.0,6.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0
75%,1.0,6.0,6.0,1.0,1.0,38.92,-77.0,4.0,2.0,2.0,187.0,5.0,1125.0,5.0,6.0,1125.0,1125.0,5.0,1125.0,1.0,18.0,44.0,72.0,311.0,41.0,6.0,1.0,5.0,1.0,6.0,3.0,1.0,0.0,7.0,1.5,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0
max,1.0,3924.0,3924.0,1.0,1.0,39.0,-76.91,16.0,9.0,50.0,10000.0,1125.0,99999999.0,1125.0,1125.0,99999999.0,2147483647.0,1125.0,2142546905.6,1.0,30.0,60.0,90.0,365.0,662.0,153.0,17.0,5.0,1.0,215.0,215.0,86.0,20.0,13.0,50.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Converting Remaining Datatypes

In [98]:
df.dtypes[:40]

host_is_superhost                                 int32
host_listings_count                             float64
host_total_listings_count                       float64
host_has_profile_pic                              int32
host_identity_verified                            int32
latitude                                        float64
longitude                                       float64
room_type                                        object
accommodates                                      int64
bedrooms                                        float64
beds                                            float64
price                                           float64
minimum_nights                                    int64
maximum_nights                                    int64
minimum_minimum_nights                            int64
maximum_minimum_nights                            int64
minimum_maximum_nights                            int64
maximum_maximum_nights                          

In [99]:
df.isna().sum()[df.isna().sum() > 0]

bedrooms                 157
beds                      58
review_scores_rating    2081
dtype: int64

# 🪓 **Train/Test Split**

---

> Before I run any further pre-processing, I split my data into training and test sets to allow me to test my model's performance.
>
> **In order to split my classification target feature properly, I will convert the original values to binary values.** Since my goal is to determine whether or not a given host property will have a high score (4+), I assign all values greater-than or equal-to 4 to '1' and anything less than 4 as '0.'
>
> **This conversion also allows me to use the "stratify" parameter in my train/test split,** which will preserve the class balance when I split my data. This will be key for proper evaluation of my models.

---

# ✨ Cmt out conversion - nans

Null values are being assigned to '0', which is causing the significant increase in class balance (orig 5/95; changed 33/67)

🌟 ALSO CANNOT USE STRATIFY

In [100]:
# ## Using np.select to reassign target values based on conditional evaluations

# cond = [df['review_scores_rating'] >= 4,
#         df['review_scores_rating'] < 4
#        ]

# choice = [1,0]

# df['review_scores_rating'] = np.select(cond, choice, 0)

In [101]:
# ## Reviewing results to confirm only 0/1 values
# df['review_scores_rating'].value_counts(dropna=False)

In [102]:
## Creating features/target for dataset
target = 'review_scores_rating'

X = df.drop(columns = target).copy()
y = df[target].copy()

In [103]:
## Confirming same number of rows
X.shape[0] == y.shape[0]

True

In [104]:
## Splitting to prevent data leakage
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# 🚿 **Preprocessing Pipeline**

In [105]:
num_cols = X_train.select_dtypes(include=[int, float]).columns.to_list()
# num_cols

In [106]:
cat_cols = ['room_type']
cat_cols

['room_type']

In [107]:
# X_train[cat_cols] = X_train[cat_cols].astype(str)

In [108]:
# X_test[cat_cols] = X_test[cat_cols].astype(str)

In [109]:
## Checking missing X-values for imputation
X_train.isna().sum()[X_train.isna().sum() > 0]

bedrooms    123
beds         47
dtype: int64

In [110]:
## Checking missing y-values for imputation
y_train.isna().sum()

1563

## X-Values Preprocessor

In [111]:
## Creating ColumnTransformer and sub-transformers for imputation and encoding

### --- Creating column transformers --- ###

# Filling missing values in "Beds" and "Bedrooms"
miss_num_transformer = SimpleImputer(strategy='mean')

## Encoding categoricals - ignoring errors to prevent issues w/ test set
categorical_transformer = OneHotEncoder(handle_unknown='ignore', sparse=False)


### --- Creating column pipelines --- ###

cat_pipe = Pipeline(steps=[('ohe', categorical_transformer)])

num_pipe = Pipeline(steps=[('imputer', miss_num_transformer),
                           ('scaler', StandardScaler())])

## Instantiating the ColumnTransformer and including all transformers
preprocessor = ColumnTransformer(
    transformers=[('nums', num_pipe, num_cols),
                  ('cats', cat_pipe, cat_cols)])

preprocessor

In [112]:
## 
preprocessor.fit(X_train)

## Getting feature names from OHE
ohe_cat_names = preprocessor.named_transformers_['cats'].named_steps['ohe'].get_feature_names(cat_cols)

## Generating list for column index
final_cols = [*num_cols, *ohe_cat_names]

In [113]:

## Fit and transform the data via the ColumnTransformer
X_train_tf = preprocessor.transform(X_train)
X_train_tf_df = pd.DataFrame(X_train_tf, columns=final_cols, index=X_train.index)

## Transforming the test set and saving
X_test_tf = preprocessor.transform(X_test)
X_test_tf_df = pd.DataFrame(X_test_tf, columns=final_cols, index=X_test.index)

display(X_train_tf_df.head(5),X_test_tf_df.head(5))

Unnamed: 0,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,latitude,longitude,bedrooms,beds,price,minimum_nights_avg_ntm,maximum_nights_avg_ntm,has_availability,instant_bookable,num_bathrooms,Adams Morgan,American University Park,Arboretum,Barnaby Woods,Barry Farm,Bellevue,Benning,Benning Heights,Bloomingdale,Brentwood,Brightwood,Brightwood Park,Brookland,Buena Vista,Burleith/Hillandale,Burrville,Buzzard Point,Capitol Hill,Capitol View,Cardozo/Shaw,Carver Langston,Cathedral Heights,Chevy Chase,Chinatown,Cleveland Park,Colonial Village,Columbia Heights,Congress Heights,Connecticut Avenue/K Street,Crestwood,Deanwood,Douglas,Downtown,Dupont Circle,Dupont Park,Eastland Gardens,Eckington,Edgewood,Fairfax Village,Fairlawn,Fairmont Heights,Foggy Bottom,Forest Hills,Fort Davis Park,Fort Dupont,Fort Lincoln,Fort McNair,Fort Totten,Foxhall Crescent,Foxhall Village,Friendship Heights,GWU,Garfield Heights,Gateway,Georgetown,Georgetown Reservoir,Glover Park,Grant Park,Greenway,Hawthorne,Hillbrook,Hillcrest,Historic Anacostia,Howard University,Ivy City,Kalorama Heights,Kenilworth,Kingman Park,Knox Hill,Lamont Riggs,Langdon,Lanier Heights,Le Droit Park,Lincoln Heights,Lincoln Park,Logan Circle,Mahaning Heights,Manor Park,Marshall Heights,Massachusetts Avenue Heights,Mayfair,McLean Gardens,Michigan Park,Mount Vernon Square,Mt. Pleasant,Navy Yard,Naylor Gardens,Near Southeast,North Capitol Street,North Cleveland Park,North Michigan Park,North Portal Estates,Palisades,Park View,Penn Branch,Penn Quarters,Petworth,Pleasant Hill,Pleasant Plains,Queens Chapel,Randle Highlands,River Terrace,Shaw,Shepherd Park,Sheridan,Shipley Terrace,Southwest Employment Area,Southwest/Waterfront,Spring Valley,Stanton Park,Summit Park,Takoma,Tenleytown,Trinidad,Truxton Circle,Twining,Union Station,University Heights,Van Ness,Washington Highlands,Wesley Heights,West End,Woodland-Normanstone Terrace,Woodland/Fort Stanton,Woodley Park,Woodridge,Unnamed: 142,email,facebook,google,government_id,identity_manual,jumio,kba,manual_offline,manual_online,offline_government_id,phone,reviews,selfie,sent_id,weibo,work_email,room_type_entire_home,room_type_hotel_room,room_type_private_room,room_type_shared_room
7827,-0.59,-0.26,-0.26,0.05,0.47,1.85,0.26,-0.58,-0.61,-0.47,-0.19,-0.07,0.18,1.19,-0.5,-0.21,-0.1,-0.17,-0.08,-0.07,-0.11,-0.08,-0.1,-0.27,-0.12,6.73,-0.24,-0.12,-0.07,-0.17,-0.08,-0.19,-0.32,-0.1,-0.19,-0.17,-0.11,-0.08,-0.22,-0.1,-0.07,-0.31,-0.11,-0.29,-0.24,-0.08,-0.06,-0.22,-0.29,-0.08,-0.05,-0.27,-0.27,-0.06,-0.11,-0.08,-0.19,-0.08,-0.11,-0.11,-0.08,-0.19,-0.12,-0.1,-0.1,-0.1,-0.19,-0.03,-0.08,-0.17,-0.1,-0.11,-0.08,-0.08,-0.08,-0.08,-0.06,-0.09,-0.19,-0.17,-0.21,-0.05,-0.32,-0.03,-0.12,-0.12,-0.21,-0.19,-0.08,-0.32,-0.27,-0.08,6.73,-0.1,-0.1,-0.08,-0.11,-0.1,-0.22,-0.31,-0.09,-0.06,-0.09,-0.22,-0.08,-0.1,-0.07,-0.1,-0.31,-0.11,-0.22,-0.24,-0.12,-0.31,-0.12,-0.11,-0.08,-0.27,-0.07,-0.07,-0.06,-0.19,-0.19,-0.1,-0.32,-0.06,6.73,-0.1,-0.17,-0.27,-0.11,-0.32,-0.1,-0.08,-0.11,-0.1,-0.19,-0.1,-0.03,-0.1,-0.08,-0.01,0.25,2.37,-0.28,-1.22,-0.56,-0.85,-0.58,-0.12,-0.09,-0.89,0.05,0.71,-0.61,-0.03,0.0,-0.51,0.0,0.0,1.0,0.0
5782,-0.59,3.26,3.26,0.05,0.47,-0.4,-1.33,0.0,-1.34,-0.2,0.06,-0.07,0.18,1.19,-0.5,-0.21,-0.1,-0.17,-0.08,-0.07,-0.11,-0.08,-0.1,-0.27,-0.12,-0.15,-0.24,-0.12,-0.07,-0.17,-0.08,-0.19,-0.32,-0.1,-0.19,-0.17,-0.11,-0.08,-0.22,-0.1,-0.07,-0.31,-0.11,-0.29,-0.24,-0.08,-0.06,-0.22,-0.29,-0.08,-0.05,-0.27,-0.27,-0.06,-0.11,-0.08,5.19,-0.08,-0.11,-0.11,-0.08,-0.19,-0.12,-0.1,-0.1,-0.1,5.19,-0.03,-0.08,-0.17,-0.1,-0.11,-0.08,-0.08,-0.08,-0.08,-0.06,-0.09,-0.19,-0.17,-0.21,-0.05,-0.32,-0.03,-0.12,-0.12,-0.21,-0.19,-0.08,-0.32,-0.27,-0.08,-0.15,-0.1,-0.1,-0.08,-0.11,-0.1,-0.22,-0.31,-0.09,-0.06,-0.09,-0.22,-0.08,-0.1,-0.07,-0.1,-0.31,-0.11,-0.22,-0.24,-0.12,-0.31,-0.12,-0.11,-0.08,-0.27,-0.07,-0.07,-0.06,-0.19,-0.19,-0.1,-0.32,-0.06,-0.15,-0.1,-0.17,-0.27,-0.11,-0.32,-0.1,-0.08,-0.11,-0.1,5.19,-0.1,-0.03,-0.1,-0.08,-0.01,0.25,-0.42,-0.28,0.82,-0.56,1.18,-0.58,-0.12,-0.09,-0.89,0.05,0.71,-0.61,-0.03,0.0,1.95,1.0,0.0,0.0,0.0
439,-0.59,-0.26,-0.26,0.05,0.47,0.04,0.26,-0.58,0.85,-0.24,0.06,-0.07,0.18,-0.84,-0.5,-0.21,-0.1,-0.17,-0.08,-0.07,-0.11,-0.08,-0.1,3.7,-0.12,-0.15,-0.24,-0.12,-0.07,-0.17,-0.08,-0.19,-0.32,-0.1,-0.19,-0.17,-0.11,-0.08,-0.22,-0.1,-0.07,-0.31,-0.11,-0.29,-0.24,-0.08,-0.06,-0.22,-0.29,-0.08,-0.05,3.7,3.7,-0.06,-0.11,-0.08,-0.19,-0.08,-0.11,-0.11,-0.08,-0.19,-0.12,-0.1,-0.1,-0.1,-0.19,-0.03,-0.08,-0.17,-0.1,-0.11,-0.08,-0.08,-0.08,-0.08,-0.06,-0.09,-0.19,-0.17,-0.21,-0.05,-0.32,-0.03,-0.12,-0.12,-0.21,-0.19,-0.08,-0.32,-0.27,-0.08,-0.15,-0.1,-0.1,-0.08,-0.11,-0.1,-0.22,-0.31,-0.09,-0.06,-0.09,-0.22,-0.08,-0.1,-0.07,-0.1,-0.31,-0.11,-0.22,-0.24,-0.12,-0.31,-0.12,-0.11,-0.08,-0.27,-0.07,-0.07,-0.06,-0.19,-0.19,-0.1,-0.32,-0.06,-0.15,-0.1,-0.17,3.7,-0.11,-0.32,-0.1,-0.08,-0.11,-0.1,-0.19,-0.1,-0.03,-0.1,-0.08,-0.01,0.25,2.37,-0.28,0.82,-0.56,1.18,-0.58,-0.12,-0.09,1.12,0.05,0.71,-0.61,-0.03,0.0,1.95,1.0,0.0,0.0,0.0
1379,-0.59,-0.25,-0.25,0.05,-2.13,1.27,0.09,-0.58,-0.61,-0.17,-0.24,-0.07,0.18,-0.84,-0.5,-0.21,-0.1,-0.17,-0.08,-0.07,-0.11,-0.08,-0.1,-0.27,-0.12,-0.15,4.13,-0.12,-0.07,-0.17,-0.08,-0.19,-0.32,-0.1,-0.19,-0.17,-0.11,-0.08,-0.22,-0.1,-0.07,-0.31,-0.11,-0.29,4.13,-0.08,-0.06,-0.22,-0.29,-0.08,-0.05,-0.27,-0.27,-0.06,-0.11,-0.08,-0.19,-0.08,-0.11,-0.11,-0.08,-0.19,-0.12,-0.1,-0.1,-0.1,-0.19,-0.03,-0.08,-0.17,-0.1,-0.11,-0.08,-0.08,-0.08,-0.08,-0.06,-0.09,-0.19,-0.17,-0.21,-0.05,-0.32,-0.03,-0.12,-0.12,-0.21,-0.19,-0.08,-0.32,-0.27,-0.08,-0.15,-0.1,-0.1,-0.08,-0.11,-0.1,-0.22,-0.31,-0.09,-0.06,-0.09,-0.22,-0.08,-0.1,-0.07,-0.1,-0.31,-0.11,-0.22,4.13,-0.12,-0.31,-0.12,-0.11,-0.08,-0.27,-0.07,-0.07,-0.06,-0.19,-0.19,-0.1,-0.32,-0.06,-0.15,-0.1,-0.17,-0.27,-0.11,-0.32,-0.1,-0.08,-0.11,-0.1,-0.19,-0.1,-0.03,-0.1,-0.08,-0.01,0.25,2.37,-0.28,-1.22,-0.56,-0.85,-0.58,-0.12,-0.09,-0.89,0.05,0.71,-0.61,-0.03,0.0,-0.51,0.0,0.0,1.0,0.0
847,1.71,-0.26,-0.26,0.05,0.47,-0.79,2.75,-0.58,-0.61,-0.41,-0.23,-0.07,0.18,-0.84,-0.5,-0.21,-0.1,-0.17,-0.08,-0.07,-0.11,-0.08,9.64,-0.27,-0.12,-0.15,-0.24,-0.12,-0.07,-0.17,-0.08,-0.19,-0.32,9.64,-0.19,-0.17,-0.11,-0.08,-0.22,-0.1,-0.07,-0.31,-0.11,-0.29,-0.24,-0.08,-0.06,-0.22,-0.29,-0.08,-0.05,-0.27,-0.27,-0.06,-0.11,-0.08,-0.19,-0.08,-0.11,-0.11,-0.08,-0.19,-0.12,-0.1,-0.1,-0.1,-0.19,-0.03,-0.08,-0.17,-0.1,-0.11,-0.08,-0.08,-0.08,-0.08,-0.06,-0.09,-0.19,-0.17,-0.21,-0.05,-0.32,-0.03,-0.12,-0.12,-0.21,-0.19,-0.08,-0.32,-0.27,-0.08,-0.15,9.64,-0.1,-0.08,-0.11,-0.1,-0.22,-0.31,-0.09,-0.06,-0.09,-0.22,-0.08,-0.1,-0.07,-0.1,-0.31,-0.11,-0.22,-0.24,-0.12,-0.31,-0.12,-0.11,-0.08,-0.27,-0.07,-0.07,-0.06,-0.19,-0.19,-0.1,-0.32,-0.06,-0.15,-0.1,-0.17,-0.27,-0.11,-0.32,-0.1,-0.08,-0.11,-0.1,-0.19,-0.1,-0.03,-0.1,-0.08,-0.01,0.25,-0.42,-0.28,-1.22,-0.56,-0.85,1.73,-0.12,-0.09,-0.89,0.05,0.71,-0.61,-0.03,0.0,-0.51,1.0,0.0,0.0,0.0


Unnamed: 0,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,latitude,longitude,bedrooms,beds,price,minimum_nights_avg_ntm,maximum_nights_avg_ntm,has_availability,instant_bookable,num_bathrooms,Adams Morgan,American University Park,Arboretum,Barnaby Woods,Barry Farm,Bellevue,Benning,Benning Heights,Bloomingdale,Brentwood,Brightwood,Brightwood Park,Brookland,Buena Vista,Burleith/Hillandale,Burrville,Buzzard Point,Capitol Hill,Capitol View,Cardozo/Shaw,Carver Langston,Cathedral Heights,Chevy Chase,Chinatown,Cleveland Park,Colonial Village,Columbia Heights,Congress Heights,Connecticut Avenue/K Street,Crestwood,Deanwood,Douglas,Downtown,Dupont Circle,Dupont Park,Eastland Gardens,Eckington,Edgewood,Fairfax Village,Fairlawn,Fairmont Heights,Foggy Bottom,Forest Hills,Fort Davis Park,Fort Dupont,Fort Lincoln,Fort McNair,Fort Totten,Foxhall Crescent,Foxhall Village,Friendship Heights,GWU,Garfield Heights,Gateway,Georgetown,Georgetown Reservoir,Glover Park,Grant Park,Greenway,Hawthorne,Hillbrook,Hillcrest,Historic Anacostia,Howard University,Ivy City,Kalorama Heights,Kenilworth,Kingman Park,Knox Hill,Lamont Riggs,Langdon,Lanier Heights,Le Droit Park,Lincoln Heights,Lincoln Park,Logan Circle,Mahaning Heights,Manor Park,Marshall Heights,Massachusetts Avenue Heights,Mayfair,McLean Gardens,Michigan Park,Mount Vernon Square,Mt. Pleasant,Navy Yard,Naylor Gardens,Near Southeast,North Capitol Street,North Cleveland Park,North Michigan Park,North Portal Estates,Palisades,Park View,Penn Branch,Penn Quarters,Petworth,Pleasant Hill,Pleasant Plains,Queens Chapel,Randle Highlands,River Terrace,Shaw,Shepherd Park,Sheridan,Shipley Terrace,Southwest Employment Area,Southwest/Waterfront,Spring Valley,Stanton Park,Summit Park,Takoma,Tenleytown,Trinidad,Truxton Circle,Twining,Union Station,University Heights,Van Ness,Washington Highlands,Wesley Heights,West End,Woodland-Normanstone Terrace,Woodland/Fort Stanton,Woodley Park,Woodridge,Unnamed: 142,email,facebook,google,government_id,identity_manual,jumio,kba,manual_offline,manual_online,offline_government_id,phone,reviews,selfie,sent_id,weibo,work_email,room_type_entire_home,room_type_hotel_room,room_type_private_room,room_type_shared_room
7760,-0.59,-0.12,-0.12,0.05,0.47,-1.15,0.95,-0.58,-0.61,-0.15,0.07,-0.07,0.18,-0.84,-0.5,-0.21,-0.1,-0.17,-0.08,-0.07,-0.11,-0.08,-0.1,-0.27,-0.12,-0.15,-0.24,-0.12,-0.07,-0.17,-0.08,-0.19,3.12,-0.1,-0.19,-0.17,-0.11,-0.08,-0.22,-0.1,-0.07,-0.31,-0.11,-0.29,-0.24,-0.08,-0.06,-0.22,-0.29,-0.08,-0.05,-0.27,-0.27,-0.06,-0.11,-0.08,-0.19,-0.08,-0.11,-0.11,-0.08,-0.19,-0.12,-0.1,-0.1,-0.1,-0.19,-0.03,-0.08,-0.17,-0.1,-0.11,-0.08,-0.08,-0.08,-0.08,-0.06,-0.09,-0.19,-0.17,-0.21,-0.05,-0.32,-0.03,-0.12,-0.12,-0.21,-0.19,-0.08,3.12,-0.27,-0.08,-0.15,-0.1,-0.1,-0.08,-0.11,-0.1,-0.22,-0.31,-0.09,-0.06,-0.09,-0.22,-0.08,-0.1,-0.07,-0.1,-0.31,-0.11,-0.22,-0.24,-0.12,-0.31,-0.12,-0.11,-0.08,-0.27,-0.07,-0.07,-0.06,-0.19,-0.19,-0.1,-0.32,-0.06,-0.15,-0.1,-0.17,-0.27,-0.11,-0.32,-0.1,-0.08,-0.11,-0.1,-0.19,-0.1,-0.03,-0.1,-0.08,-0.01,0.25,-0.42,-0.28,0.82,-0.56,-0.85,1.73,-0.12,-0.09,1.12,0.05,0.71,1.63,-0.03,0.0,1.95,1.0,0.0,0.0,0.0
533,1.71,-0.25,-0.25,0.05,-2.13,-0.46,0.61,-0.58,-0.61,-0.23,-0.24,-0.07,0.18,-0.84,-0.5,-0.21,-0.1,-0.17,-0.08,-0.07,-0.11,-0.08,-0.1,-0.27,-0.12,-0.15,-0.24,-0.12,-0.07,-0.17,-0.08,-0.19,-0.32,-0.1,-0.19,-0.17,-0.11,-0.08,-0.22,-0.1,-0.07,-0.31,-0.11,-0.29,-0.24,-0.08,-0.06,-0.22,-0.29,-0.08,-0.05,-0.27,-0.27,-0.06,-0.11,-0.08,-0.19,-0.08,-0.11,-0.11,-0.08,-0.19,-0.12,-0.1,-0.1,-0.1,-0.19,-0.03,-0.08,-0.17,-0.1,-0.11,-0.08,-0.08,-0.08,-0.08,-0.06,-0.09,-0.19,-0.17,-0.21,-0.05,3.12,-0.03,-0.12,-0.12,-0.21,-0.19,-0.08,-0.32,-0.27,-0.08,-0.15,-0.1,-0.1,-0.08,-0.11,-0.1,-0.22,-0.31,-0.09,-0.06,-0.09,-0.22,-0.08,-0.1,-0.07,-0.1,-0.31,-0.11,-0.22,-0.24,-0.12,-0.31,-0.12,-0.11,-0.08,-0.27,-0.07,-0.07,-0.06,-0.19,-0.19,-0.1,3.12,-0.06,-0.15,-0.1,-0.17,-0.27,-0.11,3.12,-0.1,-0.08,-0.11,-0.1,-0.19,-0.1,-0.03,-0.1,-0.08,-0.01,0.25,-0.42,-0.28,0.82,-0.56,1.18,-0.58,-0.12,-0.09,-0.89,0.05,0.71,-0.61,-0.03,0.0,-0.51,0.0,0.0,1.0,0.0
2673,-0.59,-0.26,-0.26,0.05,-2.13,-0.03,-1.29,0.56,1.58,0.16,-0.23,-0.07,0.18,1.19,0.2,-0.21,-0.1,-0.17,-0.08,-0.07,-0.11,-0.08,-0.1,-0.27,-0.12,-0.15,-0.24,-0.12,-0.07,6.06,-0.08,-0.19,-0.32,-0.1,-0.19,-0.17,-0.11,-0.08,-0.22,-0.1,-0.07,-0.31,-0.11,-0.29,-0.24,-0.08,-0.06,-0.22,-0.29,-0.08,-0.05,-0.27,-0.27,-0.06,-0.11,-0.08,-0.19,-0.08,-0.11,-0.11,-0.08,-0.19,-0.12,-0.1,-0.1,-0.1,-0.19,-0.03,-0.08,6.06,-0.1,-0.11,-0.08,-0.08,-0.08,-0.08,-0.06,-0.09,-0.19,-0.17,-0.21,-0.05,-0.32,-0.03,-0.12,-0.12,-0.21,-0.19,-0.08,-0.32,-0.27,-0.08,-0.15,-0.1,-0.1,-0.08,-0.11,-0.1,-0.22,-0.31,-0.09,-0.06,-0.09,-0.22,-0.08,-0.1,-0.07,-0.1,-0.31,-0.11,-0.22,-0.24,-0.12,-0.31,-0.12,-0.11,-0.08,-0.27,-0.07,-0.07,-0.06,-0.19,-0.19,-0.1,-0.32,-0.06,-0.15,-0.1,-0.17,-0.27,-0.11,-0.32,-0.1,-0.08,-0.11,-0.1,-0.19,-0.1,-0.03,-0.1,-0.08,-0.01,0.25,-0.42,3.52,-1.22,-0.56,-0.85,-0.58,-0.12,-0.09,-0.89,0.05,0.71,-0.61,-0.03,0.0,-0.51,1.0,0.0,0.0,0.0
1669,-0.59,-0.26,-0.26,0.05,0.47,-1.14,0.73,0.56,0.12,0.93,-0.23,-0.07,0.18,-0.84,3.0,-0.21,-0.1,-0.17,-0.08,-0.07,-0.11,-0.08,-0.1,-0.27,-0.12,-0.15,-0.24,-0.12,-0.07,-0.17,-0.08,-0.19,3.12,-0.1,-0.19,-0.17,-0.11,-0.08,-0.22,-0.1,-0.07,-0.31,-0.11,-0.29,-0.24,-0.08,-0.06,-0.22,-0.29,-0.08,-0.05,-0.27,-0.27,-0.06,-0.11,-0.08,-0.19,-0.08,-0.11,-0.11,-0.08,-0.19,-0.12,-0.1,-0.1,-0.1,-0.19,-0.03,-0.08,-0.17,-0.1,-0.11,-0.08,-0.08,-0.08,-0.08,-0.06,-0.09,-0.19,-0.17,-0.21,-0.05,-0.32,-0.03,-0.12,-0.12,-0.21,-0.19,-0.08,3.12,-0.27,-0.08,-0.15,-0.1,-0.1,-0.08,-0.11,-0.1,-0.22,-0.31,-0.09,-0.06,-0.09,-0.22,-0.08,-0.1,-0.07,-0.1,-0.31,-0.11,-0.22,-0.24,-0.12,-0.31,-0.12,-0.11,-0.08,-0.27,-0.07,-0.07,-0.06,-0.19,-0.19,-0.1,-0.32,-0.06,-0.15,-0.1,-0.17,-0.27,-0.11,-0.32,-0.1,-0.08,-0.11,-0.1,-0.19,-0.1,-0.03,-0.1,-0.08,-0.01,0.25,-0.42,3.52,0.82,1.78,-0.85,-0.58,-0.12,-0.09,1.12,0.05,0.71,1.63,-0.03,0.0,-0.51,1.0,0.0,0.0,0.0
3442,-0.59,-0.26,-0.26,0.05,-2.13,1.95,0.78,-0.58,-0.61,-0.33,-0.24,-0.07,0.18,1.19,-0.5,-0.21,-0.1,-0.17,-0.08,-0.07,-0.11,-0.08,-0.1,-0.27,-0.12,-0.15,-0.24,-0.12,-0.07,-0.17,-0.08,-0.19,-0.32,-0.1,-0.19,-0.17,-0.11,-0.08,-0.22,-0.1,-0.07,-0.31,-0.11,-0.29,-0.24,-0.08,-0.06,-0.22,-0.29,-0.08,-0.05,-0.27,-0.27,-0.06,-0.11,-0.08,-0.19,-0.08,-0.11,-0.11,-0.08,-0.19,8.55,-0.1,-0.1,-0.1,-0.19,-0.03,-0.08,-0.17,-0.1,-0.11,-0.08,-0.08,-0.08,-0.08,-0.06,-0.09,-0.19,-0.17,-0.21,-0.05,-0.32,-0.03,8.55,-0.12,-0.21,-0.19,-0.08,-0.32,-0.27,-0.08,-0.15,-0.1,-0.1,-0.08,-0.11,-0.1,-0.22,-0.31,-0.09,-0.06,-0.09,-0.22,-0.08,-0.1,-0.07,-0.1,-0.31,-0.11,-0.22,-0.24,8.55,-0.31,8.55,-0.11,-0.08,-0.27,-0.07,-0.07,-0.06,-0.19,-0.19,-0.1,-0.32,-0.06,-0.15,-0.1,-0.17,-0.27,-0.11,-0.32,-0.1,-0.08,-0.11,-0.1,-0.19,-0.1,-0.03,-0.1,-0.08,-0.01,0.25,2.37,-0.28,-1.22,-0.56,-0.85,1.73,-0.12,-0.09,-0.89,0.05,0.71,-0.61,-0.03,0.0,-0.51,0.0,0.0,1.0,0.0


## Y-Values Preprocessor

In [114]:
### --- Creating target pipeline --- ###

# Filling missing values in target
miss_target_imputer = SimpleImputer(strategy='mean')

# Binarizing target values
binary_converter = Binarizer(threshold = 3.999)

target_pipe = Pipeline(steps=[('imputer', miss_target_imputer),
                           ('converter', binary_converter)])

target_pipe

In [115]:
## Running pipelines on each set

y_train_xf = target_pipe.fit_transform([y_train])

y_test_xf = target_pipe.fit_transform([y_test])

In [116]:
## Checking class balance between train/test split

print(pd.Series(y_train_xf[0]).value_counts(normalize=True),'\n\n',pd.Series(y_test_xf[0]).value_counts(normalize=True))

1.00   0.96
0.00   0.04
dtype: float64 

 1.00   0.96
0.00   0.04
dtype: float64


In [117]:
## Checking post-processing null value counts

print(f' Pre-pipeline y_train null totals:  {pd.Series(y_train).isna().sum()}\n',
      f'Post-pipeline y_train null totals: {pd.Series(y_train_xf[0]).isna().sum()}\n\n',
      f'Pre-pipeline y_test null totals:  {pd.Series(y_test).isna().sum()}\n',
      f'Post-pipeline y_test null totals: {pd.Series(y_test_xf[0]).isna().sum()}')

 Pre-pipeline y_train null totals:  1563
 Post-pipeline y_train null totals: 0

 Pre-pipeline y_test null totals:  518
 Post-pipeline y_test null totals: 0


# 📝 Next Steps

* Process classification model - i.e. Logreg, KNN, DecisionTrees, etc.
* Evaluate results
* Determine if I need to redo pre-processing steps

# 🚿 Classification Pipeline