# Lab 8: Implement Your Machine Learning Project Plan

In this lab assignment, you will implement the machine learning project plan you created in the written assignment. You will:

1. Load your data set and save it to a Pandas DataFrame.
2. Perform exploratory data analysis on your data to determine which feature engineering and data preparation techniques you will use.
3. Prepare your data for your model and create features and a label.
4. Fit your model to the training data and evaluate your model.
5. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.

### Import Packages

Before you get started, import a few packages.

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need for this task.

In [2]:
# YOUR CODE HERE
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import OneHotEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score



## Part 1: Load the Data Set


You have chosen to work with one of four data sets. The data sets are located in a folder named "data." The file names of the three data sets are as follows:

* The "adult" data set that contains Census information from 1994 is located in file `adultData.csv`
* The airbnb NYC "listings" data set is located in file  `airbnbListingsData.csv`
* The World Happiness Report (WHR) data set is located in file `WHR2018Chapter2OnlineData.csv`
* The book review data set is located in file `bookReviewsData.csv`



<b>Task:</b> In the code cell below, use the same method you have been using to load your data using `pd.read_csv()` and save it to DataFrame `df`.

In [3]:
# YOUR CODE HERE
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
df = pd.read_csv(airbnbDataSet_filename, header=0)
df


Unnamed: 0,name,description,neighborhood_overview,host_name,host_location,host_about,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,...,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
0,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...",Centrally located in the heart of Manhattan ju...,Jennifer,"New York, New York, United States",A New Yorker since 2000! My passion is creatin...,0.80,0.17,True,8.0,...,4.79,4.86,4.41,False,3,3,0,0,0.33,9
1,"Whole flr w/private bdrm, bath & kitchen(pls r...","Enjoy 500 s.f. top floor in 1899 brownstone, w...",Just the right mix of urban center and local n...,LisaRoxanne,"New York, New York, United States",Laid-back Native New Yorker (formerly bi-coast...,0.09,0.69,True,1.0,...,4.80,4.71,4.64,False,1,1,0,0,4.86,6
2,"Spacious Brooklyn Duplex, Patio + Garden",We welcome you to stay in our lovely 2 br dupl...,,Rebecca,"Brooklyn, New York, United States","Rebecca is an artist/designer, and Henoch is i...",1.00,0.25,True,1.0,...,5.00,4.50,5.00,False,1,1,0,0,0.02,3
3,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,"Theater district, many restaurants around here.",Shunichi,"New York, New York, United States",I used to work for a financial industry but no...,1.00,1.00,True,1.0,...,4.42,4.87,4.36,False,1,0,1,0,3.68,4
4,Cozy Clean Guest Room - Family Apt,"Our best guests are seeking a safe, clean, spa...",Our neighborhood is full of restaurants and ca...,MaryEllen,"New York, New York, United States",Welcome to family life with my oldest two away...,,,True,1.0,...,4.95,4.94,4.92,False,1,0,1,0,0.87,7
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28017,Astoria Luxury suite 2A,THIS LOVELY HOME IS THE SPACIOUS SUITE WITH PR...,,Vicky,"Queens, New York, United States",,1.00,1.00,True,8.0,...,5.00,3.00,1.00,True,8,0,8,0,1.00,2
28018,Newly renovated suite in the heart of Williams...,Just fully renovated from head to toe. On the ...,,Samuel,"New York, New York, United States","Hello, my name is Sam. I am a real estate prof...",0.91,0.89,True,0.0,...,5.00,5.00,5.00,False,1,1,0,0,2.00,5
28019,Perfect Room to Stay in Brooklyn! Near Metro!,"Amazing and comfortable space in Brooklyn, sam...",,Carlos,US,,0.99,0.99,True,6.0,...,5.00,5.00,2.00,True,7,0,7,0,1.00,2
28020,New Beautiful Modern One Bedroom in Brooklyn,This stylish place to stay is perfect for a gr...,,Lexia,"New York, New York, United States","I am a graphic designer, swell chaser and duri...",0.90,1.00,True,3.0,...,5.00,5.00,5.00,False,3,3,0,0,1.00,7


## Part 2: Exploratory Data Analysis

The next step is to inspect and analyze your data set with your machine learning problem and project plan in mind. 

This step will help you determine data preparation and feature engineering techniques you will need to apply to your data to build a balanced modeling data set for your problem and model. These data preparation techniques may include:
* addressing missingness, such as replacing missing values with means
* renaming features and labels
* finding and replacing outliers
* performing winsorization if needed
* performing one-hot encoding on categorical features
* performing vectorization for an NLP problem
* addressing class imbalance in your data sample to promote fair AI


Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.


<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. 

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [4]:
df.dtypes

name                                             object
description                                      object
neighborhood_overview                            object
host_name                                        object
host_location                                    object
host_about                                       object
host_response_rate                              float64
host_acceptance_rate                            float64
host_is_superhost                                  bool
host_listings_count                             float64
host_total_listings_count                       float64
host_has_profile_pic                               bool
host_identity_verified                             bool
neighbourhood_group_cleansed                     object
room_type                                        object
accommodates                                      int64
bathrooms                                       float64
bedrooms                                        

In [5]:
#One hot encode variables of type object
to_encode = ["neighbourhood_group_cleansed", "room_type"]

# Create the encoder:
encoder = OneHotEncoder(handle_unknown="error", sparse=False)

# Apply the encoder:
df_enc = pd.DataFrame(encoder.fit_transform(df[to_encode]))

# Reinstate the original column names:
df_enc.columns = encoder.get_feature_names(to_encode)

df.drop(columns = to_encode ,axis=1, inplace=True)
df = df.join(df_enc)

df = df.select_dtypes(exclude=['object'])
df





Unnamed: 0,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,host_total_listings_count,host_has_profile_pic,host_identity_verified,accommodates,bathrooms,bedrooms,...,n_host_verifications,neighbourhood_group_cleansed_Bronx,neighbourhood_group_cleansed_Brooklyn,neighbourhood_group_cleansed_Manhattan,neighbourhood_group_cleansed_Queens,neighbourhood_group_cleansed_Staten Island,room_type_Entire home/apt,room_type_Hotel room,room_type_Private room,room_type_Shared room
0,0.80,0.17,True,8.0,8.0,True,True,1,1.0,,...,9,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0
1,0.09,0.69,True,1.0,1.0,True,True,3,1.0,1.0,...,6,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
2,1.00,0.25,True,1.0,1.0,True,True,4,1.5,2.0,...,3,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
3,1.00,1.00,True,1.0,1.0,True,True,2,1.0,1.0,...,4,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
4,,,True,1.0,1.0,True,True,1,1.0,1.0,...,7,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28017,1.00,1.00,True,8.0,8.0,True,True,2,1.0,1.0,...,2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0,0.0
28018,0.91,0.89,True,0.0,0.0,True,True,6,1.0,2.0,...,5,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
28019,0.99,0.99,True,6.0,6.0,True,True,2,2.0,1.0,...,2,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0
28020,0.90,1.00,True,3.0,3.0,True,True,3,1.0,1.0,...,7,0.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


In [6]:
#remove columns with nan values
nan_count = np.sum(df.isnull(), axis = 0)

condition = nan_count != 0 # look for all columns with missing values

col_names = nan_count[condition].index # get the column names
nan_cols = list(col_names) # convert column names to list
df.drop(columns=nan_cols, axis=1, inplace=True)



## Part 3: Implement Your Project Plan

<b>Task:</b> Use the rest of this notebook to carry out your project plan. You will:

1. Prepare your data for your model and create features and a label.
2. Fit your model to the training data and evaluate your model.
3. Improve your model by performing model selection and/or feature selection techniques to find best model for your problem.


Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit.

In [7]:
df.columns

Index(['host_is_superhost', 'host_listings_count', 'host_total_listings_count',
       'host_has_profile_pic', 'host_identity_verified', 'accommodates',
       'bathrooms', 'price', 'minimum_nights', 'maximum_nights',
       'minimum_minimum_nights', 'maximum_minimum_nights',
       'minimum_maximum_nights', 'maximum_maximum_nights',
       'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'has_availability',
       'availability_30', 'availability_60', 'availability_90',
       'availability_365', 'number_of_reviews', 'number_of_reviews_ltm',
       'number_of_reviews_l30d', 'review_scores_rating',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'instant_bookable',
       'calculated_host_listings_count',
       'calculated_host_listings_count_entire_homes',
       'calculated_host_listings_count_private_rooms',
       'calculated_host_listings_count_shared_rooms', 'reviews_per

In [12]:
#Creating features and labels
Y = df['instant_bookable']
X = df.drop(columns = 'instant_bookable', axis=1)

#train test split
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=123)

#create and fit decision tree model
model = DecisionTreeClassifier(criterion = "entropy", max_depth = 8, min_samples_leaf = 1)
model.fit(X_train, y_train)
class_label_predictions = model.predict(X_test)
acc_score = accuracy_score(y_test, class_label_predictions)


In [13]:
print(acc_score)

0.7741133217993079
