# Lab 8: Define and Solve an ML Problem of Your Choosing

In [1]:
import pandas as pd
import numpy as np
import os 
import matplotlib.pyplot as plt
import seaborn as sns

In this lab assignment, you will follow the machine learning life cycle and implement a model to solve a machine learning problem of your choosing. You will select a data set and choose a predictive problem that the data set supports.  You will then inspect the data with your problem in mind and begin to formulate a  project plan. You will then implement the machine learning project plan. 

You will complete the following tasks:

1. Build Your DataFrame
2. Define Your ML Problem
3. Perform exploratory data analysis to understand your data.
4. Define Your Project Plan
5. Implement Your Project Plan:
    * Prepare your data for your model.
    * Fit your model to the training data and evaluate your model.
    * Improve your model's performance.

## Part 1: Build Your DataFrame

You will have the option to choose one of four data sets that you have worked with in this program:

* The "census" data set that contains Census information from 1994: `censusData.csv`
* Airbnb NYC "listings" data set: `airbnbListingsData.csv`
* World Happiness Report (WHR) data set: `WHR2018Chapter2OnlineData.csv`
* Book Review data set: `bookReviewsData.csv`

Note that these are variations of the data sets that you have worked with in this program. For example, some do not include some of the preprocessing necessary for specific models. 

#### Load a Data Set and Save it as a Pandas DataFrame

The code cell below contains filenames (path + filename) for each of the four data sets available to you.

<b>Task:</b> In the code cell below, use the same method you have been using to load the data using `pd.read_csv()` and save it to DataFrame `df`. 

You can load each file as a new DataFrame to inspect the data before choosing your data set.

In [2]:
# File names of the four data sets
adultDataSet_filename = os.path.join(os.getcwd(), "data", "censusData.csv")
airbnbDataSet_filename = os.path.join(os.getcwd(), "data", "airbnbListingsData.csv")
WHRDataSet_filename = os.path.join(os.getcwd(), "data", "WHR2018Chapter2OnlineData.csv")
bookReviewDataSet_filename = os.path.join(os.getcwd(), "data", "bookReviewsData.csv")


df = pd.read_csv(airbnbDataSet_filename, header = 0) # YOUR CODE HERE

print(df.columns)

Index(['name', 'description', 'neighborhood_overview', 'host_name',
       'host_location', 'host_about', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_listings_count',
       'host_total_listings_count', 'host_has_profile_pic',
       'host_identity_verified', 'neighbourhood_group_cleansed', 'room_type',
       'accommodates', 'bathrooms', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'has_availability', 'availability_30',
       'availability_60', 'availability_90', 'availability_365',
       'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d',
       'review_scores_rating', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value',

## Part 2: Define Your ML Problem

Next you will formulate your ML Problem. In the markdown cell below, answer the following questions:

1. List the data set you have chosen.
2. What will you be predicting? What is the label?
3. Is this a supervised or unsupervised learning problem? Is this a clustering, classification or regression problem? Is it a binary classificaiton or multi-class classifiction problem?
4. What are your features? (note: this list may change after your explore your data)
5. Explain why this is an important problem. In other words, how would a company create value with a model that predicts this label?

1. AirbnbDataSet is the data set I chose.
2. The label is 'review_scores_rating' and it means I'm predicting the rating that goes along with the review.
3. This is a supervised learning problem since I'll be predicting the label based on multiple features and need the label to learn the patterns. This is a regression problem since I'm predicting a number between 0 and 5.
4. My features are for now are 'host_is_superhost', 'room_type', 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'amenities', 'price', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'maximum_maximum_nights', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication', 'review_scores_location', 'review_scores_value', 'instant_bookable', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', and 'has_availability'.
5. This is an important problem since a company could use the features to predict any airbnbs that are more likely to get good reviews, and use them to recommend them to people as a guarenteed way to have people be happy and have a confirmed purchase from the customers. 

## Part 3: Understand Your Data

The next step is to perform exploratory data analysis. Inspect and analyze your data set with your machine learning problem in mind. Consider the following as you inspect your data:

1. What data preparation techniques would you like to use? These data preparation techniques may include:

    * addressing missingness, such as replacing missing values with means
    * finding and replacing outliers
    * renaming features and labels
    * finding and replacing outliers
    * performing feature engineering techniques such as one-hot encoding on categorical features
    * selecting appropriate features and removing irrelevant features
    * performing specific data cleaning and preprocessing techniques for an NLP problem
    * addressing class imbalance in your data sample to promote fair AI
    

2. What machine learning model (or models) you would like to use that is suitable for your predictive problem and data?
    * Are there other data preparation techniques that you will need to apply to build a balanced modeling data set for your problem and model? For example, will you need to scale your data?
 
 
3. How will you evaluate and improve the model's performance?
    * Are there specific evaluation metrics and methods that are appropriate for your model?
    

Think of the different techniques you have used to inspect and analyze your data in this course. These include using Pandas to apply data filters, using the Pandas `describe()` method to get insight into key statistics for each column, using the Pandas `dtypes` property to inspect the data type of each column, and using Matplotlib and Seaborn to detect outliers and visualize relationships between features and labels. If you are working on a classification problem, use techniques you have learned to determine if there is class imbalance.

<b>Task</b>: Use the techniques you have learned in this course to inspect and analyze your data. You can import additional packages that you have used in this course that you will need to perform this task.

<b>Note</b>: You can add code cells if needed by going to the <b>Insert</b> menu and clicking on <b>Insert Cell Below</b> in the drop-drown menu.

In [3]:
import scipy.stats as stats

In [4]:
df.columns

Index(['name', 'description', 'neighborhood_overview', 'host_name',
       'host_location', 'host_about', 'host_response_rate',
       'host_acceptance_rate', 'host_is_superhost', 'host_listings_count',
       'host_total_listings_count', 'host_has_profile_pic',
       'host_identity_verified', 'neighbourhood_group_cleansed', 'room_type',
       'accommodates', 'bathrooms', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'has_availability', 'availability_30',
       'availability_60', 'availability_90', 'availability_365',
       'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d',
       'review_scores_rating', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value',

In [5]:
df.head(5)

Unnamed: 0,name,description,neighborhood_overview,host_name,host_location,host_about,host_response_rate,host_acceptance_rate,host_is_superhost,host_listings_count,...,review_scores_communication,review_scores_location,review_scores_value,instant_bookable,calculated_host_listings_count,calculated_host_listings_count_entire_homes,calculated_host_listings_count_private_rooms,calculated_host_listings_count_shared_rooms,reviews_per_month,n_host_verifications
0,Skylit Midtown Castle,"Beautiful, spacious skylit studio in the heart...",Centrally located in the heart of Manhattan ju...,Jennifer,"New York, New York, United States",A New Yorker since 2000! My passion is creatin...,0.8,0.17,True,8.0,...,4.79,4.86,4.41,False,3,3,0,0,0.33,9
1,"Whole flr w/private bdrm, bath & kitchen(pls r...","Enjoy 500 s.f. top floor in 1899 brownstone, w...",Just the right mix of urban center and local n...,LisaRoxanne,"New York, New York, United States",Laid-back Native New Yorker (formerly bi-coast...,0.09,0.69,True,1.0,...,4.8,4.71,4.64,False,1,1,0,0,4.86,6
2,"Spacious Brooklyn Duplex, Patio + Garden",We welcome you to stay in our lovely 2 br dupl...,,Rebecca,"Brooklyn, New York, United States","Rebecca is an artist/designer, and Henoch is i...",1.0,0.25,True,1.0,...,5.0,4.5,5.0,False,1,1,0,0,0.02,3
3,Large Furnished Room Near B'way,Please don’t expect the luxury here just a bas...,"Theater district, many restaurants around here.",Shunichi,"New York, New York, United States",I used to work for a financial industry but no...,1.0,1.0,True,1.0,...,4.42,4.87,4.36,False,1,0,1,0,3.68,4
4,Cozy Clean Guest Room - Family Apt,"Our best guests are seeking a safe, clean, spa...",Our neighborhood is full of restaurants and ca...,MaryEllen,"New York, New York, United States",Welcome to family life with my oldest two away...,,,True,1.0,...,4.95,4.94,4.92,False,1,0,1,0,0.87,7


In [6]:
# YOUR CODE HERE
df['review_scores_rating'].unique()

array([4.7 , 4.45, 5.  , 4.21, 4.91, 4.56, 4.88, 4.86, 4.87, 4.76, 4.52,
       4.89, 4.66, 4.74, 4.39, 4.81, 4.9 , 4.49, 4.14, 4.68, 4.75, 4.82,
       4.55, 4.58, 4.85, 4.93, 4.6 , 4.78, 4.83, 4.8 , 4.84, 4.41, 4.95,
       4.71, 4.69, 4.34, 4.05, 4.97, 4.43, 4.62, 4.77, 4.92, 4.33, 3.5 ,
       4.67, 4.61, 4.63, 4.42, 4.65, 4.3 , 4.  , 4.35, 4.79, 4.98, 4.72,
       4.37, 4.53, 4.23, 4.59, 4.99, 4.64, 4.51, 4.12, 4.57, 4.73, 4.28,
       4.17, 4.5 , 4.48, 4.96, 4.29, 3.67, 4.36, 4.94, 4.44, 4.54, 4.07,
       4.18, 4.25, 3.75, 4.38, 4.11, 3.8 , 4.13, 4.46, 4.15, 4.26, 2.  ,
       4.47, 0.  , 4.24, 4.4 , 4.09, 3.  , 4.31, 4.06, 4.16, 4.32, 4.2 ,
       4.27, 1.  , 4.08, 4.22, 3.9 , 3.6 , 1.5 , 4.19, 3.57, 3.86, 3.29,
       3.83, 3.88, 3.43, 3.33, 3.71, 2.67, 3.89, 2.5 , 3.78, 3.4 , 3.25,
       3.2 , 3.93, 3.13, 3.22, 4.1 , 2.33, 1.75, 3.91, 4.03, 4.02, 3.96,
       3.17, 3.98, 3.56, 3.55, 3.73, 2.89, 3.3 , 3.97, 3.94, 3.95, 3.92,
       3.63, 4.04, 2.25, 3.76, 3.81, 3.65, 3.79, 2.

In [7]:
to_drop = ['name', 'description', 'neighborhood_overview', 'host_name',
    'host_location', 'host_about', 'host_response_rate',
    'host_acceptance_rate', 'host_listings_count',
    'host_total_listings_count', 'host_has_profile_pic',
    'host_identity_verified', 'neighbourhood_group_cleansed',
    'availability_30', 'availability_60', 'availability_90', 'availability_365',
    'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d',
    'calculated_host_listings_count',
    'calculated_host_listings_count_entire_homes',
    'calculated_host_listings_count_private_rooms',
    'calculated_host_listings_count_shared_rooms', 'reviews_per_month',
    'n_host_verifications']
df = df.drop(columns = to_drop, inplace = False)

In [8]:
nan_count = np.sum(df.isnull())
nan_count

host_is_superhost                 0
room_type                         0
accommodates                      0
bathrooms                         0
bedrooms                       2918
beds                           1354
amenities                         0
price                             0
minimum_nights                    0
maximum_nights                    0
minimum_minimum_nights            0
maximum_minimum_nights            0
minimum_maximum_nights            0
maximum_maximum_nights            0
minimum_nights_avg_ntm            0
maximum_nights_avg_ntm            0
has_availability                  0
review_scores_rating              0
review_scores_cleanliness         0
review_scores_checkin             0
review_scores_communication       0
review_scores_location            0
review_scores_value               0
instant_bookable                  0
dtype: int64

In [9]:
df.columns

Index(['host_is_superhost', 'room_type', 'accommodates', 'bathrooms',
       'bedrooms', 'beds', 'amenities', 'price', 'minimum_nights',
       'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights',
       'minimum_maximum_nights', 'maximum_maximum_nights',
       'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'has_availability',
       'review_scores_rating', 'review_scores_cleanliness',
       'review_scores_checkin', 'review_scores_communication',
       'review_scores_location', 'review_scores_value', 'instant_bookable'],
      dtype='object')

In [10]:
df.shape

(28022, 24)

In [11]:
nan_detected = nan_count != 0
nan_detected

host_is_superhost              False
room_type                      False
accommodates                   False
bathrooms                      False
bedrooms                        True
beds                            True
amenities                      False
price                          False
minimum_nights                 False
maximum_nights                 False
minimum_minimum_nights         False
maximum_minimum_nights         False
minimum_maximum_nights         False
maximum_maximum_nights         False
minimum_nights_avg_ntm         False
maximum_nights_avg_ntm         False
has_availability               False
review_scores_rating           False
review_scores_cleanliness      False
review_scores_checkin          False
review_scores_communication    False
review_scores_location         False
review_scores_value            False
instant_bookable               False
dtype: bool

In [12]:
int_float = (df.dtypes == 'int64') | (df.dtypes == 'float64')
int_float

host_is_superhost              False
room_type                      False
accommodates                    True
bathrooms                       True
bedrooms                        True
beds                            True
amenities                      False
price                           True
minimum_nights                  True
maximum_nights                  True
minimum_minimum_nights          True
maximum_minimum_nights          True
minimum_maximum_nights          True
maximum_maximum_nights          True
minimum_nights_avg_ntm          True
maximum_nights_avg_ntm          True
has_availability               False
review_scores_rating            True
review_scores_cleanliness       True
review_scores_checkin           True
review_scores_communication     True
review_scores_location          True
review_scores_value             True
instant_bookable               False
dtype: bool

In [13]:
miss_and_int_float = nan_detected & int_float
miss_and_int_float

host_is_superhost              False
room_type                      False
accommodates                   False
bathrooms                      False
bedrooms                        True
beds                            True
amenities                      False
price                          False
minimum_nights                 False
maximum_nights                 False
minimum_minimum_nights         False
maximum_minimum_nights         False
minimum_maximum_nights         False
maximum_maximum_nights         False
minimum_nights_avg_ntm         False
maximum_nights_avg_ntm         False
has_availability               False
review_scores_rating           False
review_scores_cleanliness      False
review_scores_checkin          False
review_scores_communication    False
review_scores_location         False
review_scores_value            False
instant_bookable               False
dtype: bool

In [14]:
replace = df.columns[miss_and_int_float]
replace

Index(['bedrooms', 'beds'], dtype='object')

In [15]:
df['bedrooms_na'] = df[replace[0]].isnull()
df['beds_na'] = df[replace[1]].isnull()

In [16]:
df['bedrooms'].fillna(value = df['bedrooms'].mean(), inplace = True)
df['beds'].fillna(value = df['beds'].mean(), inplace = True)

In [17]:
for name in replace:
    print("{} missing values count :{}".format(name, np.sum(df[name].isnull(), axis = 0)))

bedrooms missing values count :0
beds missing values count :0


In [18]:
print(np.sum(df.isnull()))

host_is_superhost              0
room_type                      0
accommodates                   0
bathrooms                      0
bedrooms                       0
beds                           0
amenities                      0
price                          0
minimum_nights                 0
maximum_nights                 0
minimum_minimum_nights         0
maximum_minimum_nights         0
minimum_maximum_nights         0
maximum_maximum_nights         0
minimum_nights_avg_ntm         0
maximum_nights_avg_ntm         0
has_availability               0
review_scores_rating           0
review_scores_cleanliness      0
review_scores_checkin          0
review_scores_communication    0
review_scores_location         0
review_scores_value            0
instant_bookable               0
bedrooms_na                    0
beds_na                        0
dtype: int64


In [19]:
df.head()
df.select_dtypes(include=['object']).nunique()
# to one-hot encoding: room_type

room_type        4
amenities    25020
dtype: int64

In [20]:
df['host_is_superhost'].unique()
# All the values in this are True, so it should not be included as a feature
# Also, there are 25,020 values in it so it's not worth to one hot encode
df = df.drop(columns = 'host_is_superhost', inplace = False)
df = df.drop(columns = 'amenities', inplace = False)

In [21]:
df['room_type'].unique()

array(['Entire home/apt', 'Private room', 'Hotel room', 'Shared room'],
      dtype=object)

In [22]:
df_rt = pd.get_dummies(df['room_type'], prefix='room_type')
df_rt

Unnamed: 0,room_type_Entire home/apt,room_type_Hotel room,room_type_Private room,room_type_Shared room
0,1,0,0,0
1,1,0,0,0
2,1,0,0,0
3,0,0,1,0
4,0,0,1,0
...,...,...,...,...
28017,0,0,1,0
28018,1,0,0,0
28019,0,0,1,0
28020,1,0,0,0


In [23]:
df = df.join(df_rt)
df = df.drop(columns = 'room_type', inplace = False)

In [24]:
df.columns

Index(['accommodates', 'bathrooms', 'bedrooms', 'beds', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'has_availability', 'review_scores_rating',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'instant_bookable', 'bedrooms_na', 'beds_na',
       'room_type_Entire home/apt', 'room_type_Hotel room',
       'room_type_Private room', 'room_type_Shared room'],
      dtype='object')

In [25]:
df.select_dtypes(include=['object']).nunique()

Series([], dtype: float64)

In [26]:
df.select_dtypes(include = ['int']).nunique()

accommodates       16
minimum_nights     95
maximum_nights    229
dtype: int64

In [27]:
df['maximum_nights'].unique()

array([      1125,        730,         14,         21,        700,
               45,         90,         30,        365,         60,
              150,         28,        180,        100,         29,
               20,       1095,        360,         59,        999,
                5,         31,        200,         18,          7,
               27,         69,         62,       1124,        182,
              300,        120,        270,         89,         70,
              151,         40,        518,        350,          3,
              500,        101,        130,        159,       1000,
               34,         22,        760,        400,        220,
               12,        330,         75,         32,         35,
               15,          4,        800,        250,        210,
              125,         24,         99,         41,         95,
               25,        490,          8,        122,         50,
               10,          9,         61,         16,        

In [28]:
df['min_nights'] = stats.mstats.winsorize(df['minimum_nights'], limits=[0, 0.01])

In [29]:
difference = df['minimum_nights'] - df['min_nights']
difference.unique()

array([   0,   45,   15,  125,  289,  290,  105,   10,   13,   75,  295,
        106,    5,  190,  225,  104,  285,    4,  175,   35, 1049, 1175,
        123,   25,  425,  107,  135,   24,  925,   85,  111,  195,  108,
         18,  325,   33,   76,   23,   58,   14,   16,  232,  258,  145,
         70])

In [30]:
df['max_nights'] = stats.mstats.winsorize(df['maximum_nights'], limits=[0, 0.01])

In [31]:
difference = df['maximum_nights'] - df['max_nights']
difference.unique()

array([         0,       8875,        875,        125,   19998875,
       2147482522,          1,       1875,       8874])

In [32]:
df.select_dtypes(include = ['float']).nunique()

bathrooms                       16
bedrooms                        12
beds                            17
price                          684
minimum_minimum_nights          98
maximum_minimum_nights         102
minimum_maximum_nights         206
maximum_maximum_nights         206
minimum_nights_avg_ntm         329
maximum_nights_avg_ntm         452
review_scores_rating           154
review_scores_cleanliness      196
review_scores_checkin          135
review_scores_communication    141
review_scores_location         153
review_scores_value            164
dtype: int64

In [33]:
df['minimum_minimum_nights'].unique()

array([3.000e+01, 1.000e+00, 5.000e+00, 2.000e+00, 4.000e+00, 2.700e+01,
       1.000e+01, 3.000e+00, 7.000e+00, 1.400e+01, 3.100e+01, 2.800e+01,
       1.200e+02, 9.000e+01, 4.500e+01, 2.000e+02, 1.500e+01, 6.000e+00,
       3.640e+02, 2.100e+01, 6.000e+01, 1.200e+01, 3.650e+02, 7.500e+01,
       2.900e+01, 1.800e+02, 2.000e+01, 8.500e+01, 1.800e+01, 9.000e+00,
       8.800e+01, 5.700e+01, 1.500e+02, 5.000e+01, 3.700e+02, 3.200e+01,
       8.000e+00, 2.500e+01, 1.810e+02, 4.400e+01, 8.000e+01, 5.800e+01,
       2.650e+02, 3.000e+02, 1.790e+02, 3.600e+02, 7.900e+01, 1.900e+01,
       2.500e+02, 1.100e+02, 1.124e+03, 4.000e+01, 1.300e+01, 1.250e+03,
       2.300e+01, 4.900e+01, 6.900e+01, 1.980e+02, 1.000e+02, 5.000e+02,
       1.820e+02, 2.100e+02, 3.500e+01, 1.600e+01, 9.900e+01, 5.500e+01,
       1.000e+03, 1.600e+02, 1.860e+02, 6.800e+01, 2.700e+02, 7.000e+01,
       1.830e+02, 9.300e+01, 4.000e+02, 2.400e+01, 5.200e+01, 1.080e+02,
       1.510e+02, 9.800e+01, 5.900e+01, 7.200e+01, 

In [34]:
df['nprice'] = stats.mstats.winsorize(df['price'], limits=[0.01, 0.01])

In [35]:
diff = df['price'] - df['nprice']
diff.unique()

array([  0.,   1., 101.,  51.,  -1., 100.,  58.,  81.,  26.,  96.,  15.,
        25.,  41.,   6.,   7.,  46.,  83.,  99.,  44.,  43.,  93.,  78.,
        71.,   2.,  87.,  86.,  50.,  12.])

In [36]:
df.select_dtypes(include = ['boolean']).nunique()

has_availability    2
instant_bookable    2
bedrooms_na         2
beds_na             2
dtype: int64

In [37]:
df['instant_bookable'] = df['instant_bookable'].astype(int)
df['instant_bookable'].unique()

array([0, 1])

In [38]:
df['bedrooms_na'] = df['bedrooms_na'].astype(int)
df['bedrooms_na'].unique()

array([1, 0])

In [39]:
df['beds_na'] = df['beds_na'].astype(int)
df['beds_na'].unique()

array([0, 1])

In [40]:
df.select_dtypes(include = ['boolean']).nunique()

has_availability    2
dtype: int64

In [41]:
df['bedrooms_na'].unique()

array([1, 0])

In [42]:
corr_m = round(df.corr(), 5)
corr_m

Unnamed: 0,accommodates,bathrooms,bedrooms,beds,price,minimum_nights,maximum_nights,minimum_minimum_nights,maximum_minimum_nights,minimum_maximum_nights,...,instant_bookable,bedrooms_na,beds_na,room_type_Entire home/apt,room_type_Hotel room,room_type_Private room,room_type_Shared room,min_nights,max_nights,nprice
accommodates,1.0,0.36944,0.72124,0.75362,0.51906,-0.0615,-0.00607,-0.06268,-0.04769,-0.00396,...,-0.00573,-0.07052,-0.05234,0.45266,-0.01507,-0.43853,-0.06092,-0.10559,0.01297,0.52443
bathrooms,0.36944,1.0,0.47263,0.37708,0.3313,-0.01273,-0.00205,-0.01481,-0.00884,-0.00903,...,-0.03001,-0.10914,-0.01265,0.03139,-0.01759,-0.0288,-0.00112,-0.01847,0.01356,0.32857
bedrooms,0.72124,0.47263,1.0,0.73367,0.45717,-0.02905,3e-05,-0.03242,-0.02243,-0.01195,...,-0.04185,0.0,-0.0424,0.35617,-0.02895,-0.34022,-0.05829,-0.05363,-0.00593,0.46218
beds,0.75362,0.37708,0.73367,1.0,0.40033,-0.05026,-0.00358,-0.05215,-0.045,-0.00824,...,-0.01469,-0.0897,0.0,0.32859,-0.01577,-0.33164,0.01555,-0.08959,-0.01826,0.40576
price,0.51906,0.3313,0.45717,0.40033,1.0,-0.07995,-0.00102,-0.07126,-0.00769,0.06401,...,0.03919,0.025,-0.02859,0.3469,0.12791,-0.35546,-0.04794,-0.13961,0.00303,0.99844
minimum_nights,-0.0615,-0.01273,-0.02905,-0.05026,-0.07995,1.0,0.0027,0.94245,0.65217,-0.01393,...,-0.09743,0.0345,-0.02274,0.04911,-0.03599,-0.04155,-0.01163,0.7342,0.14752,-0.08205
maximum_nights,-0.00607,-0.00205,3e-05,-0.00358,-0.00102,0.0027,1.0,0.0026,0.00148,0.22355,...,-0.0036,0.01748,-0.00137,0.00532,-0.00043,-0.00512,-0.00071,0.00499,0.00591,-0.00102
minimum_minimum_nights,-0.06268,-0.01481,-0.03242,-0.05215,-0.07126,0.94245,0.0026,1.0,0.68257,-0.01607,...,-0.0862,0.03016,-0.02235,0.0472,-0.03458,-0.04007,-0.01061,0.69407,0.144,-0.07308
maximum_minimum_nights,-0.04769,-0.00884,-0.02243,-0.045,-0.00769,0.65217,0.00148,0.68257,1.0,-0.00957,...,-0.00876,0.0345,-0.02276,0.06024,-0.02471,-0.05419,-0.01224,0.49256,0.12347,-0.00728
minimum_maximum_nights,-0.00396,-0.00903,-0.01195,-0.00824,0.06401,-0.01393,0.22355,-0.01607,-0.00957,1.0,...,0.03766,-0.00037,-0.00603,-0.02189,0.11307,0.00678,-0.00314,-0.02197,0.00754,0.0655


In [43]:
corrs = corr_m['review_scores_rating']
corrs

accommodates                   0.00780
bathrooms                     -0.00208
bedrooms                       0.01088
beds                           0.00022
price                          0.04507
minimum_nights                -0.03451
maximum_nights                -0.01217
minimum_minimum_nights        -0.04201
maximum_minimum_nights        -0.03237
minimum_maximum_nights        -0.00525
maximum_maximum_nights        -0.01569
minimum_nights_avg_ntm        -0.03265
maximum_nights_avg_ntm        -0.00914
has_availability               0.03040
review_scores_rating           1.00000
review_scores_cleanliness      0.75821
review_scores_checkin          0.68815
review_scores_communication    0.72775
review_scores_location         0.57446
review_scores_value            0.82063
instant_bookable              -0.05847
bedrooms_na                   -0.01924
beds_na                       -0.03202
room_type_Entire home/apt      0.09600
room_type_Hotel room          -0.02559
room_type_Private room   

In [44]:
corr_s = corrs.sort_values(ascending = False)
corr_s

review_scores_rating           1.00000
review_scores_value            0.82063
review_scores_cleanliness      0.75821
review_scores_communication    0.72775
review_scores_checkin          0.68815
review_scores_location         0.57446
room_type_Entire home/apt      0.09600
nprice                         0.04605
price                          0.04507
has_availability               0.03040
bedrooms                       0.01088
accommodates                   0.00780
beds                           0.00022
bathrooms                     -0.00208
minimum_maximum_nights        -0.00525
maximum_nights_avg_ntm        -0.00914
maximum_nights                -0.01217
maximum_maximum_nights        -0.01569
room_type_Shared room         -0.01901
bedrooms_na                   -0.01924
room_type_Hotel room          -0.02559
beds_na                       -0.03202
maximum_minimum_nights        -0.03237
minimum_nights_avg_ntm        -0.03265
minimum_nights                -0.03451
max_nights               

In [45]:
df.columns

Index(['accommodates', 'bathrooms', 'bedrooms', 'beds', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'has_availability', 'review_scores_rating',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'instant_bookable', 'bedrooms_na', 'beds_na',
       'room_type_Entire home/apt', 'room_type_Hotel room',
       'room_type_Private room', 'room_type_Shared room', 'min_nights',
       'max_nights', 'nprice'],
      dtype='object')

In [46]:
df.shape

(28022, 30)

## Part 4: Define Your Project Plan

Now that you understand your data, in the markdown cell below, define your plan to implement the remaining phases of the machine learning life cycle (data preparation, modeling, evaluation) to solve your ML problem. Answer the following questions:

* Do you have a new feature list? If so, what are the features that you chose to keep and remove after inspecting the data? 
* Explain different data preparation techniques that you will use to prepare your data for modeling.
* What is your model (or models)?
* Describe your plan to train your model, analyze its performance and then improve the model. That is, describe your model building, validation and selection plan to produce a model that generalizes well to new data. 

I have a new feature list that is 'accommodates', 'bathrooms', 'bedrooms', 'beds', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'has_availability', 'review_scores_rating',
       'review_scores_cleanliness', 'review_scores_checkin',
       'review_scores_communication', 'review_scores_location',
       'review_scores_value', 'instant_bookable', 'bedrooms_na', 'beds_na',
       'room_type_Entire home/apt', 'room_type_Hotel room',
       'room_type_Private room', 'room_type_Shared room', 'min_nights',
       'max_nights', 'nprice'. I had to remove 'host_is_superhost' and 'amenities' because the first one's values were all the same, where the second one had too many values to one hot encode.
To prepare the data, I first dropped the features I didn't want, and then replaced missing values when it was possible. After, I one-hot encoded the feature where it was possible, and removed outliers from some of the features and also made sure all boolean values were 0 / 1 instead of False / True.
The model I chose is gradient boosted decision tree.
To build the model, I'm going to split the data into training and testing data into 70% training and 30% testing. Then I'll build the model by first validating the model with GridSearchCV to tune hyperparameters. To validate, when testing I will find the RMSE and R^2 scores. Then I will continue to make changes until it seems like it will generalize well to new data.

## Part 5: Implement Your Project Plan

<b>Task:</b> In the code cell below, import additional packages that you have used in this course that you will need to implement your project plan.

In [47]:
# YOUR CODE HERE
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error

<b>Task:</b> Use the rest of this notebook to carry out your project plan. 

You will:

1. Prepare your data for your model.
2. Fit your model to the training data and evaluate your model.
3. Improve your model's performance by performing model selection and/or feature selection techniques to find best model for your problem.

Add code cells below and populate the notebook with commentary, code, analyses, results, and figures as you see fit. 

In [48]:
# YOUR CODE HERE
X = df.drop(columns = 'review_scores_rating', axis = 1)
y = df['review_scores_rating']

In [49]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30) #splitting the data for training and testing

In [50]:
param_grid = {'max_depth': [2, 3, 5, 8, 10], 'n_estimators': [25, 50, 100, 150, 200]}

In [51]:
dt_reg = GradientBoostingRegressor()

dt_grid = GridSearchCV(dt_reg, param_grid, cv = 5, scoring = 'neg_root_mean_squared_error') # testing to see which parameters are a best fit

dt_grid_search = dt_grid.fit(X_train, y_train)

In [52]:
rsme = -1 * dt_grid_search.best_score_ 
rsme # check for this number being low, because that means the model is good

0.22255421370680964

In [53]:
best = dt_grid_search.best_params_ # finding the best numbers for the hyperparameters
best

{'max_depth': 2, 'n_estimators': 100}

In [54]:
model = GradientBoostingRegressor(max_depth = 2,  n_estimators = 100)
model.fit(X_train, y_train)

In [55]:
pred = model.predict(X_test)

mod_rmse = mean_squared_error(y_test, pred, squared = False)
mod_rmse # low rmse is good



0.22951695508103104

In [56]:
mod_r2 = r2_score(y_test, pred)
mod_r2 # this should be high when rmse is low

0.7960916269238213