# ABSTRACT
This project analyzes Airbnb listings in the city of  New York to better understand how different
attributes such as bedrooms, location, house type amongst others can be used to accurately predict
the price of listing that is optimal in terms of the host’s profitability yet affordable to their guests. 

*This model is intended to be helpful to the internal pricing tools that Airbnb provides to its hosts.*

### Objective of the PROJECT is to find:
- Estimate listing price based on provided amenities
- How review scores effect price of listing
- how cancellation policy effects price of listing


### dataset
collected from InsideAirBnB website for NewYork City(NYC) from jan-mar 2020
http://insideairbnb.com/get-the-data.html

## Data Dictionary

- id - listing identifier that can be used to create a join with other files
- last_scraped - data scrapped date
- name -name of the listing
- host_id -unique id given to host
- host_since - joining date of host can be used to calculate host experience based on duration since the first listing
- host_is_superhost - categorical t or f - describing highly rated and relaible hosts (https://www.airbnb.co.uk/superhost)
- host_identity_verified - categorical t or f - another credibility metric
- host_response_rate -  response rate is the percentage of new enquiries and reservation requests you responded to (by either     accepting/pre-approving or declining) within 24 hours in the past 30 days
- host_listings_count - total listing host have
- neighbourhood_group- Burough of NYC
- neighbourhood_cleansed -neighbourhoods in a burough zipcode
- latitude - we will use it later to visualise the data on the map
- longitude - we will use it later to visualise the data on the map
- property_type -description of property ex:appartment,privatehome
- room_type - type of room ex:shared room
- accommodates - discrete value describing property number people can accomodate
- bathrooms - another discrete value describing property
- bedrooms - another discrete value describing property
- beds - another discrete value describing property
- bed_type - categorical value describing property type of bed ex: realbed or couch
- amenities - wifi tv dryer so on...
- price - price per night for number of included guests
- security_deposit - another continous value assiociated with the cost
- cleaning_fee - additional cost at the top of rent
- guests_included - number of guest that can be allowed on the price
- extra_people - cost of additional person per night
- minimum_nights - another discrete value that is cost related.Listing with high value of minimum nights are likely sublettings
- maximun nights -property availability
- availablitiy 365-availability of the listing from scrapped date to next 365 days
- first_review - first review date
- last_review - last review date
- number_of_reviews - total number of reviews in entire listing history
- review_scores_accuracy - discrete value - numbers between 2 and 10
- review_scores_value - discrete value - numbers between 2 and 10
- review_scores_rating - this value is calculated as weighted sum of other scores
- reviews_per_month - given reviews in a month 
- instant_bookable - categorical value - t or false
- cancellation_policy - ordinal value with 5 categories that can be ordered from lowest to highest level of flexibility

In [None]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
import warnings
warnings.filterwarnings("ignore")

In [None]:
#import pyforest
import statistics
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
import random

from mpl_toolkits.mplot3d import Axes3D
import plotly.graph_objs as go

# import statistical libraries
from scipy.stats import norm,skew, boxcox_normmax

In [None]:
abm1 =  pd.read_csv('../input/airbnb-new-york-city-with-106-features/airbnbmark1.csv')
abm1.head(3)
print('abm1.shape',abm1.shape)
print('abm1.size',abm1.size)

# <center>Steps Followed<center>
    
##   I. Data Processing
##  II. EDA
## III. Feature Engineering
##  IV. Feature Selection
##   V. Model Building  

 
 --------------------------------------------------------------------------------------------------------------------------  
# I. Data Processing 
 **We will do these following steps in Data Processing part:**

- 1. Data Cleaning for special characters,spaces,nan and Checking Data types
- 2. Extracting new features from existing features
- 3. Imputing Missing Values
- 4. Check Numerical, Categorical Values

## Initial Dropping of Unnecessary columns

In [None]:
# Dropping columns that are irrelevant to our analysis 

# Created New Variable abm2 for dataset after dropping columns

abm = abm1.drop(columns = ['id','name',
                'summary','access','interaction',
                'listing_url','scrape_id','last_scraped',
                'space','description','experiences_offered',
                'neighborhood_overview','notes','transit',
                'house_rules','thumbnail_url','medium_url',
                'picture_url','xl_picture_url','host_url',
                'host_name','host_location','host_about',
                'host_acceptance_rate','host_thumbnail_url','host_picture_url',
                'host_neighbourhood','host_verifications','host_has_profile_pic',
                'market','city','smart_location','country_code','is_location_exact',
                'square_feet','minimum_minimum_nights','maximum_minimum_nights',
                'minimum_maximum_nights','maximum_maximum_nights','minimum_nights_avg_ntm',
                'maximum_nights_avg_ntm','calendar_updated','zipcode',
                'neighbourhood','state',
                'street','host_listings_count',#'neighbourhood',
                'country','availability_30','availability_60','availability_90','host_id',
                'calendar_last_scraped','weekly_price','monthly_price',
                'review_scores_cleanliness','review_scores_checkin','review_scores_communication',
                'review_scores_location','review_scores_value','license',
                'jurisdiction_names','reviews_per_month','number_of_reviews','requires_license',
                'is_business_travel_ready','require_guest_profile_picture','require_guest_phone_verification',
                'calculated_host_listings_count','calculated_host_listings_count_entire_homes',
                'calculated_host_listings_count_private_rooms',
                'calculated_host_listings_count_shared_rooms','has_availability'],axis=1,inplace=True)

In [None]:
abm1.shape # new dataset after removing features

In [None]:
abm1 = abm1.drop_duplicates()
print('abm1.shape after dropping duplicate rows: ',abm1.shape)
print('abm1.size:  ',abm1.size)
print('DataTypes wise size: \n', abm1.dtypes.value_counts())
abm1.head(2)

**Note1: We are left with 38 features after dropping initial columns which are repetative counts, irrelevant to anaysis, long text data, URL's and 1,22,818 rows after dropping duplicates**

## 1. Data Cleaning

In [None]:
abm1.replace(('11249\n11249'),11249,inplace=True)
abm1.replace((' '),np.nan,inplace=True)
abm1.host_response_rate = abm1.host_response_rate.str[:-1].astype('float64')

In [None]:
def clean_data(df):
    
    for i in ['price','cleaning_fee','security_deposit', 'extra_people']:
        df[i]=df[i].str.replace('$','').str.replace(',', '').astype(float)
        
    df.replace('', np.nan, inplace=True)
    
    return df.head(2)
clean_data(abm1)

In [None]:
# Converting Price our TARGET VAR into FLOAT
abm1['price']=abm1['price'].astype(float)

In [None]:
# Replacing columns with f/t with 0/1
abm1.replace({'f': 0, 't': 1}, inplace=True) 

#host_super host,instantbookable,identityverified,has_availabilty,requires licence

In [None]:
# converting Host_since dtype to datetime and creating new column host_days_active_years
from datetime import datetime

abm1.host_since = pd.to_datetime(abm1.host_since)
abm1.first_review = pd.to_datetime(abm1.first_review)
abm1.last_review = pd.to_datetime(abm1.last_review)

# Calculating the number of years and days
abm1['host_days_active_years'] = (datetime(2020, 4, 1) - abm1.host_since).astype('timedelta64[Y]')
abm1['host_listing_since'] = (abm1.last_review - abm1.first_review).astype('timedelta64[Y]')
#abm1['host_days_active_days'] = (datetime(2020, 4, 1) - abm1.host_since).astype('timedelta64[D]')
# Printing mean and median
#print("Mean of host_years:", round(abm1['host_days_active_years'].mean(),0))
#print("Median of host_years:", abm1['host_days_active_years'].median())
#print("Mode of host_years:", abm1['host_days_active_years'].mode())

#print('-------------')

#print("Mean of host_days:", round(abm1['host_days_active_days'].mean(),0))
#print("Median of host_days:", abm1['host_days_active_days'].median())
#print("Mode of host_days:", abm1['host_days_active_days'].mode())


# print('\nValueCounts:\n',df['host_days_active_days'].value_counts(normalize=False))


In [None]:
# splitting amenities feature 

amenities_list = list(abm1.amenities)
amenities_list_string = " ".join(amenities_list)
amenities_list_string = amenities_list_string.replace('{', '')
amenities_list_string = amenities_list_string.replace('}', ',')
amenities_list_string = amenities_list_string.replace('"', '')
amenities_set = [x.strip() for x in amenities_list_string.split(',')]
amenities_set = set(amenities_set)
print('\n Number of amenities present in total:',len(amenities_set))

abm1.loc[abm1['amenities'].str.contains('Air conditioning|Central air conditioning'), 'air_conditioning'] = 1
abm1.loc[abm1['amenities'].str.contains('Amazon Echo|Apple TV|Game console|Netflix|Projector and screen|Smart TV'), 'high_end_electronics'] = 1
abm1.loc[abm1['amenities'].str.contains('BBQ grill|Fire pit|Propane barbeque'), 'bbq'] = 1
abm1.loc[abm1['amenities'].str.contains('Balcony|Patio'), 'balcony'] = 1
abm1.loc[abm1['amenities'].str.contains('Beach view|Beachfront|Lake access|Mountain view|Ski-in/Ski-out|Waterfront'), 'nature_and_views'] = 1
abm1.loc[abm1['amenities'].str.contains('Bed linens'), 'bed_linen'] = 1
abm1.loc[abm1['amenities'].str.contains('Breakfast'), 'breakfast'] = 1
abm1.loc[abm1['amenities'].str.contains('TV'), 'tv'] = 1
abm1.loc[abm1['amenities'].str.contains('Coffee maker|Espresso machine'), 'coffee_machine'] = 1
abm1.loc[abm1['amenities'].str.contains('Cooking basics'), 'cooking_basics'] = 1
abm1.loc[abm1['amenities'].str.contains('Dishwasher|Dryer|Washer'), 'white_goods'] = 1
abm1.loc[abm1['amenities'].str.contains('Elevator'), 'elevator'] = 1
abm1.loc[abm1['amenities'].str.contains('Exercise equipment|Gym|gym'), 'gym'] = 1
abm1.loc[abm1['amenities'].str.contains('Family/kid friendly|Children|children'), 'child_friendly'] = 1
abm1.loc[abm1['amenities'].str.contains('parking'), 'parking'] = 1
abm1.loc[abm1['amenities'].str.contains('Garden|Outdoor|Sun loungers|Terrace'), 'outdoor_space'] = 1
abm1.loc[abm1['amenities'].str.contains('Host greets you'), 'host_greeting'] = 1
abm1.loc[abm1['amenities'].str.contains('Hot tub|Jetted tub|hot tub|Sauna|Pool|pool'), 'hot_tub_sauna_or_pool'] = 1
abm1.loc[abm1['amenities'].str.contains('Internet|Pocket wifi|Wifi'), 'internet'] = 1
abm1.loc[abm1['amenities'].str.contains('Long term stays allowed'), 'long_term_stays'] = 1
abm1.loc[abm1['amenities'].str.contains('Pets|pet|Cat(s)|Dog(s)'), 'pets_allowed'] = 1
abm1.loc[abm1['amenities'].str.contains('Private entrance'), 'private_entrance'] = 1
abm1.loc[abm1['amenities'].str.contains('Safe|Security system'), 'secure'] = 1
abm1.loc[abm1['amenities'].str.contains('Self check-in'), 'self_check_in'] = 1
abm1.loc[abm1['amenities'].str.contains('Smoking allowed'), 'smoking_allowed'] = 1
abm1.loc[abm1['amenities'].str.contains('Step-free access|Wheelchair|Accessible'), 'accessible'] = 1
abm1.loc[abm1['amenities'].str.contains('Suitable for events'), 'event_suitable'] = 1
abm1.loc[abm1['amenities'].str.contains('24-hour check-in'), 'check_in_24h'] = 1

In [None]:
print('Amenities Column Names:\n',abm1.columns[35:],'\n')
print(' Number of Amenities columns after categorizing under same names:',abm1.columns[35:].shape)

**Note 2:
Above are the Columns after categorizing amenieties with similar names we are left with 28 features from 148 columns in amenities set**

In [None]:
frequent_amenities = []
infrequent_amenities=[]
for col in abm1.iloc[:,35:].columns:
    if abm1[col].sum() > len(abm1)/5:
        frequent_amenities.append(col)
    else:
        infrequent_amenities.append(col)
print('Common_amenities: \n',frequent_amenities)
print('-----------------------')
print('Special_amenities: \n',infrequent_amenities)
print('frequent_amenities',len(frequent_amenities))
print('infrequent_amenities',len(infrequent_amenities))

In [None]:
# Decreasig the value_counts in cancellation policy 
abm1.cancellation_policy.replace({
    'super_strict_30': 'strict',
    'super_strict_60': 'strict',
    'strict_14_with_grace_period': 'strict'}, inplace=True)
abm1.cancellation_policy.value_counts()

In [None]:
# Decreasig the value_counts in property_type
abm1['property_type'].value_counts()

abm1['property_type'].value_counts()/abm1['property_type'].value_counts().sum()*100

#With 10 categories we account for 98% of the listings

(abm1['property_type'].value_counts()/abm1['property_type'].value_counts().sum()*100)[0:5].sum()

In [None]:
Mod_prop_type=abm1['property_type'].value_counts()[5:len(abm1['property_type'].value_counts())].index.tolist()

def change_prop_type(label):
    if label in Mod_prop_type:
        label='Other'
    return label

In [None]:
abm1.loc[:,'property_type'] = abm1.loc[:,'property_type'].apply(change_prop_type)

In [None]:
abm1['property_type'].value_counts()

**Note3: We are going to create new column that sums up the total number of ameniteis present by each host**

In [None]:
abm1['special_amenities']=abm1[['high_end_electronics','bbq','balcony','nature_and_views','breakfast','gym',
 'outdoor_space',
 'host_greeting',
 'hot_tub_sauna_or_pool',
 'pets_allowed',
 'secure',
 'smoking_allowed',
#  'PH_Accessible',
 'event_suitable',
 'check_in_24h',
#  'Private bathroom',
#  'Baby protection'
        ]].sum(axis=1)
abm1['special_amenities'].isnull().sum()
abm1.columns
abm1['special_amenities'].astype(float)
#abm1['special_amenities']=abm1['special_amenities'].mask(abm1['special_amenities']>0,1)
abm1['special_amenities']

In [None]:
abm1.isnull().sum()

In [None]:
## code for merging amenities into special features and if one amenity is present the value is 1 else 0
abm1['common_amenities']=abm1[['bed_linen',
 'tv',
 'coffee_machine',
 'cooking_basics',
 'white_goods',
 'elevator',
 'child_friendly',
 'parking',
 'internet',
 'long_term_stays',
 'private_entrance',
 'self_check_in',
#  'Toiletries',
#  'Safety'
                               ]].sum(axis=1)
abm1['common_amenities'].isnull().sum()
abm1.columns
abm1['common_amenities'].astype(float)
# abm1['common_amenities']=abm1['common_amenities'].mask(abm1['common_amenities']>0,1)
abm1['common_amenities']

In [None]:
abm1.columns[35:]

In [None]:
#dropping the actual columns
abm1.drop(['air_conditioning', 'high_end_electronics', 'bbq', 'balcony',
       'nature_and_views', 'bed_linen', 'breakfast', 'tv', 'coffee_machine',
       'cooking_basics', 'white_goods', 'elevator', 'gym', 'child_friendly',
       'parking', 'outdoor_space', 'host_greeting', 'hot_tub_sauna_or_pool',
       'internet', 'long_term_stays', 'pets_allowed', 'private_entrance',
       'secure', 'self_check_in', 'smoking_allowed', 'accessible',
       'event_suitable', 'check_in_24h','first_review',
        'last_review','host_since','amenities'],axis=1,inplace = True)

In [None]:
new_col = pd.DataFrame(columns=['avg_price_property_type'])
new_col['avg_price_property_type'] = abm1.groupby(['neighbourhood_cleansed','property_type'])['price'].mean()

### <font color=blue> new name is given to the dataframe "abm2" after adding new cols

In [None]:
abm2 = abm1.merge(new_col,left_on=['neighbourhood_cleansed','property_type'],right_on=['neighbourhood_cleansed','property_type'],how='left')
print(abm2.shape)
print(abm2.size)
abm2.head(2)

### <font color=blue> new name is given to the dataframe "abm3" after adding new cols

In [None]:
new_col1 = pd.DataFrame(columns=['avg_review_score'])
new_col1['avg_review_score'] = abm2.groupby(['neighbourhood_cleansed','property_type'])['review_scores_rating'].mean()

In [None]:
abm3 = abm2.merge(new_col1,left_on=['neighbourhood_cleansed','property_type'],right_on=['neighbourhood_cleansed','property_type'],how='left')
print(abm3.shape)
print(abm3.size)
abm3.head(2)

In [None]:
# neighbourhood_group_cleansed is renamed as Borough
#abm3['Borough'] = abm3['neighbourhood_group_cleansed']

# guests_included is renamed as Num_of_guests_incl_forprice
#abm3['Num_of_guests_incl_forprice'] = abm3['guests_included']

# extra_people is renamed as price_per_extra_people
#abm3['price_per_extra_people'] = abm3['extra_people']

abm3=abm3.rename(columns={"neighbourhood_group_cleansed": "Borough", "guests_included": "Num_of_guests_incl_forprice"
                    ,'extra_people':'price_per_extra_people'})

In [None]:
abm3.columns

## IMPUTING MISSING NAN VALUES

In [None]:
abm3.isnull().sum()

In [None]:
abm3.bedrooms.value_counts()

In [None]:
from sklearn.impute import KNNImputer

imputer = KNNImputer(missing_values=np.nan,n_neighbors=2, weights="uniform")

abm3['host_total_listings_count'] = imputer.fit_transform(abm3[['host_total_listings_count']])
#abm1['host_days_active_years'] = imputer.fit_transform(abm1[['host_days_active_years']])
#abm1['host_days_active_days'] = imputer.fit_transform(abm1[['host_days_active_days']])

In [None]:
#Group by neighborhood_cleansed and property_type fill in missing value by the median.

# abm3["security_deposit"] = abm3.groupby(['neighbourhood_cleansed','property_type'])["security_deposit"].transform(
#     lambda x: x.fillna(x.median()))

# abm3["cleaning_fee"] = abm3.groupby(['neighbourhood_cleansed','property_type'])["cleaning_fee"].transform(
#     lambda x: x.fillna(x.median()))

abm3["beds"] = abm3.groupby(['neighbourhood_cleansed','property_type'])["beds"].transform(
    lambda x: x.fillna(x.mode()))

abm3["bathrooms"] = abm3.groupby(['neighbourhood_cleansed','property_type'])["bathrooms"].transform(
    lambda x: x.fillna(x.mode()[0]))

abm3["bedrooms"] = abm3.groupby(['neighbourhood_cleansed','property_type'])["bedrooms"].transform(
    lambda x: x.fillna(x.mode()))

abm3["review_scores_rating"] = abm3.groupby(['neighbourhood_cleansed','property_type'])["review_scores_rating"].transform(
    lambda x: x.fillna(x.mean()))

# abm3["review_scores_accuracy"] = abm3.groupby(['neighbourhood_cleansed','property_type'])["review_scores_accuracy"].transform(
#     lambda x: x.fillna(x.mean()))


In [None]:
#Group by neighborhood_cleansed and property_type fill in missing value by the mode.
abm3["host_listing_since"] = abm3.groupby(['neighbourhood_cleansed'])["host_listing_since"].transform(
    lambda x: x.fillna(x.mode()))


abm3["host_is_superhost"] = abm3.groupby(['neighbourhood_cleansed'])["host_is_superhost"].transform(
    lambda x: x.fillna(x.mode()))


abm3["host_identity_verified"] = abm3.groupby(['neighbourhood_cleansed'])["host_identity_verified"].transform(
    lambda x: x.fillna(x.mode()))

In [None]:
features_nan_remove=['host_response_time']
for i in features_nan_remove:
    abm3[i]=abm3[i].astype('str').str.replace("nan", "unknown").astype(str)
    print('{}:{}'.format(i,abm2[i].isna().sum()))

In [None]:
features_nan_remove=['review_scores_rating','host_response_rate','review_scores_accuracy','cleaning_fee','bedrooms','security_deposit','host_identity_verified','beds','host_days_active_years','review_scores_rating','host_is_superhost','host_listing_since','avg_review_score']
for i in features_nan_remove:
    abm3[i]=abm3[i].astype('str').str.replace("nan", "100000000").astype('float')
    print('{}:{}'.format(i,abm3[i].isna().sum()))

In [None]:
features_nan_remove=['review_scores_rating','host_response_rate','review_scores_accuracy','cleaning_fee','bedrooms','security_deposit','host_identity_verified','beds','host_days_active_years','review_scores_rating','host_is_superhost','host_listing_since','avg_review_score']
for i in features_nan_remove:
    abm3[i]=abm3[i].replace(100000000, abm3[i].median())
    print('{}:{}'.format(i,abm3[i].isna().sum()))

In [None]:
abm3.isnull().sum()

In [None]:
abm3.dtypes

In [None]:
# seperating categorical and numerical dtypes
categorical_types=abm3.select_dtypes(include=['object']).columns
print('categorical_types: \n',categorical_types)

print('-------------')

numerical_types=abm3._get_numeric_data().columns
print('numerical_types: \n',numerical_types)

##  II. EDA
- Exploring the Data.
- Getting Business Insights from the data
- Cardinality of Categorical features(PLOTS)

In [None]:
#numerical varibles further breaking down 
# CONTINOUS AND DISCRETE VARIABLES

Discrete_features=[i for i in numerical_types if len(abm3[i].unique())<25]
print(Discrete_features)
print(len(Discrete_features))

In [None]:
#relationship between discrete var and price
for i in Discrete_features:
    abm3.groupby(i)['price'].mean().plot.bar()
    plt.xlabel(i)
    plt.ylabel('price')
    plt.title(i)
    plt.show()

Points to be taken from above:
- from the above we can see the relationship being a superhost or not doesnot affect price
- accomodates has a linear increase in price with increase in accomadates
- special features also does not affect price much until special amenities count is greater than 7
- common amenities present properties have same price regardless of neighbourhood.
- review scores does not effect price of the property. So, we can drop review_score_accuracy.

In [None]:
#lets compare the difference between years and price
plt.scatter(abm3.host_days_active_years,abm3.price)
plt.title('Host_active_years Vs Price')
plt.show()

**Note6:We can see that aas the number of years a host is registered and active the price is decreasing**

In [None]:
abm3.dtypes

In [None]:
## Continuous var
Continous_features=[i for i in numerical_types if i not in Discrete_features]
Continous_features[1:] #since Host_id is not required

In [None]:
#ploting continous var
for i in Continous_features[1:]:
    data=abm3.copy()
    if 0 in data[i].unique():
        pass
    elif i in ['zipcode','latitude','longitude']:
        pass
    else:
#         data[i]=np.log(data[i])
        data[i].hist(bins=25)
        plt.xlabel(i)
        plt.ylabel('count')
        plt.title(i)
        plt.show()

**Note: From the above plots we observe the data is either right positively or negatively skewed, and also not normally distributed.**

#### - Avg age of listings group by borough

In [None]:
print(abm3.groupby(['Borough'])['host_days_active_years'].mean().sort_values())
plt.figure(figsize=(20,8))
sns.countplot(x ='host_days_active_years',hue = "Borough",data = abm3)
plt.title("Hosting since across boroughs")
plt.show()

Brooklyn and Manhattan have the oldest listings by nearly 4 years.

### -price distribution of various room types across neighbourhood groups

- It is clearly seen that hotel rooms are much costlier than entire house in manhattan and brooklyn region 
- overall entire apt/home price is higher than private room
- Also it is seen that staten island and bronx do not have any hotel rooms

In [None]:
print(abm3.groupby(['room_type','Borough'])['price'].mean().sort_values())
sns.set(rc={'figure.figsize': (16, 5)})
ax = sns.barplot(x = 'Borough', y = 'price', hue = 'room_type', data =abm3, 
                 palette ='plasma_r', ci = False)

### Highly reviewed of every feature based on Number of reviews(last twelve months).

- In entire city we have more number of Appartment listings
- Except for brooklyn every Borough has more Appartments than other property types.

In [None]:
print(abm3.groupby(['property_type'])['number_of_reviews_ltm'].mean())
sns.set(rc={'figure.figsize': (16, 5)})
ax = sns.barplot(x = 'Borough', y = 'number_of_reviews_ltm', hue = 'property_type', data =abm3, 
                 palette ='plasma_r', ci = False)

In [None]:
print(abm3.groupby(['room_type'])['number_of_reviews_ltm'].mean())
sns.set(rc={'figure.figsize': (16, 5)})
ax = sns.barplot(x = 'Borough', y = 'number_of_reviews_ltm', hue = 'room_type', data =abm3, 
                 palette ='plasma_r', ci = False)

In [None]:
# from mpl_toolkits.mplot3d import Axes3D
# from plotly.subplots import make_subplots
# import plotly.graph_objs as go
# from mpl_toolkits.mplot3d import Axes3D
# from plotly import tools

# fig = px.scatter_mapbox(abm3, 
#                         hover_data = ['price','minimum_nights','room_type'],
#                         hover_name = 'neighbourhood_cleansed',
#                         lat="latitude", 
#                         lon="longitude", 
#                         color="Borough", 
#                         size="price", 
#                         size_max=30, 
#                         opacity = .70,
#                         zoom=10,
#                        )
# fig.layout.mapbox.style = 'stamen-terrain'
# fig.update_layout(title_text = 'Airbnb by Borough in NYC<br>(Click legend to toggle borough)', height = 800)

In [None]:
# fig = px.scatter_mapbox(abm3,
#                         hover_data=['price','property_type','room_type','number_of_reviews_ltm'], 
#                         lat="latitude", 
#                         lon="longitude", 
#                         color="neighbourhood_cleansed", 
#                         size_max=30, 
#                         opacity = .70,
#                         zoom=12,
#                        )
# fig.layout.mapbox.style = 'carto-positron'
# fig.update_layout(title_text = 'NYC Airbnb by Neighbourhood<br>(Click legend to toggle neighbourhood)', height = 800)

In [None]:
# temp_bk = abm3[abm3.Borough == 'Brooklyn']
# temp_qn = abm3[abm3.Borough == 'Queens']
# temp_mn = abm3[abm3.Borough == 'Manhattan']
# temp_bx = abm3[abm3.Borough == 'Bronx']
# temp_si = abm3[abm3.Borough == 'Staten Island']

# labels = abm3.room_type.value_counts().index.to_list()

# fig = make_subplots(1, 5, specs=[[{'type':'domain'}, {'type':'domain'}, {'type':'domain'},{'type':'domain'},{'type':'domain'}]],
#                     subplot_titles=['Manhattan', 'Brooklyn', 'Queens','Bronx','Staten Island'])
# fig1= fig.add_trace(go.Pie(labels=labels, values=temp_mn.room_type.value_counts().reset_index().sort_values(by = 'index').room_type.tolist(), scalegroup='one',
#                      name="Manhattan"),1,1)
# fig2= fig.add_trace(go.Pie(labels=labels, values=temp_bk.room_type.value_counts().reset_index().sort_values(by = 'index').room_type.tolist(), scalegroup='one',
#                      name="Brooklyn"),1,2)
# fig3= fig.add_trace(go.Pie(labels=labels, values=temp_qn.room_type.value_counts().reset_index().sort_values(by = 'index').room_type.tolist(), scalegroup='one',
#                      name="Queens"),1,3)
# fig4= fig.add_trace(go.Pie(labels=labels, values=temp_bx.room_type.value_counts().reset_index().sort_values(by = 'index').room_type.tolist(), scalegroup='one',
#                      name="Bronx"),1,4)
# fig5= fig.add_trace(go.Pie(labels=labels, values=temp_si.room_type.value_counts().reset_index().sort_values(by = 'index').room_type.tolist(), scalegroup='one',
#                      name="Staten Island"),1,5)

# fig.update_layout(title_text='room types in Boroughs')

### - Cardinality of categorical features

In [None]:
 for i in categorical_types:
        print('feature-{} & number of categories-{}'.format(i,len(abm3[i].unique())))

In [None]:
#relationship b/w categorical and dependent var
for i in categorical_types:
    if len(data[i].unique())>40:
        pass
    elif len(data[i].unique())==1:
        pass
    else:
        data.groupby(i)['price'].mean().plot.bar()
        plt.xlabel(i)
        plt.ylabel('price')
        plt.title(i)
        plt.show()

- from the above we can observe relationship between categorical feature and price feature and also we can reduce the labels

### Statistics Analysis

 Objective: To perform Statistical Analysis on the data set by implementing various stats modules (on New York AirBnb data) such as Hypothesis Testing, Tests of Mean (Kruskal Wallis Test, ANOVA - one way and two way), Tests of Proportion (z test and chi-squared test) and Tests of Variance (F-test, Levene test), after checking for the three assumptions of (i) Normality of target variable (ii) Randomness of Sampling (iii) Equal variance across categories. The level of significance is assumed to be 5 percent (i.e. alpha = 0.05) If assumptions are satisfied, parametric tests(assumes already distribution is present(ANOVA) can be performed, else non-parametric tests(do not rely on any distributions(CHI-SQUARE) have to be performed. The results of the tests performed will enable us to find the associativity and dependability of different features on one-another.

In [None]:
cont = pd.crosstab(abm3.Borough,abm3.room_type)
cont

Testing the assumptions:
- Randomness of Data
- Normality Test
- Variance Test

The target variable being the price.

In [None]:
plt.figure(figsize=(10,7))
sns.distplot(abm2.price,color='r')
plt.xlabel("Price")
plt.title("Distribution of Price of among property types")
plt.show()

#### Shapiro Test (for checking Normality)

- H0 (Null Hypothesis) : Distribution is normal

- H1 (Alternate Hypothesis): Distribution is not normal

In [None]:
import scipy.stats as st
st.shapiro(abm2.price)

#### Levene Test (for testing of variance)
H0 (null hypothesis): variance(private_room) = variance(shared_room) = variance(entire_home)=variance(hotel_room)

H1 (alternate hypothesis): variance(private_room) != variance(shared_room) != variance(entire_home)!=variance(hotel_room)

In [None]:
pvt = abm3[abm3['room_type'] == 'Private room']
share = abm3[abm3['room_type'] == 'Shared room']
apt = abm3[abm3['room_type'] == 'Entire home/apt']
hotel=abm3[abm3['room_type'] == 'Hotel room']

In [None]:
st.levene(pvt.price, share.price, apt.price,hotel.price)

Here P-value is less than 0.5 and therefore we can reject null hypothesis and can say that price varies in different room types

#### price vs neighbourhood
H0 (null hypothesis): mean_price(Brooklyn) = mean_price(Manhattan) = ..... = mean_price(Bronx)

H1 (null hypothesis): mean_price(Brooklyn) != mean_price(Manhattan) != ..... != mean_price(Bronx)

In [None]:
import  scipy.stats as stats
import statsmodels.api as sm
from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm

In [None]:
## one way anova
mod = ols('price ~ Borough', data =abm3).fit()
aov_table = sm.stats.anova_lm(mod, typ=1)
print(aov_table)

Here pvalue obtained is greater than 0.5, so we fail to reject null hypothesis

#### Room Type vs Neighbourhood Group

Since both the variables Room Type and Neighbourhood Group are categorical having more than two categories, we can peform Chi-squared test.

#### Chi Squared Test
- H0 (null hypothesis): There is no association between Room Type and Neighbourhood Group.
- H1 (alternate hypothesis): There is an association between Room Type and Neighbourhood Group.

In [None]:
tab = pd.crosstab(abm3['room_type'],abm3['Borough'])

In [None]:
st.chi2_contingency(tab)

In [None]:
ct = pd.crosstab(abm3['room_type'],abm3['Borough'])
ct.plot.bar(stacked=True)
plt.show()

### price vs review scores rating

In [None]:
mod = ols('price ~ review_scores_rating', data =abm3).fit()
aov_table = sm.stats.anova_lm(mod, typ=1)
print(aov_table)

here Pvalue obtained is less than 0.5, so we can reject null hypothesis and can say that reviews have an effect on price

### price vs cancellation policy

In [None]:
mod = ols('price ~ cancellation_policy', data =abm3).fit()
aov_table = sm.stats.anova_lm(mod, typ=1)
print(aov_table)

here Pvalue obtained is less than 0.5, so we can reject null hypothesis and can say that cancellation policy have an effect on price

### price vs availability_365

In [None]:
mod = ols('price ~ availability_365', data =abm3).fit()
aov_table = sm.stats.anova_lm(mod, typ=1)
print(aov_table)

#### chi2 on response rate and property type, borough etc

In [None]:
tab = pd.crosstab(abm3['host_response_rate'],abm3['property_type'])

In [None]:
st.chi2_contingency(tab)

In [None]:
tab = pd.crosstab(abm3['host_response_rate'],abm3['Borough'])

In [None]:
st.chi2_contingency(tab)

### Correlation Analysis

In [None]:
abm3.shape

In [None]:
numerical=[i for i in abm3.columns if abm3[i].dtypes!='O']
len(numerical)

In [None]:
cor = abm3.corr()

#Correlation with output variable
cor_target = abs(cor["price"])

# Selecting highly correlated features
relevant_features = cor_target[cor_target>0.0]
relevant_features

In [None]:
top_corr_features = cor.index[abs(cor["price"])>0.0]
# plt.figure(figsize=(10,15))
sns.heatmap(abm3[top_corr_features].corr(), annot = True, 
                cbar = True,square=True)

**Since we can see except avg_price_property_type is the only feature with above 0.3 correlation we are dropping none features and moving with feature Engineering and Selection**

In [None]:
# condition check for Multicollinearity  between 2 newly extracted features
print(abm3[["avg_review_score","avg_price_property_type"]].corr())

## III. Feature Engineering

- Reducing Labels in the Object features
- Handling Outliers
- Transforming data


In [None]:
# Checking objects in Categorical variable.
print('host_response_time: \n',abm3['host_response_time'].value_counts()/abm3['host_response_time'].value_counts().sum()*100)
print('=================================================')
print('cancellation_policy: \n',abm3.cancellation_policy.value_counts()/abm3['cancellation_policy'].value_counts().sum()*100)

In [None]:
# Reducig labdels in Host_response_time.
abm3.replace({'within an hour':'Hour','within a few hours':'One Day','within a day':'Days'},inplace=True)

# Decreasig the value_counts in property_type

#With 10 categories we account for 95% of the listings
(abm3['property_type'].value_counts()/abm3['property_type'].value_counts().sum()*100)[0:5].sum()

In [None]:
Mod_prop_type=abm3['property_type'].value_counts()[5:len(abm1['property_type'].value_counts())].index.tolist()

def change_prop_type(label):
    if label in Mod_prop_type:
        label='Other'
    return label
# Mod_prop_type
abm3.loc[:,'property_type'] = abm3.loc[:,'property_type'].apply(change_prop_type)

In [None]:
abm3.property_type.value_counts()

In [None]:
numerical_types

### Handling outliers

In [None]:
cap_df = abm3.copy()

In [None]:
def features_plot(feat,df):
    plt.rcParams['figure.figsize']=(15,15)
    plt.style.use(style='ggplot')
    xxx,sub=plt.subplots(5,6)
    xxx.subplots_adjust(hspace=0.5)
    sub=sub.flatten()
    for i in range(len(feat)):
        sub[i].scatter(x=df[feat[i]], y=np.log1p(abm3["price"]),s=4)
        sub[i].set_title('{}'.format(feat[i],fontsize=10))
        sub[i].set_ylabel('log(Price)',fontsize=10)
        sub[i].tick_params(labelsize=10)
    plt.show()
    

Numerical_cols=[i for i in cap_df.columns if cap_df[i].dtypes=='float64']
len(Numerical_cols)
Numerical_cols = list(Numerical_cols)
# Numerical_cols.remove('price')
features_plot(sorted(Numerical_cols),cap_df)

In [None]:
def cap_data(df):
    for col in df.columns:
#         print("capping the ",col)
        if (((df[col].dtype)=='float64') | ((df[col].dtype)=='int64')):
            percentiles = df[col].quantile([0.25,0.75]).values
            df[col][df[col] <= percentiles[0]] = percentiles[0]
            df[col][df[col] >= percentiles[1]] = percentiles[1]
            print(percentiles)
        else:
            df[col]=df[col]
    return df

# abm3 = cap_data(abm2)

In [None]:
# Capping outliers at lower and upper viscor
cap_df = cap_data(cap_df)

In [None]:
def features_plot(feat,df):
    plt.rcParams['figure.figsize']=(15,15)
    plt.style.use(style='ggplot')
    xxx,sub=plt.subplots(5,6)
    xxx.subplots_adjust(hspace=0.5)
    sub=sub.flatten()
    for i in range(len(feat)):
        sub[i].scatter(x=df[feat[i]], y=np.log1p(df["price"]),s=4)
        sub[i].set_title('{}'.format(feat[i],fontsize=10))
        sub[i].set_ylabel('log(Price)',fontsize=10)
        sub[i].tick_params(labelsize=10)
    plt.show()
    

Numerical_cols=[i for i in cap_df.columns if cap_df[i].dtypes=='float64']
len(Numerical_cols)
Numerical_cols = list(Numerical_cols)
# Numerical_cols.remove('price')
features_plot(sorted(Numerical_cols),cap_df)

#### Feature transformation

In [None]:
num_columns = ['host_is_superhost', 'host_total_listings_count',
       'host_identity_verified', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'price']
num_columns1 = ['price', 'availability_365','number_of_reviews_ltm', 'review_scores_rating','review_scores_accuracy', 
               'instant_bookable', 'host_days_active_years','host_listing_since']

num_columns2 = ['special_amenities', 
               'common_amenities', 'avg_price_property_type', 'avg_review_score','price', 'security_deposit',
       'cleaning_fee', 'Num_of_guests_incl_forprice', 'price_per_extra_people','minimum_nights', 'maximum_nights']

In [None]:
sns.pairplot(data=cap_df[num_columns],height=3)
plt.show()

In [None]:
sns.pairplot(data=cap_df[num_columns1],height=3)
plt.show()

In [None]:
sns.pairplot(data=cap_df[num_columns2],height=3)
plt.show()

In [None]:
final_df = cap_df[numerical_types].drop(['latitude','longitude'],axis=1)

In [None]:
# Check the skew of all numerical features
skewed_feats = final_df.apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)

In [None]:
# import modules 
import numpy as np 
from scipy import stats 

# plotting modules 
import seaborn as sns 
import matplotlib.pyplot as plt 

# generate non-normal data (exponential) 
final_df = np.random.exponential(size = 1000) 

# transform training data & save lambda value 
fitted_data, fitted_lambda = stats.boxcox(final_df) 

# creating axes to draw plots 
fig, ax = plt.subplots(1, 2) 

# plotting the original data(non-normal) and 
# fitted data (normal) 
sns.distplot(final_df, hist = False, kde = True, 
	kde_kws = {'shade': True, 'linewidth': 2}, 
	label = "Non-Normal", color ="green", ax = ax[0]) 

sns.distplot(fitted_data, hist = False, kde = True, 
	kde_kws = {'shade': True, 'linewidth': 2}, 
	label = "Normal", color ="green", ax = ax[1]) 

# adding legends to the subplots 
plt.legend(loc = "upper right") 

# rescaling the subplots 
fig.set_figheight(5) 
fig.set_figwidth(10) 

print(f"Lambda value used for Transformation: {fitted_lambda}") 


In [None]:
# Check the skew of all numerical features
skewed_feats = pd.DataFrame(fitted_data).apply(lambda x: skew(x.dropna())).sort_values(ascending=False)
print("\nSkew in numerical features: \n")
skewness = pd.DataFrame({'Skew' :skewed_feats})
skewness.head(10)

### Encoding Categorical Vars

In [None]:
categorical_types

In [None]:
cap_df

In [None]:
# # # Bin into 5 categories
cap_df['host_response_rate'].value_counts(bins=5,sort=False)

In [None]:
# Bin into five categories
cap_df['host_response_rate_bins'] = pd.cut(cap_df.host_response_rate, bins=[50,60,70,80,100], labels=['49-60%', '50-89%', '90-99%', '100%'], include_lowest=True)

# Converting to string
cap_df['host_response_rate_bins'] = cap_df['host_response_rate_bins'].astype('str')

# Replace nulls with 'unknown'
#cap_df['host_response_rate_bins'].replace('nan', 'unknown', inplace=True)

# Category counts
cap_df['host_response_rate_bins'].value_counts()

In [None]:
dfe = pd.DataFrame(cap_df,columns = ['property_type','room_type','cancellation_policy','host_response_rate_bins','bed_type'])
dfe.head()

In [None]:
cat_cols = dfe.select_dtypes(['object']).columns

In [None]:
for col in cat_cols:
    freqs = dfe[col].value_counts()
    k = freqs.index[freqs>20][:6]
    for cat in k:
        name = col+'_'+cat
        dfe[name]=(dfe[col]==cat).astype(int)
    del dfe[col]
    print(col)
    
print(dfe.dtypes)

In [None]:
abmen = pd.concat((dfe,cap_df),axis=1)
abmen.head(5)

In [None]:
final_df = pd.DataFrame()
final_df = abmen.copy()

In [None]:
final_df.drop(['host_response_time', 'host_response_rate','host_response_rate_bins', 'neighbourhood_cleansed',
       'Borough', 'property_type', 'room_type', 'bed_type',
       'cancellation_policy'],axis=1,inplace=True)

In [None]:
final_df.columns

In [None]:
# seperating categorical and numerical dtypes
categorical_cols=final_df.select_dtypes(include=['object']).columns
print('categorical_types: \n',categorical_cols)

print('-------------')

numerical_cols=final_df._get_numeric_data().columns
print('numerical_types: \n',numerical_cols)

In [None]:
final_df.drop(['longitude','latitude'],axis=1,inplace=True)

##  IV. Feature Selection

#### RFE 

In [None]:
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso

In [None]:
# Getting Data Ready
X = final_df.drop('price', axis=1)
y= final_df['price']

In [None]:
model = LinearRegression()

In [None]:
#Initializing RFE model
rfe = RFE(model, 40)

In [None]:
#Transforming data using RFE
X_rfe = rfe.fit_transform(X,y)  
#Fitting the data to model
model.fit(X_rfe,y)
print(rfe.support_)
print(rfe.ranking_)

In [None]:
X.columns

In [None]:
X.drop(['bedrooms', 'review_scores_accuracy'],axis=1,inplace=True)

In [None]:
#no of features
nof_list=np.arange(1,47)            
high_score=0
#Variable to store the optimum features
nof=0           
score_list =[]
for n in range(len(nof_list)):
    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size = 0.3, random_state = 0)
    model = LinearRegression()
    rfe = RFE(model,nof_list[n])
    X_train_rfe = rfe.fit_transform(X_train,y_train)
    X_test_rfe = rfe.transform(X_test)
    model.fit(X_train_rfe,y_train)
    score = model.score(X_test_rfe,y_test)
    score_list.append(score)
    if(score>high_score):
        high_score = score
        nof = nof_list[n]
print("Optimum number of features: %d" %nof)
print("Score with %d features: %f" % (nof, high_score))

## V. Model Building

In [None]:
import statsmodels.api as sm
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.feature_selection import RFE
from sklearn.linear_model import RidgeCV, LassoCV, Ridge, Lasso

In [None]:
## Raw linear regression model
X = final_df.drop('price', axis=1)
y= final_df['price']
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)

print(f'Coefficients: {lin_reg.coef_}')
print(f'Intercept: {lin_reg.intercept_}')
print(f'R^2 score: {lin_reg.score(X, y)}')

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test , y_train, y_test = train_test_split(X,y, test_size = 0.30, random_state = 1)
print(X_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
lin_reg = LinearRegression()
model = lin_reg.fit(X_train,y_train)
print(f'R^2 score for train: {lin_reg.score(X_train, y_train)}')
print(f'R^2 score for test: {lin_reg.score(X_test, y_test)}')

In [None]:
## Raw OLS Model
X = final_df.drop(['price'],axis=1)
y = final_df.price
X_constant = sm.add_constant(X)
lin_reg = sm.OLS(y,X_constant).fit()
lin_reg.summary()

In [None]:
##### Assumption 1- No autocorrelation

In [None]:
import statsmodels.tsa.api as smt

acf = smt.graphics.plot_acf(lin_reg.resid, lags=40 , alpha=0.05)
acf.show()

In [None]:
##### Assumption 2- Normality of Residuals
from scipy import stats
print(stats.jarque_bera(lin_reg.resid))

In [None]:
import seaborn as sns

sns.distplot(lin_reg.resid)

##### Asssumption 3 - Linearity of residuals
Here we have 2 options. Either we can plot the observed values Vs predicted values and plot the Residual Vs predicted values and see the linearity of residuals.
OR
We can go for rainbow test. Let's look both of them one by one.

#### Rainbow test 
It is done to check the linearity of the residuals for a linear regression model.
Linearity of residuals is preferred.


In [None]:
import statsmodels.api as sm
sm.stats.diagnostic.linear_rainbow(res=lin_reg, frac=0.5)

In [None]:
import scipy.stats as stats
import pylab
from statsmodels.graphics.gofplots import ProbPlot
st_residual = lin_reg.get_influence().resid_studentized_internal
stats.probplot(st_residual, dist="norm", plot = pylab)
plt.show()

In [None]:
lin_reg.resid.mean()

very close to 0 so linearity is present.

In [None]:
##### Assumption 4 -  Homoscedasticity_test(using goldfeld test) OR (Beusch-Wagon Test)

# goldfeld test
from statsmodels.compat import lzip
import numpy as np
from statsmodels.compat import lzip
import seaborn as sns 
import matplotlib.pyplot as plt
import statsmodels.stats.api as sms

model = lin_reg
fitted_vals = model.predict()
resids = model.resid
resids_standardized = model.get_influence().resid_studentized_internal

name = ['F statistic', 'p-value']
test = sms.het_goldfeldquandt(model.resid, model.model.exog)
lzip(name, test)

In [None]:
##### Assumption 4 -  Homoscedasticity_test(using goldfeld test) OR (Beusch-Wagon Test)

##### breuschpagan Test
import statsmodels.api as sm
from statsmodels.compat import lzip
name = ['Lagrange multiplier statistic', 'p-value',
        'f-value', 'f p-value']
test = sms.het_breuschpagan(lin_reg.resid, lin_reg.model.exog)
lzip(name, test)

#### so from above p- value we know that the data is heteroscedastic

##### Assumption 5- NO  MULTI COLLINEARITY

In [None]:
X.shape

In [None]:
##### Assumption 5- NO  MULTI COLLINEARITY
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = [variance_inflation_factor(X_constant.values, i) for i in range(X_constant.shape[1])]
pd.DataFrame({'vif': vif[0:]}, index=X.columns).T

#So, multicollinearity exists.
Note : This vif column has be built with the help of X_constant and not the X_values. Because we built our model by adding Constant.

In [None]:
dict(pd.DataFrame({'vif': vif[0:]}, index=X.columns).T)

In [None]:
final_df.columns

In [None]:
##  removed like correlated variables
X = final_df[['property_type_Apartment', 'property_type_House', 'property_type_Other',
       'property_type_Townhouse', 'property_type_Condominium',
       'room_type_Entire home/apt', 'room_type_Private room',
       'room_type_Shared room', 'room_type_Hotel room',
       'cancellation_policy_strict', 'cancellation_policy_flexible', 'host_response_rate_bins_100%',
       'bed_type_Real Bed', 'bed_type_Pull-out Sofa',
       'bed_type_Airbed', 'bed_type_Couch', 'host_is_superhost',
       'host_total_listings_count', 'host_identity_verified', 'accommodates',
       'bathrooms', 'bedrooms', 'beds', 'security_deposit',
       'cleaning_fee', 'Num_of_guests_incl_forprice', 'price_per_extra_people',
       'minimum_nights', 'maximum_nights', 'availability_365',
       'number_of_reviews_ltm', 'review_scores_rating',
       'review_scores_accuracy', 'instant_bookable', 'host_days_active_years',
       'host_listing_since', 'special_amenities', 'common_amenities',
       'avg_price_property_type', 'avg_review_score']]
y = final_df['price']
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)

print(f'Coefficients: {lin_reg.coef_}')
print(f'Intercept: {lin_reg.intercept_}')
print(f'R^2 score: {lin_reg.score(X, y)}')

In [None]:
import warnings 
warnings.filterwarnings('ignore')
import statsmodels.api as sm

X_constant = sm.add_constant(X)
lin_reg = sm.OLS(y,X_constant).fit()
lin_reg.summary()

In [None]:
# remove 4 more parameters from the input
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = [variance_inflation_factor(X_constant.values, i) for i in range(X_constant.shape[1])]
pd.DataFrame({'vif': vif[0:]}, index=X.columns).T

In [None]:
##  removed like correlated variables
X = final_df[['room_type_Entire home/apt', 'room_type_Private room',
       'room_type_Shared room', 'room_type_Hotel room',
       'cancellation_policy_strict', 'cancellation_policy_flexible', 'host_response_rate_bins_100%',
       'bed_type_Real Bed', 'bed_type_Pull-out Sofa',
       'bed_type_Airbed', 'bed_type_Couch', 
       'host_total_listings_count',  'accommodates',
       'bathrooms', 'bedrooms', 'beds',  'security_deposit',
       'cleaning_fee', 'Num_of_guests_incl_forprice', 'price_per_extra_people',
       'minimum_nights', 'maximum_nights', 'availability_365',
       'number_of_reviews_ltm', 'review_scores_rating',
       'review_scores_accuracy', 'instant_bookable', 'host_days_active_years',
       'host_listing_since', 'special_amenities', 'common_amenities',
       'avg_price_property_type', 'avg_review_score']]
y = final_df['price']
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)

print(f'Coefficients: {lin_reg.coef_}')
print(f'Intercept: {lin_reg.intercept_}')
print(f'R^2 score: {lin_reg.score(X, y)}')

In [None]:
import warnings 
warnings.filterwarnings('ignore')
import statsmodels.api as sm

X_constant = sm.add_constant(X)
lin_reg = sm.OLS(y,X_constant).fit()
lin_reg.summary()

In [None]:
# remove 4 more parameters from the input
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = [variance_inflation_factor(X_constant.values, i) for i in range(X_constant.shape[1])]
pd.DataFrame({'vif': vif[0:]}, index=X.columns).T

In [None]:
##  removed like correlated variables
X = final_df[['room_type_Entire home/apt',
       'room_type_Shared room',
       'cancellation_policy_strict', 'cancellation_policy_flexible', 'host_response_rate_bins_100%',
       'bed_type_Real Bed', 'bed_type_Pull-out Sofa',
       'bed_type_Airbed', 'bed_type_Couch', 
       'host_total_listings_count',  'accommodates',
       'bathrooms', 'bedrooms', 'beds',  'security_deposit',
       'cleaning_fee', 'Num_of_guests_incl_forprice', 'price_per_extra_people',
       'minimum_nights', 'maximum_nights', 'availability_365',
       'number_of_reviews_ltm', 'review_scores_rating',
       'review_scores_accuracy', 'instant_bookable', 'host_days_active_years',
       'host_listing_since', 'special_amenities', 'common_amenities',
       'avg_price_property_type', 'avg_review_score']]
y = final_df['price']
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)

print(f'Coefficients: {lin_reg.coef_}')
print(f'Intercept: {lin_reg.intercept_}')
print(f'R^2 score: {lin_reg.score(X, y)}')

In [None]:
import warnings 
warnings.filterwarnings('ignore')
import statsmodels.api as sm

X_constant = sm.add_constant(X)
lin_reg = sm.OLS(y,X_constant).fit()
lin_reg.summary()

In [None]:
# remove 4 more parameters from the input
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = [variance_inflation_factor(X_constant.values, i) for i in range(X_constant.shape[1])]
pd.DataFrame({'vif': vif[0:]}, index=X.columns).T

In [None]:
##  removed like correlated variables
X = final_df[['room_type_Entire home/apt',
       'room_type_Shared room',
       'cancellation_policy_strict', 'cancellation_policy_flexible', 'host_response_rate_bins_100%',
       'bed_type_Real Bed', 'bed_type_Pull-out Sofa',
       'bed_type_Airbed', 'bed_type_Couch', 
       'host_total_listings_count',  'accommodates',
       'bathrooms', 'bedrooms', 'beds',  'security_deposit',
       'cleaning_fee', 'Num_of_guests_incl_forprice', 'price_per_extra_people',
       'minimum_nights', 'availability_365',
       'number_of_reviews_ltm', 'review_scores_rating',
       'review_scores_accuracy', 'instant_bookable', 'host_days_active_years',
       'host_listing_since', 'special_amenities', 'common_amenities',
       'avg_price_property_type', 'avg_review_score']]
y = final_df['price']
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)

print(f'Coefficients: {lin_reg.coef_}')
print(f'Intercept: {lin_reg.intercept_}')
print(f'R^2 score: {lin_reg.score(X, y)}')

In [None]:
import warnings 
warnings.filterwarnings('ignore')
import statsmodels.api as sm

X_constant = sm.add_constant(X)
lin_reg = sm.OLS(y,X_constant).fit()
lin_reg.summary()

In [None]:
# remove 4 more parameters from the input
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif = [variance_inflation_factor(X_constant.values, i) for i in range(X_constant.shape[1])]
pd.DataFrame({'vif': vif[0:]}, index=X.columns).T

In [None]:
dict(pd.DataFrame({'vif': vif[0:]}, index=X.columns).T)

#### Linear Regression

In [None]:
##  removed like correlated variables
X = final_df[['room_type_Entire home/apt',
       'room_type_Shared room',
       'cancellation_policy_strict', 'cancellation_policy_flexible', 'host_response_rate_bins_100%',
       'bed_type_Real Bed', 'bed_type_Pull-out Sofa',
       'bed_type_Airbed', 'bed_type_Couch', 
       'host_total_listings_count',  'accommodates',
       'bathrooms', 'bedrooms', 'beds',  'security_deposit',
       'cleaning_fee', 'Num_of_guests_incl_forprice', 'price_per_extra_people',
       'minimum_nights', 'availability_365',
       'number_of_reviews_ltm', 'review_scores_rating',
       'review_scores_accuracy', 'instant_bookable', 'host_days_active_years',
       'host_listing_since', 'special_amenities', 'common_amenities',
       'avg_price_property_type', 'avg_review_score']]
y = final_df['price']
from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(X, y)

print(f'Coefficients: {lin_reg.coef_}')
print(f'Intercept: {lin_reg.intercept_}')
print(f'R^2 score: {lin_reg.score(X, y)}')

#### Finally let's check for overfit and underfit condition

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test , y_train, y_test = train_test_split(X,y, test_size = 0.30, random_state = 1)
print(X_train.shape)
print(X_test.shape)
print(y_test.shape)

In [None]:
lin_reg = LinearRegression()
model = lin_reg.fit(X_train,y_train)
print(f'R^2 score for train: {lin_reg.score(X_train, y_train)}')
print(f'R^2 score for test: {lin_reg.score(X_test, y_test)}')

#### Apply StandardScaler train and test

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaler = scaler.fit_transform(X_train)
X_test_scaler = scaler.fit_transform(X_test)

In [None]:
# modeling
# libraries
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import Ridge, Lasso, ElasticNet, RidgeCV, LassoCV, ElasticNetCV, LinearRegression
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import KFold, cross_val_score, train_test_split, GridSearchCV
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.ensemble import RandomForestRegressor


### Linear reg

In [None]:
# Getting Data Ready
##  removed like correlated variables
X = final_df[['room_type_Entire home/apt',
       'room_type_Shared room',
       'cancellation_policy_strict', 'cancellation_policy_flexible', 'host_response_rate_bins_100%',
       'bed_type_Real Bed', 'bed_type_Pull-out Sofa',
       'bed_type_Airbed', 'bed_type_Couch', 
       'host_total_listings_count',  'accommodates',
       'bathrooms', 'bedrooms', 'beds',  'security_deposit',
       'cleaning_fee', 'Num_of_guests_incl_forprice', 'price_per_extra_people',
       'minimum_nights', 'availability_365',
       'number_of_reviews_ltm', 'review_scores_rating',
       'review_scores_accuracy', 'instant_bookable', 'host_days_active_years',
       'host_listing_since', 'special_amenities', 'common_amenities',
       'avg_price_property_type', 'avg_review_score']]
y = final_df['price']

In [None]:
from sklearn.model_selection import train_test_split
train_x, test_x , train_y, test_y = train_test_split(X,y, test_size = 0.30, random_state = 1)
print(train_x.shape)
print(test_x.shape)
print(test_y.shape)

In [None]:
model = LinearRegression()

# fit the model with the training data
model.fit(train_x,train_y)

# coefficeints of the trained model
print('\nCoefficient of model :', model.coef_)

# intercept of the model
print('\nIntercept of model',model.intercept_)

# predict the target on the test dataset
predict_train = model.predict(train_x)

# Root Mean Squared Error on training dataset
rmse_train = mean_squared_error(train_y,predict_train)**(0.5)
print('\nRMSE on train dataset : ', rmse_train)

# predict the target on the testing dataset
predict_test = model.predict(test_x) 

# Root Mean Squared Error on testing dataset
rmse_test = mean_squared_error(test_y,predict_test)**(0.5)
print('\nRMSE on test dataset : ', rmse_test)

print(f'R^2 score for train: {lin_reg.score(X_train, y_train)}')
print(f'R^2 score for test: {lin_reg.score(X_test, y_test)}')

In [None]:
# Feature selection
## Raw OLS Model
X = final_df.drop(['price'],axis=1)
y = final_df.price
feature_select = LassoCV(precompute=True)
feature_select.fit(X, y)
print("Best alpha using built-in LassoCV: %f" % feature_select.alpha_)
print("Best score using built-in LassoCV: %f" %feature_select.score(X,y))
coef = pd.Series(feature_select.coef_, index = X.columns)
print("Lasso picked " + str(sum(coef != 0)) + " variables and eliminated the other " +  str(sum(coef < 0)) + " variables")
imp_coef = coef.sort_values()

In [None]:
imp_coef=imp_coef[coef!=0]
imp_coef

In [None]:
X_new = final_df[['minimum_nights','number_of_reviews_ltm','maximum_nights','security_deposit','availability_365','cleaning_fee',
                  'review_scores_rating',
                  'avg_price_property_type','common_amenities','accommodates','room_type_Entire home/apt']]
y_new= final_df['price']

In [None]:
from sklearn.model_selection import train_test_split
trainx, testx , trainy, testy = train_test_split(X_new,y_new, test_size = 0.30, random_state = 1)
print(trainx.shape)
print(testx.shape)
print(testy.shape)

In [None]:
#linear reg after feature selectio
# fit the model with the training data
model.fit(trainx,trainy)

# coefficeints of the trained model
print('\nCoefficient of model :', model.coef_)

# intercept of the model
print('\nIntercept of model',model.intercept_)

# predict the target on the test dataset
predict_train = model.predict(trainx)

# Root Mean Squared Error on training dataset
rmse_train = mean_squared_error(trainy,predict_train)**(0.5)
print('\nRMSE on train dataset : ', rmse_train)

# predict the target on the testing dataset
predict_test = model.predict(testx) 

# Root Mean Squared Error on testing dataset
rmse_test = mean_squared_error(testy,predict_test)**(0.5)
print('\nRMSE on test dataset : ', rmse_test)

print(f'R^2 score for train: {model.score(trainx, trainy)}')
print(f'R^2 score for test: {model.score(testx, testy)}')

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.linear_model import Lasso
from sklearn.linear_model import ElasticNet
from sklearn.neighbors import KNeighborsRegressor
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVR
from sklearn.ensemble import RandomForestRegressor

In [None]:
# Getting Data Ready
##  removed like correlated variables
X = final_df[['room_type_Entire home/apt',
       'room_type_Shared room',
       'cancellation_policy_strict', 'cancellation_policy_flexible', 'host_response_rate_bins_100%',
       'bed_type_Real Bed', 'bed_type_Pull-out Sofa',
       'bed_type_Airbed', 'bed_type_Couch', 
       'host_total_listings_count',  'accommodates',
       'bathrooms', 'bedrooms', 'beds',  'security_deposit',
       'cleaning_fee', 'Num_of_guests_incl_forprice', 'price_per_extra_people',
       'minimum_nights', 'availability_365',
       'number_of_reviews_ltm', 'review_scores_rating',
       'review_scores_accuracy', 'instant_bookable', 'host_days_active_years',
       'host_listing_since', 'special_amenities', 'common_amenities',
       'avg_price_property_type', 'avg_review_score']]
y = final_df['price']

#### Ridge Reg

In [None]:
# Ridge basic model

train_x = X_train
train_y = y_train
test_x = X_test
test_y = y_test

rr = Ridge(alpha=0.01)
rr.fit(train_x, train_y) 
pred_train_rr= rr.predict(train_x)
print('train_rmse: ',np.sqrt(mean_squared_error(train_y,pred_train_rr)))
print('train_r2 score: ',r2_score(train_y, pred_train_rr))
print('-------------------------')
pred_test_rr= rr.predict(test_x)
print('test_rmse: ',np.sqrt(mean_squared_error(test_y,pred_test_rr))) 
print('test_r2 score: ',r2_score(test_y, pred_test_rr))

In [None]:
# feature selected

rr = Ridge(alpha=0.01)
rr.fit(trainx, trainy) 
pred_train_rr= rr.predict(trainx)
print('ytrain_rmse: ',np.sqrt(mean_squared_error(trainy,pred_train_rr)))
print('ytrain_r2 score: ',r2_score(trainy, pred_train_rr))
print('-------------------------')
pred_test_rr= rr.predict(testx)
print('test_rmse: ',np.sqrt(mean_squared_error(testy,pred_test_rr))) 
print('test_r2 score: ',r2_score(testy, pred_test_rr))

### lasso Reg

In [None]:
# lasso
model_lasso = Lasso(alpha=0.01)
model_lasso.fit(train_x, train_y) 
pred_train_lasso= model_lasso.predict(train_x)
print('train_rmse: ',np.sqrt(mean_squared_error(train_y,pred_train_lasso)))
print('train_r2 score: ',r2_score(train_y, pred_train_lasso))
print('-------------------------')
pred_test_lasso= model_lasso.predict(test_x)
print('test_rmse: ',np.sqrt(mean_squared_error(test_y,pred_test_lasso))) 
print('test_r2 score: ',r2_score(test_y, pred_test_lasso))

In [None]:
# Feature Selected
model_lasso = Lasso(alpha=0.01)
model_lasso.fit(trainx, trainy) 
pred_train_lasso= model_lasso.predict(trainx)
print('train_rmse: ',np.sqrt(mean_squared_error(trainy,pred_train_lasso)))
print('train_r2 score: ',r2_score(trainy, pred_train_lasso))
print('-------------------------')
pred_test_lasso= model_lasso.predict(testx)
print('test_rmse: ',np.sqrt(mean_squared_error(testy,pred_test_lasso))) 
print('test_r2 score: ',r2_score(testy, pred_test_lasso))

### Elastic Net

In [None]:
#Elastic Net
model_enet = ElasticNet(alpha = 0.01)
model_enet.fit(trainx, trainy) 
pred_train_enet= model_enet.predict(trainx)
print('train_rmse: ',np.sqrt(mean_squared_error(trainy,pred_train_enet)))
print('train_r2 score: ',r2_score(trainy, pred_train_enet))

pred_test_enet= model_enet.predict(testx)
print('test_rmse: ',np.sqrt(mean_squared_error(testy,pred_test_enet)))
print('test_r2 score: ',r2_score(testy, pred_test_enet))

### Boosting techniques

In [None]:
from sklearn.datasets import make_regression
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import RepeatedKFold
from sklearn.ensemble import GradientBoostingRegressor
# define dataset
X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=7)
# define the model
model = GradientBoostingRegressor()
# define the evaluation procedure
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)
# evaluate the model
n_scores = cross_val_score(model, X, y, scoring='neg_mean_absolute_error', cv=cv, n_jobs=-1)
# report performance
print('MAE: %.3f (%.3f)' % (np.mean(n_scores), np.std(n_scores)))

#### 1. Gradient Boosting

In [None]:
# gradient boosting ensemble for making predictions for regression
from sklearn.datasets import make_regression
from sklearn.ensemble import GradientBoostingRegressor
# define dataset
X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=7)
# define the model
model = GradientBoostingRegressor()
# fit the model on the whole dataset
model.fit(trainx, trainy)
# make a single prediction

yhat = model.predict(testx)
# summarize prediction

print('test_rmse: ',np.sqrt(mean_squared_error(testy,yhat)))
print('test_r2 score: ',r2_score(testy, yhat))

In [None]:
# VIF selected
from sklearn.datasets import make_regression
from sklearn.ensemble import GradientBoostingRegressor
# define dataset
X, y = make_regression(n_samples=1000, n_features=20, n_informative=15, noise=0.1, random_state=7)
# define the model
model = GradientBoostingRegressor()
# fit the model on the whole dataset
model.fit(train_x, train_y)
# make a single prediction

yhat = model.predict(test_x)
# summarize prediction

print('test_rmse: ',np.sqrt(mean_squared_error(test_y,yhat)))
print('test_r2 score: ',r2_score(test_y, yhat))

#### 2. XG Boost

In [None]:
import xgboost
from xgboost import plot_importance
xgb_model = xgboost.XGBRegressor(
                 max_depth=3,
                 n_estimators=100,                                                    
                 seed=42)
xgb_model.fit(train_x,train_y)
yhat = xgb_model.predict(test_x)
# summarize prediction

print('test_rmse: ',np.sqrt(mean_squared_error(test_y,yhat)))
print('test_r2 score: ',r2_score(test_y, yhat))

In [None]:
import xgboost
from xgboost import plot_importance
xgb_model = xgboost.XGBRegressor(
                 max_depth=3,
                 n_estimators=100,                                                    
                 seed=42)
xgb_model.fit(trainx,trainy)
yhat = xgb_model.predict(testx)
# summarize prediction

print('test_rmse: ',np.sqrt(mean_squared_error(testy,yhat)))
print('test_r2 score: ',r2_score(testy, yhat))

In [None]:
def evaluate(model, X, y, title):
    predictions = model.predict(X)
    errors = abs(np.expm1(predictions) - np.expm1(y))
    mape = 100 * np.mean(errors / np.expm1(y))
    accuracy = 100 - mape
    score_gbr = model.score(X,y)
    #rsquared = r2_score(y,predictions)
    rmse_gbr = np.sqrt(mean_squared_error((y),(predictions)))
    
    print(title)
    print('R^2: {:0.4f}'.format(score_gbr))
#     print('R^2: {:0.4f}'.format(rsquared))
    print('RMSE: ${:0.4f} '.format(rmse_gbr))
#     print('Average Error: ${:0.4f}'.format(np.mean(errors)))
#     print('Accuracy = {:0.3f}%.'.format(accuracy),'\n')
    
    return predictions

    
def scatter_plot(prediction,y,title):
    plt.rcParams['figure.figsize']=(10,4)
    plt.style.use(style='ggplot')
    plt.scatter(x=prediction, y=y, alpha=.75)
    plt.ylabel('log(input price)',fontsize=16)
    plt.xlabel('log(predicted price)',fontsize=16)
    plt.tick_params(labelsize=16)
    plt.title(title,fontsize=16)
    plt.show()    
    
def feature_extraction(importances,title):
    plt.rcParams['figure.figsize']=(12,6)
#     importances[0:15].iloc[::-1].plot(kind='barh',legend=False,fontsize=16)
#     #importances.plot(kind='barh',legend=False,fontsize=16)
# #     plt.tick_params(labelsize=18)
# #     plt.ylabel("Feature",fontsize=20)
# #     plt.xlabel("Importance viariable",fontsize=20)
# #     plt.title(title,fontsize=20)
#     plt.show()
    
def scatter_plot2(prediction1,y1,prediction2,y2,title):
    a=min(min(prediction1),min(y1),min(prediction2),min(y2))-0.2
    b=max(max(prediction1),max(y1),max(prediction2),max(y2))+0.2
    plt.rcParams['figure.figsize']=(10,4)
    plt.style.use(style='ggplot')
    plt.scatter(x=prediction1, y=prediction1-y1, color='red',label='Training data',alpha=.75)
    plt.scatter(x=prediction2, y=prediction2-y2, color='blue', marker='s', label='Test data',alpha=.75)
    plt.hlines(y = 0, xmin = a, xmax = b, color = "black")
    plt.ylabel('log(input price)',fontsize=16)
    plt.xlabel('log(predicted price)',fontsize=16)
    plt.tick_params(labelsize=16)
    plt.title(title,fontsize=16)
    plt.legend(fontsize=16)
    plt.show()    
def scatter_plot3(prediction1,y1,prediction2,y2,title):
    a=min(min(prediction1),min(y1),min(prediction2),min(y2))-0.2
    b=max(max(prediction1),max(y1),max(prediction2),max(y2))+0.2
    plt.rcParams['figure.figsize']=(10,4)
    plt.style.use(style='ggplot')
    plt.scatter(x=prediction1, y=y1, color='red',label='Training data',alpha=.75)
    plt.scatter(x=prediction2, y=y2, color='blue', marker='s', label='Test data',alpha=.75)
    plt.plot([a, b], [a, b], c = "black")
    plt.ylabel('log(input price)',fontsize=16)
    plt.xlabel('log(predicted price)',fontsize=16)
    plt.tick_params(labelsize=16)
    plt.title(title,fontsize=16)
    plt.legend(fontsize=16)
    plt.show()

In [None]:
rf= RandomForestRegressor(random_state=1, n_jobs=-2, max_features='log2')

param_grid = dict(n_estimators=[100,50],
                  max_depth=[None,3],
                  min_samples_leaf=[1,2])

grid_rf=GridSearchCV(rf, param_grid, cv=5, scoring='neg_mean_squared_error')

grid_rf.fit(X_train,y_train)

#print("Random forest grid.cv_results_ {}".format(grid_rf.cv_results_))
print("Random forest grid.best_score_ {}".format(grid_rf.best_score_))
print("Random forest grid.best_params_ {}".format(grid_rf.best_params_))
print("Random forest grid.best_estimator_ {}".format(grid_rf.best_estimator_))

model_rf = grid_rf.best_estimator_

In [None]:
title0='Random Forest Regression:'
model_tmp = model_rf

title=title0 + ' training set model performance'
prediction_train=evaluate(model_tmp, X_train, y_train,title)

title=title0 + ' test set model performance'
prediction_test=evaluate(model_tmp, X_test, y_test,title)


scatter_plot2(prediction_train,y_train,prediction_test,y_test,title)

title=title0 + ' performance evaluation'
scatter_plot3(prediction_train,y_train,prediction_test,y_test,title)

importances_train = pd.DataFrame({'Feature':X_train.columns, 'Importance':model_rf.feature_importances_})
importances_train = importances_train.sort_values('Importance',ascending=False).set_index('Feature')
feature_extraction(importances_train,'Random Forest Regression: Training set feature importance')

In [None]:
# gradient boosting hyperparameter tuned
# Hyperparameter tuned gradient boosting Regression
gbr = GradientBoostingRegressor(min_samples_split=2,
                                min_samples_leaf=2,
                                subsample=0.8,
                                random_state=1,
                               learning_rate=0.01,
                               max_features='sqrt')
#param_grid = {"n_estimators":np.arange(1000,10000,1000),'learning_rate':[0.01,0.05,0.1,0.25,0.5]}
param_grid = dict(n_estimators=[100,600], max_depth=[1,3,8])

grid_gbr=GridSearchCV(gbr, param_grid, cv=5, scoring='neg_mean_squared_error',n_jobs=-2)

grid_gbr.fit(X_train,y_train)

#print("Random forest grid.cv_results_ {}".format(grid_gbr.cv_results_))
print("Random forest grid.best_score_ {}".format(grid_gbr.best_score_))
print("Random forest grid.best_params_ {}".format(grid_gbr.best_params_))
print("Random forest grid.best_estimator_ {}".format(grid_gbr.best_estimator_))

model_gbr = grid_gbr.best_estimator_

In [None]:
title0='Gradient Boosting Regression:'
model_tmp = model_gbr

title=title0 + ' training set model performance'
prediction_train=evaluate(model_tmp, X_train, y_train,title)

title=title0 + ' test set model performance'
prediction_test=evaluate(model_tmp, X_test, y_test,title)

title=title0 + ' residual plot'
scatter_plot2(prediction_train,y_train,prediction_test,y_test,title)

title=title0 + ' performance evaluation'
scatter_plot3(prediction_train,y_train,prediction_test,y_test,title)

importances_train = pd.DataFrame({'Feature':X_train.columns, 'Importance':model_tmp.feature_importances_})
importances_train = importances_train.sort_values('Importance',ascending=False).set_index('Feature')
feature_extraction(importances_train,'Gradient Boosting Regression: Training set feature importance')

## FINAL REPORT :

**Linear Reg**
RFE Feature selection:

R^2 score for train: 0.6369799616222565
R^2 score for test: 0.635412119637978

After applying Standard scaler:

R^2 score for train: 0.6369799616222565
R^2 score for test: 0.635412119637978

**Ridge reg:**

RFE Feature selection:

train_rmse:  27.063829839500183
train_r2 score:  0.6369799616214836
test_rmse:  27.122013030869358
test_r2 score:  0.6354121243730774

Lasso cv Feature selection:
 
ytrain_rmse:  27.262963919522388
ytrain_r2 score:  0.6316181474690495
test_rmse:  27.31931774864737
test_r2 score:  0.6300882894100244


**Lasso Reg:**

RFE Feature selection:

train_rmse:  27.06526953410705
train_r2 score:  0.6369413379754045
test_rmse:  27.121916828518213
test_r2 score:  0.6354147107702295

Lasso cv Feature selection:

train_rmse:  27.26297648274075
train_r2 score:  0.6316178079562413
test_rmse:  27.319375560168037
test_r2 score:  0.6300867238379024

**Elastic_Net:**
RFE Feature selection:

train_rmse:  27.26657574369064
train_r2 score:  0.6315205338260843
test_rmse:  27.323278460597063
test_r2 score:  0.6299810231926983

**Gradient Boosting**

RFE Feature selection:

test_rmse:  25.981896947224925
test_r2 score:  0.6654199102825642

Lasso cv Feature selection:

test_rmse:  25.49278693944052
test_r2 score:  0.677898302118356



**XG Boost:**

RFE Feature selection:

test_rmse:  24.90947071759183
test_r2 score:  0.6924700760810385

Lasso cv Feature selection:

test_rmse:  25.593941826859893
test_r2 score:  0.6753370440057858


**Random Forest:**

Random Forest Regression: training set model performance
R^2: 0.9761
Random Forest Regression: test set model performance
R^2: 0.8481

**Gradient Boosting Regression: training set model performance:**
R^2: 0.7548

Gradient Boosting Regression: test set model performance
R^2: 0.7272