# Feature Enrichment 

There are 3 ways to enrich the data:
1. Feature Extraction: obtaining new features from existing features.
2. Feature Engineering: transformation of raw data into features suitable for modeling.
3. Feature Transformation: transformation of data to improve the accuracy of the algorithm.

In [20]:
# Import libraries:
import pandas as pd
import numpy as np
import seaborn as sns
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt 

In [2]:
df = pd.read_csv("flat_file_after_data_cleansing.csv")


  interactivity=interactivity, compiler=compiler, result=result)


In [6]:
#Expanding the output display to see more rows and columns:
pd.set_option('display.max_rows', 200 , 'display.max_columns', 200)

In [3]:
df.head(3)

Unnamed: 0.1,Unnamed: 0,listing_id,name,target_start_date_period,target_end_date_period,target_avg_dollar_price_in_period,start_date_previous_period,end_date_previous_period,host_id,host_name,...,review_scores_rating_cat,reviews_per_month_cat,DaysPassed_first_review_cat,DaysPassed_last_review_cat,beds_cat,bathrooms_cat,DaysPassed_host_since_cat,host_total_listings_count_cat,bedrooms_cat,sqrt_bedrooms_cat
0,0,7071,BrightRoom with sunny greenview!,2019-06-01,2019-08-31,,2018-11-07,2019-05-31,17391,Bright,...,review_scores_rating_25%_to_50%,reviews_per_month_75%_to_100%,DaysPassed_first_review_75%_to_100%,DaysPassed_last_review_0%_to_25%,beds_50%_to_75%,bathrooms_0%_to_25%,DaysPassed_host_since_75%_to_100%,host_total_listings_count_0%_to_25%,bedrooms_0%_to_25%,sqrt_bedrooms_0%_to_25%
1,1,7071,BrightRoom with sunny greenview!,2019-07-01,2019-09-30,,2018-11-07,2019-06-30,17391,Bright,...,review_scores_rating_25%_to_50%,reviews_per_month_75%_to_100%,DaysPassed_first_review_75%_to_100%,DaysPassed_last_review_0%_to_25%,beds_50%_to_75%,bathrooms_0%_to_25%,DaysPassed_host_since_75%_to_100%,host_total_listings_count_0%_to_25%,bedrooms_0%_to_25%,sqrt_bedrooms_0%_to_25%
2,2,7071,BrightRoom with sunny greenview!,2019-08-01,2019-11-06,,2018-11-07,2019-07-31,17391,Bright,...,review_scores_rating_25%_to_50%,reviews_per_month_75%_to_100%,DaysPassed_first_review_75%_to_100%,DaysPassed_last_review_0%_to_25%,beds_50%_to_75%,bathrooms_0%_to_25%,DaysPassed_host_since_75%_to_100%,host_total_listings_count_0%_to_25%,bedrooms_0%_to_25%,sqrt_bedrooms_0%_to_25%


In [4]:
# dropping additional index columns that start with "Unnamed" - dropping these columns
columns_to_drop = [x for x in df.columns.to_list() if x.startswith("Unnamed")]
print("dropping coulmns: ", columns_to_drop) # [Unamed..., Unamed..]
df.drop(columns=columns_to_drop, axis=1, inplace=True)

dropping coulmns:  ['Unnamed: 0']


In [5]:
# Representing the dimensionality of the DataFrame (before adding new variables):
df.shape

(157864, 134)

In [8]:
df.info(verbose=True, null_counts=True)

  """Entry point for launching an IPython kernel.


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 157864 entries, 0 to 157863
Data columns (total 134 columns):
 #    Column                                   Non-Null Count   Dtype  
---   ------                                   --------------   -----  
 0    listing_id                               157864 non-null  int64  
 1    name                                     157451 non-null  object 
 2    target_start_date_period                 157864 non-null  object 
 3    target_end_date_period                   157864 non-null  object 
 4    target_avg_dollar_price_in_period        43919 non-null   float64
 5    start_date_previous_period               157864 non-null  object 
 6    end_date_previous_period                 157864 non-null  object 
 7    host_id                                  157864 non-null  int64  
 8    host_name                                157682 non-null  object 
 9    neighbourhood_group                      157864 non-null  object 
 10   neighbourhood     

### Feature Extraction

The addtional variables that were created in "Addition to Flat file" notebook are:
1. size - extract it from the "description" column.
2. concat_comments_polarity (Sentiment Analysis) - extract from "concat_comments" column
3. concat_comments_subjectivity (Sentiment Analysis) - extract from "concat_comments" column
4. concat_comments_sentiment  (Sentiment Analysis) - extract from "concat_comments" column

This variables were created in the Addition to Flat file notebook because these include NA and need to be handled in the EDA and in the Data Cleansing section. 


### Feature Engineering

In Feature Engineering I based on the pattern I found in the EDA section.

#### availability

We saw in the EDA section that there are 4 similar variables - availability_30, availability_60, availability_90 and availability_365. Those variables have high correlation (more than 0.8), so I will keep 

In [17]:
# creating new variable distance from Berlin center from latitude and longitude variables
from geopy.distance import great_circle
def distance_from_berlin(lat, lon):
    berlin_centre = (52.50277, 13.404166)
    record = (lat, lon)
    return great_circle(berlin_centre, record).km

#add distanse dataset
df['distance'] = df.apply(lambda x: distance_from_berlin(x.latitude, x.longitude), axis=1)


df.head(1)

Unnamed: 0,listing_id,name,target_start_date_period,target_end_date_period,target_avg_dollar_price_in_period,start_date_previous_period,end_date_previous_period,host_id,host_name,neighbourhood_group,neighbourhood,latitude,longitude,room_type,last_review,listing_url,scrape_id,last_scraped,summary,space,description,experiences_offered,notes,transit,access,interaction,house_rules,neighborhood_overview,host_about,host_since,picture_url,host_url,host_location,host_response_time,host_is_superhost,host_thumbnail_url,host_picture_url,host_neighbourhood,host_verifications,host_has_profile_pic,host_identity_verified,street,neighbourhood_cleansed,neighbourhood_group_cleansed,city,state,zipcode,smart_location,market,country_code,country,is_location_exact,property_type,bed_type,amenities,square_feet,weekly_price,monthly_price,calendar_updated,first_review,calendar_last_scraped,license,instant_bookable,is_business_travel_ready,require_guest_profile_picture,require_guest_phone_verification,cancellation_policy,concat_comments,concat_comments_sentiment,target_num_of_day_in_period,target_num_of_booked_days,booked_up_target,num_of_day_in_previous_period,num_of_booked_days_in_previous_period,occupancy_last_period,avg_dollar_price_in_previous_period,price,minimum_nights,number_of_reviews,DaysPassed_last_review,reviews_per_month,calculated_host_listings_count,availability_365,DaysPassed_host_since,host_response_rate,host_total_listings_count,accommodates,bathrooms,bedrooms,beds,security_deposit,cleaning_fee,guests_included,extra_people,maximum_nights,availability_30,availability_60,availability_90,DaysPassed_first_review,review_scores_rating,...,pack_���n_play/travel_crib,shower_chair,high_chair,microwave,carbon_monoxide_detector,well-lit_path_to_entrance,wide_doorway,shampoo,ethernet_connection,kitchenette,heating,accessible-height_toilet,kitchen,children���s_books_and_toys,translation_missing:_en.hosting_amenity_49,crib,stove,bathtub_with_bath_chair,toilet_paper,other_pet(s),fireplace_guards,dryer,room-darkening_shades,bathtub,game_console,children���s_dinnerware,air_conditioning,cable_tv,hot_tub,electric_profiling_bed,fixed_grab_bars_for_shower,toilet,buzzer/wireless_intercom,convection_oven,window_guards,ceiling_hoist,bathroom_essentials,gym,hot_water,pocket_wifi,wide_hallway_clearance,long_term_stays_allowed,sound_system,free_parking_on_premises,stair_gates,beachfront,dishes_and_silverware,bath_towel,pets_live_on_this_property,body_soap,breakfast_table,wide_clearance_to_bed,step-free_access,bbq_grill,iron,changing_table,24-hour_check-in,laptop_friendly_workspace,baby_monitor,firm_mattress,single_level_home,hair_dryer,private_living_room,fixed_grab_bars_for_toilet,netflix,keypad,roll-in_shower,safety_card,indoor_fireplace,luggage_dropoff_allowed,air_purifier,other,washer_/_dryer,lock_on_bedroom_door,lockbox,outlet_covers,patio_or_balcony,coffee_maker,cat(s),suitable_for_events,pets_allowed,waterfront,essentials,tv,self_check-in,hot_water_kettle,babysitter_recommendations,first_aid_kit,wheelchair_accessible,oven,extra_pillows_and_blankets,private_entrance,private_bathroom,beach_essentials,family/kid_friendly,wide_entryway,flat_path_to_front_door,translation_missing:_en.hosting_amenity_50,pool,distance
0,7071,BrightRoom with sunny greenview!,2019-06-01,2019-08-31,,2018-11-07,2019-05-31,17391,Bright,Pankow,Helmholtzplatz,52.543157,13.415091,Private room,2018-11-04,https://www.airbnb.com/rooms/7071,20181110000000.0,2018-11-07,Cozy and large room in the beautiful district ...,"The BrightRoom is an approx. 20 sqm (215ft��),...",Cozy and large room in the beautiful district ...,none,I hope you enjoy your stay to the fullest! Ple...,Best access to other parts of the city via pub...,"The guests have access to the bathroom, a smal...",I am glad if I can give you advice or help as ...,Please take good care of everything during you...,"Great neighborhood with plenty of Caf��s, Bake...","I'm a creative person, adventurer, and travele...",2009-05-16,https://a0.muscache.com/im/pictures/21278/32a1...,https://www.airbnb.com/users/show/17391,"Berlin, Berlin, Germany",within an hour,t,https://a0.muscache.com/im/pictures/user/48c3d...,https://a0.muscache.com/im/pictures/user/48c3d...,Prenzlauer Berg,"['email', 'phone', 'reviews', 'jumio', 'govern...",t,t,"Berlin, Berlin, Germany",Helmholtzplatz,Pankow,Berlin,Berlin,10437.0,"Berlin, Germany",Berlin,DE,Germany,t,Apartment,Real Bed,"{Wifi,Heating,""Family/kid friendly"",Essentials...",,,,3 days ago,2009-08-18,2018-11-07,,f,f,f,f,moderate,##������ ������������ ������������ �����������...,positive_sentiment,92,92,1,206,180,0.87,44.3846,42.0,2.0,197.0,1042.0,1.75,1.0,26.0,4501.0,1.0,1.0,2.0,1.0,1.0,2.0,0.0,0.0,1.0,24.0,10.0,15.0,26.0,26.0,4407.0,96.0,...,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,1,0,4.551287


### Feature Transformation

In [141]:
import itertools

def from_col_list_to_is_exists_columns(_df, col_of_list , ignore_values_list = ['a','']):
    # itreate on all rows and get all optional values in col_of_list
    # create list of unique values.
    # create df of col_of_list without None
    df_col = _df[pd.isna(_df[col_of_list]) == False ][col_of_list]
   
    merged = list(itertools.chain(*df_col.to_list())) 
    #display(df_col)
    # each row in df_col conatins list of strs, flat all list to single list
    flat_list = list(itertools.chain(*df_col.to_list())) 
    # print(flat_list)
    list_of_columns_to_add = list(set(flat_list)-set(ignore_values_list))
    # add column be each value of the list (equivalent for creating one hot encoding/dummies)
    
    for new_col in list_of_columns_to_add:
        is_exists = new_col in _df[col_of_list]
        
        _df[new_col] = _df[col_of_list].apply(lambda x: 1  if (x and new_col in x) else 0 )
    
    return _df
        
######################################################################
# Testing
######################################################################        
df_test = pd.DataFrame({'amenities_list': [["tv", "cable_tv"],["heating", "washer"]] } )
df_test = from_col_list_to_is_exists_columns(df_test, col_of_list = 'amenities_list')

df_expected = pd.DataFrame({'amenities_list': [["tv", "cable_tv"],["heating", "washer"]], "tv": [1,0], "cable_tv":[1,0], "heating":[0,1], "washer": [0,1] })
pd.testing.assert_frame_equal(df_test,df_expected[df_test.columns])


#### Transform "Amenities" column (set of attributes) column to Dummies colums
Amenities columns contains dict of attribute. Transform each attribute to feature.
From all attributes that appears in amenities, creating multiple features (equivalent to making dummies/One hot encoding).

In [9]:
df['amenities'][1]

'{Wifi,Heating,"Family/kid friendly",Essentials,Shampoo,Hangers,"Hair dryer","Laptop friendly workspace","translation missing: en.hosting_amenity_50","Hot water","Bed linens","Extra pillows and blankets","Single level home"}'

In [10]:
def set_to_list(amenities_val):
    # amenities_val is set of words that is kept in str. Example '{TV,"Cable TV",Wifi,Kitchen,Gym, ... }')
    amenities_str = str(amenities_val)[1:-1].split(",") # ['TV', '"Cable TV"', 'Wifi', 'Kitchen', 'Gym']
    
    # remove "" from prases with spaces, replace spaces with _ and cast prases to lower
    return [s.strip('"').lstrip().lower().replace(" ", "_") for s in amenities_str]


# Test set_to_list method
df_test = pd.DataFrame({'amenities': ['{TV,"Cable TV",Wifi,Kitchen,Gym}', '{Heating,Washer,Essentials,Shampoo,"Hair dryer"}']} )
df_test['amenities_list'] = df_test['amenities'].apply(set_to_list)

df_expected = pd.DataFrame({'amenities': ['{TV,"Cable TV",Wifi,Kitchen,Gym}', '{Heating,Washer,Essentials,Shampoo,"Hair dryer"}'],
                           'amenities_list': [["tv", "cable_tv", "wifi", "kitchen", "gym"],["heating", "washer", "essentials", "shampoo", "hair_dryer"]] } )

pd.testing.assert_frame_equal(df_test,df_expected)



In [11]:
df['amenities_list'] = df['amenities'].apply(set_to_list)

In [126]:
df = from_col_list_to_is_exists_columns(df, col_of_list = 'amenities_list')

#### Transform "host_verifications" column (set of attributes) column to Dummies colums 
host_verifications columns contains list of attribute. Transform each attribute to feature.
From all attributes that appears in host_verifications, creating multiple features (equivalent to making dummies/One hot encoding).

In [102]:
df['host_verifications'][1]

"['email', 'phone', 'reviews', 'jumio', 'government_id']"

In [103]:
import ast
# host_verifications conatins list that transformed to str (repersented as string)
# ast.literal_eval can be use to transform back to list (from the str of list)
ast.literal_eval(df['host_verifications'][1])

['email', 'phone', 'reviews', 'jumio', 'government_id']

In [132]:
# host_verifications_list is the represntaion of host_verifications as list (intead of str)
df['host_verifications_list'] = df['host_verifications'].apply(ast.literal_eval)

In [139]:
df = from_col_list_to_is_exists_columns(df, col_of_list = 'host_verifications_list')

['sesame', 'email', 'identity_manual', 'phone', 'selfie', 'photographer', 'manual_offline', 'facebook', 'reviews', 'jumio', 'weibo', 'work_email', 'google', 'zhima_selfie', 'kba', 'government_id', 'sent_id', 'sesame_offline', 'manual_online', 'offline_government_id']


#### one-hot-encoding/dummy encoding

In [29]:
#  Defining the categorical variables:
category_cols = ['neighbourhood_group','room_type', 
'host_response_time','host_is_superhost','host_has_profile_pic',
'host_identity_verified', 'bed_type', 'instant_bookable','is_business_travel_ready','require_guest_profile_picture',
 'require_guest_phone_verification','cancellation_policy', 'concat_comments_sentiment'] + [x for x in df.columns.to_list() if x.endswith("cat")]

print(category_cols)

['neighbourhood_group', 'room_type', 'host_response_time', 'host_is_superhost', 'host_has_profile_pic', 'host_identity_verified', 'bed_type', 'instant_bookable', 'is_business_travel_ready', 'require_guest_profile_picture', 'require_guest_phone_verification', 'cancellation_policy', 'concat_comments_sentiment', 'host_response_rate_cat', 'size_cat', 'avg_dollar_price_in_previous_period_cat', 'concat_comments_subjectivity_cat', 'concat_comments_polarity_cat', 'review_scores_value_cat', 'review_scores_checkin_cat', 'review_scores_location_cat', 'review_scores_communication_cat', 'review_scores_accuracy_cat', 'review_scores_cleanliness_cat', 'review_scores_rating_cat', 'reviews_per_month_cat', 'DaysPassed_first_review_cat', 'DaysPassed_last_review_cat', 'beds_cat', 'bathrooms_cat', 'DaysPassed_host_since_cat', 'host_total_listings_count_cat', 'bedrooms_cat', 'sqrt_bedrooms_cat']


### Feature Selection

Selection based on voting: using many of the techniques (univariate and multivariate), we make
a table with all the variables on the dataset and indicate the recommended variables for each
technique, then we select a threshold for the total votings and on this basis we select the variables
that will be used to train our models.