In [1]:
import numpy as np
import pandas as pd

In [2]:
'''
creates a new feature that is the total character length of the 
"description", "neighborhood_overview", and "host_about" columns
'''
def make_total_characters_feature(listings: pd.core.frame.DataFrame) -> None:
    descriptive_cols = ['description', 'neighborhood_overview', 'host_about']
    listings.loc[:, descriptive_cols].fillna(value = "", inplace = True)
    results = np.zeros(listings.shape[0])
    
    for col in descriptive_cols:
        results += listings[col].str.len()
        listings.drop(col, axis = 1, inplace = True)
    
    #listings.loc[:, 'total_description_chars'] = pd.Series(results)
    

### Why a "total_character_count" feature?
I'm curious if there is at least a weak correlation between overall description length and the price of listings. Perhaps hosts who have longer descriptions are more likely to be full-time hosts, maintain their property better, and provide better accomodations? In contrast, maybe those with little to no descriptions are more likely to be occassional hosts who provide lower quality accomodations? Overall, accomodation quality is subjective and thus, difficult to quantify. There is no numerical "quality" metric in these datasets, however the quality of an Airbnb listing should clearly have a strong correlation with its price. Two listings could be identical in location, # bedrooms, # bathrooms, amenities, etc. but their prices could be wildly different if one is more "quality" than the other.

In [3]:
'''
creates a new feature that is the total number of listed amenities
'''
def make_total_amenities_feature(listings: pd.core.frame.DataFrame) -> None:
    listings['total_amenities'] = listings.amenities.map(lambda amenities: len(amenities.split(', ')))

### Why a "total_amenities" feature?
The "total amenities" feature is just a general proxy-feature. Clearly not all amenities should be weighted equally (e.g. pool > coffee maker). There are so many different types of amenities and it would be impossible to try to determine how valuable each one is. In addition, the entries in this column is a messy list of strings that is full of typos and similar amenities:<br>
>- 'j\\u00c4son conditioner'
>- 'ikea refrigerator'
>- 'modern skyn alchemy body soap'
>- 'miele refrigerator'
>- etc.

Maybe I'll have to go back and try to clean up this data. Certain amenities like wifi, kitchen, and pool are likely to strongly affect the price.

In [4]:
def make_specific_amenity_features(listings: pd.core.frame.DataFrame) -> None:
    # parses a string'd list of amenities
    def parse_amenity_list(amenity_list: str) -> list:
        return [amenity[1:-1] for amenity in amenity_list[1:-1].lower().split(", ")]

    # gets each listings set of amenities
    listing_amenities = listings.amenities.map(lambda amenity_list: parse_amenity_list(amenity_list))

    new_feature_names = ["has_pool", "has_wifi", "has_kitchen"]
    for feature_name in new_feature_names:
        feature = feature_name[4:]
        listings[feature_name] = listing_amenities.map(lambda amenity_list: feature in amenity_list)

### Why a "make_specific_amenity_features" function?
This function creates 1 new feature for each amenity in a certain list of amenities. Each entry is True if that listing has that amenity and is False if otherwise.

In [5]:
np.nan + 0

nan