The listings data-set contains listings from Seattle with descriptive and rating information about host and property.<br>
There are also transactional information, like the price, fees or guest-requirements.<br>
By focusing on the listings data set one keeps the option open to later add new feature from the other two data-sets if necessary.<br>





## The Business Objective

The Airbnb platform is a service to bring two parties together, the host and guest.<br>
*For further analysis let us assume the perspective of a host.*<br>

**What is the hosts objective?**

A host usually already has a property and does not need to acquire one to rent it out.<br>
The host wants to make extra money by renting the property out to guests.<br>
Obviously a host wants to maximize the income from a property. One way to achieve that is to adjust the price.<br>
But how far can one go reasonably without getting unrealistic and drive away potential guests?<br>

**The questions we want to answer based on the given data are:**
1. Which parameters influence a listings price?
1. What parameter can the host use to improve price and value?
1. Can we make a good price estimation for a new offer to assist the (new) host?






In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from ExploreData import sort_mean, value_counter, index_by_key
from TransformData import \
    price_transform, rate_transform, split_column_values, date_transform

%matplotlib inline


df_listings = pd.read_csv('./data/listings.csv')

## Understanding and Preparing the Data for Further Analysis

Before analyzing the data some minimal cleaning needs to be done.<br>

1. We learned already, that some columns can be neglected
2. Some very important columns for our questions have the wrong data type.
Since our questions are phrased openly we might want to work with many columns in the data set.
Thus a close look at all categorical columns, which might be of interest, should be taken.

### 1. Drop Columns

It hast already been established in notebook number 00, that some columns can be dropped since they carry no information for the questions at hand or no information at all.

In [None]:
# Create an overview over the columns
list_column_values = value_counter(df_listings)

# Identify columns with constant values
list_const_drop = list_column_values.loc[
    list_column_values['val_count']==1
    ].index

# Identify columns with no values
list_nan_drop = list_column_values.loc[
    list_column_values['nan_pcnt']>=90
    ].index

# Identify url columns
list_url_drop= df_listings.columns[
    df_listings.columns.to_series().str.contains('url',case=False)]

# Additional colums to drop
list_else_drop = df_listings[[
    'city', 'state', 'smart_location',
    'host_name', 'host_location',
    'neighbourhood', 'neighbourhood_cleansed'
    ]].columns

# Combine the lists
drop_columns_list = list_const_drop.append(
    list_nan_drop).append(
    list_url_drop).append(
    list_else_drop)
# Drop the columns
listings_drop_col = df_listings.drop(columns=drop_columns_list)

# Check what has been droped
df_listings[drop_columns_list].head()


### 2. Transform Object-Columns

In [None]:
# Isolate object column names
cat_listings = listings_drop_col.select_dtypes(include=['object'])

# Create a df with column information
cat_column_values = value_counter(cat_listings)

# How many categorical values are there?
cat_column_values.shape[0]


There are many object columns. Not all of them are of use for the questions at hand.<br>
So we look at them one by one.

**Starting with columns that have many different values**<br>
Many of these are text and have more than 80% unique values.<br>
But there are also the *amenities* - which are basically lists of values per column entry. They can be extracted into a set of amenity-columns.<br>
One can also transform the *price* to numeric after dropping the $-sign.

In [None]:
# Find categorical columns with more than 200 unique values
cat_column_values.loc[cat_column_values['val_count']>200]

In [None]:
# Check the data
listings_drop_col[cat_column_values.loc[cat_column_values['val_count']>200].index].head()

#### Amenities
The column has many unique values. But each entry is a list and the set of unique list-entries is not so big.<br>

**Create a column for each unique amenity with values 0 and 1**

In [None]:
# First: Identify all possible values:
# Create an auxilliary list to carry all values from splitted entries
all_splitted = []
# Split every entry into a list of amenities and append the resutl to all_splitted
for i in listings_drop_col.index:
    entry_split = listings_drop_col['amenities'][i].replace('{', '').replace('}', '').replace('"', '').split(sep=",")
    #listings_drop_col['amenities_split'][i] = entry_split
    all_splitted = all_splitted+entry_split

# Create all_entries by removing dublicates from all_splitted
all_entries = list(set(all_splitted))
all_entries.remove('')

# Second: Use the split_column_values function
# to create a new column for every value
split_column_values(
    listings_drop_col,  # data-frame
    'amenities',  # column name
    all_entries,  # values in a list
    'amenities_'  # prefix for new columns
    )

# Drop the original column 'amenities'
listings_drop_col.drop(columns=['amenities'], inplace= True)

# Check the result:
listings_drop_col[index_by_key(listings_drop_col, ['amenities'])]

#### Date Columns

There are three dates in the above extract.
* first_review
* last_review
* host_since

At this point it is not clear weather they are all needed.<br>
But it is no beg step to transform them into a date-type.<br>
The function date_transform() also creates separate columns for day, month and year.

In [None]:
# Transform all dates into three columns
date_transform(listings_drop_col, 'host_since', 'host_since_')
date_transform(listings_drop_col, 'first_review', 'first_review_')
date_transform(listings_drop_col, 'last_review', 'last_review_')
# Drop original columns
listings_drop_col = listings_drop_col.drop(
    columns=['host_since','first_review','last_review'],
    axis=1
    )
# Check the result:
listings_drop_col[['host_since_year']].head()

Columns with more than 200 unique values also contain the *price* column. But there were more columns with currency values.<br>
We check the remaining columns first and transform all currency-values at once.

In [None]:
# Create an overview over the columns values
cat_column_values.loc[cat_column_values['val_count']<=200]

In [None]:
# Inspect the selected columns
listings_drop_col[cat_column_values.loc[cat_column_values['val_count']<=200].index].head()

#### Currency Columns

There are six columns containing $-values. To transform them into a number one can use the *price_transform* function.

In [None]:
# Identify currancy columns
search_values = ['price', 'fee', 'deposit', 'extra']
price_column_names  = index_by_key(listings_drop_col, search_values)
# Check before transformation
listings_drop_col[price_column_names].head()

In [None]:
# price is a string type: convert to float
for col in price_column_names :
    listings_drop_col[col] = price_transform(listings_drop_col[col])
# Check after transformation
listings_drop_col[price_column_names].head()

#### Rates

Columns containing a rate can be dealt with just like currencies, just with the *rate_transform* function.

In [None]:
# Identify columns
search_values_rate = ['rate']
rate_column_names  = index_by_key(listings_drop_col, search_values_rate)

listings_drop_col[rate_column_names].head()

In [None]:
# Transform to float and check the result:
for col in rate_column_names :
    listings_drop_col[col] = rate_transform(listings_drop_col[col])

listings_drop_col[rate_column_names].head()

#### Binary Object-Columns

Many columns contain only boolean information stored in strings: 't' and 'f'. These get mapped to 1 and 0.

In [None]:
# Identify binary object columns
binary_cols = cat_column_values.loc[cat_column_values['val_count']==2 ].index

listings_drop_col[binary_cols].head()

In [None]:
# Define a value map:
binary_map = {'t': 1, 'f': 0}

# Map binary object-columns to 0 and 1:
for col in binary_cols:
    if listings_drop_col[col].dtype == 'object':
        listings_drop_col[col] = listings_drop_col[col].map(binary_map)

listings_drop_col[binary_cols].head()

#### Ordinal Object-Columns

Two columns, host_response_time and cancellation_policy, have very few values that can be sorted in some way.<br>
We sort them and replace strings with increasing numbers.

In [None]:
listings_drop_col[['cancellation_policy', 'host_response_time']].head()

In [None]:
# Values for cancellation_policy
listings_drop_col['host_response_time'].value_counts()

In [None]:
# Values for cancellation_policy
listings_drop_col['host_response_time'].value_counts()

In [None]:
# Create value maps for both columns
policy_map = {'strict':2, 'moderate':1, 'flexible':0}
response_map = {'a few days or more':3, 'within a day':2, 'within a few hours':1, 'within an hour':0}

# Map both columns to new values
listings_drop_col['host_response_time'] = listings_drop_col['host_response_time'].map(response_map)
listings_drop_col['cancellation_policy'] = listings_drop_col['cancellation_policy'].map(policy_map)

# Check
listings_drop_col[['cancellation_policy', 'host_response_time']].head()

In [None]:
# How do the host columns look like?
search_values_host = ['host', 'type']
host_column_names  = listings_drop_col.columns[
    listings_drop_col.columns.to_series().str.contains('|'.join(search_values_host),case=False)]

listings_drop_col[host_column_names].head()

#### host_verifications

There is one column left the should be transformed at this point.<br>
host_verifications looks similar to amenities, with only small differences.

**Create a column for each unique verification method with values 0 and 1**

In [None]:
listings_drop_col['host_verifications'].head()

In [None]:
# Split entries into single values and create one column per value

# Find all values
all_splitted = []
for i in listings_drop_col.index:
        entry_split = listings_drop_col['host_verifications'].loc[i].replace(
                '[', '').replace(']', '').replace("'", "").split(sep=", ")
        all_splitted = all_splitted+entry_split
# Remove dublicates
all_entries = list(set(all_splitted))
all_entries.remove('')
all_entries.remove('None')
# Create new columns and add a prefix to new column names
split_column_values(
        listings_drop_col, 
        'host_verifications', 
        all_entries, 
        '_host_verifications'
        )
# Drop the original column 
listings_drop_col = listings_drop_col.drop(columns='host_verifications', axis=1)

# Check the result
listings_drop_col[index_by_key(listings_drop_col, ['host_verifications'])].head()


## Done

For now all rellevant columns are transoformed into a type and shape that makes them accessable for further analysis.<br>
The final look at the object columns confirm: They have been reduced in numbers significantly.<br>
The data-frames dimensions on the other hand shows an increased number of columns.

In [None]:
# The data-frames dimensions
listings_drop_col.shape()


In [None]:
# How many object columns are left and are they relevant to the problem?

# Isolate object column names:
cat_listings = listings_drop_col.select_dtypes(include=['object'])
# Create a df with column information: 
cat_column_values = value_counter(cat_listings)

cat_column_values