# 01 - Estimating metrics
Often, the data we are provided does not (yet) contain the metrics we need. In these first exercises, we are going to load in some data (similar to the data we used before) and estimate some useful metrics.

## Loading and understanding data
Let's import `pandas`, load in our dataset, and `print` the columns in the dataset.

(Note: the dataset is named `wk3_listings_sample.csv` becuase it is slightly different from before)

In [2]:
import pandas as pd
df_listings = pd.read_csv('../data/wk3_listings_sample.csv')
print(df_listings.columns)

Index(['id', 'listing_url', 'scrape_id', 'last_scraped', 'name', 'description',
       'neighborhood_overview', 'picture_url', 'host_id', 'host_url',
       'host_name', 'host_since', 'host_location', 'host_about',
       'host_response_time', 'host_response_rate', 'host_acceptance_rate',
       'host_is_superhost', 'host_thumbnail_url', 'host_picture_url',
       'host_neighbourhood', 'host_listings_count',
       'host_total_listings_count', 'host_verifications',
       'host_has_profile_pic', 'host_identity_verified', 'neighbourhood',
       'neighbourhood_cleansed', 'neighbourhood_group_cleansed', 'latitude',
       'longitude', 'property_type', 'room_type', 'accommodates', 'bathrooms',
       'bathrooms_text', 'bedrooms', 'beds', 'amenities', 'price',
       'minimum_nights', 'maximum_nights', 'minimum_minimum_nights',
       'maximum_minimum_nights', 'minimum_maximum_nights',
       'maximum_maximum_nights', 'minimum_nights_avg_ntm',
       'maximum_nights_avg_ntm', 'calendar_upd

As you can see, there are lots of columns in this dataset. If you would like to understand the columns better, then you can look at the [data dictionary](https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit#gid=982310896) of the dataset.

### Exercise-01: Understanding columns
**Questions** What do the values in the `minimum_maximum_nights` column represent?

## Price per person
You are asked to estimate the "price per person" for each listing in the dataset. To do this, you first need to format the `price` column to contain  values you can perform calculations on (like we did last week). Let's do this below. Create a `price_$` column with the price of the listing in \$'s in `float` format, and show the `head` of the `price` and `price_$` columns to check it worked OK.

In [2]:
# (SOLUTION)
def format_price(price):
    return(float(price.replace('$','').replace(',','')))

df_listings['price_$'] = df_listings['price'].apply(format_price)
df_listings[['price','price_$']].head()

Unnamed: 0,price,price_$
0,$128.00,128.0
1,$70.00,70.0
2,$39.00,39.0
3,$85.00,85.0
4,$40.00,40.0


### Exercise-02: Estimate price-per-person
With the price values in a `float` format, we can use the `/` (division) operator to divide the values in the `price_$` column by the values in the `accommodates` column, and create a new column named `price_$/person`. Run the code shown below to do this and show the `head` of the relevant columns to check it's working OK.

In [3]:
df_listings['price_$/person'] = df_listings['price_$'] / df_listings['accommodates']
df_listings[['price_$','accommodates','price_$/person']].head()

Unnamed: 0,price_$,accommodates,price_$/person
0,128.0,5,25.6
1,70.0,1,70.0
2,39.0,2,19.5
3,85.0,3,28.333333
4,40.0,1,40.0


We can also sort the rows of `df_listings` by the values in column `price_$/person` to identify which listing has the highest `price_$/person` of all the listings in the dataset. Run the code below to show the neighbourhood of the listing with the highest `price_$/person` in the dataset.

In [4]:
df_listings.sort_values(by='price_$/person', ascending=False).head(1)['neighbourhood_cleansed']

26518    Westminster
Name: neighbourhood_cleansed, dtype: object

**Question:** What neighbourhood is it in?

Look again in the [data dictionary](https://docs.google.com/spreadsheets/d/1iWCNJcSutYqpULSQHlNyGInUvHg2BoUGoNRIGa6Szc4/edit#gid=982310896) to understand what values in the `accommodates` column represent. 

**Question:** Do you think using `accomodates` is a good way to estimate the price per person for each listing? If not, why not?

**Question:** Why might someone want to estimate the price per person for each listing?

## Forecasting income
You are asked to forecast how much money (in \$'s) each listing is likely to receive over the next 30 days. To do this, you decide to use the `availability_30` column to calculate how many nights of the next available 30 nights are booked and then multiple this number by the `price_$` using the `*` (multiplication) operator. 

### Exercise-03: Forecast 30-day income
Complete the code below to estimate the income for each listing over the next 30 days. Look at Exercise-02, to start.

In [5]:
# (SOLUTION)
df_listings['estimated_income_30'] = df_listings['price_$'] * (30 - df_listings['availability_30'])
df_listings[['price_$','availability_30', 'estimated_income_30']].head()

Unnamed: 0,price_$,availability_30,estimated_income_30
0,128.0,22,1024.0
1,70.0,0,2100.0
2,39.0,30,0.0
3,85.0,28,170.0
4,40.0,16,560.0


In the code block below, sort the values of `df_listings` by the values in `esimated_income_30` to identify the listing with highest forecasted income for the next 30 days.

In [6]:
# (SOLUTION)
df_listings.sort_values(by='estimated_income_30', ascending=False).head(1)['neighbourhood_cleansed']

17210    Haringey
Name: neighbourhood_cleansed, dtype: object

**Question:** What's the value of `neighbourhood_cleansed` for this listing?

**Question:** Why might someone want to forecast the next 30 days income for each listing?

In [3]:
df_listings['neighbourhood_cleansed']

0                       Lambeth
1                   Westminster
2                     Greenwich
3                       Lambeth
4                    Wandsworth
                  ...          
29995    Kensington and Chelsea
29996                 Southwark
29997                    Ealing
29998    Kensington and Chelsea
29999                   Hackney
Name: neighbourhood_cleansed, Length: 30000, dtype: object

## Further work
If you want to explore the data further, please do, and think about what other metrics you might be able to estimate from the data and how they might be used.