
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Data Analysis with Pandas


<img src="https://files.training.databricks.com/images/301/sf.jpg" style="height: 200px; margin: 10px; border: 1px solid #ddd; padding: 10px"/>

You'll be analyzing data from [Inside Airbnb]((http://insideairbnb.com/get-the-data.html) to better understand the San Francisco rental market.

0. Read in SF Airbnb data into a pandas DataFrame
0. Select a subset of columns
0. Sort based on largest # of bedrooms
0. Fill in missing values
0. Compute the average number of bathrooms
0. Plot the most common property listings in the Financial District

Read in the file located `/dbfs/databricks-datasets/learning-spark-v2/sf-airbnb/sf-airbnb.csv` into a pandas DataFrame, and display the first 5 records.

In [0]:
# ANSWER
import pandas as pd

df = pd.read_csv("/dbfs/databricks-datasets/learning-spark-v2/sf-airbnb/sf-airbnb.csv")
df.head()

We are not interested in all of the columns in this DataFrame so let's select just these columns: 

`"beds", "bedrooms", "bathrooms", "property_type", "neighbourhood_cleansed"` and assign the result to the variable `df`.

NOTE: We are not looking at the `price` column for now because we need to convert it from a string to a double (and remove the `$` and `,` from the values)

In [0]:
# ANSWER
df = df[["beds", "bedrooms", "bathrooms", "property_type", "neighbourhood_cleansed"]]
df.head()

Now that we have the columns that we want, we would like to view the listings with the highest number of bedrooms first. We can do this using the [.sort_values()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sort_values.html) function!

In [0]:
# ANSWER
df.sort_values("bedrooms", ascending=False)

### Fill Missing Values
If you scroll through the rows carefully you'll notice that some of the entries say `NaN` instead of a number. Run the following cell to pick out and display those listings using [isna()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isna.html).

In [0]:
df[df.isna().any(axis=1)]

We're going to assume if a listing didn't input a number for `beds`, `bedrooms`, or `bathrooms` then the number should have been a 0.

Let's go ahead and fill the missing values for `beds`, `bedrooms`, or `bathrooms` with `0`.

In [0]:
# ANSWER
df = df.fillna(0)

### Average # Bathrooms
What is the average number of bathrooms in this list of Airbnb listings?

In [0]:
# ANSWER
df["bathrooms"].mean()

### Filter

Suppose we are only going to be near `Financial District` so we only want to view listings in that neighbourhood.

In [0]:
# ANSWER
financial_district_df = df[df["neighbourhood_cleansed"] == "Financial District"]
financial_district_df.head()

### Plot

We want to see what the most common types of property listings around `Financial District` are! 

[Plot](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html) the count of the various `property_type`.

In [0]:
# ANSWER
financial_district_df.groupby(["property_type"]).count().sort_values("beds", ascending=False).plot(kind="bar", y="beds")

&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>