d-sandbox
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://raw.githubusercontent.com/databricks/koalas/master/Koalas-logo.png" width="220"/>
</div>

The Koalas project makes data scientists more productive when interacting with big data, by implementing the pandas DataFrame API on top of Apache Spark. By unifying the two ecosystems with a familiar API, Koalas offers a seamless transition between small and large data.

**Goals of this notebook:**
* Demonstrate the similarities of the Koalas API with the pandas API
* Understand the differences in syntax for the same DataFrame operations in Koalas vs PySpark

[Koalas Docs](https://koalas.readthedocs.io/en/latest/index.html)

[Koalas Github](https://github.com/databricks/koalas)

**NOTE**: You need to first install `koalas` (PyPI)

### Read in the dataset

* PySpark
* pandas
* Koalas

In [None]:
# PySpark
df = spark.read.parquet("/databricks-datasets/learning-spark-v2/sf-airbnb/sf-airbnb-clean.parquet")

In [None]:
# Pandas
import pandas as pd

pdDF = pd.read_parquet("/dbfs/databricks-datasets/learning-spark-v2/sf-airbnb/sf-airbnb-clean.parquet")
pdDF.head()

In [None]:
# Koalas
import databricks.koalas as ks

kdf = ks.read_parquet("/databricks-datasets/learning-spark-v2/sf-airbnb/sf-airbnb-clean.parquet")
kdf.head()

### Converting to Koalas DataFrame to/from Spark DataFrame

In [None]:
# Creating a Koalas DataFrame from PySpark DataFrame
kdf = ks.DataFrame(df)

In [None]:
# Alternative way of creating a Koalas DataFrame from PySpark DataFrame
kdf = df.to_koalas()

### Value Counts

In [None]:
# To get value counts of the different property types with PySpark
display(df.groupby("property_type").count().orderBy("count", ascending=False))

In [None]:
# Value counts in Koalas
kdf["property_type"].value_counts()

### Visualizations with Koalas DataFrames

In [None]:
kdf.plot(kind="hist", x="bedrooms", y="price", bins=200)

### SQL on Koalas DataFrames

In [None]:
ks.sql("select distinct(property_type) from {kdf}")
