# 10 minutes to Koalas

This is a short introduction to Koalas, geared mainly for new users. This notebook shows you some key differences between pandas and Koalas. You can run this examples by yourself on a live notebook [here](https://mybinder.org/v2/gh/databricks/koalas/master?filepath=docs%2Fsource%2Fgetting_started%2F10min.ipynb). For Databricks Runtime, you can import and run [the current .ipynb file](https://raw.githubusercontent.com/databricks/koalas/master/docs/source/getting_started/10min.ipynb) out of the box. Try it on [Databricks Community Edition](https://community.cloud.databricks.com/) for free.

Customarily, we import Koalas as follows:

It seems that `Koalas` has been deprecated/migrated to the `pandas on Spark API`.

See https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/index.html


In [1]:
import pandas as pd
import numpy as np

In [9]:
# See: https://spark.apache.org/docs/latest/api/python/migration_guide/koalas_to_pyspark.html

# import databricks.koalas as ks
from pyspark import __version__ as pyspark_version
import pyspark.pandas as ps
from pyspark.sql import SparkSession

In [10]:
pyspark_version

'3.3.0'

## Object Creation



Creating a Koalas Series by passing a list of values, letting Koalas create a default integer index:

In [3]:
s = ps.Series([1, 3, 5, np.nan, 6, 8])

  fields = [
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


22/09/22 11:13:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22/09/22 11:13:34 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.


  for column, series in pdf.iteritems():


In [4]:
s

                                                                                

0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64

Creating a Koalas DataFrame by passing a dict of objects that can be converted to series-like.

In [5]:
kdf = ps.DataFrame(
    {'a': [1, 2, 3, 4, 5, 6],
     'b': [100, 200, 300, 400, 500, 600],
     'c': ["one", "two", "three", "four", "five", "six"]},
    index=[10, 20, 30, 40, 50, 60])

  fields = [
  for column, series in pdf.iteritems():


In [6]:
kdf

Unnamed: 0,a,b,c
10,1,100,one
20,2,200,two
30,3,300,three
40,4,400,four
50,5,500,five
60,6,600,six


Creating a pandas DataFrame by passing a numpy array, with a datetime index and labeled columns:

In [11]:
dates = pd.date_range('20130101', periods=6)

In [12]:
dates

DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
               '2013-01-05', '2013-01-06'],
              dtype='datetime64[ns]', freq='D')

In [13]:
pdf = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

In [14]:
pdf

Unnamed: 0,A,B,C,D
2013-01-01,1.469465,0.498672,-0.888052,1.026008
2013-01-02,-1.226854,-0.185018,-0.712293,1.661331
2013-01-03,-0.106912,1.601921,-0.237452,0.202532
2013-01-04,-1.803351,0.644592,-1.01062,0.311429
2013-01-05,1.109027,1.445527,-0.426108,0.378924
2013-01-06,-0.50445,0.86563,1.777991,-0.410543


Now, this pandas DataFrame can be converted to a Koalas DataFrame

In [15]:
kdf = ps.from_pandas(pdf)

  fields = [
  for column, series in pdf.iteritems():


In [16]:
type(kdf)

pyspark.pandas.frame.DataFrame

It looks and behaves the same as a pandas DataFrame though

In [17]:
kdf

  series = series.astype(t, copy=False)


Unnamed: 0,A,B,C,D
2013-01-01,1.469465,0.498672,-0.888052,1.026008
2013-01-02,-1.226854,-0.185018,-0.712293,1.661331
2013-01-03,-0.106912,1.601921,-0.237452,0.202532
2013-01-04,-1.803351,0.644592,-1.01062,0.311429
2013-01-05,1.109027,1.445527,-0.426108,0.378924
2013-01-06,-0.50445,0.86563,1.777991,-0.410543


Also, it is possible to create a Koalas DataFrame from Spark DataFrame.  

Creating a Spark DataFrame from pandas DataFrame

In [18]:
spark = SparkSession.builder.getOrCreate()

In [19]:
sdf = spark.createDataFrame(pdf)

  for column, series in pdf.iteritems():
  for column, series in pdf.iteritems():


In [20]:
sdf.show()

+--------------------+--------------------+-------------------+--------------------+
|                   A|                   B|                  C|                   D|
+--------------------+--------------------+-------------------+--------------------+
|  1.4694652171865765|  0.4986716439660975|-0.8880522820065168|  1.0260084423167666|
| -1.2268536655171967|-0.18501819489047755|-0.7122925692850886|  1.6613310222597077|
|-0.10691236076714528|  1.6019205912607584|-0.2374519128781401| 0.20253158909677252|
| -1.8033513489727653|  0.6445924294734232| -1.010619732641442|  0.3114287838467922|
|  1.1090269153720378|  1.4455273676180305|-0.4261077840671671| 0.37892431636089263|
| -0.5044502665930136|  0.8656301390647213| 1.7779908817147398|-0.41054299726130783|
+--------------------+--------------------+-------------------+--------------------+



Creating Koalas DataFrame from Spark DataFrame.
`to_koalas()` is automatically attached to Spark DataFrame and available as an API when Koalas is imported.

In [21]:
kdf = sdf.to_koalas()

In [22]:
kdf

Unnamed: 0,A,B,C,D
0,1.469465,0.498672,-0.888052,1.026008
1,-1.226854,-0.185018,-0.712293,1.661331
2,-0.106912,1.601921,-0.237452,0.202532
3,-1.803351,0.644592,-1.01062,0.311429
4,1.109027,1.445527,-0.426108,0.378924
5,-0.50445,0.86563,1.777991,-0.410543


Having specific [dtypes](http://pandas.pydata.org/pandas-docs/stable/basics.html#basics-dtypes) . Types that are common to both Spark and pandas are currently supported.

In [23]:
kdf.dtypes

A    float64
B    float64
C    float64
D    float64
dtype: object

## Viewing Data

See the [API Reference](https://koalas.readthedocs.io/en/latest/reference/index.html).

See the top rows of the frame. The results may not be the same as pandas though: unlike pandas, the data in a Spark dataframe is not _ordered_, it has no intrinsic notion of index. When asked for the head of a dataframe, Spark will just take the requested number of rows from a partition. Do not rely on it to return specific rows, use `.loc` or `iloc` instead.

In [24]:
kdf.head()

Unnamed: 0,A,B,C,D
0,1.469465,0.498672,-0.888052,1.026008
1,-1.226854,-0.185018,-0.712293,1.661331
2,-0.106912,1.601921,-0.237452,0.202532
3,-1.803351,0.644592,-1.01062,0.311429
4,1.109027,1.445527,-0.426108,0.378924


Display the index, columns, and the underlying numpy data.

You can also retrieve the index; the index column can be ascribed to a DataFrame, see later

In [25]:
kdf.index

Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')

In [26]:
kdf.columns

Index(['A', 'B', 'C', 'D'], dtype='object')

In [27]:
kdf.to_numpy()



array([[ 1.46946522,  0.49867164, -0.88805228,  1.02600844],
       [-1.22685367, -0.18501819, -0.71229257,  1.66133102],
       [-0.10691236,  1.60192059, -0.23745191,  0.20253159],
       [-1.80335135,  0.64459243, -1.01061973,  0.31142878],
       [ 1.10902692,  1.44552737, -0.42610778,  0.37892432],
       [-0.50445027,  0.86563014,  1.77799088, -0.410543  ]])

Describe shows a quick statistic summary of your data

In [28]:
kdf.describe()

                                                                                

Unnamed: 0,A,B,C,D
count,6.0,6.0,6.0,6.0
mean,-0.177179,0.811887,-0.249422,0.52828
std,1.282502,0.655508,1.033759,0.719496
min,-1.803351,-0.185018,-1.01062,-0.410543
25%,-1.226854,0.498672,-0.888052,0.202532
50%,-0.50445,0.644592,-0.712293,0.311429
75%,1.109027,1.445527,-0.237452,1.026008
max,1.469465,1.601921,1.777991,1.661331


Transposing your data

In [29]:
kdf.T

  fields = [
  for column, series in pdf.iteritems():


Unnamed: 0,0,1,2,3,4,5
A,1.469465,-1.226854,-0.106912,-1.803351,1.109027,-0.50445
B,0.498672,-0.185018,1.601921,0.644592,1.445527,0.86563
C,-0.888052,-0.712293,-0.237452,-1.01062,-0.426108,1.777991
D,1.026008,1.661331,0.202532,0.311429,0.378924,-0.410543


Sorting by its index

In [30]:
kdf.sort_index(ascending=False)

Unnamed: 0,A,B,C,D
5,-0.50445,0.86563,1.777991,-0.410543
4,1.109027,1.445527,-0.426108,0.378924
3,-1.803351,0.644592,-1.01062,0.311429
2,-0.106912,1.601921,-0.237452,0.202532
1,-1.226854,-0.185018,-0.712293,1.661331
0,1.469465,0.498672,-0.888052,1.026008


Sorting by value

In [31]:
kdf.sort_values(by='B')

Unnamed: 0,A,B,C,D
1,-1.226854,-0.185018,-0.712293,1.661331
0,1.469465,0.498672,-0.888052,1.026008
3,-1.803351,0.644592,-1.01062,0.311429
5,-0.50445,0.86563,1.777991,-0.410543
4,1.109027,1.445527,-0.426108,0.378924
2,-0.106912,1.601921,-0.237452,0.202532


## Missing Data
Koalas primarily uses the value `np.nan` to represent missing data. It is by default not included in computations. 


In [32]:
pdf1 = pdf.reindex(index=dates[0:4], columns=list(pdf.columns) + ['E'])

In [33]:
pdf1.loc[dates[0]:dates[1], 'E'] = 1

In [35]:
kdf1 = ps.from_pandas(pdf1)

  fields = [
  for column, series in pdf.iteritems():


In [36]:
kdf1

  series = series.astype(t, copy=False)


Unnamed: 0,A,B,C,D,E
2013-01-01,1.469465,0.498672,-0.888052,1.026008,1.0
2013-01-02,-1.226854,-0.185018,-0.712293,1.661331,1.0
2013-01-03,-0.106912,1.601921,-0.237452,0.202532,
2013-01-04,-1.803351,0.644592,-1.01062,0.311429,


To drop any rows that have missing data.

In [37]:
kdf1.dropna(how='any')

  series = series.astype(t, copy=False)


Unnamed: 0,A,B,C,D,E
2013-01-01,1.469465,0.498672,-0.888052,1.026008,1.0
2013-01-02,-1.226854,-0.185018,-0.712293,1.661331,1.0


Filling missing data.

In [38]:
kdf1.fillna(value=5)

  series = series.astype(t, copy=False)


Unnamed: 0,A,B,C,D,E
2013-01-01,1.469465,0.498672,-0.888052,1.026008,1.0
2013-01-02,-1.226854,-0.185018,-0.712293,1.661331,1.0
2013-01-03,-0.106912,1.601921,-0.237452,0.202532,5.0
2013-01-04,-1.803351,0.644592,-1.01062,0.311429,5.0


## Operations

### Stats
Operations in general exclude missing data.

Performing a descriptive statistic:

In [39]:
kdf.mean()

  fields = [
  for column, series in pdf.iteritems():


A   -0.177179
B    0.811887
C   -0.249422
D    0.528280
dtype: float64

### Spark Configurations

Various configurations in PySpark could be applied internally in Koalas.
For example, you can enable Arrow optimization to hugely speed up internal pandas conversion. See <a href="https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html">PySpark Usage Guide for Pandas with Apache Arrow</a>.

In [40]:
prev = spark.conf.get("spark.sql.execution.arrow.enabled")  # Keep its default value.
ps.set_option("compute.default_index_type", "distributed")  # Use default index prevent overhead.
import warnings
warnings.filterwarnings("ignore")  # Ignore warnings coming from Arrow optimizations.

In [42]:
spark.conf.set("spark.sql.execution.arrow.enabled", True)
%timeit ps.range(300000).to_pandas()

22/09/22 11:19:29 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it.


                                                                                

292 ms ± 98.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [43]:
spark.conf.set("spark.sql.execution.arrow.enabled", False)
%timeit ps.range(300000).to_pandas()

22/09/22 11:19:43 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it.
665 ms ± 121 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)


In [44]:
ps.reset_option("compute.default_index_type")
spark.conf.set("spark.sql.execution.arrow.enabled", prev)  # Set its default value back.

22/09/22 11:19:54 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it.


## Grouping
By “group by” we are referring to a process involving one or more of the following steps:

- Splitting the data into groups based on some criteria
- Applying a function to each group independently
- Combining the results into a data structure

In [45]:
kdf = ps.DataFrame({'A': ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                    'B': ['one', 'one', 'two', 'three',
                          'two', 'two', 'one', 'three'],
                    'C': np.random.randn(8),
                    'D': np.random.randn(8)})

In [46]:
kdf

Unnamed: 0,A,B,C,D
0,foo,one,0.396455,1.163582
1,bar,one,0.232822,-1.547456
2,foo,two,1.004841,1.346642
3,bar,three,0.04157,-0.347607
4,foo,two,-0.636318,0.412455
5,bar,two,1.108491,-0.139324
6,foo,one,-0.099152,-3.094342
7,foo,three,2.038558,-1.414541


Grouping and then applying the [sum()](https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.groupby.GroupBy.sum.html#databricks.koalas.groupby.GroupBy.sum) function to the resulting groups.

In [47]:
kdf.groupby('A').sum()

                                                                                

Unnamed: 0_level_0,C,D
A,Unnamed: 1_level_1,Unnamed: 2_level_1
foo,2.704385,-1.586204
bar,1.382883,-2.034387


Grouping by multiple columns forms a hierarchical index, and again we can apply the sum function.

In [48]:
kdf.groupby(['A', 'B']).sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,C,D
A,B,Unnamed: 2_level_1,Unnamed: 3_level_1
foo,one,0.297303,-1.93076
bar,one,0.232822,-1.547456
foo,two,0.368524,1.759097
bar,three,0.04157,-0.347607
bar,two,1.108491,-0.139324
foo,three,2.038558,-1.414541


## Plotting
See the <a href="https://koalas.readthedocs.io/en/latest/reference/frame.html#plotting">Plotting</a> docs.

In [49]:
pser = pd.Series(np.random.randn(1000),
                 index=pd.date_range('1/1/2000', periods=1000))

In [50]:
kser = ps.Series(pser)

In [51]:
kser = kser.cummax()

In [53]:
kser.plot()

22/09/22 11:20:58 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/09/22 11:20:58 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/09/22 11:20:58 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/09/22 11:20:59 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/09/22 11:20:59 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


On a DataFrame, the <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.frame.DataFrame.plot.html#databricks.koalas.frame.DataFrame.plot">plot()</a> method is a convenience to plot all of the columns with labels:

In [54]:
pdf = pd.DataFrame(np.random.randn(1000, 4), index=pser.index,
                   columns=['A', 'B', 'C', 'D'])

In [55]:
kdf = ps.from_pandas(pdf)

In [56]:
kdf = kdf.cummax()

In [57]:
kdf.plot()

22/09/22 11:21:16 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/09/22 11:21:16 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/09/22 11:21:16 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/09/22 11:21:16 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/09/22 11:21:16 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


## Getting data in/out
See the <a href="https://koalas.readthedocs.io/en/latest/reference/io.html">Input/Output
</a> docs.

### CSV

CSV is straightforward and easy to use. See <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_csv.html#databricks.koalas.DataFrame.to_csv">here</a> to write a CSV file and <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_csv.html#databricks.koalas.read_csv">here</a> to read a CSV file.

In [58]:
kdf.to_csv('foo.csv')
ps.read_csv('foo.csv').head(10)

22/09/22 11:21:32 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/09/22 11:21:32 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/09/22 11:21:33 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/09/22 11:21:33 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/09/22 11:21:33 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


                                                                                

22/09/22 11:21:35 WARN SQLConf: The SQL config 'spark.sql.execution.arrow.enabled' has been deprecated in Spark v3.0 and may be removed in the future. Use 'spark.sql.execution.arrow.pyspark.enabled' instead of it.


Unnamed: 0,A,B,C,D
0,0.054645,1.291642,0.681389,0.162467
1,1.370169,1.345856,0.681389,0.162467
2,1.822636,1.345856,1.032442,0.162467
3,1.822636,1.345856,1.032442,0.162467
4,1.822636,1.345856,1.069375,0.162467
5,1.822636,1.345856,1.069375,0.233251
6,1.822636,1.345856,1.069375,0.754467
7,1.822636,1.780574,1.069375,2.600024
8,1.822636,1.780574,1.069375,2.600024
9,1.822636,1.780574,1.069375,2.600024


### Parquet

Parquet is an efficient and compact file format to read and write faster. See <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_parquet.html#databricks.koalas.DataFrame.to_parquet">here</a> to write a Parquet file and <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_parquet.html#databricks.koalas.read_parquet">here</a> to read a Parquet file.

In [59]:
kdf.to_parquet('bar.parquet')
ps.read_parquet('bar.parquet').head(10)

22/09/22 11:21:50 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/09/22 11:21:50 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/09/22 11:21:50 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/09/22 11:21:51 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/09/22 11:21:51 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


                                                                                

Unnamed: 0,A,B,C,D
0,0.054645,1.291642,0.681389,0.162467
1,1.370169,1.345856,0.681389,0.162467
2,1.822636,1.345856,1.032442,0.162467
3,1.822636,1.345856,1.032442,0.162467
4,1.822636,1.345856,1.069375,0.162467
5,1.822636,1.345856,1.069375,0.233251
6,1.822636,1.345856,1.069375,0.754467
7,1.822636,1.780574,1.069375,2.600024
8,1.822636,1.780574,1.069375,2.600024
9,1.822636,1.780574,1.069375,2.600024


### Spark IO

In addition, Koalas fully support Spark's various datasources such as ORC and an external datasource.  See <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.DataFrame.to_spark_io.html#databricks.koalas.DataFrame.to_spark_io">here</a> to write it to the specified datasource and <a href="https://koalas.readthedocs.io/en/latest/reference/api/databricks.koalas.read_spark_io.html#databricks.koalas.read_spark_io">here</a> to read it from the datasource.

In [60]:
kdf.to_spark_io('zoo.orc', format="orc")
ps.read_spark_io('zoo.orc', format="orc").head(10)

22/09/22 11:22:05 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/09/22 11:22:05 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/09/22 11:22:05 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/09/22 11:22:06 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.
22/09/22 11:22:06 WARN WindowExec: No Partition Defined for Window operation! Moving all data to a single partition, this can cause serious performance degradation.


                                                                                

Unnamed: 0,A,B,C,D
0,0.054645,1.291642,0.681389,0.162467
1,1.370169,1.345856,0.681389,0.162467
2,1.822636,1.345856,1.032442,0.162467
3,1.822636,1.345856,1.032442,0.162467
4,1.822636,1.345856,1.069375,0.162467
5,1.822636,1.345856,1.069375,0.233251
6,1.822636,1.345856,1.069375,0.754467
7,1.822636,1.780574,1.069375,2.600024
8,1.822636,1.780574,1.069375,2.600024
9,1.822636,1.780574,1.069375,2.600024
