###1. Random Data Generation
Random data generation is useful for testing of existing algorithms and implementing randomized algorithms, such as random projection.

In [0]:
from pyspark.sql.functions import rand, randn

###2. Summary and Descriptive Statistics
The function describe returns a DataFrame containing information such as number of non-null entries (count), mean, standard deviation, and minimum and maximum value for each numerical column.

In [0]:
from pyspark.sql.functions import rand, randn
# A slightly different way to generate the two random columns
df = sqlContext.range(0, 10).withColumn('uniform', rand(seed=10)).withColumn('normal', randn(seed=27))
df.describe().show()

+-------+------------------+-------------------+-------------------+
|summary|                id|            uniform|             normal|
+-------+------------------+-------------------+-------------------+
|  count|                10|                 10|                 10|
|   mean|               4.5| 0.3865003272265095| 0.4175292757722803|
| stddev|3.0276503540974917|0.34583270415314343| 0.8931883269570855|
|    min|                 0|0.03422639313807285|-1.0451987154313813|
|    max|                 9| 0.9899129399827472| 1.7264843633887004|
+-------+------------------+-------------------+-------------------+



If you have a DataFrame with a large number of columns, you can also run describe on a subset of the columns:

In [0]:
df.describe('uniform', 'normal').show()

+-------+-------------------+-------------------+
|summary|            uniform|             normal|
+-------+-------------------+-------------------+
|  count|                 10|                 10|
|   mean| 0.3865003272265095| 0.4175292757722803|
| stddev|0.34583270415314343| 0.8931883269570855|
|    min|0.03422639313807285|-1.0451987154313813|
|    max| 0.9899129399827472| 1.7264843633887004|
+-------+-------------------+-------------------+



We can also control the list of descriptive statistics and the columns they apply to using the normal select on a DataFrame:

In [0]:
from pyspark.sql.functions import mean, min, max
df.select([mean('uniform'), min('uniform'), max('uniform')]).show()

+------------------+-------------------+------------------+
|      avg(uniform)|       min(uniform)|      max(uniform)|
+------------------+-------------------+------------------+
|0.3865003272265095|0.03422639313807285|0.9899129399827472|
+------------------+-------------------+------------------+



###3. Sample covariance and correlation
Covariance is a measure of how two variables change with respect to each other.

In [0]:
from pyspark.sql.functions import rand
df = sqlContext.range(0, 10).withColumn('rand1', rand(seed=10)).withColumn('rand2', rand(seed=27))


In [0]:
df.stat.cov('rand1', 'rand2')

Out[8]: -0.022631283696310282

In [0]:
df.stat.cov('id', 'id')

Out[7]: 9.166666666666666

The covariance value of 9.17 might be hard to interpret. Correlation is a normalized measure of covariance that is easier to understand, as it provides quantitative measurements of the statistical dependence between two random variables.

In [0]:
df.stat.corr('rand1', 'rand2')

Out[9]: -0.3600685552217083

In [0]:
df.stat.corr('id', 'id')

Out[10]: 1.0

In the above example, id correlates perfectly with itself, while the two randomly generated columns have low correlation value.

###4. Cross Tabulation (Contingency Table)
Cross Tabulation provides a table of the frequency distribution for a set of variables.

In [0]:
# Create a DataFrame with two columns (name, item)
names = ["Alice", "Bob", "Mike"]
items = ["milk", "bread", "butter", "apples", "oranges"]
df = sqlContext.createDataFrame([(names[i % 3], items[i % 5]) for i in range(100)], ["name", "item"])
# Take a look at the first 10 rows.
df.show(10)

+-----+-------+
| name|   item|
+-----+-------+
|Alice|   milk|
|  Bob|  bread|
| Mike| butter|
|Alice| apples|
|  Bob|oranges|
| Mike|   milk|
|Alice|  bread|
|  Bob| butter|
| Mike| apples|
|Alice|oranges|
+-----+-------+
only showing top 10 rows



In [0]:
df.stat.crosstab("name", "item").show()

+---------+------+-----+------+----+-------+
|name_item|apples|bread|butter|milk|oranges|
+---------+------+-----+------+----+-------+
|     Mike|     7|    6|     7|   7|      6|
|    Alice|     7|    7|     6|   7|      7|
|      Bob|     6|    7|     7|   6|      7|
+---------+------+-----+------+----+-------+



###5. Frequent Items
Figuring out which items are frequent in each column can be very useful to understand a dataset.

In [0]:
df = sqlContext.createDataFrame([(1, 2, 3) if i % 2 == 0 else (i, 2 * i, i % 4) for i in range(100)], ["a", "b", "c"])
df.show(10)

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  2|  3|
|  1|  2|  1|
|  1|  2|  3|
|  3|  6|  3|
|  1|  2|  3|
|  5| 10|  1|
|  1|  2|  3|
|  7| 14|  3|
|  1|  2|  3|
|  9| 18|  1|
+---+---+---+
only showing top 10 rows



In [0]:
freq = df.stat.freqItems(["a", "b", "c"], 0.4)
###the following code finds the frequent items that show up 40% of the time for each column

In [0]:
freq.collect()[0]

Out[4]: Row(a_freqItems=[11, 1], b_freqItems=[2, 22], c_freqItems=[1, 3])

“11” and “1” are the frequent values for column “a”

we can also find frequent items for column combinations, by creating a composite column using the struct function:

In [0]:
from pyspark.sql.functions import struct

freq = df.withColumn('ab', struct('a', 'b')).stat.freqItems(['ab'], 0.4)

freq.collect()[0]


Out[5]: Row(ab_freqItems=[Row(a=11, b=22), Row(a=1, b=2)])

From the above, the combination of “a=11 and b=22”, and “a=1 and b=2” appear frequently in this dataset.

###6. Mathematical Functions
The inputs need to be columns functions that take a single argument, such as cos, sin, floor, ceil. For functions that take two arguments as input, such as pow, hypot, either two columns or a combination of a double and column can be supplied.

In [0]:
from pyspark.sql.functions import *
df = sqlContext.range(0, 10).withColumn('uniform', rand(seed=10) * 3.14)
# you can reference a column or supply the column name
df.select(
      'uniform',
      toDegrees('uniform'),
      (pow(cos(df['uniform']), 2) + pow(sin(df.uniform), 2)). \
        alias("cos^2 + sin^2")).show()

+-------------------+------------------+------------------+
|            uniform|  DEGREES(uniform)|     cos^2 + sin^2|
+-------------------+------------------+------------------+
| 0.5367821013180484| 30.75534892368792|               1.0|
|0.10747087445354876| 6.157627526768682|               1.0|
| 1.1475525508626785| 65.74991793390322|               1.0|
|  1.310955978808693|  75.1122447131799|               1.0|
| 3.1083266315458262|178.09399733569154|0.9999999999999999|
| 0.5165986402305565|29.598921787408106|               1.0|
| 0.5696528438969835| 32.63870374292187|0.9999999999999999|
| 1.5573024855692674| 89.22685984835182|0.9999999999999999|
| 3.0450071328478523|174.46605729941354|               1.0|
|0.23646103537894467|13.548219346507173|               1.0|
+-------------------+------------------+------------------+

