### 1. PySpark SQL sample() Usage & Examples

PySpark sampling `(pyspark.sql.DataFrame.sample())` is a mechanism to get random sample records from the dataset, this is helpful when you have a larger dataset and wanted to analyze/test a subset of the data, for example 10% of the original file.

Below is syntax of the sample() function.

```sample(withReplacement, fraction, seed=None)```

fraction – Fraction of rows to generate, range [0.0, 1.0]. Note that it doesn’t guarantee to provide the exact number of the fraction of records.

seed – Seed for sampling (default a random seed). Used to reproduce the same random sampling.

withReplacement – Sample with replacement or not (default False).



##### 1.1 Using fraction to get a random sample in PySpark
By using fraction between 0 to 1, it returns the approximate number of the fraction of the dataset. For example, 0.1 returns 10% of the rows. However, this does not guarantee it returns the exact 10% of the records.

In [1]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder \
                    .master("local[1]") \
                    .appName("PySparkLearning") \
                    .getOrCreate()

df=spark.range(100)
print(df.sample(0.06).collect())

[Row(id=8), Row(id=13), Row(id=95)]


My DataFrame has 100 records and I wanted to get 6% sample records which are 6 but the sample() function returned 3 records. This proves the sample function doesn’t return the exact fraction specified.      
> Results vary for each run

##### 1.2 Using seed to reproduce the same Samples in PySpark

Every time you run a `sample()` function it returns a different set of sampling records, however sometimes during the development and testing phase you may need to regenerate the same sample every time as you need to compare the results from your previous run. To get consistent same random sampling uses the same slice value for every run. Change slice value to get different results.

In [3]:
print(df.sample(0.1,123).collect())
print(df.sample(0.1,123).collect())
print(df.sample(0.1,456).collect())

[Row(id=36), Row(id=37), Row(id=41), Row(id=43), Row(id=56), Row(id=66), Row(id=69), Row(id=75), Row(id=83)]
[Row(id=36), Row(id=37), Row(id=41), Row(id=43), Row(id=56), Row(id=66), Row(id=69), Row(id=75), Row(id=83)]
[Row(id=19), Row(id=21), Row(id=42), Row(id=48), Row(id=49), Row(id=50), Row(id=75), Row(id=80)]


On above examples, first 2 I have used slice 123 hence the sampling results are same and for last I have used 456 as slice hence it has returned different sampling records

##### 1.3 Sample `withReplacement `(May contain duplicates)

Some times you may need to get a random sample with repeated values. By using the value true, results in repeated values.

In [18]:
res1 = df.sample(True,0.3,123).collect()   # with Duplicates
for r in res1:
    print(r.id, end = ' ')
    
print()
print('<'*50,'>'*50)

res2 = df.sample(0.3,123).collect()        # No duplicates
for r in res2:
    print(r.id, end = ' ')

0 5 9 11 14 14 16 17 21 29 33 41 42 52 52 54 58 65 65 71 76 79 85 96 
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
0 4 17 19 24 25 26 36 37 41 43 44 53 56 66 68 69 70 71 75 76 78 83 84 88 94 96 97 98 

##### 1.4 Stratified sampling in PySpark

You can get Stratified sampling in PySpark without replacement by using `sampleBy()` method. It returns a sampling fraction for each stratum. If a stratum is not specified, it takes zero as the default.

**sampleBy() Syntax**


```
sampleBy(col, fractions, seed=None)
```

col – column name from DataFrame     
fractions – It’s Dictionary type takes key and value.

In [19]:
df2=df.select((df.id % 3).alias("key"))

print(df2.sampleBy("key", {0: 0.1, 1: 0.2},0).collect())

[Row(key=0), Row(key=1), Row(key=1), Row(key=1), Row(key=0), Row(key=1), Row(key=1), Row(key=0), Row(key=1), Row(key=1), Row(key=1)]


### 2. PySpark RDD Sample

PySpark RDD also provides `sample()` function to get a random sampling, it also has another signature `takeSample()` that returns an `Array[T]`.

**RDD sample() Syntax & Example**

PySpark RDD `sample()` function returns the random sampling similar to DataFrame and takes a similar types of parameters but in a different order.
```
sample(self, withReplacement, fraction, seed=None)
```

In [20]:
rdd = spark.sparkContext.range(0,100)
print(rdd.sample(False,0.1,0).collect())
print(rdd.sample(True,0.3,123).collect())

[24, 29, 41, 64, 86]
[0, 11, 13, 14, 16, 18, 21, 23, 27, 31, 32, 32, 48, 49, 49, 53, 54, 72, 74, 77, 77, 83, 88, 91, 93, 98, 99]


**RDD takeSample() Syntax & Example**

RDD `takeSample()` is an action hence you need to careful when you use this function as it returns the selected sample records to driver memory. Returning too much data results in an out-of-memory error similar to collect().

Syntax of RDD takeSample() .

```
takeSample(self, withReplacement, num, seed=None) 
```

In [22]:
print(rdd.takeSample(False,10,0))
print(rdd.takeSample(True,20,123))

[58, 1, 96, 74, 29, 24, 32, 37, 94, 91]
[24, 46, 51, 45, 95, 67, 45, 30, 29, 87, 25, 68, 90, 0, 13, 23, 50, 73, 70, 16]
