# Ex - GroupBy

### Introduction:

GroupBy can be summarized as Split-Apply-Combine.

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

Check out this [Diagram](http://i.imgur.com/yjNkiwL.png)  
### Step 1. Import the necessary libraries

In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.0.tar.gz (281.3 MB)
[K     |████████████████████████████████| 281.3 MB 45 kB/s 
[?25hCollecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 45.6 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.0-py2.py3-none-any.whl size=281764026 sha256=121082837704afe96905085711c9ca6d66b1e34215043325fa5bf4befb52ab1e
  Stored in directory: /root/.cache/pip/wheels/7a/8e/1b/f73a52650d2e5f337708d9f6a1750d451a7349a867f928b885
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.0


In [3]:
from pyspark.sql import SparkSession, functions as f
from pyspark.files import SparkFiles

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv). 

In [5]:
spark = SparkSession.builder.appName("Exercise30").getOrCreate()
spark.sparkContext.addFile("https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv")

df = spark.read.csv("file://" + SparkFiles.get("drinks.csv"), sep=",", header=True, inferSchema=True)
df.printSchema()

root
 |-- country: string (nullable = true)
 |-- beer_servings: integer (nullable = true)
 |-- spirit_servings: integer (nullable = true)
 |-- wine_servings: integer (nullable = true)
 |-- total_litres_of_pure_alcohol: double (nullable = true)
 |-- continent: string (nullable = true)



### Step 3. Assign it to a variable called drinks.

### Step 4. Which continent drinks more beer on average?

In [10]:
df.groupby("continent").avg("beer_servings").orderBy(f.desc("avg(beer_servings)")).limit(1).show()

+---------+------------------+
|continent|avg(beer_servings)|
+---------+------------------+
|       EU|193.77777777777777|
+---------+------------------+



### Step 5. For each continent print the statistics for wine consumption.

In [20]:
df.groupby("continent").agg(*map(lambda foo: foo("wine_servings"),[f.min,f.max, f.mean])).show()

+---------+------------------+------------------+------------------+------------------+
|continent|min(wine_servings)|max(wine_servings)|avg(wine_servings)|avg(wine_servings)|
+---------+------------------+------------------+------------------+------------------+
|       NA|                 1|               100| 24.52173913043478| 24.52173913043478|
|       SA|                 1|               221|62.416666666666664|62.416666666666664|
|       AS|                 0|               123| 9.068181818181818| 9.068181818181818|
|       OC|                 0|               212|            35.625|            35.625|
|       EU|                 0|               370|142.22222222222223|142.22222222222223|
|       AF|                 0|               233|16.264150943396228|16.264150943396228|
+---------+------------------+------------------+------------------+------------------+



### Step 6. Print the mean alcohol consumption per continent for every column

In [41]:
df.groupBy("continent")\
  .agg({"beer_servings":"avg","spirit_servings":"avg", "wine_servings":"avg", "total_litres_of_pure_alcohol":"avg"}).show()

+---------+------------------+--------------------+---------------------------------+------------------+
|continent|avg(wine_servings)|avg(spirit_servings)|avg(total_litres_of_pure_alcohol)|avg(beer_servings)|
+---------+------------------+--------------------+---------------------------------+------------------+
|       NA| 24.52173913043478|   165.7391304347826|                5.995652173913044|145.43478260869566|
|       SA|62.416666666666664|              114.75|                6.308333333333334|175.08333333333334|
|       AS| 9.068181818181818|   60.84090909090909|               2.1704545454545454| 37.04545454545455|
|       OC|            35.625|             58.4375|               3.3812500000000005|           89.6875|
|       EU|142.22222222222223|  132.55555555555554|                8.617777777777777|193.77777777777777|
|       AF|16.264150943396228|  16.339622641509433|                 3.00754716981132|61.471698113207545|
+---------+------------------+--------------------+----

### Step 7. Print the median alcohol consumption per continent for every column

In [44]:
df.groupBy("continent")\
  .agg(f.percentile_approx("beer_servings",0.5),
       f.percentile_approx("spirit_servings",0.5),
       f.percentile_approx("wine_servings",0.5),
       f.percentile_approx("total_litres_of_pure_alcohol", 0.5))\
  .show()

+---------+--------------------------------------------+----------------------------------------------+--------------------------------------------+-----------------------------------------------------------+
|continent|percentile_approx(beer_servings, 0.5, 10000)|percentile_approx(spirit_servings, 0.5, 10000)|percentile_approx(wine_servings, 0.5, 10000)|percentile_approx(total_litres_of_pure_alcohol, 0.5, 10000)|
+---------+--------------------------------------------+----------------------------------------------+--------------------------------------------+-----------------------------------------------------------+
|       NA|                                         143|                                           137|                                          11|                                                        6.3|
|       SA|                                         162|                                           100|                                           8|                

### Step 8. Print the mean, min and max values for spirit consumption.
#### This time output a DataFrame

In [45]:
df.select(f.mean("spirit_servings"), f.min("spirit_servings"), f.max("spirit_servings")).show()

+--------------------+--------------------+--------------------+
|avg(spirit_servings)|min(spirit_servings)|max(spirit_servings)|
+--------------------+--------------------+--------------------+
|   80.99481865284974|                   0|                 438|
+--------------------+--------------------+--------------------+

