# Ex - GroupBy

### Introduction:

GroupBy can be summarized as Split-Apply-Combine.

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

Check out this [Diagram](http://i.imgur.com/yjNkiwL.png)  
### Step 1. Import the necessary libraries

In [1]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.master("local").appName("My Application").getOrCreate()

from pyspark.sql.types import *
from pyspark.sql.functions import *

25/05/08 11:57:42 WARN Utils: Your hostname, neosoft-Latitude-E7270 resolves to a loopback address: 127.0.1.1; using 10.0.62.133 instead (on interface wlp1s0)
25/05/08 11:57:42 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/08 11:57:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/05/08 11:57:44 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
25/05/08 11:57:44 WARN Utils: Service 'SparkUI' could not bind on port 4041. Attempting port 4042.


### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv). 

### Step 3. Assign it to a variable called drinks.

In [3]:
drinks = spark.read.format("csv")\
    .option("header", "true")\
        .option("inferSchema", "true")\
            .load("drinks.csv")

drinks.show()

                                                                                

+-----------------+-------------+---------------+-------------+----------------------------+---------+
|          country|beer_servings|spirit_servings|wine_servings|total_litres_of_pure_alcohol|continent|
+-----------------+-------------+---------------+-------------+----------------------------+---------+
|      Afghanistan|            0|              0|            0|                         0.0|       AS|
|          Albania|           89|            132|           54|                         4.9|       EU|
|          Algeria|           25|              0|           14|                         0.7|       AF|
|          Andorra|          245|            138|          312|                        12.4|       EU|
|           Angola|          217|             57|           45|                         5.9|       AF|
|Antigua & Barbuda|          102|            128|           45|                         4.9|       NA|
|        Argentina|          193|             25|          221|          

### Step 4. Which continent drinks more beer on average?

In [4]:
drinks.groupBy("continent").agg(avg("beer_servings").alias("avg_beer_servings"))\
    .orderBy("avg_beer_servings", ascending=False)\
    .show()

[Stage 3:>                                                          (0 + 1) / 1]

+---------+------------------+
|continent| avg_beer_servings|
+---------+------------------+
|       EU|193.77777777777777|
|       SA|175.08333333333334|
|       NA|145.43478260869566|
|       OC|           89.6875|
|       AF|61.471698113207545|
|       AS| 37.04545454545455|
+---------+------------------+



                                                                                

### Step 5. For each continent print the statistics for wine consumption.

In [6]:
drinks.groupBy("continent").agg(
    avg("wine_servings").alias("avg_wine_servings"),
    min("wine_servings").alias("min_wine_servings"),
    max("wine_servings").alias("max_wine_servings"),
    stddev("wine_servings").alias("stddev_wine_servings")
).show()

+---------+------------------+-----------------+-----------------+--------------------+
|continent| avg_wine_servings|min_wine_servings|max_wine_servings|stddev_wine_servings|
+---------+------------------+-----------------+-----------------+--------------------+
|       NA| 24.52173913043478|                1|              100|  28.266378301658847|
|       SA|62.416666666666664|                1|              221|   88.62018888937148|
|       AS| 9.068181818181818|                0|              123|  21.667033931944484|
|       OC|            35.625|                0|              212|   64.55578982554547|
|       EU|142.22222222222223|                0|              370|   97.42173756146497|
|       AF|16.264150943396228|                0|              233|   38.84641897335842|
+---------+------------------+-----------------+-----------------+--------------------+



In [5]:
drinks.describe("beer_servings")\
    .show()

+-------+------------------+
|summary|     beer_servings|
+-------+------------------+
|  count|               193|
|   mean|106.16062176165804|
| stddev| 101.1431025393134|
|    min|                 0|
|    max|               376|
+-------+------------------+



### Step 6. Print the mean alcohol consumption per continent for every column

In [7]:
drinks.groupBy("continent").agg(avg("total_litres_of_pure_alcohol").alias("alcohol_consumption")).show()

+---------+-------------------+
|continent|alcohol_consumption|
+---------+-------------------+
|       NA|  5.995652173913044|
|       SA|  6.308333333333334|
|       AS| 2.1704545454545454|
|       OC| 3.3812500000000005|
|       EU|  8.617777777777777|
|       AF|   3.00754716981132|
+---------+-------------------+



### Step 7. Print the median alcohol consumption per continent for every column

In [8]:
drinks.groupBy("continent").agg(median("total_litres_of_pure_alcohol").alias("alcohol_consumption")).show()

+---------+-------------------+
|continent|alcohol_consumption|
+---------+-------------------+
|       NA|                6.3|
|       SA|               6.85|
|       AS|                1.2|
|       OC|               1.75|
|       EU|               10.0|
|       AF|                2.3|
+---------+-------------------+



### Step 8. Print the mean, min and max values for spirit consumption.
#### This time output a DataFrame

In [10]:
drinks.groupBy("continent")\
.agg(mean("total_litres_of_pure_alcohol").alias("mean_alcohol_consumption"),\
     max("total_litres_of_pure_alcohol").alias("max_alcohol_consumption"),\
     min("total_litres_of_pure_alcohol").alias("min_alcohol_consumption")).show()

+---------+------------------------+-----------------------+-----------------------+
|continent|mean_alcohol_consumption|max_alcohol_consumption|min_alcohol_consumption|
+---------+------------------------+-----------------------+-----------------------+
|       NA|       5.995652173913044|                   11.9|                    2.2|
|       SA|       6.308333333333334|                    8.3|                    3.8|
|       AS|      2.1704545454545454|                   11.5|                    0.0|
|       OC|      3.3812500000000005|                   10.4|                    0.0|
|       EU|       8.617777777777777|                   14.4|                    0.0|
|       AF|        3.00754716981132|                    9.1|                    0.0|
+---------+------------------------+-----------------------+-----------------------+

