# Ex - GroupBy

### Introduction:

GroupBy can be summarized as Split-Apply-Combine.

Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

Check out this [Diagram](http://i.imgur.com/yjNkiwL.png)  
### Step 1. Import the necessary libraries

In [1]:
import pandas as pd
from pyspark.sql import SparkSession

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv). 

### Step 3. Assign it to a variable called drinks.

In [2]:
import requests


url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/drinks.csv'

request_url = requests.get(url)


with open('data.csv', 'w', encoding='utf-8') as f:
    f.write(request_url.text)



In [3]:
spark = SparkSession.builder\
                    .appName('drinks_')\
                    .getOrCreate()

25/06/03 17:37:23 WARN Utils: Your hostname, kevin-llanos-Type1ProductConfigId resolves to a loopback address: 127.0.1.1; using 192.168.1.92 instead (on interface wlo1)
25/06/03 17:37:23 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/06/03 17:37:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [4]:
drinks = spark.read.csv('data.csv', header=True)

In [6]:
drinks.show()

+-----------------+-------------+---------------+-------------+----------------------------+---------+
|          country|beer_servings|spirit_servings|wine_servings|total_litres_of_pure_alcohol|continent|
+-----------------+-------------+---------------+-------------+----------------------------+---------+
|      Afghanistan|            0|              0|            0|                         0.0|       AS|
|          Albania|           89|            132|           54|                         4.9|       EU|
|          Algeria|           25|              0|           14|                         0.7|       AF|
|          Andorra|          245|            138|          312|                        12.4|       EU|
|           Angola|          217|             57|           45|                         5.9|       AF|
|Antigua & Barbuda|          102|            128|           45|                         4.9|       NA|
|        Argentina|          193|             25|          221|          

In [5]:
df_drinks = pd.read_csv('data.csv')

In [7]:
df_drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


### Step 4. Which continent drinks more beer on average?

In [13]:
from pyspark.sql.functions import col, avg

drinks.groupby(col('continent')).agg(avg('beer_servings').alias('promedio_cervezas')).orderBy(col('promedio_cervezas'), ascending=False).show()

+---------+------------------+
|continent| promedio_cervezas|
+---------+------------------+
|       EU|193.77777777777777|
|       SA|175.08333333333334|
|       NA|145.43478260869566|
|       OC|           89.6875|
|       AF|61.471698113207545|
|       AS| 37.04545454545455|
+---------+------------------+



In [16]:
df_drinks.groupby('continent', as_index=False).agg(promedio_cervezas=('beer_servings', 'mean')).sort_values(by='promedio_cervezas', ascending=False)

Unnamed: 0,continent,promedio_cervezas
2,EU,193.777778
4,SA,175.083333
3,OC,89.6875
0,AF,61.471698
1,AS,37.045455


### Step 5. For each continent print the statistics for wine consumption.

In [20]:
from pyspark.sql.functions import count, mean, stddev, min, max, expr, percentile_approx

drinks_stats = drinks.groupBy("continent").agg(
    count("wine_servings").alias("count"),
    mean("wine_servings").alias("mean"),
    stddev("wine_servings").alias("stddev"),
    min("wine_servings").alias("min"),
    percentile_approx("wine_servings", 0.25).alias("25%"),
    expr("percentile_approx(wine_servings, 0.5)").alias("median"),
    percentile_approx("wine_servings", 0.75).alias("75%"),
    max("wine_servings").alias("max")
)

drinks_stats.show(truncate=False)


+---------+-----+------------------+------------------+---+----+------+-----+---+
|continent|count|mean              |stddev            |min|25% |median|75%  |max|
+---------+-----+------------------+------------------+---+----+------+-----+---+
|NA       |23   |24.52173913043478 |28.266378301658847|1  |5.0 |11.0  |36.0 |9  |
|SA       |12   |62.416666666666664|88.62018888937148 |1  |3.0 |8.0   |74.0 |8  |
|AS       |44   |9.068181818181818 |21.667033931944484|0  |0.0 |1.0   |8.0  |9  |
|OC       |16   |35.625            |64.55578982554547 |0  |1.0 |8.0   |23.0 |9  |
|EU       |45   |142.22222222222223|97.42173756146497 |0  |59.0|128.0 |195.0|97 |
|AF       |53   |16.264150943396228|38.84641897335842 |0  |1.0 |2.0   |13.0 |9  |
+---------+-----+------------------+------------------+---+----+------+-----+---+



In [22]:
df_drinks.groupby('continent', as_index=False)['wine_servings'].describe()

Unnamed: 0,continent,count,mean,std,min,25%,50%,75%,max
0,AF,53.0,16.264151,38.846419,0.0,1.0,2.0,13.0,233.0
1,AS,44.0,9.068182,21.667034,0.0,0.0,1.0,8.0,123.0
2,EU,45.0,142.222222,97.421738,0.0,59.0,128.0,195.0,370.0
3,OC,16.0,35.625,64.55579,0.0,1.0,8.5,23.25,212.0
4,SA,12.0,62.416667,88.620189,1.0,3.0,12.0,98.5,221.0


### Step 6. Print the mean alcohol consumption per continent for every column

### Step 7. Print the median alcohol consumption per continent for every column

### Step 8. Print the mean, min and max values for spirit consumption.
#### This time output a DataFrame