# Session 14: Reading files into dataframes. Operations on data. Aggregating and grouping.

## Reading files into DataFrames:

`pandas` is a module really versatile when converting data in different files into DataFrames.

We have several functions from `pandas` to read files into DataFrames:
* `pd.read_csv` converts CSV files into a `pd.DataFrame`
* `pd.read_json` converts JSON files into a `pd.DataFrame`
* `pd.read_html` converts HTML files into a `pd.DataFrame`
* `pd.read_clipboard` converts the data in your clipboard into a `pd.DataFrame`
* and many more... https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html

In general, `pandas` will read the file just ok, but there are sometimes in which we need to specify some arguments within `read_csv()`:
* separator: `sep` can be semicolon (;), comma (,), tab (\t), etc
* encoding: `encoding` can be `utf-8`, `latin1`, ...

In [3]:
import pandas as pd
# pd.read_clipboard()

In [2]:
import os
os.getcwd()

'c:\\Users\\SLO\\OneDrive\\Documents\\GitHub\\IE-University\\IE_MASTERS\\7_PYTHON_FOR_DATA_ANALYSIS'

In [4]:
# lets read animals.csv

import pandas as pd

animals = pd.read_csv("animals.csv", sep=";")
#animals = pd.read_csv("..\..\CSV_FILES\cars1.csv", sep=";") #The two dots mean to go up one directory

animals.sample(5)

Unnamed: 0,year,district,dogs,cats
73,2016,MONCLOA-ARAVACA,12600,2368
55,2017,RETIRO,8309,2313
121,2014,TETUÁN,12301,2178
76,2016,RETIRO,8183,2061
18,2019,VICÁLVARO,5244,1505


In [39]:
animals.shape

(126, 4)

### pandas: `head`, `tail`, `sample`

* `df.head(n)` will display the first n rows of a dataframe. By default, n=5.
* `df.tail(n)` will display the last n rows of a dataframe. By default, n=5.
* `df.sample(n)` will display a random sample of n rows of a dataframe. By default, n=1.

In [41]:
animals.head()

Unnamed: 0,year,district,dogs,cats
0,2019,ARGANZUELA,10556,5074
1,2019,BARAJAS,5086,1515
2,2019,CARABANCHEL,20258,6387
3,2019,CENTRO,16010,9248
4,2019,CHAMARTÍN,11098,3922


In [42]:
animals.tail()

Unnamed: 0,year,district,dogs,cats
121,2014,TETUÁN,12301,2178
122,2014,USERA,11310,978
123,2014,VICÁLVARO,4584,505
124,2014,VILLA DE VALLECAS,7107,940
125,2014,VILLAVERDE,10467,851


In [47]:
animals.sample(8)

Unnamed: 0,year,district,dogs,cats
109,2014,CHAMARTÍN,12942,1793
88,2015,CHAMARTÍN,13159,1860
54,2017,PUENTE DE VALLECAS,23860,3961
24,2018,CENTRO,15881,8186
30,2018,LATINA,19282,7990
53,2017,MORATALAZ,6985,1909
91,2015,FUENCARRAL-EL PARDO,18305,2819
105,2014,ARGANZUELA,9595,1853


In [48]:
animals.columns

Index(['year', 'district', 'dogs', 'cats'], dtype='object')

## Operations with the data in the columns

With pandas we can not only store tabular-like data, but also perform different operations with it

### Using `pandas` methods and attributes 

Since our columns are nothing `pd.Series` objects, we can use all the attributes and methods that apply to them:
* https://pandas.pydata.org/docs/reference/api/pandas.Series.html

Just a sample of what we can do:
* Attributes:
    * `.index`, `.shape`, `.size`, `.values`, `.T`, 
* Methods:
    * `.abs()`, `.min()`, `.max()`, `.count()`, `.value_counts()`
    * `.sum()`, `.cumsum()`, `.mean()`, `.std()`
    * `.isna()`, `.isnull()`, `.idxmin()`, `.idxmax()`
    * `.unique()`, `.nunique()`, `.drop_duplicates()`

In [49]:
animals.index

RangeIndex(start=0, stop=126, step=1)

In [50]:
animals.shape # tuple with rows and columns

(126, 4)

In [51]:
animals.size # number of cells, also rows * columns

504

In [52]:
animals.values

array([[2019, 'ARGANZUELA', 10556, 5074],
       [2019, 'BARAJAS', 5086, 1515],
       [2019, 'CARABANCHEL', 20258, 6387],
       [2019, 'CENTRO', 16010, 9248],
       [2019, 'CHAMARTÍN', 11098, 3922],
       [2019, 'CHAMBERÍ', 13359, 4692],
       [2019, 'CIUDAD LINEAL', 17286, 8183],
       [2019, 'FUENCARRAL-EL PARDO', 17375, 6121],
       [2019, 'HORTALEZA', 15836, 8556],
       [2019, 'LATINA', 19049, 10564],
       [2019, 'MONCLOA-ARAVACA', 12367, 3931],
       [2019, 'MORATALAZ', 6724, 2502],
       [2019, 'PUENTE DE VALLECAS', 23437, 6208],
       [2019, 'RETIRO', 7786, 3105],
       [2019, 'SALAMANCA', 13471, 5033],
       [2019, 'SAN BLAS', 14228, 5064],
       [2019, 'TETUÁN', 12470, 5535],
       [2019, 'USERA', 12393, 2898],
       [2019, 'VICÁLVARO', 5244, 1505],
       [2019, 'VILLA DE VALLECAS', 9923, 2946],
       [2019, 'VILLAVERDE', 12917, 2694],
       [2018, 'ARGANZUELA', 10622, 4458],
       [2018, 'BARAJAS', 5203, 1300],
       [2018, 'CARABANCHEL', 20265, 5524],

In [53]:
animals.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,116,117,118,119,120,121,122,123,124,125
year,2019,2019,2019,2019,2019,2019,2019,2019,2019,2019,...,2014,2014,2014,2014,2014,2014,2014,2014,2014,2014
district,ARGANZUELA,BARAJAS,CARABANCHEL,CENTRO,CHAMARTÍN,CHAMBERÍ,CIUDAD LINEAL,FUENCARRAL-EL PARDO,HORTALEZA,LATINA,...,MORATALAZ,PUENTE DE VALLECAS,RETIRO,SALAMANCA,SAN BLAS,TETUÁN,USERA,VICÁLVARO,VILLA DE VALLECAS,VILLAVERDE
dogs,10556,5086,20258,16010,11098,13359,17286,17375,15836,19049,...,6706,22072,8774,12942,12786,12301,11310,4584,7107,10467
cats,5074,1515,6387,9248,3922,4692,8183,6121,8556,10564,...,1153,2065,1344,1793,2043,2178,978,505,940,851


#### abs

In [54]:
my_list = [1, -4, 5, 6]

[abs(x) for x in my_list]

[1, 4, 5, 6]

In [55]:
list(map(abs, my_list))

[1, 4, 5, 6]

In [56]:
animals["dogs"].abs()

0      10556
1       5086
2      20258
3      16010
4      11098
       ...  
121    12301
122    11310
123     4584
124     7107
125    10467
Name: dogs, Length: 126, dtype: int64

#### max, min, count, value_counts

In [57]:
# max value in series
animals["dogs"].max()

23860

In [58]:
# min value in series
animals["dogs"].min()

4584

In [59]:
# number of elements in series
animals["district"].count()

126

In [60]:
animals["district"].unique() # distinct

array(['ARGANZUELA', 'BARAJAS', 'CARABANCHEL', 'CENTRO', 'CHAMARTÍN',
       'CHAMBERÍ', 'CIUDAD LINEAL', 'FUENCARRAL-EL PARDO', 'HORTALEZA',
       'LATINA', 'MONCLOA-ARAVACA', 'MORATALAZ', 'PUENTE DE VALLECAS',
       'RETIRO', 'SALAMANCA', 'SAN BLAS', 'TETUÁN', 'USERA', 'VICÁLVARO',
       'VILLA DE VALLECAS', 'VILLAVERDE', 'FUENCARRAL EL PARDO'],
      dtype=object)

In [61]:
animals["district"].nunique() # count(distinct ...)

22

In [62]:
animals["district"].size

126

In [63]:
animals["district"].shape

(126,)

In [64]:
animals

Unnamed: 0,year,district,dogs,cats
0,2019,ARGANZUELA,10556,5074
1,2019,BARAJAS,5086,1515
2,2019,CARABANCHEL,20258,6387
3,2019,CENTRO,16010,9248
4,2019,CHAMARTÍN,11098,3922
...,...,...,...,...
121,2014,TETUÁN,12301,2178
122,2014,USERA,11310,978
123,2014,VICÁLVARO,4584,505
124,2014,VILLA DE VALLECAS,7107,940


In [67]:
# counts items per category
animals["year"].value_counts()

year
2019    21
2018    21
2017    21
2016    21
2015    21
2014    21
Name: count, dtype: int64

In [68]:
# if we pass normalize=True to `value_counts` we will get the proportions instead of the totals
animals["year"].value_counts(normalize=True)

year
2019    0.166667
2018    0.166667
2017    0.166667
2016    0.166667
2015    0.166667
2014    0.166667
Name: proportion, dtype: float64

#### sum, cumsum, mean, std

In [69]:
# sum of all elements
animals["cats"].sum()

419173

In [70]:
# cummulative sum 
# item1, item1+item2, item1+item2+item3, ...
animals["cats"].cumsum()

0        5074
1        6589
2       12976
3       22224
4       26146
        ...  
121    415899
122    416877
123    417382
124    418322
125    419173
Name: cats, Length: 126, dtype: int64

In [71]:
# mean value of series
animals["cats"].mean()

3326.7698412698414

In [72]:
# standard deviation of series
animals["cats"].std()

2062.750967665068

In [73]:
animals.describe()

Unnamed: 0,year,dogs,cats
count,126.0,126.0,126.0
mean,2016.5,13062.206349,3326.769841
std,1.714643,4670.823654,2062.750968
min,2014.0,4584.0,505.0
25%,2015.0,10059.0,1848.5
50%,2016.5,12723.5,2814.0
75%,2018.0,16537.5,4451.75
max,2019.0,23860.0,10564.0


#### isna/isnull, idxmin/idxmax

In [74]:
import numpy as np

In [None]:
# missing values in pandas are indicated as NaN
# with isna we can check how many 

s = pd.Series([1, None, "a", True, np.nan])

s[~s.isna()] # return all the values that are not null

0       1
2       a
3    True
dtype: object

In [76]:
# isna: returns array with same shape with True/False to mask NaN
animals["dogs"].isna().sum()

0

In [78]:
animals.isna().mean() #Shows preportion of ones to zeros or the percent of nan

year        0.0
district    0.0
dogs        0.0
cats        0.0
dtype: float64

In [79]:
# with dropna we can drop rows with NaN
pd.Series([1, None, "a", True, np.nan]).dropna()

0       1
2       a
3    True
dtype: object

In [80]:
# idxmax() returns the row label (index) of the highest value in series
animals["dogs"].idxmax()

54

In [81]:
# what was the year in which the maximum amount of dogs happened in Madrid

"""
select
    year
from animals
where dogs in (select max(dogs) from animals)
"""

year_max_dogs = animals.iloc[animals["dogs"].idxmax(), :][["year"]]

year_max_dogs

year    2017
Name: 54, dtype: object

In [82]:
# what's the district with more cats in all the period

''' select
    district,
    sum(cats) as sum_cats,
from animals
group by district
having sum(cats) = (select max(sum(cats) from animals group by district)) '''

year_max_cats = animals.iloc[animals['cats'].idxmax(), :]

year_max_cats

year          2019
district    LATINA
dogs         19049
cats         10564
Name: 9, dtype: object

In [83]:
pd.Series([3, 3, 2]).idxmax()

0

In [84]:
animals.iloc[animals["dogs"].idxmin(), :]

year             2014
district    VICÁLVARO
dogs             4584
cats              505
Name: 123, dtype: object

In [85]:
# idxmin() returns the row label (index) of the lowest value in series
animals["dogs"].idxmin()

123

#### unique, nunique, drop_duplicates

In [86]:
# returns an array with the unique values, like doing set(series)
animals["district"].unique()

array(['ARGANZUELA', 'BARAJAS', 'CARABANCHEL', 'CENTRO', 'CHAMARTÍN',
       'CHAMBERÍ', 'CIUDAD LINEAL', 'FUENCARRAL-EL PARDO', 'HORTALEZA',
       'LATINA', 'MONCLOA-ARAVACA', 'MORATALAZ', 'PUENTE DE VALLECAS',
       'RETIRO', 'SALAMANCA', 'SAN BLAS', 'TETUÁN', 'USERA', 'VICÁLVARO',
       'VILLA DE VALLECAS', 'VILLAVERDE', 'FUENCARRAL EL PARDO'],
      dtype=object)

In [87]:
animals["year"].unique()

array([2019, 2018, 2017, 2016, 2015, 2014], dtype=int64)

In [88]:
# nunique returns how many unique elements there are in the series, like doing len(set(series))
animals["year"].nunique()

6

In [89]:
animals["year"].value_counts()

year
2019    21
2018    21
2017    21
2016    21
2015    21
2014    21
Name: count, dtype: int64

In [4]:
animals

Unnamed: 0,year,district,dogs,cats
0,2019,ARGANZUELA,10556,5074
1,2019,BARAJAS,5086,1515
2,2019,CARABANCHEL,20258,6387
3,2019,CENTRO,16010,9248
4,2019,CHAMARTÍN,11098,3922
...,...,...,...,...
121,2014,TETUÁN,12301,2178
122,2014,USERA,11310,978
123,2014,VICÁLVARO,4584,505
124,2014,VILLA DE VALLECAS,7107,940


In [7]:
# drop_duplicates returns a series with only the unique values and the index at which they are
animals.drop_duplicates(subset=["year", "district"], inplace=True)

In [8]:
animals

Unnamed: 0,year,district,dogs,cats
0,2019,ARGANZUELA,10556,5074
1,2019,BARAJAS,5086,1515
2,2019,CARABANCHEL,20258,6387
3,2019,CENTRO,16010,9248
4,2019,CHAMARTÍN,11098,3922
...,...,...,...,...
121,2014,TETUÁN,12301,2178
122,2014,USERA,11310,978
123,2014,VICÁLVARO,4584,505
124,2014,VILLA DE VALLECAS,7107,940


### Create new columns out of existing columns

* We can operate 2 or more columns with arithmetic operators
* We can perform logical operations in columns using np.where
    * ```Python
    np.where(condition_on_column, result_if_true, result_if_false)
    ```


In [10]:
# sum two columns

animals["total_animals"] = animals["cats"] + animals["dogs"]

animals.head()

Unnamed: 0,year,district,dogs,cats,total_animals
0,2019,ARGANZUELA,10556,5074,15630
1,2019,BARAJAS,5086,1515,6601
2,2019,CARABANCHEL,20258,6387,26645
3,2019,CENTRO,16010,9248,25258
4,2019,CHAMARTÍN,11098,3922,15020


In [11]:
animals["dogs_squared"] = animals["dogs"] ** 2 

animals.head()

Unnamed: 0,year,district,dogs,cats,total_animals,dogs_squared
0,2019,ARGANZUELA,10556,5074,15630,111429136
1,2019,BARAJAS,5086,1515,6601,25867396
2,2019,CARABANCHEL,20258,6387,26645,410386564
3,2019,CENTRO,16010,9248,25258,256320100
4,2019,CHAMARTÍN,11098,3922,15020,123165604


### np.where

```Python
np.where(
    condition_to_check,
    value_if_condition_is_true,
    value_if_condition_is_false
)
```

equivalent to 

```python
if condition:
    value_if_condition_is_true
else:
    value_if_condition_is_false
```

In [12]:
# create a new column based on a logical condition on an existing column: `np.where`

import numpy as np

mean_animals = animals["total_animals"].mean()
print(mean_animals)

animals["total_animals_categorical"] = np.where(
    animals["total_animals"] > mean_animals, #if animals above mean
    "above_mean", # save "above_mean"
    "below_mean" # save "below_mean"
)

animals.sample(5)

16388.97619047619


Unnamed: 0,year,district,dogs,cats,total_animals,dogs_squared,total_animals_categorical
97,2015,RETIRO,8883,1391,10274,78907689,below_mean
21,2018,ARGANZUELA,10622,4458,15080,112826884,below_mean
46,2017,CHAMARTÍN,11894,3123,15017,141467236,below_mean
95,2015,MORATALAZ,6881,1173,8054,47348161,below_mean
18,2019,VICÁLVARO,5244,1505,6749,27499536,below_mean


In [13]:
animals.drop('district', axis = 1).sample(5)

Unnamed: 0,year,dogs,cats,total_animals,dogs_squared,total_animals_categorical
59,2017,12623,1847,14470,159340129,below_mean
85,2015,5217,663,5880,27217089,below_mean
8,2019,15836,8556,24392,250778896,above_mean
48,2017,17799,6273,24072,316804401,above_mean
52,2017,12738,2776,15514,162256644,below_mean


In [14]:
# concatenating strings and converting
animals["concat_string"] = animals["year"].astype(str) + " - " + animals["district"]

animals

Unnamed: 0,year,district,dogs,cats,total_animals,dogs_squared,total_animals_categorical,concat_string
0,2019,ARGANZUELA,10556,5074,15630,111429136,below_mean,2019 - ARGANZUELA
1,2019,BARAJAS,5086,1515,6601,25867396,below_mean,2019 - BARAJAS
2,2019,CARABANCHEL,20258,6387,26645,410386564,above_mean,2019 - CARABANCHEL
3,2019,CENTRO,16010,9248,25258,256320100,above_mean,2019 - CENTRO
4,2019,CHAMARTÍN,11098,3922,15020,123165604,below_mean,2019 - CHAMARTÍN
...,...,...,...,...,...,...,...,...
121,2014,TETUÁN,12301,2178,14479,151314601,below_mean,2014 - TETUÁN
122,2014,USERA,11310,978,12288,127916100,below_mean,2014 - USERA
123,2014,VICÁLVARO,4584,505,5089,21013056,below_mean,2014 - VICÁLVARO
124,2014,VILLA DE VALLECAS,7107,940,8047,50509449,below_mean,2014 - VILLA DE VALLECAS


In [15]:
# create a new column called "cats_per_dog" that contains the ratio cats/dogs
animals["cats_per_dog"] = animals["cats"] / animals["dogs"]

animals.iloc[animals["cats_per_dog"].idxmax(), :]

year                                  2019
district                            CENTRO
dogs                                 16010
cats                                  9248
total_animals                        25258
dogs_squared                     256320100
total_animals_categorical       above_mean
concat_string                2019 - CENTRO
cats_per_dog                      0.577639
Name: 3, dtype: object

In [16]:
# create a new column called "cum_sum_animals" that contains 
# the cummulative sum of the total animals 
animals["cum_sum_animals"] = animals["total_animals"].cumsum()

animals.tail()

Unnamed: 0,year,district,dogs,cats,total_animals,dogs_squared,total_animals_categorical,concat_string,cats_per_dog,cum_sum_animals
121,2014,TETUÁN,12301,2178,14479,151314601,below_mean,2014 - TETUÁN,0.177059,2028269
122,2014,USERA,11310,978,12288,127916100,below_mean,2014 - USERA,0.086472,2040557
123,2014,VICÁLVARO,4584,505,5089,21013056,below_mean,2014 - VICÁLVARO,0.110166,2045646
124,2014,VILLA DE VALLECAS,7107,940,8047,50509449,below_mean,2014 - VILLA DE VALLECAS,0.132264,2053693
125,2014,VILLAVERDE,10467,851,11318,109558089,below_mean,2014 - VILLAVERDE,0.081303,2065011


### Sorting columns using `.sort_values()`

We can sort our dataframes this way:

```Python
df.sort_values(by=[columns_to_order_with], ascending=True)
```

In [17]:
animals.sort_values(by="cats", ascending=False)

Unnamed: 0,year,district,dogs,cats,total_animals,dogs_squared,total_animals_categorical,concat_string,cats_per_dog,cum_sum_animals
9,2019,LATINA,19049,10564,29613,362864401,above_mean,2019 - LATINA,0.554570,210175
3,2019,CENTRO,16010,9248,25258,256320100,above_mean,2019 - CENTRO,0.577639,74134
8,2019,HORTALEZA,15836,8556,24392,250778896,above_mean,2019 - HORTALEZA,0.540288,180562
24,2018,CENTRO,15881,8186,24067,252206161,above_mean,2018 - CENTRO,0.515459,453995
6,2019,CIUDAD LINEAL,17286,8183,25469,298805796,above_mean,2019 - CIUDAD LINEAL,0.473389,132674
...,...,...,...,...,...,...,...,...,...,...
125,2014,VILLAVERDE,10467,851,11318,109558089,below_mean,2014 - VILLAVERDE,0.081303,2065011
85,2015,BARAJAS,5217,663,5880,27217089,below_mean,2015 - BARAJAS,0.127085,1464881
106,2014,BARAJAS,5233,659,5892,27384289,below_mean,2014 - BARAJAS,0.125932,1777685
102,2015,VICÁLVARO,4702,545,5247,22108804,below_mean,2015 - VICÁLVARO,0.115908,1739965


In [18]:
animals.sort_values(by=["cats", "dogs"], ascending=[True, False])

Unnamed: 0,year,district,dogs,cats,total_animals,dogs_squared,total_animals_categorical,concat_string,cats_per_dog,cum_sum_animals
123,2014,VICÁLVARO,4584,505,5089,21013056,below_mean,2014 - VICÁLVARO,0.110166,2045646
102,2015,VICÁLVARO,4702,545,5247,22108804,below_mean,2015 - VICÁLVARO,0.115908,1739965
106,2014,BARAJAS,5233,659,5892,27384289,below_mean,2014 - BARAJAS,0.125932,1777685
85,2015,BARAJAS,5217,663,5880,27217089,below_mean,2015 - BARAJAS,0.127085,1464881
125,2014,VILLAVERDE,10467,851,11318,109558089,below_mean,2014 - VILLAVERDE,0.081303,2065011
...,...,...,...,...,...,...,...,...,...,...
6,2019,CIUDAD LINEAL,17286,8183,25469,298805796,above_mean,2019 - CIUDAD LINEAL,0.473389,132674
24,2018,CENTRO,15881,8186,24067,252206161,above_mean,2018 - CENTRO,0.515459,453995
8,2019,HORTALEZA,15836,8556,24392,250778896,above_mean,2019 - HORTALEZA,0.540288,180562
3,2019,CENTRO,16010,9248,25258,256320100,above_mean,2019 - CENTRO,0.577639,74134


In [57]:
animals.sort_values(by=["cats", "dogs"], ascending=[False, True])

Unnamed: 0,year,district,dogs,cats,total_animals,dogs_squared,total_animals_categorical,concat_string,cats_per_dog,cum_sum_animals
9,2019,LATINA,19049,10564,29613,362864401,above_mean,2019 - LATINA,0.554570,210175
3,2019,CENTRO,16010,9248,25258,256320100,above_mean,2019 - CENTRO,0.577639,74134
8,2019,HORTALEZA,15836,8556,24392,250778896,above_mean,2019 - HORTALEZA,0.540288,180562
24,2018,CENTRO,15881,8186,24067,252206161,above_mean,2018 - CENTRO,0.515459,453995
6,2019,CIUDAD LINEAL,17286,8183,25469,298805796,above_mean,2019 - CIUDAD LINEAL,0.473389,132674
...,...,...,...,...,...,...,...,...,...,...
125,2014,VILLAVERDE,10467,851,11318,109558089,below_mean,2014 - VILLAVERDE,0.081303,2065011
85,2015,BARAJAS,5217,663,5880,27217089,below_mean,2015 - BARAJAS,0.127085,1464881
106,2014,BARAJAS,5233,659,5892,27384289,below_mean,2014 - BARAJAS,0.125932,1777685
102,2015,VICÁLVARO,4702,545,5247,22108804,below_mean,2015 - VICÁLVARO,0.115908,1739965


## Practice

### Exercise 1:
Whats the percentage that represents the dogs in "LATINA" in 2018 compared to the whole city in 2018 

In [8]:
dogs_latina_2018 = animals[(animals["district"] == "LATINA") & (animals["year"] == 2018)]["dogs"].values[0]
dogs_latina_2018

19282

In [21]:
dogs_latina_2018 = animals[
    (animals["district"]=="LATINA")&
    (animals["year"]==2018)
]["dogs"].values[0]

dogs_2018 = animals[
    (animals["year"] == 2018)
]["dogs"].sum()

ratio = round(dogs_latina_2018 * 100 / dogs_2018, 1)

f"{ratio} % of the dogs in Madrid in 2018 are in Latina"

'6.9 % of the dogs in Madrid in 2018 are in Latina'

### Exercise 2:
How many districts had an "above_mean" rating in 2016?

In [60]:
animals[
    (animals["year"]==2016)&
    (animals["total_animals_categorical"]=="above_mean")
]["district"].unique()

array(['CARABANCHEL', 'CENTRO', 'CHAMBERÍ', 'CIUDAD LINEAL',
       'FUENCARRAL-EL PARDO', 'HORTALEZA', 'LATINA', 'PUENTE DE VALLECAS',
       'SAN BLAS'], dtype=object)

### Exercise 3:
Has the "Hortaleza" district increased or decreased its dog population in the analyzed period? By how much?

In [25]:
animals[
    (animals["district"]=="HORTALEZA")
][["year", "dogs"]].sort_values(by="year", ascending=False)#["dogs"]#.values[0]

Unnamed: 0,year,dogs
8,2019,15836
29,2018,15965
50,2017,16558
71,2016,16451
92,2015,16888
113,2014,16476


In [62]:
dogs_hortaleza_2019 = animals[
    (animals["district"]=="HORTALEZA")
][["year", "dogs"]].sort_values(by="year", ascending=False)["dogs"].values[0]

dogs_hortaleza_2014 = animals[
    (animals["district"]=="HORTALEZA")
][["year", "dogs"]].sort_values(by="year", ascending=False)["dogs"].values[-1]

# to calculate the evolution we substract the number of dogs in Hortaleza in 2014 from 2019
evolution = dogs_hortaleza_2019 - dogs_hortaleza_2014

# results
result = "increased" if evolution > 0 else "decreased"

print(f"The number of dogs in Hortaleza has {result} by {abs(evolution)} dogs from 2014 to 2019")

The number of dogs in Hortaleza has decreased by 640 dogs from 2014 to 2019


## Groupby and aggregations

### `groupby`

Just like in SQL, we can use `groupby` to perform operations to whole groups on our DataFrames.

```Python
df.groupby([columns_to_group]).function_to_apply_to_each_group
```

In [63]:
animals.head()

Unnamed: 0,year,district,dogs,cats,total_animals,dogs_squared,total_animals_categorical,concat_string,cats_per_dog,cum_sum_animals
0,2019,ARGANZUELA,10556,5074,15630,111429136,below_mean,2019 - ARGANZUELA,0.480674,15630
1,2019,BARAJAS,5086,1515,6601,25867396,below_mean,2019 - BARAJAS,0.297877,22231
2,2019,CARABANCHEL,20258,6387,26645,410386564,above_mean,2019 - CARABANCHEL,0.315283,48876
3,2019,CENTRO,16010,9248,25258,256320100,above_mean,2019 - CENTRO,0.577639,74134
4,2019,CHAMARTÍN,11098,3922,15020,123165604,below_mean,2019 - CHAMARTÍN,0.353397,89154


In [None]:
animals.groupby("year")[["dogs"]].sum() # This uses the group by column as the index

Unnamed: 0_level_0,dogs
year,Unnamed: 1_level_1
2014,264115
2015,270281
2016,274770
2017,281339
2018,278460
2019,276873


In [2]:
# how many dogs per year
"""
select
    year,
    sum(dogs) as sum_dogs,
from animals
group by year
"""

animals.groupby("year", as_index=False)[["dogs"]].sum() # helps to read from the right


Unnamed: 0,year,dogs
0,2014,264115
1,2015,270281
2,2016,274770
3,2017,281339
4,2018,278460
5,2019,276873


In [26]:
energy = pd.read_csv('energy.csv')

energy.groupby("month", as_index=False)[["spot_price"]].sum() # helps to read from the right

Unnamed: 0,month,spot_price
0,1,46098.13
1,2,36301.77
2,3,36334.07
3,4,36290.6
4,5,36005.13
5,6,33956.09
6,7,38289.44
7,8,33444.15
8,9,30333.68
9,10,35081.33


In [27]:
# read energy data
from datetime import datetime

energy = pd.read_csv("energy.csv")

energy.head()

Unnamed: 0,datetime,power_demand,nuclear,gas,solar,hydro,coal,wind,spot_price,year,month,day,hour,weekday
0,2018-12-31 23:00:00+00:00,23251.2,6059.2,2954.0,7.1,3202.8,1867.0,3830.3,66.88,2018,12,31,23,0
1,2019-01-01 00:00:00+00:00,22485.0,6059.2,3044.1,8.0,2884.4,1618.0,3172.1,66.88,2019,1,1,0,1
2,2019-01-01 01:00:00+00:00,20977.0,6059.2,3138.6,7.5,1950.8,1535.3,2980.5,66.0,2019,1,1,1,1
3,2019-01-01 02:00:00+00:00,19754.2,6059.2,3596.2,7.5,1675.7,1344.0,2840.0,63.64,2019,1,1,2,1
4,2019-01-01 03:00:00+00:00,19320.6,6063.4,3192.6,7.5,1581.8,1345.0,3253.4,58.85,2019,1,1,3,1


In [66]:
energy[["power_demand"]].head()

Unnamed: 0,power_demand
0,23251.2
1,22485.0
2,20977.0
3,19754.2
4,19320.6


In [67]:
# mean spot_price per month 

"""
select 
    month,
    avg(spot_price)
from energy
group by month
"""

energy.groupby("month", as_index=False)[["spot_price"]].mean()

Unnamed: 0,month,spot_price
0,1,61.959852
1,2,54.020491
2,3,48.836116
3,4,50.403611
4,5,48.393992
5,6,47.161236
6,7,51.464301
7,8,44.951815
8,9,42.130111
9,10,47.152325


In [68]:
# max power_demand per hour

"""
select
    hour, 
    max(power_demand)
from energy
group by 1
"""


max_energy_per_hour = energy.groupby("hour")[["power_demand"]].max()

max_energy_per_hour

Unnamed: 0_level_0,power_demand
hour,Unnamed: 1_level_1
0,27739.0
1,26459.4
2,26298.4
3,26429.5
4,28167.9
5,30511.2
6,34706.4
7,38116.0
8,39241.7
9,39898.3


In [8]:
energy.sample(5)

Unnamed: 0,datetime,power_demand,nuclear,gas,solar,hydro,coal,wind,spot_price,year,month,day,hour,weekday,fossil_fuel_consumption
587,2019-01-25 10:00:00+00:00,36503.2,7096.2,5480.3,2356.9,4304.1,5764.0,9876.0,68.25,2019,1,25,10,4,11244.3
6706,2019-10-07 09:00:00+00:00,31335.3,5940.6,10115.1,3273.3,1441.8,682.0,4595.6,51.75,2019,10,7,9,0,10797.1
8095,2019-12-04 06:00:00+00:00,31187.5,4999.7,7363.7,6.7,6539.7,1217.0,4177.1,60.63,2019,12,4,6,2,8580.7
7054,2019-10-21 21:00:00+00:00,25745.2,5966.0,9022.2,1.5,2039.1,1576.4,2698.4,44.19,2019,10,21,21,0,10598.6
673,2019-01-29 00:00:00+00:00,26199.9,7103.7,1269.5,,2231.4,1984.2,10088.2,50.15,2019,1,29,0,1,3253.7


In [28]:
# day of week with lowest average consumption of fossil fuels 

energy["fossil_fuel_consumption"] = energy["gas"] + energy["coal"]

mean_fuel_consumption = energy.groupby(["weekday"])[["fossil_fuel_consumption"]].mean()

print(mean_fuel_consumption)

print(energy.groupby(["weekday"])[["fossil_fuel_consumption"]].mean().idxmin())  # Sunday

         fossil_fuel_consumption
weekday                         
0                    7777.190530
1                    8296.544281
2                    7975.761683
3                    7776.072873
4                    7669.665060
5                    6170.212417
6                    5103.884206
fossil_fuel_consumption    6
dtype: int64


### Inside a `groupby` object

`groupby` creates a tuple per `category` in the `column`(s) we're grouping by:
* The first element of the tuple is each one of the `category` in `column`
* The second element is the data associated to that category:
    * ```Python
    df[df[col_groupby]==category]
    ```

In [70]:
# what's inside a groupby object?
list(energy.groupby("year"))

[(2018,
                      datetime  power_demand  nuclear     gas  solar   hydro  \
  0  2018-12-31 23:00:00+00:00       23251.2   6059.2  2954.0    7.1  3202.8   
  
       coal    wind  spot_price  year  month  day  hour  weekday  \
  0  1867.0  3830.3       66.88  2018     12   31    23        0   
  
     fossil_fuel_consumption  
  0                   4821.0  ),
 (2019,
                         datetime  power_demand  nuclear     gas  solar   hydro  \
  1     2019-01-01 00:00:00+00:00       22485.0   6059.2  3044.1    8.0  2884.4   
  2     2019-01-01 01:00:00+00:00       20977.0   6059.2  3138.6    7.5  1950.8   
  3     2019-01-01 02:00:00+00:00       19754.2   6059.2  3596.2    7.5  1675.7   
  4     2019-01-01 03:00:00+00:00       19320.6   6063.4  3192.6    7.5  1581.8   
  5     2019-01-01 04:00:00+00:00       19262.3   6063.4  3167.9    7.5  1535.6   
  ...                         ...           ...      ...     ...    ...     ...   
  8732  2019-12-30 19:00:00+00:00    

In [29]:
groupby_object = energy.groupby("year")

In [30]:
# first element
list(groupby_object)[1][1]["year"].unique()

array([2019], dtype=int64)

In [31]:
# category
list(groupby_object)[0][0]

2018

In [32]:
# data associated with category
list(groupby_object)[0][1]

Unnamed: 0,datetime,power_demand,nuclear,gas,solar,hydro,coal,wind,spot_price,year,month,day,hour,weekday,fossil_fuel_consumption
0,2018-12-31 23:00:00+00:00,23251.2,6059.2,2954.0,7.1,3202.8,1867.0,3830.3,66.88,2018,12,31,23,0,4821.0


Now we understand the `groupby` object, we can dig a bit deeper into the syntax

If we want to groupby several columns, we can pass a list of columns to `groupby` and perform the operation we need.

If we don't want the columns to become the index of the resulting DF, we can pass `as_index=False` to `groupby`

### `groupby` and `agg`

If we want to perform different operations after `groupby` we can mix `groupby` and `agg`.

In [13]:
min.__name__

'min'

In [14]:
energy.head()

"""
select
    day, 
    min(spot_price) as min_daily_price,
    max(spot_price) as max_daily_price,
    avg(spot_price) as avg_daily_price
from energy
group by day
"""

energy.groupby('day').agg({'spot_price': ['min', 'max', 'mean']}).head(5)

Unnamed: 0_level_0,spot_price,spot_price,spot_price
Unnamed: 0_level_1,min,max,mean
day,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,27.31,68.07,48.912049
2,16.45,69.3,46.173125
3,6.2,71.1,47.402361
4,3.52,71.94,49.242986
5,8.45,69.5,48.337986


In [None]:
group_by_df = energy.groupby('day').agg({'spot_price': ['min', 'max', 'mean']})

print(group_by_df.columns)

print(group_by_df.loc[6, ('spot_price', 'mean')])
print(group_by_df.iloc[5, 2]) # regardless of the number of columns that are grouped by you just use it based on the index of the columns

MultiIndex([('spot_price',  'min'),
            ('spot_price',  'max'),
            ('spot_price', 'mean')],
           )
47.40774305555556
47.40774305555556


In [76]:
energy.head(2)

Unnamed: 0,datetime,power_demand,nuclear,gas,solar,hydro,coal,wind,spot_price,year,month,day,hour,weekday,fossil_fuel_consumption
0,2018-12-31 23:00:00+00:00,23251.2,6059.2,2954.0,7.1,3202.8,1867.0,3830.3,66.88,2018,12,31,23,0,4821.0
1,2019-01-01 00:00:00+00:00,22485.0,6059.2,3044.1,8.0,2884.4,1618.0,3172.1,66.88,2019,1,1,0,1,4662.1


In [77]:
energy["power_demand"]

0       23251.2
1       22485.0
2       20977.0
3       19754.2
4       19320.6
         ...   
8732    31160.6
8733    31152.9
8734    29151.0
8735    26989.6
8736    24350.5
Name: power_demand, Length: 8737, dtype: float64

In [35]:
# extract the min, max and mean spot_price per day, and sum and std of hydro per day

"""
select
    min(spot_price),
    max(spot_price),
    mean(spot_price),
    sum(hydro),
    std(hydro)
from energy
group by day
"""

energy.groupby("day")[["hydro", "spot_price"]].agg(
    {
        "spot_price": ["min", "max", "mean"],
        "hydro": ["sum", "std"]
    }
)

Unnamed: 0_level_0,spot_price,spot_price,spot_price,hydro,hydro
Unnamed: 0_level_1,min,max,mean,sum,std
day,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
1,27.31,68.07,48.912049,744960.0,1380.662975
2,16.45,69.3,46.173125,744957.2,1330.561572
3,6.2,71.1,47.402361,777266.8,1493.712142
4,3.52,71.94,49.242986,827278.4,1713.060398
5,8.45,69.5,48.337986,786845.1,1714.552209
6,25.68,68.0,47.407743,744371.5,1553.762043
7,26.73,70.39,48.625451,770862.7,1504.654832
8,27.1,70.0,48.119132,764295.1,1431.909429
9,28.61,69.01,48.044063,756220.9,1339.658635
10,15.6,69.75,46.376111,738336.8,1350.319475


In [36]:
import numpy as np

df = energy.groupby("day")[["hydro", "spot_price"]].agg(
    {
        "spot_price": ["min", "max", "mean"],
#        "hydro": ["sum", "std"]
    }
)

df.head()

Unnamed: 0_level_0,spot_price,spot_price,spot_price
Unnamed: 0_level_1,min,max,mean
day,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,27.31,68.07,48.912049
2,16.45,69.3,46.173125
3,6.2,71.1,47.402361
4,3.52,71.94,49.242986
5,8.45,69.5,48.337986


In [37]:
df[('spot_price', 'mean')]

day
1     48.912049
2     46.173125
3     47.402361
4     49.242986
5     48.337986
6     47.407743
7     48.625451
8     48.119132
9     48.044063
10    46.376111
11    49.112674
12    48.225729
13    47.233333
14    47.478264
15    48.465069
16    49.890069
17    47.238160
18    49.840313
19    47.573646
20    46.326944
21    46.755729
22    46.909514
23    46.642118
24    45.106736
25    44.222049
26    46.325972
27    47.056076
28    48.337222
29    49.116288
30    49.551553
31    50.491931
Name: (spot_price, mean), dtype: float64

In [38]:
df.columns

MultiIndex([('spot_price',  'min'),
            ('spot_price',  'max'),
            ('spot_price', 'mean')],
           )

In [39]:
# change the names of the columns

df.columns = [f"{tpl[0]}_{tpl[1]}" for tpl in df.columns]

df

Unnamed: 0_level_0,spot_price_min,spot_price_max,spot_price_mean
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,27.31,68.07,48.912049
2,16.45,69.3,46.173125
3,6.2,71.1,47.402361
4,3.52,71.94,49.242986
5,8.45,69.5,48.337986
6,25.68,68.0,47.407743
7,26.73,70.39,48.625451
8,27.1,70.0,48.119132
9,28.61,69.01,48.044063
10,15.6,69.75,46.376111


In [40]:
# groupby on several columns and perform mean and sum on coal and wind

df = energy.groupby(["month", "weekday", "hour"]).agg({
    "coal": ["sum", "mean", "std"],
    "wind": ["min", "max", "std"],
    "solar": ["min", "max", "std"],
})

df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,coal,coal,coal,wind,wind,wind,solar,solar,solar
Unnamed: 0_level_1,Unnamed: 1_level_1,Unnamed: 2_level_1,sum,mean,std,min,max,std,min,max,std
month,weekday,hour,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2,Unnamed: 10_level_2,Unnamed: 11_level_2
1,0,0,11913.6,2978.400000,1548.304216,6924.0,12154.0,2341.175057,0.1,0.1,
1,0,1,11402.2,2850.550000,1527.859214,6895.5,12342.0,2447.057233,0.1,0.1,
1,0,2,11539.3,2884.825000,1519.787021,6949.7,12765.9,2679.795125,,,
1,0,3,11753.1,2938.275000,1437.644823,7174.2,12892.9,2733.835641,,,
1,0,4,12982.3,3245.575000,1554.457964,7290.8,13219.6,2840.367746,,,
...,...,...,...,...,...,...,...,...,...,...,...
12,6,19,2312.1,770.700000,323.748251,2025.4,13660.9,4212.173928,0.1,12.8,6.205643
12,6,20,2302.1,767.366667,324.047471,2026.8,13323.1,4045.097944,0.1,13.0,6.841296
12,6,21,2288.1,572.025000,422.352747,2055.0,13132.9,3936.590515,0.1,12.9,6.843975
12,6,22,2305.1,576.275000,368.008799,1983.1,11039.1,3263.962954,0.1,12.9,6.784541


In [84]:
energy["month"].unique()

array([12,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])

In [85]:
# We can handle a multiindex like the one resulting from a groupby with several columns 
# and several operations in the following way:

df = energy.groupby(["month", "weekday"]).agg({
    "coal": ["sum", "mean"],
    "wind": ["sum", "mean"]
})

# mean coal generation on Tuesdays in January
df.loc[(1, 1), ("coal", "mean")]

4230.0858333333335

In [43]:
print(energy["datetime"].values[0][:-6])
energy["datetime"].values[0]

2018-12-31 23:00:00


'2018-12-31 23:00:00+00:00'

In [44]:
import datetime

a = pd.to_datetime(energy["datetime"].values[0][:-6], format="%Y-%m-%d %H:%M:%S")

a.month

12

When we have a DataFrame with several indices, we can use `unstack()` and `stack()`:

### `stack` and `unstack`

These methods allow us to "move" labels from rows to columns and viceversa
* `unstack` moves row labels to column labels
* `stack` moves column labels to row labels

By default, the level at which these function operates is on the -1th level.

In [10]:
energy = pd.read_csv('energy.csv')

In [11]:
energy.groupby(["month", "weekday"]).agg({
    "coal": ["sum", "mean"],
    "wind": ["sum", "mean"]
})

Unnamed: 0_level_0,Unnamed: 1_level_0,coal,coal,wind,wind
Unnamed: 0_level_1,Unnamed: 1_level_1,sum,mean,sum,mean
month,weekday,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
1,0,421269.8,4388.227083,824014.0,8583.479167
1,1,507610.3,4230.085833,846574.9,7054.790833
1,2,512002.1,4266.684167,1127327.0,9394.391667
1,3,499262.3,4160.519167,1107441.0,9228.675000
1,4,456063.6,4750.662500,604727.5,6299.244792
...,...,...,...,...,...
12,2,57690.3,801.254167,687588.4,7162.379167
12,3,60719.6,645.953191,966697.5,10069.765625
12,4,49367.6,530.834409,843487.3,8786.326042
12,5,31480.4,655.841667,675993.2,7041.595833


In [12]:
energy.groupby(["month", "weekday"]).agg({
    "coal": ["sum", "mean"],
    "wind": ["sum", "mean"]
}).unstack(level="weekday")

Unnamed: 0_level_0,coal,coal,coal,coal,coal,coal,coal,coal,coal,coal,...,wind,wind,wind,wind,wind,wind,wind,wind,wind,wind
Unnamed: 0_level_1,sum,sum,sum,sum,sum,sum,sum,mean,mean,mean,...,sum,sum,sum,mean,mean,mean,mean,mean,mean,mean
weekday,0,1,2,3,4,5,6,0,1,2,...,4,5,6,0,1,2,3,4,5,6
month,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3,Unnamed: 7_level_3,Unnamed: 8_level_3,Unnamed: 9_level_3,Unnamed: 10_level_3,Unnamed: 11_level_3,Unnamed: 12_level_3,Unnamed: 13_level_3,Unnamed: 14_level_3,Unnamed: 15_level_3,Unnamed: 16_level_3,Unnamed: 17_level_3,Unnamed: 18_level_3,Unnamed: 19_level_3,Unnamed: 20_level_3,Unnamed: 21_level_3
1,421269.8,507610.3,512002.1,499262.3,456063.6,403004.1,299088.4,4388.227083,4230.085833,4266.684167,...,604727.5,547692.5,845505.9,8583.479167,7054.790833,9394.391667,9228.675,6299.244792,5705.130208,8807.353125
2,355023.1,412356.6,409794.9,354202.2,296157.4,228108.9,201911.1,3698.157292,4295.38125,4268.696875,...,687880.1,742590.4,709697.2,4573.125,3127.797917,3264.470833,4621.85,7165.417708,7735.316667,7392.679167
3,90575.2,93075.8,83690.6,115528.7,211058.3,136452.7,105241.6,943.491667,969.539583,871.777083,...,479457.2,509594.6,710068.8,7840.760417,8366.327083,9745.963542,6416.372917,3995.476667,4246.621667,5917.24
4,143198.9,141308.6,93011.4,93500.1,97725.8,86918.0,75626.0,1193.324167,1177.571667,968.86875,...,688169.6,654563.4,540527.0,4688.543333,5507.959167,9419.402083,7540.68125,7168.433333,6818.36875,5630.489583
5,46014.0,57783.7,67651.7,65313.1,54077.2,34526.5,34845.2,479.3125,601.913542,563.764167,...,873835.7,631901.0,625664.3,4705.53125,5094.652083,6217.274167,6080.86,7281.964167,6582.302083,6517.336458
6,60259.3,69456.2,65415.9,57449.5,53481.9,58768.7,65776.0,627.701042,723.502083,681.415625,...,449544.9,414505.6,425935.3,3330.016667,5273.940625,5041.561458,5647.36875,4682.759375,3454.213333,3549.460833
7,103709.9,120206.6,126196.5,102481.9,102255.1,59825.4,54971.8,864.249167,1001.721667,1051.6375,...,315439.9,438007.4,383020.7,5041.106667,4705.435833,4125.745,3559.652083,3285.832292,4562.577083,3989.798958
8,39654.1,47677.5,49256.2,68909.5,68596.1,54148.6,35596.5,413.063542,496.640625,513.085417,...,437170.9,334951.0,385160.3,4416.210417,3737.441667,3280.898958,3903.4125,3643.090833,2791.258333,4012.086458
9,87533.4,74152.3,73621.4,63098.8,62168.6,42217.6,57211.7,729.445,772.419792,766.889583,...,577429.3,548192.7,590555.9,4250.530833,5994.954167,5176.195833,5125.916667,6014.888542,5710.340625,4921.299167
10,100779.0,133426.4,127907.2,116723.8,78866.9,64727.8,65093.0,1049.78125,1111.886667,1065.893333,...,397112.2,492724.4,444821.8,4660.223958,5447.506667,5343.123333,5583.9275,4136.585417,5132.545833,4633.560417


In [51]:
# create DF with 2 indices
df = energy.groupby(["month", "weekday"]).agg({
    "coal": ["sum", "mean"],
    "wind": ["sum", "mean"]
})

# sum of wind on thrusdays in august
df.loc[(8, 3), ("wind", "sum")]

468409.5

In [49]:
# move `weekday` from rows to columns: unstack weekday
df = energy.groupby(["month", "weekday"]).agg({
    "coal": ["sum", "mean"],
    "wind": ["sum", "mean"]
}).unstack(level="weekday")

# avg coal on saturdays in november
df.loc[11, ("coal", "mean", 5)]

576.5266666666666

In [52]:
# move ("coal", "wind") from columns labels to rows: stack 0
df = energy.groupby(["month", "weekday"]).agg({
    "coal": ["sum", "mean"],
    "wind": ["sum", "mean"]
}).stack(1)

# whats the sum of wind on wednesdays in july
df.loc[(7, 2, "sum"), "wind"]

df

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,coal,wind
month,weekday,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,0,sum,421269.800000,8.240140e+05
1,0,mean,4388.227083,8.583479e+03
1,1,sum,507610.300000,8.465749e+05
1,1,mean,4230.085833,7.054791e+03
1,2,sum,512002.100000,1.127327e+06
...,...,...,...,...
12,4,mean,530.834409,8.786326e+03
12,5,sum,31480.400000,6.759932e+05
12,5,mean,655.841667,7.041596e+03
12,6,sum,51225.700000,7.598387e+05


## Practice

### Exercise 1: `energy` dataset
What's the maximum solar power generation happened in August?

In [55]:
pd.set_option('display.max_column', None)
energy.head(3)

Unnamed: 0,datetime,power_demand,nuclear,gas,solar,hydro,coal,wind,spot_price,year,month,day,hour,weekday,fossil_fuel_consumption
0,2018-12-31 23:00:00+00:00,23251.2,6059.2,2954.0,7.1,3202.8,1867.0,3830.3,66.88,2018,12,31,23,0,4821.0
1,2019-01-01 00:00:00+00:00,22485.0,6059.2,3044.1,8.0,2884.4,1618.0,3172.1,66.88,2019,1,1,0,1,4662.1
2,2019-01-01 01:00:00+00:00,20977.0,6059.2,3138.6,7.5,1950.8,1535.3,2980.5,66.0,2019,1,1,1,1,4673.9


In [58]:
energy[energy['month'] == 8][['month', 'solar']].max()

month       8.0
solar    4050.1
dtype: float64

In [14]:
energy[energy['month'] == 8]['solar'].max()

4050.1

### Exercise 2: `energy` dataset
What's the average production of each of the following technologies on Hour 5

```Python
tech = ["nuclear", "solar", "hydro"]
```

In [17]:
energy.groupby('hour').agg({
    'nuclear': 'mean',
    'solar': 'mean',
    'hydro': 'mean'
})

energy.groupby('hour')[['nuclear', 'solar', 'hydro']].apply('mean').iloc[8]
energy.groupby('hour')[['nuclear', 'solar', 'hydro']].agg('mean').iloc[8]

nuclear    6386.769780
solar      1788.310165
hydro      3507.838736
Name: 8, dtype: float64

### Exercise 3:
Create a new column called `stop_wind` with value 1 if `spot_price` is below 20, and 0 otherwise.

In [73]:
energy['stop_wind'] = np.where(energy['spot_price'] < 20, 1, 0)
energy[energy['spot_price'] < 20].sample()

Unnamed: 0,datetime,power_demand,nuclear,gas,solar,hydro,coal,wind,spot_price,year,month,day,hour,weekday,fossil_fuel_consumption,stop_wind
7350,2019-11-03 05:00:00+00:00,19378.1,4676.9,1451.5,2.2,1061.0,495.0,15066.6,7.62,2019,11,3,5,6,1946.5,1


### Exercise 4:
Create a new column called weekend with 0 if weekday=0,1,2,3,4 and 1 otherwise

In [77]:
energy['weekend'] = np.where(energy['weekday'].isin([0, 1, 2, 3, 4]), 0, 1)

energy[energy['weekend'] == 1].sample()

Unnamed: 0,datetime,power_demand,nuclear,gas,solar,hydro,coal,wind,spot_price,year,month,day,hour,weekday,fossil_fuel_consumption,stop_wind,weekend
8039,2019-12-01 22:00:00+00:00,26526.3,4996.9,3472.9,,5073.0,763.8,7726.7,42.68,2019,12,1,22,6,4236.7,0,1
