<br>

<img src="./image/Logo/logo_elia_group.png" width = 200>

<br>

# Quick Analysis and Summary Statistics in Pandas 
<br>
If you want to get a quick overview of your data, get some first insights or just want to gain a basic understanding of what is going on, then the following steps can be really helpful. <br>
But first - don't forget to import the data and packages you need: 

In [1]:
import pandas as pd 
import numpy as np

In [2]:
energy_flow = pd.read_csv("./data/energy/physical_flow_2021_1_01.csv", sep = ";")

In [3]:
energy_flow.head()

Unnamed: 0,Datetime,Resolution code,Control area,Physical Flow Value
0,2021-12-01T23:45:00+01:00,PT15M,Netherlands,-419.704
1,2021-12-01T23:45:00+01:00,PT15M,UnitedKingdom,1021.774
2,2021-12-01T23:45:00+01:00,PT15M,Luxembourg,-53.708
3,2021-12-01T23:45:00+01:00,PT15M,Germany,-1001.788
4,2021-12-01T23:45:00+01:00,PT15M,France,581.908


<br>
Let's start with some quick analysis. You can... 

1. Get the names of all columns in your dataframe

In [4]:
energy_flow.columns

Index(['Datetime', 'Resolution code', 'Control area', 'Physical Flow Value'], dtype='object')

2. Check out a specific column

In [5]:
energy_flow["Physical Flow Value"].tail()

475     103.704
476     501.428
477    -299.016
478    1022.926
479     -78.695
Name: Physical Flow Value, dtype: float64

3. Get overall statistics on the column you are interested in

In [6]:
print('Average Physical Flow Value: ', energy_flow["Physical Flow Value"].mean())
print('Min Physical Flow Value: ', energy_flow["Physical Flow Value"].min())
print('Median Physical Flow Value: ', energy_flow["Physical Flow Value"].median())
print('Max Physical Flow Value: ', energy_flow["Physical Flow Value"].max())

Average Physical Flow Value:  163.06674375000003
Min Physical Flow Value:  -1002.1039999999999
Median Physical Flow Value:  143.046
Max Physical Flow Value:  1707.2320000000002


4. And some more advanced metrics

- For instance, you can see how many records we have by each country

In [14]:
energy_flow["Control area"].value_counts()

Germany          96
Netherlands      96
UnitedKingdom    96
France           96
Luxembourg       96
Name: Control area, dtype: int64

- You can also quickly view all of the unique values of a given column using the `.unique()` method with a specific column

In [18]:
energy_flow["Control area"].unique()

array(['Netherlands', 'UnitedKingdom', 'Luxembourg', 'Germany', 'France'],
      dtype=object)

- Another way to find unique values is using `set`

In [21]:
set(energy_flow["Control area"])

{'France', 'Germany', 'Luxembourg', 'Netherlands', 'UnitedKingdom'}

&#128526; Cool right? You can look [here](https://www.shanelynn.ie/summarising-aggregation-and-grouping-data-in-python-pandas/) for a list of all the built-in pandas stats. One of the most powerful built-in summary tools for pandas is `df.describe()`. Do you remember what it does ? 

In [24]:
energy_flow.describe()

Unnamed: 0,Physical Flow Value
count,480.0
mean,163.066744
std,630.222852
min,-1002.104
25%,-461.035
50%,143.046
75%,715.429
max,1707.232


**Question**: Why is there only one column?