In [1]:
import pandas as pd
import numpy as np
from numpy.random import default_rng
rng = default_rng()

## Series

Given the Series below:

In [2]:
s = pd.Series(np.arange(5),index=list("abcde"))

without entering the statements:

- predict the values and the type of object returned for each statement:

In [3]:
s['d']        # ...
s['b':'d']    # ...
s[2::2][::-1] # ...
s[['b', 'a']] # ...

b    1
a    0
dtype: int32

- predict the contents of `s`, `s1` and `lst`:

In [None]:
lst, idx = np.arange(5), list("abcde")
s = pd.Series(lst,idx)
s[-1:] = 10               # ...
lst[0] = 5                # ...
s1 = pd.Series(s.copy())  # ...
s1[0] = -1                # ...

- predict the result of the operations

In [None]:
s1 = pd.Series({'a': 0, 'b': 1, 'c': 2, 'd': 3, 'e': 4})
s2 = pd.Series({'d': 0, 'e': 1, 'f': 2, 'g': 3})

s1 + s2           # ...
s1[3:] * s2[:-2]  # ...

## DataFrame

Given the DataFrame `df` below:

In [47]:
rng = default_rng(1234)
df = pd.DataFrame(np.array(rng.standard_normal(25)).reshape(5,5),
             index=[1, 0, 4, 3, 2], columns=list("abcde"))

retrieve:
- 2nd row as a Series
- 3rd row as a DataFrame
- rows on even positions
- rows with even indices
- 3d column
- odd (index) rows and columns 'b' to 'd'

In [50]:
# ...
print(df)
df.loc[2]
df.loc[[3]]
df.iloc[1::2]
df.loc[df.index%2==0]
df["c"]
df.iloc[:,2]
df.iloc[df.index%2==1,1:4]


          a         b         c         d         e
1 -1.603837  0.064100  0.740891  0.152619  0.863744
0  2.913099 -1.478823  0.945473 -1.666135  0.343745
4 -0.512444  1.323759 -0.860280  0.519493 -1.265144
3 -2.159139  0.434734  1.733289  0.520134 -1.002166
2  0.268346  0.767175  1.191272 -1.157411  0.696279


Unnamed: 0,b,c,d
1,0.0641,0.740891,0.152619
3,0.434734,1.733289,0.520134


### Merge DataFrames

Given `df1`, `df2` and `df3` apply the following:

- merge df1 and df2 side by side
- merge df1 and df3 stacked
- merge all and reset index

In [29]:
df1 = pd.DataFrame({'name': ['ants', 'bees','wasps'] , 'order':['Hymenoptera']*3})
df2 = pd.DataFrame({'name': ['beetles', 'weevils'] , 'order':['Coleoptera']*2})
df3 = pd.DataFrame({'name': ['butterflies', 'moths'], 'order':['Lepidoptera']*2 })

In [35]:
# ...

pd.concat([df1,df2],axis=1)     
pd.concat([df1,df2])
pd.concat([df1,df2,df3],ignore_index=True) 

Unnamed: 0,name,order
0,ants,Hymenoptera
1,bees,Hymenoptera
2,wasps,Hymenoptera
3,beetles,Coleoptera
4,weevils,Coleoptera
5,butterflies,Lepidoptera
6,moths,Lepidoptera


### Missing values

Given the following DataFrame

In [73]:
df = pd.DataFrame(np.arange(25).reshape(5,5))


In [None]:

# 0	1	2	3	4
# 0	NaN	1.0	NaN	3.0	NaN
# 1	5.0	6.0	7.0	8.0	NaN
# 2	NaN	NaN	NaN	NaN	NaN
# 3	15.0	16.0	17.0	18.0	NaN
# 4	20.0	21.0	NaN	23.0	NaN

set the values to NaN as such to reproduce the following DataFrame:

In [75]:
# ...
print(df)
df.loc[2] = np.nan
df.iloc[0,::2] = np.nan
df.iloc[:,4] = np.nan
df.iloc[4,2] = np.nan
df



      0     1     2     3   4
0   NaN   1.0   NaN   3.0 NaN
1   5.0   6.0   7.0   8.0 NaN
2   NaN   NaN   NaN   NaN NaN
3  15.0  16.0  17.0  18.0 NaN
4  20.0  21.0   NaN  23.0 NaN


Unnamed: 0,0,1,2,3,4
0,,1.0,,3.0,
1,5.0,6.0,7.0,8.0,
2,,,,,
3,15.0,16.0,17.0,18.0,
4,20.0,21.0,,23.0,


Apply the following on the dataframe with missing values created in the previous step.

Drop missing:
- rows with missing values
- columns with missing values
- rows where all values are missing
- columns where all values are missing

Fill missing:
- with 0
- with mean based on column values
- with median based on row values

In [95]:
# ...
df.dropna(axis=0)
df.dropna(axis=1)
df.dropna(axis=0,how='all')
df.dropna(axis=1,how='all')
df.fillna(0)
df.fillna(df.mean(axis=0))  #axis = 0 refers to horizonal axis -> hence on all coloum
df.fillna(df.median(axis=1))
# df.mean(axis=0)

Unnamed: 0,0,1,2,3,4
0,,1.0,,3.0,
1,5.0,6.0,7.0,8.0,
3,15.0,16.0,17.0,18.0,
4,20.0,21.0,,23.0,


### Natural gas consumption in the Netherlands

The dataset can be downloaded from [CBS Open data StatLine](https://opendata.cbs.nl/statline/portal.html?_la=en&_catalog=CBS). A version is already included in the data directory of this session's git repository. We will be using this dataset in the exercises to prepare for visualisation later on in the course.

We first read the data with `pd.read_csv`. Here we only select the columns `Periods` and `TotalSupply_1`:

In [96]:
cbs = pd.read_csv("data/00372eng_UntypedDataSet_17032023_161051.csv",sep=";")
df0 = cbs[['Periods','TotalSupply_1']].copy()

The column `Periods`has the year (yyyy) followed by a tag {JJ,KW,MM} representing the yearly, quarterly and monthly terms respectively, and finally ending with two digits `00..12`. The two digit followed by the tags have different meaning per tag. For JJ it is always `00`, MM with `00..12` for 12 months and `KW`  with  `01..04` for four quarters. The column `TotaalAanbod_1` holds the natural gas consumption (MCM).

In order to get more control over the date ranges we will need to split the string based on the pattern `YYYY{MM,KW,JJ}{00,...,12}`. The Series class has a comprehensive set of submodules, one of which being `pandas.Series.str` with the method `split`. The `split` method takes a [regular expression](https://docs.python.org/3/library/re.html) describing the pattern and  splits the string based on the pattern. Regular expressions fall beyond the scope of this course, therefore the solution is given here for the exercise.

In [103]:
df = df0.Periods.str.split(r'(JJ|MM|KW)', regex=True, expand=True)  # expand=True forces the result into
                                                                        # a DataFrame
df = pd.DataFrame({'year': df[0].astype(int),                 # Create DataFrame {year,term,idx}
                        'term': df[1],
                        'idx': df[2].astype(int)})

df = pd.concat([df,cbs[['TotalSupply_1']]],axis=1)
df.loc[700::]
# df.shape[0]

Unnamed: 0,year,term,idx,TotalSupply_1
700,2021,MM,2,4622
701,2021,MM,3,4414
702,2021,KW,1,14631
703,2021,MM,4,3655
704,2021,MM,5,2976
705,2021,MM,6,2303
706,2021,KW,2,8934
707,2021,MM,7,2147
708,2021,MM,8,1835
709,2021,MM,9,2131


1) Write a function given a Series with {year,term,idx} returns a timestamp according to the following specification:

```
JJ : yyyyJJ00 => 31-12-yyyy
KW : yyyyKWmm => where mm in {1,2,3,4}
                 01: 1-1-yyyy to 31-3-yyyy
                 02: 1-4-yyyy to 30-6-yyyy
                 03: 1-7-yyyy to 30-9-yyyy
                 04: 1-10-yyyy to 31-12-yyyy
MM : yyyyMMmm => dd-mm-yyyy where dd is the last day of the month and
                 mm in {1,..,12}
```

2) Create a new DataFrame called `ngc` (natural gas consumption) with three columns {term, date, consumption} :
- term : {JJ,KW,MM}
- date : timestamps as specified in the previous exercise
- consumption: which is `TotalSupply_1` only renamed

In [147]:
# ...
s = pd.Series([2005,"MM",2])
s[0]
df2 = df.loc[df.year == s[0]]
s[1]
df2['term'].isin([s[1]]).any()
df2.loc[df2.term == s[1]]
"-".join(str(x) for x in [1,2,'a'])
def get_time(series1):
    quart = [("1-1-","31-3-"),("1-4-","30-6-"),("1-7-","30-9-"),("1-10-","31-12-")]
    mon_end = [31,28,31,30,31,30,30,31,30,31,30,31]
    if s[1] == "JJ" and s[2]==0:
        return f"31-12-{s[0]}"
    elif s[1] == "KW" and s[2] in [1,2,3,4]:
        return f"{quart[s[2]-1][0]}{s[0]} to {quart[s[2]-1][1]}{s[0]}"
    elif s[1] == "MM":
        if s[2] == 2:
            if (s[0]%4==0 and s[0]%100!=0) or s[0]%400==0:
                return f"29-{s[2]}-{s[0]}" 
        return f"{mon_end[s[2]-1]}-{s[2]}-{s[0]}"
    else:
        return "some terms are wrong"
get_time(s)
# s.isin([0]).any()
    

'28-2-2005'

Validate entries in the ngc DataFrame from the previous step:
- whether sum of 3 months consumptions are equal to the corresponding quarterly entries(KW)
- whether sum of 4 quarters addup to the yearly (JJ) entries

In [None]:
# ...