# Pandas: reading files and operations

## Reading files into pandas

We can read files into pandas using the `read_csv` function. This function reads a comma-separated values (csv) file into DataFrame. The `read_csv` function is extremely powerful and can read files from a URL, a file, or a string. 

```python
import pandas as pd

# Reading a file from a CSV
df = pd.read_csv('file.csv')
```

If the file is in a different format than a csv, we can use other functions like `read_excel`, `read_json`, `read_html`, `read_sql`, `read_table`, etc.

## Energy dataset

This dataset contains information about the hourly electricity production of Spain with different sources. The dataset is in a CSV file and has the following columns:

* `datetime`: date and time of the observation
* `power_demand`: total electricity demand at the given datetime
* `nuclear`: electricity production from nuclear sources at the given datetime
* `gas`: electricity production from gas sources at the given datetime
* `solar`: electricity production from solar sources at the given datetime
* `hydro`: electricity production from hydroelectric sources at the given datetime
* `coal`: electricity production from coal sources at the given datetime
* `wind`: electricity production from wind sources at the given datetime
* `spot_price`: electricity spot price at the given datetime
* `year`: year of the observation
* `month`: month of the observation
* `day`: day of the observation
* `hour`: hour of the observation
* `weekday`: day of the week of the observation

### Exercise 1

Read the file `energy.csv` into a DataFrame called `energy` and display the first 5 rows.

In [1]:
import pandas as pd
import numpy as np

energy = pd.read_csv('energy.csv')

energy.head(5)

Unnamed: 0,datetime,power_demand,nuclear,gas,solar,hydro,coal,wind,spot_price,year,month,day,hour,weekday
0,2018-12-31 23:00:00+00:00,23251.2,6059.2,2954.0,7.1,3202.8,1867.0,3830.3,66.88,2018,12,31,23,0
1,2019-01-01 00:00:00+00:00,22485.0,6059.2,3044.1,8.0,2884.4,1618.0,3172.1,66.88,2019,1,1,0,1
2,2019-01-01 01:00:00+00:00,20977.0,6059.2,3138.6,7.5,1950.8,1535.3,2980.5,66.0,2019,1,1,1,1
3,2019-01-01 02:00:00+00:00,19754.2,6059.2,3596.2,7.5,1675.7,1344.0,2840.0,63.64,2019,1,1,2,1
4,2019-01-01 03:00:00+00:00,19320.6,6063.4,3192.6,7.5,1581.8,1345.0,3253.4,58.85,2019,1,1,3,1


### Exercise 2

Create a new column called date that contains the date of the observation. The date should be in the format `YYYY-MM-DD`.

**Hint:** You can extract a sub-string from a column of strings using the `str` attribute and the slice notation. For example, to extract the first 4 characters of a column `col`, you can use `col.str[:4]`.

In [2]:
energy['date'] = energy['datetime'].str[:4]

energy.head(5)

Unnamed: 0,datetime,power_demand,nuclear,gas,solar,hydro,coal,wind,spot_price,year,month,day,hour,weekday,date
0,2018-12-31 23:00:00+00:00,23251.2,6059.2,2954.0,7.1,3202.8,1867.0,3830.3,66.88,2018,12,31,23,0,2018
1,2019-01-01 00:00:00+00:00,22485.0,6059.2,3044.1,8.0,2884.4,1618.0,3172.1,66.88,2019,1,1,0,1,2019
2,2019-01-01 01:00:00+00:00,20977.0,6059.2,3138.6,7.5,1950.8,1535.3,2980.5,66.0,2019,1,1,1,1,2019
3,2019-01-01 02:00:00+00:00,19754.2,6059.2,3596.2,7.5,1675.7,1344.0,2840.0,63.64,2019,1,1,2,1,2019
4,2019-01-01 03:00:00+00:00,19320.6,6063.4,3192.6,7.5,1581.8,1345.0,3253.4,58.85,2019,1,1,3,1,2019


### Exercise 3

Create a new column that aggregates all the electricity production sources (nuclear, gas, solar, hydro, coal, and wind) into a single column called `total_production`.

Create a second column that calculates the sum of renewable sources (solar, hydro, and wind) into a single column called `renewable_production`.

**Hint1:** You can sum the values of multiple columns by summing the columns using the `+` operator.

**Hint2:** You can also sum the values of multiple columns using the `sum` method of the DataFrame, using the `axis` parameter to specify if you want to sum the rows (`axis=1`) or the columns (`axis=0`).

In [3]:
energy['total_production_test'] = energy['nuclear'] + energy['gas'] + energy['solar'] + energy['hydro'] + energy['coal'] + energy['wind']
energy['total_production'] = energy.iloc[:, 2:8].sum(axis = 1) 
energy['renewable_production'] = energy['solar'] + energy['hydro'] + energy['wind']

energy.head()

Unnamed: 0,datetime,power_demand,nuclear,gas,solar,hydro,coal,wind,spot_price,year,month,day,hour,weekday,date,total_production_test,total_production,renewable_production
0,2018-12-31 23:00:00+00:00,23251.2,6059.2,2954.0,7.1,3202.8,1867.0,3830.3,66.88,2018,12,31,23,0,2018,17920.4,17920.4,7040.2
1,2019-01-01 00:00:00+00:00,22485.0,6059.2,3044.1,8.0,2884.4,1618.0,3172.1,66.88,2019,1,1,0,1,2019,16785.8,16785.8,6064.5
2,2019-01-01 01:00:00+00:00,20977.0,6059.2,3138.6,7.5,1950.8,1535.3,2980.5,66.0,2019,1,1,1,1,2019,15671.9,15671.9,4938.8
3,2019-01-01 02:00:00+00:00,19754.2,6059.2,3596.2,7.5,1675.7,1344.0,2840.0,63.64,2019,1,1,2,1,2019,15522.6,15522.6,4523.2
4,2019-01-01 03:00:00+00:00,19320.6,6063.4,3192.6,7.5,1581.8,1345.0,3253.4,58.85,2019,1,1,3,1,2019,15443.7,15443.7,4842.7


### Exercise 4

Create a new column called `total_production_ratio` that contains the ratio of the total production to the power demand. Do the same with the renewable production under `renewable_production_ratio`.

In [4]:
energy['total_production_ratio'] = energy['total_production'] / energy['power_demand']
energy['renewable_production_ratio'] = energy['renewable_production'] / energy['power_demand']

energy.head(5)

Unnamed: 0,datetime,power_demand,nuclear,gas,solar,hydro,coal,wind,spot_price,year,month,day,hour,weekday,date,total_production_test,total_production,renewable_production,total_production_ratio,renewable_production_ratio
0,2018-12-31 23:00:00+00:00,23251.2,6059.2,2954.0,7.1,3202.8,1867.0,3830.3,66.88,2018,12,31,23,0,2018,17920.4,17920.4,7040.2,0.77073,0.302789
1,2019-01-01 00:00:00+00:00,22485.0,6059.2,3044.1,8.0,2884.4,1618.0,3172.1,66.88,2019,1,1,0,1,2019,16785.8,16785.8,6064.5,0.746533,0.269713
2,2019-01-01 01:00:00+00:00,20977.0,6059.2,3138.6,7.5,1950.8,1535.3,2980.5,66.0,2019,1,1,1,1,2019,15671.9,15671.9,4938.8,0.747099,0.235439
3,2019-01-01 02:00:00+00:00,19754.2,6059.2,3596.2,7.5,1675.7,1344.0,2840.0,63.64,2019,1,1,2,1,2019,15522.6,15522.6,4523.2,0.785787,0.228974
4,2019-01-01 03:00:00+00:00,19320.6,6063.4,3192.6,7.5,1581.8,1345.0,3253.4,58.85,2019,1,1,3,1,2019,15443.7,15443.7,4842.7,0.799339,0.25065


### Exercise 5

Create a new column called `thermal_gap` that contains the difference between the power demand and the sum of the nuclear, solar, hydro and wind production.

In [5]:
energy['thermal_gap'] = energy['power_demand'] - (energy['nuclear'] + energy['solar'] + energy['hydro'] + energy['wind'])

energy.head()

Unnamed: 0,datetime,power_demand,nuclear,gas,solar,hydro,coal,wind,spot_price,year,...,day,hour,weekday,date,total_production_test,total_production,renewable_production,total_production_ratio,renewable_production_ratio,thermal_gap
0,2018-12-31 23:00:00+00:00,23251.2,6059.2,2954.0,7.1,3202.8,1867.0,3830.3,66.88,2018,...,31,23,0,2018,17920.4,17920.4,7040.2,0.77073,0.302789,10151.8
1,2019-01-01 00:00:00+00:00,22485.0,6059.2,3044.1,8.0,2884.4,1618.0,3172.1,66.88,2019,...,1,0,1,2019,16785.8,16785.8,6064.5,0.746533,0.269713,10361.3
2,2019-01-01 01:00:00+00:00,20977.0,6059.2,3138.6,7.5,1950.8,1535.3,2980.5,66.0,2019,...,1,1,1,2019,15671.9,15671.9,4938.8,0.747099,0.235439,9979.0
3,2019-01-01 02:00:00+00:00,19754.2,6059.2,3596.2,7.5,1675.7,1344.0,2840.0,63.64,2019,...,1,2,1,2019,15522.6,15522.6,4523.2,0.785787,0.228974,9171.8
4,2019-01-01 03:00:00+00:00,19320.6,6063.4,3192.6,7.5,1581.8,1345.0,3253.4,58.85,2019,...,1,3,1,2019,15443.7,15443.7,4842.7,0.799339,0.25065,8414.5


### Exercise 6

Using the `corr()` method of the DataFrame, calculate the correlation between the columns of the DataFrame. Which columns are more correlated with the `spot_price` column?

**Hint:** There are columns that will raise an error when calculating correlations, like those with strings or dates. Drop them before calculating the correlation using the `drop` method of the DataFrame.

```python
df.drop(columns=['column1', 'column2'], axis=1).corr()
```

In [21]:
correlation_table = energy.drop(columns = ['datetime'])

correlation_table.head(5)

correlation_table.corr()['spot_price'].drop('spot_price').idxmax()

'thermal_gap'

### Exercise 7

Which of the columns has the highest correlation with the `spot_price` column?

### Exercise 8

* What was the date in which the `spot_price` was the highest? and the lowest?
* What was the date in which the `power_demand` was the highest? and the lowest?
* What is the month in which the highest `total_production` was registered? and the lowest?
* What is the hour in which the highest `renewable_production` was registered? and the lowest?

In [30]:
print(energy.iloc[energy['spot_price'].idxmax(), :][['date']])
print(energy.iloc[energy['spot_price'].idxmin(), :][['date']])

print(energy.iloc[energy['power_demand'].idxmax(), :][['date']])
print(energy.iloc[energy['power_demand'].idxmin(), :][['date']])

print(energy.iloc[energy['total_production'].idxmax(), :][['date']])
print(energy.iloc[energy['total_production'].idxmin(), :][['date']])

print(energy.iloc[energy['renewable_production'].idxmax(), :][['date']])
print(energy.iloc[energy['renewable_production'].idxmin(), :][['date']])

date    2019
Name: 355, dtype: object
date    2019
Name: 8572, dtype: object
date    2019
Name: 524, dtype: object
date    2019
Name: 8597, dtype: object
date    2019
Name: 524, dtype: object
date    2019
Name: 2498, dtype: object
date    2019
Name: 8317, dtype: object
date    2019
Name: 5741, dtype: object


### Exercise 9

Out of all the rows with prices below 30, what is the average `total_production` and `renewable_production`?

In [34]:
print(energy[energy['spot_price'] < 30]['total_production'].mean())
print(energy[energy['spot_price'] < 30]['renewable_production'].mean())

22892.016822429905
15641.03757763975


### Exercise 10

What's the proportion of hours in which the wind production was higher than the solar production?

**Hint:** Use `len(df)` to compare number of rows in a DataFrame.

In [6]:
len(energy[energy['wind'] > energy['solar']]) / len(energy)

0.6368318644843768

### Exercise 11

If the `spot_price` is higher than 50, what is the proportion of hours in which the `thermal_gap` is greater than 3000?

In [10]:
len(energy[(energy['spot_price'] > 50) & (energy['thermal_gap'] > 3000)]) / len(energy)

0.3494334439739041

### Exercise 12

How many unique dates are there in the period where the `spot_price` is higher than the mean plus 2 times the standard deviation of the `spot_price`?

**Hint1:** Use the `unique` method of the Series to get the unique values of a column, and then use the `len` of it to get the number of unique values.

**Hint2:** Or use `nunique` method of the Series to get the number of unique values.

In [12]:
energy[(energy['spot_price']) > ((energy['spot_price'].mean()) +  (2 * energy['spot_price'].std()))].nunique()

datetime                      102
power_demand                  102
nuclear                        55
gas                           101
solar                          64
hydro                         101
coal                           96
wind                          102
spot_price                     60
year                            1
month                           2
day                            21
hour                           16
weekday                         7
date                            1
total_production_test          77
total_production              102
renewable_production           77
total_production_ratio        102
renewable_production_ratio     77
thermal_gap                    77
dtype: int64