From 96bfeb49903822549c056bc78795e1525bd04703 Mon Sep 17 00:00:00 2001 From: OnnoEbbens Date: Wed, 1 May 2024 16:44:07 +0200 Subject: [PATCH 1/3] Update pandas notebook, add english version --- .../01_Pandas/01_pandas_basis_dutch.ipynb | 26 +- .../01_Pandas/01_pandas_basis_english.ipynb | 1336 +++++++++++++++++ 2 files changed, 1353 insertions(+), 9 deletions(-) create mode 100644 Exercise_notebooks/On_topic/01_Pandas/01_pandas_basis_english.ipynb diff --git a/Exercise_notebooks/On_topic/01_Pandas/01_pandas_basis_dutch.ipynb b/Exercise_notebooks/On_topic/01_Pandas/01_pandas_basis_dutch.ipynb index dacbe4b..c9364f0 100644 --- a/Exercise_notebooks/On_topic/01_Pandas/01_pandas_basis_dutch.ipynb +++ b/Exercise_notebooks/On_topic/01_Pandas/01_pandas_basis_dutch.ipynb @@ -26,9 +26,10 @@ "5. [DataFrame](#5)\n", "6. [Bestanden inlezen](#6)\n", "7. [Bewerken DataFrame](#7)\n", - "8. [Plotten data](#8)\n", - "9. [Opslaan](#9)\n", - "10. [Geavanceerde analyses](#10)" + "8. [Datumtijd](#8)\n", + "9. [Plotten data](#9)\n", + "10. [Opslaan](#10)\n", + "11. [Geavanceerde analyses](#11)" ] }, { @@ -637,6 +638,13 @@ "df_knmi.columns" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### [Stap 8. Datumtijd](#top)" + ] + }, { "cell_type": "markdown", "metadata": {}, @@ -698,7 +706,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### [Stap 8. plotten data](#top)" + "### [Stap 9. plotten data](#top)" ] }, { @@ -870,7 +878,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### [Stap 9. Opslaan resultaten](#top)\n", + "### [Stap 10. Opslaan resultaten](#top)\n", "Wanneer je een `DataFrame` hebt ingelezen en aangepast is het handig om de resultaten op te slaan om later te gebruiken. Ook kan het soms handig zijn om de resultaten in bijv. excel te bekijken. Dit kan eenvoudig met de `to_csv()` functie:" ] }, @@ -896,7 +904,7 @@ "source": [ "#### Opdracht 8 \n", "\n", - "Vraag de statistieken op van het `DataFrame` met de `describe()` functie. Sla deze statistieken op als csv bestand met de naam 'statistiek.csv'. Als je het bestand hebt opgeslagen kan je het downloaden door [hier](files/statistiek.csv) te klikken." + "Vraag de statistieken op van het `DataFrame` met de `describe()` functie. Sla deze statistieken op als csv bestand met de naam 'statistiek.csv'. Als je het bestand hebt opgeslagen kan je het downloaden door [hier](statistiek.csv) te klikken." ] }, { @@ -917,7 +925,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "### [Stap 10. Geavanceerd analyses](#top)" + "### [Stap 11. Geavanceerd analyses](#top)" ] }, { @@ -1347,9 +1355,9 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.7.6" + "version": "3.9.4" } }, "nbformat": 4, - "nbformat_minor": 2 + "nbformat_minor": 4 } diff --git a/Exercise_notebooks/On_topic/01_Pandas/01_pandas_basis_english.ipynb b/Exercise_notebooks/On_topic/01_Pandas/01_pandas_basis_english.ipynb new file mode 100644 index 0000000..6c6b56a --- /dev/null +++ b/Exercise_notebooks/On_topic/01_Pandas/01_pandas_basis_english.ipynb @@ -0,0 +1,1336 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + " \n", + " \n", + "
\n", + "\n", + "# Pandas exercise\n", + "\n", + "This exercise is used as an introduction to the `pandas` package for data analysis. In the exercise you will be using data from the Dutch metereological organisation (KNMI)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Table of content\n", + "1. [Import](#1)\n", + "2. [Series](#2)\n", + "3. [Series attributes](#3)\n", + "4. [Series methods](#4)\n", + "5. [DataFrame](#5)\n", + "6. [Read data](#6)\n", + "7. [Manipulate DataFrames](#7)\n", + "8. [Dealing with Dates](#8)\n", + "9. [Plotting](#9)\n", + "10. [Writing data](#10)\n", + "11. [Advanced analyses](#11)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### [1. Import packages ](#top)\n", + "\n", + "Pandas is usually imported as `pd`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### [2. Series](#top)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The pandas package can be used to analyse Data and has many similarities with Excel. The difference is in the way you control the program. In Excel you can use your mouse to click on the data and modify it. With pandas you have to write code to manipulate the data. \n", + "\n", + "You usually start with some data. Pandas has roughly two data structures to store data:\n", + "1. Series: A one-dimensional, labelled array, for example a time series of groundwater heads where the label of each measurement is the measurmeents data. Similar to an excel sheet with 2 columns.\n", + "2. DataFrame: A two-dimensional, tabular, data structure where each data point is labelled by a column name and a row name. For example a list with well locations where each location has an x and a y coordinate. Similar to an Excel spreadsheet with more than 2 columns." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The easiest way to create a Series is by manually entering the data. Below we create a Series with the weight op some animal species." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s = pd.Series(\n", + " index=[\"cow\", \"horse\", \"chicken\"], data=[656.0, 450.0, 3.8], name=\"weight\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The Series is created and assigned to the variable `s` . To see the data we can print the Series:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(s)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### [3. Series attributes](#top)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our Series has many attributes, for example:\n", + "- index\n", + "- values\n", + "- name\n", + "- shape\n", + "- dtype\n", + "\n", + "You can get the value of an attribute using the variable name `s` followed by a dot `.` and the name of the attribute. We do this for the `index` and the `values` attributes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s.index" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s.values" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In the code cells above you can see the difference between the `values` and the `index` . A Series can be used to obtain data based on the index. For this you can use `loc` or `iloc`.\n", + "\n", + "In the cell below we obtain the weight of a cow using `loc`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s.loc[\"cow\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Or we can do the same usin `iloc`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s.iloc[0]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Exercise 1 \n", + "Write the code to get the weight of the chicken using `iloc`. Type your code below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Answer Exercise 1" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### [4. Series Methods](#top)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A Series contains methods to analyse it's data. For exampke, you can calculate the mean weight of all animals using the code below.\n", + "\n", + "Note: You always have to add parenthesis `()` at the end of a method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s.mean()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can get the maximum weight using `max()`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s.max()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is also possible to get some descriptive statistics using the the `describe` method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s.describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Exercise 2 \n", + "\n", + "The result of `s.describe()` is another pandas Series. You can assign this Series to another variable, for example using `stats = s.describe()`. Create a variable with the results of `s.describe()` and use `loc` to obtain the 25% percentile." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Answer Exercise 2" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### [5. DataFrame](#top)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When you have two dimensional data you can use a pandas `DataFrame`. We create a DataFrame with the latitude, longitude and inhabitants for 3 import cities." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df = pd.DataFrame(\n", + " index=[\"London\", \"Rome\", \"Villa Bartolomea\"],\n", + " data={\n", + " \"lat\": [51.5064, 41.8986, 45.1580],\n", + " \"lon\": [-0.1388, 12.4789, 11.3572],\n", + " \"inhabitants\": [9748000, 4332000, 5500],\n", + " },\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Just like a Series we can print the DataFrame to see the data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "print(df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Besides the `print` function you can also use the `display` function to get a more visually pleasing representation of the DataFrame.\n", + "\n", + "Note: the display function can only be used in iPython and not with plain Python." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "display(df)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Just like a Series a DataFrame has the attributes `index`, `values` and `shape`. Additionally, a DataFrame has the attribute `columns`. These are the labels used for every column. A data point in a DataFrame is defined by a row label (the index) and a column label (the columns). Every value in a single column should be of the same data type. The column names and types can be accessed using the `dtypes` attribute." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df.columns" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df.dtypes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To obtain a single value from a DataFrame you can use `loc` and `iloc`. With a Series you can just specify the index label, for a DataFrame you have to specify both an index label and a column label e.g. `df.loc[,]`. Using the code below we obtain the longitude of Rome." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df.loc['Rome', 'lon']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Exercise 3 \n", + "\n", + "Get the number of inhabitants of Villa Bartolomea using `loc`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Answer Exercise 3" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can also get some statistics for a DataFrame. By default the statistics are calculated for every column. See the examples below." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df.max()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df.describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Sometimes you are only interested in a single column. You can obtain a single column from a DataFrame using the square brackets `[]`. This will return a Series." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s = df['inhabitants']\n", + "s" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Exercise 4 \n", + "\n", + "The result of `df.decribe()` is also a pandas `DataFrame` (see above). Assigne the result of `df.describe()` to a variable en show the statistics for the column `inhabitants`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "answerExercise 4" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### [6. Read files](#top)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Usually you have datasets in a certain file format (e.g. a .csv file). You don't have to copy and paste the values in your code because you can directly read the data into a DataFrame. For reading .csv files you can use the `pd.read_csv` function. For other file formats there are other read functions in pandas.\n", + "\n", + "When you read a file using the pandas read functions you usually have to specify some settings such as the seperator character, the decimal indicator, a number of rows to be skipped, etc. It usually takes a bit of trial and error to find the appropiate settings to read the data correctly. Once you have accomplished this other actions will be a lot faster.\n", + "\n", + "Below you find the code to read a csv file with data from the Dutch meteorological institute (KNMI). The file 'etmgeg_240.txt' is already in your course material." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "df_knmi = pd.read_csv(\"etmgeg_240.txt\", skiprows=47, index_col=\"YYYYMMDD\", \n", + " parse_dates=[1])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The settings for reading the file are given to the `read_csv` function as function arguments. The first argument `etmgeg_240.txt` is the name of the file. Because the file is in the same directory as this notebook Python can find it. All the other arguments in the `pd.read_csv` function have a default value and we only have to change it if the default value is not applicable for our file. The arguments we defined are:\n", + "- `skiprows`: In our case the first 47 rows of the file contain metadata and do not have a tabluar structure. With this arguments we skip these lines.\n", + "- `index_col`: It is often a good idea to choose one column as the index column. The index column can be used later as a label to obtain the data. It is a good idea to choose a column with unique values for each row. In our case we choose the `YYYYMMDD` column because it has a unique date for each row.\n", + "- `parse_dates`: This is a more advanced option to tell the `read_csv` function that the second column contains dates and the values should be interpreted as date values.\n", + "\n", + "Note: you might also see a red warning appear when you run the code. This is just a warning that your code may be slow and a possible way to solve this. Since the code is not really that slow we choose to ignore this warning.\n", + "\n", + "When you have succesfully read the .csv file into a DataFrame you can list the first 5 rows using the `df.head()` method." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df_knmi.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Exercise 5 \n", + "\n", + "Obtain the number of rows and columns from the `DataFrame` you've read above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Answer Exercise 5" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### [7. Manipulating DataFrames](#top)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "It is a good habit to check your DataFrame after you have read it and see if everything was interpreted correctly. We will do this now for the DataFrame with meteorological data." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "First we have a look at the columns. Annoyingly we still have some spaces in the column names." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df_knmi.columns" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After this we check the datatype per column. When we want to do calculations with the data in a columns we have to make sure that the data is of a numeric type (int or float). The datatype 'object' is used for textual values. Looking at the dtype we can see that is not possible to perform calculations on all columns." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "df_knmi.dtypes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now that we have checked our DataFrame we can manipulate it to our needs. First we will remove the spaces from the column names using a for-loop. For every column name `icol` we use the `strip()` method to remove any trailing spaces. The modified string is added to the list `new_names`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "new_names = []\n", + "for icol in df_knmi.columns:\n", + " new_names.append(icol.strip())\n", + "new_names" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The list `new_names` can be used to overwrite the original column names with our modified names." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df_knmi.columns = new_names" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Check if everything worked according to plan: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df_knmi.columns" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### [8. Dealing with Dates](#top)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When we read the meteo data file we set the index column with the datatype 'datetime'. Because of this we can now select specific periods rather easily. With the code below we only show the data for the year 2018." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "df_knmi.loc['2018']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If we combine this with only the columns \"RH\" (daily precipitation) and \"EV24\" (daily evaporation), we get this:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df_knmi.loc[\"2018\", [\"RH\", \"EV24\"]]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Exercise 6 \n", + "\n", + "The column `TX` contains the maximum temperature in 0.1 degrees Celsius. Obtain the maximum temperature in 2018." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Answer Exercise 6" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### [9. Plotting](#top)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For the plotting we will only deal with precipitation and evaporation, therefore we take a subset of our DataFrame and assign it to the new variable `dfs`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dfs = df_knmi.loc[:, [\"RH\", \"EV24\"]]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Every DataFrame has a `plot` method, unfortunately when we call this we get an error:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "dfs.plot()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The error message tells us:\n", + "\n", + " TypeError: no numeric data to plot\n", + "\n", + "This suggest that the data that we have does not have a numeric type. We can check this using the `dtypes` attribute." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dfs.dtypes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can see that both columns have the dtype 'object'. In order to plot the data the dtype should be 'int' or 'float'. We can use the `to_numeric` function to convert the datatype from 'object' to 'float'. We do this seperately for each column using a for-loop. Additionaly we use the keyword argument `errors='coerce'`, this will use a NaN (Not a Number) value for data that cannot be converted to a numeric type." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for icol in dfs.columns:\n", + " print(icol)\n", + " dfs[icol] = pd.to_numeric(dfs[icol], errors=\"coerce\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Check if it worked:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dfs.dtypes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can plot!" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dfs.plot()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The data is show in the plot although it could look better. Below some ideas on how to improve the layout of a plot." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ax = dfs.plot(figsize=(12,4))\n", + "ax.set_xlim('2017','2018')\n", + "ax.set_ylabel('0.1 mm/day')\n", + "ax.set_xlabel('')\n", + "ax.grid()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Exercise 7 \n", + "\n", + "Plot the maximum temperature between 2000 en 2005." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Answer Exercise 7" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### [10. Write](#top)\n", + "It can be useful to write a `DataFrame` to a file to use later. This is easily done with the `to_..` methods. For example, to write our dataframe to a csv file we can use `df_knmi.to_csv('modified_timeseries.csv')`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df_knmi.to_csv('modified_timeseries.csv')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When you ran this code a .csv file is created named 'modified_timeseries.csv'. The file is saved in the same directory as this notebook. You can now open the file in a text editor or Excel." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Exercise 8 \n", + "\n", + "Obtain the statistics from `df_knmi` using the `describe` method. Write these statistics to a .csv file named 'statistics.csv'. After you run the code you can ispect the file by cliking [here](statistics.csv)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Answer Exercise 8" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### [11. Advanced analysis](#top)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Below you find some examples of more advanced analysis you can do with pandas. There are no exercises for this." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Often we want to analyse yearly sums of precipitation or evaporation while the data is available as daily values. A useful method to convert our data to yearly sums is the `groupby` method. This is sort of similar to the `pivot` option in Excel." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When we look at the index we can see that the dtype is `datetime64[ns]`. This dtype has some neat options for dealing with dates." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dfs.index" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can easily get the year of each date in the index." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dfs.index.year" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Or the day" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "dfs.index.day" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Using the `groupby` method we can group the data for each year" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "gr = dfs.groupby(by=dfs.index.year)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Than we can choose how we want to aggregate the data for each year. Here we take the sum of all values for each year." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "gr.sum()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The result of the `df.groupby()` method is a `DataFrameGroupedBy` object which allows you to obtain a variation of statistics such as the mean: `gr.mean()`, median: `gr.median()` or the maximum `gr.max()`.\n", + "\n", + "You can also loop over the groups in the `DataFrameGroupedBy` object to be even more flexible." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "for groupname, group in gr:\n", + " # Only print the groups after 2016\n", + " if groupname > 2016:\n", + " print(groupname)\n", + " display(group.head())" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A bar plot is easily obtained" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "gr.sum().plot.bar(figsize=(16, 6))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Another way of analyzing data is by calculating the cumulative sum. For this we use the precipitation minus the evaporation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "no = dfs.loc[\"2018\", \"RH\"] - dfs.loc[\"2018\", \"EV24\"]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we can plot the cumulative precipitation, evaporation and recharge in the same plot." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ax = dfs.loc[\"2018\", \"RH\"].cumsum().plot(legend=True, figsize=(12,4))\n", + "dfs.loc[\"2018\", \"EV24\"].cumsum().plot(ax=ax, legend=True)\n", + "no.cumsum().plot(ax=ax, label=\"Recharge\", legend=True);" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can also use a DataFrame to check how often a condition is met. With the code below we check for each row if the precipitation was higher than 15.0 mm/day (the 150 is because the value is still in 0.1 mm/day)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "gt150 = dfs.loc[:, \"RH\"] > 15" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`gt150` is now a pandas Series with only boolean values. The value is `True` if the precipitation is higher than 15.0 mm/dag and `False` if lower than 15.0 mm/day." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "gt150.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In Python True equals 1 and False equals 0. Thus if we take the sum of this Series we get the number of occurences when the precipitation was higher than 15.0 mm/day." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "gt150.sum()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can also use the boolean series to obtain a subset of a DataFrame. Using the code below we get all the rows in our DataFrame where the precipitation is higher than 15.0 mm/day. This process is called boolean subsetting and is a very powerfull tool in pandas." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "dfs.loc[gt150]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "## Answers" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Answer Exercise 1 " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "s.iloc[2]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Answer Exercise 2 " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "stats = s.describe()\n", + "stats.loc['25%']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Answer Exercise 3 " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df.loc['Villa Bartolomea', 'inhabitants']" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Answer Exercise 4 " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "stats = df.describe()\n", + "s = stats['inhabitants']\n", + "print(s)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Answer Exercise 5 " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df_knmi.shape\n", + "print(df_knmi.shape[0],' rijen')\n", + "print(df_knmi.shape[1],' kolommen')" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Answer Exercise 6 " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "df_knmi.loc['2018','TX'].max()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Bonus: om de datum op te vragen wanneer dit op trad kunnen we idxmax() gebruiken:\n", + "df_knmi.loc['2018','TX'].idxmax()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Answer Exercise 7 " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "ax = df_knmi['TX'].plot(figsize=(12,6))\n", + "ax.set_xlim('2000','2005')\n", + "ax.set_ylabel('temperature (0.1$^\\circ$C)')\n", + "ax.set_xlabel('')\n", + "ax.grid()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "#### Answer Exercise 8 " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "stats = df_knmi.describe()\n", + "stats.to_csv('statistics.csv')" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.4" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 8528c57539ff15f50106971cf68c4f1d96e71e13 Mon Sep 17 00:00:00 2001 From: OnnoEbbens Date: Thu, 2 May 2024 13:14:50 +0200 Subject: [PATCH 2/3] set pastas to version lower than 1.5 --- requirements.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/requirements.txt b/requirements.txt index 08ac630..62e1f98 100644 --- a/requirements.txt +++ b/requirements.txt @@ -11,7 +11,7 @@ mapclassify folium bokeh seaborn -pastas +pastas < 1.5 pastastore xlrd openpyxl From c8335c06264e9778f6da4164964e73a37e605652 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Dav=C3=ADd=20Brakenhoff?= Date: Fri, 3 May 2024 13:15:11 +0200 Subject: [PATCH 3/3] minor changes --- .../01_Pandas/01_pandas_basis_english.ipynb | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/Exercise_notebooks/On_topic/01_Pandas/01_pandas_basis_english.ipynb b/Exercise_notebooks/On_topic/01_Pandas/01_pandas_basis_english.ipynb index 6c6b56a..7258c87 100644 --- a/Exercise_notebooks/On_topic/01_Pandas/01_pandas_basis_english.ipynb +++ b/Exercise_notebooks/On_topic/01_Pandas/01_pandas_basis_english.ipynb @@ -288,7 +288,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "When you have two dimensional data you can use a pandas `DataFrame`. We create a DataFrame with the latitude, longitude and inhabitants for 3 import cities." + "When you have two dimensional data you can use a pandas `DataFrame`. We create a DataFrame with the latitude, longitude and inhabitants for 3 \"important\" cities." ] }, { @@ -483,7 +483,7 @@ "source": [ "Usually you have datasets in a certain file format (e.g. a .csv file). You don't have to copy and paste the values in your code because you can directly read the data into a DataFrame. For reading .csv files you can use the `pd.read_csv` function. For other file formats there are other read functions in pandas.\n", "\n", - "When you read a file using the pandas read functions you usually have to specify some settings such as the seperator character, the decimal indicator, a number of rows to be skipped, etc. It usually takes a bit of trial and error to find the appropiate settings to read the data correctly. Once you have accomplished this other actions will be a lot faster.\n", + "When you read a file using the pandas read functions you usually have to specify some settings such as the separator character, the decimal indicator, a number of rows to be skipped, etc. It usually takes a bit of trial and error to find the appropiate settings to read the data correctly. Once you have accomplished this other actions will be a lot faster.\n", "\n", "Below you find the code to read a csv file with data from the Dutch meteorological institute (KNMI). The file 'etmgeg_240.txt' is already in your course material." ] @@ -496,8 +496,7 @@ }, "outputs": [], "source": [ - "df_knmi = pd.read_csv(\"etmgeg_240.txt\", skiprows=47, index_col=\"YYYYMMDD\", \n", - " parse_dates=[1])" + "df_knmi = pd.read_csv(\"etmgeg_240.txt\", skiprows=47, index_col=\"YYYYMMDD\", parse_dates=[\"YYYYMMDD\"])" ] }, { @@ -507,7 +506,7 @@ "The settings for reading the file are given to the `read_csv` function as function arguments. The first argument `etmgeg_240.txt` is the name of the file. Because the file is in the same directory as this notebook Python can find it. All the other arguments in the `pd.read_csv` function have a default value and we only have to change it if the default value is not applicable for our file. The arguments we defined are:\n", "- `skiprows`: In our case the first 47 rows of the file contain metadata and do not have a tabluar structure. With this arguments we skip these lines.\n", "- `index_col`: It is often a good idea to choose one column as the index column. The index column can be used later as a label to obtain the data. It is a good idea to choose a column with unique values for each row. In our case we choose the `YYYYMMDD` column because it has a unique date for each row.\n", - "- `parse_dates`: This is a more advanced option to tell the `read_csv` function that the second column contains dates and the values should be interpreted as date values.\n", + "- `parse_dates`: This is a more advanced option to tell the `read_csv` function that the `'YYYYMMDD'` column contains dates and the values should be interpreted as date values.\n", "\n", "Note: you might also see a red warning appear when you run the code. This is just a warning that your code may be slow and a possible way to solve this. Since the code is not really that slow we choose to ignore this warning.\n", "\n", @@ -848,7 +847,7 @@ "source": [ "#### Exercise 7 \n", "\n", - "Plot the maximum temperature between 2000 en 2005." + "Plot the maximum daily temperature (column `\"TX\"`) between 2000 en 2005." ] }, { @@ -1102,7 +1101,7 @@ "metadata": {}, "outputs": [], "source": [ - "gt150 = dfs.loc[:, \"RH\"] > 15" + "gt150 = dfs.loc[:, \"RH\"] > 150" ] }, {