# Programming Assignment 13: Data Framework

<h1 style="position: absolute; display: flex; flex-grow: 0; flex-shrink: 0; flex-direction: row-reverse; top: 60px;right: 30px; margin: 0; border: 0">
    <style>
        .markdown {width:100%; position: relative}
        article { position: relative }
    </style>
    <img src="https://gitlab.tudelft.nl/mude/public/-/raw/main/tu-logo/TU_P1_full-color.png" style="width:100px" />
    <img src="https://gitlab.tudelft.nl/mude/public/-/raw/main/mude-logo/MUDE_Logo-small.png" style="width:100px" />
</h1>
<h2 style="height: 10px">
</h2>

*[CEGM1000 MUDE](http://mude.citg.tudelft.nl/): Week 2.5. Due: complete this PA prior to class on Friday, Dec 15, 2023.*

## Overview of Assignment

This assignment quickly introduces you to the package `pandas`. We only use a few small features here, to help you get familiar with it before using it more in the coming weeks. The primary purpose is to easily load data from csv files and quickly process the contents. This is accomplished with a new data type unique to pandas: a `DataFrame`. It also makes it very easy to export data to a `*.csv` file.

If you want to learn more about pandas after finishing this assignment, the [Getting Started page](https://pandas.pydata.org/docs/getting_started/index.html) is a great resource.

## Assignment Criteria

**You will pass this assignment as long as your repository fulfills the following criteria:**  

- You have completed this notebook and it runs without errors
- Your notebook creates a file `earth_dams.csv` in the root of the repository
- Your repository contains a `.gitignore` file that ignores all csv files
- You commit a `.gitignore` file to your repository, but _not_ a csv file

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

## Introduction to pandas

Pandas dataframes are considered by some to be difficult to use. For example, here is a line of code from one of our notebooks this week. Can you understand what it is doing?
```
net_data.loc[net_data['capacity'] <= 0, 'capacity'] = 0
```

One of the reasons for this is that the primary pandas data type, a `DataFrame` object, uses a dictionary-like syntax to access and store elements. For example, remember that a dictionary is defined using curly braces. 

In [2]:
my_dict = {}
type(my_dict)

dict

Also remember that you can add items as a key-value pair:

In [3]:
my_dict = {'key': 5}

The item `key` was added with value 5. We can access it like this:

In [4]:
my_dict['key']

5

This is useful beceause if we have something like a list as the value, we can simply add the index the the end of the call to the dictionary. For example:

In [5]:
my_dict['array'] = [34, 634, 74, 7345]
my_dict['array'][3]

7345

And now that you see the "double brackets" above, i.e., `[ ][ ]`, you can see where the notation starts to get a little more complicated. Here's a fun nested example:

In [6]:
shell = ['chick']
shell = {'shell': shell}
shell = {'shell': shell}
shell = {'shell': shell}
nest = {'egg': shell}
nest['egg']['shell']['shell']['shell'][0]

'chick'

Don't worry about that too much...as long as you keep dictionaries and their syntax in mind, it becomes easier to "read" the complicated pandas syntax.

Now let's go through a few simple tasks that will illustrate what a `DataFrame` is (when constructed from a dictionary), and some of its fundamental methods and characteristics.

<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 0.1:</b>   
    
Run the cell below and check what kind of object was created using the method <code>type</code>.
</p>
</div>

In [7]:
new_dict = {'names': ['Gauss', 'Newton', 'Lagrange', 'Euler'],
            'birth year': [1777, 1643, 1736, 1707]}
# YOUR_CODE_HERE
# Solution
type(new_dict)

dict

<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 0.2:</b>   
    
Run the cell below and check what kind of object was created using the method <code>type</code>.
</p>
</div>

In [8]:
df = pd.DataFrame(new_dict)
# YOUR_CODE_HERE
# Solution
type(df)

pandas.core.frame.DataFrame

<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 0.2:</b>   
    
Read the code below and try to predict what the answer should be before you run it and view the output. Then run the cell, confirm your guess and in the second cell check what kind of object was created using the method <code>type</code>.
</p>
</div>

In [9]:
guess = df.loc[df['birth year'] <= 1700, 'names']
print(guess)

1    Newton
Name: names, dtype: object


In [10]:
# YOUR_CODE_HERE
# Solution
type(guess)

pandas.core.series.Series

Note that this is a `Series` data type, which is part of the pandas package (you can read about it [here](https://pandas.pydata.org/docs/reference/api/pandas.Series.html)). If you need to use the value that is stored in the series, you can use the attribute `values` as if it were an object with the same `type` as the data in the `Series`; the example below shows that the `names` in the `DataFrame` is a `Series` where the data has type `ndarray`.

In [11]:
print(type(df.loc[df['birth year'] <= 1700, 'names']))
print(type(df.loc[df['birth year'] <= 1700, 'names'].values))
print('The value in the series is an ndarray with first item:',
      df.loc[df['birth year'] <= 1700, 'names'].values[0])

<class 'pandas.core.series.Series'>
<class 'numpy.ndarray'>
The value in the series is an ndarray with first item: Newton


Another useful feature of pandas is to be able to quickly look at the contents of the data frame. You can quickly see which columns are present:

In [12]:
df.head()

Unnamed: 0,names,birth year
0,Gauss,1777
1,Newton,1643
2,Lagrange,1736
3,Euler,1707


You can also get summary information easily:

In [13]:
df.describe()

Unnamed: 0,birth year
count,4.0
mean,1715.75
std,56.364143
min,1643.0
25%,1691.0
50%,1721.5
75%,1746.25
max,1777.0


Finally, it is also very easy to read and write dataframes to a `*.csv` file, which you can do using the following commands (_you will apply this in the tasks below_):
```
df = pd.read_csv('dams.csv')
```
To write, the method is similar; the keyword argument `index=False` avoids adding a numbered index as an extra column in the csv:
```
df.to_csv('dams.csv', index=False)
```

**Now we are ready to practice using pandas and git to effectively manage data in our repositories!**

## Task 1: Get the data into our repo

For this assignment we will use a small `*.csv` file that can be downloaded using [this link](https://surfdrive.surf.nl/files/index.php/s/8xDKt0MsIcTYsJK).

The steps below outline how you should add a data set to a git repository so that you can access the data with the code (i.e., Jupyter notebook), but not commit the file to the repository. A key assumption here is that you prefer to archive the data on a different website that is more appropriate for this purpose (not git!).

<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 1.1:</b>   
    
Download the dataset and move it to your working directory (the git repo of this notebook).
</p>
</div>

<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 1.2:</b>
    
Check your GitHub Desktop to see that the file is listed as a "changed file." <b>Do not commit the dataset!</b>
</p>
</div>

As you learned in the README, we don't want to include datasets in our repositories (ignore the fact that this one is tiny). You may remember from Q1 that we can use a `.gitignore` file to tell git not to track specific files. We can do it by simply listing `dams.csv` in our `.gitignore` file.

<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 1.3:</b>
    
Create a <code>.gitignore</code> file to ignore the dataset. Confirm that it worked properly by making sure that the data file is no longer listed as a "changed file."
</p>
</div>

<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 1.4:</b>
    
Commit the <code>.gitignore</code> file.
</p>
</div>

## Task 2: Evalue and process the data

Now that the data is stored locally, we can process it and use it in our analysis.

<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 2.1:</b>
    
Import the dataset as a DataFrame, then explore it and learn about its contents (use the methods presented above; you can also look inside the csv file).
</p>
</div>

In [14]:
df = pd.read_csv('dams.csv')
df.head()

Unnamed: 0,Name,Year,Volume (1e6 m^3),Height (m),Type
0,Tarbela,1976,153.0,143,rock fill
1,Fort Peck,1940,96.0,96,earth fill
2,Ataturk,1990,84.5,166,rock fill
3,Houtribdijk,1968,78.0,13,rock fill
4,Oahe,1963,70.3,75,rock fill


<div style="background-color:#FAE99E; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%">
<p>
<b>Solution:</b>   

We can see that this dataset has some information about dams, including the name, year constructed, volume and height. They look pretty big! It's actually the largest 5 dams by either volume or height (10 dams total), listed on Wikipedia page <a href="https://en.wikipedia.org/wiki/List_of_largest_dams" target="_blank">here</a>.

</p>
</div>

<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 2.2:</b>
    
Using the example above, find the dams in the <code>DataFrame</code> that are of type <code>earth fill</code>.</code>
</p>
</div>

In [15]:
names_of_earth_dams = df.loc[df['Type'] == 'earth fill', 'Name'].values[:]
print('The earth fill dams are:', names_of_earth_dams)

The earth fill dams are: ['Fort Peck' 'Nurek' 'Kolnbrein' 'WAC Bennett']


_Hint: the answer should be:_ `['Fort Peck' 'Nurek' 'Kolnbrein' 'WAC Bennett']`

<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 2.3:</b>
    
Create a new dataframe that only includes the earth fill dams. Save it as a new csv file called <code>earth_dams.csv</code>.
</p>
</div>

_Hint: you only need to remove a small thing from the code for your answer to the task above)._

In [16]:
df_earth = df.loc[df['Type'] == 'earth fill']
df_earth.to_csv('earth_dams.csv', index=False)

<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 2.4:</b>
    
Check the contents of the new csv file to make sure you created it correctly.
</p>
</div>

## Task 3: Keep your repository clean

Now we have created a second csv file, but we also do not want to track it in our repo. We could add the filename to our gitignore file, but there is a better way: using a wildcard! We already used this in Q1, so hopefully you can see that adding `*.csv` to the `.gitignore` file will ignore _all_ csv files in the repository, which is exactly what we want! 

<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 3.1:</b>
    
Update your gitignore using the wildcard <code>*.csv</code>.
</p>
</div>

<div style="background-color:#AABAB2; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px">
<p>
<b>Task 3.2:</b>
    
Confirm that the data files do not show up as "changed files" in your GitHub Desktop application. Then commit this notebook to your repository and push it to GitLab because you are done with the assignment!
</p>
</div>

<div style="background-color:#facb8e; color: black; vertical-align: middle; padding:15px; margin: 10px; border-radius: 10px; width: 95%"> <p>Note that if we really cared about this "new" dataset, and it were too large to save it in our git repository, we would want to back it up to another (cloud) platform so that we can recover it if our files are lost. We skip this step here, but don't forget to do it if you are working on another project in the future (for example, your thesis).</p></div>

**End of notebook.**
<h2 style="height: 60px">
</h2>
<h3 style="position: absolute; display: flex; flex-grow: 0; flex-shrink: 0; flex-direction: row-reverse; bottom: 60px; right: 50px; margin: 0; border: 0">
    <style>
        .markdown {width:100%; position: relative}
        article { position: relative }
    </style>
    <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">
      <img alt="Creative Commons License" style="border-width:; width:88px; height:auto; padding-top:10px" src="https://i.creativecommons.org/l/by-nc-sa/4.0/88x31.png" />
    </a>
    <a rel="TU Delft" href="https://www.tudelft.nl/en/ceg">
      <img alt="TU Delft" style="border-width:0; width:100px; height:auto; padding-bottom:0px" src="https://gitlab.tudelft.nl/mude/public/-/raw/main/tu-logo/TU_P1_full-color.png"/>
    </a>
    <a rel="MUDE" href="http://mude.citg.tudelft.nl/">
      <img alt="MUDE" style="border-width:0; width:100px; height:auto; padding-bottom:0px" src="https://gitlab.tudelft.nl/mude/public/-/raw/main/mude-logo/MUDE_Logo-small.png"/>
    </a>
    
</h3>
<span style="font-size: 75%">
&copy; Copyright 2023 <a rel="MUDE Team" href="https://studiegids.tudelft.nl/a101_displayCourse.do?course_id=65595">MUDE Teaching Team</a> TU Delft. This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.