This is the notebook for the 2nd week. It will teach you some more advanced stuff than in the first one. This week you learn: 
* How to import modules
* How to load a csv file
* How to create a data frame with pandas from csv
* How to filter rows in a data frame
    * by index/name
    * by conditions
* How to filter columns in a data frame
    * by index, 
    * by name, 
    * by condition
* Numpy
* Simple stats with numpy:
    * avg, median, mode, variance
    * standard deviation
    * order lists
    * find min/max

# Importing Modules

A module is functionality that you import from another piece of code (library). Python provides a huge set of modules for all type of functionality: mathematics, statistics, importing and exporting data, visalization, etc. As there are so many modules available, only a tiny fraction of them are imported by default when you create a new python program in order to keep your program small and slick.

Here is how you import a module that gives your program the ability to import a csv file. Import statements have to be in the beginning of your files before you write any actual code.

You import the `csv` module like so:

In [1]:
import csv

# Reading Files

We can now load csv-files. More information on the csv package: https://docs.python.org/2/library/csv.html.

The following code opens a csv file and saves its content into the variable `csvfile`. You can imagine this a large bulk of data. The `r` means that you are opening the file in reading-mode. Thus you cannot write the file. 

`with open('myfile.csv', 'r') as csvfile:`

Next, we convert the file content in `csvfile` to something like a csv-format with rows and columns. Therefor we use the `reader` function from the module `csv`. The result is written into a variable we call `csvcontent` (again since it's a variable, you can name it whatever you want, if you use it consistently thereafter).

`csvcontent = csv.reader(csvfile, delimiter=' ', quotechar='|')`

The parameters `delimiter` and `quotechar` help python to understand how your csv file is formatted. In the current example, those parameters are set to space (for `delimiter`) and `|` (say "pipe") for `quotechar`. These values depend on your specific csv file and you take a look at the file to set these values correctly. But, what do they mean? 

**delimiter**: is the character (a single character on the keyboard) that separates values within a row. The follogin code shows an example from a csv file, opened in a text-editor. The delimiter is a comma `,`:

`Name, Height, Shoe Size
Devin, 175, 9
Pela, 178, 8.5
Jer, 182, 11`


**quotechar**: is the character that encapsulates string values which may contain characters normally used as delimiter. In other words: if your csv file's delimiter is a `,`, then you cannot normaly use this comma in strings such as the following example: 

`Name, Loaction,
Devin, Paris, Texas
Marie, Paris`

In this previous example, the string `Paris, Texas` is meant to designate a single location. However, python will think that the location is only `Paris` because `Paris` is followed by a `,`. 

In order to prevent these cases, we can use **quotechars**, such as `"`:

`Name, Loaction,
Devin, "Paris, Texas"
Marie, "Paris"`

The `"` tells python that is should ignore all `,` between two quotechar characters. However, you need to tell python which delimiter you are using. Technically you can use any character, e.g. `'`, `<`, `:`, `%`, '&'. 

After we have now successfully loaded the file content into our `csvcontent` variable, we can iterate through the rows in our data. `csvcontent` effectively is a list of rows. Each row itself is another list, hence `csvcontent` is a list of lists. 

The following code shows the full example loading a specific csv file with a `delimiter=' '` (the delimiter is a space character), and `quotechar='|'` (the pipe character).  After loading, the example iterates through all the rows (using the `for`-loop, and prints each row `r` it encounters.

**NOTE** The differnt intendations: the second line has 1 tab intent because it is executed within the `with` block. The fourth row has 2 tabs intent because it is executed within the `for` block. Intendation is very imporant in pythohn. For more information on indentation, please read here: http://www.peachpit.com/articles/article.aspx?p=1312792&seqNum=3

In [8]:
with open('myfile.csv', 'r') as csvfile:
     csvcontent = csv.reader(csvfile, delimiter=' ', quotechar='|')
     for r in csvcontent: 
         print(r)

['a', 'b', 'c']
['1', '3', '4']
['2', '4', '5']


Now, import the file `gdp_per_capita.csv` and print all its rows:

This file contains numeric data. However, by default, Python thinks the numbers are strings, not actual numbers. That means that you cannot tell python to run statistics and other mathematical functions on it. Before continuing, we need to convert all the strings into numbers. Here is how we do that:
* iterate through all cells with a nested loop (remember last class)
* convert values to number with the int() function (remember last class)

Now, here is how you iterate through all cells---we already said that the `csvcontent` is a list of lists. Hence, in order to iterate through all cells, we need two loop functions one inside the other:

In [14]:
with open('filename', 'r') as csvfile:
    csvcontent = csv.reader(csvfile, arguments)
    for row in csvcontent: # Go through each row
        # Do anything you need to do per row...
        for value in row: # Now go through the values for each row
            #Do anything you need to do to each value

SyntaxError: unexpected EOF while parsing (<ipython-input-14-4df7e95b0035>, line 6)

There are two more issues

### Working with headers

The file you just loaded contains a row for for each country, with the first row contining the headers of the file. The first entry in each row shows the country name, the following three rows show the countries' GPD for the years 1950, 1955, and 1960.  The structure is similar to what you might see in a spreadsheet table.

This means that the first row doesn't have numbers in, so you don't want to covert any data in the first row - just use it as-is.

### Cleaning data
Never trust your data: in most cases, there are glitches, errors, and missing data in your file that require to be fixed before you can convert your data into some form of analyzable object in python. Such glitches may be a misplaced comma or other characters, misspelled entries, or differently formatted dates.  Sometimes certain irregularities are there on purpose.

For example, the first row (`Afghanistan`) does not contain any values. They are missing and empty. Some values in the other rows are also missing. Be aware of this when you iteratate through rows and check for values, and skip any values that are not numbers.

## Putting it together

You should write a program that does the following:

* create a new list that contains your clean data.
* iterate through all rows, then iterate through all values in that row, using two nested loops (remember last class)
* inside the first loop, i.e. for each new row you find, create a variable that is called `newRow`. This will contain your clean data for that row. Also, create a variable `counter` that counts which value in the row you are parsing. Remember that you need to convert all but the first value, which contains the contry name and cannot be converted to a float.
* in the second loop, the place where you iterate over each value in that row, you need to make 2 decissions: 1) is this value the first (i.e. `counter == 0`) and if this value is empty `v==''`. If any of these is the case, simply append the value `v` to your `newRow`. In any other cases, append the converted value `float(v)`.
* don't forget to increment the `counter` by `1` after each value you have processed: `counter = counter+1`
* after you have iterated over all values in one row, append the `newRow` to the `data` list.

Eventually, after you have done all the conversions, print `data`. 

You should get something like:
`[['countries', 1950.0, 1955.0, 1960.0, 1965.0, 1970.0, 1975.0, 1980.0, 1985.0, 1990.0, 1995.0, 2000.0, 2005.0, 2010.0, 2015.0], ['Afghanistan', '', '', '', '', '', '', '', '', '', '', '', '', ''], ['Brazil', `

### Note: don't spend too long on this!
This is the hard way to do things, to give you experience in working with loops. If you're struggling, skip on to the Pandas section afterwards, because Pandas does a lot of this work for you.

In [15]:
# Your code here

# Pandas

Pandas (http://pandas.pydata.org) is a python module that provides you with convenient methods to manipulate, filter, and aggregate complexer data in table format. The central structure in pandas is called `dataframe`. A dataframe is a table representation of your data like in the previous example, but with a lot of functionality.

## Importing and Creating Dataframes

First, we need to import the pandas module. 
When importing a module, you can give it an abbreviation, as some of the modules can have long names. Abbreviations are indicated with the `as` keyword after the `import modulename`:
`import mymodulename as myabbreviation`. 

In the following, we want to import the `pandas` module while using the abbreviation `pd`:

In [11]:
import pandas as pd

`pd` is the standard abbrevation for pandas and which is used in most online tutorials.

Now, let's load our csv file into a data frame. Pandas already comes with its own csv-import function that returns a dataframe:

`pd.read_csv('somefile.csv')`

Load the data file `gdp_per_capita.csv` and put the results in a variable called `myDataFrame`. Print `myDataFrame` to be sure it's properly loaded.

  countries          1950          1955          1960
0    Brazil   4297.823854   3739.919389   3693.275820
1     China   1864.102702   1105.952557    774.884634
2   Germany  25297.385393  23217.332027  21502.721444


You should see a a table like this: 

Now, its time to load all the data. Load the file `gpd_per_capita.csv` and print the entire table.

As already mentioned, dataframes are complex but sophisticated objects. So far, we have called functions from python directly (e.g. `print()`, `len()`) or on modules (`csv.reader()`). Sometimes, we can also call methods on objects, such as the DataFrame-object. E.g. the following two methods can be called on any variable that is a `DataFrame` object: 

* `DataFrame.head()`: shows the 5 first rows
* `DataFrame.tail()`: shows the 5 last rows

Can you try that for our dataframe? Replace the `DataFrame` part int the above examples by the name of our DataFrame-variable.

There are some simple metrics we can calculate about the data frame, using the following functions:
* `DataFrame.shape`: returns the numbers of rows and columns in the data frame in the format (rows, columns)
* `list(DataFrame.columns)`: returns all the column names

_How many rows and columns does our DataFrame has?_

_Can you output all the column names?_ 

## Selecting Rows and Columns

Now, we want to filter rows and columns to calculate statistics on the values and create visualizations later on. There are a couple of functions for selecting values and set of values in a DataFrame: 

* `DataFrame['mycolumnname']`: Selects the column with the name `columnname` in the column title (pay attention to the single quotes and squared brackets).
* `DataFrame[startrow:endrow]`: Selects all rows between `start` and `end`. 
* `DataFrame.loc['mylabel']`: Selects the row with the label `label`
* `DataFrame.iloc[...]`: Selecting rows and columns by index.

We will exercise each of these functions individually in the following.
More information here: https://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing

### Selecting columns

A single column is selected through `DataFrame['columnname']`. In our example, selecting columns means selecting individual years or ranges of years. 

_Can you select the colum for the year 1955?_

### Selecting rows
To select a range of rows, use `DataFrame[startRowIndex:endRowIndex]`. 
_Can you select rows 2 to 5?_

The above method selects a set of rows but what if we want one row, say `Brazil`? There are two options. 

Option 1 is to select-by-row-position using `DataFrame.iloc[rownumber]`. To select Brazil, we have to find the row number for Brazil and pass it as a parameter to the `iloc` selector. 

_Can you do it?_ 

### Selecting Rows and columns with `iloc[]`. 

The `iloc` function helps you selecting both, rows and colums by their indices, i.e. their order number. 

`iloc[`ROW_SELECTION_GOES_HERE`, `COLUMN_SELECTION_GOES_HERE`]` takes 1 - 2 parameters. The first one (`ROW_SELECTION_GOES_HERE`) is a specification of the rows you want to select, the second parameter (`COLUMN_SELECTION_GOES_HERE`) is a specification of the columns you want to select. Both parameters can be one of the following three forms, independent from each other:

* An **individual value (e.g., 1)** use this when you want to select a single row or column. E.g. `DataFrame.iloc[2,4]` gives you the value in the 4th column in row 2 (there are two individual values). 

* An **enumeration/list (e.g., [0,1,3])** use this when you have specific rows and or columns to select and be careful to use squared brackets around your array of numbers. For example, the expression `DataFrame.iloc[[0,2], [1,3]]` returns you the values of columns 1 and 3 for rows 0 and 2 (4 values in total).

* **Ranges (e.g., 1:3)** when you want to select a range of rows or colums. For example, the expression `DataFrame.iloc[0:3, 1:3]` returns you the values of columns 1 to 3 for rows 0 to 3 (9 values in total). When using ranges, you can leave fields blank, meaning that you refer to the first or last row or column. For example, `DataFrame.iloc[:2, 3:]` returns you all columns from colum 3 on  for rows 0 to 2. 


Of course, you can mix the above values for the the `ROW_SELECTION_GOES_HERE` parameter and the `COLUMN_SELECTION_GOES_HERE` parameter. E.g. the statement DataFrame.iloc[2,0:3] will return columns 0 to 3 for row 2. 
 
More information here: https://www.shanelynn.ie/select-pandas-dataframe-rows-and-columns-using-iloc-loc-and-ix/



_Can you select the values for first 3 years for `Germany` and `Malawi`?_

### Creating labels for rows

That previous example required you to look up the indices for `Germany` and `Malawi` individually and pass the indices to `iloc`. Not only is this inconvenient, but if you have a larger data set, this is simply impossible. 

Pandas has a function that allows you to ask for names, rather than indices. This function is called `loc` (while `iloc` stands for index-`loc`). `loc` allows you to **pass columnnames and row labels** and can be much more convenient to use. 

Now, columns names are usually specified in the first row of your csv files (the table header), which in our case are the years. We have already used them with `DataFrame['somecolumnname']`.

However, in addition to years in the firs row, our data has country names in the very fist column (which is the colum with the index 0). While Pandas assumes that you have column labels, it does not assume that you have row labels. This is because many tables have an index in the first column, rather than a name like us.

Thus, we need to tell Pandas that our first column should be used as  **labels** (which is how Pandas calls them). Important is that each label **must be _unique_**, i.e. no two rows can have the same lables. In our case, we have only one row per country and no two countries with the same name. Great, let's move on.
 
In order to tell Pandas which column we want to use as lables, we use the `DataFrame.set_index(COLUMN_NAME, inplace=False)` method. The `inplace=False` parameter prevents Pandas from modifying the `DataFrame` variable; rather, the function returns a new data frame and leaves the old one untouched. Hence, here, we create a new dataframe which we want to call `countryData` and which is returned by calling `set_index(..)` on our first dataframe.

**NOTE:** In your own projects, you can well use `inplace=True`, which is more convenient since you do not have to create a new data frame. However, for the excerises in this notebook, you need to create a new data frame otherwise the following examples will not work properly.

_Can you set the country column as labels and save the result in a new data frame called `countryData'?_ 

_Now, print the first few rows of both data frames (`myDataFrame` and `countryData`) to see the difference._ 

In the first print out the first column of your table should contain numbers (=indices). In the second printout (after having set `set_index(..)`, the first column should contain the country names and the numbers are gone. My (chrome) browser renders the entries in the first colum conveniently in bold. 

In the following, we will continue with the `countryData` data frame that has our country names as labels.

### Selecting Rows and columns with `loc[]`. 

Now, we can use `loc` with both row and column labels. `loc[]` works pretty much the same way than `iloc[]` but instead of integer indices (e.g. `DataFrame.iloc[2,3]`), `loc[]` understands our labels and column names. 

The three query methods to select rows and eventually colmns are the same as for `iloc`: 
1. **Individual value (e.g. 'Brazil')**: `DataFrame.iloc['Brazil']`
2. **Enumeration (e.g. ['Brazil','United Kingdom'])**: `DataFrame.iloc[['Brazil', 'United Kingdom']]` (note the double rectangular brackets). , and 
3. **Ranges (e.g. 'Brazil':'United Kingdom')**: `DataFrame.iloc['Brazil': 'United Kingdom']`

_Can you select all values for `Germany`._

_Can you select the values for `Germany` and `Malawi` for the years 1960 to 1980?_

**NOTE**: `loc` takes only labels and column names, no indices.

## Conditional Selection with Pandas and Boolean Operations
One of the most useful tools in pandas are **conditional selection**, i.e. selecting rows based on their values in particular columns. For example, you want to calculate statistics for high-income countries only. Let's see how that works. `loc` gives us almost all we need.

### Conditions
First, we need to know about **conditions**. A condition performs a test and returns `True` or `False`. The following condition tests if a value in the column `1960` is higher than `1000`.

`df['1960'] > 1000`

Used with `loc`, we can filter all rows which have a value higher than `1000` in column `1960`: 

`df.loc[df['1960'] > 1000]`

The `loc` in the above statement iterates over all rows in the dataframe and tests whether the value in brackets is true or false. Remember, when you use `loc` to pass a label value, `loc` checks whether the first column in that row matches the passed name. Now, we match a condition instead.

If the statement in the squared brackets returns true, this row is included into the result.

Conditions are powerful mechanisms and besides the greater than `>` operations include the following numeric relations:

* **lesser than**: `<`, e.g. `df.loc[df['1960'] < 1000]`
* **equals**: `==`, e.g. `df.loc[df['1960'] == 1000]`
* **unequals**: `!=`, e.g. `df.loc[df['1960'] != 1000]`
* **equal or greater**: `>=` e.g. `df.loc[df['1960'] >= 1000]`
* **equal or lesser**: `<=` e.g. `df.loc[df['1960'] <= 1000]`

_Can you filter all rows (countries) with values lower than `10000` in 2010?_ 

### Boolean operations

Moreover, we can combine conditions through logical **boolean operations**. Boolean operations are logical constructs that work through the following **boolean operators**: 

1. **AND (`&`)**: selects a row if **all** conditions joined by an `&` sign are true:
    *  e.g. `df.loc[(df['1960'] > 1000) & (df['1960'] < 3000)]` returs all countries with values between 1000 **and** 3000. 
    
* **OR ('|')**: selects a row if **at least one** of the conditions joined by a `|` (say 'pipe') sign are true:
    * e.g. `df.loc[(df['1960'] < 1000) | (df['1960'] > 3000])` returs all countries with values smaller than 1000 **or** with values larger than 3000. 
    
* **NOT(~)**: selects a row if a **condition is not met**. the `~` charater (say 'tilde') has to stand _before_ the condition: 
    *  e.g. `df.loc[~(df['1960'] < 1000)]` returs all countries with values not lower than 1000 (i.e. rows with values higher than 1000, including 1000.) 

You can combine these boolean operations in many ways using parentheses, as in the following example, which returns all countries with values between 1000 and 3000, or countries with values exactly 10000 in 1960.

`df.loc[((df['1960'] > 1000) & (countryData['1960'] < 3000)) | (countryData['1960'] == 10000)]`

_Can you get only those countries whose values have increased from below 10000 in 1950 to over 300000 in 2010?_

Being able to filter rows and columns by index (`iloc[]`), name (`loc[]`), and conditional values, we can now proceed with calculating statistical values on rows and columns. In the following, we introduce Numpy, a library for exactly this purpose.  

Write a few more scripts to search for some more values, using conditions:

# Numpy

Numpy is a python module for all sorts of numerical operations and statistical analysis (http://www.numpy.org). We import numpy as follows:

In [14]:
import numpy as np

The directive `as np` assigns an abbreviation to the module with the official name `numpy` (`np` is the standard abbrevation for numpy used in most other  tutorials and references). To call a function from an imported module, you can then use your abbreviation like so:

`np.sum(1,2,3)` instead of `numpy.sum(1,2,3)` (the `sum(..)` function returns the sum of the passed arguments). 

## Descriptive Statistics with Numpy

Given the data above, let's calculate some simple descriptive statistic values for each country: mean, median, standart deviation, min, max, etc. 
The functions we need are
* `np.mean(myarrayhere)` --- returns the arithmetic mean 
* `np.median(myarrayhere)` --- returns the median (the value half-way through an ordered set https://en.wikipedia.org/wiki/Median)
* `np.std(myarrayhere)`
More useful numpy functions are found here: https://docs.scipy.org/doc/numpy/reference/routines.statistics.html


## Simple Array Example

Let's start with a a simple example. We create an array of length 10 of some random numers, using numpy's `rand` function in the `random` package (`np.random.rand(DESIRED_ARRAY_LENGTH_HERE)`). Then print that array.

Now, can you calculate all the values in the above box for these numbers? (mean, median, sum, std) 

## Data Example

Now, we are turning back to our country example. Note that numpy wants arrays, not tables. Hence, from our data set we can calculate statistics only per row and per colum, but not for multiple rows or multiple columns. 

_Can you calculate the mean value in 1990 across all countries?_ Tip: first, get the country data for 1990 using some of pandas selection functions, then pass them as a parameter to the corresponding numpy function.

The value should be `6528.782086357598`

_Can you calculate the mean for `Brazil` for all years?_

The value should be `3303.2866764551354`

Can you calculate the mean for Brazil for the years 1950 to 1970?

_Can you calculate the two means for Brazil and for Germany for the two years 1950 and 1970?_

### Iterating through a DataFrame. 

Now that we can print values for each row and column individually, you may ask for an automization of that proceedure: can we just print all the means for all rows in one statement. Unfortunately not automatically, but we can iterate through the rows using a loop:

`for index, row in countryData.iterrows():
   print(index)`
    
`index` in our case will return the label and row an array of numbers in this row.

_Can you complete the loop and print all the countries means?_

## More Array Functions with Numpy

Numpy can do more than descrptics statistics. Here is a list of useful functions: 

* `np.size(array)`: returns the length of an array, i.e. how many elements it contains
* `np.sort(array)`: sorts the array
* `np.maximum(array)`: 



Can you output only the countries with the highest and lowest means across all years?

Can you output only the countries with the highest and lowest values in 1980? 

Can you output the means for the 3 countries that have the highest values in 1990.

Which country has the highest increase in values from the first to the last year? 

Which countries have higher values in 1990 than in 1960?

# Writing CSV files 

Eventually, after doing some stats or even after cleaning a large table, you may want to write data back into a csv file for later use. Pandas makes this easy, with the `to_csv` method, that takes a filename [https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html].

Select all of  countries with GDP less than 10000 in 2010, and put this in a variable called `lower`.
Now write this dataframe to a file called `lower_2010.csv`. 

Now, try reading that file back in, and make sure that the values are the same.

# That's it!

Congratulations. You have made it. 

This tutorial introduced you to `pandas` and `numpy`. We explained you how to use pandas to select rows and columns and numpy to calculate some simple statistics. Numpy is very powerful and you will use it a lot in cases where you do not need `pandas`.

More info on descriptive statistics with `pandas` can be found here: 
https://pandas.pydata.org/pandas-docs/stable/basics.html#descriptive-statistics