# CIC Virtual Carpentries Workshop - Day 2

This lesson is adapted from the [Data Carpentry Ecology lesson](http://www.datacarpentry.org/python-ecology-lesson/)

We'll be using the gitter channel to share solutions to challenges, ask questions and chat:

**enter link here**


## How to use a Jupyter Notebook

https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/index.html

https://www.packtpub.com/books/content/getting-started-jupyter-notebook-part-1

- The file autosaves
- You run a cell with **shift + enter** or using the run button in the tool bar
- If you run a cell with **option + enter** it will also create a new cell below
- See *Help > Keyboard Shortcuts* or the *Cheatsheet* for more info


- The notebook has different type of cells: Code and Markdown are most commonly used
- **Code** cells expect code for the Kernel you have chosen, syntax highlighting is available, comments in the code are specified with # -> code after this will not be executed
- **Markdown** cells allow you to right report style text, using markdown for formatting the style (e.g. Headers, bold face etc)

# Reading CSV Data Using Pandas

We will begin by locating and reading our survey data which are in CSV format.
We can use Pandas' `read_csv` function to pull the file directly into a
[DataFrame](http://pandas.pydata.org/pandas-docs/stable/dsintro.html#dataframe).


In [1]:
# check if you need to change your directory
import os
os.getcwd()  

'/home/paulhancock/Work/CIC_Carpentries_Python/notebooks'

In [2]:
os.listdir("../")

['environment.yml',
 'LICENSE',
 'notebooks',
 'data',
 'README.md',
 '.git',
 '.gitignore']

In [3]:
# If not already in the data fodler, change directory
os.chdir("../data/")

**Be careful not tto execute the above cell twice** as it will try to move directory again, but this time from your new location which will give you and error

In [4]:
# check we are now in the correct folder
os.getcwd()  

'/home/paulhancock/Work/CIC_Carpentries_Python/data'

In [5]:
# Load the library
import pandas as pd
#check your version, we need v0.19 or higher
pd.__version__

'1.1.4'

In [6]:
surveys_df = pd.read_csv("surveys.csv")

In [7]:
# some variables we will need later, we originally created them during the previous session
avrg_wgt = surveys_df.groupby('plot').mean()['wgt']
plot_names = pd.unique(surveys_df['plot'])

# Indexing & Slicing in Python

We often want to work with subsets of a **DataFrame** object. There are
different ways to accomplish this including: using labels (ie, column headings - as used previously),
numeric ranges or specific x,y index locations.

## Extracting Range based Subsets: Slicing

**REMINDER**: Python Uses 0-based Indexing

Let's remind ourselves that Python uses 0-based
indexing. This means that the first element in an object is located at position
0. This is different from other tools like R and Matlab that index elements
within objects starting at 1.


![indexing diagram](https://datacarpentry.org/python-ecology-lesson/fig/slicing-indexing.png)

![slicing diagram](https://datacarpentry.org/python-ecology-lesson/fig/slicing-slicing.png)

## Challenges


```python
# Create a list of numbers:
a = [1,2,3,4,5]
```

1. What value does the code below return?
        a[0]
2. How about this:
        a[5]
3. Or this?
        a[len(a)]
4. In the example above, calling `a[5]` returns an error. Why is that?

In [8]:
a = [1,2,3,4,5]

## Slicing Subsets of Rows in Python

Slicing using the `[]` operator selects a set of rows and/or columns from a
DataFrame. To slice out a set of rows, you use the following syntax:
`data[start:stop]`. When slicing in pandas the start bound is included in the
output. The stop bound is one step BEYOND the row you want to select. So if you
want to select rows 0, 1 and 2 your code would look like this:

```python
# select rows 0,1,2 (but not 3)
surveys_df[0:3]
```

The stop bound in Python is different from what you might be used to in
languages like Matlab and R.

```python
# select the first, second and third rows from the surveys variable
surveys_df[0:3]
# select the first 5 rows (rows 0,1,2,3,4)
surveys_df[:5]
# select the last element in the list
surveys_df[-1:]
```

We can also reassign values within subsets of our DataFrame. But before we do that, let's make a 
copy of our DataFrame so as not to modify our original imported data. 

```python
# copy the surveys dataframe so we don't modify the original DataFrame
surveys_copy = surveys_df

# set the first three rows of data in the DataFrame to 0
surveys_copy[0:3] = 0
```

Next, try the following code: 

```python
surveys_copy.head()
surveys_df.head()
```
What is the difference between the two data frames?

In [None]:
surveys_df.head()

In [18]:
# copy the surveys dataframe so we don't modify the original DataFrame
surveys_copy = surveys_df

# set the first three rows of data in the DataFrame to 0
surveys_copy[0:3] = 0

In [19]:
print(surveys_copy.head())
print(surveys_df.head())

   record_id  month  day  year  plot species sex  wgt
0          0      0    0     0     0       0   0  0.0
1          0      0    0     0     0       0   0  0.0
2          0      0    0     0     0       0   0  0.0
3          4      7   16  1977     7      DM   M  NaN
4          5      7   16  1977     3      DM   M  NaN
   record_id  month  day  year  plot species sex  wgt
0          0      0    0     0     0       0   0  0.0
1          0      0    0     0     0       0   0  0.0
2          0      0    0     0     0       0   0  0.0
3          4      7   16  1977     7      DM   M  NaN
4          5      7   16  1977     3      DM   M  NaN


## Referencing Objects vs Copying Objects in Python

We might have thought that we were creating a fresh copy of the `surveys_df` objects when we 
used the code `surveys_copy = surveys_df`. However the statement  y = x doesn’t create a copy of our DataFrame. 
It creates a new variable y that refers to the **same** object x refers to. This means that there is only one object 
(the DataFrame), and both x and y refer to it. So when we assign the first 3 columns the value of 0 using the 
`surveys_copy` DataFrame, the `surveys_df` DataFrame is modified too. To create a fresh copy of the `surveys_df`
DataFrame we use the syntax y=x.copy(). But before we have to read the surveys_df again because the current version contains the unintentional changes made to the first 3 columns.

```python
surveys_df = pd.read_csv("data/surveys.csv")
surveys_copy= surveys_df.copy()
```

In [20]:
surveys_df = pd.read_csv("surveys.csv")
surveys_copy= surveys_df.copy()

In [21]:
surveys_df.head()

Unnamed: 0,record_id,month,day,year,plot,species,sex,wgt
0,1,7,16,1977,2,,M,
1,2,7,16,1977,3,,M,
2,3,7,16,1977,2,DM,F,
3,4,7,16,1977,7,DM,M,
4,5,7,16,1977,3,DM,M,


In [22]:
surveys_copy.head()

Unnamed: 0,record_id,month,day,year,plot,species,sex,wgt
0,1,7,16,1977,2,,M,
1,2,7,16,1977,3,,M,
2,3,7,16,1977,2,DM,F,
3,4,7,16,1977,7,DM,M,
4,5,7,16,1977,3,DM,M,


In [23]:
# set the first three rows of data in the DataFrame to 0
surveys_copy[0:3] = 0

In [24]:
surveys_df.head()

Unnamed: 0,record_id,month,day,year,plot,species,sex,wgt
0,1,7,16,1977,2,,M,
1,2,7,16,1977,3,,M,
2,3,7,16,1977,2,DM,F,
3,4,7,16,1977,7,DM,M,
4,5,7,16,1977,3,DM,M,


In [25]:
surveys_copy.head()

Unnamed: 0,record_id,month,day,year,plot,species,sex,wgt
0,0,0,0,0,0,0,0,0.0
1,0,0,0,0,0,0,0,0.0
2,0,0,0,0,0,0,0,0.0
3,4,7,16,1977,7,DM,M,
4,5,7,16,1977,3,DM,M,


## Slicing Subsets of Rows and Columns in Python Pandas

We can select specific ranges of our data in both the row and column directions
using either label or integer-based indexing.

- `loc`: indexing via *labels* (which can be integers)
- `iloc`: indexing via *integers*

![loc_iloc_subsetting](http://104.236.88.249/wp-content/uploads/2016/10/Pandas-selections-and-indexing.png)


![dataframe_indexing](https://vrzkj25a871bpq7t1ugcgmn9-wpengine.netdna-ssl.com/wp-content/uploads/2019/01/pandas-dataframe-has-indexes.png)

To select a subset of rows AND columns from our DataFrame, we can use the `iloc`
method. For example, we can select month, day and year (columns 2, 3 and 4 if we
start counting at 1), like this:

```python
surveys_df.iloc[0:3, 1:4]
```


**Note**: the order of selection is ROW followed by COLUMN

In [26]:
surveys_df.head()

Unnamed: 0,record_id,month,day,year,plot,species,sex,wgt
0,1,7,16,1977,2,,M,
1,2,7,16,1977,3,,M,
2,3,7,16,1977,2,DM,F,
3,4,7,16,1977,7,DM,M,
4,5,7,16,1977,3,DM,M,


In [27]:
surveys_df.iloc[0:3, 1:4]

Unnamed: 0,month,day,year
0,7,16,1977
1,7,16,1977
2,7,16,1977


Notice that we asked for a slice from 0:3. This yielded 3 rows of data. When you
ask for 0:3, you are actually telling python to start at index 0 and select rows
0, 1, 2 **up to but not including 3**.


Let's next explore some other ways to index and select subsets of data:

In [28]:
# select all columns for rows of index values 0 and 10
surveys_df.loc[[0, 10], :]

Unnamed: 0,record_id,month,day,year,plot,species,sex,wgt
0,1,7,16,1977,2,,M,
10,11,7,16,1977,5,DS,F,


In [29]:
# what does this do?
surveys_df.loc[0:4, 'plot' : 'wgt']


Unnamed: 0,plot,species,sex,wgt
0,2,,M,
1,3,,M,
2,2,DM,F,
3,7,DM,M,
4,3,DM,M,


In [30]:
# What happens when you type the code below?
surveys_df.iloc[[0, 10, 45549], :]


IndexError: positional indexers are out-of-bounds

NOTE: Labels (.loc) must be found in the DataFrame or you will get a `KeyError`. The same is true for indices (.iloc), you will get an `IndexError` otherwise.

When using `loc`, the start bound and the stop bound are **included** and integers
*can* be labels (e.g. row numbers), but they refer to the **index label** and not the position. Thus
when you use `loc`, and select rowa with index label 1:4, you will get a different result than using `iloc` to select rows 1:4.

We can also select a specific data value according to the specific row and
column location within the data frame using the `iloc` function:
`dat.iloc[row,column]`.


```python
surveys_df.iloc[2,6]
```

which gives **output**

```
'F'
```

Remember that Python indexing begins at 0. So, the index location [2, 6] selects
the element that is 3 rows down and 7 columns over in the DataFrame.

## Challenge Activities

1. What happens when you type:
	- surveys_df[0:3]
	- surveys_df[:5]
	- surveys_df[-1:]

2. What happens when you call:
    - `surveys_df.iloc[0:4, 1:4]`
    - `surveys_df.loc[0:4, 1:4]`
    - How are the two commands different?

In [31]:
surveys_df.iloc[0:4, 1:4]

Unnamed: 0,month,day,year
0,7,16,1977
1,7,16,1977
2,7,16,1977
3,7,16,1977


In [32]:
surveys_df.loc[0:4, 1:4]


TypeError: cannot do slice indexing on Index with these indexers [1] of type int

# Using Masks

A mask can be useful to locate where a particular subset of values exist or
don't exist - for example,  NaN, or "Not a Number" values. To understand masks,
we also need to understand `BOOLEAN` objects in python.

Boolean values include `true` or `false`. So for example

```python
# set x to 5
x = 5
# what does the code below return?
x > 5
# how about this?
x == 5
```

In [33]:
# Useful magic to check the variables created throughout this session
%whos

Variable       Type         Data/Info
-------------------------------------
a              list         n=5
avrg_wgt       Series       plot\n1     51.822911\n2 <...>Name: wgt, dtype: float64
os             module       <module 'os' from '/usr/lib/python3.7/os.py'>
pd             module       <module 'pandas' from '/h<...>ages/pandas/__init__.py'>
plot_names     ndarray      24: 24 elems, type `int64`, 192 bytes
surveys_copy   DataFrame           record_id  month  <...>n[35549 rows x 8 columns]
surveys_df     DataFrame           record_id  month  <...>n[35549 rows x 8 columns]


In [None]:
#For the below double chceck the lists are created during the subsetting section earlier on.

In [34]:
# what happens when we ask whether the avrg_wgt list is greater than 2?
avrg_wgt > 2

plot
1     True
2     True
3     True
4     True
5     True
6     True
7     True
8     True
9     True
10    True
11    True
12    True
13    True
14    True
15    True
16    True
17    True
18    True
19    True
20    True
21    True
22    True
23    True
24    True
Name: wgt, dtype: bool

In [35]:
# what happens when we ask whether the plot_names are greater than 5?
plot_names > 5

array([False, False,  True, False,  True, False,  True, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True])

When we ask python what the value of `x > 5` is, we get `False`. This is because x
is not greater than 5 it is equal to 5. To create a boolean mask, you first create the
True / False criteria (e.g. values > 5 = True). Python will then assess each
value in the object to determine whether the value meets the criteria (True) or
not (False). Python creates an output object that is the same shape as
the original object, but with a True or False value for each index location.


### Logical evaluators
You can use the syntax below when querying data from a DataFrame. Experiment
with selecting various subsets of the "surveys" data.

* Equals: `==`
* Not equals: `!=`
* Greater than, less than: `>` or `<`
* Greater than or equal to `>=`
* Less than or equal to `<=`

Let's try this out. Let's identify all locations in the survey data that have
null (missing or NaN) data values. We can use the `isnull` method to do this.
Each cell with a null value will be assigned a value of  `True` in the new
boolean object.

In [None]:
pd.isnull(surveys_df)

To select the rows where there are null values,  we can use 
the mask as an index to subset our data as follows:

```python
#To select just the rows with NaN values, we can use the .any method
surveys_df[pd.isnull(surveys_df).any(axis=1)]
```

Note that there are many null or NaN values in the `wgt` column of our DataFrame.
We will explore different ways of dealing with these in Lesson 03.

We can run `isnull` on a particular column too. What does the code below do?

```python
# what does this do?
empty_weights = surveys_df[pd.isnull(surveys_df).any(axis=1)]['wgt']
```

Let's take a minute to look at the statement above. 

We are using the Boolean object as an index. 
We are asking python to select rows that have a `NaN` value
for weight.

---

We can also select a subset of our data using criteria. For example, we can
select all rows that have a year value of 2002.

In [None]:
surveys_df[surveys_df.year == 2002]

In [None]:
# Or we can select all rows that do not contain the year 2002.
surveys_df[surveys_df.year != 2002]

In [None]:
# We can define sets of criteria too:
surveys_df[(surveys_df.year >= 1980) & (surveys_df.year <= 1985)]

Another way to extract subsets using multiple criteria is using the .query() method

In [None]:
surveys_df.query(' year >= 1980 & year <= 1985')

### Challenge Activities

1. Select a subset of rows in the `surveys_df` DataFrame that contain data from
   the year 1999 and that contain weight values less than or equal to 8. How
   many rows did you end up with? What did your neighbor get?
2. You can use the `isin` command in python to query a DataFrame based upon a
   list of values as follows:
   `surveys_df[surveys_df['species'].isin([listGoesHere])]`. Use the `isin` function
   to find all plots that contain particular species in
   the surveys DataFrame. How many records contain these values?
3. Experiment with other queries. Create a query that finds all rows with a weight value > or equal to 0.
4. The `~` symbol in Python can be used to return the OPPOSITE of the selection that you specify in python. 
It is equivalent to **is not in**. Write a query that selects all rows that are NOT equal to 'M' or 'F' in the surveys
data.

In [None]:
# number of unique plot id where species are found


In [None]:
# total number of records where species are found


## Concatinating

We can use the `concat` function in Pandas to append either columns or rows from
one DataFrame to another.  Let's grab two subsets of our data to see how this
works.


In [36]:
# read in first 10 lines of surveys table
survey_sub=surveys_df.head(10)
survey_sub

Unnamed: 0,record_id,month,day,year,plot,species,sex,wgt
0,1,7,16,1977,2,,M,
1,2,7,16,1977,3,,M,
2,3,7,16,1977,2,DM,F,
3,4,7,16,1977,7,DM,M,
4,5,7,16,1977,3,DM,M,
5,6,7,16,1977,1,PF,M,
6,7,7,16,1977,2,PE,F,
7,8,7,16,1977,1,DM,M,
8,9,7,16,1977,1,DM,F,
9,10,7,16,1977,6,PF,F,


In [37]:
# grab the last 10 rows (minus the last one)
survey_sub_last10 = surveys_df[-11:-1]
survey_sub_last10

Unnamed: 0,record_id,month,day,year,plot,species,sex,wgt
35538,35539,12,31,2002,15,SF,M,68.0
35539,35540,12,31,2002,15,PB,F,23.0
35540,35541,12,31,2002,15,PB,F,31.0
35541,35542,12,31,2002,15,PB,F,29.0
35542,35543,12,31,2002,15,PB,F,34.0
35543,35544,12,31,2002,15,US,,
35544,35545,12,31,2002,15,AH,,
35545,35546,12,31,2002,15,AH,,
35546,35547,12,31,2002,10,RM,F,14.0
35547,35548,12,31,2002,7,DO,M,51.0


In [38]:
# reset the index values to the second dataframe appends properly
# drop=True option avoids adding new index column with 
# old index values
survey_sub_last10 = survey_sub_last10.reset_index(drop=True)
survey_sub_last10

Unnamed: 0,record_id,month,day,year,plot,species,sex,wgt
0,35539,12,31,2002,15,SF,M,68.0
1,35540,12,31,2002,15,PB,F,23.0
2,35541,12,31,2002,15,PB,F,31.0
3,35542,12,31,2002,15,PB,F,29.0
4,35543,12,31,2002,15,PB,F,34.0
5,35544,12,31,2002,15,US,,
6,35545,12,31,2002,15,AH,,
7,35546,12,31,2002,15,AH,,
8,35547,12,31,2002,10,RM,F,14.0
9,35548,12,31,2002,7,DO,M,51.0


When we concatenate DataFrames, we need to specify the axis. `axis=0` tells
Pandas to stack the second DataFrame under the first one. It will automatically
detect whether the column names are the same and will stack accordingly.
`axis=1` will stack the columns in the second DataFrame to the RIGHT of the
first DataFrame. To stack the data vertically, we need to make sure we have the
same columns and associated column format in both datasets. When we stack
horizonally, we want to make sure what we are doing makes sense (ie the data are
related in some way).


In [39]:
# stack the DataFrames on top of each other
vertical_stack = pd.concat([survey_sub, survey_sub_last10], axis = 0)
vertical_stack

Unnamed: 0,record_id,month,day,year,plot,species,sex,wgt
0,1,7,16,1977,2,,M,
1,2,7,16,1977,3,,M,
2,3,7,16,1977,2,DM,F,
3,4,7,16,1977,7,DM,M,
4,5,7,16,1977,3,DM,M,
5,6,7,16,1977,1,PF,M,
6,7,7,16,1977,2,PE,F,
7,8,7,16,1977,1,DM,M,
8,9,7,16,1977,1,DM,F,
9,10,7,16,1977,6,PF,F,


In [40]:
# place the DataFrames side by side
horizontal_stack = pd.concat([survey_sub, survey_sub_last10], axis = 1)
horizontal_stack

Unnamed: 0,record_id,month,day,year,plot,species,sex,wgt,record_id.1,month.1,day.1,year.1,plot.1,species.1,sex.1,wgt.1
0,1,7,16,1977,2,,M,,35539,12,31,2002,15,SF,M,68.0
1,2,7,16,1977,3,,M,,35540,12,31,2002,15,PB,F,23.0
2,3,7,16,1977,2,DM,F,,35541,12,31,2002,15,PB,F,31.0
3,4,7,16,1977,7,DM,M,,35542,12,31,2002,15,PB,F,29.0
4,5,7,16,1977,3,DM,M,,35543,12,31,2002,15,PB,F,34.0
5,6,7,16,1977,1,PF,M,,35544,12,31,2002,15,US,,
6,7,7,16,1977,2,PE,F,,35545,12,31,2002,15,AH,,
7,8,7,16,1977,1,DM,M,,35546,12,31,2002,15,AH,,
8,9,7,16,1977,1,DM,F,,35547,12,31,2002,10,RM,F,14.0
9,10,7,16,1977,6,PF,F,,35548,12,31,2002,7,DO,M,51.0


In [41]:
horizontal_stack['wgt']

Unnamed: 0,wgt,wgt.1
0,,68.0
1,,23.0
2,,31.0
3,,29.0
4,,34.0
5,,
6,,
7,,
8,,14.0
9,,51.0


Notice anything unusual about the `vertical_stack`?

The row indexes for the two data frames `survey_sub` and `survey_sub_last10`
have been repeated. We can reindex the new dataframe using the `reset_index()` method.

In [42]:
vertical_stack = vertical_stack.reset_index()
vertical_stack

Unnamed: 0,index,record_id,month,day,year,plot,species,sex,wgt
0,0,1,7,16,1977,2,,M,
1,1,2,7,16,1977,3,,M,
2,2,3,7,16,1977,2,DM,F,
3,3,4,7,16,1977,7,DM,M,
4,4,5,7,16,1977,3,DM,M,
5,5,6,7,16,1977,1,PF,M,
6,6,7,7,16,1977,2,PE,F,
7,7,8,7,16,1977,1,DM,M,
8,8,9,7,16,1977,1,DM,F,
9,9,10,7,16,1977,6,PF,F,


## Writing Out Data to CSV

We can use the `to_csv` command to do export a DataFrame in CSV format. Note that the code
below will by default save the data into the current working directory. We can
save it to a different folder by adding the foldername and a slash to the file
`vertical_stack.to_csv('foldername/out.csv')`.

```python
# Write DataFrame to CSV 
vertical_stack.to_csv('data/out.csv')
```

Check out your working directory to make sure the CSV wrote out properly, and
that you can open it! If you want, try to bring it back into python to make sure
it imports properly.

```python	
# let's read our output back into python and make sure all looks good
new_output = pd.read_csv('data/out.csv', keep_default_na=False, na_values=[""])
```

In [43]:
vertical_stack.to_csv('data/out.csv')

# Automating data processing using For Loops

So far, we've used Python and the pandas library to explore and manipulate
individual datasets by hand, much like we would do in a spreadsheet. The beauty
of using a programming language like Python, though, comes from the ability to
automate data processing through the use of loops and functions.

## For loops

Loops allow us to repeat a workflow (or series of actions) a given number of
times or while some condition is true. We would use a loop to automatically
process data that's stored in multiple files (daily values with one file per
year, for example). Loops lighten our work load by performing repeated tasks
without our direct involvement and make it less likely that we'll introduce
errors by making mistakes while processing each file by hand.

Let's write a simple for loop that simulates what a kid might see during a
visit to the zoo:

In [None]:
animals = ['lion','tiger','crocodile','vulture','hippo']
print(animals)

In [None]:
for creature in animals:
    print(creature)

The line defining the loop must start with `for` and end with a colon, and the
body of the loop must be indented.

In this example, `creature` is the loop variable that takes the value of the next
entry in `animals` every time the loop goes around. We can call the loop variable
anything we like. After the loop finishes, the loop variable will still exist
and will have the value of the last entry in the collection:

In [None]:
for i in range(0,5):
    print(animals[i])
    print(i)

In [None]:
for creature in animals:
    pass

In [None]:
creature

We are not asking python to print the value of the loop variable anymore, but
the for loop still runs and the value of `creature` changes on each pass through
the loop. The statement `pass` in the body of the loop just means "do nothing".


---

The file we've been using so far, `surveys.csv`, contains 25 years of data and is
very large. We would like to separate the data for each year into a separate
file.

Let's start by making a new directory inside the folder `data` to store all of
these files using the module `os`:

In [50]:
import os
os.getcwd()

In [51]:
os.mkdir('yearly_files')

The command `os.mkdir` is equivalent to `mkdir` in the shell. Just so we are
sure, we can check that the new directory was created within the `data` folder:

In [62]:
os.listdir('./')

['out.csv',
 'environment.yml',
 'LICENSE',
 'notebooks',
 'data',
 'README.md',
 '.git',
 '.gitignore']

The command `os.listdir` is equivalent to `ls` in the shell.

---


Previously, we saw how to use the library pandas to load the species
data into memory as a DataFrame, how to select a subset of the data using some
criteria, and how to write the DataFrame into a csv file. Let's write a script
that performs those three steps in sequence for the year 2002:

```python
import pandas as pd

# Load the data into a DataFrame
surveys_df = pd.read_csv('data/surveys.csv')

# Select only data for 2002
surveys2002 = surveys_df[surveys_df.year == 2002]

# Write the new DataFrame to a csv file
surveys2002.to_csv('data/yearly_files/surveys2002.csv')
```

To **create yearly data files**, we could repeat the last two commands over and
over, once for each year of data. Repeating code is neither elegant nor
practical, and is very likely to introduce errors into your code. **We want to
turn what we've just written into a loop** that repeats the last two commands for
every year in the dataset.

Let's start by writing a loop that simply prints the names of the files we want
to create - the dataset we are using covers 1977 through 2002, and we'll create
a separate file for each of those years. Listing the filenames is a good way to
confirm that the loop is behaving as we expect.


We have seen that we can loop over a list of items, so we need a list of years 
to loop over. We can get the *unique* years in our DataFrame with:

In [48]:
surveys_df['year'].unique()

array([1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985, 1986, 1987,
       1988, 1989, 1990, 1991, 1992, 1993, 1994, 1995, 1996, 1997, 1998,
       1999, 2000, 2001, 2002])

Putting this into our for loop we get

In [57]:
for  year in surveys_df.year.unique():
    # creating filename
    filename = 'yearly_files/surveys_year' + str(year) + '.csv'
    print(filename)


data/yearly_files/surveys_year1977.csv
data/yearly_files/surveys_year1978.csv
data/yearly_files/surveys_year1979.csv
data/yearly_files/surveys_year1980.csv
data/yearly_files/surveys_year1981.csv
data/yearly_files/surveys_year1982.csv
data/yearly_files/surveys_year1983.csv
data/yearly_files/surveys_year1984.csv
data/yearly_files/surveys_year1985.csv
data/yearly_files/surveys_year1986.csv
data/yearly_files/surveys_year1987.csv
data/yearly_files/surveys_year1988.csv
data/yearly_files/surveys_year1989.csv
data/yearly_files/surveys_year1990.csv
data/yearly_files/surveys_year1991.csv
data/yearly_files/surveys_year1992.csv
data/yearly_files/surveys_year1993.csv
data/yearly_files/surveys_year1994.csv
data/yearly_files/surveys_year1995.csv
data/yearly_files/surveys_year1996.csv
data/yearly_files/surveys_year1997.csv
data/yearly_files/surveys_year1998.csv
data/yearly_files/surveys_year1999.csv
data/yearly_files/surveys_year2000.csv
data/yearly_files/surveys_year2001.csv
data/yearly_files/surveys

Notice that we use single quotes to add text strings. The variable is not
surrounded by quotes. This code produces the string
`data/yearly_files/surveys_year2002.csv` which contains the path to the new filename
AND the file name itself.

We can now add the rest of the steps we need to create separate text files.
Once finished look inside the `yearly_files` directory and check a couple of the files you
just created to confirm that everything worked as expected.

In [59]:
for year in surveys_df.year.unique():
    #creating filename
    filename =  'yearly_files/surveys_year' + str(year) + '.csv'
    # extracting data of a specific year
    surveys_year = surveys_df[surveys_df.year == year]
    # writing to file
    surveys_year.to_csv(filename)

In [61]:
os.listdir('yearly_files/')

['surveys_year1986.csv',
 'surveys_year1996.csv',
 'surveys_year1999.csv',
 'surveys_year2000.csv',
 'surveys_year1994.csv',
 'surveys_year1981.csv',
 'surveys_year1995.csv',
 'surveys_year1984.csv',
 'surveys_year1979.csv',
 'surveys_year1977.csv',
 'surveys_year1988.csv',
 'surveys_year1989.csv',
 'surveys_year1990.csv',
 'surveys_year1997.csv',
 'surveys_year1987.csv',
 'surveys_year1985.csv',
 'surveys_year1993.csv',
 'surveys_year1982.csv',
 'surveys_year1980.csv',
 'surveys_year1978.csv',
 'surveys_year1983.csv',
 'surveys_year2001.csv',
 'surveys_year1992.csv',
 'surveys_year1991.csv',
 'surveys_year1998.csv',
 'surveys_year2002.csv']


### Challenge




   1. What happens if there is no data for a year in the sequence (for example, imagine we had used 1976 as the start year in range)?

   2. Let's say you only want to look at data from a given multiple of years. How would you modify your loop in order to generate a data file for only every 5th year, starting from 1977? Hint: you will need to use range to specify the list of numbers.

   ```python
range(start, end, steps)
```
   3. Instead of splitting out the data by years, a colleague wants to do analyses each species separately. How would you write a unique csv file for each species?


In [None]:
# what do we get returned for a year that does not exist? 
surveys_df[(surveys_df['year']== 1976)]

In [None]:
# only save data for every 5th year using range
for year in range(surveys_df.year.min(),surveys_df.year.max()+1,5):
    #creating filename
    filename = 'yearly_files/5yeardata_' + str(year) + '.csv'
    # extracting data of a specific year
    surveys_year = surveys_df[surveys_df.year == year]
    # writing to file
    surveys_year.to_csv(filename)

In [None]:
os.listdir('yearly_files/')

In [None]:
#find the unique species
surveys_df.species.dropna().unique()

In [None]:
# create the new folder for the species data
os.mkdir('species')

In [None]:
# save data files for each species, 
# Caution: skip the nan
for species in surveys_df.species.dropna().unique():
    #creating filename
    filename = 'species/species_' + species + '.csv'
    # extracting data of a specific year
    surveys_species = surveys_df[surveys_df.species == species]
    # writing to file
    surveys_species.to_csv(filename)

In [None]:
os.listdir('species/')

## Building reusable and modular code with functions

Suppose that separating large data files into individual yearly files is a task
that we frequently have to perform. We could write a **for loop** like the one above
every time we needed to do it but that would be time consuming and error prone.
A more elegant solution would be to create a reusable tool that performs this
task with minimum input from the user. To do this, we are going to turn the code
we've already written into a function.

Functions are reusable, self-contained pieces of code that are called with a
single command. They can be designed to accept arguments as input and return
values, but they don't need to do either. Variables declared inside functions
only exist while the function is running and if a variable within the function
(a local variable) has the same name as a variable somewhere else in the code,
the local variable hides but doesn't overwrite the other.

Every method used in Python (for example, `print`) is a function, and the
libraries we import (say, `pandas`) are a collection of functions. We will only
use functions that are housed within the same code that uses them, but it's also
easy to write functions that can be used by different programs.

Functions are declared following this general structure:

```python
def this_is_the_function_name(input_argument1, input_argument2):

    # The body of the function is indented
    # This function prints the two arguments to screen
    print('The function arguments are:', input_argument1, input_argument2, '(this is done inside the function!)')

    # And returns their product
    return input_argument1 * input_argument2
```

The function declaration starts with the word `def`, followed by the function
name and any arguments in parenthesis, and ends in a colon. The body of the
function is indented just like loops are. If the function returns something when
it is called, it includes a return statement at the end.

In [None]:
#let's define this function
def this_is_the_function_name(input_argument1, input_argument2):

    # The body of the function is indented
    # This function prints the two arguments to screen
    print('The function arguments are:', input_argument1, input_argument2, '(this is done inside the function!)')

    # And returns their product
    return input_argument1 * input_argument2

In [None]:
#and now let's call the function:
this_is_the_function_name(5,2)

In [None]:
this_is_the_function_name(input_argument2=5, input_argument1=2)

### Challenge:

1. Try calling the function by giving it the wrong number of arguments (not 2)
2. Declare a variable inside the function and test to see where it exists (Hint:
   can you print it from outside the function?)
3. Explore what happens when a variable both inside and outside the function
   have the same name. What happens to the global variable when you change the
   value of the local variable?

In [None]:
# try giving only 1 or maybe 3 inputs

In [None]:
def this_other_function(in1, in2=74646):
    new_variable = 3
    print(new_variable, in1, in2)
    return

In [None]:
this_other_function(1)

In [None]:
this_other_function(1,2)

In [None]:
new_variable

In [None]:
new_variable = 5
print(new_variable)
this_other_function(1,2)
print(new_variable)

---

We can now turn our code for saving yearly data files into a function. There are
many different "chunks" of this code that we can turn into functions, and we can
even create functions that call other functions inside them. Let's first write a
function that separates data for just one year and saves that data to a file:

```python
def one_year_csv_writer(this_year, all_data):
    """
    Writes a csv file for data from a given year.

    this_year --- year for which data is extracted
    all_data --- DataFrame with multi-year data
    """

    # Select data for the year
    surveys_year = all_data[all_data.year == this_year]

    # Write the new DataFrame to a csv file
    filename = 'data/yearly_files/function_surveys_year' + str(this_year) + '.csv'
    surveys_year.to_csv(filename)
```

In [None]:
def one_year_csv_writer(this_year, all_data):
    """
    Writes a csv file for data from a given year.

    this_year --- year for which data is extracted
    all_data --- DataFrame with multi-year data
    """

    # Select data for the year
    surveys_year = all_data[all_data.year == this_year]

    # Write the new DataFrame to a csv file
    filename = 'yearly_files/function_surveys_year' + str(this_year) + '.csv'
    surveys_year.to_csv(filename)

The text between the two sets of triple double quotes is called a docstring and
contains the documentation for the function. It does nothing when the function
is running and is therefore not necessary, but it is good practice to include
docstrings as a reminder of what the code does. Docstrings in functions also
become part of their 'official' documentation:

In [63]:
# find help on inbuilt function sum
sum?

In [None]:
# find help on your function one_year_csv_writer

In [None]:
one_year_csv_writer(2002,surveys_df)

In [None]:
os.listdir('yearly_files/')

We changed the root of the name of the csv file so we can distinguish it from
the one we wrote before. Check the `yearly_files` directory for the file. Did it
do what you expect?

---

What we really want to do, though, is **create files for multiple years without
having to request them one by one**. Let's write another function that replaces
the entire `for loop` by simply looping through a sequence of years and repeatedly
calling the function we just wrote, `one_year_csv_writer`:


```python
def yearly_data_csv_writer(start_year, end_year, all_data):
    """
    Writes separate csv files for each year of data.

    start_year --- the first year of data we want
    end_year --- the last year of data we want
    all_data --- DataFrame with multi-year data
    """

    # "end_year" is the last year of data we want to pull, so we loop to end_year+1
    for year in range(start_year, end_year+1):
        one_year_csv_writer(year, all_data)
```

In [None]:
def yearly_data_csv_writer(start_year, end_year, all_data):
    """
    Writes separate csv files for each year of data.

    start_year --- the first year of data we want
    end_year --- the last year of data we want
    all_data --- DataFrame with multi-year data
    """

    # "end_year" is the last year of data we want to pull, so we loop to end_year+1
    for year in range(start_year, end_year+1):
        one_year_csv_writer(year, all_data)

Because people will naturally expect that the end year for the files is the last
year with data, the for loop inside the function ends at `end_year + 1`. 
This is because when we specify `range()` the last number is not included, try it for yourself.

In [None]:
list(range(5))

By writing the entire loop into a function, we've made a reusable tool for whenever
we need to break a large data file into yearly files. Because we can specify the
first and last year for which we want files, we can even use this function to
create files for a subset of the years available. This is how we call this
function:

In [None]:
yearly_data_csv_writer(1980,1990,surveys_df)

In [None]:
os.listdir('yearly_files/')

**BEWARE!** If you are using IPython Notebooks and you modify a function, you MUST
re-run that cell in order for the changed function to be available to the rest
of the code. Nothing will visibly happen when you do this, though, because
simply defining a function without *calling* it doesn't produce an output. Any
cells that use the now-changed functions will also have to be re-run for their
output to change.

### Challenge:

1. **Add two arguments** to the functions we wrote that take the *path* of the
   directory where the files will be written and the *root* of the file name.
   Additionally, **add default values for all year inputs**. Note, rearrange the order of inputs so that arguments with default are listed last.
   Create a new set of files with a different name in a different directory.
2. Make the functions **return a list** of the files they have written. There are
   many ways you can do this (and you should try them all!): either of the
   functions can print to screen, either can use a return statement to give back
   numbers or strings to their function call, or you can use some combination of
   the two. You could also try using the `os` library to list the contents of
   directories.

In [None]:
# adding two arguments 



In [None]:
# adding a list of filenames to be returned
# adding two arguments 

In [None]:
yearly_data_csv_writer(surveys_df, 'yearly_files/','test_')

--- 

But what if our dataset doesn't start in 1977 and end in 2002? We can modify the
function so that it looks for the start and end years in the dataset if those
dates are not provided:

```python
    def yearly_data_arg_test(all_data, start_year = None, end_year = None):
        """
        Modified from yearly_data_csv_writer to test default argument values!

        start_year --- the first year of data we want --- default: None - check all_data
        end_year --- the last year of data we want --- default: None - check all_data
        all_data --- DataFrame with multi-year data
        """

        if not start_year:
            start_year = min(all_data.year)
        if not end_year:
            end_year = max(all_data.year)

        return start_year, end_year
```

In [64]:
# define function

In [None]:
yearly_data_arg_test(surveys_df, 1980, 1990)

In [None]:
# test function
yearly_data_arg_test(surveys_df)

The default values of the `start_year` and `end_year` arguments in the function
`yearly_data_arg_test` are now `None`. This is a build-it constant in Python
that indicates the absence of a value - essentially, that the variable exists in
the namespace of the function (the directory of variable names) but that it
doesn't correspond to any existing object.

The body of the test function now has two conditional 'loops' (**if statement**) that
check the values of `start_year` and `end_year`. If statements execute the body of
the 'loop' when some condition is met. 

`if statements` work like the boolean logic we saw earlier when we created masks to select our data.
As a function they commonly look something like this:

```python
a = 5

if a<0: # meets first condition?

    # if a IS less than zero
    print('a is a negative number')

elif a>0: # did not meet first condition. meets second condition?

    # if a ISN'T less than zero and IS more than zero
    print('a is a positive number')

else: # met neither condition

    # if a ISN'T less than zero and ISN'T more than zero
    print('a must be zero!')
```

In [None]:
a = 0.0

if a<0: # meets first condition?

    # if a IS less than zero
    print('a is a negative number')

elif a>=0: # did not meet first condition. meets second condition?

    # if a ISN'T less than zero and IS more than zero
    print('a is a positive number')

else: # met neither condition

    # if a ISN'T less than zero and ISN'T more than zero
    print('a must be zero!')

Change the value of `a` to see how this function works. The statement `elif`
means "else if", and all of the conditional statements must end in a colon.

The if statements in the function `yearly_data_arg_test` check whether there is an
object associated with the variable names `start_year` and `end_year`. If those
variables are `None`, the if statements return the boolean `True` and execute whatever
is in their body. On the other hand, if the variable names are associated with
some value (they got a number in the function call), the if statements return `False`
and do not execute. The opposite conditional statements, which would return
`True` if the variables were associated with objects (if they had received value
in the function call), would be `if start_year` and `if end_year`.

### Challenge:

1. Rewrite the `one_year_csv_writer` and `yearly_data_csv_writer` functions to use `none` as default for the years.


3. The code below checks to see whether a directory exists and creates one if it
doesn't. Add some code to your function that writes out the CSV files, to check
for a directory to write to.

```Python
	if 'dir_name_here' in os.listdir('.'):
	    print('Processed directory exists')
	else:
	    os.mkdir('dir_name_here')
	    print('Processed directory created')
```

2. Modify the functions so that they don't create yearly files if there is no
data for a given year and display an alert to the user (Hint: use conditional
statements to do this.)

In [None]:
# use none as default start/end years
# adding a list of filenames to be returned
# adding two arguments


In [None]:
# avoid making a new directory if it exists
# use none as default start/end years
# adding a list of filenames to be returned
# adding two arguments

In [None]:
#
# avoid making a new directory if it exists
# use none as default start/end years
# adding a list of filenames to be returned
# adding two arguments

In [None]:
one_year_csv_writer(all_data=surveys_df,directory='yearly_files/',file_root='final_test_', this_year=1976)

In [None]:
yearly_data_csv_writer(all_data=surveys_df,directory='test_creation/',file_root='final_test_', start_year=1977, end_year=1978)