# CIDS Carpentries Workshop - Day 1 - Part 4
This lesson is adapted from the Data Carpentries [Data Analysis and Visualization in Python for Ecologists](https://datacarpentry.org/python-ecology-lesson/index.html) lesson.

---
## How to use a Jupyter Notebook
Online Resources:
- https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/index.html
- https://code.visualstudio.com/docs/datascience/jupyter-notebooks 

Useful Tips:
- To save the notebook/file, <kbd>Ctrl</kbd> + <kbd>s</kbd> or Go to `File -> Save`.
- You run a cell with <kbd>Shift</kbd> + <kbd>Enter</kbd> or
    - **Jupyter Notebook, JupyterLab**: you can use the run button ▶ in the tool bar.
    - **VScode**: you can use the run button ▶ in front of the cell.
- If you run a cell with <kbd>Option (Alt)</kbd> + <kbd>Enter</kbd> it will also create a new cell below.
- If you opened this a classic notebook you can check *Help > Keyboard Shortcuts* else see the *Cheatsheet* for more info.
- If you are using VScode, See [Jupyter Notebooks in VS Code](https://code.visualstudio.com/docs/datascience/jupyter-notebooks) for more info.
- The notebook has different type of cells (Code and Markdown are most commonly used): 
    - **Code** cells expect code for the Kernel you have chosen, syntax highlighting is available, comments in the code are specified with `#` -> code after this will not be executed.
    - **Markdown** cells allow you to right report style text, using markdown for formatting the style (e.g. Headers, bold face etc).
---

## ❓Questions and Objectives for this Notebook
What should you be able to answer by the end of this notebook?
### Questions

- What types of data can be contained in a DataFrame?
- Why is the data type important?

### Objectives

- Describe how information is stored in a Python DataFrame.
- Define the two main types of data in Python: text and numerics.
- Examine the structure of a DataFrame.
- Modify the format of values in a DataFrame.
- Describe how data types impact operations.
- Define, manipulate, and interconvert integers and floats in Python.
- Analyze datasets having missing/null values (NaN values).
- Write manipulated data to a file.



---
# Types of Data
How information is stored in a DataFrame or a Python object affects what we can do with it and the outputs of calculations as well. There are two main types of data that we will explore in this lesson: numeric and text data types.

# Numeric Data Types
Numeric data types include integers and floats. A **floating point** number (known as a float)  has decimal points even if that decimal point value is 0. For example: 1.13, 2.0, 1234.345. If we have a column that contains both integers and floating point numbers, Pandas will assign the entire column to the float data type so the decimal points are not lost.

An **integer** will never have a decimal point. Thus if we wanted to store 1.13 as an integer it would be stored as 1. Similarly, 1234.345 would be stored as 1234. You will often see the data type `Int64` in Python which stands for 64 bit integer. The 64 refers to the memory allocated to store data in each cell which effectively relates to how many digits it can store in each “cell”. Allocating space ahead of time allows computers to optimize storage and processing efficiency.

# Text Data Type
The text data type is known as a 'string' in Python, or 'object' in Pandas. Strings can contain numbers and / or characters. For example, a string might be a word, a sentence, or several sentences. A Pandas object might also be a plot name like ‘plot1’. A string can also contain or consist of numbers. For instance, ‘1234’ could be stored as a string, as could ‘10.23’. However **strings that contain numbers can not be used for mathematical operations!**  

## Pandas vs Python
Pandas and base Python use slightly different names for data types. More on this is in the table below:

| Pandas Type  |  Native Python Type  | Description  |
|:---:|:---:|:---:|
| object  |  	string  |  The most general dtype. Will be assigned to your column if column has mixed types (numbers and strings). |
|  int64  |  int |  Numeric characters. 64 refers to the memory allocated to hold this character. |
|  float64  |  float  |  Numeric characters with decimals. If a column contains numbers and NaNs (see below), pandas will default to float64, in case your missing value has a decimal.  |
|  datetime64,  timedelta[ns]  |  N/A (but see the [datetime module](https://docs.python.org/2/library/datetime.html) in Python’s standard library)  | Values meant to hold time data. Look into these for time series experiments.  |

# Checking the format of our data
Now that we’re armed with a basic understanding of numeric and text data types, let’s explore the format of our survey data. We’ll be working with the same surveys.csv dataset that we’ve used in previous lessons.

In [3]:
# Make sure pandas is loaded
import pandas as pd

# Load data - note that pd.read_csv is used because we imported pandas as pd
surveys_df = pd.read_csv('../data/surveys.csv')

Remember that we can check the type of an object like this:  

`type(var_name)`


In [4]:
# Check the data type of the data frame we made:
type(surveys_df)

pandas.core.frame.DataFrame

Next, let’s look at the structure of our surveys data. In pandas, we can check the type of one column in a DataFrame using the syntax `dataFrameName[column_name].dtype`:

In [21]:
# Check the data type of the column 'sex'
surveys_df['sex'].dtype

dtype('O')

A type ‘O’ just stands for “object” which in Pandas’ world is a string (text).

In [20]:
# Now check the column for 'record_id'
surveys_df['record_id'].dtype

dtype('float64')

The type `int64` tells us that Python is storing each value within this column as a 64 bit integer.   
 
We can use the `dataframe_name.dtypes` command to view the data type for each column in a DataFrame (all at once).

In [23]:
surveys_df.dtypes

record_id          float64
month                int64
day                  int64
year                 int64
plot_id              int64
species_id          object
sex                 object
hindfoot_length    float64
weight             float64
dtype: object

Note that most of the columns in our Survey data are of type `int64`. This means that they are 64 bit integers. But the weight column is a floating point value which means it contains decimals. The `species_id` and `sex` columns are objects which means they contain strings.

---
# Working With Integers and Floats
So we’ve learned that computers store numbers in one of two ways: as integers or as floating-point numbers (or floats). Integers are the numbers we usually count with. Floats have fractional parts (decimal places). Let’s next consider how the data type can impact mathematical operations on our data. Addition, subtraction, division and multiplication work on floats and integers as we’d expect.

In [11]:
# Add some things
5+5

10

In [10]:
# Subtract some things
24-4

20

If we divide one integer by another, we get a float. The result on Python 3 is different than in Python 2, where the result is an integer (integer division).

In [12]:
# Divide two integers
5/9

0.5555555555555556

In [13]:
# Try some different numbers!
10/3

3.3333333333333335

We can still use integer division if we want with the `//` operator (two division symbols). This will apply a division, but throw away the remainder

In [14]:
# Now try "Integer" division
10//3


3

We can also convert a floating point number to an integer or an integer to floating point number. Notice that Python by default rounds down when it converts from floating point to integer.

In [17]:
# Convert a to an integer
a = 7.83
int(a)

7

In [18]:
# Convert b to a float
b = 7
float(b)


7.0

---
# Working With Our Survey Data
Getting back to our data, we can modify the format of values within our data, if we want. For instance, we could convert the record_id field to floating point values.

In [19]:
# Convert the record_id field from an integer to a float
surveys_df['record_id'] = surveys_df['record_id'].astype('float64')
surveys_df['record_id'].dtype


dtype('float64')

### ✏️ Challenge:
Try converting the column plot_id to floats using:  
```py
surveys_df.plot_id.astype("float")
```

Then try converting `weight` to an integer. What goes wrong here? What is Pandas telling you? We will talk about some solutions to this later.

In [24]:
surveys_df.plot_id.astype("float")

0         2.0
1         3.0
2         2.0
3         7.0
4         3.0
         ... 
35544    15.0
35545    15.0
35546    10.0
35547     7.0
35548     5.0
Name: plot_id, Length: 35549, dtype: float64

In [25]:
surveys_df.weight.astype("int")

IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer

# Missing Data Values - NaN
What happened in the last challenge activity? Notice that this throws one of the following errors: `ValueError: Cannot convert NA to integer` or `IntCastingNaNError: Cannot convert non-finite values (NA or inf) to integer`. If we look at the weight column in the surveys data we notice that there are NaN (**N**ot **a** **N**umber) values. NaN values are undefined values that cannot be represented mathematically. Pandas, for example, will read an empty cell in a CSV or Excel sheet as a NaN. NaNs have some desirable properties: if we were to average the weight column without replacing our NaNs, Python would know to skip over those cells.

In [26]:
# Average the weight column
surveys_df['weight'].mean()

42.672428212991356

Dealing with missing data values is always a challenge. It’s sometimes hard to know why values are missing - was it because of a data entry error? Or data that someone was unable to collect? Should the value be 0? We need to know how missing values are represented in the dataset in order to make good decisions. If we’re lucky, we have some metadata that will tell us more about how null values were handled.

For instance, in some disciplines, like Remote Sensing, missing data values are often defined as -9999 or -999. Having a bunch of -9999 values in your data could really alter numeric calculations. Often in spreadsheets, cells are left empty where no data are available. Pandas will, by default, replace those missing values with NaN. However it is good practice to get in the habit of intentionally marking cells that have no data, with a no data value! That way there are no questions in the future when you (or someone else) explores your data.

# Where Are the NaN’s?
Let’s explore the NaN values in our data a bit further. Using the tools we learned in lesson 02 (from yesterday!), we can figure out how many rows contain NaN values for weight. We can also create a new subset from our data that only contains rows with weight values > 0 (i.e., select meaningful weight values):

In [27]:
# How many rows have weight values?
len(surveys_df[surveys_df['weight'].notnull()])


32283

In [30]:
# How many rows have null values?
len(surveys_df[surveys_df['weight'].isnull()])


3266

We can replace all `NaN` values with zeroes using the `.fillna()` method (after making a copy of the data so we don’t lose our work):

In [31]:
# Make a copy! Be sure to assign a new variable using the ".copy()" method!
df1 =surveys_df.copy()
# Now fill all NaN values with 0
df1['weight'] = df1['weight'].fillna(0)

However `NaN` and 0 yield different analysis results. The mean value when `NaN` values are replaced with 0 is different from when `NaN` values are simply thrown out or ignored.

In [32]:
# Check the mean of df1
df1['weight'].mean()


38.751976145601844

We can fill `NaN` values with any value that we chose. The code below fills all `NaN` values with a mean for all weight values.

In [34]:
df1['weight'] = surveys_df['weight'].fillna(surveys_df['weight'].mean())

We could also chose to create a subset of our data, only keeping rows that do not contain NaN values.

The point is to **make conscious decisions about how to manage missing data**. This is where we think about how our data will be used and how these values will impact the scientific conclusions made from the data.

Python gives us all of the tools that we need to account for these issues. We just need to be cautious about how the decisions that we make impact scientific results.

### ✏️ Challenge - Counting
Count the number of missing values per column of the original dataframe `surveys_df`

Hints:  
As with almost everything programming, there are multiple ways to do this.  
  Try the method `.count()`, which gives you the number of non-NA observations per column (we need the NA observations per column).  
  
You could also try using the `.isnull()` or `.isna()` methods too.

In [36]:
for column in surveys_df.columns:
    print(column, len((surveys_df[surveys_df[column].isna()])))

record_id 0
month 0
day 0
year 0
plot_id 0
species_id 763
sex 2511
hindfoot_length 4111
weight 3266


In [38]:
for column in surveys_df.columns:
    print(column, len(surveys_df[pd.isnull(surveys_df[column])]))

record_id 0
month 0
day 0
year 0
plot_id 0
species_id 763
sex 2511
hindfoot_length 4111
weight 3266


# Writing Out Data to CSV
We’ve learned about using manipulating data to get desired outputs. But we’ve also discussed keeping data that has been manipulated separate from our raw data. Something we might be interested in doing is working with only the columns that have full data. First, let’s reload the data so we’re not mixing up all of our previous manipulations.  
We do this with:  
`pd.read_csv("file location")`

In [39]:
# Read the data in again and store it in memory
surveys_df = pd.read_csv('../data/surveys.csv')

Next, let’s drop all the rows that contain missing values. We will use the method `dropna()`. By default, `dropna()` removes rows that contain missing data for even just one column.

In [40]:
# Create the new dataframe 
df_na = surveys_df.dropna()

If you now type `df_na`, you should observe that the resulting DataFrame has 30676 rows and 9 columns, much smaller than the 35549 row original.

In [41]:
df_na

Unnamed: 0,record_id,month,day,year,plot_id,species_id,sex,hindfoot_length,weight
62,63,8,19,1977,3,DM,M,35.0,40.0
63,64,8,19,1977,7,DM,M,37.0,48.0
64,65,8,19,1977,4,DM,F,34.0,29.0
65,66,8,19,1977,4,DM,F,35.0,46.0
66,67,8,19,1977,7,DM,M,35.0,36.0
...,...,...,...,...,...,...,...,...,...
35540,35541,12,31,2002,15,PB,F,24.0,31.0
35541,35542,12,31,2002,15,PB,F,26.0,29.0
35542,35543,12,31,2002,15,PB,F,27.0,34.0
35546,35547,12,31,2002,10,RM,F,15.0,14.0


We can now use the `to_csv()` method to export a DataFrame in CSV format. Note that the code below will by default save the data into the current working directory. We can save it to a different folder by adding the foldername and a slash before the filename: `df.to_csv('foldername/out.csv')`. We use `index=False` so that pandas doesn’t include the index number for each line.

In [44]:
# Write DataFrame to CSV file '../data_output/surveys_complete.csv'
df_na.to_csv('../data/surveys_complete.csv', index=False)

We will use this data file later in the workshop. Check out your working directory to make sure the CSV wrote out properly, and that you can open it! If you want, try to bring it back into Python to make sure it imports properly.

---
# Recap
What we’ve learned:
- How to explore the data types of columns within a DataFrame
- How to change the data type
- What NaN values are, how they might be represented, and what this means for your work
- How to replace NaN values, if desired
- How to use to_csv to write manipulated data to a file.


# ❗Key Points


- Pandas uses other names for data types than Python, for example: object for textual data.
- A column in a DataFrame can only have one data type.
- The data type in a DataFrame’s single column can be checked using dtype.
- Make conscious decisions about how to manage missing data.
- A DataFrame can be saved to a CSV file using the to_csv function.
