# Dealing with data and intro to Pandas

_This notebook is adapted by Shannon Tubridy from materials authored by Todd Gureckis and Kelsey Moty and released under a [CC BY-SA 4.0](https://creativecommons.org/licenses/by-sa/4.0/) license._

## How to think about data and organize it

When we start accumulating and using data we need to start thinking abstractly about **organizing** our measurements into some type of collection that can help us keep track of things like the order in which measurements were made, or the relationship between different datapoints.

### Lists

Lists are simple data types that we've already learned about. Lis are fine for some cases, especially one dimensional data like an array of names or a sequence of measurements. We can look for the location of items in a list or calculate some basic statistics.

In [None]:
sites = ['NYC', 'Beijing', 'Cairo', 'Bogota']

In [None]:
# find whether and where 'Cairo' is in sites:

In [None]:
nyc_monthly_cases = [1152, 954, 987, 1014, 876, 854, 987, 1090, 1011, 921, 842, 821]

In [None]:
# compute the average or mean value of a 
# list using sum() and len() functions:

### Multi-dimensional data

In many cases we will have data that have at least two dimensions on which we have measurements.

For example:

- a behavioral experiment where you have a set of people (dimension 1) and their performance on some assessment (dimension 2)

- crime rates (dimension 1) in different neighborhoods within a city (dimension 2)

- a measure of behavior like working memory accuracy (dimension 1) and simultaneous fMRI measurements from an area in visual cortex (dimension 2)

- fMRI activation measurements (dimension 1) at each location in a 3D space (dimensions 2,3, and 4) measured over time (dimension 5)

The assignment of a particular measurement to a particular dimension is arbitrary, but in each of these examples there are two or more kinds of data that need to stay in alignment.

A list of lists, approximating a matrix, is one way to handle this.

Imagine 3 people in a study and we have 4 measuremnts for each. 

A list of lists storing these data might look like this, where rows are people and "columns", or the sub-elements of each list, are the individual measurements within each person:

<div>
<img src="attachment:matrix.png" width="300"/>
</div>


In [None]:
# define three individual lists of numbers
p1 = [10, 15, 23, 16]
p2 = [9, 8, 57, 2]
p3 = [22, 64, 70, 19]

In [None]:
# make a new list of the lists


In [None]:
# access one of the lists using regular 
# list indexing and square brackets

# 


In [None]:
# indexing the individual lists happens 
# with a second set of square brackets

# get the second element in the first list:


# get all the elements in the third list:



Although a list of lists is a simple way to organize multiple observations made on multiple people it can confusing to keep track of which dimension is which.  Are the rows or the columns the people in the example above?  


<div>
<img src="attachment:matrix.png" width="400"/>
</div>

It is nice to have data stored along with additional information.

### Metadata and data file formats

**metadata** refers to data or information that *describes* other data. An example would be a column name or a row index in a spreadsheet.





This is much nicer than a list of lists as all of the information describing the data is embedded with the data themselves.


![image.png](attachment:image.png)

This a snapshot of a file encoding for the spreadsheet visualized above. You can find the actual data in there (look for sub-22, sub-23, and sub-31) but also a bunch of other information like column names, data types (strings and numbers), font name and size, etc. and it's all in a hierarchical structure with tags.

<br>

<div>
<img src="attachment:image.png" width="750"/>
</div>

<br>

This formatted structure enables storing lots of additional information like column names and style along with the data.

This can come at the cost of complexity in the file, as in the excel above, and can make it difficult (but not impossible) to use xlsx with other software.

To avoid this and facilitate the sharing and re-use of data, **it is prefereable that data is stored using _plain-test formats_ such as txt and csv files.**

#### Comma-separated Values (.csv files)
CSV, or Comma-separated value files, are written using a plain-text format. No special tags, just text. The only indication that this is columnar data are the commas used to delimit different values for columns and new lines to indicate rows. 

Here's the same dataset as before but now as a CSV file:



<div>
<img src="attachment:image-2.png" width="300"/>
</div>


This means that what you can store within a CSV file is quite limited (no images or font formatting, for example). But it also means that this kind of file is great for storing and sharing data. 

CSV files can be opened by many applications, including Python, R, Excel and Google Sheets, making them accessible to others.

#### Tab-separated Values (.tsv files)
TSV files are very similar to CSV files, except that instead of using a comma to delimit between values in a dataset, it uses tabs (a special type of spacing, the same type you get when you hit the tab key on your computer). We will revisit this soon when we load some data into Python from a file.

#### .txt files
Filename extensions don't actually change anything in the underlying file and instead are used as signals to software (and people) about the structure of what's inside the file. The .txt file extension is another commonly used extension that indicates plain text. One could have a txt file with comma separated values in it or anything else. The key is that the file only has text in it and not any additional special meta data.

### Exporting Google Sheets to csv

To export a .csv file from Google Sheets, you click on File > Download > Comma-separated values (.csv). 


<div>
<img src="attachment:export_csv_google.png" width="500"/>
</div>


If you created a Google Sheet with multiple sheets, you will have to save each individual sheet as a .csv file because .csv files do not support multiple sheets. To do this, you will have to click on each sheet and go through the save process for each sheet. 

### Exporting to csv in Excel

To create a .csv file from Excel, click on the File menu and select Save As. Choose save location and select the "File Format" or "Save as Type" menu (the menu name depends on Excel version). Choose the plain csv.<br><br>


<div>
<img src="attachment:export_csv_excel.png" width="450"/>
</div>


We will make extensive use of csv files as we read data into Python using the Pandas library and to save new analyses back to file for storage or sharing.

# The Pandas library for Python

Throughout this class there are several libraries (i.e., tools which extend the capabilities of the core Python language) which we will use a lot.  One of those is [Pandas](https://pandas.pydata.org).  

Pandas is an open-source project for Python that provides a data analysis and manipulation tool.  Pandas gives you a way to interact with a dataset organized like a spreadsheet (meaning columns and rows which are possibly named or labeled with meta data) in order to perform sophisticated data analyses. This data structure is called a _dataframe_.  

![image.png](attachment:image.png)

<div class="alert alert-warning" role="alert">
Pandas can be confusing! It can be difficult to "get" Pandas at first, but when it clicks it can open a universe of data management and analysis. It takes some patience in part because data manipulation is a very complex topic at least at the conceptual level.
</div>

We won't go through a complete description of everything in Pandas, but you should get a sense of how to use Pandas (with code) to do the kinds of tasks typically done in a spreadsheet like Excel or Google Sheets.

The first step of using Pandas is to load it into your current Python session using `import`.

The import.ipynb notebook has basic info on the `import` command.

In [None]:
# load the numpy library to use some random number generators
import numpy as np

# load the pandas library
import pandas as pd

As discussed in the _imports.ipynb notebook_, there are standard conventions (but not requirements) for giving short names to libraries when you import them. For pandas this is **pd** and numpy is **np**.

Using the standard names facilitates understanding code written by different people.

## Reading data (e.g., csvs) into Python using Pandas

First let's understand how to read some data into a dataframe object.  

Often you are reading data in from some other file (e.g., a CSV file) and "loading" it into Pandas.  Pandas has <a href="https://pandas.pydata.org/pandas-docs/stable/reference/io.html">many different ways</a> for reading in different file types. Here, we will use <a href="https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html">`pd.read_csv()`</a> function because we will mostly be working with .csv files in this class.

When reading in your .csv file, there are two things you have to do:

1) Tell pandas where to find the file

2) Store your data into a variable

**1) tell `pd.read_csv()` where the file you are trying to import is located.** 

pd.read_csv() has one required input: the path or location of a file you want to load.

If the .csv file and your notebook are in the same folder, you only have to put the name of the file.

In [None]:
pd.read_csv('../data/salary.csv')

Sometimes, your .csv file might be saved in a folder that's not in the same folder as your notebook. For example, say you have a folder called "project" as in the image below. 

In that folder, there is a folder called "code" that contains your notebooks/python code, as well as a subfolder called "data" that contains the data_sample.csv file. There is also a data directory with salary.csv in it at the same level as the code directory.

To import a file that is in a subdirectory of the directory where your notebook is we can use '.' notation. So if the current directory is code, then this:

`./data/data_sample.csv`

means the data directory that's inside the current folder which is indicated by the single '.'

In the image below, if we wanted to use the analysis_notebook.ipynb to load the salary.csv file in this project we could do it like this:

'../data/salary.csv`

The double dots means go up one level from where you are now (i.e., from `project/code/` to `project/`) and then look at the data directory that's there.



![image-2.png](attachment:image-2.png)

<div class="alert alert-warning" role="alert">
**Important note**: on Windows the relative path notation (dots) works the same, but the directory separator is a '\' slash rather than '/'. So: 
    
    ./project/data/salary.csv

on a Mac would be 
    
    .\project\data\salary.csv
on a Windows computer.
    
You can either do this manually, or you can use the os library (import os) and its join() function which we will discuss in subsequent lectures.
</div>

**2) Store the data into a variable** 
Otherwise, you won't be able to work with your data. Here, we called the dataframe "df" but you can name it whatever you want. 

In [None]:
# Incorrect, doesn't store into a variable: 
pd.read_csv('../data/salary.csv')

In [None]:
# Correct, creates a dataframe called df that you can work with 
df = pd.read_csv('../data/salary.csv')

In [None]:
# check out the loaded data by 
# just putting the dataframe variable name
df

In [None]:
# it's a new data type:
type(df)

### Comma, tab, and other column separators

By default, `pd.read_csv()` assumes the file uses commas to differentiate between different entries in the file. But you can also explicitly tell `pd.read_csv` that you want to use commas as the delimiter or tell pandas to use a different value as the delimiter.

In [None]:
df = pd.read_csv('../data/salary.csv', sep = ",")

Here you can see the use of an **optional function argument**. If we don't include an input named `sep` then pandas will use a commma by default. The sep= argument is useful when the 'delimiter' is another character.

In addition to comma separated data files you will sometimes encounter tab separated files where the separation between characters looks is a tab (spaces) rather than a comma.

Loading such a file with read_csv() and no `sep=` input will not give you what you want:

In [None]:
pd.read_csv('../data/xls_sample.tsv')

The result is clearly wrong, and you may notice that we have a single column and all the values are all mashed together with a \t between them. That's the separating character and we can load it correctly using '\t' as the `sep=` input (\t is the special notation for a tab)

In [None]:
pd.read_csv('../data/xls_sample.tsv', sep='\t')

#### Loading .txt

The file extension doesn't really matter

In [None]:
pd.read_csv('../sample_data_dollar.txt', sep='$')

## Column names or headers

`pd.read_csv()` also assumes by default that the first row in your .csv files lists the names for each column of data (the column headers) and having the column names be in your data file simplifies keeping track of the structure of the data. 

Nonetheless, you may sometimes have csv files with no colum names like this:

<br>


<div>
<img src="attachment:image.png" width="150"/>
</div>
<br>
<br>

If you load this using file using the default read_csv() you can end up with your first row of data being used as column names.
   

In [None]:
# load data that has no headers
pd.read_csv('../data_no_headers.csv')

You can tell pandas to have the first line be data by using the input `header=None` and it will just use numbers as column names:

In [None]:
pd.read_csv('../data/data_no_headers.csv', header=None)

If you use `header = None` and your data has column names, those column names will appear as the first row of data in your dataframe variable.

In [None]:
df = pd.read_csv('salary.csv', sep = ",", header = None)
df.head()

Sometimes, datasets may have multiple headers. (e.g., both the first and second rows of the dataset list column names). `pd.read_csv` allows you to keep both rows as headers by modifying the `header` argument with a list of integers. Remember that 0 actually means the first row. 

<div class="alert alert-warning" role="alert">
When creating your own dataset, refrain from using characters like a comma, a space, or a period (.) in the column names.  This will make things easier for you down the line when using Pandas for statistical modeling. 
</div>

## Viewing the data in a dataframe

In [None]:
# load a fresh dataframe
salary_df = pd.read_csv('salary.csv')


Now we have loaded a dataset into a variable called `salary_df`.  There are several ways to look at the data to check it was properly read in and also to learn more about the structure of this dataset.  

The simplest method is simply to type the name of the dataframe variable by itself in a code cell or as the last entry in a cell:

In [None]:
salary_df = pd.read_csv('salary.csv')
salary_df

This outputs a "table" view of the dataframe showing the column names, and several of the rows of the dataset.  It doesn't show you **all** of the data at once because in many large files this would be too much to make sense of.

You can also specifically request the first several rows of the dataframe:

In [None]:
salary_df.head()

or the last several rows:

In [None]:
salary_df.tail()

The "head" of the dataframe is the top.  The "tail" of the dataframe is the bottom. If you don't give an input to the head() or tail() function they will display the first (or last) 5 rows. If you give a number you'll get that many rows:

In [None]:
salary_df.head(2)

## Accessing individual data columns

To access a single column you can index it like a Python dictionary, using the column name as a key:

In [None]:
salary_df['salary']

#### extract one column into a variable:

In [None]:
age = salary_df['age']

print(age)

In [None]:
age.max()

This returns a new data type which is a **Series**.

In [None]:
type(age)

Pandas **series** datatypes can be useful, but sometimes they are overcomplicated and you just want the values from a column in a list. The series.to_list() function will do this for you.

#### Convert a series to a list using series.to_list()

In [None]:
# age.to_list() with no inputs will return a list representation
# of the age series

age_list = age.to_list()
print(age_list)
print(type(age_list))

## Accessing individual rows in a dataframe

In [None]:
# view the top of the dataframe using head()
salary_df.head(3)

#### using df.iloc[]
Since the Python bracket notation is used to lookup columns, a special command is needed to access rows.  

The best way to look up a single row is to use `.iloc[]` where you pass the integer row number you want to access (zero indexed).  So to get the first row you type:

In [None]:
salary_df.iloc[0]

The output of the above is a pandas _series_ which we've briefly introduced. It's kind of like a supercharged list combined with a dictionary. 

To get the value for one of the variables or columns you can use square bracket naming of the desired column:

In [None]:
salary_df.iloc[0]['salary']

Indexing rows in this way is the same as indexing strings and lists: starts at zero, ends at the number of rows minus 1, and reverse indexing can be used starting at -1.

### Indexes and Columns

In [None]:
salary_df.head(3)

There are two special elements of a normal dataframe called the **column index** and the **row index** (or just index).  

The row index is the column on the left that has no name but seems like a counter of the rows (e.g., 72, 73, 74, ...).  The row index is useful in Pandas dataframes for looking things up by row.  

Although you can index a row by counting as we did using .iloc above (access the fifth row for instance), the index can be made of arbitrary types of data including strings, etc...  You don't need to know a ton about indexes to use Pandas typically but every once in a while they come up so it is useful to know the special nature of the index column.

In [None]:
salary_df.loc[0]

In [None]:
df.index

In [None]:
df.columns

The above code shows how to find the row index and column index.

You can change the row index to another column:

In [None]:
df.head(5)

In [None]:
df.set_index('departm')


Or to reset it to a sequence of row numbers use `.reset_index()`.

In [None]:
# first change the index to values from column 'departm'
# and put the new df in variable df2
df2 = df.set_index('departm')

# show df2 with the new departm index:
df2.head()


In [None]:
# reset the index to numeric values 0 to n rows
df2 = df2.reset_index()
df2

That reset the index to be numeric, and it took the old index and added it as a column ('departm')

**Note the syntax we've used a few times here.**

We referenced the `df` variable which is the variable we created to store the data from our file.  Then the `.functionname()` is known as a method of the data frame which is provided by pandas.  

For instance the `.head()` method prints out the first five rows of the data frame.  

The `.tail()` method prints out the last five rows of the data frame.  There are many other methods available on data frames that you access by calling it using the `.functionname()` syntax.  The next parts of this notebook explain some of the most common ones.