**Setup**

With this Google Colaboratory (Colab) notebook open, click the "Copy to Drive" button that appears in the menu bar. The notebook will then be attached to your own user account, so you can edit it in any way you like -- you can even take notes directly in the notebook.

# Python Open Labs: Reading, exploring, and writing data with Pandas

## Welcome!

### Instructors
- Walt Gurley
- Claire Cahoon
- Scott Bailey
- Natalia Lopez
- Ashley Evans Bandy

### Open Labs agenda

1.   **Guided activity**: One of the instructors will share their screen to work through the guided activity and teach concepts along the way.

2.   **Open lab time**: After the guided portion of the Open Lab, the rest of the time is for you to ask questions, work collaboratively, or have self-guided practice time. You will have access to instructors and peers for questions and support.

Breakout rooms will be available if you would like to work in small groups. If you have trouble joining a room, ask in the chat to be moved into a room.

### Learning objectives

By the end of our workshop today, we hope you'll understand what the pandas library is and be able to work with pandas data structures like a `Series` and a `DataFrame`.

### Today's Topics
- What is pandas, and how does it relate to Python?
- Importing and using pandas
- How to read data into pandas
- Common pandas data structures (`Series` and `DataFrame`)
- Referencing data in a `DataFrame`
- How to write data from pandas


### Using Zoom

Please make sure that your mic is muted during the workshop.

We will have live captioning enabled, you can switch this on and off from your toolbar at the bottom of the screen.

### Asking questions

Please feel free to ask questions in the Zoom chat throughout the demonstration.

Other instructors will be monitoring chat on Zoom. They will answer as able, and will collect questions with answers that might help everyone to answer at the end of the demonstration.

The open lab time is when you will be able to ask more questions and work together on the exercises.

### Using Jupyter Notebooks and Google Colaboratory

Jupyter notebooks are a way to write and run Python code in an interactive way. They're quickly becoming a standard way of putting together data, code, and written explanations or visualizations into a single document and sharing that. There are a lot of ways that you can run Jupyter notebooks, including just locally on your computer, but we've decided to use Google's Colaboratory notebook platform for this workshop.  Colaboratory is “a Google research project created to help disseminate machine learning education and research.”  If you would like to know more about Colaboratory in general, you can visit the [Welcome Notebook](https://colab.research.google.com/notebooks/welcome.ipynb).

Using the Google Colaboratory platform allows us to focus on learning and writing Python in the workshop rather than on setting up Python, which can sometimes take a bit of extra work depending on platforms, operating systems, and other installed applications. If you'd like to install a Python distribution locally, though, we're happy to help. Feel free to [get help from our graduate consultants](https://www.lib.ncsu.edu/dxl) or [schedule an appointment with Libraries staff](https://go.ncsu.edu/dvs-request).

## Guided Instruction
This week we're introducing the Pandas library for Python and working on importing, viewing, and referencing the data.

Content Warning: This dataset contains information relating to violence towards animals. We understand that this may be distressing, and if you need to step away from the workshop we understand.

In this section, we will work through examples using data from the [Federal Aviation Administration (FAA) Wildlife Strikes Database](https://wildlife.faa.gov/search). We have filtered the data to only include North Carolina.

> "The FAA Wildlife Strike Database contains records of reported wildlife strikes since 1990. Strike reporting is voluntary. Therefore, this database only represents the information we have received from airlines, airports, pilots, and other sources." - [FAA website](https://wildlife.faa.gov/home)

### What is a Python library?

A "Library" in this context is a package of code that adds to the functionality of Python. Base Python offers a lot of features, but not everything -- Python libraries can be imported at the beginning of your code to use for your specific purpose. 

For example, you may import Matplotlib to create graphs and plots, or Natural Language Toolkit (NLTK) to do natural language processing. Today we will be using the pandas library to manipulate a dataset.

### What is Pandas?

Pandas is a high-level data manipulation tool first created in 2008 by Wes McKinney. The name comes from the term “panel data,” an econometrics term for data sets that include observations over multiple time periods for the same individuals.<sup>[[wikipedia](https://en.wikipedia.org/wiki/Pandas_(software))]</sup>

From Jake Vanderplas’ book [**Python Data Science Handbook**](http://shop.oreilly.com/product/0636920034919.do):

> As well as offering a convenient storage interface for labeled data, Pandas implements a number of powerful data operations familiar to users of both database frameworks and spreadsheet programs.

#### What does Pandas do?
* Reading and writing data from persistent storage
* Cleaning, filtering, and otherwise preparing data
* Calculating statistics and analyzing data
* Visualization with help from Matplotlib

We can learn more about Pandas by using the help window in Google Colab.

In [None]:
# Type the function with a question mark afterwards and run the code to pull up a help window.
# Here we will find out more about Pandas
pd?

### Importing a Python library

To use any library, we must import it into our Python document.

In [None]:
# Import the Pandas library as pd (callable in our code as pd)
import pandas as pd
pd?

### Importing files into Pandas
We have prepared the data from the FAA website for this workshop. We will import those datasets into our notebook to use them for data analysis.

Datasets can be stored in several types of files, including .csv, .json, .txt, .xls, .xlsx, and more. Here we will import a .csv file and a .json file.

CSV Files

A comma separated values (CSV) file is a plain text file containing data separated by commas.

In [None]:
# Import a comma-sperated values (csv) file as a DataFrame

# The file location
csv_file_url = 'https://raw.githubusercontent.com/NCSU-Libraries/data-viz-workshops/master/Python_Open_Labs/data/FAA_Wildlife_strikes_1990-1999.csv'

# Read in the file and print out the DataFrame
wl_strikes_csv = pd.read_csv(csv_file_url)
wl_strikes_csv.head()

JSON Files

JSON (JavaScript Object Notation) is a data storage format that uses name/value pairs to create objects and associative arrays. Learn more about [JSON files structure and syntax from W3Schools](https://www.w3schools.com/js/js_json_syntax.asp)

In [None]:
# Importing a JavaScript object notation (JSON) file

# The file location
json_file_url = 'https://raw.githubusercontent.com/NCSU-Libraries/data-viz-workshops/master/Python_Open_Labs/data/FAA_Wildlife_strikes_2010-2019.json'

# Read in the file and print out the DataFrame
wl_strikes_json = pd.read_json(json_file_url, 'index')
wl_strikes_json.head()

### Pandas data structures

Pandas uses two main data structures: `Series` and `DataFrame`.

<img src="https://raw.githubusercontent.com/NCSU-Libraries/data-viz-workshops/master/Data_Manipulation_with_Python/assets/nc_dataframes.png" alt="DataFrames are composed of Series" width="80%">

#### `Series`
A `Series` is a one-dimensional array of indexed data, or a single column of data. It can be thought of as a specialized dictionary or a generalized NumPy array. You can learn more about the Series data type in the [Pandas documentation for Series](https://pandas.pydata.org/pandas-docs/stable/reference/series.html).

#### `DataFrame`
A `DataFrame` is a two-dimensional array composed of one or more `Series`, similar to tabluar data (think of Excel). They can optionally have an `Index` and have flexible row indices and flexible column names. 

It can be thought of as a generalization of a two-dimensional NumPy array, or a specialization of a dictionary in which each column name maps to a `Series` of column data. You can learn more about the DataFrame data type in the [Pandas documentation for DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html).

A `DataFrame` is made up of `Series` in a similar way in which a table is made up of columns. The only restriction is that each column must be of the same data type.  Many of the operations that can be performed on a `DataFrame` can also be performed on an individual `Series`.

In [None]:
# The csv file we imported earlier was stored in a DataFrame.
# Let's look at that data:
wl_strikes_csv

In [None]:
# You can also view the "shape" of the Dataframe
# This tells you how many rows and columns there are
wl_strikes_csv.shape

(669, 92)

In [None]:
# A Series is a one-dimensional array, or one column of data
# When we take one column of a DataFrame, it is represented as a Series
airport = wl_strikes_csv['AIRPORT']
type(airport)

pandas.core.series.Series

In [None]:
# Now that we have created a Series, let's look at the data:
airport

In [None]:
# You can also see the shape of a Series
# Since a Series only has one column, it will tell you how many rows there are
airport.shape

(669,)

In [None]:
# You can convert a Series to a list with to_list()
airport.to_list()

### Exploring your data

Now that we have our data, we can use Pandas to explore our data for analysis. This can be useful if you are new to a dataset to see what's there and how you should start analyzing.

#### View DataFrame column labels

Our DataFrame has 92 columns. We can quickly view the label names for each column using the DataFrame `columns` property.

In [None]:
# View column labels (headers)
wl_strikes_csv.columns

#### View summaries of a DataFrame

We can quickly generate summaries of our DataFrame to observe some basic statistics and information such as column data types and non-null value counts.

In [None]:
# Get summary statistics of DataFrame columns using "describe()" (only includes
# numerical data types)
wl_strikes_csv.describe()

In [None]:
# Get summary statistics of single column using "describe()"
wl_strikes_csv['AIRCRAFT'].describe()

In [None]:
# Summarize column data types, non-null values, and memory usage using "info()"
wl_strikes_csv.info()

#### Referencing and indexing a DataFrame

Referencing Rows

In [None]:
# Reference a row by index label
# Returns a Series

# Access first row of wl_strikes_csv by index label
# In this case the index label is 0
wl_strikes_csv.loc[0]

# Access first row of wl_strikes_json by index label
# In this case the index label is not 0
# wl_strikes_json.loc[0]
wl_strikes_json.loc[1080125]

In [None]:
# Reference multiple rows by index label (in this case the index label 0 through 2)
# Returns a DataFrame
wl_strikes_csv.loc[0:2]

In [None]:
# Reference a row or multiple rows by zero-based integer position

# Access first row of wl_strikes_csv by row integer value
# In this case the row is row 0
wl_strikes_csv.iloc[0]

# Access first row of wl_strikes_json by row integer value
# In this case the row is also row 0
wl_strikes_json.iloc[0]

In [None]:
# Reference multiple rows by row number (in this case rows 0 through 2)
# Note that this time the range doesn't include the stop number
wl_strikes_csv.iloc[0:3]

Referencing Columns

In [None]:
# Referencing a column by column label (in this case, "INDX_NR")
wl_strikes_csv['INDX_NR']

In [None]:
# Referencing multiple columns by a list of column labels 
# (in this case, the columns "INDX_NR" and "SPECIES")
wl_strikes_csv[['INDX_NR', 'AIRPORT']]

Referencing both rows and columns

In [None]:
# Referencing a subset of rows and columns using index and column labels
# Note that we're using a range of column labels instead of a list
# Make sure that your column range starts with the leftmost label
wl_strikes_csv.loc[:10, 'INDX_NR':'TIME']

### Writing data to a file

In [None]:
# Save the subset from the previous cell in a variable
first_ten = wl_strikes_csv.loc[:10, 'INDX_NR':'TIME']

# Write to csv
first_ten.to_csv("new_data.csv")

In [None]:
#Write to an Excel file
first_ten.to_excel("new_data.xls")

In [None]:
# Write to a JSON file
first_ten.to_json("new_data.json")

----
## Open work time
You can use this time to ask questions, collaborate, or work on the following activities (on your own or in a group). 

### Exercise 1: Read in an Excel file
Take this Excel file, read it into a DataFrame, and print out the first five rows of the DataFrame.



> Hint: the syntax is very similar to reading a .csv file.



Link to the file: https://github.com/NCSU-Libraries/data-viz-workshops/blob/master/Python_Open_Labs/data/FAA_Wildlife_strikes_2000-2009.xlsx?raw=true

In [None]:
# Save the url as a variable
xls_file_url = 'https://github.com/NCSU-Libraries/data-viz-workshops/blob/master/Python_Open_Labs/data/FAA_Wildlife_strikes_2000-2009.xlsx?raw=true'

# Read the file in
wl_strikes_xls = pd.read_excel(xls_file_url)

# View the file
wl_strikes_xls.head()

### Exercise 2: Indexing cells

Use referencing and indexing to answer the following questions by finding the data in the rows, columns, and/or cells. 



#### 2a. Time of day
Airlines are interested in when they should schedule flights to minimize collisions. What is the time of day for each incident? Create a `Series` of the time of day (`TIME_OF_DAY`)

In [None]:
# 2a. Create a `Series` of the time of day (`TIME_OF_DAY`)
time_of_day = wl_strikes_csv['TIME_OF_DAY']

# Print new Series
time_of_day

#### 2b. Date and time
We want to find out when most of these collisions occur. What is the exact date and time of each incident? Print the third, fourth, and fifth columns from the data (`INCIDENT_MONTH`,	`INCIDENT_YEAR`, and	`TIME`).

In [None]:
# 2b. Print the third, fourth, and fifth columns from the data 
# (`INCIDENT_MONTH`, `INCIDENT_YEAR`, and `TIME`).
wl_strikes_csv[['INCIDENT_MONTH', 'INCIDENT_YEAR', 'TIME']]

#### 2c. Access the 126th row

Use row indexing to find the data in the 126th row in the `wl_strikes_json` DataFrame. Check that your result is correct by making sure your `INCIDENT_DATE` value is `2020-07-17`.

> Tip: Remember that the integer-based row location is zero based



In [None]:
# 2c. Access the 126th row from the 'wl_strikes_json` DataFrame
wl_strikes_json.iloc[125]

#### 2d. Cloud cover
A particular airline has nine flights that they want to compare to see if the cloud cover in the area had anything to do with the collision. Print rows 60-65 and the columns `INDX_NR`, `SKY`, `PHASE_OF_FLIGHT`, and `AIRPORT`

In [None]:
# 2d. Print rows 60-65 and the columns 'INDX_NR', 'SKY', 'PHASE_OF_FLIGHT', and
# 'AIRPORT'
cloud_cover = wl_strikes_csv.loc[65:73, ['INDX_NR', 'SKY', 'PHASE_OF_FLIGHT', 'AIRPORT']]
cloud_cover

### Exercise 3: Write to a file
Take the your result in exercise 2d. (or another DataFrame you have created), and write it to a .csv file.

In [None]:
# Write to a new .csv file
cloud_cover.to_csv("exercise3.csv")



---


## Finding Help with Pandas

The [Pandas website](https://pandas.pydata.org/) and [online documentation](http://pandas.pydata.org/pandas-docs/stable/) are useful resources, and of course the indispensible [Stack Overflow has a "pandas" tag](https://stackoverflow.com/questions/tagged/pandas).  There is also a (much younger, much smaller) [sister site dedicated to Data Science questions that has a "pandas" tag](https://datascience.stackexchange.com/questions/tagged/pandas) too.

## Evaluation Survey
Please, spend 1 minute answering these questions that help improve future workshops.

https://go.ncsu.edu/dvs-eval

## Credits

This workshop was created by Claire Cahoon and Walt Gurley, adapted from previous workshop materials by Scott Bailey and Simon Wiles, of Stanford Libraries.