##  Setup

With this Google Colaboratory (Colab) notebook open, click the "Copy to Drive" button that appears in the menu bar. The notebook will then be attached to your own user account, so you can edit it in any way you like -- you can even take notes directly in the notebook.

<h1 align="center">Python Open Labs: Exploratory analysis and visualization with Pandas</h1>

## Welcome!

### Instructors
- Walt Gurley
- Claire Cahoon
- Scott Bailey
- Natalia Lopez
- Ashley Evans Bandy

### Open Labs agenda

1.   **Guided activity**: One of the instructors will share their screen to work through the guided activity and teach concepts along the way.

2.   **Open lab time**: After the guided portion of the Open Lab, the rest of the time is for you to ask questions, work collaboratively, or have self-guided practice time. You will have access to instructors and peers for questions and support.

Breakout rooms will be available if you would like to work in small groups. If you have trouble joining a room, ask in the chat to be moved into a room.

### Learning objectives

By the end of our workshop today, we hope you'll understand how to aggregate, group, and visualize data in Pandas.

### Today's Topics
- Aggregation methods (average, minimum, maximum)
- Grouping a dataframe with `groupby`
- Creating a bar chart
- Creating a line chart
- Using other data visualization libraries


### Using Zoom

Please make sure that your mic is muted during the workshop.

We will have live captioning enabled, you can switch this on and off from your toolbar at the bottom of the screen.

### Asking questions

Please feel free to ask questions in the Zoom chat throughout the demonstration.

Other instructors will be monitoring chat on Zoom. They will answer as able, and will collect questions with answers that might help everyone to answer at the end of the demonstration.

The open lab time is when you will be able to ask more questions and work together on the exercises.

### Using Jupyter Notebooks and Google Colaboratory

Jupyter notebooks are a way to write and run Python code in an interactive way. They're quickly becoming a standard way of putting together data, code, and written explanations or visualizations into a single document and sharing that. There are a lot of ways that you can run Jupyter notebooks, including just locally on your computer, but we've decided to use Google's Colaboratory notebook platform for this workshop.  Colaboratory is “a Google research project created to help disseminate machine learning education and research.”  If you would like to know more about Colaboratory in general, you can visit the [Welcome Notebook](https://colab.research.google.com/notebooks/welcome.ipynb).

Using the Google Colaboratory platform allows us to focus on learning and writing Python in the workshop rather than on setting up Python, which can sometimes take a bit of extra work depending on platforms, operating systems, and other installed applications. If you'd like to install a Python distribution locally, though, we're happy to help. Feel free to [get help from our graduate consultants](https://www.lib.ncsu.edu/dxl) or [schedule an appointment with Libraries staff](https://go.ncsu.edu/dvs-request).



---


## Guided Instruction
This week we're focusing on exploratory analysis of our dataset. We'll go through the process of asking a question, using Pandas to find the data to help answer it, and then visualizing that data.

Content Warning: This dataset contains information relating to violence towards animals. We understand that this may be distressing, and if you need to step away from the workshop we understand.

In this section, we will work through examples using data from the [Federal Aviation Administration (FAA) Wildlife Strikes Database](https://wildlife.faa.gov/search). We have filtered the data to only include North Carolina.

> "The FAA Wildlife Strike Database contains records of reported wildlife strikes since 1990. Strike reporting is voluntary. Therefore, this database only represents the information we have received from airlines, airports, pilots, and other sources." - [FAA website](https://wildlife.faa.gov/home)

### Importing the dataset

In [None]:
# Import the Pandas library as pd (callable in our code as pd)


In [None]:
# Import the data from a csv file (wildlife strike data from the years 1990-2020)
# This dataset was cleaned based on methods from previous workshops
csv_file_url = 'https://raw.githubusercontent.com/NCSU-Libraries/data-viz-workshops/master/Python_Open_Labs/data/FAA_Wildlife_strikes_clean.csv'


# Print out the first five columns of the dataset


###Exploratory analysis of the dataset

After observing and cleaning our dataset, it is now easier to conduct analyses on our data. We will conduct some numerical and visual analyses that will help us explore questions such as:

- How many unique species have been identified in the data set?
- Which species are struck the most?
- How have number of strikes changed over time?
- Are there times of the year when most strikes occur?
- How frequently are land-based animals involved?

We can do this by calculating summaries of rows and columns, grouping data, and visualizing the results.

### Aggregation methods

There are several methods that can be used to calculate aggregated values from the dataset, such as the number of unique values, unique value counts, minimum, maximum, and average.

#### Unique


We can use the `unique()` method on the "SPECIES" column to create an array of unique species names. 

The length of this array will provide the number of unique species.

In [None]:
# Create a list of the unique species with unique()


# Print out the unique species


In [None]:
# Get the length of the new array using len()
# How many unique species are there?


#### Value counts

Value counts show how many instances there are of each unique entry in a column. Here, we are interested in seeing how many incidents there are in each month. This could tell us which months have the most accidents.

We will specify the `INCIDENT MONTH` column in our dataset and call the method `value_counts()`. This will return a Series with an index label of each unique month number and a value corresponding to the count of that month number in the `INCIDENT MONTH` column.

In [None]:
# Count the occurance of unique values on the column 'INCIDENT_MONTH'


# Sort the Series by its index using sort_index(), to help view yearly trend


#### Minimum, maximum, average...

We can also calculate aggregates like the minimum, maximum, and mean of values in a DataFrame or Series. Here are a few examples:

- `mean()` to find the average of a range
- `min()` to find the smallest value
- `max()` to find the largest value
- `sum()` to sum the values of a range

In [None]:
# Calculate the minimum values for each column
# Note the minimum values for columns not containing numbers (e.g., the minimum
# for strings is alphabetical order, with uppercase characters preceeding
# lowercase characters - "B" comes before "a")


In [None]:
# Calculate the average height at which all strikes occurred


We can use the `agg()` method to call multiple aggregate functions at once.

In [None]:
# Calculate the minimum, maximum, and average hour in which all strikes occurred


### Group values using groupby

We may be interested in seeing our data in groups. For example, what does the data look like if we group by month and find the average of each column? Which month has the highest cost for repairs?

We can do this by calling `groupby()` on our dataset and passing in the column we would like to group by. We will group our data by the column `INCIDENT_MONTH`, find the mean of each column in the grouped data, and sort by the `COST_REPAIRS_INFL_ADJ` column to see which month has the highest average cost.

In [None]:
# Group the dataset by "INCIDENT_MONTH"


# This creates a groupby object that contains information about the groups


# Find the mean of the grouped data, then sort by "COST_REPAIRS_INFL_ADJ"


You can also use `groupby()` to group data by multiple variables. We will create a hierarchical grouping of `INCIDENT_MONTH` and then `PRECIPITATION` to see the counts of different types of precipitation in each month using the `size()` groupby method.

In [None]:
# Preview the unique values in the "PRECIPITATION" column


# Group the data by month and then precipitation and get the count of each type
# of precipitation during each month using size()


### Data Visualization with matplotlib

We can use Pandas to data to create visualizations, like bar charts and line graphs. We will start by using the [matplotlib visualization library](https://matplotlib.org/index.html) to create charts.

Import the [pyplot interface](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.html#module-matplotlib.pyplot) as `plt` to access the plotting functionality of matplotlib.

For all of our charts we will use the matplotlib integration with pandas data structures by calling the method [`plot()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html#pandas.DataFrame.plot) on a DataFrame or Series.

In [None]:
# Import the matplotlib pyplot interface as plt (callable in our code as plt)


#### Create a Bar Chart

We want to find the top ten most struck species and show the number of incidents in which they were involved.

**Questions to consider with this data:**

- What is the most struck species?
- Approximately how many times has the tenth most struck species been involved in aircraft strikes?

We will aggregate the `SPECIES_GENERALIZE` using the `value_counts()` method. 

The `SPECIES_GENERALIZE` column is a generalized version of the `SPECIES` column in which the generalized name "Unknown flying animal" is substituted for various "Unknown" listings (e.g., "Unknown bird - large", "Unknown bat", etc.) in the `SPECIES` column. We need to remove all values with the label `Unknown flying animal` using the `drop()` method so we only include identifiable species in our analyses.

In [None]:
# Get the number of strike records for each species type using "SPECIES_GENERALIZE"


# Drop the row from the Series with the index label "Unknown flying animal"


# Print the first 10 values using head()


Now that the data is formatted, we can create a [bar graph](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.bar.html) to visualize the ten species that are most involved in strikes and the count of records of each.

We create a bar plot of the first ten values of the "unique species" Series by calling the `plot()` method on our Series with the keyword argument `kind='bar'`.

In [None]:
# Select the first ten values from the unique species count list using iloc


# Create a bar chart ("bar") with the labels (species name) along the horizontal
# axis and value counts (number of records) along the vertical axis


In [None]:
# Now create a horizontal bar chart ("barh") with the labels (species name) along 
# the vertical axis and value counts (number of records) along the horizontal axis


#### Setting global chart styles

We can set some basic graphic global paramenters for our plots–the overall [style of the plots and the size of the plots](https://matplotlib.org/3.3.3/tutorials/introductory/customizing.html) by editing `plt.rcParams` and setting `plt.style`. These settings will apply to all plots created after running the following cells.

You can view examples of various styles in the [Matplotlib Style Gallery](https://tonysyu.github.io/raw_content/matplotlib-style-gallery/gallery.html).

In [None]:
# Set the default size of the plots to 10 inches wide and 8 inches tall


In [None]:
# Set the default graphic style of the plots
# See more styles in the "Matplotlib Style Gallery", here we will change the
# style to 'ggplot'


After setting these global styles, re-run the horizontal bar chart above to see the formatting differences.

#### Create a Line Graph

We want to show the change in total strikes per year over the year range of our dataset.

**Questions to consider with this data:**

- What is the general trend in wildlife strikes over time? What could this trend be related to?
- Should there be any considerations for the year range of our dataset? Are there years that we can remove?

We will first calculate the `value_counts()` of the `INCIDENT_YEAR` column

In [None]:
# Get the number of strikes that occurred in each year using value_counts()


# Print out the strikes per year series


We then sort the values based on the index, the year of occurrence, so that our data is sorted by year and not value counts.

In [None]:
# Sort the new Series by the index values, the year


# Print out the sorted strikes per year series


We will use the matplotlib "line" plot type to create a [line chart](https://matplotlib.org/3.1.0/api/_as_gen/matplotlib.pyplot.plot.html) that shows the change in total strikes per year over the year range of our dataset.

To create the line plot we call the `plot()` method on the `strikes_per_year` Series and pass in the keyword arguments `kind='line`.


In [None]:
# Plot the total strikes per year


#### Set the style of the plot

We can manipulate the graphical encoding of our plot data in multiple ways. We can use the keyword argument `style` in our plot method to pass a shorthand styling string. For example, the style arguments `style='o-g'` are a shorthand to indicate that we want our data points (markers) to be represented by a green (g) circle (o) and conected by a solid line (-). You can find other shorthand style arguments for formatting markers and lines in the [matplotlib plot documentation](https://matplotlib.org/2.1.2/api/_as_gen/matplotlib.pyplot.plot.html).

In [None]:
# Plot the total strikes per year using a solid green line connecting data
# points and a solid green circle representing a data point


You can also set specific data encoding styles using various keyword arguments in the `plot()` method. In this example we set the color of the line (`color='hotpink'`), the line width (`linewidth='3'`), the marker style (`marker='o'`), the marker fill (`markerfacecolor='white'`), marker stroke (`markeredgecolor='black'`), and marker size (`markersize=8`) explicitly

In [None]:
# Plot the total strikes per year using a solid hot pink line connecting data
# points represented by a white circle with a black outline


To make our chart more descriptive and easy to interperet, we can set the plot title and axis labels. Here we will store our plot as a matplotlib `Axes` object in a variable by calling the `plot()` method. We can then call the methods `set_title()`, `set_xlabel()`, and `set_ylabel()` on this variable to set a descriptive title, x axis label, and y axis label, respectively.

Note that you can also set these labels in the `plot()` method using keyword arguments (e.g., `xlabel`, `ylabel`, `title`) similar to the previous way in which we set style parameters.

In [None]:
# Create a variable that stores the plot


#Change titles and axis labels with set_title(), set_xlabel(), and set_ylabel()


#### Saving a plot

`matplotlib` provides functionality for saving a figure as an image using the method `savefig()`. We need to get the figure from our plot by calling `get_figure` on the matplotlib `Axes` object and the calling the method `savefig()`.

In [None]:
# Save the strikes per year plot as a png


# Save a high resolution version of the plot by setting a high dpi


# Save a vectorized version of the plot (svg)


### Other Data Visualization Libraries

`matplotlib` is not the only option for creating data visualizations, there are other libraries that offer additional styling and interactive options.

#### Create a Histogram with Seaborn

Seaborn is a common visualization library that builds on matplotlib and provides more robust options. You can learn more and find documentation on the [seaborn website](https://seaborn.pydata.org/)

In [None]:
# Import the Seaborn library as sns (callable in our code as sns)


In [None]:
# Create a histogram of the incident hour by calling histplot() in Seaborn
# Set the histogram bin width to one and include a kernel density estimate (kde)
# to show a smooth histogram


#### Create an interactive line chart with Altair

Altair is an open-source, declariative visualization library. You can learn more and find documentation on the [Altair website](https://altair-viz.github.io/index.html).

In [None]:
# Import the Altair library as alt (callable in our code as alt)


In [None]:
# Create an interactive Altair line chart of total incidents per year
# Include a tooltip to show the year and incident count when a user
# hovers over a data point




---


## Open work time
You can use this time to ask questions, collaborate, or work on the following activities (on your own or in a group)

### Exercise 1: Aggregation

Find the average, minimum, and maximum from the column 'SPEED'.

In [None]:
# Get general statistics of altitude when strike occured


### Exercise 2: Grouping Values

Find how expensive repairs (the data in column `'COST_REPAIRS_INFL_ADJ'`) were at each time of day (dawn, dusk, day, and night). Group the data using `TIME_OF_DAY`, then show a table with the average cost of repairs for each time of day, sorted from least expensive to most expensive.



> Bonus discussion question: does showing the data this way tell the whole story? What other factors could affect these numbers?



In [None]:
# Create a variable that stores clean_data grouped by INCIDENT_MONTH


# Find the mean of the data based on the cost of repairs


#View the data, sorted by least to most expensive


### Exercise 3: Calculate strike totals by month

We're considering if there might be a relationship between the number of strikes and bird migration occurance (March-April and August-November -- see [Bird Migration and Areas With Sensitive Fauna](https://www.faa.gov/air_traffic/publications/atpubs/aip_html/part2_enr_section_5.6.html)). Find the total strikes by month to observe any patterns that might occur at the temporal level of month by creating a bar chart of total strikes by month.



> Hint: To rotate a chart, there is a keyword that you can add to the `plot()` method. `rot` specifies the rotation of the x-axis tick labels (0.5 = 90 degrees rotation). Example: `your_data.plot(kind='bar', rot=.5)`



In [None]:
# Get the number of strikes that occured during a specific month


# Sort the new Series by the index values, the month number


# Create a bar chart with the data, rotate the x-axis tick labels by 90 degrees



### Exercise 4: Compare wildlife strikes between flying animals and land animals

Compare wildlife strikes based on the catagorization of animals that can or cannot fly (Land animals and flying animals are identified in the column `SPECIES_TYPE`). Create a bar chart that visualizes the number in each category.

In [None]:
#Show the counts for each category


In [None]:
# Create a bar chart to compar the counts of land animals vs flying animals


### Exercise 5: Recreate the graphs in Seaborn and Altair
Recreate the graphs in exercises 3 and 4 with the Seaborn and Altair libraries.

Read more about these libraries in the documentation:
- [Seaborn documentation](https://seaborn.pydata.org/)
- [Seaborn `countplot()`](https://seaborn.pydata.org/generated/seaborn.countplot.html?highlight=countplot#seaborn.countplot)
- [Altair documentation](https://altair-viz.github.io/index.html)
- [Altair's "Simple Bar Chart" example](https://altair-viz.github.io/gallery/simple_bar_chart.html)

In [None]:
# Exercise 3 in Seaborn using countplot()


In [None]:
# Exercise 3 in Altair using mark_bar()


In [None]:
# Exercise 4 in Seaborn using countplot()


In [None]:
# Exercise 4 in Altair using mark_bar()


## Further resources

### Filled version of this notebook

[Python Open Labs Week 3 filled notebook](https://colab.research.google.com/github/NCSU-Libraries/data-viz-workshops/blob/master/Python_Open_Labs/Exploratory_analysis_and_visualization_with_Pandas/Python_Open_Labs_Week3_filled.ipynb) - a version of this notebook with all code filled in for the guided activity and exercises.

### Learning resources

- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/index.html) - a free, online version of Jake VanderPlas' introduction to data science with Python, includes a chapter on data manipulation with pandas.

- [Python Programming for Data Science](https://www.tomasbeuzen.com/python-programming-for-data-science/README.html) - a website providing a great overview of conducting data science with Python including pandas.

- [Seaborn User guide and tutorial](https://seaborn.pydata.org/tutorial.html) - a guide and tutorial provided on the seaborn website

- [An Introduction to Altair](https://vallandingham.me/altair_intro.html) - a walkthrough introducing the basics of Altair

### Finding help with pandas

The [Pandas website](https://pandas.pydata.org/) and [online documentation](http://pandas.pydata.org/pandas-docs/stable/) are useful resources, and of course the indispensible [Stack Overflow has a "pandas" tag](https://stackoverflow.com/questions/tagged/pandas).  There is also a (much younger, much smaller) [sister site dedicated to Data Science questions that has a "pandas" tag](https://datascience.stackexchange.com/questions/tagged/pandas) too.too.

## Evaluation Survey
Please, spend 1 minute answering these questions that help improve future workshops.

https://go.ncsu.edu/dvs-eval

## Credits

This workshop was created by Claire Cahoon and Walt Gurley, adapted from previous workshop materials by Scott Bailey and Simon Wiles, of Stanford Libraries.