# Exploring FAA Wildlife Strikes that occurred in North Carolina

This notebook introduces some basic concepts and methods for cleaning and exploring a dataset with Python using the data analysis library pandas and the visualization library Matplotlib.

We won't be teaching you all about how to use Python in this module, instead this notebook provide example code with thorough documentation and a few hands-on opportunities to write your own code. Any hands-on activity is preceded by a text cell with the title “Try it yourself.” Follow the instructions in this text cell to type out and run the required code in the following code cell. A solution for each “Try it yourself” section is provided under the hidden "Solution" section.

The dataset used in this notebook comes from the [Federal Aviation Administration (FAA) Wildlife Strikes Database](https://wildlife.faa.gov/search), a database containing records of reported wildlife strikes with aircraft. The dataset has been filtered to include only wildlife strikes that have occured in the state of North Carolina.

## Credit

These materials were originally developed for Engineering REU Library Modules by Claire Cahoon, Walt Gurley, and Natalia Lopez in the [Data & Visualization Services Department](https://www.lib.ncsu.edu/department/data-visualization-services) at the [NC State University Libraries](https://www.lib.ncsu.edu/).

## About Jupyter Notebook and Google Colab

Jupyter notebooks are a way to write and run Python code in an interactive way. They're quickly becoming a standard way of putting together data, code, and written explanations or visualizations into a single document and sharing that. There are a lot of ways that you can run Jupyter notebooks, including just locally on your computer, but you wil be using Google's Colaboratory notebook platform for this module.  Colaboratory is “a Google research project created to help disseminate machine learning education and research.”  If you would like to know more about Colaboratory in general, you can visit the [Welcome Notebook](https://colab.research.google.com/notebooks/welcome.ipynb).

**In short, this notebook consists of text cells and code cells. You will run code cells by pressing the "play" icon in the top-left corner of a code cell. Output from a cell will be printed below the code cell.**

**Be sure to run code cells sequentially, from top to bottom, as the output from one cell my be required as input to a following cell.**

## Import the pandas library

Begin by importing the pandas library. The pandas library provides functionality for importing, manipulating, and exploring data in Python.

We import the lbrary and use the alias `pd` to access the functionality of the pandas library by calling `pd.[function()]`. For example, we can read in a CSV file by calling the function `pd.read_csv('path_to_file.csv')`.

Run the Python code cell below be pressing the play button in the upper-left corner of the cell.

In [None]:
# Import the Pandas library as pd (callable in our code as pd)
import pandas as pd

## Import the North Carolina FAA Wildlife Strikes dataset

The dataset consists of a CSV file that contains records of wildlife strikes that have occured with aircraft (i.e., recorded instances where animals have been hit by an aircraft) in North Carolina. Each row in the dataset contains the information associated with a single reported strike, while each column contains a specific criteria of a strike report (e.g., date of occurance, airport, species of animal, etc.).

The CSV file is hosted online and is accessible through the URL stored as a string in the variable `file_url`. 

The pandas library provides the method `read_csv` to conveniently read a CSV file and store it in a pandas data structure. Our CSV file consists of a 2-dimensional dataset (multiple rows and columns) and is imported as a [pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) and stored in the variable `wl_strikes`.

As a pandas DataFrame, with `wl_strikes`, we have access to all of the pandas DataFrame methods and properties by calling `wl_strikes.[function()]` or `wl_strikes.[property]`

In [None]:
# Variable containing the string pointing to the location of the data file
file_url = 'https://raw.githubusercontent.com/NCSU-Libraries/data-viz-instruction/main/Data_Analysis_with_Python/FAA_wildlife_strikes_NC.csv'

# Read in the CSV file and store it as a pandas dataframe in the variable wl_strikes
wl_strikes = pd.read_csv(file_url)

## Preview the dataset

It is advisable to always check the data in the DataFrame you just created to ensure it was loaded correctly and the data matches what you expect. Below are a few ways to preview some of the values and properties of the dataset and to quickly summerize the data.

### Observe sample rows from the dataset

Print out the first five rows of the dataset using the [`head()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) on the `wl_strikes` DataFrame.

From the output you can see that our dataset contains information such as the date and time of a reported strike and related information such as the aircraft type, damage location on the aircraft, the species of wildlife, etc. You can scroll right and left in the printed out table to view the columns.

In [None]:
# Print out the first five rows of the dataset
wl_strikes.head()

#### Try it yourself: `sample()`

> This is an opportunity for you to practice some of the concepts introduced so far. Read the directions below and use the examples above to write code that produces the expected results. You can check your code against the solution provided in the hidden **solution section** below (click the triangle next to the Solution header to expand the hidden cells).

**Print out a random sample of 5 rows using the [`sample()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.sample.html) on the `wl_strikes` dataframe.**

*Tip: To select **n** number of rows using the `sample()` method you pass the argument **n** into the method call. For example, to randomly sample 2 rows from a dataframe you would call `sample(2)`.*

In [None]:
# Print out a random sample of five rows of the dataset


#### Solution: `sample()`

In [None]:
# Print out a random sample of five rows of the dataset
wl_strikes.sample(5)

### Observe a specific column from the dataset

You can select a single column of data from a DataFrame by using an index reference of the column's name: `DataFrame['column name']`.

Print out the values contained in the "INCIDENT_MONTH" column of the `wl_strikes()` DataFrame using an index reference of the column name. Note that this returns a new pandas data structure called a [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html), a one-dimensional array (one column, multiple rows).

In [None]:
# Print out the values contained in the 'INCIDENT_MONTH' column
wl_strikes['INCIDENT_MONTH']

### Summarize the data

Quickly summarizing the data provides some insight into the size and type of data in the dataset.

Get the number of rows and columns in a dataframe using the [`shape` attribute](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html) of the `wl_strikes` DataFrame. This is returned as (number of rows, number of columns)

In [None]:
# Print out the shape of the dataset (number of rows, number of columns)
wl_strikes.shape

Use the [`describe()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.describe.html) to generate some basic statistics (e.g., count, mean, minimum, maximum) on each column of the `wl_strikes` dataframe.

In [None]:
# Generate basic statistics on each column of data
wl_strikes.describe()

You can also perform summaries on specific columns of data by first selecting the column you wish to observe.

Use the [`unique()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.nunique.html) on the "INCIDENT_MONTH" column of the `wl_strikes` dataframe to observe the unqiue values contained in this column (this should be the values 1-12, representing the months January-December).

In [None]:
# Print the unique values contained in the "INCIDENT_MONTH" column
wl_strikes['INCIDENT_MONTH'].unique()

#### Try it yourself: `value_counts()`

Use the [`value_counts()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.value_counts.html) property on the column 'INCIDENT_MONTH' of the `wl_strikes` DataFrame to get the counts of unique values from this column. This will provide the total number of times a unique value occurs in this column (e.g., how many strikes occured in the month of January over the entire dataset)

In [None]:
# Print out the value counts of unique values from the "INCIDENT_MONTH" column


#### Solution: `value_counts()`

In [None]:
# Print out the value counts of unique values from the "INCIDENT_MONTH" column
wl_strikes['INCIDENT_MONTH'].value_counts()

## Clean the dataset to prepare it for analysis

Often, the dataset you download in its original form doesn't fit your data analysis needs exactly. You may want to make surface-level changes to this "raw" dataset so that it's easier to work with. This process is called "cleaning" your data, or preparing it for data analysis.

The following steps show a few common actions you may want to take while cleaning data. It is important to remember that as much as possible, cleaning data should not change the original dataset values in any way.

As we clean our dataset we will assign the altered dataset to new variables to preserve the original dataset and prevent possible errors.

### Rename column headers

In this data, the columns are named using a code. We can use the [data dictionary](https://wildlife.faa.gov/assets/fieldlist.pdf) to match up the column header names with what they stand for to make it easier for us to read.

We can use the [`rename()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.rename.html) on our `wl_strikes` dataframe to rename column headers of our dataset. *Note that this change is assigned to a new variable, `wl_strikes_rename`.*

In [None]:
#Renaming columns: change the name from "REG" to "AIRCRAFT_REGISTRATION"
wl_strikes_rename = wl_strikes.rename(columns={'REG':'AIRCRAFT_REGISTRATION'})

# Print out the dataset column headers
wl_strikes_rename.columns

#### Try it yourself: `rename()`

Use the [data dictionary](https://wildlife.faa.gov/assets/fieldlist.pdf) to rename the column "FLT" to a more descriptive name. Store the dataframe with the new column name in the variable `wl_strikes_rename2`. Make sure to call the rename on the dataframe created in the previous step, `wl_strikes_rename`.

In [None]:
# Use the data dictionary to rename "FLT" and store the new dataframe in the
# variable wl_strikes_rename2


#### Solution: `rename()`

In [None]:
# Use the data dictionary to rename "FLT" and store the new dataframe in the
# variable wl_strikes_rename2
wl_strikes_rename2 = wl_strikes_rename.rename(columns={'FLT': "FLIGHT_NUM"})

### Removing spaces from column headers

When we renamed the column headers, we didn't use any spaces or special characters. This is to keep the data easily read by the computer. There is one column header name with a space, so we need to change it to an underscore.

#### Try it yourself: `rename()` 2

Using the same method for renaming the column headers above, rename the column "ENROUTE STATE" to "ENROUTE_STATE" and store the new dataframe in the variable `wl_strikes_rename3`. Make sure to call the rename on the dataframe created in the previous step, `wl_strikes_rename2`.

In [None]:
# Find the column labeled "ENROUTE STATE" and change the name to "ENROUTE_STATE"
# Store the updated dataframe in the variable wl_strikes_rename3


#### Solution: `rename()` 2

In [None]:
# Find the column labeled "ENROUTE STATE" and change the name to "ENROUTE_STATE"
# Store the updated dataframe in the variable wl_strikes_rename3
wl_strikes_rename3 = wl_strikes_rename2.rename(columns={'ENROUTE STATE': 'ENROUTE_STATE'})

### Removing unnecessary columns
There may be columns in the dataset that you do not plan to use. You can delete those so that the size of the file is smaller and easier to work with.

We will use the [`drop()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) to remove the "COMMENT" column since it is not relevant to the analyses we will conduct. Store the new dataframe in a variable named `wl_strikes_drop`. Make sure to call `drop()` on the dataframe created in the previous step `wl_strikes_rename3`.

In [None]:
# Remove the unnecessary column "COMMENT" 
wl_strikes_drop = wl_strikes_rename3.drop(columns=['COMMENT'])

# Print out the dataset column headers
wl_strikes_drop.columns

#### Try it yourself: `drop()`

You can drop multiple columns at one time by providing a list of multiple column names into the `columns` keyword argument. In the previous example we only passed in a list with one column name, "COMMENT".

Use the `drop()` method to remove the columns dealing with the strike and damage location on an aircraft. These column names are stored as a list in the variable `drop_columns`. Store the new DataFrame with the dropped columns in the variable `wl_strikes_drop2`. Make sure to call `drop()` on the dataframe created in the previous step `wl_strikes_drop`.

In [None]:
# A list of column names to be dropped from the dataset
drop_columns = ['STR_RAD', 'DAM_RAD', 'STR_WINDSHLD', 'DAM_WINDSHLD', 'STR_NOSE',
                'DAM_NOSE', 'STR_ENG1','DAM_ENG1', 'STR_ENG2', 'DAM_ENG2',
                'STR_ENG3', 'DAM_ENG3', 'STR_ENG4', 'DAM_ENG4', 'STR_PROP',
                'DAM_PROP', 'STR_WING_ROT', 'DAM_WING_ROT', 'STR_FUSE',
                'DAM_FUSE', 'STR_LG', 'DAM_LG', 'STR_TAIL', 'DAM_TAIL',
                'STR_LGHTS', 'DAM_LGHTS', 'STR_OTHER', 'DAM_OTHER',
                'OTHER_SPECIFY', 'EFFECT', 'EFFECT_OTHER']

# Drop the columns dealing with the strike and damage location on an aircraft
# Store the updated dataframe in the variable wl_strikes_drop2


# Print out the dataset column headers


#### Solution: `drop()`

In [None]:
# A list of column names to be dropped from the dataset
drop_columns = ['STR_RAD', 'DAM_RAD', 'STR_WINDSHLD', 'DAM_WINDSHLD', 'STR_NOSE',
                'DAM_NOSE', 'STR_ENG1','DAM_ENG1', 'STR_ENG2', 'DAM_ENG2',
                'STR_ENG3', 'DAM_ENG3', 'STR_ENG4', 'DAM_ENG4', 'STR_PROP',
                'DAM_PROP', 'STR_WING_ROT', 'DAM_WING_ROT', 'STR_FUSE',
                'DAM_FUSE', 'STR_LG', 'DAM_LG', 'STR_TAIL', 'DAM_TAIL',
                'STR_LGHTS', 'DAM_LGHTS', 'STR_OTHER', 'DAM_OTHER',
                'OTHER_SPECIFY', 'EFFECT', 'EFFECT_OTHER']

# Drop the columns dealing with the strike and damage location on an aircraft
# Store the updated dataframe in the variable wl_strikes_drop2
wl_strikes_drop2 = wl_strikes_drop.drop(columns=drop_columns)

# Print out the dataset column headers
wl_strikes_drop2.columns

### Combining categories

Sometimes the dataset is more specific than we need. In this case, if a species was unknown, the reporter had the option to enter the general size of the animal. Unknown animals are listed as "Unknown bird," "Unknown bird - small," "Unknown bird - medium," etc. This may be helpful, but for our analyses, we want them all labeled as "Unknown flying animal".

First, we will observe the original species data to determine the number of species types that start with the word "Unknown" and the number of records with these labels using the dataframe method `value_counts()` and then the [`filter()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.filter.html) using a regular expression on the results. The final output of these operations will provide a listing of the values of "Unknown" types and the number of records labeled with these types.

In [None]:
# Select the data from the "SPECIES" column
species = wl_strikes_drop['SPECIES']

# Get the value counts of each unique species in species
uniqueSpeciesCounts = species.value_counts()

# Filter the unique species counts to include only categories that contain the
# word "Unknown"
uniqueSpeciesCounts.filter(regex='Unknown.*')

From the output above you can see that there are over 2,500 animal strikes that have a 'SPECIES' label of some type of unknown bird or unknown bat.

Next, we will create a new DataFrame `wl_strikes_clean` that replaces all instances of "SPECIES" labels that begin with the string "Unknown" with the new category "Unknown flying animal." To do this we will use the [`replace() method`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html) on the `wl_strikes_drop2` DataFrame.

In [None]:
# Replace all values in the "SPECIES" column containing the word "Unknown" with
# the value "Unknown flying animal"
wl_strikes_clean = wl_strikes_drop2.replace({'SPECIES': 'Unknown.*'}, 'Unknown flying animal', regex=True)

# Print out the value counts for unique species in the updated dataset
wl_strikes_clean['SPECIES'].value_counts()

## Exploratory analysis of the dataset

After cleaning our observing and cleaning our dataset it is now easier to conduct analyses on our data. We will conduct some numerical and visual analyses that will help us explore questions such as:

- How many unique species have been identified in the data set?
- Which species are struck the most?
- How have number of strikes changed over time?
- Are there times of the year when most strikes occur?
- How frequently do land-based animals get impacted by this issue?

We will be using the [matplotlib visualization library](https://matplotlib.org/index.html) to create charts.

Start by importing the [pyplot interface](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.html#module-matplotlib.pyplot) as `plt` to access the plotting functionality of matplotlib. We will also set some basic graphic paramenters for our plots–the overall [style of the plots and the size of the plots](https://matplotlib.org/3.3.3/tutorials/introductory/customizing.html).

For all of our charts we will use the matplotlib integration with pandas data structures by calling the method [`plot()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.plot.html#pandas.DataFrame.plot) on a DataFrame or Series.

In [None]:
# Import the matplotlib pyplot interface as plt (callable in our code as plt)
import matplotlib.pyplot as plt

# Set the graphic style of the plots
plt.style.use('seaborn')

# Set the size of the plots
plt.rcParams['figure.figsize'] = [10, 8]

### Number of unique species

It may be interesting to see the total number of unique species that are recorded in our dataset. We can use the `unique()` method on the "SPECIES" column to create a Series of unique species labels, stored in the variable `unique_species`. The length of this Series will provide the number of unique labels. We call [the built-in Python function `len()`](https://docs.python.org/3/library/functions.html#len) and pass in the `unique_species` variable to obtain the length of the Series.

*Remember, one of our labels is "Unknown animal." We probably don't want to include this label in our count.*

In [None]:
# Get the unique species names from the "SPECIES" column
unique_species = wl_strikes_clean['SPECIES'].unique()

# Print the length of the "unique_species" list (the number of unique species)
# minus one (to account for the "Unknown animal" label we created earlier)
len(unique_species) - 1

### Top ten occuring species

Create a [horizontal bar graph](https://matplotlib.org/3.1.1/api/_as_gen/matplotlib.pyplot.barh.html) to show the top ten species that are most involved in strikes and the count of records of each.

**Questions to consider:**

- What is the fifth most struck species?
- Approximately how many times has the tenth most struck species been involved in aircraft strikes?

Similar to our previous analysis of unique species, we will aggregate the "SPECIES" column to obtain the unique species, but also obtain the total counts of each in the column using the `value_counts()` method. We will store the resulting Series in the variable `unique_species_count()`. We must then remove the value with the label "Unknown flying animal" so we only include identifiable species using the `drop()` method. In this case, we will overwrite the existing `unique_species_count` variable by reassigning the output to the same variable name.

Once the data is formatted we create a horizontal bar plot of the first ten values of `unique_species_count` by first [slicing a range using the index reference](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#slicing-ranges) `[:10]` to select the first ten values in the Series and then calling the `plot()` method with the keyword argument `kind='barh'`.

In [None]:
# Get the number of strike records for each species type
unique_species_count = wl_strikes_clean['SPECIES'].value_counts()

# Drop the value from the Series with the label "Unknown animal"
unique_species_count = unique_species_count.drop(labels='Unknown flying animal')

# Select the first ten values from the unique species count list and create a 
# horizontal bar chart ("barh") with the labels (species name) along the vertical axis
# and value counts (number of records) along the horizontal axis
unique_species_count[:10].plot(kind='barh')

### Incidents per year

Next, we will create a [line chart](https://matplotlib.org/3.1.0/api/_as_gen/matplotlib.pyplot.plot.html) to observe the change in the number of wildlife strikes per year over the timeframe of our dataset.

We will first group the dataset based on the recorded "INCIDENT_YEAR" and generate the count of records with this value, stored in the variable `strikes_per_year`. We then sort the values based on the index, the year of occurrence, so that our data is sorted by year and not value counts.

In [None]:
# Get the number of strikes that occurred in each year
strikes_per_year = wl_strikes_clean['INCIDENT_YEAR'].value_counts()

# Sort the new Series by the index values, the year
strikes_per_year = strikes_per_year.sort_index()

# Print the series
strikes_per_year

We will use the matplotlib "line" plot type to create a line graph that shows the change in total strikes per year over the year range of our dataset.

**Questions to consider:**

- What is the general trend in wildlife strikes over time? What could this trend be related to?
- Should there be any considerations for the data from the years 1985 and 2020? Why would the year 2020 be so low relative to previous years? Could the 1985 value be considered an outlier?

To create the line plot we call the `plot()` method on the `strikes_per_year` Series and pass in the keyword arguments `kind='line` and `style='o-'`. The style arguments are shorthand to indicate that we want our data points (markers) to be represented by a circle and conected by a solid line. You can find other style arguments for formatting markers and lines in the [matplotlib plot documentation](https://matplotlib.org/2.1.2/api/_as_gen/matplotlib.pyplot.plot.html).

In [None]:
# Plot the total strikes per year using a solid line connecting data points and
# a solid circle representing a data point
strikes_per_year.plot(kind='line', style='o-')

### Incidents by month

We will now create a bar chart to observe any patterns that might occur at the temporal level of month. 

**Questions to consider:**

- Are there any observable patterns in wildlife strikes by month?
- Could there be a relationship between bird migration occurance (March-April and August-November) and number of strikes? (See [Bird Migration and Areas With Sensitive Fauna](https://www.faa.gov/air_traffic/publications/atpubs/aip_html/part2_enr_section_5.6.html))

We will again use the `value_counts()` method, this time on the "INCIDENT_MONTH" column to get the total strikes that have occured in each month, stored in `strikes_by_month`. We also need to sort the resultant data by its index labels so the data is sorted by month number (1-12).

We create the bar plot by calling the `plot()` method on `strikes_by_month` using the keword arguments `kind='bar'` and `rot=0.5`. The `rot` keyword specifies the rotation of the x-axis tick labels (0.5 = 90 degrees rotation).

In [None]:
# Get the number of strikes that occured during a specific month
strikes_by_month = wl_strikes_clean['INCIDENT_MONTH'].value_counts()

# Sort the new Series by the index values, the month number
strikes_by_month = strikes_by_month.sort_index()

# Create a bar chart with the data, rotate the x-axis tick labels by 90 degrees
strikes_by_month.plot(kind='bar', rot=.5)


### Compare wildlife strikes between flying animals and land animals

Finally, let's compare wildlife strikes based on the catagorization of animals that can or cannot fly, flying animals vs land animals.

Land animals are identified by a SPECIES_ID that begins with the number 1 or the number 2. This is based on the International Civil Aviation Organization (ICAO) codes found in the [ICAO 2008 - 2015 Wildlife Strike Analysis Electronic Bulletin](https://www.icao.int/safety/IBIS/2008%20-%202015%20Wildlife%20Strike%20Analyses%20(IBIS)%20-%20EN.pdf).

**Questions to consider:**

- What type of animal, flying or land animal, is involved most in strikes?
- How different or similar are the number of strikes between the two catagories? Why might this be?

We first create a new Series, `land_or_flying_animal`, using the [method `str.contains()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.contains.html) to test if a value in the "SPECIES_ID" column begins with the character 1 or 2 (using the regular expression `'^1|2'`. This method returns boolean values, True or False, if the string does or does not match the given regular expression. We then use the [`replace()` method](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.replace.html) on the new Series to replace the True or False values with descriptive labels, "Land animal" or "Flying animal," respectively.

In [None]:
# Create a Series of boolean values, True if a land animal, False if a flying
# animal, using a regular expression to test if the "SPECIES_ID" string begins
# with the character 1 or 2
land_or_flying_animal = wl_strikes_clean['SPECIES_ID'].str.contains('^1|2', regex=True)

# Replace True or False values with new strings, "Land animal" or "Flying animal"
land_or_flying_animal = land_or_flying_animal.replace(
  {True: 'Land animal', False: 'Flying animal'}
)

# Print out the Series
land_or_flying_animal

We create the bar plot by first calculating the value counts of the `land_or_flying_animal` Series and then calling the `plot` method using the keword arguments `kind='bar'` and `rot=0.5`. The `rot` keyword specifies the rotation of the x-axis tick labels (0.5 = 90 degrees rotation).

In [None]:
# Create a bar chart to compar the counts of land animals vs flying animals
land_or_flying_animal.value_counts().plot(kind='bar', rot=0.5)