# FAA Bird Strike EDA

In this section, we are going to take the abridged findings from the Anaconda course [_Exploratory Data Analysis with Python_](https://learning.anaconda.cloud/exploratory-data-analysis-eda-with-python) and gather them to tell a story. 

Let's say we were asked by a pilot's union to do an independent study on what factors cause bird strikes. There are other reports out there, but they want to do their own and see if we come to similar conclusions as bird strike damages continue to be a problem. 

We want to only include recent data from 2015 onwards. We have deemed there is little value looking at older data from before that year. We can argue that the nature of bird strikes never change, but the environment certainly changes. Schedules and airports grow and shrink, weather patterns change, and different airlines come and go.  Even the [FAA themselves](https://wildlife.faa.gov/home) say:  

> Expanding wildlife populations, increases in number of aircraft movements, a trend toward faster and quieter aircraft, and outreach to the aviation community all have contributed to the observed increase in reported wildlife strikes.


## The Bird Strike Dataset 

Aircraft bird strikes as reported by the Federal Aviation Administration (FAA) in the United States. A bird strike occurs when a bird collides with an aircraft, and the damage can be severe. Each year, there are on average 13,000 birdstrikes in the United States alone and cost the aviation industry $400 million in damages. While most bird strikes are minor, some can be dangerous and fatal. 

<a title="Greg L, CC BY 2.0 &lt;https://creativecommons.org/licenses/by/2.0&gt;, via Wikimedia Commons" href="https://commons.wikimedia.org/wiki/File:US_Airways_Flight_1549_(N106US)_after_crashing_into_the_Hudson_River_(crop_2).jpg"><img width="512" alt="US Airways Flight 1549 (N106US) after crashing into the Hudson River (crop 2)" src="https://upload.wikimedia.org/wikipedia/commons/thumb/8/8f/US_Airways_Flight_1549_%28N106US%29_after_crashing_into_the_Hudson_River_%28crop_2%29.jpg/512px-US_Airways_Flight_1549_%28N106US%29_after_crashing_into_the_Hudson_River_%28crop_2%29.jpg?20200816213116"></a>

**In 2009, US Airways Flight 1549 suffered a major birdstrike resulting in an emergency landing in the Hudson River.**<br>
*Greg L, CC BY 2.0, via Wikimedia Commons*

Below, we will import the bird strike data we cleaned from the other Anaconda course. 

In [None]:
import pandas as pd 

url = r"https://github.com/thomasnield/anaconda_python_eda/raw/public/birdstrike_section2.csv"
df = pd.read_csv(url, index_col='INDEX_NR', parse_dates=["INCIDENT_DATE"])

with pd.option_context('display.max_columns', None):
  display(df)

We will also do some datatype conversions.

In [None]:
# Turn PHASE_OF_FLIGHT into a category
phase_of_flt = pd.CategoricalDtype(categories=['Parked', 'Taxi','Take-off Run', 'Approach', 'Departure', 'Climb', 'En Route',
                                               'Descent', 'Landing Roll', 'Arrival', 'Local'])

df["PHASE_OF_FLIGHT"] = df["PHASE_OF_FLIGHT"].astype(phase_of_flt)

# Turn TIME into timedelta type 
df["TIME"] = pd.to_timedelta(df["TIME"])

While we could get into the weeds with our audience, and pilots can be a mixed bag crowd with some being very technical, we should strive to avoid going into too many data-cleaning details. We should mention high-level information that we are only looking at data from 2015, leaving us with 141K records to analyze. We should also caution our audience that this data is self-reported. 

Each record is reported by a pilot to the FAA. It is one thing if the pilot works for a big airline like Delta or Southwest Airlines, and they are well-trained in procedure. They have little reason to miss filing a bird strike report other than the time it takes. But an independent pilot who owns his own small aircraft may be a wildcard. If the damage is minor or nonexistant, perhaps he just shrugs it off and does not report it. If he hits a protected species like the bald eagle, he may be even less inclined as he is unsure of the consequences. 

We caution our audience that self-reporting always carries a bias with it, because not everybody is going to self-report. This could skew the results in ways that do not reflect reality. For example, if the data shows large airlines are far more prone to bird strikes than general aviation aircraft, that could be due to airlines being better at reporting, not because birds collide more with airliners. 

We should challenge our audience to not get too caught up in just what the data says. Also they should ask where it came from. What could possible bias it? Framing these questions will help frame more intelligent conclusions.

That being said, let's move onto some very real findings.

![](https://y.yarn.co/18f66e28-0f40-4db7-b05f-3529759e9708_text.gif)

*Courtesy: Paramount Pictures* 

## Height and Phase of Flight 

One of the first things we noticed in our analysis is bird strikes definitely skew towards lower altitudes, especially below 1000 feet. 

In [None]:
df["HEIGHT"].hist(bins=30)

If we look at phase of flight, we surprisigly see "approach" and "landing roll" have more strikes than take-off or climb. 

In [None]:
df["PHASE_OF_FLIGHT"].value_counts().plot.bar()

Since we are talking to pilots, we do not need to define each of these phases of flights for them as they probably know more than we do. Regardless, here they are for good measure. 

![](https://github.com/thomasnield/anaconda_python_eda/raw/public/resource/7Od2TS0O.svg)


## Speed versus Height Variable

Next let's take a look at `SPEED`. The faster a plane is going, the more likely the plane is going to be damaged colliding with a bird, hence resulting in a bird strike report. A bird that bumps into a slow-moving plane is less likely to count as a bird strike if no damage occurs, right? However, a spinning engine on a stationary aircraft can suck in a bird and certainly count as a bird strike too. 

Let's take a look. 

In [None]:
df["SPEED"].hist(bins=50)

We seem to have a normal distribution here as indicated by the bell curve shape, with some extreme outliers to the right. This is interesting. Maybe this is explained by the height as planes don't hit their high speeds until cruise altitudes. So what happens if we bring that in alongside the speed? 

In [None]:
df.plot.scatter(x="HEIGHT",y="SPEED")


Well, when looking at bird strikes we certainly see as speed increases so does height. However, approaching cruise altitudes above 15,000 feet show sharp decreases in incidents. 

## Distance vs Height Variable

We can also see that when we set distance against height, we see bird strike incidents bottleneck quickly as both approach 0. 

In [None]:
df.plot.scatter(x="DISTANCE",y="HEIGHT")


This lines up with our finding that bird strikes happen during the approach and take-off phases of flight. 

## Time Series Analysis 

Now let's look at bird strikes by week. We will do some time series conversions and then plot teh count of incidents by week. 

In [None]:
df_series = pd.DataFrame({"INCIDENT_DATE" : df["INCIDENT_DATE"], "STRIKE_COUNT" : 1})
df_series.set_index('INCIDENT_DATE', inplace=True)

df_series \
 .resample("W") \
 .sum() \
 .plot(kind='line', figsize=(15,3), title="Time Series Analysis")


Whoa, we got something quite cyclical here. Let's take a look at a single year and sniff out some seasonality. 

In [None]:
df_series \
 .loc["2021"] \
 .resample("W") \
 .sum() \
 .plot(kind='line', figsize=(15,3), title="Time Series Analysis")


So if this is representative of the typical cycle, we see bird strikes rise in April, and then rise again sharply after June. Then the decline starts to happen in the middle of October. 

Let's look at 2021 onwards by week. 

In [None]:
df_series \
 .loc["2021":] \
 .resample("W") \
 .sum() \
 .plot(kind='line', figsize=(15,3), title="Time Series Analysis")


The reasons why we see a peak in the summer, and a dip in the winter, largely might have to do with migration patterns with birds. In North America, birds fly south for the winter, and fly north for the summer. We could also hypothesize that summer travel brings more flights, but people travel a lot in December for holidays too. If we did research this matter thoroughly, we would find according [to the FAA](https://www.faa.gov/air_traffic/publications/atpubs/aip_html/part2_enr_section_5.6.html) that "bird strike risk increases because of bird migration during the months of March through April and August through November." As a matter of fact, this is the largest factor in bird strike risk and the time series above shows this! 


What about time of day? Does that play a role? 

In [None]:
df_series_tm =  df["TIME"].dropna().dt.components.hours.value_counts().sort_index()
df_series_tm.plot(kind='line', figsize=(15,3), title="Bird Strikes by Hour")


Okay it's pretty apparent that bird strikes occur during the day, with a peak well before 10am. Makes sense that many birds would be less active at night. We could hypothesize if the less frequent night bird strikes are nocturnal birds like owls, but go down those rabbit holes on your own time. 


It would probably be helpful to see this curve separately for each month of the year, so we can account for seasonality and bird migration increasing or decreasing incidents. 

In [None]:
by_month_tm = pd.DataFrame({
    "MONTH" : df[df["TIME"].notna()]["INCIDENT_DATE"].dt.month, 
    "HOUR" : df[df["TIME"].notna()]["TIME"].dt.components.hours, 
    "STRIKES" : 1 
}).groupby(["MONTH","HOUR"]) \
.sum() \
.reset_index() \
.pivot(index="HOUR",columns="MONTH",values="STRIKES")

by_month_tm.plot(kind='line', figsize=(15,3), title="Bird Strikes by Hour for Each Month")

The peak and dip trends throughout the day generally hold for each month, where summer months are more amplified. But interestingly September (month 9), October (month 10), and May (month 5) have an uptick after 8:00pm (hour 20). These might be migratory months so perhaps there is more bird activity in later hours of the day? It's hard to say and there are many hypotheses we can explore! And we only now know to look because we did this time series analysis. 

Time series analysis is yet another way we can detect trends and patterns in our data where a chronological component is playing a role. Just be careful and always ask where the data came from! For instance, if you ingested the entire FAA bird strike dataset (and not just post-2015) you may find bird strike reports have gone up rapidly since 2008. Does this mean bird strikes have increased? No, but the reported bird strikes have increased due to more outreach and proactive reporting since the Sully incident. 

## What Story Are We Going to Tell? 

Keep in mind this is an abridged exploratory data analysis for the purposes of this class. It is highly recommended to check out [_Exploratory Data Analysis with Python_](https://learning.anaconda.cloud/exploratory-data-analysis-eda-with-python) if you have not already done so. But these findings we did here should get us started with our storytelling. 

So what is our central thesis? I think it would be this: 

> Reported bird strikes occur heavily at low altitude phases of flight, especially on approach. They also are highly seasonal due to migration patterns, climbing as summer approaches and tapering off as winter approaches. They also are much more likely to occur in the morning hours.

While this may not be a terribly ground-breaking analysis to pilots (the FAA has also concluded these findings and then some), you did exactly what you were asked: do an independent study on what causes bird strikes and your findings confirm the current hypotheses. 

We would gather the above charts and use these to back our findings. We will talk about charts in the next section. 

The pilots may ask you to find more interesting things that others might have missed. You should avoid this pressure because this is p-hacking, something we will discuss more later. Instead, you will want to push back with "is there something specific (a hypothesis) you want to investigate?" Put the onus back on them to have a hypothesis as "the experts" rather than feel pressure to crunch data blindly looking for correlations. Most productive analyses start with hypotheses and data mining, while appropriate in some situations, quickly has a diminishing return. 

### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 

## Exercise

If your audience was not a pilot's union, but rather data scientists at a major AI Conference with 300 attendees for your presentation, what would you do to change your storytelling? 

### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 

**If you are presenting to data scientists at an AI conference rather than a pilot's union, here are some things to consider:**

* Keep findings the same, but use more detailed statistics 
* Highly technical crowd, show code
* Use a Jupyter notebook instead of PowerPoint 
* Explain aviation jargon to them, as few (if any) will be pilots
* Challenge them to think outside the data