# Univariate Analysis and Outliers

We will cover some of the fundamentals of describing data one variable at a time, called **univariate analysis**, and this knowledge will roll into dealing with outliers. 

Keep in mind that exploratory data analysis can be time-consuming especially as you go through each variable. Rabbit holes can frequently take you down paths that consume hours, even days or weeks, but for the sake of expediency we will show this process with a few variables we hypothesize are relevant for bird strikes. 

Let's start with bringing in the data from the last section. 

In [None]:
import pandas as pd 

url = r"https://github.com/thomasnield/anaconda_python_eda/raw/public/birdstrike_section2.csv"
df = pd.read_csv(url, index_col='INDEX_NR', parse_dates=["INCIDENT_DATE"])
with pd.option_context('display.max_columns', None):
  display(df)

Let's also take care of a few datatype conversions that do not get saved into the CSV. 

In [None]:
# Turn PHASE_OF_FLIGHT into a category
phase_of_flt = pd.CategoricalDtype(categories=['Parked', 'Taxi','Take-off Run', 'Approach', 'Departure', 'Climb', 'En Route',
                                               'Descent', 'Landing Roll', 'Arrival', 'Local'])

df["PHASE_OF_FLIGHT"] = df["PHASE_OF_FLIGHT"].astype(phase_of_flt)

# Turn TIME into timedelta type 
df["TIME"] = pd.to_timedelta(df["TIME"])

## Height Variable

Let's start with a few theories on some of the variables, perhaps that the `HEIGHT` variable (the altitude) might be relevant for whether or not bird strikes occurs. After all, birds need to land so they can eat and tend to their nests. We can call the `hist()` function on this column to create a histogram. 

In [None]:
df["HEIGHT"].hist(bins=10)

Okay, that's interesting. It seems bird strikes heavily skew at lower altitudes. Let's increase the number of bins to see some more resolution. We do not want to have too many bins because we do not have an infinite amount of data, and we will encounter a diminishing return and then a loss of information. 

In [None]:
df["HEIGHT"].hist(bins=30)

Most bird strikes overwhelmingly happen below 1000 feet. This makes sense because birds, although frequently airborne, will largely fly close to the ground. Note you can also build a histogram directly with `matplotlib`. This allows us to bring in some more details into the graph, such as labeling the counts for each bar. 

In [None]:
from matplotlib import pyplot as plt
import pandas as pd
import numpy as np

values, bins, bars = plt.hist(df['HEIGHT'], bins=30, edgecolor='white')
plt.xlabel("HEIGHT (Feet)")
plt.ylabel("# BIRD STRIKES")
plt.title('Height vs Bird Strike Incidents')
plt.bar_label(bars, fontsize=10, color='navy')
plt.margins(x=0.01, y=0.1)
plt.show()

You can detect a skew by comparing the **mean** (the average of the sample) and **median** (center-most value in the sample) of a given variable. If the two are very different, then we have a highly skewed variable which visually is apparent above. 

In [None]:
height_mean = df["HEIGHT"].mean()
height_median = df["HEIGHT"].median()

print(f"MEAN: {height_mean} MEDIAN: {height_median}")

On a sidenote, you can approximate the distribution using a [kernel density estimation (KDE)](https://pandas.pydata.org/docs/reference/api/pandas.Series.plot.density.html). 

In [None]:
df["HEIGHT"].plot.kde(xlim=(0,50_000))

## Phase of Flight Variable

Related to the `HEIGHT`, let's look at the `PHASE_OF_FLIGHT`. For some context, here is a typical cycle visualizing the phases of flight. Note that depending on the aircraft and nature of the flight, some stages will be different. For example, an `EN ROUTE` is typical for a flight going from point A to point B. But if a pilot is practicing circuits in a plane (taking off and landing over and over again), this is called `LOCAL` as a local pattern is being flown. 

![](https://github.com/thomasnield/anaconda_python_eda/raw/public/resource/7Od2TS0O.svg)


We should expect phases of flight that are closer to the ground to have more bird strikes, based on our previous variable analysis on `HEIGHT`. Let's take a look and plot the `value_counts()` as a bar chart. 

In [None]:
df["PHASE_OF_FLIGHT"].value_counts().plot.bar()

So there is nothing to surprising here. Phases of flight that are closer to the ground have more bird strikes. Since this variable is discrete, it might be useful to observe the **mode**, the most frequently occurring value(s). We can see that `Approach` is the mode, meaning that is the most common phase of flight for bird strikes. 

In [None]:
df["PHASE_OF_FLIGHT"].mode()

## Speed Variable

Next let's take a look at `SPEED`. The faster a plane is going, the more likely the plane is going to be damaged colliding with a bird, hence resulting in a bird strike report. A bird that bumps into a slow-moving plane is less likely to count as a bird strike if no damage occurs, right? However, a spinning engine on a stationary aircraft can suck in a bird and certainly count as a bird strike too. 

Let's take a look. 

In [None]:
df["SPEED"].hist(bins=50)

We seem to have a normal distribution here as indicated by the bell curve shape, with some extreme outliers to the right. This is going to be interesting. Let's take the mean and median of this. 

In [None]:
speed_mean = df["SPEED"].mean()
speed_median = df["SPEED"].median()

print(f"MEAN: {speed_mean} MEDIAN: {speed_median}")

Sure enough, our mean is not very far from our median so we got a good-looking variable with some predictive value. And again, this might make sense. When an aircraft is moving slowly, it is not moving fast enough for a bird to hit in a damaging way (unless it gets sucked into an engine). If it is moving fast, it is likely at cruise altitude high and away from where birds are found. There might be a correlation even between speed and height which we will explore in the next section. 

For good measure, let's approximate the probability distribution. If we use speed for certain tasks, we might consider chopping off the outliers in that right tail. 

In [None]:
df["SPEED"].plot.kde(xlim=(0,2000))

## Outliers 

**Outliers** are values that are far away from most of the values in a distribution. How we deal with outliers depends on what we are trying to do and the context of the problem. We may remove them, replace them, or just leave them be depending on what the outlier means to the problem at hand.

While there are valid cases to remove outliers, just remember to ask what outliers mean in your application. Your smart thermostat may not need to learn from an unusually cold day in May, and that is an outlier you can safely consider removing. However, a pedestrian in a chicken costume disrupting a "self-driving" car's computer vision is a very serious issue, even if it is an outlier. We do not want to remove that as it indicates we have bigger problems with our domain. 

Outliers are a very difficult topic to get right and require not just an understanding of statistics, but also an understanding of the problem. Just keep that in mind!

### Interquartile Range (IQR) and Percentiles

Recall that a majority of bird strikes happened well before 10,000 feet, so this skews the data to the left heavily. 

In [None]:
df["HEIGHT"].hist(bins=30)

Let's take a look at records where bird strikes exceeded that height and hypothesize those as outliers. 

In [None]:
with pd.option_context('display.max_columns', None):
    display(df[df["HEIGHT"] > 10_000])

Okay, 325 rows is somewhat small amount compared to the whole dataset. While this goes into bivariate analysis, let's satiate our curiosity and ask what species of birds are capable of flying this high according to the data. 

In [None]:
df[df["HEIGHT"] > 10_000]["SPECIES"].value_counts(dropna=False)

Okay, a lot of unknown birds and a lot of diversity with no clear pattern. Are there any birds flying above 25,000 feet? 

In [None]:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(df[df["HEIGHT"] > 25_000])

Interesting. We only had 3 instances where bird strikes occurred above 25,000 feet, including a "Cliff swallow" and a "Wilson's warber." Is it possible birds can fly this high? If we do some research, the greatest record of a bird collision was in [1973 when a vulture collided at 37,000 feet](https://sora.unm.edu/sites/default/files/journals/wilson/v086n04/p0461-p0462.pdf). 

Let's formalize our analysis a bit more. As we saw,`HEIGHT` is not one of those cases that follow the nice bell curve shape of the normal distribution. Another way we can approach outliers in these cases is to use the **Interquartile Range (IQR) method**. The **IQR** is the difference between the 75th and 25th percentile. When referring to the quarterly percentiles (0, 25, 50, 75, and 100), we refer to them as quartiles. A 50 percent quartile would be the middle-most value (the median), or the average of the two most-centered values.

A box plot  (also called a "box and whiskers plot") will visualize all of this quickly as shown below. 

<img src ="https://github.com/thomasnield/anaconda_python_eda/raw/public/resource/8U7f1C6A.png" width="600"> </img>

The `1.5` value is known as $ k $, and we can increase it to raise the threshold for what we consider an *outlier*. The box plot will not only show the range of the data, but also show where most data gravitates towards and its skewness. Let's show a `boxplot()` for `HEIGHT`. 

In [None]:
import seaborn as sns 
import matplotlib.pyplot as plt

sns.boxplot(x=df['HEIGHT'])

Well... that's a bit messy. The top 25% of values are above 1000 feet and spread all the way over 30,000 feet. The bottom 25% of values are extremely compressed though at 0 feet, as well as all the outliers. Is that true? Let's get those exact numbers. Let's also drop the NA's here because they will not provide value and distract from the values we do have. While unreported values can be problematic, let's determine it is okay to remove them. 

In [None]:
from numpy import percentile

q25 = percentile(df["HEIGHT"].dropna(), 25)
q75 = percentile(df["HEIGHT"].dropna(), 75)

q25, q75

So the bottom 25% of values are indeed at ground level, 0 feet. As a matter of fact, 44% of recorded `HEIGHT` values happen at ground level. We can calculate that like this:

In [None]:
sum(df["HEIGHT"] == 0) / df["HEIGHT"].dropna().shape[0]

This might make sense as birds tend to hang around near the ground where food, nests, water, resting spots, and other habitat essentials are.

Let's do the same proportion for at least 1000 feet. Sure enough, 26% of values are above 1000 feet. 

In [None]:
sum(df["HEIGHT"] >= 1000) / df["HEIGHT"].dropna().shape[0]

So what could we consider outliers? Let's try any values exceeding $ Q1 \pm 1.5 \times \text{IQR} $. That `1.5` would serve as the starting `k` value, and we can increase it for a higher outlier threshold if needed (e.g. we are getting "too many" outliers). 

In [None]:
iqr = q75 - q25
k = 1.5
cut_off = iqr * k
lower = q25 - cut_off
upper = q75 + cut_off

outliers = df[(df['HEIGHT'] < lower) | (df['HEIGHT'] > upper)]

with pd.option_context('display.max_columns', None):
    display(outliers)

We cannot say it is helpful to detect outliers in the lower direction, given how 0's dominate anything below the 44th percentile so they are not really outliers. But the upper direction might be useful, so let's just focus in that direction. Let's increase the `k` value to `10` because we really want to raise the threshold to see truly fringe values. 

In [None]:
iqr = q75 - q25
k = 10
cut_off = iqr * k
upper = q75 + cut_off

outliers = df[(df['HEIGHT'] > upper)]

with pd.option_context('display.max_columns', None):
    display(outliers)

With such a crazy high treshhold, we get 219 outliers. Browsing the data there seems to be a lot of heavy airliners flown by UNITED AIRLINES and SOUTHWEST AIRLINES. At risk of going into bivariate analysis, let's take a look at the `AIRCRAFT` in these outliers to test this theory. 

In [None]:
outliers["AIRCRAFT"].value_counts(dropna=False)

Okay, interesting... or maybe not! Airlines fly large aircraft like the 737-800 really high and quite frequently. And where there is more frequency, there is more opportunity to observe outliers like aircraft hitting birds at higher altitudes. Perhaps the [Law of Truly Large Numbers](https://en.wikipedia.org/wiki/Law_of_truly_large_numbers) is playing a role here `¯\_(ツ)_/¯`.

## Standard Deviation Outliers

Since our `SPEED` variable seems to follow a normal distribution, we can detect outliers using standard deviations. 

Let's create another boxplot. 

In [None]:
import seaborn as sns 
import matplotlib.pyplot as plt

sns.boxplot(x=df['SPEED'])

Okay, that's fairly balanced. There are a few outliers on the right side but not many of them. Let's bring our attention to that right direction, specifically values greater than 3 standard deviations from the mean. 

In [None]:
speed_mean = df["SPEED"].mean()
speed_std = df["SPEED"].std()
outliers = df[df["SPEED"] > (speed_mean+speed_std*3)]

with pd.option_context('display.max_columns', None):
    display(outliers)

This leaves us with 129 records. We will explore this in relationship with other variables, like the aircraft type and the carrier, in the next section. Let's next look in the opposite direction, but there's one problem: 3 standard deviations to the left of the mean is negative, and we do not have recorded negative speeds. 

In [None]:
print(speed_mean-speed_std*3)

Let's dial it back to 2.5 standard deviations. 

In [None]:
print(speed_mean-speed_std*2.5)

In [None]:
outliers = df[df["SPEED"] < (speed_mean-speed_std*2.9)]
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
    display(outliers)

Huh, a lot of these aircraft going so slow they were captured as outliers seem to be on the ground. Makes sense. We'll save that bivariate analysis for the next section for that deep dive. 

## EXERCISE

Explore the `DISTANCE` (which is nautical miles from the airport) and `AC_CLASS` variables. What can you observe about each of them? 

For context, `AC_CLASS` is decoded in the following table: 

| Aircraft Code | Aircraft Classification |
|---------------|-------------------------|
| A             | Airplane                |
| B             | Helicopter              |
| C             | Glider                  |
| D             | Balloon                 |
| F             | Dirigible               |
| I             | Gyroplane               |
| J             | Ultralight              |
| Y             | Other                   |
| Z             | Unknown                 |

In [None]:
# PUT CODE HERE 



### SCROLL DOWN FOR ANSWER
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
|<br>
v 

Using a histogram and KDE plot, we can see that bird strikes skew heavily near the airport. 

In [None]:
df["DISTANCE"].hist(bins=30)

In [None]:
df["DISTANCE"].plot.kde(xlim=(0,150))

With `AC_CLASS`, bird strikes happen overwhelmingly to planes (class `A`) followed by helicopters (class `B`). This makes sense as gliders and ultralight aircraft are probably less common, rather than because aircraft and helicopters are more vulnerable to bird strikes.

In [None]:
df["AC_CLASS"].value_counts().plot.bar()