# Dataset Specfication and Dictionary 

***Aircraft-Wildlife Collisions***
**Description**
A collection of all collisions between aircraft in wildlife that were reported to the US Federal Aviation Administration between 1990 and 1997, with details on the circumstances of the collision.

**Format**
A data frame with 19302 observations on the following 17 variables.

- opid:
Three letter identification code for the operator (carrier) of the aircraft.

- operator:
Name of the aircraft operator.

- atype:
Make and model of aircraft.

- remarks:
Verbal remarks regarding the collision.

- phase_of_flt:
Phase of the flight during which the collision occurred: Approach, Climb, Descent, En Route, Landing Roll, Parked, Take-off run, Taxi.

- ac_mass:
Mass of the aircraft classified as 2250 kg or less (1), 2251-5700 kg (2), 5701-27000 kg (3), 27001-272000 kg (4), above 272000 kg (5).

- num_engs:
Number of engines on the aircraft.

- date:
Date of the collision (MM/DD/YYYY).

- time_of_day:
Light conditions: Dawn, Day, Dusk, Night.

- state:
Two letter abbreviation of the US state in which the collision occurred.

- height:
Feet above ground level.

- speed:
Knots (indicated air speed).

- effect:
Effect on flight: Aborted Take-off, Engine Shut Down, None, Other, Precautionary Landing.

- sky:
Type of cloud cover, if any: No Cloud, Overcast, Some Cloud.

- species:
Common name for bird or other wildlife.

- birds_seen:
Number of birds/wildlife seen by pilot: 1, 2-10, 11-100, Over 100.

- birds_struck:
Number of birds/wildlife struck: 0, 1, 2-10, 11-100, Over 100.

**Details**
The FAA National Wildlife Strike Database contains strike reports that are voluntarily reported to the FAA by pilots, airlines, airports and others. Current research indicates that only about 20\ Wildlife strike reporting is not uniform as some organizations have more robust voluntary reporting procedures. Because of variations in reporting, users are cautioned that the comparisons between individual airports or airlines may be misleading.

**Source**
Aircraft Wildlife Strike Data: Search Tool - FAA Wildlife Strike Database. Available at https://dev.socrata.com/foundry/datahub.transportation.gov/jhay-dgxy. Retrieval date: Feb 4, 2012.

**Objective:**
In this project, I try to predict effect of bird struck to speed up decision process and airfield preperations in emercencies. In this context, this project will inspire others for future process.    

# Import libraries  

In [1]:
# libraries for EDA
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import cufflinks as cf
#Enabling the offline mode for interactive plotting locally
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot

init_notebook_mode(connected=True)
cf.go_offline()

#To display the plots
%matplotlib inline
from ipywidgets import interact
import plotly.io as pio

pio.renderers.default = "notebook"

# sklearn library for machine learning algorithms, data preprocessing, and evaluation
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
from sklearn.model_selection import train_test_split, GridSearchCV, cross_validate
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, log_loss, recall_score, accuracy_score, precision_score, f1_score

# yellowbrick library for visualizing the model performance
from yellowbrick.classifier import ConfusionMatrix
from yellowbrick.cluster import KElbowVisualizer

from sklearn.pipeline import Pipeline
# to get rid of the warnings
import warnings

warnings.filterwarnings("ignore")
warnings.warn("this will not show")

# Ingest Data

In [2]:
bird_df = pd.read_csv("birds.csv")

In [3]:
bird_df.head()

Unnamed: 0,opid,operator,atype,remarks,phase_of_flt,ac_mass,num_engs,date,time_of_day,state,height,speed,effect,sky,species,birds_seen,birds_struck
0,AAL,AMERICAN AIRLINES,MD-80,NO DAMAGE,Descent,4.0,2.0,9/30/1990 0:00:00,Night,IL,7000.0,250.0,,No Cloud,UNKNOWN BIRD - MEDIUM,,1
1,USA,US AIRWAYS,FK-28-4000,"2 BIRDS, NO DAMAGE.",Climb,4.0,2.0,11/29/1993 0:00:00,Day,MD,10.0,140.0,,No Cloud,UNKNOWN BIRD - MEDIUM,2-10,2-10
2,AAL,AMERICAN AIRLINES,B-727-200,,Approach,4.0,3.0,8/13/1993 0:00:00,Day,TN,400.0,140.0,,Some Cloud,UNKNOWN BIRD - SMALL,2-10,1
3,AAL,AMERICAN AIRLINES,MD-82,,Climb,4.0,2.0,10/7/1993 0:00:00,Day,VA,100.0,200.0,,Overcast,UNKNOWN BIRD - SMALL,,1
4,AAL,AMERICAN AIRLINES,MD-82,NO DAMAGE,Climb,4.0,2.0,9/25/1993 0:00:00,Day,SC,50.0,170.0,,Some Cloud,UNKNOWN BIRD - SMALL,2-10,1


# EDA 

In [4]:
df = bird_df.copy()

In [5]:
df.shape

(19302, 17)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19302 entries, 0 to 19301
Data columns (total 17 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   opid          19302 non-null  object 
 1   operator      19302 non-null  object 
 2   atype         19302 non-null  object 
 3   remarks       16516 non-null  object 
 4   phase_of_flt  17519 non-null  object 
 5   ac_mass       18018 non-null  float64
 6   num_engs      17995 non-null  float64
 7   date          19302 non-null  object 
 8   time_of_day   17225 non-null  object 
 9   state         18431 non-null  object 
 10  height        16109 non-null  float64
 11  speed         12294 non-null  float64
 12  effect        13584 non-null  object 
 13  sky           15723 non-null  object 
 14  species       19302 non-null  object 
 15  birds_seen    4764 non-null   object 
 16  birds_struck  19263 non-null  object 
dtypes: float64(4), object(13)
memory usage: 2.5+ MB


* may be I can change numerical values into integers to improve performance
* some feature like "operator" and "opid" yield same thing so I will drop them

In [7]:
df.dropna(how="all", inplace=True)

In [24]:
def missing_perct(df, print_it = False):
    if print_it:
        print(round((df.isnull().sum()/df.shape[0]*100), 2))
    return round((df.isnull().sum()/df.shape[0]*100), 2)

In [26]:
missing_perct(df,print_it=False)

opid             0.00
operator         0.00
atype            0.00
remarks         18.27
phase_of_flt     3.52
ac_mass          2.06
num_engs         2.17
date             0.00
time_of_day      3.31
state            4.36
height          10.51
speed           29.09
effect           0.00
sky             11.40
species          0.00
birds_seen      70.82
birds_struck     0.26
dtype: float64

* Dataset has some features with high number of missing values, I will try to impute them if they have up until 40 percent.

In [10]:
df.loc[df.duplicated()]

Unnamed: 0,opid,operator,atype,remarks,phase_of_flt,ac_mass,num_engs,date,time_of_day,state,height,speed,effect,sky,species,birds_seen,birds_struck
8715,ABX,ABX AIR (was AIRBORNE EXPRESS),DC-9-40,,En Route,4.0,2.0,10/13/1994 0:00:00,Night,,,,,,UNKNOWN BIRD,,1
14090,UNK,UNKNOWN,UNKNOWN,,,,,10/12/1996 0:00:00,,IL,,,,,RING-BILLED GULL,,1


* 2 duplicated value, we should get rid of those for better analysis

In [11]:
df.drop_duplicates(inplace=True)
df.reset_index(drop=True, inplace=True)

## Investigate the columns 

In [12]:
def first_look(df, col):
    print("Column name:"+col)
    print(f"Value Counts \n{df[col].value_counts(dropna=False)}\n"+"*+"*30)
    
    

In [14]:
for col in df.columns:
    first_look(df, col)

Column name:opid
Value Counts 
AAL    3139
BUS    2443
USA    1836
UNK    1613
SWA    1096
       ... 
AVQ       1
RGS       1
LSY       1
PAH       1
AAR       1
Name: opid, Length: 285, dtype: int64
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
Column name:operator
Value Counts 
AMERICAN AIRLINES     3139
BUSINESS              2443
US AIRWAYS            1836
UNKNOWN               1613
SOUTHWEST AIRLINES    1096
                      ... 
AVIATION SERVICES        1
RENOWN AVIATION          1
LINDSAY AVIATION         1
PANORAMA AIR TOUR        1
ASIANA AIRLINES          1
Name: operator, Length: 285, dtype: int64
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
Column name:atype
Value Counts 
B-737-300      1328
UNKNOWN        1252
MD-80          1058
B-737-200      1030
B-727           854
               ... 
CONVAIR 640       1
B-747-300         1
C-404             1
BE-80 QUEEN       1
HUGHES 300        1
Name: atype, Length: 284, dtype: int64
*+*+*+*+*+*+

### effect

* This will be my target so I will drop nans if they are not obvious in some classes

### opid

* I will use that feature just for EDA and anlysis purposes, then will drop it
* This feature has to much cardinality so it will effect may ML algorithms 

### operetor

* This feature is identical with opid
* Same prosess like opid

###  atype

* I will eximine the distrubution of effect over aircraft type
* This feature will stay as an independent variable

### remarks

* This field is free text, I can extract feature for ML and analysis
* For first iteration I will drop it.

### phase_of_flt 

* It has missing values. 
* I will use height to fill missing values. 
* I will check the distrubution of missing values over other features.

### ac_mass

* It has missing values
* I can check the atype and get the mass value for aircraft type
* I will fill nan values with the help of atype column

### num_engs

* It has missing values
* I can check the atype and get the number of engines value for aircraft type
* I will fill nan values with the help of atype column

### date

* I will use that column for feature extraction
* Feature that I will create "season", "month_of_the_year"

### time_of_day

* Clean column and I will use that feature as it is

### state

* It has high cardinality
* I will check the bird flying routes and cluster states into 4 or 6 
* Other option: I will use unsupervised aproach to have same num of cluster

### height

* It has missing values
* I can impute nans' via phase_of_flt 

### speed

* It has missing values 
* I will impute speed with "height", "phase_of_flt",and "atype"

### species

* I will drop down the number of categories in that column into 4

### birds_seen

* I will not use that feature because I am researching, I can predict the effect of bird strike
* Feature has no contrubituon the and state

### birds_struck

* This feature is crucial and it has some missing values
* I will try to impute those values by using other features
* I can use supervised aproach to fill

## Deal with Missing Values

### Target feature : effect

In [15]:
df.effect.value_counts(dropna=False)

None                     11610
NaN                       5717
Precautionary Landing      965
Aborted Take-off           544
Other                      348
Engine Shut Down           116
Name: effect, dtype: int64

In [16]:
df.loc[df["effect"].isnull()].sample(2)

Unnamed: 0,opid,operator,atype,remarks,phase_of_flt,ac_mass,num_engs,date,time_of_day,state,height,speed,effect,sky,species,birds_seen,birds_struck
8492,ARY,ARGOSY AIRWAYS,PA-31 NAVAJO,FLT 546 HAD NO DAMAGE. # OF BIRDS HIT NOT REPTD.,Taxi,2.0,2.0,12/18/1992 0:00:00,Night,CA,0.0,,,No Cloud,UNKNOWN BIRD - SMALL,,1
692,UAL,UNITED AIRLINES,B-727,FLT 663 HAD NO DAMAGAE.,Approach,4.0,3.0,10/18/1990 0:00:00,Night,MN,2500.0,170.0,,Some Cloud,UNKNOWN BIRD - MEDIUM,,1


In [17]:
df.loc[df["effect"]=="None"].sample(2)

Unnamed: 0,opid,operator,atype,remarks,phase_of_flt,ac_mass,num_engs,date,time_of_day,state,height,speed,effect,sky,species,birds_seen,birds_struck
16552,BUS,BUSINESS,PA-32,STRUCK AT WING ROOT ON REINFORCED SURFACE. DAM...,En Route,1.0,1.0,9/27/1997 0:00:00,Night,KY,5500.0,,,No Cloud,"DUCKS, GEESE, SWANS",,1
14557,UAL,UNITED AIRLINES,B-757-200,"NO DAMAGE, MESSY.",Approach,4.0,2.0,10/29/1995 0:00:00,Night,DC,1700.0,170.0,,No Cloud,UNKNOWN BIRD,,1


* <span class="mark">NaNs looks like None class bu not clear and ambgious so I decided to drop</span>

In [18]:
df = df.dropna(subset=["effect"]).reset_index(drop=True)

In [31]:
missing_perct(df,print_it=True).iplot("bar")

opid             0.00
operator         0.00
atype            0.00
remarks         18.27
phase_of_flt     3.52
ac_mass          2.06
num_engs         2.17
date             0.00
time_of_day      3.31
state            4.36
height          10.51
speed           29.09
effect           0.00
sky             11.40
species          0.00
birds_seen      70.82
birds_struck     0.26
dtype: float64


### phase_of_flt

In [86]:
df.groupby(["phase_of_flt","atype"]).agg({"height" : ["median","mean","count"],"speed":["median","mean","count"]})

Unnamed: 0_level_0,Unnamed: 1_level_0,height,height,height,speed,speed,speed
Unnamed: 0_level_1,Unnamed: 1_level_1,median,mean,count,median,mean,count
phase_of_flt,atype,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
Approach,A-300,1000.0,1783.857143,35,150.0,166.064516,31
Approach,A-310,1000.0,1506.666667,3,170.0,168.333333,3
Approach,A-320,500.0,923.636364,33,140.0,147.464286,28
Approach,AEROS SN601,600.0,600.000000,1,130.0,130.000000,1
Approach,AEROSTAR 600,10.0,10.000000,1,90.0,90.000000,1
...,...,...,...,...,...,...,...
Taxi,PA-34 SENECA,0.0,0.000000,1,,,0
Taxi,PA-60 600,0.0,0.000000,1,,,0
Taxi,SAAB-340,0.0,0.000000,1,,,0
Taxi,SIKORSKY S-76,0.0,0.000000,1,,,0


In [52]:
temp = df.loc[df["phase_of_flt"].isnull(),"phase_of_flt"]
temp.index

Int64Index([   11,    66,    91,    95,   118,   160,   227,   350,   362,
              380,
            ...
            13453, 13459, 13497, 13502, 13512, 13530, 13532, 13566, 13572,
            13579],
           dtype='int64', length=478)

In [81]:
for idx in temp.index:
    if df.loc[idx,"height"] in range(50,500) and df.loc[idx,"speed"] in range(85,150) :
        print(df.loc[idx,"phase_of_flt"])
    if df.loc[idx,"height"] in range(500,1000) and df.loc[idx,"speed"] in range(125,200) :
        print(df.loc[idx,"phase_of_flt"])

nan
nan
nan
nan
nan
nan


In [35]:
df.loc[df["phase_of_flt"].isnull(),"height"].value_counts(dropna=False)

NaN        449
0.0          4
10.0         3
20.0         3
100.0        3
25.0         2
500.0        2
1000.0       2
50.0         1
300.0        1
550.0        1
5800.0       1
1500.0       1
800.0        1
10000.0      1
1600.0       1
30.0         1
400.0        1
Name: height, dtype: int64

In [41]:
df.loc[df["phase_of_flt"].isnull(),"speed"].value_counts(dropna=False)

NaN      444
140.0      6
130.0      4
120.0      4
150.0      3
60.0       2
80.0       2
280.0      1
145.0      1
200.0      1
122.0      1
195.0      1
215.0      1
105.0      1
115.0      1
155.0      1
170.0      1
138.0      1
240.0      1
153.0      1
Name: speed, dtype: int64