# W02d02 - Practice in data cleaning

Load the following data file about the safety of data airlines comparing the period of 1985 to 1999 with the one from 2000 to 2014. Unfortunately, some accidents happend to this dataset which you are asked to cure: 
1. Drop unnecessary columns
1. Locate missing values by their indices
1. Locate inconsistent datatypes by their indices
1. Remove missing values
1. Transform columns to consistent datatypes suitable for statistical analysis
1. Create an additional column containing the counts of fatal accidents for each airline from 1985 to 2014
1. Provide summary statistics for all numeric columns

In [62]:
import pandas as pd
import numpy as np

df = pd.read_csv('airline_safety.csv')

#### 1. Drop unncessary columns

In [63]:
print(df.columns)
print(df.shape)
df.head()

Index([u'Unnamed: 0', u'airline', u'avail_seat_km_per_week',
       u'incidents_85_99', u'fatal_accidents_85_99', u'fatalities_85_99',
       u'incidents_00_14', u'fatal_accidents_00_14', u'fatalities_00_14'],
      dtype='object')
(56, 9)


Unnamed: 0.1,Unnamed: 0,airline,avail_seat_km_per_week,incidents_85_99,fatal_accidents_85_99,fatalities_85_99,incidents_00_14,fatal_accidents_00_14,fatalities_00_14
0,0,Aer Lingus,320906734,2,0.0,0,0.0,0,0.0
1,1,Aeroflot*,1197672318,76,14.0,128,six,1,88.0
2,2,Aerolineas Argentinas,385803648,6,0.0,0,1.0,0,0.0
3,3,Aeromexico*,596871813,3,1.0,64,5.0,0,0.0
4,4,Air Canada,1865253802,2,0.0,0,2.0,0,0.0


There appears to be no unnecessary column

#### 2. Locate missing values by their indices

In [64]:
for col in df.columns:
    print(col,df[col].index[df[col].apply(pd.isnull)])

('Unnamed: 0', Int64Index([], dtype='int64'))
('airline', Int64Index([], dtype='int64'))
('avail_seat_km_per_week', Int64Index([], dtype='int64'))
('incidents_85_99', Int64Index([], dtype='int64'))
('fatal_accidents_85_99', Int64Index([10], dtype='int64'))
('fatalities_85_99', Int64Index([], dtype='int64'))
('incidents_00_14', Int64Index([40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55], dtype='int64'))
('fatal_accidents_00_14', Int64Index([], dtype='int64'))
('fatalities_00_14', Int64Index([31, 32, 33, 34], dtype='int64'))


Columns incidents_00_14 and fatalities_00_14 have missing values as indicated above

#### 3. Locate inconsistent datatypes by their indices

In [65]:
df.dtypes

Unnamed: 0                  int64
airline                    object
avail_seat_km_per_week      int64
incidents_85_99             int64
fatal_accidents_85_99      object
fatalities_85_99            int64
incidents_00_14            object
fatal_accidents_00_14       int64
fatalities_00_14          float64
dtype: object

fatal_accidents_85_99 and incidents_00_14 should be int64

_I was unable to come up with an elegant solution for this, had to lookup the data to find out with rows must be changed_

#### 4. Remove missing values

In [66]:
df = df.dropna()
df.shape

(35, 9)

21 rows have been dropped

#### 5. Transform columns to consistent datatypes suitable for statistical analysis

In [73]:
df['incidents_00_14'][1] = '6'
df['fatal_accidents_85_99'][8] = '0'
df['incidents_00_14'] = df['incidents_00_14'].astype("float64")
df['fatal_accidents_85_99'] = df['fatal_accidents_85_99'].astype("float64")

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if __name__ == '__main__':
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app


_I was unable to convert to int64, had to do float64 instead. Explanations welcome :) _ 

#### 6. Create an additional column containing the counts of fatal accidents for each airline from 1985 to 2014

In [76]:
df['total_fatal_accidents'] = df['fatal_accidents_85_99'] + df['fatal_accidents_00_14']

#### 7. Provide summary statistics for all numeric columns

In [77]:
df.describe()

Unnamed: 0.1,Unnamed: 0,avail_seat_km_per_week,incidents_85_99,fatal_accidents_85_99,fatalities_85_99,incidents_00_14,fatal_accidents_00_14,fatalities_00_14,total_fatal_accidents
count,35.0,35.0,35.0,35.0,35.0,35.0,35.0,35.0,35.0
mean,18.285714,1214864000.0,8.0,2.142857,109.2,4.057143,0.685714,59.457143,2.828571
std,11.513602,1406822000.0,13.4186,3.228029,155.217457,4.819952,0.866753,105.791233,3.721841
min,0.0,277414800.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,8.5,404964900.0,2.0,0.0,0.0,1.0,0.0,0.0,0.5
50%,18.0,613356700.0,4.0,1.0,47.0,3.0,0.0,0.0,1.0
75%,26.5,1385945000.0,7.5,3.0,157.5,5.0,1.0,88.0,4.5
max,39.0,6525659000.0,76.0,14.0,535.0,24.0,3.0,416.0,15.0
