# W02d02 - Practice in data cleaning

Load the following data file about the safety of data airlines comparing the period of 1985 to 1999 with the one from 2000 to 2014. Unfortunately, some accidents happend to this dataset which you are asked to cure: 
1. Drop unnecessary columns
1. Locate missing values by their indices
1. Locate inconsistent datatypes by their indices
1. Remove missing values
1. Transform columns to consistent datatypes suitable for statistical analysis
1. Create an additional column containing the counts of fatal accidents for each airline from 1985 to 2014
1. Provide summary statistics for all numeric columns

In [1]:
import pandas as pd
import numpy as np

df = pd.read_csv('airline_safety.csv')

In [2]:
df.head()

Unnamed: 0.1,Unnamed: 0,airline,avail_seat_km_per_week,incidents_85_99,fatal_accidents_85_99,fatalities_85_99,incidents_00_14,fatal_accidents_00_14,fatalities_00_14
0,0,Aer Lingus,320906734,2,0.0,0,0.0,0,0.0
1,1,Aeroflot*,1197672318,76,14.0,128,six,1,88.0
2,2,Aerolineas Argentinas,385803648,6,0.0,0,1.0,0,0.0
3,3,Aeromexico*,596871813,3,1.0,64,5.0,0,0.0
4,4,Air Canada,1865253802,2,0.0,0,2.0,0,0.0


In [3]:
df.shape

(56, 9)

In [4]:
# Drop first column
df.drop([df.columns[0]],inplace=True,axis=1)

In [5]:
# entries with missing values
for col in df.columns:
    if len(df.index[df[col].isnull() == True]) > 0:
        print col, df.index[df[col].isnull() == True]

fatal_accidents_85_99 Int64Index([10], dtype='int64')
incidents_00_14 Int64Index([40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55], dtype='int64')
fatalities_00_14 Int64Index([31, 32, 33, 34], dtype='int64')


In [6]:
# drop missing values
df.dropna(inplace=True)
df.shape

(35, 8)

In [7]:
# check types
df.dtypes

airline                    object
avail_seat_km_per_week      int64
incidents_85_99             int64
fatal_accidents_85_99      object
fatalities_85_99            int64
incidents_00_14            object
fatal_accidents_00_14       int64
fatalities_00_14          float64
dtype: object

In [8]:
# locate unsuitable entry in incidents_00_14
df.incidents_00_14.index[df.incidents_00_14=='six'][0]

1

In [9]:
# cure it and transform column type
df.incidents_00_14.iloc[df.incidents_00_14.index[df.incidents_00_14=='six'][0]] = '6'
df.incidents_00_14 = df.incidents_00_14.astype(float)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self._setitem_with_indexer(indexer, value)


In [10]:
# locate unsuitable entry in fatal_accidents_85_99
df.fatal_accidents_85_99.index[df.fatal_accidents_85_99=='zero'][0]

8

In [11]:
# cure it and transform column type
df.fatal_accidents_85_99.iloc[df.fatal_accidents_85_99.index[df.fatal_accidents_85_99=='zero'][0]] = 0
df.fatal_accidents_85_99 = df.fatal_accidents_85_99.astype(float)

In [12]:
# add new column for all fatal accidents
df['fatal_accidents_85_2014'] = df['fatal_accidents_85_99'] + df['fatal_accidents_00_14']

In [13]:
# check datatypes
df.dtypes

airline                     object
avail_seat_km_per_week       int64
incidents_85_99              int64
fatal_accidents_85_99      float64
fatalities_85_99             int64
incidents_00_14            float64
fatal_accidents_00_14        int64
fatalities_00_14           float64
fatal_accidents_85_2014    float64
dtype: object

In [14]:
# print summary statistics
df.describe()

Unnamed: 0,avail_seat_km_per_week,incidents_85_99,fatal_accidents_85_99,fatalities_85_99,incidents_00_14,fatal_accidents_00_14,fatalities_00_14,fatal_accidents_85_2014
count,35.0,35.0,35.0,35.0,35.0,35.0,35.0,35.0
mean,1214864000.0,8.0,2.142857,109.2,4.057143,0.685714,59.457143,2.828571
std,1406822000.0,13.4186,3.228029,155.217457,4.819952,0.866753,105.791233,3.721841
min,277414800.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,404964900.0,2.0,0.0,0.0,1.0,0.0,0.0,0.5
50%,613356700.0,4.0,1.0,47.0,3.0,0.0,0.0,1.0
75%,1385945000.0,7.5,3.0,157.5,5.0,1.0,88.0,4.5
max,6525659000.0,76.0,14.0,535.0,24.0,3.0,416.0,15.0


In [20]:
df.fatalities_00_14.astype(int)

0       0
1      88
2       0
3       0
4       0
5     337
6     158
7       7
8      88
9       0
11    416
12      0
13      0
14      0
15      0
16    225
17      0
18      0
19     51
20     14
21      0
22     92
23      0
24     22
25    143
26      0
27      0
28      0
29    283
30      0
35     46
36      1
37      0
38      0
39    110
Name: fatalities_00_14, dtype: int64