# Data Cleaning and Preparation
## DAT540 Introduction to Data Science
## University of Stavanger
### L09
#### Antorweep Chakravorty (antorweep.chakravorty@uis.no)

In [2]:
import numpy as np
import pandas as pd

- Most real world data sets have missing values
- Such values could lead to erroneous results
- Proper strategies should be applied prior to any data processing in order to handle missing data

- **Handling Missing Data**
- pandas uses floating point value *NaN* (Not a Number) to represent missing values
- Such values are termed as *sentinel value* and pandas provides a wide array of methods to detect them
- pandas also employees a sentinel value *NA* (Not Available) to represent data that does not exist
- When cleaning data for analysis, it is crucial to perform a prior analysis on missing data to identify data collection issues or detect biases
- The built in Python *None* value is also treated as a NA object
- Typically, if we have a Series or DataFrame, we can use the instance method *isnull* to detect missing values
- *isnull* instance method returns true for all values that are  null
- Other instance methods such as 
  - *dropna* filters axis labels based on whether values for each label have missing data, with varying thresholds for how much missing data to tolerate
  - *fillna* fills in missing data with some value or using interpolation methods such as 'ffill'or 'bfill'  

In [3]:
titanic = pd.read_csv('./data/titanicv1.csv', delimiter=',')
print(titanic.shape)
print(titanic.isnull().head())

(156, 13)
   Unnamed: 0  PassengerId  Survived  Pclass   Name    Sex    Age  SibSp  \
0       False        False     False   False  False  False  False  False   
1       False        False     False   False  False  False  False  False   
2       False        False     False   False  False  False  False  False   
3       False        False     False   False  False  False  False  False   
4       False        False     False   False  False  False  False  False   

   Parch  Ticket   Fare  Cabin  Embarked  
0  False   False  False   True     False  
1  False   False  False  False     False  
2  False   False  False   True     False  
3  False   False  False  False     False  
4  False   False  False   True     False  


- How do we get an overview of the missing values per column

In [4]:
titanic.isnull().sum(axis=0)

Unnamed: 0       0
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex             50
Age             30
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          125
Embarked         1
dtype: int64

- rows or columns with missing values can be dropped with the instance method *dropna*
- *axis* argument determines either a row or column to be dropped
- *how* {'any', 'all'}, default 'any'. determine when we remove, do we have at least one NA or all NA.
- *thresh* require that many non-NA values to be present
- subset : array-like, labels along other axis to consider, e.g. if you are dropping rows these would be a list of columns to include.

In [5]:
titanic.dropna(subset=['Embarked'], inplace=True)

In [6]:
titanic.shape

(155, 13)

In [7]:
titanic.isnull().sum(axis=0)

Unnamed: 0       0
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex             50
Age             30
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          125
Embarked         0
dtype: int64

- What about the Cabin column it has around 80% missing data
- Do we drop this column? May be we can interpolate the values by looking at the Fare column

In [8]:
# We drop the cabin column as it has a lot of missing data and if we can interpolate it from the Fare or Ticket column, 
# then we dont need it
# We provide arg tresh=100, as lets say we need a column to have atleast 60% non missing values
titanic.dropna(thresh=100, axis=1, inplace=True)

In [9]:
titanic.shape

(155, 12)

- We also want to drop all rows that have all NaN values

In [10]:
titanic.dropna(how='all', inplace=True)
titanic.isnull().sum(axis=0)

Unnamed: 0      0
PassengerId     0
Survived        0
Pclass          0
Name            0
Sex            50
Age            30
SibSp           0
Parch           0
Ticket          0
Fare            0
Embarked        0
dtype: int64

- Filling in Missing Data
- The *fillna* instance method is the workhorse for filling holes in a dataset
- It's *value* argument takes in a *scalar*, *dict*, *Series* or *DataFrame*. The dict/Series/DataFrame of values specify which values to use for each index (Series) or column (DataFrame)
- *method* argument {'bfill', 'pad', 'ffill', None} (defualt None) specifies the method to use to fill holes
- *limit* argument could be provided in tandem with the method argument. It specifies the maximum number of consecutive NaN values to forward/backward fill. It thus could create partially filled row or column 

- For the titanic dataset, let us try to fill the age column
- What would be good strategy?

In [11]:
print(titanic.sample(3))

    Unnamed: 0  PassengerId  Survived  Pclass                         Name  \
0            0            1         0       3      Braund, Mr. Owen Harris   
62          62           63         0       1  Harris, Mr. Henry Birkhardt   
71          71           72         0       3   Goodwin, Miss. Lillian Amy   

       Sex   Age  SibSp  Parch     Ticket    Fare Embarked  
0     male  22.0      1      0  A/5 21171   7.250        S  
62    male  45.0      1      0      36973  83.475        S  
71  female  16.0      5      2    CA 2144  46.900        S  


- May be the Fare for children are cheaper?

In [12]:
# First let us check if there is any correlation between the non missing Age and Fare values
age_fare = titanic.dropna(subset=['Age']).loc[:, ['Age', 'Fare']]
print('shape: ', age_fare.shape)
age_fare['Age'].corr(age_fare['Fare']) # We see almost no correlation

shape:  (125, 2)


0.018210834187746487

- Is there any relationship, if we create bins for Age and Fare
- Ex. 
  - Age: 0 -> [0-6], 1 -> [6-12], 2 -> [12-18], 3-> [18+]
  - Fare: round the fare to the nearest integer


In [13]:
age_fare['Age'].map(lambda x: 0 if x < 6 else 1 if x < 12 else 2 if x < 18 else 3).corr(age_fare['Fare'].apply(lambda x: np.ceil(x)))

0.04135740503165585

- From the first look Age and Fare does not have any strong correlations
- We don't directly see any other column that might have relevance to Age.
- Let us fill it with the mean value for now

In [14]:
titanic.fillna({'Age': titanic['Age'].mean()}, inplace=True)

In [15]:
titanic.isnull().sum(axis=0)

Unnamed: 0      0
PassengerId     0
Survived        0
Pclass          0
Name            0
Sex            50
Age             0
SibSp           0
Parch           0
Ticket          0
Fare            0
Embarked        0
dtype: int64

- How can we fill Sex column?
- Is there any relation with any other column? Can we derive from it?

In [16]:
print(titanic.sample(3))

     Unnamed: 0  PassengerId  Survived  Pclass                           Name  \
94           94           95         0       3              Coxon, Mr. Daniel   
141         141          142         1       3       Nysten, Miss. Anna Sofia   
87           87           88         0       3  Slocovski, Mr. Selman Francis   

      Sex       Age  SibSp  Parch           Ticket  Fare Embarked  
94   male  59.00000      0      0           364500  7.25        S  
141   NaN  22.00000      0      0           347081  7.75        S  
87   male  28.06264      0      0  SOTON/OQ 392086  8.05        S  


- We can observe that we can derive the salutations from name, that would represent the gender.
- Let us do that
  - First we extract the salutations form name
  - Then we use them to replace the missing values in sex

In [17]:
salutations = ['Mrs.', 'Mrs', 'Miss.', 'Miss', 'Mr.', 'Mr', 'Master', 'Master.']
gender_verifier = titanic['Name'].map(lambda x: [sal.title() for sal in x.split(' ') if sal.title() in salutations] )
gender = gender_verifier.map(lambda x: 'female' if x in salutations[:4] else 'male')
titanic.fillna({'Sex': gender }, inplace=True)

In [18]:
titanic.isnull().sum(axis=0)

Unnamed: 0     0
PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0
dtype: int64

In [19]:
print(titanic.sample(3))

     Unnamed: 0  PassengerId  Survived  Pclass                          Name  \
139         139          140         0       1            Giglio, Mr. Victor   
66           66           67         1       2  Nye, Mrs. (Elizabeth Ramell)   
124         124          125         0       1   White, Mr. Percival Wayland   

        Sex   Age  SibSp  Parch      Ticket     Fare Embarked  
139    male  24.0      0      0    PC 17593  79.2000        C  
66   female  29.0      0      0  C.A. 29395  10.5000        S  
124    male  54.0      0      1       35281  77.2875        S  


In [24]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import mean_absolute_error as MAE
from sklearn.metrics import mean_absolute_percentage_error as MAPE

In [46]:
tdataset = titanic[['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Survived']]
# We take 80% of the dataset to train and rest to test
# there are further optimization that can be performed on the data. This is just an initial example
test = tdataset.sample(frac=.2)
train = tdataset[~tdataset.isin(test).all(1)]

In [50]:
predictions = pd.DataFrame([])
model = LogisticRegression()
model.fit(train.loc[:, train.columns != 'Survived'], train['Survived'])
predictions['Survived'] = np.ceil(model.predict(test.loc[:, test.columns != 'Survived']))
predictions['PassengerId'] = test.index.values
predictions.set_index('PassengerId', inplace=True)
# since we are doing random sampling in choosing the train and test set, expect different MAPE across multiple runs
print('MAE:', MAE(test['Survived'], predictions['Survived'])) 
print('MAPE:', MAPE(test['Survived'], predictions['Survived']) * 100) 

MAE: 0.22580645161290322
MAPE: 22.58064516129032


In [56]:
#Sanity Check
compare = test['Survived'] == predictions['Survived']
# Accuracy
compare.sum() / len(compare)

0.7741935483870968