## This notebook demonstrates some basic data pro-precessing tasks based on Titanic dataset

##### VARIABLE DESCRIPTIONS

Survived - Survival (0 = No; 1 = Yes)<br>
Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)<br>
Name - Name<br>
Sex - Sex<br>
Age - Age<br>
SibSp - Number of Siblings/Spouses Aboard<br>
Parch - Number of Parents/Children Aboard<br>
Ticket - Ticket Number<br>
Fare - Passenger Fare (British pound)<br>
Cabin - Cabin<br>
Embarked - Port of Embarkation (C = Cherbourg, France; Q = Queenstown, UK; S = Southampton - Cobh, Ireland)

In [0]:
%fs ls /mnt/isa460/data/titanic

In [0]:
from pyspark.sql.functions import col, isnull, mean, stddev, abs, lit, desc
from pyspark.sql.types import StringType, IntegerType, DoubleType
from pyspark.ml.feature import Imputer

## load the Titanic dataset

In [0]:
df =spark.read.csv('/mnt/isa460/data/titanic/titanic-training-data.csv', header=True, inferSchema=True)

display(df)

### Checking target variable, make sure it includes valid values

### Checking for missing values

### Taking care of missing values
##### Dropping missing values
So let's just go ahead and drop all the variables that aren't relevant for predicting survival. We should at least keep the following:
- Survived - This variable is obviously relevant.
- Pclass - Does a passenger's class on the boat affect their survivability?
- Sex - Could a passenger's gender impact their survival rate?
- Age - Does a person's age impact their survival rate?
- SibSp - Does the number of relatives on the boat (that are siblings or a spouse) affect a person survivability? Probability
- Parch - Does the number of relatives on the boat (that are children or parents) affect a person survivability? Probability
- Fare - Does the fare a person paid effect his survivability? Maybe - let's keep it.
- Embarked - Does a person's point of embarkation matter? It depends on how the boat was filled... Let's keep it.

What about a person's name, ticket number, and passenger ID number? They're irrelavant for predicting survivability. And as you recall, the cabin variable is almost all missing values, so we can just drop all of these.

### Identify outliers

In [0]:
# set the size of seaborn plot
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import sklearn
from pylab import rcParams

rcParams['figure.figsize'] = 15,8

pd=df1.toPandas()

sb.boxplot(x='Parch', y='Age', data=pd, palette='hls')

### impute missing values age

In [0]:
from pyspark.sql.functions import *
display(df1.groupBy('Parch').agg(avg('Age').alias("Age")).orderBy('Parch'))

replace missing age value of each person with mean of the parch category that person is in

In [0]:
def age_approx(Age, Parch):
    
    if Age is not None:
       return int(Age)
       
    else:
        if Parch == 0:
            return 32
        elif Parch == 1:
            return 24
        elif Parch == 2:
            return 17
        elif Parch == 3:
            return 33
        elif Parch == 4:
            return 45
        else:
            return 30

In [0]:
# create a user defined function based on age_approx



# apply this function to the dataframe



### Converting categorical variables to a dummy indicators

In [0]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer

#stringIndexer = StringIndexer(inputCol="Embarked", outputCol="Embarked_index", handleInvalid="skip")
oheEncoder = OneHotEncoder(inputCol="Embarked_index", outputCol="Embarked_vec")

df4=stringIndexer.fit(df2).transform(df2)
df5=oheEncoder.fit(df4).transform(df4)

display(df5)

### Checking for independence between features

In [0]:
# set the size of seaborn plot
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import sklearn
from pylab import rcParams

rcParams['figure.figsize'] = 15,8

pd=df1.toPandas()

sb.heatmap(pd.corr())