## Define the problem

**Given**: dataset with various information about the Titanic passengers (age, sex, socio-economic status, cabin,...) 

**Goal**: analysis of what sorts of people were likely to survive.
    

## Prepare the data (Data Preprocessing)

### Load the data

We'll use the datasets provided by kaggle:[titanic/data](https://www.kaggle.com/c/titanic/data).

You can downloaded from here: [Titanic - all.zip](https://github.com/ProgressBG-Python-Course/JupyterNotebooksExamples/blob/master/datasets/Titanic/all.zip)

In [None]:
# look at the row files:
!head ../../datasets/Titanic/train.csv

In [None]:
# load the dataset, using PassengerId as index
df_train = pd.read_csv("../../datasets/Titanic/train.csv", index_col='PassengerId')
df_test = pd.read_csv("../../datasets/Titanic/test.csv", index_col='PassengerId')

### Data variable descriptions:
<pre>
survival        Survival
                (0 = No; 1 = Yes)
pclass          Passenger Class
                (1 = 1st; 2 = 2nd; 3 = 3rd)
name            Name
sex             Sex
age             Age
sibsp           Number of Siblings/Spouses Aboard
parch           Number of Parents/Children Aboard
ticket          Ticket Number
fare            Passenger Fare
cabin           Cabin
embarked        Port of Embarkation
                (C = Cherbourg; Q = Queenstown; S = Southampton)

SPECIAL NOTES:
Pclass is a proxy for socio-economic status (SES)
 1st ~ Upper; 2nd ~ Middle; 3rd ~ Lower

Age is in Years; Fractional if Age less than One (1)
 If the Age is Estimated, it is in the form xx.5
</pre>

### Get insight of the data (Prepare and clean)

In [None]:
df_train.head()
# df_test.head()

In [None]:
df_test.head(5)

In [None]:
print(df_train.info())
print(df_test.info())
# print(data_df.columns.values.tolist())

#### Visualize with Seaborn

In [None]:
sns.countplot('Survived', data=df_train)

In [None]:
sns.countplot('Survived', hue='Sex', data=df_train)
# plt.title("Male/Female Survived")
plt.legend(bbox_to_anchor=(1, 1), loc=2)

In [None]:
sns.countplot('Survived', hue='Pclass', data=df_train)
plt.legend(bbox_to_anchor=(1, 1), loc=2)

In [None]:
df_train['Age'].plot.hist(bins=30)

### Clean and wrangle the data

#### Check for NaN values

In [None]:
df_train.isnull().sum()
df_test.isnull().sum()

Columns 'Age', 'Cabin' and 'Embarked' <span style="color:red">has NaN</span> values. We have to deal with them.

#### Deal with NaN values

In [None]:
def fill_nan_values(df):
    # Put port = Southampton for 'Embarked' null values:
    df["Embarked"] = df["Embarked"].fillna("S")
    
    # put the mean passengers age for 'Age' null values
    df["Age"] = df["Age"].fillna(df["Age"].median())
    
    # put 0 for cabin number for all 'Cabin' null values
    df["Cabin"] = df["Cabin"].fillna(0)
    
    # put the mean Fare for 'fare' null values:
    df["Fare"] = df["Fare"].fillna(df["Fare"].median())
    
    return df


In [None]:
df_train = fill_nan_values(df_train)
df_test = fill_nan_values(df_test)

In [None]:
# check again:
df_train.isnull().sum()
df_test.isnull().sum()

#### Categorical text data => to numbers

In [None]:
df_train.info()

In [None]:
def categories_to_numbers(df):
    if df['Sex'].dtype == "object":
        # male => 1, female => 0
        df["Sex"] = np.where(df["Sex"] == "male", 1,0) 
        
    if df['Embarked'].dtype == "object":  
        ### this is just more-readable
        df.loc[df["Embarked"] == "S", "Embarked"] = 0
        df.loc[df["Embarked"] == "C", "Embarked"] = 1
        df.loc[df["Embarked"] == "Q", "Embarked"] = 2

        ### usefull, when we have more values
        # Get the unique values of Embarked
#         embarks = sorted(df['Embarked'].unique())

        # Generate a mapping of Embarked string to a numbers (0,1,...)    
#         embarks_map = dict(zip(embarks, range(0, len(embarks) + 1)))

        # Transform Embarked from a string to a number representation
#         df['Embarked'] = df['Embarked'].map(embarks_map).astype(int)
        
    print("df['Sex'].dtype", df['Sex'].dtype)
    print("df['Embarked'].dtype", df['Embarked'].dtype)

    return df

In [None]:
df_train = categories_to_numbers(df_train)
df_test = categories_to_numbers(df_test)
df_train.head()

## Select features

### Show Correlations

In [None]:
# show correlations
df_train_corr = df_train.corr()
sns.heatmap(df_train_corr,annot=True, cmap="Reds")

From the heat map we can see that 'Sex', 'Pclass','Fare' and 'Embarked' have the highest weight for 'Survived'. And we will use them as features.


<!-- ### Drop columns we won't use -->

In [None]:
usefull_features = ['Sex', 'Pclass','Fare','Embarked','Survived']

df_train = df_train[ usefull_features ]

## Separate the training data from the test data

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df_train.drop('Survived',axis=1), 
    df_train['Survived'], 
    random_state=1)

# print(f'X_train: {X_train[:5]}\n', f'y_train: {y_train[:5]}\n')
# print(f'X_test: {X_test[:5]}\n', f'y_test: {y_test[:5]}\n')

print(f'X_train shape: {X_train.shape}')
print(f'X_test shape: {X_test.shape}')
print(f'y_train shape: {y_train.shape}')
print(f'y_test shape: {y_test.shape}')

## Choose the model

The problem is a categorization one, and we are going to try first with LogisticRegression.

In [None]:
from sklearn.linear_model import LogisticRegression

## Train the model (fit the model)

In [None]:
# instantiate and fit the model
lg = LogisticRegression()
fitted = lg.fit(X_train,y_train)

### Make predictions

In [None]:
# let's check the "learned" co-efficients:
print(fitted.intercept_)
print(fitted.coef_)

## Predict (classify unknown input sample)

In [None]:
y_pred = fitted.predict(X_test)

## Evaluate the model

In [None]:
from sklearn.metrics import classification_report

In [None]:
print(classification_report(y_test,predictions))