# [Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic)

![](https://upload.wikimedia.org/wikipedia/commons/6/6e/St%C3%B6wer_Titanic.jpg)

<a class="anchor" id="0"></a>

# Automatic EDA with Pandas Profiling

## This notebook used version of the library Pandas-profiling 2.9.0 (Sep 2020)
https://github.com/pandas-profiling/pandas-profiling/releases

Amazing Automatic Exploratory Data Analysis (EDA) with **Pandas Profiling**

This notebook shows the main Pandas Profiling capabilities on the example of the "[Titanic: Machine Learning from Disaster](https://www.kaggle.com/c/titanic)" competition.

The analysis is carried out taking into account the result of features engineering (FE).

At the end is given a simple example of a solution from other my notebook.

See change log of **Pandas Profiling** in https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/changelog.html

Thanks to:
* https://www.kaggle.com/prashant111/eda-is-fun 
* https://www.kaggle.com/vbmokin/three-lines-of-code-for-titanic-top-15
* https://www.kaggle.com/vbmokin/three-lines-of-code-for-titanic-top-20
* https://www.kaggle.com/mauricef/titanic
* https://www.kaggle.com/kpacocha/top-6-titanic-machine-learning-from-disaster
* https://www.kaggle.com/erinsweet/simpledetect
* https://www.kaggle.com/tunguz/covid-19-eda-week-5
* [Titanic - Top score : one line of the prediction](https://www.kaggle.com/vbmokin/titanic-top-score-one-line-of-the-prediction)

<a class="anchor" id="0.1"></a>

## Table of Contents

1. [Introduction to Pandas Profiling](#1)
    -  [General information](#1.1)
    -  [Attributes](#1.2)
    -  [Methods](#1.3)
    -  [Chengelog 2.9.0](#1.4)
1. [Import libraries](#2)
1. [Download datasets](#3)
1. [Features engineering (FE)](#4)
1. [EDA with describe](#5)
1. [EDA with Pandas Profiling](#6)
    -  [EDA of training dataset](#6.1)
    -  [EDA of test dataset](#6.2)
1. [Conclusion and prediction](#7)


## 1. Introduction to Pandas Profiling <a class="anchor" id="1"></a>

[Back to Table of Contents](#0.1)

### 1.1 General information <a class="anchor" id="1.1"></a>

**The results of this tool have recently been updated.**

The first thing we do after importing a dataset is to get an insight about the dataset. This is called exploratory data analysis or EDA in short. We use Pandas for EDA purposes.


Pandas is the most widely used Python library which is used to get insights about the data. It is used for loading and processing data in Python. It has great set of tools to perform various statistical operations on the data. I have listed below some basic and common commands along with their description which are used to get insights about the data.

- **head() method** - view the top 5 rows of the dataset.

- **tail() method** - view the bottom 5 rows of the dataset.

- **info() method** - view concise summary of dataset.

- **describe() method** - view statistical properties of dataset.


There are some basic dataframe attributes which are as follows -


- **df.shape** - gives the dimensions of the dataset.

- **df.dtypes** - gives the data types of the columns.

- **df.columns** - view the column names of the dataset.


But these methods and attributes are very basic for EDA purposes.


There is an alternative, called **Pandas profiling**. This library generates a complete report for your dataset, which includes:

- Basic data type information (which columns contain what).

- Descriptive statistics (mean, average, etc.)

- Quantile statistics (tells you about how your data is distributed)

- Histograms for your data (again, for visualizing distributions)

- Correlations (Let's you see what's related)


This tool outputs a bunch of HTML file, containing all the information mentioned above. Instead of just giving us a single output, pandas-profiling tool provides a broadly structured HTML file containing all the relevant information that a typical EDA of all the basic commands and attributes provide. So, it saves a lot of time. Now we can perform EDA with just one line of code (as explained below).


This Pandas Profiling tool can be download here: -

https://github.com/pandas-profiling/pandas-profiling


We will apply pandas-profiling to the Titanic data set because it has variety of data types and it contains missing values. This tool is particularly useful when the dataset is not cleaned and it requires individual exploration of the variables. 

### 1.2 Attributes <a class="anchor" id="1.2"></a>

[Back to Table of Contents](#0.1)

Source: **pandas_profiling/__init__.py**   

* df : DataFrame
        Data to be analyzed
* bins : int
        Number of bins in histogram.
        The default is 10.
* check_correlation : boolean
        Whether or not to check correlation.
        It's `True` by default.
* vars:
    num:
        quantiles:
              - 0.05
              - 0.25
              - 0.5
              - 0.75
              - 0.95
        skewness_threshold: 20
        low_categorical_threshold: 5
* Set to zero to disable
        chi_squared_threshold: 0.999
    cat:
        check_composition: True
        cardinality_threshold: 50
        n_obs: 5
* Set to zero to disable
        chi_squared_threshold: 0.999
    bool:
        n_obs: 3

More settings can be found in the configuration files:
* default configuration file, 
* minimal configuration file,
* dark themed configuration file.

Example:

**profile = df.profile_report(title='Pandas Profiling Report', plot={'histogram': {'bins': 8}})**

### 1.3 Methods <a class="anchor" id="1.3"></a>

[Back to Table of Contents](#0.1)

Source: **pandas_profiling/__init__.py**   

* get_description
        Return the description (a raw statistical summary) of the dataset.
* get_rejected_variables
        Return the list of rejected variable or an empty list 
        if there is no rejected variables.
* to_file
        Write the report to a file.
* to_html
        Return the report as an HTML string.

### 1.4 Chengelog 2.9.0 <a class="anchor" id="1.4"></a>

[Back to Table of Contents](#0.1)

Source: **https://pandas-profiling.github.io/pandas-profiling/docs/master/rtd/pages/changelog.html#changelog-v2-9-0**

* Description per variable now possible (see the metadata page) or the Census example.
* Fixed bug for small DataFrames with unused categories.
* Fixed bug where parallelization would have side effects.
* Removed warning where colormap was modified in place.
* Distinguish between unique and distinct correctly.

#### Large datasets

Version 2.x introduces **minimal mode**. This is a default configuration that disables expensive computations (such as correlations and dynamic binning). Use the following syntax:

      profile = ProfileReport(large_dataset, minimal=True)
      profile.to_file(output_file="output.html")

## 2. Import libraries <a class="anchor" id="2"></a>

[Back to Table of Contents](#0.1)

In [1]:
# !pip install -U pandas-profiling==2.9.0

## In Kaggle switch "Environment" to "Always use latest environment"

![image.png](attachment:image.png)

In [2]:
import numpy as np
import pandas as pd
# import pandas_profiling as pp
# from pandas_profiling import ProfileReport

In [3]:
# pp.__version__

## 3. Download datasets <a class="anchor" id="3"></a>

[Back to Table of Contents](#0.1)

In [4]:
traindf = pd.read_csv('../input/train.csv').set_index('PassengerId')
testdf = pd.read_csv('../input/test.csv').set_index('PassengerId')

In [5]:
traindf.head(3)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [6]:
testdf.head(3)

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q


## 4. Features engineering (FE) <a class="anchor" id="4"></a>

[Back to Table of Contents](#0.1)

In [7]:
# Thanks to: 
# https://www.kaggle.com/mauricef/titanic
# https://www.kaggle.com/vbmokin/titanic-top-3-one-line-of-the-prediction-code

df = pd.concat([traindf, testdf], axis=0, sort=False)
df['Title'] = df.Name.str.split(',').str[1].str.split('.').str[0].str.strip()
df['IsWomanOrBoy'] = ((df.Title == 'Master') | (df.Sex == 'female'))
df['LastName'] = df.Name.str.split(',').str[0]
family = df.groupby(df.LastName).Survived
df['WomanOrBoyCount'] = family.transform(lambda s: s[df.IsWomanOrBoy].fillna(0).count())
# df['WomanOrBoyCount'] = df.mask(df.IsWomanOrBoy, df.WomanOrBoyCount - 1, axis=0)
df['FamilySurvivedCount'] = family.transform(lambda s: s[df.IsWomanOrBoy].fillna(0).sum())
# df['FamilySurvivedCount'] = df.mask(df.IsWomanOrBoy, df.FamilySurvivedCount - \
#                                     df.Survived.fillna(0), axis=0)
df['WomanOrBoySurvived'] = df.FamilySurvivedCount / df.WomanOrBoyCount.replace(0, np.nan)
df.WomanOrBoyCount = df.WomanOrBoyCount.replace(np.nan, 0)
df['Alone'] = (df.WomanOrBoyCount == 0)

#Thanks to: https://www.kaggle.com/kpacocha/top-6-titanic-machine-learning-from-disaster
#"Title" improvement
df['Title'] = df['Title'].replace('Ms','Miss')
df['Title'] = df['Title'].replace('Mlle','Miss')
df['Title'] = df['Title'].replace('Mme','Mrs')
# Embarked
df['Embarked'] = df['Embarked'].fillna('S')

# Thanks to https://www.kaggle.com/erinsweet/simpledetect
# Fare
med_fare = df.groupby(['Pclass', 'Parch', 'SibSp']).Fare.median()[3][0][0]
df['Fare'] = df['Fare'].fillna(med_fare)
#Age
# df['Age'] = df.groupby(['Sex', 'Pclass', 'Title'])['Age'].apply(lambda x: x.fillna(x.median()))
# Family_Size
df['Family_Size'] = df['SibSp'] + df['Parch'] + 1

#Thanks to https://www.kaggle.com/kpacocha/top-6-titanic-machine-learning-from-disaster
# Cabin, Deck
#df['Deck'] = df['Cabin'].apply(lambda s: s[0] if pd.notnull(s) else 'M')
#df.loc[(df['Deck'] == 'T'), 'Deck'] = 'A'

df.WomanOrBoySurvived = df.WomanOrBoySurvived.fillna(0)
df.WomanOrBoyCount = df.WomanOrBoyCount.fillna(0)
df.FamilySurvivedCount = df.FamilySurvivedCount.fillna(0)
df.Alone = df.Alone.fillna(0)

In [8]:
df.IsWomanOrBoy

PassengerId
1       False
2        True
3        True
4        True
5       False
        ...  
1305    False
1306     True
1307    False
1308    False
1309     True
Name: IsWomanOrBoy, Length: 1309, dtype: bool

In [9]:
train_x, test_x = df.loc[traindf.index], df.loc[testdf.index]
test_x = test_x.drop('Survived', axis=1)

In [10]:
train_x.head(3)

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,IsWomanOrBoy,LastName,WomanOrBoyCount,FamilySurvivedCount,WomanOrBoySurvived,Alone,Family_Size
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1
1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S,Mr,False,Braund,0,0.0,0.0,True,2
2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,Mrs,True,Cumings,1,1.0,1.0,False,2
3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S,Miss,True,Heikkinen,1,1.0,1.0,False,1


In [11]:
test_x.head(3)

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Title,IsWomanOrBoy,LastName,WomanOrBoyCount,FamilySurvivedCount,WomanOrBoySurvived,Alone,Family_Size
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q,Mr,False,Kelly,3,3.0,1.0,False,1
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S,Mrs,True,Wilkes,1,0.0,0.0,False,2
894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q,Mr,False,Myles,0,0.0,0.0,True,1


## 5. EDA with describe <a class="anchor" id="5"></a>

[Back to Table of Contents](#0.1)

In [12]:
train_x.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,WomanOrBoyCount,FamilySurvivedCount,WomanOrBoySurvived,Family_Size
count,891.0,891.0,714.0,891.0,891.0,891.0,891.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208,1.159371,0.540965,0.331304,1.904602
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429,1.591893,0.803961,0.443523,1.613459
min,0.0,1.0,0.42,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104,0.0,0.0,0.0,1.0
50%,0.0,3.0,28.0,0.0,0.0,14.4542,1.0,0.0,0.0,1.0
75%,1.0,3.0,38.0,1.0,0.0,31.0,2.0,1.0,1.0,2.0
max,1.0,3.0,80.0,8.0,6.0,512.3292,8.0,4.0,1.0,11.0


In [13]:
test_x.describe()

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,WomanOrBoyCount,FamilySurvivedCount,WomanOrBoySurvived,Family_Size
count,418.0,332.0,418.0,418.0,418.0,418.0,418.0,418.0,418.0
mean,2.26555,30.27259,0.447368,0.392344,35.560746,1.12201,0.318182,0.143421,1.839713
std,0.841838,14.181209,0.89676,0.981429,55.857021,1.456629,0.647226,0.281846,1.519072
min,1.0,0.17,0.0,0.0,0.0,0.0,0.0,0.0,1.0
25%,1.0,21.0,0.0,0.0,7.8958,0.0,0.0,0.0,1.0
50%,3.0,27.0,0.0,0.0,14.4542,1.0,0.0,0.0,1.0
75%,3.0,39.0,1.0,0.0,31.471875,2.0,0.0,0.0,2.0
max,3.0,76.0,8.0,9.0,512.3292,8.0,3.0,1.0,11.0


## 6. EDA with Pandas Profiling <a class="anchor" id="6"></a>

[Back to Table of Contents](#0.1)

### 6.1 EDA of training dataset <a class="anchor" id="6.1"></a>

[Back to Table of Contents](#0.1)

Different options

In [14]:
# ProfileReport(train_x, title='Pandas Profiling Report for training dataset', html={'style':{'full_width':True}})

In [15]:
# %%time
# profile = train_x.profile_report(title='Pandas Profiling Report for training dataset')
# profile.to_file(output_file="train_profile.html")

## A new mode for Big data (large dataset)

In [16]:
# %%time
# profile = ProfileReport(train_x, title='Pandas Profiling Report for training dataset', minimal=True)
# profile.to_file(output_file="train_short_profile.html")

### 6.2 EDA of test dataset <a class="anchor" id="6.2"></a>

[Back to Table of Contents](#0.1)

In [17]:
# ProfileReport(test_x, title='Pandas Profiling Report for test dataset')

## 7. Conclusion and prediction <a class="anchor" id="7"></a>

[Back to Table of Contents](#0.1)


- We can see that `Pandas Profiling` is a nice tool which summarizes the dataset information in a concise way.

- It generates a nice html file which gives us the `overview` of `variables` alongwith their `coorelations` and `missing values` and `sample`.

#### Based on the EDA results, you can form a solution, for example, you can apply the following rule, derived from the Decision Tree Classifier in the notebook [Titanic - Top score : one line of the prediction](https://www.kaggle.com/vbmokin/titanic-top-score-one-line-of-the-prediction):

In [18]:
# The one line of the code for prediction : LB = 0.80382 (Titanic Top 6%) 
test_x = pd.concat([test_x.WomanOrBoySurvived.fillna(0), test_x.Alone, \
                    test_x.Sex.replace({'male': 0, 'female': 1})], axis=1)
pd.DataFrame({'Survived': (((test_x.WomanOrBoySurvived <= 0.2381) & (test_x.Sex > 0.5) & (test_x.Alone > 0.5)) | \
                        ((test_x.WomanOrBoySurvived > 0.2381) & \
                       ~((test_x.WomanOrBoySurvived > 0.55) & (test_x.WomanOrBoySurvived <= 0.633)))).astype(int)}, index=testdf.index).reset_index()

I hope you find this kernel useful and enjoyable.

Your votes, comments and feedback are most welcome.

[Go to Top](#0)