Real-world datasets are rarely in a format that allows a machine learning or deep learning model to train on them. Before feeding a dataset to a model, we must thoroughly analyze its features to determine how we will process each of them.

Different analysts' workflows may be vastly different from one another. This case study, and the ones that will follow, will show my personal workflow when dealing with a dataset. This personal workflow was developed through training and experience, and you will most likely develop your own as you learn.

Nonetheless, we will make every effort to adhere to the best practices that professionals in this field prefer whenever possible. 

### Section 1 : First Steps

Typically, a dataset is analyzed using a Jupyter Notebook file running a Python kernel. The notebook format allows us to run code and get the resulting output using cells instead of separating it into several .py files. This speeds up data analysis and makes it much easier to structure our workflow.

We will utilize Markdown cells to explain and monitor the reasoning behind each of our decisions related to the dataset. We will also use Markdown cells to explain what a Code cells purpose is, instead of just using the comment functionality of python.

__Step 1__

The following Code cell imports some basic libraries that will allow us to analyze, process and visualize data.

In [1]:
# Data Analysis and Processing
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

%matplotlib inline

__Step 2__

In this particular case, we have been given a training dataset and a testing dataset in CSV(Comma Separated Values) format. In the next Code cell, we will use the pandas library to read these datasets and store them as a DataFrame object provided by pandas. 

The DataFrame object provided by pandas is a powerful analytical tool that we will be using extensively for a variety of purposes. 

In [4]:
# Github doesn't play nice with '<' , '>' characters
# and this often results in the returning value of the type()
# function not showing up in output cells

# This is just a simple funtion to deal with this issue

def git_type(var: any) -> str:
    return str(type(var)).replace('<', '|').replace('>', '|')

In [5]:
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")

# Check the type of these variables
print(type(train_df))
print(type(test_df))

# Only to show output in github repos
print(git_type(train_df))
print(git_type(test_df))

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
|class 'pandas.core.frame.DataFrame'|
|class 'pandas.core.frame.DataFrame'|


__Step 3__

Previewing the dataset is a must before moving on, just so we have a visualization of what our data looks like.

DataFrame.head(n) will return and display the first n samples from our dataset.

DataFrame.tail(n) will return and display the last n samples from our dataset.

In [6]:
train_df.head(10)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


### Section 2 : Using Pandas to Get Important Overviews

__Step 4__

The pandas library offers two major DataFrame methods that allows us to get an overall picture of the dataset in a compact form.

The next Code cell uses, __DataFrame.info()__, which returns us how many total samples we have, how many non-null values each of the features have and what their corresponding data types are.

In [8]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


__Step 5__

The next Code cell uses, __DataFrame.describe()__, which returns us various important statistical values related to each of the numeric features. 

But properly grasping the table generated from this method and extracting useful information from it requires a strong knowledge of statistics.

We will get back to this table later on in our analysis and discuss how and what important information we could have extracted from here.

In [9]:
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


### Section 3 : Dropping Features Completely

### Section 4 : Separating Numeric and Categorical Features

### Section 5 : Feature Engineering

### Section 6 : Completely Processing the Dataset

### Section 7 : Choosing Machine Learning Models

### Section 8 : Training the Models on the Processed Dataset

### Section 9 : Comparing the Predictive Accuracy of the Models