<img src="http://data.freehdw.com/ships-titanic-vehicles-best.jpg"  Width="800">

<a id="introduction" ></a><br>
This kernel is for all aspiring data scientists to learn from and to review their knowledge. We will have a detailed statistical analysis of Titanic data set along with Machine learning model implementation. I am super excited to share my first kernel with the Kaggle community. As I go on in this journey and learn new topics, I will incorporate them with each new updates. So, check for them and please <b>leave a comment</b> if you have any suggestions to make this kernel better!! Going back to the topics of this kernel, I will do more in-depth visualizations to explain the data, and the machine learning classifiers will be used to predict passenger survival status.


NOTE:

- This is a julia translation
- If you are reading this on github, I recommend you read this on <a href="https://www.kaggle.com/masumrumi/a-statistical-analysis-ml-workflow-of-titanic">kaggle</a>
- Follow me on github: 

# Kernel Goals

<a id="aboutthiskernel"></a>

---

There are three primary goals of this kernel.

- <b>Do a statistical analysis</b> of how some group of people was survived more than others.
- <b>Do an exploratory data analysis(EDA)</b> of titanic with visualizations and storytelling.
- <b>Predict</b>: Use machine learning classification models to predict the chances of passengers survival.

P.S. If you want to learn more about regression models, try this [kernel](https://www.kaggle.com/masumrumi/a-stats-analysis-and-ml-workflow-of-house-pricing/edit/run/9585160).

# Part 1: Importing Necessary Libraries and datasets

---

<a id="import_libraries**"></a>

## 1a. Loading libraries

Python is a fantastic language with a vibrant community that produces many amazing libraries. I am not a big fan of importing everything at once for the newcomers. So, I am going to introduce a few necessary libraries for now, and as we go on, we will keep unboxing new libraries when it seems appropriate.

In [None]:
using Pkg
Pkg.activate(".")
Pkg.add(["IJulia", "DataFrames", "CSV", "CairoMakie", "StatsBase",
         "Statistics", "MLJ", "MLJModels", "MLJBase", "HypothesisTests",
         "Distributions", "Missings", "CategoricalArrays", "AlgebraOfGraphics", "Chain"])

In [None]:
import DataFrames as DF
import CSV
import CairoMakie as Makie
import AlgebraOfGraphics as AoG
import Statistics as Stats
import StatsBase
import Chain: @chain
import Random: shuffle
import IJulia


In [None]:
readdir("./input/")

## 1b. Loading Datasets

<a id="load_data"></a>

---

After loading the necessary modules, we need to import the datasets. Many of the business problems usually come with a tremendous amount of messy data. We extract those data from many sources. I am hoping to write about that in a different kernel. For now, we are going to work with a less complicated and quite popular machine learning dataset.

In [None]:
## Importing the datasets
using CSV

train = CSV.read("./input/train.csv", DF.DataFrame)
test = CSV.read("./input/test.csv", DF.DataFrame);

You are probably wondering why two datasets? Also, Why have I named it "train" and "test"? To explain that I am going to give you an overall picture of the supervised machine learning process.

"Machine Learning" is simply "Machine" and "Learning". Nothing more and nothing less. In a supervised machine learning process, we are giving machine/computer/models specific inputs or data(text/number/image/audio) to learn from aka we are training the machine to learn certain aspects based on the data and the output. Now, how can we determine that machine is actually learning what we are try to teach? That is where the test set comes to play. We withhold part of the data where we know the output/result of each datapoints, and we use this data to test the trained models. We then compare the outcomes to determine the performance of the algorithms. If you are a bit confused thats okay. I will explain more as we keep reading. Let's take a look at sample datasets.

In [None]:
DF.first(train, 5)

In [None]:
@chain train begin
    DF.dropmissing(:Age) # Drop rows with missing Age
    DF.groupby(:Sex)
    DF.combine(:Age => minimum => :MinAge)
end

In [None]:
DF.describe(train, :eltype)

## 1c. A Glimpse of the Datasets.

<a id="glimpse"></a>

---

# Train Set

In [None]:
DF.first(train[shuffle(1:DF.nrow(train))[1:5], :], 5)

# Test Set

In [None]:
DF.first(test[shuffle(1:DF.nrow(test))[1:5], :], 5)

This is a sample of train and test dataset. Lets find out a bit more about the train and test dataset.

In [None]:
println("The shape of the train data is (row, column): $(size(train))")
println("Train dataset info:")
DF.describe(train)


println("The shape of the test data is (row, column): $(size(test))")
println("Test dataset info:")
DF.describe(test)

## 1d. About This Dataset

<a id="aboutthisdataset"></a>

---

The data has split into two groups:

- training set (train.csv)
- test set (test.csv)

**_The training set includes our target variable(dependent variable), passenger survival status_** (also known as the ground truth from the Titanic tragedy) along with other independent features like gender, class, fare, and Pclass.

The test set should be used to see how well our model performs on unseen data. When we say unseen data, we mean that the algorithm or machine learning models have no relation to the test data. We do not want to use any part of the test data in any way to modify our algorithms; Which are the reasons why we clean our test data and train data separately. **_The test set does not provide passengers survival status_**. We are going to use our model to predict passenger survival status.

Now let's go through the features and describe a little. There is a couple of different type of variables, They are...

---

**Categorical:**

- **Nominal**(variables that have two or more categories, but which do not have an intrinsic order.)
  > - **Cabin**
  > - **Embarked**(Port of Embarkation)
            C(Cherbourg)
            Q(Queenstown)
            S(Southampton)

- **Dichotomous**(Nominal variable with only two categories)
  > - **Sex**
            Female
            Male
- **Ordinal**(variables that have two or more categories just like nominal variables. Only the categories can also be ordered or ranked.)
  > - **Pclass** (A proxy for socio-economic status (SES))
            1(Upper)
            2(Middle)
            3(Lower)

---

**Numeric:**

- **Discrete**
  > - **Passenger ID**(Unique identifing # for each passenger)
  > - **SibSp**
  > - **Parch**
  > - **Survived** (Our outcome or dependent variable)
            0
            1
- **Continous**
  > - **Age**
  > - **Fare**

---

**Text Variable**

> - **Ticket** (Ticket number for passenger.)
> - **Name**( Name of the passenger.)

## 1e. Tableau Visualization of the Data

<a id='tableau_visualization'></a>

---

I have incorporated a tableau visualization below of the training data. This visualization...

- is for us to have an overview and play around with the dataset.
- is done without making any changes(including Null values) to any features of the dataset.

---

Let's get a better perspective of the dataset through this visualization.

```{=html}
<div class='tableauPlaceholder' id='viz1516349898238' style='position: relative'><noscript><a href='#'><img alt='An Overview of Titanic Training Dataset ' src='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ti&#47;Titanic_data_mining&#47;Dashboard1&#47;1_rss.png' style='border: none' /></a></noscript><object class='tableauViz'  style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='embed_code_version' value='3' /> <param name='site_root' value='' /><param name='name' value='Titanic_data_mining&#47;Dashboard1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https:&#47;&#47;public.tableau.com&#47;static&#47;images&#47;Ti&#47;Titanic_data_mining&#47;Dashboard1&#47;1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /><param name='filter' value='publish=yes' /></object></div>                <script type='text/javascript'>                    var divElement = document.getElementById('viz1516349898238');                    var vizElement = divElement.getElementsByTagName('object')[0];                    vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px';                    var scriptElement = document.createElement('script');                    scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js';                    vizElement.parentNode.insertBefore(scriptElement, vizElement);                </script>
```

We want to see how the left vertical bar changes when we filter out unique values of certain features. We can use multiple filters to see if there are any correlations among them. For example, if we click on **upper** and **Female** tab, we would see that green color dominates the bar with a ratio of 91:3 survived and non survived female passengers; a 97% survival rate for females. We can reset the filters by clicking anywhere in the whilte space. The age distribution chart on top provides us with some more info such as, what was the age range of those three unlucky females as the red color give away the unsurvived once. If you would like to check out some of my other tableau charts, please click [here.](https://public.tableau.com/profile/masum.rumi#!/)

# Part 2: Overview and Cleaning the Data

<a id="cleaningthedata"></a>

---

## 2a. Overview

Datasets in the real world are often messy, However, this dataset is almost clean. Lets analyze and see what we have here.

In [None]:
DF.describe(train, :nmissing, :eltype)

It looks like, the features have unequal amount of data entries for every column and they have many different types of variables. This can happen for the following reasons...

- We may have missing values in our features.
- We may have categorical features.
- We may have alphanumerical or/and text features.

## 2b. Dealing with Missing values

<a id="dealwithnullvalues"></a>

---

**Missing values in _train_ dataset.**

In [None]:
function missing_percentage(df::DF.DataFrame)
    """This function takes a DataFrame as input and returns total missing values and percentages"""
    missing_counts = [count(ismissing, df[!, col]) for col in DF.names(df)]
    missing_pct = round.(missing_counts ./ DF.nrow(df) .* 100, digits=2)

    # Create result DataFrame
    result = DF.DataFrame(
        Column = DF.names(df),
        Total = missing_counts,
        Percent = missing_pct
    )

    # Sort by total missing values (descending)
    return DF.sort(result, :Total, rev=true)
end

In [None]:
missing_percentage(train)

**Missing values in _test_ set.**

In [None]:
missing_percentage(test)

We see that in both **train**, and **test** dataset have missing values. Let's make an effort to fill these missing values starting with "Embarked" feature.

### Embarked feature

---

In [None]:
function percent_value_counts(df::DF.DataFrame, feature::Symbol)
    """This function takes a dataframe and a column and finds the percentage of the value_counts"""

    # Count values including missing
    counts = DF.combine(DF.groupby(df, feature), DF.nrow => :Total)

    # Calculate percentages
    counts.Percent = round.(counts.Total ./ DF.nrow(df) .* 100, digits=2)

    # Sort by total count (descending)
    return DF.sort(counts, :Total, rev=true)
end

In [None]:
percent_value_counts(train, :Embarked)

It looks like there are only two null values( ~ 0.22 %) in the Embarked feature, we can replace these with the mode value "S". However, let's dig a little deeper.

**Let's see what are those two null values**

In [None]:
train[ismissing.(train.Embarked), :]

We may be able to solve these two missing values by looking at other independent variables of the two raws. Both passengers paid a fare of $80, are of Pclass 1 and female Sex. Let's see how the **Fare** is distributed among all **Pclass** and **Embarked** feature values

In [None]:
fig = Makie.Figure()

# Prepare data for plotting
train_clean = DF.dropmissing(train, [:Embarked, :Fare, :Pclass])
test_clean = DF.dropmissing(test, [:Embarked, :Fare, :Pclass])

# Create mapping for embarked ports to numbers
unique_categories = unique(train_clean.Embarked)
category_to_index = Dict(category => i for (i, category) in enumerate(unique_categories))
# Convert categorical to numeric
train_clean.Embarked_num = [category_to_index[port] for port in train_clean.Embarked]
test_clean.Embarked_num = [category_to_index[port] for port in test_clean.Embarked]

# Training set boxplot
ax1 = Makie.Axis(fig[1, 1],
    title = "Training Set",
    xlabel = "Embarked",
    ylabel = "Fare",
    xticks = (1:3, unique_categories)
)

ax2 = Makie.Axis(fig[1, 2],
    title = "Test Set",
    xlabel = "Embarked",
    ylabel = "Fare",
    xticks = (1:3, unique_categories)
)

Makie.boxplot!(ax2, test_clean.Embarked_num, test_clean.Fare,
           dodge = test_clean.Pclass,
           color = test_clean.Pclass)
Makie.boxplot!(ax1, train_clean.Embarked_num, train_clean.Fare,
           dodge = train_clean.Pclass,
           color = train_clean.Pclass)

fig

Here, in both training set and test set, the average fare closest to $80 are in the <b>C</b> Embarked values where pclass is 1. So, let's fill in the missing values as "C"

In [None]:
## Replacing the null values in the Embarked column with the mode.
train.Embarked = coalesce.(train.Embarked, "C");

### Cabin Feature

---

In [None]:
println("Train Cabin missing: $(count(ismissing, train.Cabin) / DF.nrow(train))")
println("Test Cabin missing: $(count(ismissing, test.Cabin) / DF.nrow(test))")

Approximately 77% of Cabin feature is missing in the training data and 78% missing on the test data.
We have two choices,

- we can either get rid of the whole feature, or
- we can brainstorm a little and find an appropriate way to put them in use. For example, We may say passengers with cabin record had a higher socio-economic-status then others. We may also say passengers with cabin record were more likely to be taken into consideration when loading into the boat.

Let's combine train and test data first and for now, will assign all the null values as **"N"**

In [None]:
survivors = train.Survived
DF.select!(train, DF.Not(:Survived))  # Remove Survived column
all_data = vcat(train, test)

all_data.Cabin = coalesce.(all_data.Cabin, "N");

All the cabin names start with an English alphabet following by multiple digits. It seems like there are some passengers that had booked multiple cabin rooms in their name. This is because many of them travelled with family. However, they all seem to book under the same letter followed by different numbers. It seems like there is a significance with the letters rather than the numbers. Therefore, we can group these cabins according to the letter of the cabin name.

In [None]:
all_data.Cabin = [string(cabin[1]) for cabin in all_data.Cabin];

Now let's look at the value counts of the cabin features and see how it looks.

In [None]:
percent_value_counts(all_data, :Cabin)

So, We still haven't done any effective work to replace the null values. Let's stop for a second here and think through how we can take advantage of some of the other features here.

- We can use the average of the fare column We can use pythons **_groupby_** function to get the mean fare of each cabin letter.

In [None]:
@chain all_data begin
    DF.dropmissing(:Fare)
    DF.groupby(:Cabin)
    DF.combine(:Fare => Stats.mean => :Mean_Fare)
    DF.sort(:Mean_Fare)
end

Now, these means can help us determine the unknown cabins, if we compare each unknown cabin rows with the given mean's above. Let's write a simple function so that we can give cabin names based on the means.

In [None]:
function cabin_estimator(fare::Union{Float64, Missing})
    """Grouping cabin feature by the first letter based on fare"""
    # Handle missing values
    if ismissing(fare)
        return "N"  # Default cabin for missing fare
    end
    
    if fare < 16
        return "G"
    elseif 16 ≤ fare < 27
        return "F"
    elseif 27 ≤ fare < 38
        return "T"
    elseif 38 ≤ fare < 47
        return "A"
    elseif 47 ≤ fare < 53
        return "E"
    elseif 53 ≤ fare < 54
        return "D"
    elseif 54 ≤ fare < 116
        return "C"
    else
        return "B"
    end
end

Let's apply <b>cabin_estimator</b> function in each unknown cabins(cabin with <b>null</b> values). Once that is done we will separate our train and test to continue towards machine learning modeling.

In [None]:
with_N = all_data[all_data.Cabin .== "N", :]
without_N = all_data[all_data.Cabin .!= "N", :];

In [None]:
with_N.Cabin = cabin_estimator.(with_N.Fare)

# Combine back together
all_data = vcat(with_N, without_N)

# Sort by PassengerId
DF.sort!(all_data, :PassengerId)

# Separate train and test
train = all_data[1:891, :]
test = all_data[892:end, :]

# Add back survival information
train.Survived = survivors;

### Fare Feature

---

If you have paid attention so far, you know that there is only one missing value in the fare column. Let's have it.

In [None]:
test[ismissing.(test.Fare), :]

Here, We can take the average of the **Fare** column to fill in the NaN value. However, for the sake of learning and practicing, we will try something else. We can take the average of the values where**Pclass** is **_3_**, **Sex** is **_male_** and **Embarked** is **_S_**

In [None]:
missing_value = @chain test begin
    DF.subset(:Pclass => x -> x .== 3, :Embarked => x -> x .== "S", :Sex => x -> x .== "male")
    _.Fare
    skipmissing
    Stats.mean
end

test.Fare = coalesce.(test.Fare, missing_value);

### Age Feature

---

We know that the feature "Age" is the one with most missing values, let's see it in terms of percentage.

In [None]:
println("Train age missing value: $(round(count(ismissing, train.Age) / DF.nrow(train) * 100, digits=2))%")
println("Test age missing value: $(round(count(ismissing, test.Age) / DF.nrow(test) * 100, digits=2))%")

We will take a different approach since **~20% data in the Age column is missing** in both train and test dataset. The age variable seems to be promising for determining survival rate. Therefore, It would be unwise to replace the missing values with median, mean or mode. We will use machine learning model Random Forest Regressor to impute missing value instead of Null value. We will keep the age column unchanged for now and work on that in the feature engineering section.

# Part 3. Visualization and Feature Relations

<a id="visualization_and_feature_relations" ></a>

---

Before we dive into finding relations between independent variables and our dependent variable(survivor), let us create some assumptions about how the relations may turn-out among features.

**Assumptions:**

- Gender: More female survived than male
- Pclass: Higher socio-economic status passenger survived more than others.
- Age: Younger passenger survived more than other passengers.
- Fare: Passenger with higher fare survived more that other passengers. This can be quite correlated with Pclass.

Now, let's see how the features are related to each other by creating some visualizations.

## 3a. Gender and Survived

<a id="gender_and_survived"></a>

---


In [None]:
Makie.set_theme!(Makie.theme_light())

In [None]:
fig = Makie.Figure()
ax = Makie.Axis(fig[1, 1], 
    title = "Survived/Non-Survived Passenger Gender Distribution",
    xlabel = "Sex",
    ylabel = "% of passenger survived",
    xticks= (1:2, ["Male", "Female"]),
    
)

# Calculate survival rates by gender
survival_by_sex = @chain train begin
    DF.groupby(:Sex)
    DF.combine(:Survived => Stats.mean => :survival_rate)
    DF.sort(:Sex, rev=true)  # Female first
end

# Create elegant barplot
Makie.barplot!(ax, 1:2, survival_by_sex.survival_rate, 
           color = ["green", "pink"])

fig

This bar plot above shows the distribution of female and male survived. The **_x_label_** represents **Sex** feature while the **_y_label_** represents the % of **passenger survived**. This bar plot shows that ~74% female passenger survived while only ~19% male passenger survived.


In [None]:
fig = Makie.Figure()
ax = Makie.Axis(fig[1, 1],
    title = "Passenger Gender Distribution - Survived vs Not-survived",
    xlabel = "Sex",
    ylabel = "# of Passenger Survived",
    xticks = (1:2, ["Male", "Female"])
)

# Count data for grouped bar chart
count_data = @chain train begin
    DF.groupby([:Sex, :Survived])
    DF.combine(DF.nrow => :count)
    DF.unstack(:Survived, :count, fill=0)
end

# Create grouped bar chart
counts = [count_data[1, 2], count_data[1, 3], count_data[2, 2], count_data[2, 3]]


Makie.barplot!(ax, [1, 1, 2, 2], counts,
           dodge = [1, 2, 1,2],
           color = ["gray", "green", "gray", "green"])



# Add legend
Makie.Legend(fig[1, 2], 
    [Makie.PolyElement(color = "gray"), Makie.PolyElement(color = "green")],
    ["Not Survived", "Survived"],
    "Survival Status")

fig

This count plot shows the actual distribution of male and female passengers that survived and did not survive. It shows that among all the females ~ 230 survived and ~ 70 did not survive. While among male passengers ~110 survived and ~480 did not survive.

**Summary**

---

- As we suspected, female passengers have survived at a much better rate than male passengers.
- It seems about right since females and children were the priority.

## 3b. Pclass and Survived

<a id="pcalss_and_survived"></a>

---

In [None]:
fig = Makie.Figure()
ax = Makie.Axis(fig[1, 1],
    title = "Passenger Class Distribution - Survival Percentage",
    xlabel = "Passenger Class",
    ylabel = "Percentage",
    titlesize = 20,
    xlabelsize = 16,
    ylabelsize = 16,
    xticks=(1:3, ["1st Class", "2nd Class", "3rd Class"])
)

# Calculate percentages by class
class_survival = @chain train begin
    DF.groupby([:Pclass, :Survived])
    DF.combine(DF.nrow => :count)
    DF.unstack(:Survived, :count, fill=0)
end

no_survived = class_survival[:, 2]
yes_survived = class_survival[:, 3]
total_by_class = no_survived + yes_survived

survived_percentage = (yes_survived ./ total_by_class) * 100
not_survived_percentage = (no_survived ./ total_by_class) * 100

flatten = vcat(not_survived_percentage ,survived_percentage)

Makie.barplot!(ax, [1, 2, 3, 1, 2, 3], flatten, stack=[1, 2, 3, 1, 2, 3], color = ["red", "red", "red", "green", "green", "green"], strokewidth = 1, strokecolor = :black)

# Add legend
Makie.Legend(fig[1, 2],
    [Makie.PolyElement(color = "#F44336"), Makie.PolyElement(color = "#4CAF50")],
    ["Not Survived", "Survived"],
    "Survival Status")

fig

In [None]:
Makie.barplot([1, 2, 3], survived_percentage, axis=(xticks=(1:3, ["1st Class", "2nd Class", "3rd Class"]), title = "Passenger Class Distribution - Survived vs Non-Survived"), color=["brown", "orange", "green"])

- It looks like ...
  - ~ 63% first class passenger survived titanic tragedy, while
  - ~ 48% second class and
  - ~ only 24% third class passenger survived.

In [None]:
fig = Makie.Figure(
    title = "Passenger Class Distribution - Survived vs Non-Survived",
    xlabel = "Passenger Class",
    ylabel = "Density of Passenger Survived",
) # Adjust figure size as needed
ax =  Makie.Axis(fig[1, 1], xticks = ([1, 2, 3], ["Upper", "Middle", "Lower"]))           

d1 = Makie.density!(ax, train.Pclass[train.Survived .== 0], color = (:gray, 0.2), strokecolor=:gray, strokewidth=2)

d2= Makie.density!(ax, train.Pclass[train.Survived .== 1], color = (:green, 0.2), strokecolor=:green, strokewidth=2)

Makie.axislegend(ax,
    [d1, d2],
    ["Not Survived", "Survived"],
    "Survival Status")

fig

This KDE plot is pretty self-explanatory with all the labels and colors. Something I have noticed that some readers might find questionable is that the lower class passengers have survived more than second-class passengers. It is true since there were a lot more third-class passengers than first and second.

**Summary**

---

The first class passengers had the upper hand during the tragedy. You can probably agree with me more on this, in the next section of visualizations where we look at the distribution of ticket fare and survived column.

## 3c. Fare and Survived

<a id="fare_and_survived"></a>

---

In [None]:
fig = Makie.Figure()

ax = Makie.Axis(fig[1, 1],
    title = "Fare Distribution - Survived vs Non-Survived",
    xlabel = "Fare",
    ylabel = "Density of Passenger Survived",
)
 
d1 = Makie.density!(ax, train.Fare[train.Survived .== 0], color = (:gray, 0.2), strokecolor=:gray, strokewidth=2)
d2 = Makie.density!(ax, train.Fare[train.Survived .== 1], color = (:green, 0.2), strokecolor=:green, strokewidth=2)

Makie.axislegend(ax,
    [d1, d2],
    ["Not Survived", "Survived"],
    "Survival Status")
fig

This plot shows something impressive..

- The spike in the plot under 100 dollar represents that a lot of passengers who bought the ticket within that range did not survive.
- When fare is approximately more than 280 dollars, there is no gray shade which means, either everyone passed that fare point survived or maybe there is an outlier that clouds our judgment. Let's check...

In [None]:
train[train.Fare .> 280, :]

As we assumed, it looks like an outlier with a fare of $512. We sure can delete this point. However, we will keep it for now.

## 3d. Age and Survived

<a id="age_and_survived"></a>

---

In [None]:
fig = Makie.Figure()

ax = Makie.Axis(fig[1, 1], title = "Age Distribution - Survived vs Non-Survived",
    xlabel = "Age",
    ylabel = "Density of Passenger Survived")


# clean missing first
clean_train =  DF.dropmissing(train, :Age)
not_survived = clean_train.Age[clean_train.Survived .== 0]
survived = clean_train.Age[clean_train.Survived .== 1]

d1 = Makie.density!(ax, not_survived, color = (:gray, 0.2), strokecolor=:gray, strokewidth=2)
d2 = Makie.density!(ax, survived, color = (:green, 0.2), strokecolor=:green, strokewidth=2)

Makie.axislegend(ax,
    [d1, d2],
    ["Not Survived", "Survived"],
    "Survival Status")

fig

There is nothing out of the ordinary about this plot, except the very left part of the distribution. This may hint on the posibility that children and infants were the priority.

## 3e. Combined Feature Relations

<a id='combined_feature_relations'></a>

---

In this section, we are going to discover more than two feature relations in a single graph. I will try my best to illustrate most of the feature relations. Let's get to it.

In [None]:
fig = Makie.Figure(title="Survived by Sex and Age")

# Create subplots for each combination

for (i, (sex, survived)) in enumerate(Iterators.product(["female", "male"], [0, 1]))

    ax = Makie.Axis(fig[div(i - 1, 2) + 1, i % 2 + 1],
        title = "$sex $(survived == 1 ? "Survived" : "Not Survived")",
        xlabel = "Age",
        ylabel = "Count"
    )
    
    subset_data = train[(train.Sex .== sex) .& (train.Survived .== survived) .& .!ismissing.(train.Age), :]
    
    Makie.hist!(ax, subset_data.Age, bins = 20, 
            color = survived == 1 ? "green" : "gray",
            strokewidth = 1, strokecolor = :white)
   
end

fig

Facetgrid is a great way to visualize multiple variables and their relationships at once. From the chart in section 3a we have a intuation that female passengers had better prority than males during the tragedy. However, from this facet grid, we can also understand which age range groups survived more than others or were not so lucky

In [None]:
fig = Makie.Figure(title="Survived by Sex and Age")

# Create subplots for each combination
for (i, (sex, embarked)) in enumerate(Iterators.product(["female", "male"], ["S", "C", "Q"]))

    ax = Makie.Axis(fig[div(i - 1, 2) + 1, i % 2 + 1],
        title = "$sex $embarked",
    )

    subset_data = train[(train.Sex .== sex) .& (train.Embarked .== embarked) .& .!ismissing.(train.Age), :]

    for (survived) in [0, 1]
        subset_survived = subset_data[(subset_data.Survived .== survived), :]
        println("Length of subset: $(DF.nrow(subset_survived))")

        if DF.nrow(subset_data) > 0
             Makie.hist!(ax, subset_survived.Age, 
                        bins = 20,
                        color = survived == 1 ? (:green, 0.5) : (:gray, 0.5),
                        strokewidth = 1, 
                        strokecolor = :white,
                        label = survived == 1 ? "Survived" : "Not Survived"
                    )
        end
    end
end


Makie.Legend(fig[1, 3], 
    [Makie.PolyElement(color = (:gray, 0.7)), 
     Makie.PolyElement(color = (:green, 0.7))],
    ["Not Survived", "Survived"],
    "Survival Status"
)

fig

This is another compelling facet grid illustrating four features relationship at once. They are **Embarked, Age, Survived & Sex**.

- The color illustrates passengers survival status(green represents survived, gray represents not survived)
- The column represents Sex(left being male, right stands for female)
- The row represents Embarked(from top to bottom: S, C, Q)

---

Now that I have steered out the apparent let's see if we can get some insights that are not so obvious as we look at the data.

- Most passengers seem to be boarded on Southampton(S).
- More than 60% of the passengers died boarded on Southampton.
- More than 60% of the passengers lived boarded on Cherbourg(C).
- Pretty much every male that boarded on Queenstown(Q) did not survive.
- There were very few females boarded on Queenstown, however, most of them survived.

In [None]:
fig = Makie.Figure(resolution = (1000, 600))

ax_m = Makie.Axis(fig[1, 1],
    title = "Male", 
    xlabel = "Fare",
    ylabel = "Age")
# Female subplot
ax_f = Makie.Axis(fig[1, 2], 
    title = "Female",
    xlabel = "Fare",
    ylabel = "Age")

female_data = train[(train.Sex .== "female") .& .!ismissing.(train.Age), :]
male_data = train[(train.Sex .== "male") .& .!ismissing.(train.Age), :]


Makie.scatter!(ax_m, male_data.Fare, male_data.Age,
           color = [s == 1 ? "green" : "gray" for s in male_data.Survived],
           strokewidth=1, strokecolor="white", markersize=14)
Makie.scatter!(ax_f, female_data.Fare, female_data.Age,
           color = [s == 1 ? "green" : "gray" for s in female_data.Survived],
           strokewidth=1, strokecolor="white", markersize=14)


# Add legend
Makie.Legend(fig[1, 3],
    [Makie.MarkerElement(color = "gray", marker = :circle), 
     Makie.MarkerElement(color = "green", marker = :circle)],
    ["Not Survived", "Survived"],
    "Survived")

Makie.Label(fig[0, :], "Survived by Sex, Fare and Age")
fig

This facet grid unveils a couple of interesting insights. Let's find out.

- The grid above clearly demonstrates the three outliers with Fare of over \$500. At this point, I think we are quite confident that these outliers should be deleted.
- Most of the passengers were with in the Fare range of \$100.

In [None]:
train = train[train.Fare .< 500, :]

fig = Makie.Figure(size = (800, 600))
ax = Makie.Axis(fig[1, 1],
    title = "Parents/Children Survival Rate",
    xlabel = "Number of Parents/Children",
    ylabel = "Survival Rate",
)

parch_survival = @chain train_clean begin
    DF.groupby(:Parch)
    DF.combine(
        :Survived => Stats.mean => :survival_rate,
        :Survived => Stats.std => :std_dev,
        :Survived => length => :count
    )
end

parch_survival.std_error = parch_survival.std_dev ./ sqrt.(parch_survival.count)

Makie.scatterlines!(ax, parch_survival.Parch, parch_survival.survival_rate,
    color = "#2196F3", 
    linewidth = 3,
    markersize = 8
)

error = Makie.errorbars!(ax, parch_survival.Parch, parch_survival.survival_rate, 
    parch_survival.std_error,
    color = "blue",
    linewidth = 2,
    whiskerwidth = 8
)

Makie.Legend(fig[1, 2],
    [Makie.PolyElement(color = "#2196F3"), Makie.PolyElement(color = "blue")],
    ["Survival Rate", "Standard Error"],
    "Legend"
)
fig

**Passenger who traveled in big groups with parents/children had less survival rate than other passengers.**

In [None]:
fig = Makie.Figure(size = (800, 600))
ax = Makie.Axis(fig[1, 1],
    title = "Siblings/Spouses Survival Rate",
    xlabel = "Number of Siblings/Spouses",
    ylabel = "Survival Rate",
)

sibsp_survival = @chain train_clean begin
    DF.groupby(:SibSp)
    DF.combine(
        :Survived => Stats.mean => :survival_rate,
        :Survived => Stats.std => :std_dev,
        :Survived => length => :count
    )
end

sibsp_survival.std_error = sibsp_survival.std_dev ./ sqrt.(sibsp_survival.count)

Makie.scatterlines!(ax, sibsp_survival.SibSp, sibsp_survival.survival_rate,
    color = "#2196F3", 
    linewidth = 3,
    markersize = 8
)

error = Makie.errorbars!(ax, sibsp_survival.SibSp, sibsp_survival.survival_rate, 
    sibsp_survival.std_error,
    color = "blue",
    linewidth = 2,
    whiskerwidth = 8
)

Makie.Legend(fig[1, 2],
    [Makie.PolyElement(color = "#2196F3"), Makie.PolyElement(color = "blue")],
    ["Survival Rate", "Standard Error"],
    "Legend"
)
fig

**While, passenger who traveled in small groups with sibilings/spouses had better changes of survivint than other passengers.**

In [None]:
train.Sex = [sex == "female" ? 0 : 1 for sex in train.Sex]
test.Sex = [sex == "female" ? 0 : 1 for sex in test.Sex];

# Part 4: Statistical Overview

<a id="statisticaloverview"></a>

---

![title](https://cdn-images-1.medium.com/max/400/1*hFJ-LI7IXcWpxSLtaC0dfg.png)

**Train info**

In [None]:
DF.describe(train)

In [None]:
categorical_cols = [col for col in names(train) if eltype(train[!, col]) <: Union{String, AbstractString}]
DF.describe(train[!, categorical_cols])

In [None]:
survived_summary = @chain train begin
    DF.select(DF.names(train, Number)...)
    DF.groupby(:Survived)
    DF.combine(DF.All() .=> Stats.mean)
end

In [None]:
sex_summary = @chain train begin
    DF.select(DF.names(train, Number)...)
    DF.groupby(:Sex)
    DF.combine(DF.All() .=> Stats.mean)
end

In [None]:
class_summary = @chain train begin
    DF.select(DF.names(train, Number)...)
    DF.groupby(:Pclass)
    DF.combine(DF.All() .=> Stats.mean)
end

I have gathered a small summary from the statistical overview above. Let's see what they are...

- This train data set has 891 raw and 9 columns.
- only 38% passenger survived during that tragedy.
- ~74% female passenger survived, while only ~19% male passenger survived.
- ~63% first class passengers survived, while only 24% lower class passenger survived.

## 4a. Correlation Matrix and Heatmap

<a id="heatmap"></a>

---

### Correlations

In [None]:
train_clean = DF.dropmissing(train)
train_numeric = DF.select(train_clean, DF.names(train_clean, Number)...)
corr_matrix = Stats.cor(Stats.Matrix(train_numeric))

corr_df = DF.DataFrame(corr_matrix, DF.names(train_numeric))

In [None]:
DF.sort(DF.DataFrame(
    Variable = DF.names(corr_df),
    Correlation = abs.(corr_df[!, :Survived])
), [:Correlation], rev=true)

** Sex is the most important correlated feature with _Survived(dependent variable)_ feature followed by Pclass.**

In [None]:
DF.sort(DF.DataFrame(
    Variable = DF.names(corr_df),
    Correlation = abs.(corr_df[!, :Survived]) .^ 2
), [:Correlation], rev=true)

**Squaring the correlation feature not only gives on positive correlations but also amplifies the relationships.**


In [None]:
n = size(corr_df, 1)

fig = Makie.Figure()
ax = Makie.Axis(fig[1, 1], 
    title = "Correlaciones Entre Variables",  
    xticks=((1:n), DF.names(corr_df)), 
    yticks=((1:n), DF.names(corr_df)),
)

hm = Makie.heatmap!(ax, (1:n), (1:n), corr_matrix, colormap="RdBu")

for i in 1:n
    for j in 1:n
        text_val = corr_matrix[j, i]
        
        Makie.text!(ax, i, j, 
            text = string(round(text_val, digits=2)),
            color = abs(text_val) > 0.5 ? :white : :black,
            fontsize = 10,
            align = (:center, :center)
        )
    end
end

Makie.Colorbar(fig[1, 2], hm, 
    label = "Coeficiente de Correlación",
)

fig

#### Positive Correlation Features:

- Fare and Survived: 0.26

#### Negative Correlation Features:

- Fare and Pclass: -0.6
- Sex and Survived: -0.55
- Pclass and Survived: -0.33

**So, Let's analyze these correlations a bit.** We have found some moderately strong relationships between different features. There is a definite positive correlation between Fare and Survived rated. This relationship reveals that the passenger who paid more money for their ticket were more likely to survive. This theory aligns with one other correlation which is the correlation between Fare and Pclass(-0.6). This relationship can be explained by saying that first class passenger(1) paid more for fare then second class passenger(2), similarly second class passenger paid more than the third class passenger(3). This theory can also be supported by mentioning another Pclass correlation with our dependent variable, Survived. The correlation between Pclass and Survived is -0.33. This can also be explained by saying that first class passenger had a better chance of surviving than the second or the third and so on.

However, the most significant correlation with our dependent variable is the Sex variable, which is the info on whether the passenger was male or female. This negative correlation with a magnitude of -0.54 which points towards some undeniable insights. Let's do some statistics to see how statistically significant this correlation is.

## 4b. Statistical Test for Correlation

<a id="statistical_test"></a>

---

Statistical tests are the scientific way to prove the validation of theories. In any case, when we look at the data, we seem to have an intuitive understanding of where data is leading us. However, when we do statistical tests, we get a scientific or mathematical perspective of how significant these results are. Let's apply some of these methods and see how we are doing with our predictions.

### Hypothesis Testing Outline

A hypothesis test compares the mean of a control group and experimental group and tries to find out whether the two sample means are different from each other and if they are different, how significant that difference is.

A **hypothesis test** usually consists of multiple parts:

1. Formulate a well-developed research problem or question: The hypothesis test usually starts with a concrete and well-developed researched problem. We need to ask the right question that can be answered using statistical analysis.
2. **The null hypothesis($H_0$) and Alternating hypothesis($H_1$)**:
   > - The **null hypothesis($H_0$)** is something that is assumed to be true. It is the status quo. In a null hypothesis, the observations are the result of pure chance. When we set out to experiment, we form the null hypothesis by saying that there is no difference between the means of the control group and the experimental group.
   > - An **Alternative hypothesis($H_A$)** is a claim and the opposite of the null hypothesis. It is going against the status quo. In an alternative theory, the observations show a real effect combined with a component of chance variation.
3. Determine the **test statistic**: test statistic can be used to assess the truth of the null hypothesis. Depending on the standard deviation we either use t-statistics or z-statistics. In addition to that, we want to identify whether the test is a one-tailed test or two-tailed test. [This](https://support.minitab.com/en-us/minitab/18/help-and-how-to/statistics/basic-statistics/supporting-topics/basics/null-and-alternative-hypotheses/) article explains it pretty well. [This](https://stattrek.com/hypothesis-test/hypothesis-testing.aspx) article is pretty good as well.

4. Specify a **Significance level** and **Confidence Interval**: The significance level($\alpha$) is the probability of rejecting a null hypothesis when it is true. In other words, we are **_comfortable/confident_** with rejecting the null hypothesis a significant amount of times even though it is true. This considerable amount is our Significant level. In addition to that, Significance level is one minus our Confidence interval. For example, if we say, our significance level is 5%, then our confidence interval would be (1 - 0.05) = 0.95 or 95%.

5. Compute the **T-Statistics/Z-Statistics**: Computing the t-statistics follows a simple equation. This equation slightly differs depending on one sample test or two sample test

6. Compute the **P-value**: P-value is the probability that a test statistic at least as significant as the one observed would be obtained assuming that the null hypothesis is correct. The p-value is known to be unintuitive, and even many professors are known to explain it wrong. I think this [video](https://www.youtube.com/watch?v=E4KCfcVwzyw) explains the p-value well. **The smaller the P-value, the stronger the evidence against the null hypothesis.**

7. **Describe the result and compare the p-value with the significance value($\alpha$)**: If p<=$\alpha$, then the observed effect is statistically significant, the null hypothesis is ruled out, and the alternative hypothesis is valid. However if the p> $\alpha$, we say that, we fail to reject the null hypothesis. Even though this sentence is grammatically wrong, it is logically right. We never accept the null hypothesis just because we are doing the statistical test with sample data points.

We will follow each of these steps above to do your hypothesis testing below.

P.S. Khan Academy has a set of videos that I think are intuative and helped me understand conceptually.

---

### Hypothesis testing for Titanic

#### Formulating a well developed researched question:

Regarding this dataset, we can formulate the null hypothesis and alternative hypothesis by asking the following questions.

> - **Is there a significant difference in the mean sex between the passenger who survived and passenger who did not survive?**.
> - **Is there a substantial difference in the survival rate between the male and female passengers?**

#### The Null Hypothesis and The Alternative Hypothesis:

We can formulate our hypothesis by asking questions differently. However, it is essential to understand what our end goal is. Here our dependent variable or target variable is **Survived**. Therefore, we say

> ** Null Hypothesis($H_0$):** There is no difference in the survival rate between the male and female passengers. or the mean difference between male and female passenger in the survival rate is zero.  
>  ** Alternative Hypothesis($H_A$):** There is a difference in the survival rate between the male and female passengers. or the mean difference in the survival rate between male and female is not zero.

Onc thing we can do is try to set up the Null and Alternative Hypothesis in such way that, when we do our t-test, we can choose to do one tailed test. According to [this](https://support.minitab.com/en-us/minitab/18/help-and-how-to/statistics/basic-statistics/supporting-topics/basics/null-and-alternative-hypotheses/) article, one-tailed tests are more powerful than two-tailed test. In addition to that, [this](https://www.youtube.com/watch?v=5NcMFlrnYp8&list=PLIeGtxpvyG-LrjxQ60pxZaimkaKKs0zGF) video is also quite helpful understanding these topics. with this in mind we can update/modify our null and alternative hypothesis. Let's see how we can rewrite this..

> **Null Hypothesis(H0):** male mean is greater or equal to female mean.

> **Alternative Hypothesis(H1):** male mean is less than female mean.

#### Determine the test statistics:

> This will be a two-tailed test since the difference between male and female passenger in the survival rate could be higher or lower than 0.
> Since we do not know the standard deviation($\sigma$) and n is small, we will use the t-distribution.

#### Specify the significance level:

> Specifying a significance level is an important step of the hypothesis test. It is an ultimate balance between type 1 error and type 2 error. We will discuss more in-depth about those in another lesson. For now, we have decided to make our significance level($\alpha$) = 0.05. So, our confidence interval or non-rejection region would be (1 - $\alpha$)=(1-0.05) = 95%.

#### Computing T-statistics and P-value:

Let's take a random sample and see the difference.

In [None]:
male_mean = DF.mean(train[train.Sex .== 1, :Survived])
female_mean = DF.mean(train[train.Sex .== 0, :Survived])

println("Male survival mean: ", male_mean)
println("Female survival mean: ", female_mean)
println("The mean difference between male and female survival rate: ", female_mean - male_mean)

Now, we have to understand that those two means are not **the population mean ($\bar{\mu}$)**. _The population mean is a statistical term statistician uses to indicate the actual average of the entire group. The group can be any gathering of multiple numbers such as animal, human, plants, money, stocks._ For example, To find the age population mean of Bulgaria; we will have to account for every single person's age and take their age. Which is almost impossible and if we were to go that route; there is no point of doing statistics in the first place. Therefore we approach this problem using sample sets. The idea of using sample set is that; if we take multiple samples of the same population and take the mean of them and put them in a distribution; eventually the distribution start to look more like a **normal distribution**. The more samples we take and the more sample means will be added and, the closer the normal distribution will reach towards population mean. This is where **Central limit theory** comes from. We will go more in depth of this topic later on.

Going back to our dataset, like we are saying these means above are part of the whole story. We were given part of the data to train our machine learning models, and the other part of the data was held back for testing. Therefore, It is impossible for us at this point to know the population means of survival for male and females. Situation like this calls for a statistical approach. We will use the sampling distribution approach to do the test. let's take 50 random sample of male and female from our train data.

In [None]:
male = train[train.Sex .== 1, :]
female = train[train.Sex .== 0, :]

# Listas vacías para almacenar las muestras de medias
m_mean_samples = Float64[]
f_mean_samples = Float64[]

# Generar 50 muestras aleatorias
for i in 1:50
    # Muestreo aleatorio de 50 elementos sin reemplazo
    male_sample = StatsBase.sample(male.Survived, 50, replace=false)
    female_sample = StatsBase.sample(female.Survived, 50, replace=false)
    
    push!(m_mean_samples, DF.mean(male_sample))
    push!(f_mean_samples, DF.mean(female_sample))
end

println("Male mean sample mean: ", round(DF.mean(m_mean_samples), digits=2))
println("Female mean sample mean: ", round(DF.mean(f_mean_samples), digits=2))
println("Difference between male and female mean sample mean: ", 
        round(DF.mean(f_mean_samples) - DF.mean(m_mean_samples), digits=2))

H0: male mean is greater or equal to female mean<br>
H1: male mean is less than female mean.

According to the samples our male samples ($\bar{x}_m$) and female samples($\bar{x}_f$) mean measured difference is ~ 0.55(statistically this is called the point estimate of the male population mean and female population mean). keeping in mind that...

- We randomly select 50 people to be in the male group and 50 people to be in the female group.
- We know our sample is selected from a broader population(trainning set).
- We know we could have totally ended up with a different random sample of males and females.

---

With all three points above in mind, how confident are we that, the measured difference is real or statistically significant? we can perform a **t-test** to evaluate that. When we perform a **t-test** we are usually trying to find out **an evidence of significant difference between population mean with hypothesized mean(1 sample t-test) or in our case difference between two population means(2 sample t-test).**

The **t-statistics** is the measure of a degree to which our groups differ standardized by the variance of our measurements. In order words, it is basically the measure of signal over noise. Let us describe the previous sentence a bit more for clarification. I am going to use [this post](http://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-is-a-t-test-and-why-is-it-like-telling-a-kid-to-clean-up-that-mess-in-the-kitchen) as reference to describe the t-statistics here.

#### Calculating the t-statistics

# $$t = \frac{\bar{x}-\mu}{\frac{S} {\sqrt{n}} }$$

Here..

- $\bar{x}$ is the sample mean.
- $\mu$ is the hypothesized mean.
- S is the standard deviation.
- n is the sample size.

1. Now, the denominator of this fraction $(\bar{x}-\mu)$ is basically the strength of the signal. where we calculate the difference between hypothesized mean and sample mean. If the mean difference is higher, then the signal is stronger.

the numerator of this fraction ** ${S}/ {\sqrt{n}}$ ** calculates the amount of variation or noise of the data set. Here S is standard deviation, which tells us how much variation is there in the data. n is the sample size.

So, according to the explanation above, the t-value or t-statistics is basically measures the strength of the signal(the difference) to the amount of noise(the variation) in the data and that is how we calculate the t-value in one sample t-test. However, in order to calculate between two sample population mean or in our case we will use the follow equation.

# $$t = \frac{\bar{x}_M - \bar{x}_F}{\sqrt {s^2 (\frac{1}{n_M} + \frac{1}{n_F})}}$$

This equation may seem too complex, however, the idea behind these two are similar. Both of them have the concept of signal/noise. The only difference is that we replace our hypothesis mean with another sample mean and the two sample sizes repalce one sample size.

Here..

- $\bar{x}_M$ is the mean of our male group sample measurements.
- $ \bar{x}\_F$ is the mean of female group samples.
- $ n_M$ and $n_F$ are the sample number of observations in each group.
- $ S^2$ is the sample variance.

It is good to have an understanding of what going on in the background. However, we will use **scipy.stats** to find the t-statistics.

#### Compare P-value with $\alpha$

> It looks like the p-value is very small compared to our significance level($\alpha$)of 0.05. Our observation sample is statistically significant. Therefore, our null hypothesis is ruled out, and our alternative hypothesis is valid, which is "**There is a significant difference in the survival rate between the male and female passengers."**

# Part 5: Feature Engineering

<a id="feature_engineering"></a>

---

Feature Engineering is exactly what its sounds like. Sometimes we want to create extra features from with in the features that we have, sometimes we want to remove features that are alike. Features engineering is the simple word for doing all those. It is important to remember that we will create new features in such ways that will not cause **multicollinearity(when there is a relationship among independent variables)** to occur.

## name_length

**_Creating a new feature "name_length" that will take the count of letters of each name_**

In [None]:
train[!, :name_length] = [length(i) for i in train.Name]
test[!, :name_length] = [length(i) for i in test.Name]

function name_length_group(size)
    a = ""
    if size <= 20
        a = "short"
    elseif size <= 35
        a = "medium"
    elseif size <= 45
        a = "good"
    else
        a = "long"
    end
    return a
end

train[!, :nLength_group] = [name_length_group(x) for x in train.name_length]
test[!, :nLength_group] = [name_length_group(x) for x in test.name_length]

## title

**Getting the title of each name as a new feature. **

In [None]:
train[!, :title] = [split(i, '.')[1] for i in train.Name]
train[!, :title] = [split(i, ',')[2] for i in train.title]

In [None]:
println(unique(train.title))

In [None]:
## Let's fix that
train[!, :title] = [strip(x) for x in train.title]

In [None]:
## We can also combile all three lines above for test set here
test[!, :title] = [strip(split(split(i, '.')[1], ',')[2]) for i in test.Name]
## However it is important to be able to write readable code, and the line above is not so readable.

In [None]:
## Let's replace some of the rare values with the keyword 'rare' and other word choice of our own.
## train Data
train[!, :title] = [replace(i, "Ms" => "Miss") for i in train.title]
train[!, :title] = [replace(i, "Mlle" => "Miss") for i in train.title]
train[!, :title] = [replace(i, "Mme" => "Mrs") for i in train.title]
train[!, :title] = [replace(i, "Dr" => "rare") for i in train.title]
train[!, :title] = [replace(i, "Col" => "rare") for i in train.title]
train[!, :title] = [replace(i, "Major" => "rare") for i in train.title]
train[!, :title] = [replace(i, "Don" => "rare") for i in train.title]
train[!, :title] = [replace(i, "Jonkheer" => "rare") for i in train.title]
train[!, :title] = [replace(i, "Sir" => "rare") for i in train.title]
train[!, :title] = [replace(i, "Lady" => "rare") for i in train.title]
train[!, :title] = [replace(i, "Capt" => "rare") for i in train.title]
train[!, :title] = [replace(i, "the Countess" => "rare") for i in train.title]
train[!, :title] = [replace(i, "Rev" => "rare") for i in train.title]

## Now in programming there is a term called DRY(Don't repeat yourself), whenever we are repeating
## same code over and over again, there should be a light-bulb turning on in our head and make us think
## to code in a way that is not repeating or dull. Let's write a function to do exactly what we
## did in the code above, only not repeating and more interesting.

In [None]:
## we are writing a function that can help us modify title column
"""
    This function helps modifying the title column
"""
function name_converted(feature)
    result = ""
    if feature in ["the Countess", "Capt", "Lady", "Sir", "Jonkheer", "Don", "Major", "Col", "Rev", "Dona", "Dr"]
        result = "rare"
    elseif feature in ["Ms", "Mlle"]
        result = "Miss"
    elseif feature == "Mme"
        result = "Mrs"
    else
        result = feature
    end
    return result
end

test[!, :title] = [name_converted(x) for x in test.title]
train[!, :title] = [name_converted(x) for x in train.title];

In [None]:
println(unique(train.title))
println(unique(test.title))

## family_size

**_Creating a new feature called "family_size"._**

In [None]:
## Family_size seems like a good feature to create
train[!, :family_size] = train.SibSp + train.Parch .+ 1
test[!, :family_size] = test.SibSp + test.Parch .+ 1

In [None]:
## bin the family size.
"""
This function groups(loner, small, large) family based on family size
"""
function family_group(size)
    a = ""
    if size <= 1
        a = "loner"
    elseif size <= 4
        a = "small"
    else
        a = "large"
    end
    return a
end

In [None]:
## apply the family_group function in family_size
train[!, :family_group] = [family_group(x) for x in train.family_size]
test[!, :family_group] = [family_group(x) for x in test.family_size];

## is_alone

In [None]:
train[!, :is_alone] = [i < 2 ? 1 : 0 for i in train.family_size]
test[!, :is_alone] = [i < 2 ? 1 : 0 for i in test.family_size];

## ticket

In [None]:
println(StatsBase.sample(collect(StatsBase.countmap(train.Ticket)), 10))

I have yet to figureout how to best manage ticket feature. So, any suggestion would be truly appreciated. For now, I will get rid off the ticket feature.

In [None]:
DF.select!(train, DF.Not(:Ticket))
DF.select!(test, DF.Not(:Ticket));

## calculated_fare

In [None]:
## Calculating fare based on family size.
train[!, :calculated_fare] = train.Fare ./ train.family_size
test[!, :calculated_fare] = test.Fare ./ test.family_size;

Some people have travelled in groups like family or friends. It seems like Fare column kept a record of the total fare rather than the fare of individual passenger, therefore calculated fare will be much handy in this situation.

## fare_group


In [None]:
"""
    This function creates a fare group based on the fare provided
    """
function fare_group(fare::Float64)
    a = ""
    if fare <= 4
        a = "Very_low"
    elseif fare <= 10
        a = "low"
    elseif fare <= 20
        a = "mid"
    elseif fare <= 45
        a = "high"
    else
        a = "very_high"
    end
    return a
end

train[!, :fare_group] = [fare_group(x) for x in train.calculated_fare]
test[!, :fare_group] = [fare_group(x) for x in test.calculated_fare];

Fare group was calculated based on <i>calculated_fare</i>. This can further help our cause.

## PassengerId

It seems like <i>PassengerId</i> column only works as an id in this dataset without any significant effect on the dataset. Let's drop it.

In [None]:
DF.select!(train, DF.Not(:PassengerId))
DF.select!(test, DF.Not(:PassengerId))

## Creating dummy variables

You might be wondering what is a dummy variable?

Dummy variable is an important **prepocessing machine learning step**. Often times Categorical variables are an important features, which can be the difference between a good model and a great model. While working with a dataset, having meaningful value for example, "male" or "female" instead of 0's and 1's is more intuitive for us. However, machines do not understand the value of categorical values, for example, in this dataset we have gender male or female, algorithms do not accept categorical variables as input. In order to feed data in a machine learning model, we

In [None]:
function get_dummies(df, columns)
    """
    Creates categorical one hot encoding variables
    """
    result_df = copy(df)
    
    for col in columns
        unique_vals = unique(skipmissing(result_df[!, col]))
        
        dummy_transforms = [@. col => DF.ByRow(isequal(val)) => Symbol(col, "_", val) for val in unique_vals]
        
        DF.transform!(result_df, dummy_transforms...)
        DF.select!(result_df, DF.Not(col))

    end
    
    return result_df
end

dummy_cols = [:title, :Pclass, :Cabin, :Embarked, :nLength_group, :family_group, :fare_group]
train = get_dummies(train, dummy_cols)
test = get_dummies(test, dummy_cols);

In [None]:
cols_to_drop = [:family_size, :Name, :Fare, :name_length]
DF.select!(train, DF.Not(cols_to_drop))
DF.select!(test, DF.Not(cols_to_drop))

## age

As I promised before, we are going to use Random forest regressor in this section to predict the missing age values. Let's do it

In [None]:
train[.!ismissing.(train.Age), :]

In [None]:
Pkg.add(["DecisionTree", "MLJDecisionTreeInterface"])
import MLJ
import MLJModels

In [None]:
age_train_data = train[.!ismissing.(train.Age), :]
y_train = DF.collect(age_train_data[:, :Age])

In [None]:
function completing_age(df::DF.DataFrame)
    #Prepare data
    age_train_data = df[.!ismissing.(df.Age), :]
    age_test_data = df[ismissing.(df.Age), :]

    features_to_exclude = [:Age]

    # exclude "survived" from train 
    if ("Survived" in DF.names(df))
        push!(features_to_exclude, :Survived)
    end

    # data prepared from train
    X_train = DF.select(age_train_data, DF.Not(features_to_exclude))
    y_train = DF.collect(age_train_data.Age)

    
    # data prepared from test
    X_test = DF.select(age_test_data, DF.Not(features_to_exclude))


    # model
    DecisionTreeRegressor = MLJ.@load RandomForestRegressor pkg=DecisionTree
    model = DecisionTreeRegressor()
    mach = MLJ.machine(model, X_train, y_train)
    MLJ.fit!(mach)

    y_hat = MLJ.predict(mach, X_test)

    # fill out missing values with prediction
    df[ismissing.(df.Age), :Age] = y_hat
    df.Age = Float64.(df.Age) #Ensure its not Union{missing} (How is suffer because of not having this line)

    return df
end

In [None]:
completing_age(train)
completing_age(test)

Let's take a look at the histogram of the age column.

In [None]:
Makie.hist( train.Age, bins=100)

## age_group

We can create a new feature by grouping the "Age" column

In [None]:
 """
    This function creates a bin for age
    """
function age_group_fun(age::Float64)
    a = ""
    if age <= 1
        a = "infant"
    elseif age <= 4
        a = "toddler"
    elseif age <= 13
        a = "child"
    elseif age <= 18
        a = "teenager"
    elseif age <= 35
        a = "Young_Adult"
    elseif age <= 45
        a = "adult"
    elseif age <= 55
        a = "middle_aged"
    elseif age <= 65
        a = "senior_citizen"
    else
        a = "old"
    end
    return a
end

In [None]:
train[!, :age_group] = [age_group_fun(x) for x in train.Age]
test[!, :age_group] = [age_group_fun(x) for x in test.Age]

## Creating dummies for "age_group" feature.
train = get_dummies(train, [:age_group])
test = get_dummies(test, [:age_group])

<div class="alert alert-danger">
<h1>Need to paraphrase this section</h1>
<h2>Feature Selection</h2>
<h3>Feature selection is an important part of machine learning models. There are many reasons why we use feature selection.</h3> 
<ul>
    <li>Simple models are easier to interpret. People who acts according to model results have a better understanding of the model.</li>
    <li>Shorter training times.</li>
    <li>Enhanced generalisation by reducing overfitting. </li>
    <li>Easier to implement by software developers> model production.</li>
        <ul>
            <li>As Data Scientists we need to remember no to creating models with too many variables since it might overwhelm production engineers.</li>
    </ul>
    <li>Reduced risk of data errors during model use</li>
    <li>Data redundancy</li>
</ul>
</div>

# Part 6: Pre-Modeling Tasks

## 6a. Separating dependent and independent variables

<a id="dependent_independent"></a>

---

Before we apply any machine learning models, It is important to separate dependent and independent variables. Our dependent variable or target variable is something that we are trying to find, and our independent variable is the features we use to find the dependent variable. The way we use machine learning algorithm in a dataset is that we train our machine learning model by specifying independent variables and dependent variable. To specify them, we need to separate them from each other, and the code below does just that.

P.S. In our test dataset, we do not have a dependent variable feature. We are to predict that using machine learning models.

In [None]:
# separating our independent and dependent variable
X = DF.select(train, DF.Not(:Survived))
y = train.Survived;

## 6b. Splitting the training data

<a id="split_training_data" ></a>

---

There are multiple ways of splitting data. They are...

- train_test_split.
- cross_validation.

We have separated dependent and independent features; We have separated train and test data. So, why do we still have to split our training data? If you are curious about that, I have the answer. For this competition, when we train the machine learning algorithms, we use part of the training set usually two-thirds of the train data. Once we train our algorithm using 2/3 of the train data, we start to test our algorithms using the remaining data. If the model performs well we dump our test data in the algorithms to predict and submit the competition. The code below, basically splits the train data into 4 parts, **X_train**, **X_test**, **y_train**, **y_test**.

- **X_train** and **y_train** first used to train the algorithm.
- then, **X_test** is used in that trained algorithms to predict **outcomes. **
- Once we get the **outcomes**, we compare it with **y_test**

By comparing the **outcome** of the model with **y_test**, we can determine whether our algorithms are performing well or not. As we compare we use confusion matrix to determine different aspects of model performance.

P.S. When we use cross validation it is important to remember not to use **X_train, X_test, y_train and y_test**, rather we will use **X and y**. I will discuss more on that.

In [None]:
using Random
Random.seed!(0)

# We convert de datatype for better integration with models
y = MLJ.coerce(y, MLJ.Multiclass)  
# We convert all features to continous, ex. (false true) -> (0, 1)
X = MLJ.coerce(X, MLJ.Count => MLJ.Continuous, MLJ.OrderedFactor => MLJ.Continuous)  


(X_train, X_test), (y_train, y_test) = MLJ.partition((X, y), 0.67, shuffle=true, multi=true);

In [None]:
size(X_train)

In [None]:
size(X_test)

## 6c. Feature Scaling

<a id="feature_scaling" ></a>

---

Feature scaling is an important concept of machine learning models. Often times a dataset contain features highly varying in magnitude and unit. For some machine learning models, it is not a problem. However, for many other ones, its quite a problem. Many machine learning algorithms uses euclidian distances to calculate the distance between two points, it is quite a problem. Let's again look at a the sample of the **train** dataset below.


In [None]:
sample = shuffle(1:DF.nrow(X_train))[1:5] 
X_train[sample, :]

Here **Age** and **Calculated_fare** is much higher in magnitude compared to others machine learning features. This can create problems as many machine learning models will get confused thinking **Age** and **Calculated_fare** have higher weight than other features. Therefore, we need to do feature scaling to get a better result.
There are multiple ways to do feature scaling.

<ul>
    <li><b>MinMaxScaler</b>-Scales the data using the max and min values so that it fits between 0 and 1.</li>
    <li><b>StandardScaler</b>-Scales the data so that it has mean 0 and variance of 1.</li>
    <li><b>RobustScaler</b>-Scales the data similary to Standard Scaler, but makes use of the median and scales using the interquertile range so as to aviod issues with large outliers.</b>
 </ul>
I will discuss more on that in a different kernel. For now we will use <b>Standard Scaler</b> to feature scale our dataset.

P.S. I am showing a sample of both before and after so that you can see how scaling changes the dataset.

<h3><font color="$5831bc" face="Comic Sans MS">Before Scaling</font></h3>

In [None]:
print(DF.names(X_train))
DF.first(X_train, 5)

In [None]:
Standardizer = MLJ.@load Standardizer pkg=MLJModels
standardizer = Standardizer()
mach_scaler = MLJ.machine(standardizer, X_train)

MLJ.fit!(mach_scaler)
X_train_scaled = MLJ.transform(mach_scaler, X_train)
X_test_scaled = MLJ.transform(mach_scaler, X_test);

<h3><font color="#5831bc" face="Comic Sans MS">After Scaling</font></h3>

You can see how the features have transformed above.

### NOTE: In this example, in difference with the original notebook, the categorical and boolean columns were not transformed, as it doesnt affect negativaly the model 

# Part 7: Modeling the Data

<a id="modelingthedata"></a>

---

In the previous versions of this kernel, I thought about explaining each model before applying it. However, this process makes this kernel too lengthy to sit and read at one go. Therefore I have decided to break this kernel down and explain each algorithm in a different kernel and add the links here. If you like to review logistic regression, please click [here](https://www.kaggle.com/masumrumi/logistic-regression-with-titanic-dataset).

In [None]:
Pkg.add("MLJLinearModels")

In [None]:
LogisticClassifier = MLJ.@load LogisticClassifier pkg=MLJLinearModels

logreg = LogisticClassifier(penalty=:l1)
mach_logreg = MLJ.machine(logreg, X_train_scaled, y_train)
MLJ.fit!(mach_logreg)

y_prob = MLJ.predict(mach_logreg, X_test_scaled);
# y_prob Has the raw predictions -> [0.13, 0.4, 0.75] etc
# MLJ.mode.(y_prob) has the categorical predictions -> [0, 0, 1] etc
y_pred = MLJ.mode.(y_prob)

<h1><font color="#5831bc" face="Comic Sans MS">Evaluating a classification model</font></h1>

There are multiple ways to evaluate a classification model.

- Confusion Matrix.
- ROC Curve
- AUC Curve.

## Confusion Matrix

<b>Confusion matrix</b>, a table that <b>describes the performance of a classification model</b>. Confusion Matrix tells us how many our model predicted correctly and incorrectly in terms of binary/multiple outcome classes by comparing actual and predicted cases. For example, in terms of this dataset, our model is a binary one and we are trying to classify whether the passenger survived or not survived. we have fit the model using **X_train** and **y_train** and predicted the outcome of **X_test** in the variable **y_pred**. So, now we will use a confusion matrix to compare between **y_test** and **y_pred**. Let's do the confusion matrix.


In [None]:
conf_matrix = MLJ.confusion_matrix(y_pred, y_test)

Our **y_test** has a total of 294 data points; part of the original train set that we splitted in order to evaluate our model. Each number here represents certain details about our model. If we were to think about this interms of column and raw, we could see that...

- the first column is of data points that the machine predicted as not-survived.
- the second column is of the statistics that the model predicted as survievd.
- In terms of raws, the first raw indexed as "Not-survived" means that the value in that raw are actual statistics of not survived once.
- and the "Survived" indexed raw are values that actually survived.

Now you can see that the predicted not-survived and predicted survived sort of overlap with actual survived and actual not-survived. After all it is a matrix and we have some terminologies to call these statistics more specifically. Let's see what they are

<ul style="list-style-type:square;">
    <li><b>True Positive(TP)</b>: values that the model predicted as yes(survived) and is actually yes(survived).</li>
    <li><b>True Negative(TN)</b>: values that model predicted as no(not-survived) and is actually no(not-survived)</li>
    <li><b>False Positive(or Type I error)</b>: values that model predicted as yes(survived) but actually no(not-survived)</li>
    <li><b>False Negative(or Type II error)</b>: values that model predicted as no(not-survived) but actually yes(survived)</li>
</ul>

For this dataset, whenever the model is predicting something as yes, it means the model is predicting that the passenger survived and for cases when the model predicting no; it means the passenger did not survive. Let's determine the value of all these terminologies above.

<ul style="list-style-type:square;">
    <li><b>True Positive(TP):85</b></li>
    <li><b>True Negative(TN):158</b></li>
    <li><b>False Positive(FP):28</b></li>
    <li><b>False Negative(FN):22</b></li>
</ul>
From these four terminologies, we can compute many other rates that are used to evaluate a binary classifier.

#### Accuracy:

** Accuracy is the measure of how often the model is correct.**

- (TP + TN)/total = (85+158)/293 = .829

We can also calculate accuracy score using scikit learn.

In [None]:
MLJ.accuracy(y_pred, y_test)

**Misclassification Rate:** Misclassification Rate is the measure of how often the model is wrong\*\*

- Misclassification Rate and Accuracy are opposite of each other.
- Missclassification is equivalent to 1 minus Accuracy.
- Misclassification Rate is also known as "Error Rate".

> (FP + FN)/Total = (28+22)/293 = 0.17

**True Positive Rate/Recall/Sensitivity:** How often the model predicts yes(survived) when it's actually yes(survived)?

> TP/(TP+FN) = 85/(85+22) = 0.794392523364486

In [None]:
MLJ.recall(y_pred, y_test)

**False Positive Rate:** How often the model predicts yes(survived) when it's actually no(not-survived)?

> FP/(FP+TN) = 28/(28+158) = 0.1505376344

**True Negative Rate/Specificity:** How often the model predicts no(not-survived) when it's actually no(not-survived)?

- True Negative Rate is equivalent to 1 minus False Positive Rate.

> TN/(TN+FP) = 158/(158+28) = 0.84946236559

**Precision:** How often is it correct when the model predicts yes.

> TP/(TP+FP) = 85/(85+28) = 0.75221238938

In [None]:
MLJ.ppv(y_pred, y_test)  # aka precision

In [None]:
function simple_classification_report(y_true, y_pred)
    println("Classification Report:")
    println("======================")
    
    # Métricas principales
    println("Accuracy: ", round(MLJ.accuracy(y_pred, y_true), digits=4))
    println("Balanced Accuracy: ", round(MLJ.balanced_accuracy(y_pred, y_true), digits=4))
    println("F1 Score: ", round(MLJ.f1score(y_pred, y_true), digits=4))
    println("  Precision: ", round(MLJ.positive_predictive_value(y_pred, y_true), digits=4))
    println("  Recall:    ", round(MLJ.true_positive_rate(y_pred, y_true), digits=4))
end

# Usar la función
simple_classification_report(y_test, y_pred)

we have our confusion matrix. How about we give it a little more character.

In [None]:
fig = Makie.Figure()
ax = Makie.Axis(fig[1, 1], 
    title = "Confusion Matrix",  
    xticks=((1:2),["Actual 0", "Actual 1"]), 
    yticks=((1:2), ["Predicted 0", "Predicted 1"]),
)

hm = Makie.heatmap!(ax, (1:2), (1:2), conf_matrix.mat, colormap="Blues")

for i in 1:2
    for j in 1:2
        text_val = conf_matrix.mat[j, i]
        
        Makie.text!(ax, i, j, 
            text = string(round(text_val, digits=2)),
            color = abs(text_val) > 100 ? :white : :black,
            align = (:center, :center)
        )
    end
end

Makie.Colorbar(fig[1, 2], hm, 
    label = "Counts",
)

fig

<h1>AUC & ROC Curve</h1>

In [None]:
fig = Makie.Figure()
ax = Makie.Axis(fig[1, 1],
    title = "ROC Curve",
    xlabel = "False Positive Rate",
    ylabel = "True Positive Rate",
   
)
fpr, tpr, _ = MLJ.roc_curve(y_prob, y_test)

Makie.lines!(ax, fpr, tpr,
    label = "ROC Curve",
    linewidth=4
)
Makie.ablines!(ax, 0, 1,
    color = :black,
    linestyle=:dash,
)
fig

ROC:

In [None]:
roc_auc = MLJ.auc(y_prob, y_test)

## Using Cross-validation:

Pros:

- Helps reduce variance.
- Expands models predictability.

In [None]:
mach_standarizer = MLJ.machine(standardizer, X)
MLJ.fit!(mach_standarizer)
X = MLJ.transform(mach_standarizer, X)

In [None]:
cv = MLJ.StratifiedCV(nfolds=10, shuffle=true, rng=0)

logreg = LogisticClassifier(penalty=:l2)
mach_logreg = MLJ.machine(logreg, X_scaled, y)
MLJ.fit!(mach_logreg)

evaluation = MLJ.evaluate!(mach_logreg, resampling=cv, verbosity=0, measure=[MLJ.Accuracy()]);

println("Cross-Validation accuracy scores: ", evaluation.per_fold)
println("Mean Cross-Validation accuracy score: ", MLJ.mean(evaluation.per_fold[1]))

## Grid Search on Logistic Regression

- What is grid search?
- What are the pros and cons?

**Gridsearch** is a simple concept but effective technique in Machine Learning. The word **GridSearch** stands for the fact that we are searching for optimal parameter/parameters over a "grid." These optimal parameters are also known as **Hyperparameters**. **The Hyperparameters are model parameters that are set before fitting the model and determine the behavior of the model.**. For example, when we choose to use linear regression, we may decide to add a penalty to the loss function such as Ridge or Lasso. These penalties require specific alpha (the strength of the regularization technique) to set beforehand. The higher the value of alpha, the more penalty is being added. GridSearch finds the optimal value of alpha among a range of values provided by us, and then we go on and use that optimal value to fit the model and get sweet results. It is essential to understand those model parameters are different from models outcomes, for example, **coefficients** or model evaluation metrics such as **accuracy score** or **mean squared error** are model outcomes and different than hyperparameters.

#### This part of the kernel is a working progress. Please check back again for future updates.####

In [None]:
Random.seed!(30)

logreg = LogisticClassifier()
lambda_vals = exp10.(range(log10(0.01), stop=log10(0.1), length=50))
penalties = [:l1, :l2]

ranges = [
    range(logreg, :lambda, values = lambda_vals),
    range(logreg, :penalty, values = penalties)
]


cv = MLJ.StratifiedCV(nfolds=10, shuffle=true, rng=123)

tuned_logreg = MLJ.TunedModel(
    model=logreg,
    tuning=MLJ.Grid(resolution=1),
    resampling=cv,
    range=ranges,
    measure=MLJ.Accuracy(),  
    acceleration=MLJ.CPUThreads(), 
    acceleration_resampling=MLJ.CPUThreads()
)

# 5. Crear y entrenar la máquina
mach = MLJ.machine(tuned_logreg, X, y)
MLJ.fit!(mach)

In [None]:
best_model = MLJ.fitted_params(mach).best_model
println("Mejores parámetros: ")
println("  lambda = ", best_model.lambda)
println("  penalty = ", best_model.penalty)

MLJ.report(mach).best_history_entry

#### Using the best parameters from the grid-search.

In [None]:
#MLJ.predict(mach, X_test) # the mach already has the best model

#### This concludes the Julia notebook. As you can see, the interface and structure for building models are very similar to Python. The notebook now continues with the original Python examples. Hope you enjoy reading the notebook 😊

####

Resources:



### Under-fitting & Over-fitting:


- [Confusion Matrix](https://www.youtube.com/watch?v=8Oog7TXHvFY)
So, we have our first model and its score. But, how do we make sure that our model is performing well. Our model may be overfitting or underfitting. In fact, for those of you don't know what overfitting and underfitting is, Let's find out.

![](https://cdncontribute.geeksforgeeks.org/wp-content/uploads/fittings.jpg)

As you see in the chart above. **Underfitting** is when the model fails to capture important aspects of the data and therefore introduces more bias and performs poorly. On the other hand, **Overfitting** is when the model performs too well on the training data but does poorly in the validation set or test sets. This situation is also known as having less bias but more variation and perform poorly as well. Ideally, we want to configure a model that performs well not only in the training data but also in the test data. This is where **bias-variance tradeoff** comes in. When we have a model that overfits, meaning less biased and more of variance, we introduce some bias in exchange of having much less variance. One particular tactic for this task is regularization models (Ridge, Lasso, Elastic Net). These models are built to deal with the bias-variance tradeoff. This [kernel](https://www.kaggle.com/dansbecker/underfitting-and-overfitting) explains this topic well. Also, the following chart gives us a mental picture of where we want our models to be.
![](http://scott.fortmann-roe.com/docs/docs/BiasVariance/biasvariance.png)

Ideally, we want to pick a sweet spot where the model performs well in training set, validation set, and test set. As the model gets complex, bias decreases, variance increases. However, the most critical part is the error rates. We want our models to be at the bottom of that **U** shape where the error rate is the least. That sweet spot is also known as **Optimum Model Complexity(OMC).**

Now that we know what we want in terms of under-fitting and over-fitting, let's talk about how to combat them.

How to combat over-fitting?

<ul>
    <li>Simplify the model by using less parameters.</li>
    <li>Simplify the model by changing the hyperparameters.</li>
    <li>Introducing regularization models. </li>
    <li>Use more training data. </li>
    <li>Gatter more data ( and gather better quality data). </li>
    </ul>
 #### This part of the kernel is a working progress. Please check back again for future updates.####

## 7b. K-Nearest Neighbor classifier(KNN)

<a id="knn"></a>

---

In [None]:
## Importing the model.
from sklearn.neighbors import KNeighborsClassifier
## calling on the model oject.
knn = KNeighborsClassifier(metric='minkowski', p=2)
## knn classifier works by doing euclidian distance


## doing 10 fold staratified-shuffle-split cross validation
cv = StratifiedShuffleSplit(n_splits=10, test_size=.25, random_state=2)

accuracies = cross_val_score(knn, X,y, cv = cv, scoring='accuracy')
print ("Cross-Validation accuracy scores:{}".format(accuracies))
print ("Mean Cross-Validation accuracy score: {}".format(round(accuracies.mean(),3)))

#### Manually find the best possible k value for KNN

In [None]:
## Search for an optimal value of k for KNN.
k_range = range(1,31)
k_scores = []
for k in k_range:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, X,y, cv = cv, scoring = 'accuracy')
    k_scores.append(scores.mean())
print("Accuracy scores are: {}\n".format(k_scores))
print ("Mean accuracy score: {}".format(np.mean(k_scores)))

In [None]:
from matplotlib import pyplot as plt
plt.plot(k_range, k_scores)

### Grid search on KNN classifier

In [None]:
from sklearn.model_selection import GridSearchCV
## trying out multiple values for k
k_range = range(1,31)
##
weights_options=['uniform','distance']
#
param = {'n_neighbors':k_range, 'weights':weights_options}
## Using startifiedShufflesplit.
cv = StratifiedShuffleSplit(n_splits=10, test_size=.30, random_state=15)
# estimator = knn, param_grid = param, n_jobs = -1 to instruct scikit learn to use all available processors.
grid = GridSearchCV(KNeighborsClassifier(), param,cv=cv,verbose = False, n_jobs=-1)
## Fitting the model.
grid.fit(X,y)

In [None]:
print(grid.best_score_)
print(grid.best_params_)
print(grid.best_estimator_)

#### Using best estimator from grid search using KNN.

In [None]:
### Using the best parameters from the grid-search.
knn_grid= grid.best_estimator_
knn_grid.score(X,y)

#### Using RandomizedSearchCV

Randomized search is a close cousin of grid search. It doesn't always provide the best result but its fast.

In [None]:
from sklearn.model_selection import RandomizedSearchCV
## trying out multiple values for k
k_range = range(1,31)
##
weights_options=['uniform','distance']
#
param = {'n_neighbors':k_range, 'weights':weights_options}
## Using startifiedShufflesplit.
cv = StratifiedShuffleSplit(n_splits=10, test_size=.30)
# estimator = knn, param_grid = param, n_jobs = -1 to instruct scikit learn to use all available processors.
## for RandomizedSearchCV,
grid = RandomizedSearchCV(KNeighborsClassifier(), param,cv=cv,verbose = False, n_jobs=-1, n_iter=40)
## Fitting the model.
grid.fit(X,y)

In [None]:
print (grid.best_score_)
print (grid.best_params_)
print(grid.best_estimator_)

In [None]:
### Using the best parameters from the grid-search.
knn_ran_grid = grid.best_estimator_
knn_ran_grid.score(X,y)

## Gaussian Naive Bayes

<a id="gaussian_naive"></a>

---

In [None]:
# Gaussian Naive Bayes
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score

gaussian = GaussianNB()
gaussian.fit(X, y)
y_pred = gaussian.predict(X_test)
gaussian_accy = round(accuracy_score(y_pred, y_test), 3)
print(gaussian_accy)

## Support Vector Machines(SVM)

<a id="svm"></a>

---

In [None]:
from sklearn.svm import SVC
Cs = [0.001, 0.01, 0.1, 1,1.5,2,2.5,3,4,5, 10] ## penalty parameter C for the error term.
gammas = [0.0001,0.001, 0.01, 0.1, 1]
param_grid = {'C': Cs, 'gamma' : gammas}
cv = StratifiedShuffleSplit(n_splits=10, test_size=.30, random_state=15)
grid_search = GridSearchCV(SVC(kernel = 'rbf', probability=True), param_grid, cv=cv) ## 'rbf' stands for gaussian kernel
grid_search.fit(X,y)

In [None]:
print(grid_search.best_score_)
print(grid_search.best_params_)
print(grid_search.best_estimator_)

In [None]:
# using the best found hyper paremeters to get the score.
svm_grid = grid_search.best_estimator_
svm_grid.score(X,y)

## Decision Tree Classifier

Decision tree works by breaking down the dataset into small subsets. This breaking down process is done by asking questions about the features of the datasets. The idea is to unmix the labels by asking fewer questions necessary. As we ask questions, we are breaking down the dataset into more subsets. Once we have a subgroup with only the unique type of labels, we end the tree in that node. If you would like to get a detailed understanding of Decision tree classifier, please take a look at [this](https://www.kaggle.com/masumrumi/decision-tree-with-titanic-dataset) kernel.

In [None]:
from sklearn.tree import DecisionTreeClassifier
max_depth = range(1,30)
max_feature = [21,22,23,24,25,26,28,29,30,'auto']
criterion=["entropy", "gini"]

param = {'max_depth':max_depth,
         'max_features':max_feature,
         'criterion': criterion}
grid = GridSearchCV(DecisionTreeClassifier(),
                                param_grid = param,
                                 verbose=False,
                                 cv=StratifiedKFold(n_splits=20, random_state=15, shuffle=True),
                                n_jobs = -1)
grid.fit(X, y)

In [None]:
print( grid.best_params_)
print (grid.best_score_)
print (grid.best_estimator_)

In [None]:
dectree_grid = grid.best_estimator_
## using the best found hyper paremeters to get the score.
dectree_grid.score(X,y)

 <h4> Let's look at the feature importance from decision tree grid.</h4>

In [None]:
## feature importance
feature_importances = pd.DataFrame(dectree_grid.feature_importances_,
                                   index = column_names,
                                    columns=['importance'])
feature_importances.sort_values(by='importance', ascending=False).head(10)

These are the top 10 features determined by **Decision Tree** helped classifing the fates of many passenger on Titanic on that night.

## 7f. Random Forest Classifier

<a id="random_forest"></a>

I admire working with decision trees because of the potential and basics they provide towards building a more complex model like Random Forest(RF). RF is an ensemble method (combination of many decision trees) which is where the "forest" part comes in. One crucial details about Random Forest is that while using a forest of decision trees, RF model <b>takes random subsets of the original dataset(bootstrapped)</b> and <b>random subsets of the variables(features/columns)</b>. Using this method, the RF model creates 100's-1000's(the amount can be menually determined) of a wide variety of decision trees. This variety makes the RF model more effective and accurate. We then run each test data point through all of these 100's to 1000's of decision trees or the RF model and take a vote on the output.

In [None]:
from sklearn.model_selection import GridSearchCV, StratifiedKFold, StratifiedShuffleSplit
from sklearn.ensemble import RandomForestClassifier
n_estimators = [140,145,150,155,160];
max_depth = range(1,10);
criterions = ['gini', 'entropy'];
cv = StratifiedShuffleSplit(n_splits=10, test_size=.30, random_state=15)


parameters = {'n_estimators':n_estimators,
              'max_depth':max_depth,
              'criterion': criterions

        }
grid = GridSearchCV(estimator=RandomForestClassifier(max_features='auto'),
                                 param_grid=parameters,
                                 cv=cv,
                                 n_jobs = -1)
grid.fit(X,y)

In [None]:
print (grid.best_score_)
print (grid.best_params_)
print (grid.best_estimator_)

In [None]:
rf_grid = grid.best_estimator_
rf_grid.score(X,y)

In [None]:
from sklearn.metrics import classification_report
# Print classification report for y_test
print(classification_report(y_test, y_pred, labels=rf_grid.classes_))

## Feature Importance

In [None]:
## feature importance
feature_importances = pd.DataFrame(rf_grid.feature_importances_,
                                   index = column_names,
                                    columns=['importance'])
feature_importances.sort_values(by='importance', ascending=False).head(10)

<h3>Why Random Forest?(Pros and Cons)</h3>

---

<h2>Introducing Ensemble Learning</h2>
In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.

There are two types of ensemple learnings.

**Bagging/Averaging Methods**

> In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its variance is reduced.

**Boosting Methods**

> The other family of ensemble methods are boosting methods, where base estimators are built sequentially and one tries to reduce the bias of the combined estimator. The motivation is to combine several weak models to produce a powerful ensemble.

<h4 align="right">Source:GA</h4>

Resource: <a href="https://towardsdatascience.com/ensemble-methods-bagging-boosting-and-stacking-c9214a10a205">Ensemble methods: bagging, boosting and stacking</a>

---

## 7g. Bagging Classifier

<a id="bagging"></a>

---

<a href="https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.BaggingClassifier.html">Bagging Classifier</a>(Bootstrap Aggregating) is the ensemble method that involves manipulating the training set by resampling and running algorithms on it. Let's do a quick review:

- Bagging classifier uses a process called bootstrapped dataset to create multiple datasets from one original dataset and runs algorithm on each one of them. Here is an image to show how bootstrapped dataset works.
<img src="https://uc-r.github.io/public/images/analytics/bootstrap/bootstrap.png" width="600">
<h4 align="center">Resampling from original dataset to bootstrapped datasets</h4>
<h4 align="right">Source: https://uc-r.github.io</h4>

- After running a learning algorithm on each one of the bootstrapped datasets, all models are combined by taking their average. the test data/new data then go through this averaged classifier/combined classifier and predict the output.

Here is an image to make it clear on how bagging works,
<img src="https://prachimjoshi.files.wordpress.com/2015/07/screen_shot_2010-12-03_at_5-46-21_pm.png" width="600">

<h4 align="right">Source: https://prachimjoshi.files.wordpress.com</h4>
Please check out [this](https://www.kaggle.com/masumrumi/bagging-with-titanic-dataset) kernel if you want to find out more about bagging classifier.

In [None]:
from sklearn.ensemble import BaggingClassifier
n_estimators = [10,30,50,70,80,150,160, 170,175,180,185];
cv = StratifiedShuffleSplit(n_splits=10, test_size=.30, random_state=15)

parameters = {'n_estimators':n_estimators,

        }
grid = GridSearchCV(BaggingClassifier(base_estimator= None, ## If None, then the base estimator is a decision tree.
                                      bootstrap_features=False),
                                 param_grid=parameters,
                                 cv=cv,
                                 n_jobs = -1)
grid.fit(X,y)

In [None]:
print (grid.best_score_)
print (grid.best_params_)
print (grid.best_estimator_)

In [None]:
bagging_grid = grid.best_estimator_
bagging_grid.score(X,y)

<h3>Why use Bagging? (Pros and cons)</h3>
Bagging works best with strong and complex models(for example, fully developed decision trees). However, don't let that fool you to thinking that similar to a decision tree, bagging also overfits the model. Instead, bagging reduces overfitting since a lot of the sample training data are repeated and used to create base estimators. With a lot of equally likely training data, bagging is not very susceptible to overfitting with noisy data, therefore reduces variance. However, the downside is that this leads to an increase in bias.

<h4>Random Forest VS. Bagging Classifier</h4>

If some of you are like me, you may find Random Forest to be similar to Bagging Classifier. However, there is a fundamental difference between these two which is **Random Forests ability to pick subsets of features in each node.** I will elaborate on this in a future update.

## 7h. AdaBoost Classifier

<a id="AdaBoost"></a>

---

AdaBoost is another <b>ensemble model</b> and is quite different than Bagging. Let's point out the core concepts.

> AdaBoost combines a lot of "weak learners"(they are also called stump; a tree with only one node and two leaves) to make classifications.

> This base model fitting is an iterative process where each stump is chained one after the other; <b>It cannot run in parallel.</b>

> <b>Some stumps get more say in the final classifications than others.</b> The models use weights that are assigned to each data point/raw indicating their "importance." Samples with higher weight have a higher influence on the total error of the next model and gets more priority. The first stump starts with uniformly distributed weight which means, in the beginning, every datapoint have an equal amount of weights.

> <b>Each stump is made by talking the previous stump's mistakes into account.</b> After each iteration weights gets re-calculated in order to take the errors/misclassifications from the last stump into consideration.

> The final prediction is typically constructed by a weighted vote where weights for each base model depends on their training errors or misclassification rates.

To illustrate what we have talked about so far let's look at the following visualization.

<img src="https://cdn-images-1.medium.com/max/1600/0*paPv7vXuq4eBHZY7.png">
<h5 align="right"> Source: Diogo(Medium)</h5>

Let's dive into each one of the nitty-gritty stuff about AdaBoost:

---

> <b>First</b>, we determine the best feature to split the dataset using Gini index(basics from decision tree). The feature with the lowest Gini index becomes the first stump in the AdaBoost stump chain(the lower the Gini index is, the better unmixed the label is, therefore, better split).

---

> <b>Secondly</b>, we need to determine how much say a stump will have in the final classification and how we can calculate that.

- We learn how much say a stump has in the final classification by calculating how well it classified the samples (aka calculate the total error of the weight).
- The <b>Total Error</b> for a stump is the sum of the weights associated with the incorrectly classified samples. For example, lets say, we start a stump with 10 datasets. The first stump will uniformly distribute an weight amoung all the datapoints. Which means each data point will have 1/10 weight. Let's say once the weight is distributed we run the model and find 2 incorrect predicitons. In order to calculate the total erorr we add up all the misclassified weights. Here we get 1/10 + 1/10 = 2/10 or 1/5. This is our total error. We can also think about it

$$ \epsilon_t = \frac{\text{misclassifications}\_t}{\text{observations}\_t} $$

- Since the weight is uniformly distributed(all add up to 1) among all data points, the total error will always be between 0(perfect stump) and 1(horrible stump).
- We use the total error to determine the amount of say a stump has in the final classification using the following formula

$$ \alpha_t = \frac{1}{2}ln \left(\frac{1-\epsilon_t}{\epsilon_t}\right) \text{where } \epsilon_t < 1$$

Where $\epsilon_t$ is the misclassification rate for the current classifier:

$$ \epsilon_t = \frac{\text{misclassifications}\_t}{\text{observations}\_t} $$

Here...

- $\alpha_t$ = Amount of Say
- $\epsilon_t$ = Total error

We can draw a graph to determine the amount of say using the value of total error(0 to 1)

<img src="http://chrisjmccormick.files.wordpress.com/2013/12/adaboost_alphacurve.png">
<h5 align="right"> Source: Chris McCormick</h5>

- The blue line tells us the amount of say for <b>Total Error(Error rate)</b> between 0 and 1.
- When the stump does a reasonably good job, and the <b>total error</b> is minimal, then the <b>amount of say(Alpha)</b> is relatively large, and the alpha value is positive.
- When the stump does an average job(similar to a coin flip/the ratio of getting correct and incorrect ~50%/50%), then the <b>total error</b> is ~0.5. In this case the <b>amount of say</b> is <b>0</b>.
- When the error rate is high let's say close to 1, then the <b>amount of say</b> will be negative, which means if the stump outputs a value as "survived" the included weight will turn that value into "not survived."

P.S. If the <b>Total Error</b> is 1 or 0, then this equation will freak out. A small amount of error is added to prevent this from happening.

---

> <b>Third</b>, We need to learn how to modify the weights so that the next stump will take the errors that the current stump made into account. The pseducode for calculating the new sample weight is as follows.

$$ New Sample Weight = Sample Weight + e^{\alpha_t}$$

Here the $\alpha_t(AmountOfSay)$ can be positive or negative depending whether the sample was correctly classified or misclassified by the current stump. We want to increase the sample weight of the misclassified samples; hinting the next stump to put more emphasize on those. Inversely, we want to decrease the sample weight of the correctly classified samples; hinting the next stump to put less emphasize on those.

The following equation help us to do this calculation.

$$ D\_{t+1}(i) = D_t(i) e^{-\alpha_t y_i h_t(x_i)} $$

Here,

- $D_{t+1}(i)$ = New Sample Weight.
- $D_t(i)$ = Current Sample weight.
- $\alpha_t$ = Amount of Say, alpha value, this is the coefficient that gets updated in each iteration and
- $y_i h_t(x_i)$ = place holder for 1 if stump correctly classified, -1 if misclassified.

Finally, we put together the combined classifier, which is

$$ AdaBoost(X) = sign\left(\sum\_{t=1}^T\alpha_t h_t(X)\right) $$

Here,

$AdaBoost(X)$ is the classification predictions for $y$ using predictor matrix $X$

$T$ is the set of "weak learners"

$\alpha_t$ is the contribution weight for weak learner $t$

$h_t(X)$ is the prediction of weak learner $t$

and $y$ is binary **with values -1 and 1**

P.S. Since the stump barely captures essential specs about the dataset, the model is highly biased in the beginning. However, as the chain of stumps continues and at the end of the process, AdaBoost becomes a strong tree and reduces both bias and variance.

<h3>Resources:</h3>
<ul>
    <li><a href="https://www.youtube.com/watch?v=LsK-xG1cLYA">Statquest</a></li>
    <li><a href="https://www.youtube.com/watch?v=-DUxtdeCiB4">Principles of Machine Learning | AdaBoost(Video)</a></li>
</ul>

In [None]:
from sklearn.ensemble import AdaBoostClassifier
n_estimators = [100,140,145,150,160, 170,175,180,185];
cv = StratifiedShuffleSplit(n_splits=10, test_size=.30, random_state=15)
learning_r = [0.1,1,0.01,0.5]

parameters = {'n_estimators':n_estimators,
              'learning_rate':learning_r

        }
grid = GridSearchCV(AdaBoostClassifier(base_estimator= None, ## If None, then the base estimator is a decision tree.
                                     ),
                                 param_grid=parameters,
                                 cv=cv,
                                 n_jobs = -1)
grid.fit(X,y)

In [None]:
print (grid.best_score_)
print (grid.best_params_)
print (grid.best_estimator_)

In [None]:
adaBoost_grid = grid.best_estimator_
adaBoost_grid.score(X,y)

## Pros and cons of boosting

---

### Pros

- Achieves higher performance than bagging when hyper-parameters tuned properly.
- Can be used for classification and regression equally well.
- Easily handles mixed data types.
- Can use "robust" loss functions that make the model resistant to outliers.

---

### Cons

- Difficult and time consuming to properly tune hyper-parameters.
- Cannot be parallelized like bagging (bad scalability when huge amounts of data).
- More risk of overfitting compared to bagging.

<h3>Resources: </h3>
<ul>
    <li><a href="http://mccormickml.com/2013/12/13/adaboost-tutorial/">AdaBoost Tutorial-Chris McCormick</a></li>
    <li><a href="http://rob.schapire.net/papers/explaining-adaboost.pdf">Explaining AdaBoost by Robert Schapire(One of the original author of AdaBoost)</a></li>
</ul>

## 7i. Gradient Boosting Classifier

<a id="gradient_boosting"></a>

---

In [None]:
# Gradient Boosting Classifier
from sklearn.ensemble import GradientBoostingClassifier

gradient_boost = GradientBoostingClassifier()
gradient_boost.fit(X, y)
y_pred = gradient_boost.predict(X_test)
gradient_accy = round(accuracy_score(y_pred, y_test), 3)
print(gradient_accy)

<div class=" alert alert-info">
<h3>Resources: </h3>
<ul>
    <li><a href="https://www.youtube.com/watch?v=sDv4f4s2SB8">Gradient Descent(StatQuest)</a></li>
    <li><a href="https://www.youtube.com/watch?v=3CC4N4z3GJc">Gradient Boost(Regression Main Ideas)(StatQuest)</a></li>
    <li><a href="https://www.youtube.com/watch?v=3CC4N4z3GJc">Gradient Boost(Regression Calculation)(StatQuest)</a></li>
    <li><a href="https://www.youtube.com/watch?v=jxuNLH5dXCs">Gradient Boost(Classification Main Ideas)(StatQuest)</a></li>
    <li><a href="https://www.youtube.com/watch?v=StWY5QWMXCw">Gradient Boost(Classification Calculation)(StatQuest)</a></li>
    <li><a href="https://www.analyticsvidhya.com/blog/2016/02/complete-guide-parameter-tuning-gradient-boosting-gbm-python/">Complete Machine Learning Guide to Parameter Tuning in Gradient Boosting (GBM) in Python</a></li>
</ul>
</div>

## 7j. XGBClassifier

<a id="XGBClassifier"></a>

---

In [None]:
# from xgboost import XGBClassifier
# XGBClassifier = XGBClassifier()
# XGBClassifier.fit(X, y)
# y_pred = XGBClassifier.predict(X_test)
# XGBClassifier_accy = round(accuracy_score(y_pred, y_test), 3)
# print(XGBClassifier_accy)

## 7k. Extra Trees Classifier

<a id="extra_tree"></a>

---

In [None]:
from sklearn.ensemble import ExtraTreesClassifier
ExtraTreesClassifier = ExtraTreesClassifier()
ExtraTreesClassifier.fit(X, y)
y_pred = ExtraTreesClassifier.predict(X_test)
extraTree_accy = round(accuracy_score(y_pred, y_test), 3)
print(extraTree_accy)

## 7l. Gaussian Process Classifier

<a id="GaussianProcessClassifier"></a>

---

In [None]:
from sklearn.gaussian_process import GaussianProcessClassifier
GaussianProcessClassifier = GaussianProcessClassifier()
GaussianProcessClassifier.fit(X, y)
y_pred = GaussianProcessClassifier.predict(X_test)
gau_pro_accy = round(accuracy_score(y_pred, y_test), 3)
print(gau_pro_accy)

## 7m. Voting Classifier

<a id="voting_classifer"></a>

---

In [None]:
from sklearn.ensemble import VotingClassifier

voting_classifier = VotingClassifier(estimators=[
    ('lr_grid', logreg_grid),
    ('svc', svm_grid),
    ('random_forest', rf_grid),
    ('gradient_boosting', gradient_boost),
    ('decision_tree_grid',dectree_grid),
    ('knn_classifier', knn_grid),
#     ('XGB_Classifier', XGBClassifier),
    ('bagging_classifier', bagging_grid),
    ('adaBoost_classifier',adaBoost_grid),
    ('ExtraTrees_Classifier', ExtraTreesClassifier),
    ('gaussian_classifier',gaussian),
    ('gaussian_process_classifier', GaussianProcessClassifier)
],voting='hard')

#voting_classifier = voting_classifier.fit(train_x,train_y)
voting_classifier = voting_classifier.fit(X,y)

In [None]:
y_pred = voting_classifier.predict(X_test)
voting_accy = round(accuracy_score(y_pred, y_test), 3)
print(voting_accy)

In [None]:
#models = pd.DataFrame({
#    'Model': ['Support Vector Machines', 'KNN', 'Logistic Regression',
#              'Random Forest', 'Naive Bayes',
#              'Decision Tree', 'Gradient Boosting Classifier', 'Voting Classifier', 'XGB Classifier','ExtraTrees Classifier','Bagging Classifier'],
#    'Score': [svc_accy, knn_accy, logreg_accy,
#              random_accy, gaussian_accy, dectree_accy,
#               gradient_accy, voting_accy, XGBClassifier_accy, extraTree_accy, bagging_accy]})
#models.sort_values(by='Score', ascending=False)

# Part 8: Submit test predictions

<a id="submit_predictions"></a>

---

In [None]:
all_models = [logreg_grid,
              knn_grid,
              knn_ran_grid,
              svm_grid,
              dectree_grid,
              rf_grid,
              bagging_grid,
              adaBoost_grid,
              voting_classifier]

c = {}
for i in all_models:
    a = i.predict(X_test)
    b = accuracy_score(a, y_test)
    c[i] = b

In [None]:
test_prediction = (max(c, key=c.get)).predict(test)
submission = pd.DataFrame({
        "PassengerId": passengerid,
        "Survived": test_prediction
    })

submission.PassengerId = submission.PassengerId.astype(int)
submission.Survived = submission.Survived.astype(int)

submission.to_csv("titanic1_submission.csv", index=False)

<div class="alert alert-info">
    <h1>Resources</h1>
    <ul>
        <li><b>Statistics</b></li>
        <ul>
            <li><a href="https://statistics.laerd.com/statistical-guides/measures-of-spread-standard-deviation.php">Types of Standard Deviation</a></li>
            <li><a href="https://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-is-a-t-test-and-why-is-it-like-telling-a-kid-to-clean-up-that-mess-in-the-kitchen">What Is a t-test? And Why Is It Like Telling a Kid to Clean Up that Mess in the Kitchen?</a></li>
            <li><a href="https://blog.minitab.com/blog/statistics-and-quality-data-analysis/what-are-t-values-and-p-values-in-statistics">What Are T Values and P Values in Statistics?</a></li>
            <li><a href="https://www.youtube.com/watch?v=E4KCfcVwzyw">What is p-value? How we decide on our confidence level.</a></li>
        </ul>
        <li><b>Writing pythonic code</b></li>
        <ul>
            <li><a href="https://www.kaggle.com/rtatman/six-steps-to-more-professional-data-science-code">Six steps to more professional data science code</a></li>
            <li><a href="https://www.kaggle.com/jpmiller/creating-a-good-analytics-report">Creating a Good Analytics Report</a></li>
            <li><a href="https://en.wikipedia.org/wiki/Code_smell">Code Smell</a></li>
            <li><a href="https://www.python.org/dev/peps/pep-0008/">Python style guides</a></li>
            <li><a href="https://gist.github.com/sloria/7001839">The Best of the Best Practices(BOBP) Guide for Python</a></li>
            <li><a href="https://www.python.org/dev/peps/pep-0020/">PEP 20 -- The Zen of Python</a></li>
            <li><a href="https://docs.python-guide.org/">The Hitchiker's Guide to Python</a></li>
            <li><a href="https://realpython.com/tutorials/best-practices/">Python Best Practice Patterns</a></li>
            <li><a href="http://www.nilunder.com/blog/2013/08/03/pythonic-sensibilities/">Pythonic Sensibilities</a></li>
        </ul>
        <li><b>Why Scikit-Learn?</b></li>
        <ul>
            <li><a href="https://www.oreilly.com/content/intro-to-scikit-learn/">Introduction to Scikit-Learn</a></li>
            <li><a href="https://www.oreilly.com/content/six-reasons-why-i-recommend-scikit-learn/">Six reasons why I recommend scikit-learn</a></li>
            <li><a href="https://hub.packtpub.com/learn-scikit-learn/">Why you should learn Scikit-learn</a></li>
            <li><a href="https://www.kaggle.com/baghern/a-deep-dive-into-sklearn-pipelines">A Deep Dive Into Sklearn Pipelines</a></li>
            <li><a href="https://www.kaggle.com/sermakarevich/sklearn-pipelines-tutorial">Sklearn pipelines tutorial</a></li>
            <li><a href="https://www.kdnuggets.com/2017/12/managing-machine-learning-workflows-scikit-learn-pipelines-part-1.html">Managing Machine Learning workflows with Sklearn pipelines</a></li>
            <li><a href="https://towardsdatascience.com/a-simple-example-of-pipeline-in-machine-learning-with-scikit-learn-e726ffbb6976">A simple example of pipeline in Machine Learning using SKlearn</a></li>
        </ul>
    </ul>
    <h1>Credits</h1>
    <ul>
        <li>To Brandon Foltz for his <a href="https://www.youtube.com/channel/UCFrjdcImgcQVyFbK04MBEhA">youtube</a> channel and for being an amazing teacher.</li>
        <li>To GA where I started my data science journey.</li>
        <li>To Kaggle community for inspiring me over and over again with all the resources I need.</li>
        <li>To Udemy Course "Deployment of Machine Learning". I have used and modified some of the code from this course to help making the learning process intuitive.</li>
    </ul>
</div>

<div class="alert alert-info">
<h4>If you like to discuss any other projects or just have a chat about data science topics, I'll be more than happy to connect with you on:</h4>
    <ul>
        <li><a href="https://www.linkedin.com/in/masumrumi/"><b>LinkedIn</b></a></li>
        <li><a href="https://github.com/masumrumi"><b>Github</b></a></li>
        <li><a href="https://masumrumi.github.io/cv/"><b>masumrumi.github.io/cv/</b></a></li>
        <li><a href="https://www.youtube.com/channel/UC1mPjGyLcZmsMgZ8SJgrfdw"><b>Youtube</b></a></li>
    </ul>

<p>This kernel will always be a work in progress. I will incorporate new concepts of data science as I comprehend them with each update. If you have any idea/suggestions about this notebook, please let me know. Any feedback about further improvements would be genuinely appreciated.</p>

<h1>If you have come this far, Congratulations!!</h1>

<h1>If this notebook helped you in any way or you liked it, please upvote and/or leave a comment!! :)</h1></div>