# Factor Analysis of Mixed Data with &nbsp;<a href="https://www.python.org/"><img src="https://s3.dualstack.us-east-2.amazonaws.com/pythondotorg-assets/media/community/logos/python-logo-only.png" style="max-width: 35px; display: inline" alt="R"/></a>

## _Titanic Survival Prediction_

---

In this tutorial, we will use the [titanic dataset](https://www.kaggle.com/competitions/titanic/data), from Kaggle competition "_Titanic - Machine Learning from Disaster_".

Our aim here is to mobilise several dimension reduction algorithms covered in the lesson (PCA, MCA) and attempt to predict passengers' chances of survival.
As prediction algorithms are not at the heart of this course (unlike Machine Learning!), we will not optimise this part: we could implement much more efficient algorithms than the naive ones proposed here.

For this tutorial, we will be using the [`prince`](https://maxhalford.github.io/prince/) package, based on the same syntax as scikit-learn.

---

In [None]:
# pip install prince

In [None]:
import pandas as pd
import numpy as np
import prince

import matplotlib.pyplot as plt
import seaborn as sns

---
## Data Loading

First of all, we need to load the data and clean it up to make it easier to use. The Titanic dataset is available in `Seaborn` as the `titanic` datase. It consists of the following columns:

- survived: Survival status (0 = No, 1 = Yes)
- pclass: Passenger class (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
- sex: Passenger’s gender
- age: Passenger’s age
- sibsp: Number of siblings/spouses aboard
- parch: Number of parents/children aboard
- fare: Fare paid for the ticket
- embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
- class: Equivalent to pclass (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
- who: Passenger’s category (man, woman, child)
- adult_male: Whether the passenger is an adult male or not (True or False)
- deck: Cabin deck
- embark_town: Port of embarkation (Cherbourg, Queenstown, Southampton)
- alive: Survival status (yes or no)
- alone: Whether the passenger is alone or not (True or False)

In [None]:
# Load Data

titanic = sns.load_dataset("titanic")
titanic.head()

##### <span style="color:purple">**Todo:** Clean the data.</span>

1. Remove any redundant variables

2. Manage missing values.
    - Check whether or not the data contain missing values,
    - Where this seems reasonable (not too many missing values, not too many modalities in the case of a categorical variable, etc.):
        - Imput to the median for missing quantitative values,
        - Imput to the most frequent modality for missing qualitative values.    <br><br>
        
3. Create a new 'family_size' variable that counts the number of siblings/spouses and parents/children aboard.
    
---

The command `print(titanic.isnull().sum())` should produce the following output:
```
survived       0
sex            0
age            0
sibsp          0
parch          0
fare           0
embarked       0
class          0
who            0
adult_male     0
alone          0
family_size    0
```

In [None]:
### TO BE COMPLETED ###

[...]

# --- #

print('')
print('=== Checking ===')
print(titanic.isnull().sum())

In [None]:
# %load solutions/data/clean_data.py

##### <span style="color:purple">**Todo:** Outliers detection.</span>

Based on the two figures below, would you say that the data shows outliers? If so, remove them.

In [None]:
# Fare
sns.catplot(data=titanic, x='fare', hue='class')
plt.title('Fare according to the class')
plt.show()

# Age
sns.catplot(data=titanic, x='age', hue='class')
plt.title('Age according to the class')
plt.show()

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/data/outliers.py

##### <span style="color:purple">**Todo:** Explore the profile of the passengers on the titanic.</span>

1. View the distribution of the passenger according to their status (man, woman, child), according to their class, _etc._
2. View the breakdown of passengers who survived the shipwreck
3. How does the survival rate change depending on the class and type of passenger (male, female, child)?
4. What is the age profile of passengers according to the class of their ticket?
5. Same question with ticket prices
6. Do ticket prices seem to depend on the age of passengers?

In [None]:
# Passenger distribution

### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/data/passengers.py

In [None]:
# Survival Count

### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/data/survival.py

In [None]:
# Survival depending on sex and class

### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/data/survival_sex_class.py

In [None]:
# Age vs. Class

### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/data/aga_class.py

In [None]:
# Class vs. Fare

### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/data/class_fare.py

In [None]:
# Age vs. Fare

### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/data/age_fare.py

##### <span style="color:purple">**Question:** Under what age are passengers considered to be children?</span>

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/data/children.py

##### <span style="color:purple">**Todo:** Visualize correlations between features.</span>

- What do you think?

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/data/correlation.py

## Principal Component Analysis

PCA is the usual dimension reduction technique when working with quantitative variables.

##### <span style="color:purple">**Todo:** Using the `prince` package, perform a [PCA](https://maxhalford.github.io/prince/pca/).</span>

- Restrict to quantitative variables: survived, age, fare and family_size.
- Fit the model.

In [None]:
### TO BE COMPLETED ###

titanic_quanti = ...
titanic_quanti = titanic_quanti.set_index('survived')

[...]

In [None]:
# %load solutions/pmca/titanic_quanti.py

##### <span style="color:purple">**Question:** What do you think of the inertia carried by each axis?</span>

- Consider the `eigenvalues_summary` function
- and/or view the `scree_plot`

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/pmca/pca_eigenvalues.py

##### <span style="color:purple">**Question:** Do you think we could easily predict the survival rate in PCA space?</span>

- Show the projections of the points in the principal maps, colored according to their survival rate.
- Refer to Prince's help for figures.

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/pmca/plot_pca_prince.py

> *Disclaimer:* If you get an error message when calling the Prince `plot` function on your machine (as is the case on mine), you can use the function below.

In [None]:
# %load solutions/pmca/plot_pca.py

## Multiple Correspondence Analysis

MCA is the usual dimension reduction technique when working with qualitative variables.

##### <span style="color:purple">**Question:** Does it seem reasonable to assume family size as a qualitative variable?</span>

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/pmca/family_size.py

##### <span style="color:purple">**Question:** How many modalities does the dataset contain, taking `family_size` as a qualitative variable? And without?</span>

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/pmca/titanic_quali.py

##### <span style="color:purple">**Todo:** Using the `prince` package, perform a [MCA](https://maxhalford.github.io/prince/mca/).</span>

- Restrict to qualitative variables: survived, embarked, class, who, alone and possibly family_size
- Fit the model.

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/pmca/mca.py

##### <span style="color:purple">**Question:** What do you think of the inertia carried by each axis?</span>

- Consider the `eigenvalues_summary` function
- and/or view the `scree_plot`

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/pmca/mca_eigenvalues.py

##### <span style="color:purple">**Todo:** View MCA results.</span>

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/pmca/plot_mca_prince.py

##### <span style="color:purple">**Todo:** Write a function `plot_mca`.</span>

- The function `plot_mca(ax1=0, ax2=1, mca=mca, data=titanic_quali)` plots the dataset projections on the MCA plane (ax1, ax2), colored according to the overvis of the passengers.
- Choose colors consistent with the previous PCA graph.
- Label each axis with the percentage of variance it explains (based on the graph above).

In [None]:
### TO BE COMPLETED ###

def plot_mca(ax1=0, ax2=1, mca=mca, data=titanic_quali):
  [...]

In [None]:
# %load solutions/pmca/plot_mca.py

##### <span style="color:purple">**Question:** What do you conclude from the graphs below?</span>

In [None]:
plot_mca(0,1)
plot_mca(1,2)
plot_mca(0,2)

##### <span style="color:purple">**Question:** Which variables contribute most strongly to the axes?</span>

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/pmca/contrib_mca.py

##### <span style="color:purple">**Question:** What do you think of the quality of the representation of individuals? of variables?</span>


In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/pmca/quality_mca.py

## Factor analysis of mixed data

We now consider the following dataset, composed of quantitative _and_ qualitative variables.

**Note**: FAMD is a special case of FMA in which the groups are of mixed type (some quantitative, others qualitative) and of size 1.

In [None]:
titanic_clip = titanic[['survived','age','fare','embarked','class','who','alone']] #,'family_size'
titanic_clip = titanic_clip.set_index('survived')
display(titanic_clip)

##### <span style="color:purple">**Todo:** Using the `prince` package, perform a [FAMD](https://maxhalford.github.io/prince/famd/).</span>

- Fit the data
- Vizualise the result of the FAMD (plot)

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/pmca/famd.py

##### <span style="color:purple">**Question:** What do you think of the inertia carried by each axis?</span>

- Consider the `eigenvalues_summary` function
- and/or view the `scree_plot`

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/pmca/famd_eigenvalues.py

##### <span style="color:purple">**Todo:** Write a function `plot_mca`.</span>

- The function `plot_famd(ax1=0, ax2=1, famd=famd, data=titanic_clip)` plots the dataset projections on the FAMD plane (ax1, ax2), colored according to the overvis of the passengers.
- Choose colors consistent with the previous PCA graph.
- Label each axis with the percentage of variance it explains (based on the graph above).

In [None]:
### TO BE COMPLETED ###

def plot_famd(ax1=0, ax2=1, famd=famd, data=titanic_clip):
  [...]

In [None]:
# %load solutions/pmca/plot_famd.py

##### <span style="color:purple">**Question:** What do you conclude from the graphs below?</span>


In [None]:
plot_famd(0,1)
plot_famd(0,2)
plot_famd(1,2)

##### <span style="color:purple">**Question:** Which variables contribute most strongly to the axes?</span>

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/pmca/contrib_famd.py

## Survival of the passengers on the titanic

We are now going to take advantage of the dimension reduction techniques we have already seen to make predictions. 

First, we perform a classification of the points to assess whether we have correctly understood the distribution of the points. If we do not have access to the labels of the points, we will not be able to take the study any further. However, here we have access to the labels for the points. We therefore decide to apply two common classification algorithms to predict the survival, or otherwise, of the Titanic's passengers: SVM and logistic regression -- these algorithms are outside the scope of this course, but you are already familiar with them. We will not go into this part in detail, in order to focus on exploratory data analysis.

### Study in the FAMD space

In [None]:
titanic_famd = famd.transform(titanic_clip)
titanic_famd.reset_index(inplace=True)
titanic_famd['survived'] = titanic_famd['survived'].cat.rename_categories({'No': 0, 'Yes':1})

X = titanic_famd[np.arange(5)].to_numpy()
y = titanic_famd['survived'].to_numpy()

#### k-means on FAMD Components

In [None]:
from sklearn.cluster import KMeans
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

##### <span style="color:purple">**Todo:** Classification using the k-means algorithm (or another!)</span>

- How many classes to impose?
- What is the proportion of well-classified points?
- Display the confusion matrix

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/pred/kmeans_famd.py

#### SVM Model on FAMD Components

- We will run [SVM](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html) for different values of the regularization parameter `C`, and different value of the kernel coefficient `gamma`

- We will use [RandomSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) to slelect the optimal `C` and `Gamma` hyperparameters.

- For more accurate results, you should use [Gridsearch](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) instead, but at the cost of computational time.

In [None]:
from sklearn.model_selection import train_test_split

##### <span style="color:purple">**Todo:** Divide the dataset into test and train sets.</span>

- Create test and training matrices containing projections into the FAMD space.
- Create test and training vectors with related 'survived' labels.
- Keep only those dimensions that carry sufficient inertia.
- Refer to the `train_test_split` function of `scikit-learn`

**Note**: When we want to train a (supervised) model, we _always_ need to make such a cut to avoid over-fitting.

In [None]:
### TO BE COMPLETED ###

[...]

X_train = ...
y_train = ...

X_test = ...
y_test = ...

In [None]:
# %load solutions/pred/train_test_famd.py

##### Random search for best C and Gamma

In [None]:
from sklearn.svm import SVC
from sklearn.model_selection import RandomizedSearchCV
from scipy import stats

In [None]:
svm = SVC(random_state=42, probability=True);

distributions = {"C": stats.uniform(2, 10),
             "gamma": stats.uniform(0.1, 1)}
clf = RandomizedSearchCV(svm,
    param_distributions = distributions,
    n_iter = 20, n_jobs = 4, cv = 3,
    random_state = 0,
    scoring = 'roc_auc')

search = clf.fit(X_train, y_train)
print('Optimized Hyperparameters: %s' % search.best_params_)

##### SVM with optimized C and Gamma

In [None]:
from sklearn import metrics

In [None]:
svm = SVC(random_state = 42,
          probability = True,
          kernel = 'rbf',
          gamma =search.best_params_['gamma'],
          C= search.best_params_['C'])

svm.fit(X_train, y_train)

y_pred_svm_famd = svm.predict(X_test)
svm_accuracy_famd = metrics.accuracy_score(y_test, y_pred_svm_famd)
print("Accuracy of SVM: {:.3f}%".format(100*svm_accuracy_famd))

ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred_svm_famd)).plot()
plt.show()

#### Logistic Regression on FAMD Components

We can do exactly the same with logistic regression

In [None]:
from sklearn.linear_model import LogisticRegression

In [None]:
logistic = LogisticRegression(random_state=0, solver='saga',
                              tol=1e-2, max_iter=200)

distributions = {"C": stats.uniform(0, 4),
                 "penalty": ['l2', 'l1']}
clf = RandomizedSearchCV(logistic,
    param_distributions = distributions,
    n_iter = 20, n_jobs = 4, cv = 3,
    random_state = 0,
    scoring = 'roc_auc')

search = clf.fit(X_train, y_train)
print('Optimized Hyperparameters: %s' % search.best_params_)

In [None]:
logistic = LogisticRegression(random_state=0,
                              solver='saga', tol=1e-2, max_iter=200,
                              C= search.best_params_['C'],
                              penalty =search.best_params_['penalty'])

logistic.fit(X_train, y_train)

y_pred_logistic_famd = logistic.predict(X_test)
logistic_accuracy_famd = metrics.accuracy_score(y_test, y_pred_logistic_famd)
print("Accuracy of Logistic Regression: {:.3f}%".format(100*logistic_accuracy_famd))

ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred_logistic_famd)).plot()
plt.show()

### MCA for quantitative variables

We saw in the course that we could process both quantitative and qualitative variables at the same time by thresholding the quantitative variables and then applying an MCA to all the variables (which are then all qualitative).

##### <span style="color:purple">**Todo:** Threshold quantitative variables.</span>

1. Threshold quantitative variables
2. View the breakdown of classes created in this way

In [None]:
### TO BE COMPLETED ###

titanic_thresh = titanic_clip.copy()

[...]

In [None]:
# %load solutions/pred/titanic_thresh.py

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/pred/breakdown.py

##### <span style="color:purple">**Todo:** Carry out the MCA of the variables thus truncated.</span>

What is inertia like?

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/pred/mca.py

##### <span style="color:purple">**Todo:** Observe the distribution of points in MCA space.</span>

One may use a previously coded function.

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/pred/plot_mca.py

We are now going to repeat the previous study, but on the MCA space.

In [None]:
titanic_mca = mca.transform(titanic_thresh)
titanic_mca.reset_index(inplace=True)
titanic_mca['survived'] = titanic_mca['survived'].astype('category').cat.rename_categories({'No': 0, 'Yes':1})

X = titanic_mca.drop('survived', axis=1).to_numpy()
y = titanic_famd['survived'].to_numpy()

#### k-means on MCA Components

##### <span style="color:purple">**Todo:** Classification using the k-means algorithm (or another!)</span>

- How many classes to impose?
- What is the proportion of well-classified points?
- Display the confusion matrix

In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/pred/kmeans_mca.py

#### SVM Model on MCA Components

##### <span style="color:purple">**Todo:** Divide the dataset into test and train sets.</span>

- Create test and training matrices containing projections into the MCA space.
- Create test and training vectors with related 'survived' labels.

In [None]:
### TO BE COMPLETED ###

[...]

X_train = ...
y_train = ...

X_test = ...
y_test = ...

In [None]:
# %load solutions/pred/train_test_mca.py

##### SVM with optimized C and Gamma

In [None]:
svm = SVC(random_state=42, probability=True);

distributions = {"C": stats.uniform(2, 10),
             "gamma": stats.uniform(0.1, 1)}
clf = RandomizedSearchCV(svm,
    param_distributions = distributions,
    n_iter = 20, n_jobs = 4, cv = 3,
    random_state = 0,
    scoring = 'roc_auc')

search = clf.fit(X_train, y_train)
print('Optimized Hyperparameters: %s' % search.best_params_)

In [None]:
svm = SVC(random_state = 42,
          probability = True,
          kernel = 'rbf',
          gamma =search.best_params_['gamma'],
          C= search.best_params_['C'])

svm.fit(X_train, y_train)

y_pred_svm_mca = svm.predict(X_test)
svm_accuracy_mca = metrics.accuracy_score(y_test, y_pred_svm_mca)
print("Accuracy of SVM: {:.3f}%".format(100*svm_accuracy_mca))

ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred_svm_mca)).plot()
plt.show()

#### Logistic Regression on MCA Components

In [None]:
logistic = LogisticRegression(random_state=0, solver='saga',
                              tol=1e-2, max_iter=200)

distributions = {"C": stats.uniform(0, 4),
                 "penalty": ['l2', 'l1']}
clf = RandomizedSearchCV(logistic,
    param_distributions = distributions,
    n_iter = 20, n_jobs = 4, cv = 3,
    random_state = 0,
    scoring = 'roc_auc')

search = clf.fit(X_train, y_train)
print('Optimized Hyperparameters: %s' % search.best_params_)

In [None]:
logistic = LogisticRegression(random_state=0,
                              solver='saga', tol=1e-2, max_iter=200,
                              C= search.best_params_['C'],
                              penalty =search.best_params_['penalty'])

logistic.fit(X_train, y_train)

y_pred_logistic_mca = logistic.predict(X_test)
logistic_accuracy_mca = metrics.accuracy_score(y_test, y_pred_logistic_mca)
print("Accuracy of Logistic Regression: {:.3f}%".format(100*logistic_accuracy_mca))

ConfusionMatrixDisplay(confusion_matrix(y_test, y_pred_logistic_mca)).plot()
plt.show()

##### <span style="color:purple">**Todo:** Compare the different accuracies obtained previously.</span>


In [None]:
### TO BE COMPLETED ###

[...]

In [None]:
# %load solutions/pred/compare.py