In this tutorial, we use `safeds` on **Titanic passenger data** to predict who will survive and who will not.

### Loading Data
The data is available under [Titanic - Machine Learning from Disaster](https://www.kaggle.com/c/titanic/data):


In [1]:
from safeds.data.tabular.containers import Table

raw_data = Table.from_csv_file("data/titanic.csv")
#For visualisation purposes we only print out the first 15 rows.
raw_data.slice_rows(length=15)

id,name,sex,age,siblings_spouses,parents_children,ticket,travel_class,fare,cabin,port_embarked,survived
i64,str,str,f64,i64,i64,str,i64,f64,str,str,i64
0,"""Abbing, Mr. Anthony""","""male""",42.0,0,0,"""C.A. 5547""",3,7.55,,"""Southampton""",0
1,"""Abbott, Master. Eugene Joseph""","""male""",13.0,0,2,"""C.A. 2673""",3,20.25,,"""Southampton""",0
2,"""Abbott, Mr. Rossmore Edward""","""male""",16.0,1,1,"""C.A. 2673""",3,20.25,,"""Southampton""",0
3,"""Abbott, Mrs. Stanton (Rosa Hun…","""female""",35.0,1,1,"""C.A. 2673""",3,20.25,,"""Southampton""",1
4,"""Abelseth, Miss. Karen Marie""","""female""",16.0,0,0,"""348125""",3,7.65,,"""Southampton""",1
…,…,…,…,…,…,…,…,…,…,…,…
10,"""Adahl, Mr. Mauritz Nils Martin""","""male""",30.0,0,0,"""C 7076""",3,7.25,,"""Southampton""",0
11,"""Adams, Mr. John""","""male""",26.0,0,0,"""341826""",3,8.05,,"""Southampton""",0
12,"""Ahlin, Mrs. Johan (Johanna Per…","""female""",40.0,1,0,"""7546""",3,9.475,,"""Southampton""",0
13,"""Aks, Master. Philip Frank""","""male""",0.8333,0,1,"""392091""",3,9.35,,"""Southampton""",1


### Splitting Data into Train and Test Sets
- **Training set**: Contains 60% of the data and will be used to train the model.
- **Testing set**: Contains 40% of the data and will be used to test the model's accuracy.

In [2]:
train_table, test_table = raw_data.shuffle_rows().split_rows(0.6)

### Removing Low-Quality Columns

In [3]:
train_table.summarize_statistics()

metric,id,name,sex,age,siblings_spouses,parents_children,ticket,travel_class,fare,cabin,port_embarked,survived
str,f64,str,str,f64,f64,f64,str,f64,f64,str,str,f64
"""min""",1.0,"""Abbott, Master. Eugene Joseph""","""female""",0.1667,0.0,0.0,"""110152""",1.0,0.0,"""A11""","""Cherbourg""",0.0
"""max""",1307.0,"""van Melkebeke, Mr. Philemon""","""male""",76.0,8.0,6.0,"""WE/P 5735""",3.0,512.3292,"""T""","""Southampton""",1.0
"""mean""",654.408917,"""-""","""-""",29.542191,0.518471,0.396178,"""-""",2.298089,33.849861,"""-""","""-""",0.37707
"""median""",658.0,"""-""","""-""",28.0,0.0,0.0,"""-""",3.0,14.5,"""-""","""-""",0.0
"""standard deviation""",376.780514,"""-""","""-""",14.164325,1.067841,0.818931,"""-""",0.834712,55.721765,"""-""","""-""",0.484962
"""distinct value count""",785.0,"""784""","""2""",89.0,7.0,7.0,"""618""",3.0,239.0,"""134""","""3""",2.0
"""idness""",1.0,"""0.9987261146496815""","""0.0025477707006369425""",0.11465,0.008917,0.008917,"""0.7872611464968153""",0.003822,0.305732,"""0.17197452229299362""","""0.003821656050955414""",0.002548
"""missing value ratio""",0.0,"""0.0""","""0.0""",0.189809,0.0,0.0,"""0.0""",0.0,0.001274,"""0.7745222929936306""","""0.0""",0.0
"""stability""",0.001274,"""0.0025477707006369425""","""0.6522292993630573""",0.048742,0.670064,0.75414,"""0.007643312101910828""",0.541401,0.043367,"""0.02824858757062147""","""0.7019108280254777""",0.62293


We remove certain columns for the following reasons:
1. **high idness**: `id` , `ticket`
2. **high stability**: `parents_children` 
3. **high missing value ratio**: `cabin`

In [4]:
train_table = train_table.remove_columns(["id","ticket", "parents_children", "cabin"])
test_table = test_table.remove_columns(["id","ticket", "parents_children", "cabin"])

### Handling Missing Values
We fill in missing values in the `age` and `fare` columns with the mean of each column


In [5]:
from safeds.data.tabular.transformation import SimpleImputer

simple_imputer = SimpleImputer(column_names=["age","fare"],strategy=SimpleImputer.Strategy.mean())
fitted_simple_imputer_train, transformed_train_data = simple_imputer.fit_and_transform(train_table)
transformed_test_data = fitted_simple_imputer_train.transform(test_table)

### Handling Nominal Categorical Data
We use `OneHotEncoder` to transform categorical, non-numerical values into numerical representations with values of zero or one. In this example, we will transform the values of the `sex` column, so they can be used in the model to predict passenger survival.
- Use the `fit_and_transform` function of the `OneHotEncoder` to pass the table and the column names to be used as features for the prediction.

In [6]:
from safeds.data.tabular.transformation import OneHotEncoder

fitted_one_hot_encoder_train, transformed_train_data = OneHotEncoder(column_names=["sex", "port_embarked"]).fit_and_transform(transformed_train_data)
transformed_test_data = fitted_one_hot_encoder_train.transform(transformed_test_data)

### Statistics after Data Processing
Check the data after cleaning and transformation to ensure the changes were made correctly.


In [7]:
transformed_train_data.summarize_statistics()

metric,name,age,siblings_spouses,travel_class,fare,survived,sex__male,sex__female,port_embarked__Southampton,port_embarked__Cherbourg,port_embarked__Queenstown
str,str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""min""","""Abbott, Master. Eugene Joseph""",0.1667,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"""max""","""van Melkebeke, Mr. Philemon""",76.0,8.0,3.0,512.3292,1.0,1.0,1.0,1.0,1.0,1.0
"""mean""","""-""",29.542191,0.518471,2.298089,33.849861,0.37707,0.652229,0.347771,0.701911,0.208917,0.089172
"""median""","""-""",29.542191,0.0,3.0,14.5,0.0,1.0,0.0,1.0,0.0,0.0
"""standard deviation""","""-""",12.747491,1.067841,0.834712,55.686217,0.484962,0.476566,0.476566,0.45771,0.406794,0.285174
"""distinct value count""","""784""",90.0,7.0,3.0,240.0,2.0,2.0,2.0,2.0,2.0,2.0
"""idness""","""0.9987261146496815""",0.11465,0.008917,0.003822,0.305732,0.002548,0.002548,0.002548,0.002548,0.002548,0.002548
"""missing value ratio""","""0.0""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"""stability""","""0.0025477707006369425""",0.189809,0.670064,0.541401,0.043312,0.62293,0.652229,0.652229,0.701911,0.791083,0.910828


### Mark the `survived` Column as the Target Variable to Be Predicted

In [8]:
tagged_train_table = transformed_train_data.to_tabular_dataset("survived",extra_names=["name"])

### Fitting a Classifier
We use the `RandomForest` classifier as our model and pass the training dataset to the model's `fit` function to train it.

In [9]:
from safeds.ml.classical.classification import RandomForestClassifier

classifier = RandomForestClassifier()
fitted_classifier = classifier.fit(tagged_train_table)

### Predicting with the Classifier
Use the trained `RandomForest` model to predict the survival rate of passengers in the test dataset. <br>
Pass the `test_table` into the `predict` function, which uses our trained model for prediction.

In [10]:
prediction = fitted_classifier.predict(transformed_test_data)

### Reverse-Transforming the Prediction

In [11]:
reverse_transformed_prediction = prediction.to_table().inverse_transform_table(fitted_one_hot_encoder_train)
#For visualisation purposes we only print out the first 15 rows.
reverse_transformed_prediction.slice_rows(length=15)

name,age,siblings_spouses,travel_class,fare,survived,sex,port_embarked
str,f64,i64,i64,f64,i64,str,str
"""Christy, Mrs. (Alice Frances)""",45.0,0,2,30.0,1,"""female""","""Southampton"""
"""Gheorgheff, Mr. Stanio""",29.542191,0,3,7.8958,0,"""male""","""Cherbourg"""
"""Miles, Mr. Frank""",29.542191,0,3,8.05,0,"""male""","""Southampton"""
"""Foley, Mr. William""",29.542191,0,3,7.75,0,"""male""","""Queenstown"""
"""Kink-Heilmann, Miss. Luise Gre…",4.0,0,3,22.025,0,"""female""","""Southampton"""
…,…,…,…,…,…,…,…
"""Zimmerman, Mr. Leo""",29.0,0,3,7.875,0,"""male""","""Southampton"""
"""Kelly, Mr. James""",44.0,0,3,8.05,0,"""male""","""Southampton"""
"""Jensen, Mr. Niels Peder""",48.0,0,3,7.8542,0,"""male""","""Southampton"""
"""White, Mr. Richard Frasar""",21.0,0,1,77.2875,0,"""male""","""Southampton"""


### Testing the Accuracy of the Model

In [12]:
fitted_classifier.accuracy(transformed_test_data)

0.7938931297709924