In this tutorial, we use `safeds` on **Titanic passenger data** to predict who will survive and who will not.

### Load your data into a `Table`
- The data is available under `docs/tutorials/data/titanic.csv`:


In [1]:
from safeds.data.tabular.containers import Table

raw_data = Table.from_csv_file("data/titanic.csv")
#For visualisation purposes we only print out the first 15 rows.
raw_data.slice_rows(0, 5)

id,name,sex,age,siblings_spouses,parents_children,ticket,travel_class,fare,cabin,port_embarked,survived
i64,str,str,f64,i64,i64,str,i64,f64,str,str,i64
0,"""Abbing, Mr. Anthony""","""male""",42.0,0,0,"""C.A. 5547""",3,7.55,,"""Southampton""",0
1,"""Abbott, Master. Eugene Joseph""","""male""",13.0,0,2,"""C.A. 2673""",3,20.25,,"""Southampton""",0
2,"""Abbott, Mr. Rossmore Edward""","""male""",16.0,1,1,"""C.A. 2673""",3,20.25,,"""Southampton""",0
3,"""Abbott, Mrs. Stanton (Rosa Hun…","""female""",35.0,1,1,"""C.A. 2673""",3,20.25,,"""Southampton""",1
4,"""Abelseth, Miss. Karen Marie""","""female""",16.0,0,0,"""348125""",3,7.65,,"""Southampton""",1


### Removing olumns with high idness / stability / missing value ratio

In [2]:
raw_data.summarize_statistics()

metric,id,name,sex,age,siblings_spouses,parents_children,ticket,travel_class,fare,cabin,port_embarked,survived
str,f64,str,str,f64,f64,f64,str,f64,f64,str,str,f64
"""min""",0.0,"""Abbing, Mr. Anthony""","""female""",0.1667,0.0,0.0,"""110152""",1.0,0.0,"""A10""","""Cherbourg""",0.0
"""max""",1308.0,"""van Melkebeke, Mr. Philemon""","""male""",80.0,8.0,9.0,"""WE/P 5735""",3.0,512.3292,"""T""","""Southampton""",1.0
"""mean""",654.0,"""-""","""-""",29.881135,0.498854,0.385027,"""-""",2.294882,33.295479,"""-""","""-""",0.381971
"""median""",654.0,"""-""","""-""",28.0,0.0,0.0,"""-""",3.0,14.4542,"""-""","""-""",0.0
"""standard deviation""",378.020061,"""-""","""-""",14.4135,1.041658,0.86556,"""-""",0.837836,51.758668,"""-""","""-""",0.486055
"""distinct value count""",1309.0,"""1307""","""2""",98.0,7.0,8.0,"""929""",3.0,281.0,"""186""","""3""",2.0
"""idness""",1.0,"""0.998472116119175""","""0.0015278838808250573""",0.07563,0.005348,0.006112,"""0.7097020626432391""",0.002292,0.215432,"""0.14285714285714285""","""0.0030557677616501145""",0.001528
"""missing value ratio""",0.0,"""0.0""","""0.0""",0.200917,0.0,0.0,"""0.0""",0.0,0.000764,"""0.774637127578304""","""0.0015278838808250573""",0.0
"""stability""",0.000764,"""0.0015278838808250573""","""0.6440030557677616""",0.044933,0.680672,0.76547,"""0.008403361344537815""",0.541635,0.045872,"""0.020338983050847456""","""0.6993114001530222""",0.618029


We remove certain columns for the following reasons:
1. **high idness**: `name`, `id` , `ticket`
2. **high stability**: `parents_children` 
3. **high missing value ratio**: `cabin`

In [3]:
raw_data = raw_data.remove_columns(["name","id","ticket", "parents_children", "cabin"])

### Imputing columns `age` and `fare`
We fill in missing values in the `age` and `fare` columns with the mean of each column


In [4]:
from safeds.data.tabular.transformation import SimpleImputer

simple_transformer = SimpleImputer(column_names=["age","fare"],strategy=SimpleImputer.Strategy.mean())
_, transformed_raw_data = simple_transformer.fit_and_transform(raw_data)

### Using `OneHotEncoder` to create an encoder and fit and transform the table
We use `OneHotEncoder` to transform categorical, non-numerical values into numerical representations with values of zero or one. In this example, we will transform the values of the `sex` column, so they can be used in the model to predict passenger survival.
- Use the `fit_and_transform` function of the `OneHotEncoder` to pass the table and the column names to be used as features for the prediction.

In [5]:
from safeds.data.tabular.transformation import OneHotEncoder

one_hot_encoder = OneHotEncoder(column_names=["sex", "port_embarked"])
transformer_one_hot_encoder, transformed_raw_data = one_hot_encoder.fit_and_transform(transformed_raw_data)

### Reverse transforming `transformed_raw_data`

In [6]:
reverse_transfromed_raw_data = transformed_raw_data.inverse_transform_table(transformer_one_hot_encoder)

### Satistics after imputing / removing / encoding
Check the data after cleaning and transformation to ensure the changes were made correctly.


In [7]:
transformed_raw_data.summarize_statistics()

metric,age,siblings_spouses,travel_class,fare,survived,sex__male,sex__female,port_embarked__Southampton,port_embarked__Cherbourg,port_embarked__Queenstown
str,f64,f64,f64,f64,f64,f64,f64,f64,f64,f64
"""min""",0.1667,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"""max""",80.0,8.0,3.0,512.3292,1.0,1.0,1.0,1.0,1.0,1.0
"""mean""",29.881135,0.498854,2.294882,33.295479,0.381971,0.644003,0.355997,0.698243,0.206264,0.093965
"""median""",29.881135,0.0,3.0,14.4542,0.0,1.0,0.0,1.0,0.0,0.0
"""standard deviation""",12.883199,1.041658,0.837836,51.738879,0.486055,0.478997,0.478997,0.459196,0.404777,0.291891
"""distinct value count""",99.0,7.0,3.0,282.0,2.0,2.0,2.0,2.0,2.0,2.0
"""idness""",0.07563,0.005348,0.002292,0.215432,0.001528,0.001528,0.001528,0.001528,0.001528,0.001528
"""missing value ratio""",0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"""stability""",0.200917,0.680672,0.541635,0.045837,0.618029,0.644003,0.644003,0.698243,0.793736,0.906035


### Spliting the `raw_data` into train and test sets
- **Training set**: Contains 60% of the data and will be used to train the model.
- **Testing set**: Contains 40% of the data and will be used to test the model's accuracy.

In [8]:
train_table, test_table = transformed_raw_data.shuffle_rows().split_rows(0.6)

### Mark the `survived` column as the target variable to be predicted

In [9]:
tagged_train_table = train_table.to_tabular_dataset("survived")

## Using `RandomForest` classifier as a model for classification
We use the `RandomForest` classifier as our model and pass the training dataset to the model's `fit` function to train it.

In [10]:
from safeds.ml.classical.classification import RandomForestClassifier

classifier = RandomForestClassifier()
fitted_classifier= classifier.fit(tagged_train_table)

### Using the trained random forest model to predict survival
Use the trained `RandomForest` model to predict the survival rate of passengers in the test dataset.
Pass the `test_table` into the `predict` function, which uses our trained model for prediction.

In [11]:
prediction = fitted_classifier.predict(test_table)
#For visualisation purposes we only print out the first 15 rows.
prediction.to_table().slice_rows(start=0, length=15)

age,siblings_spouses,travel_class,fare,sex__male,sex__female,port_embarked__Southampton,port_embarked__Cherbourg,port_embarked__Queenstown,survived
f64,i64,i64,f64,u8,u8,u8,u8,u8,i64
45.0,0,2,30.0,0,1,1,0,0,1
29.881135,0,3,7.8958,1,0,0,1,0,0
29.881135,0,3,8.05,1,0,1,0,0,0
29.881135,0,3,7.75,1,0,0,0,1,0
4.0,0,3,22.025,0,1,1,0,0,0
…,…,…,…,…,…,…,…,…,…
29.0,0,3,7.875,1,0,1,0,0,0
44.0,0,3,8.05,1,0,1,0,0,0
48.0,0,3,7.8542,1,0,1,0,0,0
21.0,0,1,77.2875,1,0,1,0,0,0


### Testing the accuracy of the model

In [12]:
test_tabular_dataset = test_table.to_tabular_dataset("survived")
fitted_classifier.accuracy(test_tabular_dataset)

0.7938931297709924