## Preprocessing Project
### Dataset: Spaceship-titanic
---
#### Authors:
- Dawid Siera
- Anatol Kaczmarek
- Deniz Aksoy
- Marcin Leszczyński
---

Let's start with our preprocessing. In preprocessing library we use `pandas.DataFrame` class as a base as it's very popular and commonly used when working with data. So let's start by importing pandas and preprocessing.

In [15]:
import pandas as pd
import preprocessing as pr

Now it's time to load our dataset. From now on we will work with `spaceship-titanic` dataset as it was main objective of our assignment. For this code to work, you need to download dataset and save the same directory as this notebook or you can just provide absolute path.

In [16]:
dataset = pd.read_csv('spaceship-titanic/train.csv')

Now we can check how good raw dataset works on classifier implementation (test set will be chosen randomly from part of main dataset). To evaluate our dataset we need to create instance of `pr.benchmarks.ClassifierBenchmark` and call it's `pr.benchmarks.ClassifierBenchmark.evaluate` method. The method will output the accuracy, so our objective will be to maximize it. If you  just want to get value you can turn printing off in classifier constructor as an argument.

In [17]:
clf = pr.benchmarks.ClassifierBenchmark(printing=True)
score = clf.evaluate(dataset, target='Transported')
print(f'Accuracy before preprocessing: {score*100:.2f}%')

              precision    recall  f1-score   support

       False       0.84      0.69      0.76      1284
        True       0.75      0.88      0.81      1324

    accuracy                           0.79      2608
   macro avg       0.80      0.79      0.78      2608
weighted avg       0.79      0.79      0.78      2608

Accuracy before preprocessing: 78.64%


Now to take full advantage of our library we will choose our selector and extractor. So far there is only one selector implemented - `VARSelector`, so that's the one that we will use. When it comes to extractor we have two options: `PCAExtractor` and `LDAExtractor`. For our dataset, `LDAExtractor` will not be recommended as it tries to classify into many groups, it can be used, but the results weren't any close to the `PCAExtractor` or even to pure dataset, so we will go with `PCAExtractor`.

In [18]:
selector = pr.VARSelector()
extractor = pr.PCAExtractor(num_components=8, target='Transported')

Now we will choose combine these two with additional encoding and na handling methods. For any dataset there exists class `pr.Preprocessing`, but our state-of-the-art version for this dataset is `pr.SpaceShipPreprocessing` which inherits from `pr.Preprocessing` and  implements additional splitting features method, which has some hard-coded values to work better with our dataset.

In [19]:
preprocessing = pr.SpaceShipPreprocessing(dataset, target='Transported',
                                 selector=selector, extractor=extractor)

To run the final combination we need to call method `pr.Preprocessing.preprocess()`

In [20]:
new_dataset = preprocessing.preprocess()

Now let's evaluate our dataset again

In [21]:
score = clf.evaluate(new_dataset, target='Transported')
print(f'Accuracy after preprocessing: {score*100:.2f}%')

              precision    recall  f1-score   support

         0.0       0.83      0.73      0.78      1284
         1.0       0.77      0.85      0.81      1324

    accuracy                           0.79      2608
   macro avg       0.80      0.79      0.79      2608
weighted avg       0.80      0.79      0.79      2608

Accuracy after preprocessing: 79.22%


If everything will work as it is supposed to, the accuracy should increase, which should prove that our preprocessing gave positive results :D