# Pretest for PCA with column 'flight'

The column 'flight' shows the flight number of each plane.
There are over 1500 unique flights in the data.
PCA is used to reduce the number of columns after OneHotEncoding without losing to much variance.

Background:
Behind the flight number is a specific aircraft type with a specific kerosene consumption. Kerosene prices have an impact on flight prices. Kerosene consumption also depends on the flight altitude and speed.

The aircraft types, altitude and speed can be read out e.g. via flightaware.com. Information on average kerosene consumption can then also be read from the aircraft types.

As the flight altitude and speed are standardized for each flight, this information is contained in the flight numbers.

PCA is used here as an attempt to retain the flight numbers with the information hidden in them and to reduce the number of columns for model training.

In [1]:
# importing modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import validation_curve

In [None]:
# reading data
df = pd.read_csv('data/Clean_Dataset.csv')
df.head()

In [None]:
# Train-Test-Split
df_train, df_test = train_test_split(df, test_size = 0.3, random_state = 42)

print(df_train.shape)
print(df_test.shape)

In [None]:
# Second Train-Test-Split for val/aim data
df_test, df_aim = train_test_split(df_test, test_size=0.33, random_state = 42)

print(df_test.shape)
print(df_aim.shape)

In [None]:
# splitting train data into features and target
features_train = df_train.drop('price', axis = 1)
target_train = df_train['price']

print(features_train.shape)
print(target_train.shape)

In [None]:
# OHE for flight number
ohe_feature = ['flight']

# Instantiating CT with OHE 
# Since it is just a pretest for using PCA only for flight all other features are dropped here
preprocessor = ColumnTransformer(transformers = [('ohe', OneHotEncoder(sparse_output = False, handle_unknown = 'ignore'), ohe_feature)], remainder = 'drop')


# Instantiating Pipeline
ohe_pipe = Pipeline([('preprocessor', preprocessor)])

#
ohe_train = ohe_pipe.fit_transform(features_train)

# getting DataFrame back with column names and index from features'_train
ohe_train = pd.DataFrame(data = ohe_train, columns = ohe_pipe.named_steps['preprocessor'].transformers_[0][1].get_feature_names_out(ohe_feature), index = features_train.index)
ohe_train.head()

In [None]:
# testing vor n components for 0.9 explained variance
pca = PCA(n_components = 0.8)

# 0.9 = 618 Features
# 0.8 = 371 Features
# 0.7 = 239 Features

ohe_train_transformed = pca.fit_transform(ohe_train)
pca.components_

In [None]:
pd.DataFrame(ohe_train_transformed).shape

## Conclusion

* Doubtable if column 'flight' will have an influence on the performance
* PCA will reduce number of columns not to the extend as expected
    - Explained variance 0.9 equals 618 features
    - Explained variance 0.8 equals 371 features
    - Explained variance 0.7 equals 239 features
* We will test it with a model, expecting long training process
* There is more additional and clear information for flight numbers available
    - aircraft type via flightaware.com
    - from aircraft type to (average) cerosin consumption (since prices of cerosine have an influence on prices)
    - Height of flight (average)
    - Speed of flight (average)
