# **Content**:

 -  3 [Data Preprocessing](#DataPreprocessing)
   - 3.1  [Preparation](#Preparation)
   - 3.2 [Divide numerical and catagorical features](#divide)
   - 3.3 [Filling in missing values](#fill)
   - 3.4 [Normalize and PCA transformation](#transform)
   - 3.5 [Nomalize Y variable](#y)

# 3 Data Preprocessing
Here is the data preprocessing and feature selection part

## Preparation
This part is about import packages and loading data.

In [1]:
'''
import related packages
'''
import pandas as pd
import numpy as np
from sklearn.preprocessing import QuantileTransformer
from sklearn.decomposition import PCA
from sklearn.preprocessing import Normalizer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

In [2]:
'''
loading data processed in the first part
'''
train_data = pd.read_csv('../Data_preprocess/X_train.csv')
test_data=pd.read_csv('../Data_preprocess/X_test.csv')
y_data=pd.read_csv('../Data_preprocess/y_train.csv')

## Divide numerical and catagorical features <a class="anchor" id="divide"></a>
We have two different kinds of feature variables in the dataset. One is numerical varible. The other is categorical variable.

In [3]:
#numerical variables
num_features_final = ['Feature_2', 'Feature_3', 'Feature_4', 'Feature_6',
                      'Feature_11', 'Feature_14',
                      'Feature_17', 'Feature_18', 'Feature_19',
                      'Feature_21', 'Feature_22', 'Feature_23', 'Feature_24', 'Feature_25',
                      'Ret_MinusTwo', 'Ret_MinusOne', 'R_Agg', 'R_Agg_Std',
                      'R_Std']

#categorical variables
cat_features_final = ['Feature_1', 'Feature_5', 'Feature_7', 'Feature_8', 'Feature_9', 'Feature_10',
                              'Feature_12', 'Feature_15', 'Feature_16', 'Feature_20']

#Y variables
targets = ['Ret_PlusOne', 'Ret_PlusTwo'] 

features_final = num_features_final + cat_features_final

train_X_data = train_data[features_final]
test_X_data = test_data[features_final]
train_Y_data=y_data[targets]

## Filling in missing values<a class="anchor" id="fill"></a>
Due to the fact that we cannot know the actual meaning of each feature provided by Kaggle, so we chose to use constant to fill in the misssing value of all feature variables, which is a quite simple and most conservative way. 

In [4]:
imputer = SimpleImputer(strategy='constant')
train_X_data = pd.DataFrame(imputer.fit_transform(train_X_data), columns=features_final)
test_X_data = pd.DataFrame(imputer.transform(test_X_data), columns=features_final)

## Normalize and PCA transformation  <a class="anchor" id="transform"></a>
We used a roubust function: QuantileTransformer to force numerical variables in the dataset follows a normal distribution format. This transformation greatly improved our model accuracy.
Also, we apply PCA method to categorical variables and leave two principle components, which can expain 99% of the total variance. This step also helped a lot in our prediction.

In [5]:
#normolize numerical variables
num_transformer = QuantileTransformer(n_quantiles=300, output_distribution='normal', random_state=0)

#apply PCA method to categorical variables
cat_transformer_nominal = Pipeline(steps=[
    ('pca', PCA(n_components=2,random_state=0)),
     ('norm', Normalizer(norm='l2')),
])

#apply method to two kinds of variables
preprocessor_X = ColumnTransformer(
    transformers=[
        ('num', num_transformer, num_features_final),        
        ('cat_nom', cat_transformer_nominal, cat_features_final),      
    ])

#find total explaind variance
pca=PCA(n_components=2,random_state=0)
pca.fit_transform(train_X_data[cat_features_final])
pca.explained_variance_ratio_

array([9.99999918e-01, 3.10451203e-08])

## Transformed data for model training 

In [6]:
Preprocessed_x_train=pd.DataFrame(preprocessor_X.fit_transform(train_X_data),columns=num_features_final+['pca1','pca2'])
Preprocessed_x_test=pd.DataFrame(preprocessor_X.transform(test_X_data),columns=num_features_final+['pca1','pca2'])
Preprocessed_x_train.head()

Unnamed: 0,Feature_2,Feature_3,Feature_4,Feature_6,Feature_11,Feature_14,Feature_17,Feature_18,Feature_19,Feature_21,...,Feature_23,Feature_24,Feature_25,Ret_MinusTwo,Ret_MinusOne,R_Agg,R_Agg_Std,R_Std,pca1,pca2
0,0.289008,-0.507163,-0.413924,-0.083932,0.319742,0.927037,1.006071,-0.764086,0.321578,-0.306535,...,-1.298895,0.781034,1.078218,1.94094,-0.670442,-2.190017,0.508973,1.489095,1.0,0.000882
1,0.289008,-0.507163,-0.413924,0.522191,-0.104132,-0.098964,-0.575548,0.337299,0.363299,0.510962,...,0.456558,1.731163,-0.394168,0.607782,0.282945,-1.185813,-0.970219,-0.668901,-1.0,0.000185
2,-0.579391,0.36699,-0.904483,0.619693,-0.158031,1.036359,-0.860541,1.016131,-0.676719,-0.306535,...,1.175681,-0.659845,-1.148629,0.203559,0.502281,-0.101822,-0.258746,-1.170008,-1.0,0.000144
3,-0.576944,1.110585,0.324578,0.155162,-1.261308,-2.325095,-0.858695,0.857025,1.855358,0.729778,...,0.915947,-0.653676,1.078218,0.061111,0.943021,0.286224,-1.436142,-0.129346,-1.0,-7e-05
4,-1.787541,2.063238,1.170043,1.947869,-2.311977,-1.766005,2.730373,-2.742027,0.063184,1.720474,...,1.495378,-1.982066,2.649022,-1.036283,0.791777,-0.40869,0.519264,0.647671,-1.0,0.000206


## Nomalize Y variable <a class="anchor" id="y"></a>

In [7]:
# Target transformer
preprocessor_Y = Pipeline(steps=[
    ('quantile', QuantileTransformer(n_quantiles=300, output_distribution='normal', random_state=0))
])