# Tutorial 2: Feature Reduction and Selection

---

## Introduction

Welcome! This tutorial will show you how to reduce, and select the most prominent feature. This is very important pre-processing step because the infrared spectrum data are large and highly correlated.

In [55]:
import pandas as pd
import numpy as np

First, let us recall the data from the previous notebook

In [82]:
%store -r X
%store -r Y
%store -r df

In [125]:
X

Unnamed: 0,833.647,833.915,834.183,834.451,834.720,834.989,835.258,835.527,835.796,836.066,...,2478.624,2480.996,2483.372,2485.753,2488.138,2490.529,2492.924,2495.323,2497.727,2500.136
0,-0.808493,-0.807557,-0.808127,-0.809277,-0.810315,-0.809986,-0.808822,-0.810160,-0.814925,-0.817989,...,-0.408186,-0.390167,-0.395094,-0.416035,-0.416191,-0.393794,-0.376064,-0.394183,-0.438806,-0.461273
1,-1.045204,-1.042834,-1.040064,-1.039891,-1.041951,-1.041888,-1.039677,-1.039588,-1.042774,-1.045571,...,-1.982225,-1.971150,-1.966837,-1.972678,-1.980933,-1.986981,-1.992071,-1.999241,-2.010441,-2.015706
2,-0.933558,-0.930759,-0.930131,-0.930111,-0.930078,-0.929796,-0.930579,-0.933991,-0.938503,-0.941131,...,-0.543787,-0.546740,-0.556853,-0.563482,-0.553554,-0.538869,-0.526329,-0.530286,-0.579681,-0.641481
3,-0.956279,-0.954154,-0.953762,-0.953193,-0.951764,-0.951278,-0.952847,-0.956050,-0.959460,-0.960475,...,-1.022473,-1.018331,-1.003590,-0.988366,-0.989729,-1.006131,-1.027455,-1.034726,-1.020874,-0.996992
4,-1.071554,-1.068693,-1.068226,-1.068877,-1.070526,-1.072038,-1.072885,-1.074952,-1.078072,-1.079284,...,-1.637289,-1.630079,-1.636046,-1.644734,-1.633950,-1.608915,-1.587258,-1.577930,-1.585577,-1.603032
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
498,-0.275430,-0.274320,-0.274200,-0.274479,-0.274608,-0.277242,-0.281107,-0.280976,-0.276278,-0.272611,...,-0.491406,-0.485925,-0.464286,-0.467758,-0.490807,-0.515450,-0.524820,-0.519023,-0.512824,-0.488398
499,-0.298707,-0.295062,-0.294296,-0.294798,-0.296765,-0.301067,-0.302066,-0.298652,-0.296644,-0.297677,...,-1.157046,-1.173071,-1.175726,-1.162370,-1.142283,-1.133218,-1.141321,-1.142113,-1.121310,-1.100813
500,-0.514468,-0.516540,-0.518424,-0.519333,-0.522529,-0.528154,-0.530454,-0.526756,-0.521762,-0.520717,...,-2.076827,-2.073034,-2.058804,-2.051791,-2.053137,-2.047367,-2.043846,-2.051221,-2.057879,-2.057166
501,0.014295,0.014717,0.016864,0.017340,0.015010,0.011359,0.009173,0.008830,0.009937,0.013005,...,1.524477,1.566631,1.663975,1.720487,1.749993,1.765076,1.669538,1.535575,1.491738,1.417050


---

one way to reduce the features is by calculating the average of every x coloumn in a df and create new df, but how can we do that

<b><i> Example for data reduction </i></b> 

First let us create a dummy dataframe

In [98]:
df = pd.DataFrame({'A': [0, 1, 2, 3, 4], 'B': [0, 10, 20, 30, 40], 'C': [0, 100, 200, 300, 400], 'D': [0, 1, 2, 3, 4],  'E': [0, 1, 2, 3, 4],  'F': [0, 1, 2, 3, 4]})
df

Unnamed: 0,A,B,C,D,E,F
0,0,0,0,0,0,0
1,1,10,100,1,1,1
2,2,20,200,2,2,2
3,3,30,300,3,3,3
4,4,40,400,4,4,4


now let us take the sum for every 3 coloumns

In [99]:
df = df.rolling(3, axis = 1).sum()  # if axis = 1, means that the rolling window will go over the columns
print(df)


    A   B      C      D      E     F
0 NaN NaN    0.0    0.0    0.0   0.0
1 NaN NaN  111.0  111.0  102.0   3.0
2 NaN NaN  222.0  222.0  204.0   6.0
3 NaN NaN  333.0  333.0  306.0   9.0
4 NaN NaN  444.0  444.0  408.0  12.0


Now we are only interested in non-overlapping window size

In [101]:
df = df.iloc[:, 2::3]
df

Unnamed: 0,C,F
0,0.0,0.0
1,111.0,3.0
2,222.0,6.0
3,333.0,9.0
4,444.0,12.0


DONE

---

<b><i> Apple data feature reduction </i></b> 

Now we will create several datasets, but instead of the sum we will be taking the mean.

In [172]:
# from statistics import mean

def mean_df (pd):
    return pd.mean()
def sum_df (pd):
    return pd.sum()
def skew_df (pd):
    return pd.skew()
def kurt_df (pd):
    return pd.kurt()

def creat_rollingData (df = X, window_arr = [10, 20, 30, 40, 50, 100], method = sum_df, ax = 1):
    df_arr = []
    for w in window_arr:
        df_tmp = df.copy()
        
        df_tmp =  method(df_tmp.rolling(w, axis = ax ))
        # df_tmp =  df_tmp.rolling(w, axis = ax ).mean()
        
        df_tmp = df_tmp.iloc[:, w-1::w]
        df_arr.append(df_tmp)
        print(df_tmp.shape)
# X_10

In [173]:
creat_rollingData ()


(503, 207)
(503, 103)
(503, 69)
(503, 51)
(503, 41)
(503, 20)


In [161]:
creat_rollingData ()



      836.066    838.771    841.493    844.233    846.991    849.768   \
0    -8.105650  -8.109480  -8.084608  -8.110873  -8.104027  -8.101889   
1   -10.419442 -10.433837 -10.428431 -10.430074 -10.415331 -10.418868   
2    -9.328635  -9.358261  -9.337652  -9.354185  -9.327740  -9.329143   
3    -9.549263  -9.570040  -9.572426  -9.601431  -9.592159  -9.597327   
4   -10.725107 -10.765805 -10.784403 -10.795295 -10.778947 -10.790742   
..         ...        ...        ...        ...        ...        ...   
498  -2.761251  -2.726196  -2.791043  -2.823910  -2.778839  -2.801189   
499  -2.975734  -2.985233  -3.001504  -3.064809  -3.068415  -3.079237   
500  -5.219136  -5.152752  -5.200915  -5.235324  -5.220110  -5.255679   
501   0.130532   0.181800   0.135327   0.099196   0.083375   0.076526   
502  -0.391250  -0.355770  -0.399536  -0.480859  -0.416534  -0.458868   

      852.562    855.375    858.206    861.056   ...   2292.343   2312.793  \
0    -8.076384  -8.088410  -8.073566  -8.0697