<a href="https://colab.research.google.com/github/Priyo-prog/Statistics-and-Data-Science/blob/main/Feature%20Selection%20Complete/Filter%20Methods/kdd_method.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Method used in a KDD 2009 competition**

We will cover the feature selection approach undertaken by data scientists at the University of Melbourne in the KDD 2009 data science competition. The task consisted in predicting churn based on a dataset with a huge number of features.

The authors describe this procedure as an aggressive non-parametric feature selection procedure that is based in contemplating the relationship between the feature and the target. Therefore, this method should be classified as a filter method.

**The procedure consists in the following steps:**

For each categorical variable:

1) Separate into train and test

2) Determine the mean value of the target within each label of the categorical variable using the train set

3) Use that mean target value per label as the prediction (using the test set) and calculate the roc-auc.

For each numerical variable:

1) Separate into train and test

2) Divide the variable into 100 quantiles

3) Calculate the mean target within each quantile using the training set

4) Use that mean target value / bin as the prediction (using the test set) and calculate the roc-auc


The authors quote the following advantages of the method:

Speed: computing mean and quantiles is direct and efficient
Stability respect to scale: extreme values for continuous variables do not skew the predictions
Comparable between categorical and numerical variables
Accommodation of non-linearities
See my notes at the end of the notebook for a discussion on the method.

**Important** The authors here use the roc-auc, but in principle, we could use any metric, including those valid for regression.

In [1]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

In [2]:
# connect the google drive
from google.colab import drive

drive.mount("/content/drive")

Mounted at /content/drive


In [3]:
# import the dataset
df = pd.read_csv("/content/drive/MyDrive/Data Science/Feature Selection/titanic_clean.csv")
df.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,cabin,embarked
0,1,1,female,29.0,0,0,211.3375,B5,S
1,1,1,male,0.9167,1,2,151.55,C22,S
2,1,0,female,2.0,1,2,151.55,C22,S
3,1,0,male,30.0,1,2,151.55,C22,S
4,1,0,female,25.0,1,2,151.55,C22,S


In [4]:
# dict = df.groupby(["sex"])["age"].mean().to_dict()

In [5]:
# dict

In [6]:
# Determine the feature and variable
X = df[['pclass', 'sex', 'embarked', 'cabin', 'survived']]
y = df["survived"]

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
X_train.shape, X_test.shape

((914, 5), (392, 5))

## **Replace the values with mean values**

In [8]:
def mean_encoding(df_train, df_test, categorical_vars):

  # first make temporary copies of dataframes
  df_train_temp = df_train.copy()
  df_test_temp =df_test.copy()

  # Iterate over each variable
  for col in categorical_vars:

    target_mean_dict = df_train.groupby([col])["survived"].mean().to_dict()

    # replace the categories by the mean of the target
    df_train_temp[col] = df_train[col].map(target_mean_dict)
    df_test_temp[col] = df_test[col].map(target_mean_dict)

  # drop the target from the datatset
  df_train_temp.drop(labels=["survived"], axis=1, inplace=True)
  df_test_temp.drop(labels=["survived"],axis=1, inplace=True)

  return df_train_temp, df_test_temp

In [9]:
categorical_vars = ["pclass", "sex", "embarked", "cabin"]

X_train_enc, X_test_enc = mean_encoding(X_train, X_test, categorical_vars)

X_train_enc.head()

Unnamed: 0,pclass,sex,embarked,cabin
840,0.243902,0.199664,0.338534,0.295875
866,0.243902,0.199664,0.338534,0.295875
427,0.416667,0.199664,0.338534,0.295875
478,0.416667,0.199664,0.545946,0.295875
1305,0.243902,0.199664,0.338534,0.295875


In [10]:
X_test_enc.head()

Unnamed: 0,pclass,sex,embarked,cabin
609,0.243902,0.199664,0.338534,0.295875
412,0.416667,0.199664,0.338534,0.295875
528,0.416667,0.199664,0.338534,0.295875
1147,0.243902,0.716981,0.329545,0.295875
942,0.243902,0.199664,0.338534,0.295875


## **Determine the roc-auc values using the variable values as input**

In [11]:
# now calculate the roc-auc value, using the encoded variables
# as predictions
roc_values = []

for feature in categorical_vars:

    roc_values.append(roc_auc_score(y_test, X_test_enc[feature].fillna(0)))

In [12]:
X_test_enc.isnull().sum()

pclass       0
sex          0
embarked     0
cabin       46
dtype: int64

In [13]:
m1 = pd.Series(roc_values)
m1.index = categorical_vars
m1.sort_values(ascending=False)

sex         0.784164
pclass      0.630389
embarked    0.573342
cabin       0.477412
dtype: float64

We can see all these features are important since roc_auc curve is above 0.5.
Sex seems to be the most important feature to predict survival.

## **Feature selection using numerical variables KDD method**

The procedure is exactly the same, but it requires one additional first step which is to divide the continuous variable into bins.

The authors of the method divide the variable in 100 quantiles, that is 100 bins. In principle, you could divide the variable in less bins. Here I will divide the variable in 5 bins only.

I will work with the numerical variables Age and Fare.

In [14]:
# Separate the traininga nad testing set
X = df[["age", "fare", "survived"]]
y = df["survived"]

In [15]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
X_train.shape, X_test.shape

((914, 3), (392, 3))

In [16]:
# Fill the missing values
X_train = X_train.fillna(0)
X_test = X_test.fillna(0)

## **Bin Variable Age**

In [18]:
X_train["age_binned"], intervals = pd.qcut(
    X_train["age"],
    q=5,
    labels=False,
    retbins=True,
    precision=3,
    duplicates="drop"
)

X_train[["age_binned", "age"]].head(10)

Unnamed: 0,age_binned,age
840,2,29.813199
866,4,43.0
427,4,44.0
478,1,25.0
1305,1,29.0
453,4,63.0
117,3,30.0
482,3,34.0
294,3,39.0
261,4,50.0


In [21]:
# now use the interval limits to the testing set
X_test["age_binned"] = pd.cut(X_test["age"], bins=intervals, labels=False)

X_test[["age_binned", "age"]].head(10)

Unnamed: 0,age_binned,age
609,0.0,0.8333
412,3.0,34.0
528,0.0,19.0
1147,2.0,29.813199
942,2.0,29.813199
870,2.0,29.813199
5,4.0,48.0
231,4.0,47.0
731,0.0,9.0
1289,2.0,29.813199


## **Bin VariableFare**

In [22]:
X_train["fare_binned"], intervals = pd.qcut(X_train["fare"], q=5,
                                     labels=False,
                                     retbins=True,
                                     precision=3,
                                     duplicates="drop")


X_test["fare_binned"] = pd.cut(X_test["fare"], bins=intervals, labels=False)

X_test[["fare_binned", "fare"]]

Unnamed: 0,fare_binned,fare
609,1.0,9.3500
412,2.0,21.0000
528,1.0,10.5000
1147,0.0,7.7208
942,1.0,7.8958
...,...,...
911,0.0,7.7958
578,2.0,21.0000
1257,1.0,9.8417
1140,3.0,29.1250


In [23]:
X_train.isnull().sum()

age            0
fare           0
survived       0
age_binned     0
fare_binned    0
dtype: int64

In [24]:
# now use our already created function to
# encode the variables with target mean

binned_vars = ["age_binned", "fare_binned"]

X_train_enc, X_test_enc = mean_encoding(X_train[binned_vars+["survived"]],
                                        X_test[binned_vars+["survived"]], binned_vars)

X_train_enc.head()

Unnamed: 0,age_binned,fare_binned
840,0.254237,0.367232
866,0.421965,0.256831
427,0.421965,0.367232
478,0.379487,0.629834
1305,0.379487,0.207447


In [26]:
# now we calculate the roc-auc values, using the encoding
# variables as predictions

roc_values = []

for feature in binned_vars:
  roc_values.append(roc_auc_score(y_test, X_test_enc[feature].fillna(0)))

In [27]:
m2 = pd.Series(roc_values)
m2.index = binned_vars
m2.sort_values(ascending=False)

fare_binned    0.670674
age_binned     0.489970
dtype: float64

Fare is much predictor of Survival. Age produces a random output, the roc-auc is 0.5