<a href="https://colab.research.google.com/github/Priyo-prog/Statistics-and-Data-Science/blob/main/Feature%20Selection%20Complete/Filter%20Methods/kdd_method.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Method used in a KDD 2009 competition**

We will cover the feature selection approach undertaken by data scientists at the University of Melbourne in the KDD 2009 data science competition. The task consisted in predicting churn based on a dataset with a huge number of features.

The authors describe this procedure as an aggressive non-parametric feature selection procedure that is based in contemplating the relationship between the feature and the target. Therefore, this method should be classified as a filter method.

**The procedure consists in the following steps:**

For each categorical variable:

1) Separate into train and test

2) Determine the mean value of the target within each label of the categorical variable using the train set

3) Use that mean target value per label as the prediction (using the test set) and calculate the roc-auc.

For each numerical variable:

1) Separate into train and test

2) Divide the variable into 100 quantiles

3) Calculate the mean target within each quantile using the training set

4) Use that mean target value / bin as the prediction (using the test set) and calculate the roc-auc


The authors quote the following advantages of the method:

Speed: computing mean and quantiles is direct and efficient
Stability respect to scale: extreme values for continuous variables do not skew the predictions
Comparable between categorical and numerical variables
Accommodation of non-linearities
See my notes at the end of the notebook for a discussion on the method.

**Important** The authors here use the roc-auc, but in principle, we could use any metric, including those valid for regression.

In [45]:
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

In [46]:
# connect the google drive
from google.colab import drive

drive.mount("/content/drive")

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [47]:
# import the dataset
df = pd.read_csv("/content/drive/MyDrive/Data Science/Feature Selection/titanic_clean.csv")
df.head()

Unnamed: 0,pclass,survived,sex,age,sibsp,parch,fare,cabin,embarked
0,1,1,female,29.0,0,0,211.3375,B5,S
1,1,1,male,0.9167,1,2,151.55,C22,S
2,1,0,female,2.0,1,2,151.55,C22,S
3,1,0,male,30.0,1,2,151.55,C22,S
4,1,0,female,25.0,1,2,151.55,C22,S


In [48]:
# dict = df.groupby(["sex"])["age"].mean().to_dict()

In [49]:
# dict

In [50]:
# Determine the feature and variable
X = df[['pclass', 'sex', 'embarked', 'cabin', 'survived']]
y = df["survived"]

In [51]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=0)
X_train.shape, X_test.shape

((914, 5), (392, 5))

## **Replace the values with mean values**

In [52]:
def mean_encoding(df_train, df_test, categorical_vars):

  # first make temporary copies of dataframes
  df_train_temp = df_train.copy()
  df_test_temp =df_test.copy()

  # Iterate over each variable
  for col in categorical_vars:

    target_mean_dict = df_train.groupby([col])["survived"].mean().to_dict()

    # replace the categories by the mean of the target
    df_train_temp[col] = df_train[col].map(target_mean_dict)
    df_test_temp[col] = df_test[col].map(target_mean_dict)

  # drop the target from the datatset
  df_train_temp.drop(labels=["survived"], axis=1, inplace=True)
  df_test_temp.drop(labels=["survived"],axis=1, inplace=True)

  return df_train_temp, df_test_temp

In [53]:
categorical_vars = ["pclass", "sex", "embarked", "cabin"]

X_train_enc, X_test_enc = mean_encoding(X_train, X_test, categorical_vars)

X_train_enc.head()

Unnamed: 0,pclass,sex,embarked,cabin
840,0.243902,0.199664,0.338534,0.295875
866,0.243902,0.199664,0.338534,0.295875
427,0.416667,0.199664,0.338534,0.295875
478,0.416667,0.199664,0.545946,0.295875
1305,0.243902,0.199664,0.338534,0.295875


In [54]:
X_test_enc.head()

Unnamed: 0,pclass,sex,embarked,cabin
609,0.243902,0.199664,0.338534,0.295875
412,0.416667,0.199664,0.338534,0.295875
528,0.416667,0.199664,0.338534,0.295875
1147,0.243902,0.716981,0.329545,0.295875
942,0.243902,0.199664,0.338534,0.295875


## **Determine the roc-auc values using the variable values as input**

In [55]:
# now calculate the roc-auc value, using the encoded variables
# as predictions
roc_values = []

for feature in categorical_vars:

    roc_values.append(roc_auc_score(y_test, X_test_enc[feature].fillna(0)))

In [42]:
X_test_enc.isnull().sum()

pclass      0
sex         0
embarked    0
cabin       0
dtype: int64