### Feature Analysis on MovieLens 100K Dataset

This notebook shows the process of making decisions on features by feature analysis. The purpose of feature analysis is to choose the influential feature and improve the model performance. The process could be arranged as:
1. [Feature Preprocessing]: Fixed Dataset, NaN value filled in and feature engineering.

2. [Feature Analysis]: Feature correlation on numerical features
3. [Feature Analysis]: Chi-Square test of independence on categorical features
4. [Feature Analysis]: Two-Sample z-Test on numerical features
5. [Feature Analysis]: Feature importance on random forest model and permutation importance
6. [Decision]: make priority on features

The problem is formulated as: 
- With limited time and resources, which feature should be considered including in model first.

- It could be treated as a CTR problem in recommender system by setting `label` as movie rating larger than 3 as 1, movie rating lower than 3 as 0, and drop the movie rating 3.

### 1. Feature Preprocessing: Dataset
`ML100K` generates the dataset and uses `random_seed` to make sure reproducibility. The feature `age_interval` and `freshness` are generated by feature engineering:

- `age_interval`: bucketize `age` feature by 25% equally samples

- `freshness`: use `timestamp` and `year` to generate `freshness` to represent the time between the movie and the user rating time

In [73]:
%reload_ext autoreload
%autoreload 2

from feature_analysis.data import ML100K

categorical = ['user_id', 'item_id', 'gender', 'occupation', 'age_interval']
numerical = ['timestamp', 'year', 'age', 'freshness']

data = ML100K(data_dir='/DATA/', 
              categorical=categorical, 
              numerical=numerical, 
              apply_fillnan=True, 
              apply_preprocessing=False, 
              random_seed=42)
phase_data = data.phase_data
train_df, train_label = phase_data['train']
train_df['label'] = train_label

print(f'number of train_df: {len(train_df)}')

number of train_df: 58284


### 2. Feature Analysis: Feature Correlation on numerical features
Numerical features could be analyzed directly with the feature correlation. The output of the `df_correlation` would be the feature set with its correlation and ranked by absolute value of correlation. 

The result shows most of features have no highly correlated with `label`. Furthermore, `freshness` and `year` have high correlation because `freshness` is generated by `year` and `timestamp`.

In [2]:
from feature_analysis.analysis import df_correlation

copy_df = train_df.copy()
copy_df['label'] = train_label

corr_pair = df_correlation(df=copy_df,
                           method='pearson',
                           show_image=False,
                           threshold=0.1,
                           descending=True,
                           features=['label', *numerical])
print('pearson correlation', corr_pair)
corr_pair = df_correlation(df=copy_df,
                           method='spearman',
                           show_image=False,
                           threshold=0.1,
                           descending=True,
                           features=['label', *numerical])
print('spearman correlation', corr_pair)

pearson correlation [(-0.9994025745385305, 'year', 'freshness'), (-0.17023564184993722, 'label', 'year'), (0.16856944163951673, 'label', 'freshness'), (0.15277789320652435, 'timestamp', 'age'), (0.14378317122976503, 'age', 'freshness'), (-0.14035769097585107, 'year', 'age')]
spearman correlation [(-0.9834729387676251, 'year', 'freshness'), (-0.1936631616037067, 'label', 'year'), (0.18701100858712852, 'label', 'freshness'), (0.12179751746853304, 'age', 'freshness'), (0.11959637376161228, 'timestamp', 'age'), (-0.11135941759729562, 'year', 'age')]


### 3. Feature Analysis: Chi-Square test of independence on categorical features and label

In this section, `gender`, `age_interval`, `occupation` are considered. In order to make sure each sample in hypothesis testing is independent. In each time of hypothesis testing, only 1 movie and its label will be sampled from each user. Whole hypothesis testing would be ran 100 times and be processed by multiple testing correction.

In [65]:
from feature_analysis.analysis import hypothesis_test

copy_df = train_df.copy()
user_df = copy_df.groupby('user_id').agg({
    'gender': lambda a: list(a)[0],
    'label': list
}).reset_index()

hypothesis_test(user_df,
                feature='gender',
                label='label',
                times=100,
                test_type='chi_square_independence')


{'total_times': 100, 'significant_count': 0}

In [64]:
copy_df = train_df.copy()
user_df = copy_df.groupby('user_id').agg({
    'age_interval': lambda a: list(a)[0],
    'label': list
}).reset_index()

hypothesis_test(user_df,
                feature='age_interval',
                label='label',
                times=100,
                test_type='chi_square_independence')


{'total_times': 100, 'significant_count': 0}

In [67]:
user_df = copy_df.groupby('user_id').agg({
    'occupation': lambda a: list(a)[0],
    'label': list
}).reset_index()
hypothesis_test(user_df.copy(),
                feature='occupation',
                label='label',
                times=100,
                test_type='chi_square_independence')


{'total_times': 100, 'significant_count': 0}

### 4. Feature Analysis: Two Sample z-test on numerical features
In this section, `freshness`, `age`, `year` are considered. As the same as the previous section, Only 1 movie and its label will be sampled from each user.

In [68]:
copy_df = train_df.copy()
user_df = copy_df.groupby('user_id').agg({'freshness': list, 'label': list})
hypothesis_test(user_df.copy(),
                feature='freshness',
                label='label',
                times=100,
                test_type='ztest')


{'total_times': 100, 'significant_count': 100}

In [71]:
user_df = copy_df.groupby('user_id').agg({'age': list, 'label': list})
hypothesis_test(user_df.copy(),
                feature='age',
                label='label',
                times=100,
                test_type='ztest')


{'total_times': 100, 'significant_count': 22}

In [72]:
user_df = copy_df.groupby('user_id').agg({'year': list, 'label': list})
hypothesis_test(user_df.copy(),
                feature='year',
                label='label',
                times=100,
                test_type='ztest')

{'total_times': 100, 'significant_count': 100}

### 5. Feature Analysis: Feature importance on random forest model and permutation importance
In order to run random forest model on the dataset, feature preprocessing on filling NaN value and standardizing the numerical features are necessary. In order to avoid putting high correlated features into model at the same time, `age` is used instead of `age_interval` and `freshness` is used instead of `year`, `timestamp`.

In [4]:
categorical = ['user_id', 'item_id', 'gender', 'occupation']
numerical = ['age', 'freshness']
data = ML100K(data_dir='/DATA/',
              categorical=categorical,
              numerical=numerical,
              apply_fillnan=True,
              apply_preprocessing=True)
phase_data = data.phase_data
train_processed_df, train_processed_label = phase_data['train']
val_processed_df, val_processed_label = phase_data['val']


### 5.1 Random Forest parameter searching
The purpose is to find the best parameter combination.

In [7]:
from feature_analysis.utils import parse_hyperparams
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

best_auc = 0
best_params = None
rf_params = {
    'n_estimators': list(range(100, 200, 25)),
    'min_samples_split': list(range(3, 21, 4)),
    'min_samples_leaf': list(range(3, 21, 4)),
    'max_depth': list(range(25, 125, 25))
}
hyperparams_set = parse_hyperparams(rf_params)
for idx, params in enumerate(hyperparams_set):
    if (idx + 1) % 20 == 0: print(f'[{idx+1}/{len(hyperparams_set)}] was ran')
    forest = RandomForestClassifier(**params, random_state=42, n_jobs=4)
    forest.fit(train_processed_df, train_processed_label)
    pred = forest.predict_proba(val_processed_df)
    auc = metrics.roc_auc_score(val_processed_label, pred[:, 1])
    if auc > best_auc:
        best_auc = auc
        best_params = params
print(f'Testing AUC: {best_auc:.2f}')
print(f'Best params: {best_params}')


[20/400] was ran
[40/400] was ran
[60/400] was ran
[80/400] was ran
[100/400] was ran
[120/400] was ran
[140/400] was ran
[160/400] was ran
[180/400] was ran
[200/400] was ran
[220/400] was ran
[240/400] was ran
[260/400] was ran
[280/400] was ran
[300/400] was ran
[320/400] was ran
[340/400] was ran
[360/400] was ran
[380/400] was ran
[400/400] was ran
Testing AUC: 0.80
Best params: {'n_estimators': 175, 'min_samples_split': 19, 'min_samples_leaf': 3, 'max_depth': 25}


### 5.2 Random Forest Feature Importance
The best parameter on the random forest is applied and set `apply_permutation_importance = False` to observe the importance from the random forest model. The drawback of feature importance from random forest is that it will be affected by high cardinality features. Therefore, permutation importance is applied in next section.

In [8]:
from feature_analysis.analysis import random_forest_importance

random_forest_importance(train_processed_df,
                         train_processed_label,
                         val_processed_df,
                         val_processed_label,
                         model_kwargs={
                             **best_params, 'random_state': 42
                         },
                         apply_permutation_importance=False)


{'metric_name': 'auc',
 'metric_value': 0.8025399341810802,
 'feature_importance': [('item_id', 0.34770623517269517),
  ('user_id', 0.22153512464789368),
  ('age', 0.15108650083728364),
  ('freshness', 0.14604414895992396),
  ('occupation', 0.10938771431207134),
  ('gender', 0.024240276070132138)]}

### 5.3 Permutation Importance
Permutation importance is applied and different aspects on feature importance could be observed. The concern of the permuatation importance is that if there are highly correlated features, the importance of the feature would be affected.

In [9]:
random_forest_importance(train_processed_df,
                         train_processed_label,
                         val_processed_df,
                         val_processed_label,
                         model_kwargs={
                             **best_params, 'random_state': 42
                         },
                         apply_permutation_importance=True,
                         permutation_importance_kwargs={
                             'n_jobs': 2,
                             'random_state': 42
                         })

{'metric_name': 'auc',
 'metric_value': 0.8025399341810802,
 'feature_importance': [('freshness', 0.09765077372152975),
  ('item_id', 0.09523056666015066),
  ('age', 0.050029208540506985),
  ('user_id', 0.04839482177107688),
  ('occupation', 0.04166194839046174),
  ('gender', 0.01034771698112431)]}

### 6. Decision
First, the baseline feature is set as: `(user_id, item_id)` and from the previous analysis some insights could be arranged as:

- On feature correlation analysis on numerical features, no feature, label pair has relative high correlation

- On Chi-Squared test of independence on categorical features, None of `age_interval`, `gender`, and `occupation` shows more significant times in 100 hypothesis testing times. 
- On z-test two samples mean on numerical features, `freshness` and `year` shows possiblity of being influential feature
- On feature importance analysis, `age` and `freshness` shows more important than others and further on permutation importance aspect, `freshness` shows most important on all of features.


The final priority of features could be ordered as: `freshness`, `age`, `occupation`, `gender`