<a href="https://colab.research.google.com/github/ShauryaDevHub/EDA/blob/main/Permutation_Importance.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# IMPORTANT: RUN THIS CELL IN ORDER TO IMPORT YOUR KAGGLE DATA SOURCES,
# THEN FEEL FREE TO DELETE THIS CELL.
# NOTE: THIS NOTEBOOK ENVIRONMENT DIFFERS FROM KAGGLE'S PYTHON
# ENVIRONMENT SO THERE MAY BE MISSING LIBRARIES USED BY YOUR
# NOTEBOOK.
import kagglehub
mathan_fifa_2018_match_statistics_path = kagglehub.dataset_download('mathan/fifa-2018-match-statistics')
dansbecker_hospital_readmissions_path = kagglehub.dataset_download('dansbecker/hospital-readmissions')
dansbecker_new_york_city_taxi_fare_prediction_path = kagglehub.dataset_download('dansbecker/new-york-city-taxi-fare-prediction')

print('Data source import complete.')


Downloading from https://www.kaggle.com/api/v1/datasets/download/mathan/fifa-2018-match-statistics?dataset_version_number=20...


100%|██████████| 3.86k/3.86k [00:00<00:00, 2.43MB/s]

Extracting files...





Downloading from https://www.kaggle.com/api/v1/datasets/download/dansbecker/hospital-readmissions?dataset_version_number=1...


100%|██████████| 522k/522k [00:00<00:00, 73.2MB/s]

Extracting files...





Downloading from https://www.kaggle.com/api/v1/datasets/download/dansbecker/new-york-city-taxi-fare-prediction?dataset_version_number=1...


100%|██████████| 1.50M/1.50M [00:00<00:00, 122MB/s]

Extracting files...
Data source import complete.





# Introduction

One of the most basic questions we might ask of a model is: What features have the biggest impact on predictions?  

This concept is called **feature importance**.

There are multiple ways to measure feature importance.  Some approaches answer subtly different versions of the question above. Other approaches have documented shortcomings.

In this lesson, we'll focus on **permutation importance**.  Compared to most other approaches, permutation importance is:

- fast to calculate,
- widely used and understood, and
- consistent with properties we would want a feature importance measure to have.

# How It Works

Permutation importance uses models differently than anything you've seen so far, and many people find it confusing at first. So we'll start with an example to make it more concrete.  

Consider data with the following format:

![Data](https://storage.googleapis.com/kaggle-media/learn/images/wjMAysV.png)

We want to predict a person's height when they become 20 years old, using data that is available at age 10.

Our data includes useful features (*height at age 10*), features with little predictive power (*socks owned*), as well as some other features we won't focus on in this explanation.

**Permutation importance is calculated after a model has been fitted.** So we won't change the model or change what predictions we'd get for a given value of height, sock-count, etc.

Instead we will ask the following question:  If I randomly shuffle a single column of the validation data, leaving the target and all other columns in place, how would that affect the accuracy of predictions in that now-shuffled data?

![Shuffle](https://storage.googleapis.com/kaggle-media/learn/images/h17tMUU.png)

Randomly re-ordering a single column should cause less accurate predictions, since the resulting data no longer corresponds to anything observed in the real world.  Model accuracy especially suffers if we shuffle a column that the model relied on heavily for predictions.  In this case, shuffling `height at age 10` would cause terrible predictions. If we shuffled `socks owned` instead, the resulting predictions wouldn't suffer nearly as much.

With this insight, the process is as follows:

1. Get a trained model.
2. Shuffle the values in a single column, make predictions using the resulting dataset.  Use these predictions and the true target values to calculate how much the loss function suffered from shuffling. That performance deterioration measures the importance of the variable you just shuffled.
3. Return the data to the original order (undoing the shuffle from step 2). Now repeat step 2 with the next column in the dataset, until you have calculated the importance of each column.

# Code Example

Our example will use a model that predicts whether a soccer/football team will have the "Man of the Game" winner based on the team's statistics.  The "Man of the Game" award is given to the best player in the game.  Model-building isn't our current focus, so the cell below loads the data and builds a rudimentary model.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

data = pd.read_csv('/content/FIFA 2018 Statistics.csv')
data.head(4)
data.columns

Index(['Date', 'Team', 'Opponent', 'Goal Scored', 'Ball Possession %',
       'Attempts', 'On-Target', 'Off-Target', 'Blocked', 'Corners', 'Offsides',
       'Free Kicks', 'Saves', 'Pass Accuracy %', 'Passes',
       'Distance Covered (Kms)', 'Fouls Committed', 'Yellow Card',
       'Yellow & Red', 'Red', 'Man of the Match', '1st Goal', 'Round', 'PSO',
       'Goals in PSO', 'Own goals', 'Own goal Time'],
      dtype='object')

In [None]:
#to get general info about the data
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 128 entries, 0 to 127
Data columns (total 27 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   Date                    128 non-null    object 
 1   Team                    128 non-null    object 
 2   Opponent                128 non-null    object 
 3   Goal Scored             128 non-null    int64  
 4   Ball Possession %       128 non-null    int64  
 5   Attempts                128 non-null    int64  
 6   On-Target               128 non-null    int64  
 7   Off-Target              128 non-null    int64  
 8   Blocked                 128 non-null    int64  
 9   Corners                 128 non-null    int64  
 10  Offsides                128 non-null    int64  
 11  Free Kicks              128 non-null    int64  
 12  Saves                   128 non-null    int64  
 13  Pass Accuracy %         128 non-null    int64  
 14  Passes                  128 non-null    in

In [None]:
data.describe()

Unnamed: 0,Goal Scored,Ball Possession %,Attempts,On-Target,Off-Target,Blocked,Corners,Offsides,Free Kicks,Saves,...,Passes,Distance Covered (Kms),Fouls Committed,Yellow Card,Yellow & Red,Red,1st Goal,Goals in PSO,Own goals,Own goal Time
count,128.0,128.0,128.0,128.0,128.0,128.0,128.0,128.0,128.0,128.0,...,128.0,128.0,128.0,128.0,128.0,128.0,94.0,128.0,12.0,12.0
mean,1.320312,49.992188,12.59375,3.914062,5.273438,3.359375,4.71875,1.34375,14.890625,2.726562,...,462.648438,106.664062,13.546875,1.695312,0.015625,0.015625,39.457447,0.203125,1.0,45.833333
std,1.156519,10.444074,5.245827,2.234403,2.409675,2.403195,2.446072,1.193404,4.724262,2.049447,...,151.186311,11.749537,4.619131,1.325454,0.124507,0.124507,24.496506,0.807049,0.0,29.978275
min,0.0,25.0,3.0,0.0,1.0,0.0,0.0,0.0,5.0,0.0,...,189.0,80.0,5.0,0.0,0.0,0.0,1.0,0.0,1.0,12.0
25%,0.0,42.0,9.0,2.0,4.0,1.75,3.0,0.0,11.0,1.0,...,351.0,101.0,10.0,1.0,0.0,0.0,18.25,0.0,1.0,21.75
50%,1.0,50.0,12.0,3.5,5.0,3.0,5.0,1.0,15.0,2.0,...,462.0,104.5,13.0,2.0,0.0,0.0,39.0,0.0,1.0,35.0
75%,2.0,58.0,15.0,5.0,7.0,4.0,6.0,2.0,18.0,4.0,...,555.25,109.0,16.0,2.0,0.0,0.0,54.75,0.0,1.0,75.75
max,6.0,75.0,26.0,12.0,11.0,10.0,11.0,5.0,26.0,9.0,...,1137.0,148.0,25.0,6.0,1.0,1.0,90.0,4.0,1.0,90.0


### Step 4: Explore unique values for categorical features

For categorical columns, it's often useful to see the unique values and their counts. Let's look at a few examples, such as 'Team', 'Opponent', and 'Man of the Match'.

In [None]:
data.head(4)

Unnamed: 0,Date,Team,Opponent,Goal Scored,Ball Possession %,Attempts,On-Target,Off-Target,Blocked,Corners,...,Yellow Card,Yellow & Red,Red,Man of the Match,1st Goal,Round,PSO,Goals in PSO,Own goals,Own goal Time
0,14-06-2018,Russia,Saudi Arabia,5,40,13,7,3,3,6,...,0,0,0,Yes,12.0,Group Stage,No,0,,
1,14-06-2018,Saudi Arabia,Russia,0,60,6,0,3,3,2,...,0,0,0,No,,Group Stage,No,0,,
2,15-06-2018,Egypt,Uruguay,0,43,8,3,3,2,0,...,2,0,0,No,,Group Stage,No,0,,
3,15-06-2018,Uruguay,Egypt,1,57,14,4,6,4,5,...,0,0,0,Yes,89.0,Group Stage,No,0,,


In [None]:
data['On-Target'].unique()

array([ 7,  0,  3,  4,  2,  5,  1,  6,  9, 12, 10,  8])

In [None]:
data['Man of the Match'].value_counts()

Unnamed: 0_level_0,count
Man of the Match,Unnamed: 1_level_1
Yes,64
No,64


In [None]:
y=(data['Man of the Match']=='Yes')
feature_names =[i for i in data.columns if data[i].dtype in [np.int64]]
X=data[feature_names]
train_X,val_X,train_y,val_y=train_test_split(X,y,random_state=1)
my_model=RandomForestClassifier(n_estimators=100,random_state=0).fit(train_X,train_y)

Here is how to calculate and show importances with the [eli5](https://eli5.readthedocs.io/en/latest/) library:

In [None]:
#i understand that i need to visualize distributions so that i can see if there are any outliers
#i need to explore the relationshsiops to undertand that if there is acolumn which is correlated numerical and categorical
#i need to analyze catagorucal features -using bar charts or pie plots to visualuze the distribution of categhorical features
#feature engineering involves creating new features if you have to

In [None]:
#doing feature importance here using permutation i
!pip install eli5

Collecting eli5
  Downloading eli5-0.16.0-py2.py3-none-any.whl.metadata (18 kB)
Downloading eli5-0.16.0-py2.py3-none-any.whl (108 kB)
[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/108.4 kB[0m [31m?[0m eta [36m-:--:--[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m108.4/108.4 kB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: eli5
Successfully installed eli5-0.16.0


In [None]:
import eli5
from eli5.sklearn import PermutationImportance
perm = PermutationImportance(my_model, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = val_X.columns.tolist())

Weight,Feature
0.1750  ± 0.0848,Goal Scored
0.0500  ± 0.0637,Distance Covered (Kms)
0.0437  ± 0.0637,Yellow Card
0.0187  ± 0.0637,Free Kicks
0.0187  ± 0.0500,Off-Target
0.0187  ± 0.0637,Fouls Committed
0.0125  ± 0.0637,Pass Accuracy %
0.0125  ± 0.0306,Blocked
0.0063  ± 0.0612,Saves
0.0063  ± 0.0250,Ball Possession %
