# PPS -  Predictive Power Score 

* statistical metric that measures the predictive relationship between two variables. Unlike correlation, it can capture non-linear and asymmetric relationships.
* Correlation measures a linear relationship between variables, symmetrically.
* PPS assesses the ability of one variable to predict another, incorporating machine learning models, and is directional.


## How PPS Works:
* Utilizes decision trees to estimate the likelihood of predicting one variable using another.
* Evaluates the success of predictions using a score, with 0 indicating no predictive power and 1 indicating perfect prediction


The Predictive Power Score (PPS) is an alternative to the correlation coefficient (like Pearson's r) that can reveal insights about the predictive relationship between two variables. While correlation measures linear relationships, PPS can detect more complex patterns such as non-linear relationships. It can be used not only with numerical data but also with categorical variables.

PPS is essentially a score that can tell you how well one variable can predict another. It is based on the concept that if one variable can be used to predict another using a machine learning model (typically a decision tree), then there is likely a meaningful relationship.

PPS is calculated by building a model to predict one variable using another and then assessing the model's performance. The performance metric used is generally a model score like R-squared for regression tasks or accuracy for classification. The PPS is normalized to a range of 0 to 1, where 0 indicates no predictive power and 1 indicates perfect predictive ability.

In [6]:
import pandas as pd
import seaborn as sns
import ppscore as pps  # Correct import for the ppscore library
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

# Load example dataset
diamonds = sns.load_dataset('diamonds')


# Initialize LabelEncoder
encoder = LabelEncoder()

print(diamonds.dtypes)


# Select categorical columns
categorical_cols = diamonds.select_dtypes(include=['object']).columns

# Apply LabelEncoder to each categorical column
for col in categorical_cols:
    diamonds[col] = encoder.fit_transform(diamonds[col])


# Calculate the Predictive Power Score
pps_matrix = pps.matrix(diamonds)

pps_matrix.head()

import matplotlib.pyplot as plt
import seaborn as sns

# Create a DataFrame for the heatmap
pps_df = pd.DataFrame(pps_matrix)

plt.figure(figsize=(10, 8))
sns.heatmap(pps_df, annot=True, fmt=".2f", cmap='coolwarm')
plt.title('PPS Matrix')
plt.show()



carat       float64
cut        category
color      category
clarity    category
depth       float64
table       float64
price         int64
x           float64
y           float64
z           float64
dtype: object


ValueError: could not convert string to float: 'carat'

<Figure size 1000x800 with 0 Axes>

In [5]:
import pandas as pd
import seaborn as sns
import ppscore as pps  # Correct import for the ppscore library
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder

# Load the dataset
diamonds = sns.load_dataset('diamonds')

# Initialize LabelEncoder
encoder = LabelEncoder()

# Select categorical columns
categorical_cols = diamonds.select_dtypes(include=['object']).columns

# Apply LabelEncoder to each categorical column
for col in categorical_cols:
    diamonds[col] = encoder.fit_transform(diamonds[col])

# Calculate the Predictive Power Score
pps_matrix = pps.matrix(diamonds)

# Instead of calling pps_matrix.head(), which assumes pps_matrix is a DataFrame,
# Let's ensure we properly convert the output to a DataFrame if it isn't one already.
if not isinstance(pps_matrix, pd.DataFrame):
    pps_matrix = pd.DataFrame(pps_matrix)

# Check the structure of pps_matrix
print(pps_matrix.head())

# Assuming pps_matrix is correctly formatted as a DataFrame
plt.figure(figsize=(10, 8))
sns.heatmap(pps_matrix, annot=True, fmt=".2f", cmap='coolwarm', vmin=0, vmax=1)
plt.title('PPS Matrix')
plt.show()


       x        y   ppscore            case  is_valid_score  \
0  carat    carat  1.000000  predict_itself            True   
1  carat      cut  0.085389  classification            True   
2  carat    color  0.060319  classification            True   
3  carat  clarity  0.064141  classification            True   
4  carat    depth  0.000000      regression            True   

                metric  baseline_score  model_score                     model  
0                 None         0.00000     1.000000                      None  
1          weighted F1         0.29420     0.354467  DecisionTreeClassifier()  
2          weighted F1         0.15720     0.208037  DecisionTreeClassifier()  
3          weighted F1         0.18000     0.232596  DecisionTreeClassifier()  
4  mean absolute error         1.01662     1.051711   DecisionTreeRegressor()  


ValueError: could not convert string to float: 'carat'

<Figure size 1000x800 with 0 Axes>