# Example usage

```py_predpurchase``` can be used to:

* Apply preprocessing transformations to the data, including scaling, encoding, and passing through features as specified.
* Calculate the cross validation results for a four common off-the-shelf models (Dummy, KNN, SVM and RandomForests)
* Fit a given model, and extract feature importances, sorted in descending order, and returns them as a DataFrame.
* Calculate the classification metrics for model predictions including precision, recall, accuracy and F1 scores.

Here, we will demonstrate each of those functionalities:

In [1]:
import py_predpurchase

print(py_predpurchase.__version__)

0.1.0


### Imports

In [45]:
# Importing packages needed for the functions:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

# Importing functions
from py_predpurchase.function_preprocessing import numerical_categorical_preprocess
from py_predpurchase.function_model_cross_val import model_cross_validation
from py_predpurchase.function_feature_importance import get_feature_importances 
from py_predpurchase.function_classification_metrics import calculate_classification_metrics


### Creating dummy objects to give to the function
**Note**: this is a demonstration, when using the package you will have your own objects (dataframes, models, hyperparameter values) to pass through. For the different functions, we have different kinds of dummy data. This is because these functions do not cover the entire flow of analysis, therefore the output of one function may not necessarily be the direct input of the next function. 

## Preprocessing

Given a dataset with both categorical and numerical data, you can use the function ```numerical_categorical_preprocess``` to preprocess all features, making them at a format that is compatible with most machine learning models.

In [19]:
# Creating a dummy dataset with both categorical and numerical data
data = {
    'Category': ['A', 'B', 'C', 'A', 'B', 'C', 'A', 'B'],
    'Boolean': [True, False, True, False, True, False, True, False],
    'Numerical_1': np.random.rand(8),  # 8 random float numbers
    'Numerical_2': np.random.randint(1, 100, 8)  # 8 random integers between 1 and 100
}

df = pd.DataFrame(data)

# performing a train/test split 
X = df.drop('Numerical_2', axis=1)
y = df['Numerical_2']


# test_size=0.25, 0.75 for train set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

In [28]:
# defining numerical and categorical features
numeric_features = ["Numerical_1"]
categorical_features = ["Category", "Boolean"]

# applying the numerical_categorical_preprocess function
preprocessed_data = numerical_categorical_preprocess(
    X_train, 
    X_test, 
    y_train,
    y_test,
    numeric_features, 
    categorical_features
)


### Cross Validation
The ```model_cross_validation``` function Calculates the cross validation results for a four common off-the-shelf models (Dummy, KNN, SVM and RandomForests) using preprocessed and cleaned training and testing datasets. Random forests and Dummy hyperparameters are fixed for simplicity sake.
	

In [38]:
# creating a dummy dataset

train_data = pd.DataFrame({
    "feature1": [1, 2, 3, 4, 5, 2, 2, 3, 3, 3, 3],
    "feature2": [5, 4, 3, 2, 1, 4, 0, 5, 5, 5, 5],
    'target': [0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0]
})

test_data = pd.DataFrame({
    "feature1": [1, 2, 3, 4, 5, 2, 2, 3, 3, 3, 3],
    "feature2": [5, 4, 3, 2, 1, 4, 0, 5, 5, 5, 5],
    'target': [0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0]
})


In [37]:
# defining dummy hyperparameters
target = "target"
k = 5
gamma = 10

cross_val_results = model_cross_validation(train_data, 
                                           test_data, 
                                           target, 
                                           k, 
                                           gamma)

pd.DataFrame(cross_val_results)

Unnamed: 0,dummy,knn,SVM,random_forest
fit_time,0.001 (+/- 0.000),0.002 (+/- 0.001),0.003 (+/- 0.001),0.057 (+/- 0.003)
score_time,0.001 (+/- 0.001),0.004 (+/- 0.001),0.002 (+/- 0.001),0.004 (+/- 0.001)
test_score,0.533 (+/- 0.075),0.467 (+/- 0.075),0.367 (+/- 0.217),0.367 (+/- 0.217)
train_score,0.544 (+/- 0.025),0.569 (+/- 0.084),0.750 (+/- 0.048),0.750 (+/- 0.048)


## Feature Importance

Given an X and y (explanatory and target features) dataframe, the function ```get_feature_importances``` fits the model, extracts feature importances, sorts them, and returns them as a DataFrame.

In [44]:
# creating a dummy mdoel
model = RandomForestClassifier(max_depth=2, random_state=0)

#dummy data
X_data = pd.DataFrame({
    "feature1": [1, 2, 3, 4, 5, 2, 2, 3, 3, 3, 3],
    "feature2": [5, 4, 3, 2, 1, 4, 0, 5, 5, 5, 5],
})

y_data = [0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0]

# fitting the dummy model

model.fit(X_data, y_data)

X_columns = ["feature1", "feature2"]

get_feature_importances(model, X_columns)

Unnamed: 0,Importance
feature2,0.504683
feature1,0.495317


### Classification Metrics
Given the true value and the predictive target value (output from a chosen model's prediction), the ```calculate_classification_metrics``` function calculates classification metrics for model predictions including precision, recall, accuracy and F1 scores.

In [46]:
# dummy data

y_true = [1,0,1,1,1,0,0,1,0,1]
y_pred = [1,1,1,0,1,0,0,1,0,0]

calculate_classification_metrics(y_true, y_pred)

{'Precision': 0.7200000000000001,
 'Recall': 0.7,
 'Accuracy': 0.7,
 'F1 Score': 0.7030303030303029}