![giskard_logo.png](https://raw.githubusercontent.com/Giskard-AI/giskard/main/readme/Logo_full_darkgreen.png)

# About Giskard

Open-Source CI/CD platform for ML teams. Deliver ML products, better & faster. 

*   Collaborate faster with feedback from business stakeholders.
*   Deploy automated tests to eliminate regressions, errors & biases.

🏡 [Website](https://giskard.ai/)

📗 [Documentation](https://docs.giskard.ai/)

#Start by creating an ML model 🚀🚀🚀

Let's create a house pricing model based on Kaggle dataset [(Link](https://raw.githubusercontent.com/Giskard-AI/giskard-client/main/sample_data/regression/house-prices/house_price_updated.csv) to download the dataset)

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score


In [None]:
url = 'https://raw.githubusercontent.com/Giskard-AI/giskard-client/main/sample_data/regression/house-prices/house_price_updated.csv' #To download go to https://github.com/Giskard-AI/giskard-client/blob/main/sample_data/regression/house-prices/house_price_updated.csv
data = pd.read_csv(url)

In [None]:
column_types = {'TypeOfDewelling': 'category',
                'BldgType': 'category',
                'AbvGrndLivArea': 'numeric',
                'Neighborhood': 'category',
                'KitchenQual': 'category',
                'NumGarageCars': 'numeric',
                'YearBuilt': 'numeric',
                'YearRemodAdd':  'numeric',
                'ExterQual': 'category',
                'LotArea': 'numeric',
                'LotShape': 'category',
                'Fireplaces': 'numeric',
                'NumBathroom': 'numeric',
                'Basement1Type': 'category',
                'Basement1SurfaceArea': 'numeric',
                'Basement2Type': 'category',
                'Basement2SurfaceArea': 'numeric',
                'TotalBasementArea': 'numeric',
                'GarageArea': 'numeric',
                '1stFlrArea': 'numeric',
                '2ndFlrArea': 'numeric',
                'Utilities': 'category',
                'OverallQual': 'category',
                'SalePrice': 'category'
                }

In [None]:
feature_types = {i:column_types[i] for i in column_types if i!='SalePrice'}

numeric_features = [key for key in feature_types.keys() if feature_types[key]=="numeric"]
categorical_features = [key for key in feature_types.keys() if feature_types[key]=="category"]

numeric_transformer = Pipeline([('imputer', SimpleImputer(missing_values= np.nan, strategy='mean')),
    ('scaler', StandardScaler())])

categorical_transformer = Pipeline([
        ('imputer', SimpleImputer(missing_values= np.nan, strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore',sparse=False)) ])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
      ('cat', categorical_transformer, categorical_features)
    ]
)
reg_random_forest = Pipeline(steps=[('preprocessor', preprocessor),
                      ('regressor', RandomForestRegressor())])

y = data['SalePrice']
X = data.drop(columns="SalePrice")
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20, random_state = 30)

In [None]:
reg_random_forest.fit(X_train, y_train)
print("model score: %.3f" % reg_random_forest.score(X_test, y_test))

In [None]:
train_data = pd.concat([X_train, y_train], axis=1)
test_data = pd.concat([X_test, y_test ], axis=1)

#Upload the model in Giskard 🚀🚀🚀

#### Install Giskard library

In [None]:
!pip install giskard

### Initiate a project

In [None]:
from giskard.giskard_client import GiskardClient

url1 = "http://gsk1.giskard.ai:10000"
token1 = "eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiJhZG1pbiIsInRva2VuX3R5cGUiOiJBUEkiLCJhdXRoIjoiUk9MRV9BRE1JTiIsImV4cCI6MTY2MjkzMDE3Mn0.A0hdmCnddvdhVj62mRCMvQ_N-Cor13SdcHeLa7e8J9YqEucWlZRpTt8hbK6PKIa1yfgCrwN7EQQ4Q4mYMNNeXQ"

#url = "http://localhost:19000" #If Giskard is installed locally
#token = "eyJhbGciOiJIUzUxMiJ9.eyJzdWIiOiJhZG1pbiIsInRva2VuX3R5cGUiOiJBUEkiLCJhdXRoIjoiUk9MRV9BRE1JTiIsImV4cCI6MTY2Mjc1Nzg5Nn0.vKOmgNqi3wMFq1nABvmlpi-nq1zLLFGEJwLKREXl0fF6_8kGX4a-MwQn3TszxRUngC_bElR_Ui2uivjyCZ9Tgg"
#Find your token in the Admin tab of your app (login: admin; password: admin)


client = GiskardClient(url1, token1)

house_pricing = client.create_project("house_pricing", "House pricing model", "Project to predict house prices")

#If you've already created a project with the key "house_pricing" use
#credit_scoring = client.get_project("house_pricing")

### Upload your model and a dataset (see [documentation](https://docs.giskard.ai/start/guides/upload-your-model))

In [None]:
house_pricing.upload_model_and_df(
    prediction_function=reg_random_forest.predict, 
    model_type='regression',
    df=test_data, #the dataset you want to use to inspect your model
    column_types=column_types, #all the column types of df
    target='SalePrice', #the column name in df corresponding to the actual target variable (ground truth).
    feature_names=list(feature_types.keys()),#list of the feature names of prediction_function
    model_name='random_forest_v1',
    dataset_name='test_data'
)

### 🌟 If you want to upload a dataset without a model


For example, let's upload the train set in Giskard, this is key to create drift tests in Giskard.

In [None]:
house_pricing.upload_df(
    df=train_data,
    column_types=column_types, #all the column types of df
    target="SalePrice", # do not pass this parameter if dataset doesnt contain target column 
    name="train_data"
)

You can also upload new production data to use it as a validatation set for your existing model. In that case, you might not have the ground truth target variable

In [None]:
production_data = data.drop(columns="SalePrice")

In [None]:
house_pricing.upload_df(
    df=production_data,
    column_types=feature_types, #all the column types without the target
    name="production_data"
)

### 🌟 If you just want to upload a model without a dataframe 

This happens for instance when you built a new version of the model and you want to inspect it using a validation dataframe that is already in Giskard

For example, let's create a second version of the model using the catboost library

In [None]:
!pip install catboost

In [None]:
from catboost import CatBoostRegressor

X['Basement1Type'] = X['Basement1Type'].fillna("")
X['Basement2Type'] = X['Basement2Type'].fillna("")
X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.20, random_state = 30)

model = CatBoostRegressor(iterations=2,
                           learning_rate=1,
                           depth=2)

model.fit(X_train, y_train, cat_features=categorical_features)

In [None]:
def prediction_function(X):
  X['Basement1Type'] = X['Basement1Type'].fillna("")
  X['Basement2Type'] = X['Basement2Type'].fillna("")
  return model.predict(X)

In [None]:
house_pricing.upload_model(
    prediction_function=prediction_function,
    model_type='regression',
    feature_names=list(feature_types.keys()),#list of the feature names of prediction_function
    name='catboost',
    validate_df=train_data, #Optional. Validatation df is not uploaded in the app, it's only used to check whether the model has the good format
    target="SalePrice", #Optional. target should be a column of validate_df
)

### Happy Exploration ! 🧑‍🚀