# Regression on the hotel reviews [sklearn]
* Regression task of predicting review 'score', based on the review text.
* Reference notebook: <https://www.kaggle.com/code/jiashenliu/simple-regression-model-most-important-words/notebook>
* Dataset: <https://www.kaggle.com/code/jiashenliu/simple-regression-model-most-important-words/input>

# Quickstart

By running this notebook, you’ll create a whole test suite in a few lines of code. The model used here is a simple linear regression model with the hotel reviews dataset. Feel free to use your own model (tabular, text, or LLM).

You’ll learn how to:
* Detect vulnerabilities by scanning the model
* Generate a test suite with domain-specific tests
* Customize your test suite by loading a test from the Giskard catalog
* Upload your model to the Giskard server to:
* Compare models to decide which one to promote
* Debug your tests to diagnose issues
* Share your results and collect business feedback from your team

## Install Giskard

In [None]:
!pip install giskard

## Import libraries

In [None]:
import os
from pathlib import Path
from typing import Iterable
from urllib.request import urlretrieve

import pandas as pd
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import GradientBoostingRegressor
from sklearn.feature_extraction.text import TfidfVectorizer

from giskard import Model, Dataset, scan, testing
from giskard.client.giskard_client import GiskardClient

## Define constants

In [None]:
# Constants.
FEATURE_COLUMN_NAME = "Full_Review"
TARGET_COLUMN_NAME = "Reviewer_Score"

# Paths.
DATA_URL = os.path.join("ftp://sys.giskard.ai", "pub", "unit_test_resources", "hotel_text_regression_dataset", "Hotel_Reviews.csv")
DATA_PATH = Path.home() / ".giskard" / "hotel_text_regression_dataset" / "Hotel_Reviews.csv"

## Load data

In [None]:
def fetch_from_ftp(url: str, file: Path) -> None:
    """Helper to fetch data from the FTP server."""
    if not file.parent.exists():
        file.parent.mkdir(parents=True, exist_ok=True)

    if not file.exists():
        print(f"Downloading data from {url}")
        urlretrieve(url, file)

    print(f"Data was loaded!")

In [None]:
def load_data(**kwargs) -> pd.DataFrame:
    fetch_from_ftp(DATA_URL, DATA_PATH)
    df = pd.read_csv(DATA_PATH, **kwargs)

    # Create target column.
    df[FEATURE_COLUMN_NAME] = df.apply(lambda x: x['Positive_Review'] + ' ' + x['Negative_Review'], axis=1)

    return df

reviews_df = load_data(nrows=1000)

## Train-test split

In [None]:
train_X, test_X, train_Y, test_Y = train_test_split(reviews_df[[FEATURE_COLUMN_NAME]], reviews_df[TARGET_COLUMN_NAME], random_state=42)

## Wrap data with giskard

In [None]:
raw_data = pd.concat([test_X, test_Y], axis=1)
wrapped_data = Dataset(raw_data,
                       name="hotel_text_regression_dataset",
                       target=TARGET_COLUMN_NAME,
                       column_types={FEATURE_COLUMN_NAME: "text"})

## Define preprocessing steps

In [None]:
def adapt_vectorizer_input(df: pd.DataFrame) -> Iterable:
    """Adapt input for the vectorizers.

    The problem is that vectorizers accept iterable, not DataFrame, but Series.
    Thus, we need to ravel dataframe with text have input single dimension.
    """

    df = df.iloc[:, 0]
    return df

## Define, fit and test model

In [None]:
# Define pipeline.
pipeline = Pipeline(steps=[
    ("vectorizer_adapter", FunctionTransformer(adapt_vectorizer_input)),
    ("vectorizer", TfidfVectorizer(max_features=10000)),
    ("regressor", GradientBoostingRegressor(n_estimators=10))
])

# Fit pipeline.
pipeline.fit(train_X, train_Y)

# Perform inference on train and test data.
pred_train = pipeline.predict(train_X)
pred_test = pipeline.predict(test_X)

In [None]:
train_metric = mean_absolute_error(train_Y, pred_train)
test_metric = mean_absolute_error(test_Y, pred_test)

print(f"Train MAE: {train_metric: .2f}\n"
      f"Test MAE: {test_metric: .2f}")

## Wrap model with giskard

In [None]:
wrapped_model = Model(pipeline.predict,
                      model_type="regression",
                      name="hotel_text_regression",
                      feature_names=[FEATURE_COLUMN_NAME])

In [None]:
# Validate wrapped model.
pred_test_wrapped = wrapped_model.predict(wrapped_data).raw_prediction
wrapped_test_metric = mean_absolute_error(test_Y, pred_test_wrapped)
print(f"Wrapped Test MAE: {wrapped_test_metric: .2f}")

## Scan your model to find vulnerabilities
With the Giskard scan feature, you can detect vulnerabilities in your model, including performance biases, unrobustness, data leakage, stochasticity, underconfidence, ethical issues, and more. For detailed information about the scan feature, please refer to our scan documentation.

In [None]:
results = scan(wrapped_model, wrapped_data)

In [None]:
display(results)

## Generate a test suite from the Scan
The objects produced by the scan can be used as fixtures to generate a test suite that integrate domain-specific issues. To create custom tests, refer to the Test your ML Model page.

In [None]:
test_suite = results.generate_test_suite("My first test suite")
test_suite.run()

## Customize your suite by loading objects from the Giskard catalog

The Giskard open source catalog will enable to load:
* Tests such as metamorphic, performance, prediction & data drift, statistical tests, etc
* Slicing functions such as detectors of toxicity, hate, emotion, etc
* Transformation functions such as generators of typos, paraphrase, style tune, etc

For demo purposes, we will load a simple unit test (test_r2) that checks if the test R2 score is above the given threshold. For more examples of tests and functions, refer to the Giskard catalog.

In [None]:
test_suite.add_test(testing.test_r2(model=wrapped_model, dataset=wrapped_data, threshold=0.7)).run()

## Upload your suite to the Giskard server

Upload your suite to the Giskard server to:
* Compare models to decide which model to promote
* Debug your tests to diagnose the issues
* Create more domain-specific tests that are integrating business feedback
* Share your results

In [None]:
# Uploading the test suite will automatically save the model, dataset, tests, slicing & transformation functions inside the Giskard UI server
# Create a Giskard client after having install the Giskard server (see documentation)
token = "API_TOKEN"  # Find it in Settings in the Giskard server

client = GiskardClient(
    url="http://localhost:19000",  # URL of your Giskard instance
    token=token
)

my_project = client.create_project("my_project", "PROJECT_NAME", "DESCRIPTION")

# Upload to the current project ✉️
test_suite.upload(client, "my_project")