# Amazon reviews classification [sklearn]
* Binary classification of product's review 'helpfulness' (quality).
* Reference notebook: <https://t-lanigan.github.io/amazon-review-classifier/>
* Dataset: <http://snap.stanford.edu/data/amazon/productGraph/categoryFiles/reviews_Apps_for_Android_5.json.gz>

By running this notebook, you’ll create a whole test suite in a few lines of code. The model used here is a simple classification model with the Amazon reviews dataset. Feel free to use your own model (tabular, text, or LLM).

You’ll learn how to:
* Detect vulnerabilities by scanning the model
* Generate a test suite with domain-specific tests
* Customize your test suite by loading a test from the Giskard catalog
* Upload your model to the Giskard server to:
    * Compare models to decide which one to promote
    * Debug your tests to diagnose issues
    * Share your results and collect business feedback from your team

## Install Giskard

In [None]:
!pip install giskard

## Import libraries

In [None]:
import os
import string

import numpy as np
import pandas as pd
from pathlib import Path
from sklearn.pipeline import Pipeline
from urllib.request import urlretrieve
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import FunctionTransformer
from sklearn.feature_extraction.text import TfidfVectorizer

import giskard
from giskard import Dataset, Model, GiskardClient, testing

# Disable chained assignment warning.
pd.options.mode.chained_assignment = None

## Define constants

In [None]:
# Constants.
RANDOM_SEED = 0
TEST_RATIO = 0.2

TARGET_THRESHOLD = 0.5
TARGET_NAME = "isHelpful"

# Paths.
DATA_URL = os.path.join("ftp://sys.giskard.ai", "pub", "unit_test_resources", "amazon_review_dataset", "reviews.json")
DATA_PATH = Path.home() / ".giskard" / "amazon_review_dataset" / "reviews.json"

## Dataset preparation

### Load and preprocess data

In [None]:
def fetch_from_ftp(url: str, file: Path) -> None:
    """Helper to fetch data from the FTP server."""
    if not file.parent.exists():
        file.parent.mkdir(parents=True, exist_ok=True)

    if not file.exists():
        print(f"Downloading data from {url}")
        urlretrieve(url, file)

    print(f"Data was loaded!")


def download_data(**kwargs) -> pd.DataFrame:
    """Download the dataset using URL."""
    fetch_from_ftp(DATA_URL, DATA_PATH)
    _df = pd.read_json(DATA_PATH, lines=True, **kwargs)
    return _df


def preprocess_data(df: pd.DataFrame) -> pd.DataFrame:
    """Perform data-preprocessing steps."""
    print(f"Start data preprocessing...")

    # Select columns.
    df = df[["reviewText", "helpful"]]

    # Remove Null-characters (x00) from the dataset.
    df.reviewText = df.reviewText.apply(lambda x: x.replace("\x00", ""))

    # Extract numbers of helpful and total votes.
    df['helpful_ratings'] = df.helpful.apply(lambda x: x[0])
    df['total_ratings'] = df.helpful.apply(lambda x: x[1])

    # Filter unreasonable comments.
    df = df[df.total_ratings > 10]

    # Create target column.
    df[TARGET_NAME] = np.where((df.helpful_ratings / df.total_ratings) > TARGET_THRESHOLD, 1, 0).astype(int)

    # Delete columns we don't need anymore.
    df.drop(columns=["helpful", 'helpful_ratings', 'total_ratings'], inplace=True)

    print("Data preprocessing finished!")

    return df

In [None]:
reviews_df = download_data(nrows=20000)
reviews_df = preprocess_data(reviews_df)

### Train-test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(reviews_df[["reviewText"]], reviews_df[TARGET_NAME],
                                                    test_size=TEST_RATIO, random_state=RANDOM_SEED)

### Wrap dataset with Giskard

In [None]:
raw_data = pd.concat([X_test, y_test], axis=1)
wrapped_data = Dataset(raw_data, name="reviews", target=TARGET_NAME, column_types={"reviewText": "text"})

## Model training

### Define preprocessing pipeline

In [None]:
def remove_punctuation(x):
    """Remove punctuation from input string."""
    x = x.reviewText.apply(lambda row: row.translate(str.maketrans('', '', string.punctuation)))
    return x

preprocessor = Pipeline(steps=[
    ("punctuation", FunctionTransformer(remove_punctuation)),
    ("vectorizer", TfidfVectorizer(stop_words='english', min_df=0.01))
])

### Build estimator

In [None]:
pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("estimator", LogisticRegression(random_state=RANDOM_SEED))
])

pipeline.fit(X_train, y_train)

train_metric = roc_auc_score(y_train, pipeline.predict_proba(X_train)[:, 1])
test_metric = roc_auc_score(y_test, pipeline.predict_proba(X_test)[:, 1])

print(f"Train ROC-AUC score: {train_metric:.2f}")
print(f"Test ROC-AUC score: {test_metric:.2f}")

### Wrap model with Giskard

In [None]:
wrapped_model = Model(model=pipeline.predict_proba,
                      model_type="classification",
                      feature_names=["reviewText"],
                      name="review_helpfulness_predictor",
                      classification_threshold=0.5,
                      classification_labels=[0, 1])

# Validate wrapped model.
wrapped_predict = wrapped_model.predict(wrapped_data).raw[:, 1]
wrapped_test_metric = roc_auc_score(y_test, wrapped_predict)

print(f"Wrapped Test ROC-AUC score: {wrapped_test_metric:.2f}")

## Scan your model to find vulnerabilities
With the Giskard scan feature, you can detect vulnerabilities in your model, including performance biases, unrobustness, data leakage, stochasticity, underconfidence, ethical issues, and more. For detailed information about the scan feature, please refer to our scan documentation.

In [None]:
results = giskard.scan(model=wrapped_model, dataset=wrapped_data)

In [None]:
display(results)

## Generate a test suite from the Scan
The objects produced by the scan can be used as fixtures to generate a test suite that integrate domain-specific issues. To create custom tests, refer to the Test your ML Model page.

In [None]:
test_suite = results.generate_test_suite("My first test suite")
test_suite.run()

## Customize your suite by loading objects from the Giskard catalog

The Giskard open source catalog will enable to load:
* Tests such as metamorphic, performance, prediction & data drift, statistical tests, etc
* Slicing functions such as detectors of toxicity, hate, emotion, etc
* Transformation functions such as generators of typos, paraphrase, style tune, etc

For demo purposes, we will load a simple unit test (test_f1) that checks if the test F1 score is above the given threshold. For more examples of tests and functions, refer to the Giskard catalog.

In [None]:
test_suite.add_test(testing.test_f1(model=wrapped_model, dataset=wrapped_data, threshold=0.7)).run()

## Upload your suite to the Giskard server

Upload your suite to the Giskard server to:
* Compare models to decide which model to promote
* Debug your tests to diagnose the issues
* Create more domain-specific tests that are integrating business feedback
* Share your results

In [None]:
# Uploading the test suite will automatically save the model, dataset, tests, slicing & transformation functions inside the Giskard UI server
# Create a Giskard client after having install the Giskard server (see documentation)
token = "API_TOKEN"  # Find it in Settings in the Giskard server

client = GiskardClient(
    url="http://localhost:19000",  # URL of your Giskard instance
    token=token
)

my_project = client.create_project("my_project", "PROJECT_NAME", "DESCRIPTION")

# Upload to the current project ✉️
test_suite.upload(client, "my_project")