# Text Classification <a class="tocSkip">

**In this notebook:**

* Create sklearn classification model with tfidf vectorizer and LinearSVC classifier

* This notebook is a demo notebook to show the ML Lab capabilities. Checkout the walkthrough section of the ML Lab documentation for more information of how to get the here described dataset etc.:
    1. Create a project in ML Lab
    2. Go to the Datasets view and upload the News Data
    3. Go through this notebook and execute the cells to connect to an ML Lab instance, download the dataset to the workspace, create an experiment, and run it.
    4. See the results of the experiments in the Experiments Dashboard of the web app
    5. Upload the Unified Model to the ML Lab instance
    6. One-click deploy the Unified Model via the web app
    7. Access the service's API in the web app Services view and run an exemplary prediction

## Dependencies
Install, load, and initialize all required dependencies for this experiment.

### Import Dependencies

In [None]:
# System libraries
import logging, os, sys

# Enable logging
logging.basicConfig(format='[%(levelname)s] %(message)s', level=logging.INFO, stream=sys.stdout)

# Intialize tqdm to always use the notebook progress bar
import tqdm
tqdm.tqdm = tqdm.tqdm_notebook

# Third-party libraries
import numpy as np
import pandas as pd

# ML Lab libraries
from lab_client import Environment


### Initialize Environment

In [None]:
# Initialize environment
env = Environment(project="ml-lab-demo"  # ML Lab project you want to work on. Must exist / be created in the ML Lab instance.
                  # Only required in stand-alone workspace deployments:
                  # lab_endpoint="LAB_ENDPOINT", # ML Lab endpoint url: e.g. http://10.2.3.45:8091
                  # lab_api_token="LAB_API_TOKEN"
                 ) 

# Initialize experiment
exp = env.create_experiment('News Text Classification')

## Load Data
Download, explore, and prepare all required data for the experiment in this section.

In [None]:
# Get data from remote storage of ML Lab only if it does not exist locally
dataset_path = env.get_file('datasets/news-categorized.csv')

# Read data into basic pandas dataframe
df = pd.read_csv(dataset_path, sep=";")

# Configure dataset transfomration
dataset_config = {
    'train_size':0.80,
    'min_label_count': 10
}

# Add dataset configuration to experiment parameters
exp.log_params(dataset_config)

# only use items with more than 10 labels
df = df.groupby("label").filter(lambda x: len(x) > dataset_config['min_label_count'])

# Split the dataset into train (80%), and test (20%) based on dataset configuration
train_df, test_df = np.split(df.sample(frac=1, random_state=2), [int(dataset_config['train_size']*len(df))])

# add dataframes to experiment (will be logged and accesible within the experiment)
exp.add_artifact("train_data", train_df)
exp.add_artifact("test_data", test_df)

# Show sample
train_df.head()

## Train Model
Implementation, configuration, and evaluation of the experiment.

### Define Experiment

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.calibration import CalibratedClassifierCV

# Preprocessing logic
def preprocess(data, **kwargs):
    return str(data).replace('\n', ' ').replace('\r', '').strip().split()
    
# Training logic
def train(exp, params, artifacts):
    # exp (= Experiment instance)
    # params (= parameter dictonary) 
    # artifacts (= dictionary of added artifacts)
    
    # Get artifacts for the experiment run
    train_df = artifacts["train_data"]
    test_df = artifacts["test_data"]
    
    # Experiment Implementation
    classification_pipeline = Pipeline([
        ("tfidf", TfidfVectorizer(analyzer=lambda x: x,min_df=params['min_df'])),
        ("lsvc_calib", CalibratedClassifierCV(LinearSVC(verbose=0),method="isotonic", cv=3))
    ])
    
    sklearn_classifier = classification_pipeline.fit(
        [preprocess(item) for item in train_df["text"].tolist()], train_df["label"].tolist()
    )
    
    # Add trained model instance to experiment -> it can accessed after the exp-run is finished
    exp.add_artifact("sklearn_classifier", sklearn_classifier)
    
    # Evaluate trained model
    score = sklearn_classifier.score(
        [preprocess(item) for item in test_df["text"].tolist()], test_df["label"].tolist()
    )
    
    # log a metric to the current experiment
    exp.log_metric("accuracy", score)
    
    print("Model trained. Accuracy: " + str(score))
    # optional: return the most descriptive metric (main objective) for the experiment
    return score


### Run Experiment

In [None]:
# Define parameter configuration for experiment run
params = {
    'min_df': 2
}

# Run experiment and sync all metadata
exp.run_exp(train, params)

## Deploy Model
Wrap the model with the Unified Model API and upload it to the remote storage.

### Create Unified Model
You can find information on how to create a self-contained executable model file in the [unified model library](https://github.com/SAP/machine-learning-lab/tree/master/libraries/unified-model).

In [None]:
# Create unified model instance here
from unified_model.predefined_models.sklearn_models import SklearnTextClassifier

# Optional: create a describable name for the model
model_name = "news-categorized_sklearn_classifier.model"

# Initialize unified model instance with trained model
sklearn_model = SklearnTextClassifier(exp.get_artifact("sklearn_classifier"), 
                                      transform_func=preprocess,
                                      name=model_name)

# Save the unified model within the dedicated experiment folder
sklearn_model_path = sklearn_model.save(exp.create_file_path(model_name))

# Evaluate unified model with test data
metrics, label_scores = sklearn_model.evaluate(test_df['text'].tolist(), test_df['label'].tolist(), per_label=True)
print(metrics)

### Upload Unified Model

In [None]:
env.upload_file(sklearn_model_path, data_type="model")