# Find Good Hyperparameter For News Recommendation System With Tune

The goal of this example is to train a very simple news recommendation system, We will:
- Prepare the training data in parallel with Ray
- Train a simple model that classifies article titles as "popular" or "less popular" using scikit learn and
- Find good hyperparameter settings for the model with Tune, Ray's parallel hyperparameter optimization library.

### Downloading And Preparing The Training Data

First we will download and uncompress 400,000 hackernews submissions. This is a small subset of the articles that have been submitted to https://news.ycombinator.com. The data includes the title of each submission and its score, which roughly corresponds to the number of upvotes. There are 4 batches of JSON files that contain the information, named `submission-1.json` through `submission-4.json`. The first couple lines of the first file will be printed below by the `head` command. Delete zip file as we have already extracted the required data.

In [1]:
!wget -nc https://s3-us-west-2.amazonaws.com/ray-tutorials/hackernews.zip
!unzip -o hackernews.zip
!head -n 2 submission-1.json
!rm -rf "hackernews.zip"

--2023-01-24 15:54:00--  https://s3-us-west-2.amazonaws.com/ray-tutorials/hackernews.zip
Resolving s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)... 52.218.213.128, 52.218.246.136, 52.218.219.32, ...
Connecting to s3-us-west-2.amazonaws.com (s3-us-west-2.amazonaws.com)|52.218.213.128|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 56402193 (54M) [application/zip]
Saving to: ‘hackernews.zip’


2023-01-24 15:54:03 (18.6 MB/s) - ‘hackernews.zip’ saved [56402193/56402193]

Archive:  hackernews.zip
  inflating: submission-1.json       
  inflating: submission-2.json       
  inflating: submission-3.json       
  inflating: submission-4.json       
{"body": {"descendants": 0, "url": "http://markpincus.blogspot.com/2005/03/peopleweb-i-believe-we-are-close-to.html", "text": "", "title": "The PeopleWeb | Mark Pincus Blog (March 2005)", "by": "sayemm", "score": 3, "time": 1286515576, "type": "story", "id": 1770734}, "source": "firebase", "id": 1770734, "retriev

In [2]:
pip install pandas==1.3.2

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [3]:
pip install scikit-learn

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import json
import numpy as np
import pandas as pd
import ray
import time

In [2]:
#ray.init(address="ray://kuberay-head-svc.kuberay:10001", runtime_env={"working_dir": "./"})

# Run this line, all required packages exist in Ray cluster
ray.init(address="ray://kuberay-head-svc.kuberay:10001", runtime_env={"working_dir": "./"})

2023-01-24 15:54:37,080	INFO packaging.py:546 -- Creating a file package for local directory './'.
2023-01-24 15:54:37,573	INFO packaging.py:373 -- Pushing file package 'gcs://_ray_pkg_3fd8ce61f6676437.zip' (149.47MiB) to Ray cluster...
2023-01-24 15:54:39,089	INFO packaging.py:386 -- Successfully pushed file package 'gcs://_ray_pkg_3fd8ce61f6676437.zip'.


0,1
Python version:,3.8.13
Ray version:,2.2.0
Dashboard:,http://10.244.3.8:8265


The function below parses a chunk of the data and produces a pandas DataFrame with the titles and scores of the submissions.

In [3]:
def parse_hn_submissions(path):
    with open(path, "r") as f:
        records = []
        for line in f.readlines():
            body = json.loads(line)["body"]
            records.append({"data": body["title"], "score": body["score"]})
        return pd.DataFrame(records)

We now process all the data chunks and concatenate them into a single dataframe:

In [4]:
start_time = time.time()

files = ["submission-" + str(i) + ".json" for i in range(1, 5)]
records = [parse_hn_submissions(file) for file in files]
df = pd.concat(records)

end_time = time.time()
duration = end_time - start_time
print("Took {} seconds to parse the hackernews submissions".format(duration))

df.head()

Took 3.5356011390686035 seconds to parse the hackernews submissions


Unnamed: 0,data,score
0,The PeopleWeb | Mark Pincus Blog (March 2005),3
1,Computer science and programming are two separ...,1
2,Don't Go It Alone: Create an Advisory Board,1
3,Wikileaks Secret Dreams,1
4,MakeMyTrip.com: Is eCommerce in India Finall...,1


We use the following lines to determine a cutoff of what we consider a "good" article. The median score for articles is 1, so we want to label articles with score higher than that as class "1" and everything else as "0".

In [5]:
df["score"].median()

1.0

In [6]:
df["target"] = df["score"] > 1.0

Note: If above line gives error, try lowering down the version of pandas to 1.3.2 by uncommenting and running below line.

In [10]:
#!pip3 install pandas==1.3.2

We are now done preparing the data and can start training a model.

### Training A Model


First we split the data into a train and test set.

In [7]:
from sklearn.model_selection import train_test_split
train, test = train_test_split(df, test_size=0.2)

The following defines a pipeline that first converts the title of the submission to a bag of words and then applies an SVM for the actual classification. Note that we are fitting a very simple SVM here due to the computational restrictions of Binder. With more resources, a state-of-the-art model like [BERT](https://ai.googleblog.com/2018/11/open-sourcing-bert-state-of-art-pre.html) would be a better choice, in this case the code would be structured similarly.

In [8]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import SGDClassifier

pipeline = Pipeline([
    ("vect", CountVectorizer()),
    ("clf", SGDClassifier(loss="hinge", penalty="l2",
                          alpha=0.001,
                          max_iter=5, tol=1e-3,
                          warm_start=True))])
result = pipeline.fit(train.data, train.target)

predicted = result.predict(train.data)
print("Accuracy on the training set is {}".format(np.mean(predicted == train.target)))



Accuracy on the training set is 0.585496875


In [9]:
predicted = pipeline.predict(test.data)
print("Accuracy on the test set is {}".format(np.mean(predicted == test.target)))

Accuracy on the test set is 0.5814625


We can also classify new titles as follows:

In [10]:
pipeline.predict(["Iconic consoles of the IBM System/360 mainframes, 55 years old today", "Are Banned Drugs in Your Meat?"])

array([ True, False])

### Hyperparameter Tuning

Now let's try to improve these results by doing some hyperparameter tuning. Hyperparameter tuning is the process of finding the best parameters for the learning algorithm. These parameters are typically few numbers like learning rate schedule (i.e. how large steps to take in each iteration), regularization parameters or size of the model. By tuning these knobs, we can typically make the model perform better. Tune supports a number of different algorithms to perform hyperparameter tuning. The simplest is a grid search where we just exhaustively try out different values for the parameters. More sophisticated algorithms include hyperband and population based training. If you want to learn more about these, check out the [tune documentation](https://ray.readthedocs.io/en/latest/tune.html). 

In [11]:
import os
from ray import tune

First we need to put the training data into the object store (to make sure it will be re-used between training runs), and define the objective function. The objective function `train_func` takes two arguments: The `config` argument which contains the hyperparameters for that hyperparameter run. The `reporter` object can be used to report the performance of these hyperparameters back to tune so it can select the next trial based on the performance of the past ones.

The following function instantiates a model corresponding to the hyperparameters in `config`, runs 5 iterations of training and saves the model parameters to a checkpoint file.

In [12]:
train_id = ray.put(train)
test_id = ray.put(test)

def train_func(config, reporter):
    pipeline = Pipeline([
    ("vect", CountVectorizer()),
    ("clf", SGDClassifier(loss="hinge", penalty="l2",
                          alpha=config["alpha"],
                          max_iter=5, tol=1e-3,
                          warm_start=True))])
    train = ray.get(train_id)
    test = ray.get(test_id)
    for i in range(5):
        # Perform one epoch of SGD
        X = pipeline.named_steps["vect"].fit_transform(train.data)
        pipeline.named_steps["clf"].partial_fit(X, train.target, classes=[0, 1])
        reporter(mean_accuracy=np.mean(pipeline.predict(test.data) == test.target))  # report metrics

We can then get the best setting for the regularization parameter $\alpha$ as follows. **You should expect the training to take about 4-5 minutes**.

In [None]:
all_trials = tune.run(
    train_func,
    name="news_recommendation",
    # With the "stop" parameter, you could also specify a stopping criterion.
    config={"alpha": tune.grid_search([1e-3, 1e-4, 1e-5, 1e-6])}
)

From Trial Status table, considering metrics accuracy we can get optimized value for it.
Get the best trial from all trials as:

In [36]:
best_trial = all_trials.get_best_trial(metric="mean_accuracy", mode="max", scope="all")
print(best_trial)

train_func_ac062_00001


With this example, we ran trials for alpha parameter tuning using Ray Tune.
Shut down the workers.

In [37]:
ray.shutdown()