# Find Good Hyperparameter For News Recommendation System With Tune

The goal of this example is to train a very simple news recommendation system, We will:
- Prepare the training data in parallel with Ray
- Train a simple model that classifies article titles as "popular" or "less popular" using scikit learn and
- Find good hyperparameter settings for the model with Tune, Ray's parallel hyperparameter optimization library.

### Downloading And Preparing The Training Data

First we will download and uncompress 400,000 hackernews submissions. This is a small subset of the articles that have been submitted to https://news.ycombinator.com. The data includes the title of each submission and its score, which roughly corresponds to the number of upvotes. There are 4 batches of JSON files that contain the information, named `ls-1.json` through `ls-4.json`. The first couple lines of the first file will be printed below by the `head` command. Delete zip file as we have already extracted the required data.

In [1]:
%env
TUNE_DISABLE_STRICT_METRIC_CHECKING=1
RAY_AIR_REENABLE_DEPRECATED_SYNC_TO_HEAD_NODE=1

In [2]:
import ray
from ray.job_submission import JobSubmissionClient
import time

# Ray cluster information
ray_head_ip = "kuberay-head-svc.kuberay.svc.cluster.local"
ray_head_port = 8265
ray_address = f"http://{ray_head_ip}:{ray_head_port}"

# Submit Ray job using JobSubmissionClient
client = JobSubmissionClient(ray_address)
job_id = client.submit_job(
    entrypoint="python hyperparameter.py",
    runtime_env={
        "working_dir": "./", 
        #"entrypoint_memory": 50000,
        #"excludes": ['']
    },
    entrypoint_num_cpus=3
)


2024-01-11 12:09:45,264	INFO dashboard_sdk.py:338 -- Uploading package gcs://_ray_pkg_c6239a001d10143c.zip.
2024-01-11 12:09:45,266	INFO packaging.py:518 -- Creating a file package for local directory './'.


In [3]:
print(client.__dict__)
print(f"Ray job submitted with job_id: {job_id}")

# Wait for a while to let the jobs run
time.sleep(10)

job_status = client.get_job_status(job_id)
get_job_logs = client.get_job_logs(job_id)
get_job_info = client.get_job_info(job_id)

async for lines in client.tail_job_logs(job_id):
    print(lines, end="") 
# Disconnect from the Ray cluster
ray.shutdown()

{'_client_ray_version': '2.7.0', '_address': 'http://kuberay-head-svc.kuberay.svc.cluster.local:8265', '_cookies': None, '_default_metadata': {}, '_headers': None, '_verify': True, '_ssl_context': None}
Ray job submitted with job_id: raysubmit_TF29VHNxULke6KHV
Took 0.0063669681549072266 seconds to parse the hackernews submissions
Accuracy on the training set is 1.0
Accuracy on the test set is 1.0
2024-01-11 04:09:55,651	INFO worker.py:1313 -- Using address 10.224.59.219:6379 set in the environment variable RAY_ADDRESS
2024-01-11 04:09:55,651	INFO worker.py:1431 -- Connecting to existing Ray cluster at address: 10.224.59.219:6379...
2024-01-11 04:09:55,715	INFO worker.py:1612 -- Connected to Ray cluster. View the dashboard at [1m[32mhttp://10.224.59.219:8265 [39m[22m
2024-01-11 04:09:55,798	INFO tune.py:666 -- [output] This will use the new output engine with verbosity 2. To disable the new output and use the legacy output engine, set the environment variable RAY_AIR_NEW_OUTPUT=0. F