# Making your first submission on Numerai

## Introduction 
This tutorial will go over how to create your first submission on Numerai.

## Overview

1. Using this notebook
2. Download the datasets
3. Train your first model
4. Generate your first predictions
4. Make your first submission


---



## 1. Using this notebook 

This is an interactive notebook. You can execute code in each cell by pressing `shift+enter`. This requires you to login with your Google account.

In order to make changes, you need to make a copy by `File -> Save a copy in Drive`.

Let's start off by installing and importing our dependencies.

In [1]:
# install dependencies
!pip install -U scikit-learn
!pip install pandas sklearn numerapi

Collecting scikit-learn
  Using cached https://files.pythonhosted.org/packages/7b/57/502e640e110e5ca5765036f832f88ed9ce529aa3e6854daa394389353b96/scikit_learn-0.24.1-cp38-cp38-win_amd64.whl
Collecting threadpoolctl>=2.0.0 (from scikit-learn)
  Using cached https://files.pythonhosted.org/packages/f7/12/ec3f2e203afa394a149911729357aa48affc59c20e2c1c8297a60f33f133/threadpoolctl-2.1.0-py3-none-any.whl
Collecting joblib>=0.11 (from scikit-learn)
  Downloading https://files.pythonhosted.org/packages/55/85/70c6602b078bd9e6f3da4f467047e906525c355a4dacd4f71b97a35d9897/joblib-1.0.1-py3-none-any.whl (303kB)
Installing collected packages: threadpoolctl, joblib, scikit-learn
Successfully installed joblib-1.0.1 scikit-learn-0.24.1 threadpoolctl-2.1.0


You should consider upgrading via the 'python -m pip install --upgrade pip' command.


Collecting pandas
  Downloading https://files.pythonhosted.org/packages/dc/24/2e678c33e5d534d57583e47a373f8d0d1f7375a15b7dafe58ce920c7ab8b/pandas-1.2.2-cp38-cp38-win_amd64.whl (9.3MB)
Collecting sklearn
  Downloading https://files.pythonhosted.org/packages/1e/7a/dbb3be0ce9bd5c8b7e3d87328e79063f8b263b2b1bfa4774cb1147bfcd3f/sklearn-0.0.tar.gz
Collecting numerapi
  Downloading https://files.pythonhosted.org/packages/46/97/9139d36db3f93d7e39249bf72f92eae6d495ae0d191c5285e8ae50abb04d/numerapi-2.4.1-py3-none-any.whl
Collecting pytz>=2017.3 (from pandas)
  Downloading https://files.pythonhosted.org/packages/70/94/784178ca5dd892a98f113cdd923372024dc04b8d40abe77ca76b5fb90ca6/pytz-2021.1-py2.py3-none-any.whl (510kB)
Collecting tqdm>=4.29.1 (from numerapi)
  Downloading https://files.pythonhosted.org/packages/b3/db/dcda019790a8d989b8b0e7290f1c651a0aaef10bbe6af00032155e04858d/tqdm-4.56.2-py2.py3-none-any.whl (72kB)
Collecting click>=7.0 (from numerapi)
  Using cached https://files.pythonhosted.org

You should consider upgrading via the 'python -m pip install --upgrade pip' command.


In [None]:
# import dependencies
import pandas as pd
from sklearn import numerapi
from sklearn.ensemble import RandomForestClassifier

## 2. Download the datasets

### Datasets 
*   `training_data` is used to train your model
*   `tournament_data` is used to evaluate your model

### Column descriptions
*   id: a randomized id that corresponds to a stock 
*   era: a period of time
*   data_type: either `train`, `validation`, `test`, or `live` 
*   feature_*: abstract financial features of the stock 
*   target: abstract measure of stock performance




In [None]:
# download the latest training dataset (takes around 30s)
training_data = pd.read_csv("https://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_training_data.csv.xz")
training_data.head()

In [None]:
# download the latest tournament dataset (takes around 30s)
tournament_data = pd.read_csv("https://numerai-public-datasets.s3-us-west-2.amazonaws.com/latest_numerai_tournament_data.csv.xz")
tournament_data.head(20)

## 3. Train your first model
Let's create a basic model using sklearn's linear regression.

In [None]:
# find only the feature columns
feature_cols = training_data.columns[training_data.columns.str.startswith('feature')]

In [None]:
# select those columns out of the training dataset
training_features = training_data[feature_cols]

In [None]:
# create a model and fit the training data (~30 sec to run)
model = sklearn.RandomForestClassifier()
model.fit(training_features, training_data.target)

## 4. Generate your first predictions
Now that we have a trained model, we can use it to make predictions on the tournament data.



In [None]:
# select the feature columns from the tournament data
live_features = tournament_data[feature_cols]

In [None]:
# predict the target on the live features
predictions = model.predict(live_features)

In [None]:
# predictions must have an `id` column and a `prediction_kazutsugi` column
predictions_df = tournament_data["id"].to_frame()
predictions_df["prediction_kazutsugi"] = predictions
predictions_df.head(10)

## 5. Make your first submission
To enter the tournament, we must submit the predictions back to Numerai. We will use the `numerapi` library to do this.

In [None]:
# Get your API keys and model_id from https://numer.ai/submit
public_id = "VY2KGO4SOYB55KWLFDG4TMHSTUUFR7PR"
secret_key = "SLMDPPJHX5ILB4YZI2IWQX7JRTQYMU4AUAKLEKKJVXCQDWYSCNP44XLZZGA774BJ"
model_id = "22b1f26c-02d2-4ead-9e7d-a465d383827a"
napi = numerapi.NumerAPI(public_id=public_id, secret_key=secret_key)

In [None]:
# Upload your predictions
predictions_df.to_csv("predictions.csv", index=False)
submission_id = napi.upload_predictions("predictions.csv", model_id=model_id)

# Done 🚀
Good job! You just made your first submission on Numerai!

Head back over to https://numer.ai/submit to continue.