# NLP training example
In this example, we'll train an NLP model for sentiment analysis of tweets using spaCy.

First we download spaCy language libraries.

In [1]:
!python -m spacy download xx_ent_wiki_sm

You should consider upgrading via the 'pip install --upgrade pip' command.[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the model via spacy.load('xx_ent_wiki_sm')


And import the boilerplate code.

In [2]:
from __future__ import unicode_literals, print_function

import boto3
import json
import numpy as np
import pandas as pd
import spacy

## Data prep

Download the dataset from S3

In [3]:
S3_BUCKET = "verta-strata"
EN_S3_KEY = "english-tweets.csv"
EN_FILENAME = EN_S3_KEY
DE_S3_KEY = "german-tweets.csv"
DE_FILENAME = DE_S3_KEY

boto3.client('s3').download_file(S3_BUCKET, EN_S3_KEY, EN_FILENAME)
boto3.client('s3').download_file(S3_BUCKET, DE_S3_KEY, DE_FILENAME)

Clean and load data using our library.

In [4]:
import utils

en_data = pd.read_csv(EN_FILENAME)
de_data = pd.read_csv(DE_FILENAME)

data = pd.concat([en_data, de_data], axis=0)
data = data.sample(frac=1).reset_index(drop=True)
utils.clean_data(data)

data.head()

Unnamed: 0,text,sentiment
0,have coffee stomach... aaahhhh! need food... ...,0
1,TOM DELONGE &lt;3,1
2,all is very far from being well in the life of...,0
3,Just created the account and logged in for the...,1
4,i could have been racking up on call of duty k...,0


## Set up ModelDB
ModelDB organizes our work, and enables us to log and version metadata.

In [5]:
from verta import Client

client = Client('https://dev.verta.ai')
client.set_project('Tweet Classification')
client.set_experiment('SpaCy')
run = client.set_experiment_run()

set email from environment
set developer key from environment
connection successfully established
set existing Project: Tweet Classification from personal workspace
set existing Experiment: SpaCy
created new ExperimentRun: Run 997181584405191309407


We'll first record our code, configuration, dataset, and environment versions to a ModelDB repository.

In [6]:
repo = client.set_repository('Verta Strata')
commit = repo.get_commit(branch='master')

set existing Repository: Verta Strata from personal workspace


In [7]:
from verta.code import Notebook
from verta.configuration import Hyperparameters
from verta.dataset import S3
from verta.environment import Python

code_ver = Notebook()
config_ver = Hyperparameters({'n_iter': 20})
dataset_ver = S3([
    "s3://{}/{}".format(S3_BUCKET, EN_S3_KEY),
    "s3://{}/{}".format(S3_BUCKET, DE_S3_KEY),
])
env_ver = Python()

commit.update("notebooks/tweet-analysis", code_ver)
commit.update("config/hyperparams", config_ver)
commit.update("data/tweets", dataset_ver)
commit.update("env/python", env_ver)
commit.save("Deployment-ready sentiment analysis")

commit

<IPython.core.display.Javascript object>

(Branch: master)
Commit dca076b0f24e38821c17120005445a4c2cfc8a0e6b3d70a1036b96c933a71a7a containing:
config/hyperparams (Blob)
data/tweets (Blob)
env/python (Blob)
notebooks/tweet-analysis (Blob)

## Train the model
We'll use a pre-trained model from spaCy and fine tune it in our new dataset.

In [8]:
nlp = spacy.load('xx_ent_wiki_sm')

Update the model with the current data using our library.

In [9]:
import training

training.train(nlp, data, n_iter=20)

Using 16000 examples (12800 training, 3200 evaluation)
Training the model...
LOSS 	  P  	  R  	  F  
16.020	0.747	0.700	0.723
0.363	0.751	0.723	0.737
0.106	0.752	0.727	0.739


KeyboardInterrupt: 

## Save and version the model
We log the model itself as an artifact to ModelDB.

In [None]:
run.log_model(nlp)

And finally, link the commit to our Experiment Run.

In [None]:
run.log_commit(
    commit,
    {
        'notebook': "notebooks/tweet-analysis",
        'hyperparameters': "config/hyperparams",
        'training_data': "data/tweets",
        'python_env': "env/python",
    },
)

## Deployment

Great! Now you have a model that you can use to run predictions against. Follow the next step of this tutorial to see how to do it.