# Welcome to Dotscience!

Dotscience is a _run tracker_ for data engineering and machine learning (ML).

By tracking runs, Dotscience allows you to capture all of the inputs that go into creating ML models.

Let's try a practical example! **Start by clicking on the "Dotscience" tab in the top left corner of the screen.**

## Simplest possible "hello world"

In this example, we create a single run and record with it a _run message_:

In [None]:
import dotscience as ds
ds.interactive()               # tell Dotscience we're running in Jupyter (as opposed to CLI/script mode)
ds.start()                     # start a new run (also clears any metadata from any previous runs)
ds.publish("did an empty run") # publish the run (pushes it to the Dotscience Hub)

Click on the cell above, and then click the "►" button in Jupyter, or press shift-enter.

You'll notice that some metadata is printed after the cell. This metadata being written to the notebook file is the trigger for Dotscience recording a new run.

You'll notice that within a second or two, in the Dotscience Status view, Dotscience has detected the run and captured a new run, which includes a snapshot of the files in the workspace as well as the metadata included in the run (which for now, is just the run message).

You'll see the run captured in the Dotscience Runs status view. It's also pushed to the Dotscience Hub.

### Capture metrics

Let's pretend we're doing training an ML model. ML models have parameters, like learning rates, and summary statistics, like accuracy. We can include these in the Dotscience metadata to record them for posterity and share them with our team.

In [None]:
ds.start()                                 # start a new run
ds.parameter("learning_rate", 0.001)       # a pretend learning rate
ds.summary("accuracy", 0.99)               # a great accuracy score
ds.publish("trained imaginary neural net") # a meaningful run message

You should see the run get captured in the Dotscience plugin and get pushed to the hub.

Now click on the "Activity" and you should see the parameters and summary statistics shown in the table and in the chart.

### Ingest some data

Now let's do something a bit more realistic. We're going to ingest some raw data, and then modify it (by combining two data sources into one), and then build a linear regression model to predict house prices.

We can start by doing a data ingestion run:

In [None]:
!apt-get install -qq -y wget
ds.start()
!wget -q -O data1.csv \
    https://github.com/dotmesh-io/dotscience-demo/blob/master/bay_area_zillow_agent1.csv?raw=true
!wget -q -O data2.csv \
    https://github.com/dotmesh-io/dotscience-demo/blob/master/bay_area_zillow_agent2.csv?raw=true
ds.output("data1.csv")
ds.output("data2.csv")
ds.publish("ingested zillow property data")

Now click "Resources", and click on one of the data files, and you'll see your first provenance graph - Dotscience is tracking that the given version of the data file came from a specific run.

### Combine the datasets

Now we'll combine the datasets and make another run which captures the act of doing so.

In [None]:
import pandas as pd
ds.start()
inputs = [pd.read_csv(ds.input("data1.csv")), pd.read_csv(ds.input("data2.csv"))]
df = pd.concat(f for f in inputs)
df.to_csv(ds.output("combined.csv"))
ds.publish("combined data files")

### Let's build a model!

Now we'll put together a very simple linear regression, then we can track the provenance of the model data file as well as the accuracy statistics from testing it.

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.externals import joblib
from sklearn.model_selection import train_test_split
import numpy as np

ds.start()
features = ['finishedsqft']
X = df[features]
Y = df['lastsoldprice']

ds.parameter('features',", ".join(sorted(features)))

df = pd.read_csv(ds.input('combined.csv'))
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.3, random_state=1)

regressor = LinearRegression()
regressor.fit(X_train, y_train)
y_pred = regressor.predict(X_test)

regressor_score = regressor.score(X_test, y_test)
ds.summary('regressor_score', regressor_score)

lin_mse = mean_squared_error(y_pred, y_test)
lin_rmse = np.sqrt(lin_mse)
ds.summary('lin_rmse', lin_rmse)

joblib.dump(regressor, ds.output('linear_regressor.pkl'))
ds.publish("trained linear regression model")

Now you should be able to inspect the provenance of the model file in the Resources section, and see the metrics of its performance in the Activity section.