Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Dagster Data pipeline #798

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions python/tabby-eval/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
tmp*
tabby_data_pipeline.egg-info
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you still need to remove the directory from this commit

48 changes: 48 additions & 0 deletions python/tabby-eval/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# tabby_data_pipeline

This is a [Dagster](https://dagster.io/) project scaffolded with [`dagster project scaffold`](https://docs.dagster.io/getting-started/create-new-project).

## Getting started

First, install your Dagster code location as a Python package. By using the --editable flag, pip will install your Python package in ["editable mode"](https://pip.pypa.io/en/latest/topics/local-project-installs/#editable-installs) so that as you develop, local code changes will automatically apply.

```bash
pip install -e ".[dev]"
```

Then, start the Dagster UI web server:

```bash
dagster dev
```

Open http://localhost:3000 with your browser to see the project.

You can start writing assets in `tabby_data_pipeline/assets.py`. The assets are automatically loaded into the Dagster code location as you define them.

## Development


### Adding new Python dependencies

You can specify new Python dependencies in `setup.py`.

### Unit testing

Tests are in the `tabby_data_pipeline_tests` directory and you can run tests using `pytest`:

```bash
pytest tabby_data_pipeline_tests
```

### Schedules and sensors

If you want to enable Dagster [Schedules](https://docs.dagster.io/concepts/partitions-schedules-sensors/schedules) or [Sensors](https://docs.dagster.io/concepts/partitions-schedules-sensors/sensors) for your jobs, the [Dagster Daemon](https://docs.dagster.io/deployment/dagster-daemon) process must be running. This is done automatically when you run `dagster dev`.

Once your Dagster Daemon is running, you can start turning on schedules and sensors for your jobs.

## Deploy on Dagster Cloud

The easiest way to deploy your Dagster project is to use Dagster Cloud.

Check out the [Dagster Cloud Documentation](https://docs.dagster.cloud) to learn more.
452 changes: 452 additions & 0 deletions python/tabby-eval/edit_distance_analysis.ipynb

Large diffs are not rendered by default.

6 changes: 6 additions & 0 deletions python/tabby-eval/log.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be in git ignore as well

Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
model: TabbyML/StarCoder-1B; language: python; file: line_completion.jsonlSkipped 0 rows, 10 rows with predictions, 0 rows with errors

model: TabbyML/StarCoder-1B; language: python; file: line_completion_rg1_bm25.jsonlSkipped 0 rows, 10 rows with predictions, 0 rows with errors

model: TabbyML/StarCoder-1B; language: python; file: line_completion_oracle_bm25.jsonlSkipped 0 rows, 10 rows with predictions, 0 rows with errors

6 changes: 6 additions & 0 deletions python/tabby-eval/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
[build-system]
requires = ["setuptools"]
build-backend = "setuptools.build_meta"

[tool.dagster]
module_name = "tabby_data_pipeline"
2 changes: 2 additions & 0 deletions python/tabby-eval/setup.cfg
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
[metadata]
name = tabby_data_pipeline
17 changes: 17 additions & 0 deletions python/tabby-eval/setup.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
from setuptools import find_packages, setup

setup(
name="tabby_data_pipeline",
packages=find_packages(exclude=["tabby_data_pipeline_tests"]),
install_requires=[
"dagster",
"dagster-cloud",
"dagstermill",
"papermill-origami>=0.0.8",
"pandas",
"matplotlib",
"seaborn",
"scikit-learn",
],
extras_require={"dev": ["dagster-webserver", "pytest"]},
)
8 changes: 8 additions & 0 deletions python/tabby-eval/tabby_data_pipeline.egg-info/PKG-INFO
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This directory should be in side .gitignore

Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
Metadata-Version: 2.1
Name: tabby-data-pipeline
Version: 0.0.0
Requires-Dist: dagster
Requires-Dist: dagster-cloud
Provides-Extra: dev
Requires-Dist: dagster-webserver; extra == "dev"
Requires-Dist: pytest; extra == "dev"
13 changes: 13 additions & 0 deletions python/tabby-eval/tabby_data_pipeline.egg-info/SOURCES.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
README.md
pyproject.toml
setup.cfg
setup.py
tabby_data_pipeline/__init__.py
tabby_data_pipeline/analyze.py
tabby_data_pipeline/assets.py
tabby_data_pipeline/predict.py
tabby_data_pipeline.egg-info/PKG-INFO
tabby_data_pipeline.egg-info/SOURCES.txt
tabby_data_pipeline.egg-info/dependency_links.txt
tabby_data_pipeline.egg-info/requires.txt
tabby_data_pipeline.egg-info/top_level.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@

6 changes: 6 additions & 0 deletions python/tabby-eval/tabby_data_pipeline.egg-info/requires.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
dagster
dagster-cloud

[dev]
dagster-webserver
pytest
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
tabby_data_pipeline
18 changes: 18 additions & 0 deletions python/tabby-eval/tabby_data_pipeline/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
from dagster import Definitions, load_assets_from_modules

from dagstermill import define_dagstermill_asset, ConfigurableLocalOutputNotebookIOManager

from dagster import AssetIn, Field, Int, asset, file_relative_path

from . import assets, create_csv

all_assets = load_assets_from_modules([assets, create_csv])

defs = Definitions(
assets=all_assets,
resources = {
"output_notebook_io_manager": ConfigurableLocalOutputNotebookIOManager()
}
)


87 changes: 87 additions & 0 deletions python/tabby-eval/tabby_data_pipeline/analyze.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
import pandas as pd
import json
import sys

from dagster import (
AssetExecutionContext,
MetadataValue,
asset,
StaticPartitionsDefinition,
MultiPartitionsDefinition,
)

def get_bracket_lang_statement(completion):
end_idx = None
for i in range(len(completion)):
if completion[i] in [";", "{", "}"]:
end_idx = i
break
return completion[:end_idx+1] if end_idx else completion


def postprocess_code_lines(prompt, target, language):
try:
if language in ["java", "csharp", "typescript"]:
return get_bracket_lang_statement(target)
elif language == "python":
return target.split("\n")[0]
except Exception as e:
return target


def analyze(model, language, file):

line_match = 0
statement_match = 0

input_file = f"./data/{model}/{language}/{file}"
output_file = f"./data/{model}/{language}/result_{file}"

with open(output_file, 'w') as fout:
with open(input_file) as fin:
for line in fin:
obj = json.loads(line)
result = {}
prediction = ""

for k in obj.keys():
if k == "prediction":
prediction = str(obj[k])
break
elif k == "error":
break
else:
result[k] = obj[k]

tabby_eval = {}
if file == "line_completion.jsonl":
tabby_eval["raw_prompt"] = obj["prompt"]
else:
tabby_eval["raw_prompt"] = obj["crossfile_context"]["text"] + obj["prompt"]

tabby_eval["prediction"] = prediction

groundtruth = obj["groundtruth"]

tabby_eval["first_line_prediction"] = prediction.split("\n")[0]
tabby_eval["first_line_groundtruth"] = groundtruth.split("\n")[0]
if tabby_eval["first_line_prediction"] == tabby_eval["first_line_groundtruth"]:
tabby_eval["first_line_matched"] = True
line_match += 1
else:
tabby_eval["first_line_matched"] = False

tabby_eval["first_statement_prediction"] = postprocess_code_lines(tabby_eval["raw_prompt"], prediction, language)
tabby_eval["first_statement_groundtruth"] = postprocess_code_lines(tabby_eval["raw_prompt"], groundtruth, language)
if tabby_eval["first_statement_prediction"] == tabby_eval["first_statement_groundtruth"]:
tabby_eval["first_statement_matched"] = True
statement_match += 1
else:
tabby_eval["first_statement_matched"] = False

result["tabby_eval"] = tabby_eval

json.dump(result, fout)
fout.write("\n")


147 changes: 147 additions & 0 deletions python/tabby-eval/tabby_data_pipeline/assets.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
import modal
import json
import os, subprocess
import pandas as pd

from dagster import (
AssetExecutionContext,
MetadataValue,
asset,
StaticPartitionsDefinition,
MultiPartitionsDefinition,
)
from . import analyze


@asset
def baseline() -> str:
return "line_completion.jsonl"

@asset
def bm25() -> str:
return "line_completion_rg1_bm25.jsonl"

@asset
def oracle() -> str:
return "line_completion_oracle_bm25.jsonl"

@asset(
partitions_def=MultiPartitionsDefinition(
{
"model_id" : StaticPartitionsDefinition(['TabbyML/StarCoder-1B', 'TabbyML/StarCoder-3B', 'TabbyML/StarCoder-7B', 'TabbyML/WizardCoder-1B', 'TabbyML/WizardCoder-3B', 'TabbyML/CodeLlama-7B', 'TabbyML/CodeLlama-13B']),
"language" : StaticPartitionsDefinition(["python", "java", "csharp", "typescript"]),

}
))
def predict_baseline(context: AssetExecutionContext, baseline: str) -> None:
model_id = context.partition_key.keys_by_dimension["model_id"]
language = context.partition_key.keys_by_dimension["language"]

my_env = os.environ.copy()
my_env["MODEL_ID"] = model_id

context.add_output_metadata(metadata={"model_id": MetadataValue.md(model_id)})

files = baseline

p = subprocess.Popen(["modal", "run", "./modal/predict.py","--language", language, "--files", files], env=my_env)
p.wait()
context.add_output_metadata(metadata={'modal run': MetadataValue.md("success!")})

@asset(
partitions_def=MultiPartitionsDefinition(
{
"model_id" : StaticPartitionsDefinition(['TabbyML/StarCoder-1B', 'TabbyML/StarCoder-3B', 'TabbyML/StarCoder-7B', 'TabbyML/WizardCoder-1B', 'TabbyML/WizardCoder-3B', 'TabbyML/CodeLlama-7B', 'TabbyML/CodeLlama-13B']),
"language" : StaticPartitionsDefinition(["python", "java", "csharp", "typescript"]),

}
))
def predict_bm25(context: AssetExecutionContext, bm25: str) -> None:
model_id = context.partition_key.keys_by_dimension["model_id"]
language = context.partition_key.keys_by_dimension["language"]

my_env = os.environ.copy()
my_env["MODEL_ID"] = model_id

context.add_output_metadata(metadata={"model_id": MetadataValue.md(model_id)})

files = bm25

p = subprocess.Popen(["modal", "run", "./modal/predict.py","--language", language, "--files", files], env=my_env)
p.wait()
context.add_output_metadata(metadata={'modal run': MetadataValue.md("success!")})


@asset(
partitions_def=MultiPartitionsDefinition(
{
"model_id" : StaticPartitionsDefinition(['TabbyML/StarCoder-1B', 'TabbyML/StarCoder-3B', 'TabbyML/StarCoder-7B', 'TabbyML/WizardCoder-1B', 'TabbyML/WizardCoder-3B', 'TabbyML/CodeLlama-7B', 'TabbyML/CodeLlama-13B']),
"language" : StaticPartitionsDefinition(["python", "java", "csharp", "typescript"]),

}
))
def predict_oracle(context: AssetExecutionContext, oracle: str) -> None:
model_id = context.partition_key.keys_by_dimension["model_id"]
language = context.partition_key.keys_by_dimension["language"]

my_env = os.environ.copy()
my_env["MODEL_ID"] = model_id

context.add_output_metadata(metadata={"model_id": MetadataValue.md(model_id)})

files = oracle

p = subprocess.Popen(["modal", "run", "./modal/predict.py","--language", language, "--files", files], env=my_env)
p.wait()
context.add_output_metadata(metadata={'modal run': MetadataValue.md("success!")})



@asset(
partitions_def=MultiPartitionsDefinition(
{
"model_id" : StaticPartitionsDefinition(['TabbyML/StarCoder-1B', 'TabbyML/StarCoder-3B', 'TabbyML/StarCoder-7B', 'TabbyML/WizardCoder-1B', 'TabbyML/WizardCoder-3B', 'TabbyML/CodeLlama-7B', 'TabbyML/CodeLlama-13B']),
"language" : StaticPartitionsDefinition(["python", "java", "csharp", "typescript"]),
}
), deps=[predict_baseline])
def matching_baseline(context) -> None:
model_id = context.partition_key.keys_by_dimension["model_id"]
language = context.partition_key.keys_by_dimension["language"]


model = model_id.split("/")[-1]
analyze.analyze(model, language, 'line_completion.jsonl')



@asset(
partitions_def=MultiPartitionsDefinition(
{
"model_id" : StaticPartitionsDefinition(['TabbyML/StarCoder-1B', 'TabbyML/StarCoder-3B', 'TabbyML/StarCoder-7B', 'TabbyML/WizardCoder-1B', 'TabbyML/WizardCoder-3B', 'TabbyML/CodeLlama-7B', 'TabbyML/CodeLlama-13B']),
"language" : StaticPartitionsDefinition(["python", "java", "csharp", "typescript"]),
}
), deps=[predict_bm25])
def matching_bm25(context) -> None:
model_id = context.partition_key.keys_by_dimension["model_id"]
language = context.partition_key.keys_by_dimension["language"]


model = model_id.split("/")[-1]
analyze.analyze(model, language, 'line_completion_rg1_bm25.jsonl')



@asset(
partitions_def=MultiPartitionsDefinition(
{
"model_id" : StaticPartitionsDefinition(['TabbyML/StarCoder-1B', 'TabbyML/StarCoder-3B', 'TabbyML/StarCoder-7B', 'TabbyML/WizardCoder-1B', 'TabbyML/WizardCoder-3B', 'TabbyML/CodeLlama-7B', 'TabbyML/CodeLlama-13B']),
"language" : StaticPartitionsDefinition(["python", "java", "csharp", "typescript"]),
}
), deps=[predict_oracle])
def matching_oracle(context) -> None:
model_id = context.partition_key.keys_by_dimension["model_id"]
language = context.partition_key.keys_by_dimension["language"]


model = model_id.split("/")[-1]
analyze.analyze(model, language, 'line_completion_oracle_bm25.jsonl')
Loading
Loading