Skip to content

Conversation

kklein
Copy link
Collaborator

@kklein kklein commented Aug 12, 2022

constraints. The ``VarCharRegex`` constraint compared the columns' values to a regular
expression. The ``UniquesEquality`` constraint expected the unique values of the
``language`` column to not have changed between version 1 and version 2.
* The failing ``KolmogorovSminrnov`` constraint tells us that we shouldn't assume the
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test only failed due to a bug in the data processing.

@codecov
Copy link

codecov bot commented Aug 12, 2022

Codecov Report

Merging #51 (55821b2) into main (7a5d603) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##             main      #51   +/-   ##
=======================================
  Coverage   93.90%   93.90%           
=======================================
  Files          15       15           
  Lines        1607     1607           
=======================================
  Hits         1509     1509           
  Misses         98       98           
Impacted Files Coverage Δ
src/datajudge/__init__.py 77.77% <ø> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

df_v2[fluctuating_column] = ((1 + change) * df_v1[fluctuating_column]).astype(int)

# Make old version not have data about all channels from current version.
df_v1 = df_v1.sample(frac=0.85, random_state=SEED)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, this subsampling happened before the numeric perturbations but after the copying.

Hence, df_v2 only received updated values for the subsampled rows/indices. Remaining rows/indices were assigned NA values.

# Introduce a data error.
index = (~df_v2["channel"].isin(df_v1["channel"])).idxmax()
df_v2.loc[index, "language"] = "Sw3d1zh"
df_v1 = pd.read_csv("twitch_version1.csv")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dump to file system such that a separate upload script can be run even when not running the processing oneself.

@kklein kklein changed the title More docs cleanup Fix data processing error in tutorial Aug 15, 2022
@kklein kklein requested a review from ivergara August 15, 2022 09:16
@kklein kklein marked this pull request as ready for review August 15, 2022 09:30
@kklein kklein merged commit e0ea007 into main Aug 15, 2022
@kklein kklein deleted the more_docs_cleanup branch August 15, 2022 09:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants