-
Notifications
You must be signed in to change notification settings - Fork 3
Fix data processing error in tutorial #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
constraints. The ``VarCharRegex`` constraint compared the columns' values to a regular | ||
expression. The ``UniquesEquality`` constraint expected the unique values of the | ||
``language`` column to not have changed between version 1 and version 2. | ||
* The failing ``KolmogorovSminrnov`` constraint tells us that we shouldn't assume the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test only failed due to a bug in the data processing.
Codecov Report
@@ Coverage Diff @@
## main #51 +/- ##
=======================================
Coverage 93.90% 93.90%
=======================================
Files 15 15
Lines 1607 1607
=======================================
Hits 1509 1509
Misses 98 98
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. |
df_v2[fluctuating_column] = ((1 + change) * df_v1[fluctuating_column]).astype(int) | ||
|
||
# Make old version not have data about all channels from current version. | ||
df_v1 = df_v1.sample(frac=0.85, random_state=SEED) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously, this subsampling happened before the numeric perturbations but after the copying.
Hence, df_v2
only received updated values for the subsampled rows/indices. Remaining rows/indices were assigned NA
values.
# Introduce a data error. | ||
index = (~df_v2["channel"].isin(df_v1["channel"])).idxmax() | ||
df_v2.loc[index, "language"] = "Sw3d1zh" | ||
df_v1 = pd.read_csv("twitch_version1.csv") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dump to file system such that a separate upload script can be run even when not running the processing oneself.
Rendered docs:
https://datajudge--51.org.readthedocs.build/en/51/