Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Creating Index due to Bad Chars #14

Closed
ScarletGuo opened this issue Nov 14, 2018 · 4 comments
Closed

Error Creating Index due to Bad Chars #14

ScarletGuo opened this issue Nov 14, 2018 · 4 comments

Comments

@ScarletGuo
Copy link
Collaborator

ScarletGuo commented Nov 14, 2018

Error:

LINE 1: ...ATE INDEX census_0126 ON census_01 (migration_code-change_in...
                                                             ^
 [SQL: 'CREATE INDEX census_0126 ON census_01 (migration_code-change_in_msa)'] (Background on this error at: http://sqlalche.me/e/f405)
For table: census_01 ERROR while creating index for table census_01 on attributes ['migration_code-change_in_msa']

===================

tmp solution added to my local holoclean/dataset/table:36

bad_chars = re.compile('[^a-zA-Z0-9]+')
renamed_cols = [bad_chars.sub(' ', col).strip().replace(' ', '_') for col in columns]

@ScarletGuo
Copy link
Collaborator Author

ScarletGuo commented Nov 14, 2018

However, the tmp solution above can result in inconsistency of attribute names when it comes to evaluation. Similar changes might be needed for eval.py when loading clean data

@richardwu
Copy link
Collaborator

richardwu commented Nov 15, 2018

The error arises because we don't quote columns (attributes) in our queries since Postgres's lexer attempts to resolve these unquoted tokens while also downcasing the attribute (see https://stackoverflow.com/questions/20878932/are-postgresql-column-names-case-sensitive).

I fixed how we treat attributes as columns in #18 by retaining their original format from the raw datasets. This should no longer be an issue: I tested this by replacing the attribute "HospitalName" with "Hospital-Name" in the hospital dataset

// hospital.csv
ProviderNumber,Hospital-Name,Address1,Address2,Address3,City,State,ZipCode,CountyName,PhoneNumber,HospitalType,HospitalOwner,EmergencyService,Condition,MeasureCode,MeasureName,Score,Sample,Stateavg
10018,callahan eye foundation hospital,1720 university blvd,,,birmingham,al,35233,jefferson,2053258100,acute care hospitals,voluntary non-profit - private,yes,surgical infection prevention,scip-card-2,surgery patients who were taking heart drugs caxxed beta bxockers before coming to the hospitax who were kept on the beta bxockers during the period just before and after their surgery,,,al_scip-card-2
...
// hospital_clean.csv
tid,attribute,correct_val
0,ProviderNumber,10018
0,Hospital-Name,callahan eye foundation hospital
0,Address1,1720 university blvd
...
// hospital_constraints_att.txt
t1&t2&EQ(t1.Condition,t2.Condition)&EQ(t1.MeasureName,t2.MeasureName)&IQ(t1.HospitalType,t2.HospitalType)
t1&t2&EQ(t1.Hospital-Name,t2.Hospital-Name)&IQ(t1.ZipCode,t2.ZipCode)
t1&t2&EQ(t1.Sample,t2.Sample)&IQ(t1.Score,t2.Score)
...

and the test ran perfectly

Precision = 0.96, Recall = 0.69, Repairing Recall = 0.80, F1 = 0.80, Repairing F1 = 0.88, Detected Errors = 435, Total Errors = 509, Correct Repairs = 350, Total Repairs = 365, Total Repairs (Grdth present) = 365

@thodrek
Copy link

thodrek commented Nov 22, 2018

@richardwu is this fixed? if so please close.

@richardwu
Copy link
Collaborator

Yes this is fixed. I can't seem to close this issue (I believe only the author and/or maintainers can).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants