Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Index error in docker-compose article-relevance-prediction. #103

Open
SimonGoring opened this issue Jul 24, 2023 · 2 comments
Open

Index error in docker-compose article-relevance-prediction. #103

SimonGoring opened this issue Jul 24, 2023 · 2 comments

Comments

@SimonGoring
Copy link
Contributor

Running the docker compose in the root directory I am now running into a new error with indices:

simon@partyLaptop:~/Documents/Neotoma/MetaExtractor$ docker-compose up article-relevance-prediction
Starting metaextractor_article-relevance-prediction_1 ... done
Attaching to metaextractor_article-relevance-prediction_1
article-relevance-prediction_1  | 2023-07-24 18:04:08,781 - gdd_api_query.py:113 - get_new_gdd_articles - INFO - Querying by n_recent = 1000
article-relevance-prediction_1  | 2023-07-24 18:04:09,379 - gdd_api_query.py:151 - get_new_gdd_articles - INFO - 1000 articles queried from GeoDeepDive (page 1).
article-relevance-prediction_1  | 2023-07-24 18:04:09,379 - gdd_api_query.py:174 - get_new_gdd_articles - INFO - GeoDeepDive query completed.
article-relevance-prediction_1  | 2023-07-24 18:04:09,854 - gdd_api_query.py:197 - get_new_gdd_articles - INFO - 1000 articles returned from GeoDeepDive.
article-relevance-prediction_1  | 2023-07-24 18:04:12,763 - relevance_prediction_parquet.py:57 - crossref_extract - INFO - Running crossref_extract function.
article-relevance-prediction_1  | 2023-07-24 18:04:12,766 - relevance_prediction_parquet.py:77 - crossref_extract - INFO - Querying CrossRef API for article metadata.
article-relevance-prediction_1  | 2023-07-24 18:10:54,843 - relevance_prediction_parquet.py:98 - crossref_extract - INFO - CrossRef API query completed for 1000 articles.
article-relevance-prediction_1  | 2023-07-24 18:10:54,877 - relevance_prediction_parquet.py:164 - data_preprocessing - INFO - Prediction data preprocessing begin.
article-relevance-prediction_1  | /app/src/article_relevance/relevance_prediction_parquet.py:178: DeprecationWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
article-relevance-prediction_1  |   metadata_df.loc[valid_condition, 'has_abstract'] = metadata_df.loc[valid_condition, "abstract"].isnull()
article-relevance-prediction_1  | 2023-07-24 18:10:54,896 - relevance_prediction_parquet.py:189 - data_preprocessing - INFO - Running article language imputation.
article-relevance-prediction_1  | 2023-07-24 18:10:54,903 - relevance_prediction_parquet.py:201 - data_preprocessing - INFO - 81 articles require language imputation
article-relevance-prediction_1  | 2023-07-24 18:10:54,903 - relevance_prediction_parquet.py:203 - data_preprocessing - INFO - 81 cannot be imputed due to too short text metadata(title, subtitle and abstract less than 5 character).
article-relevance-prediction_1  | 2023-07-24 18:10:54,905 - relevance_prediction_parquet.py:213 - data_preprocessing - INFO - Missing language imputation completed
article-relevance-prediction_1  | 2023-07-24 18:10:54,906 - relevance_prediction_parquet.py:214 - data_preprocessing - INFO - After imputation, there are 1000 non-English articles in total excluded from the prediction pipeline.
article-relevance-prediction_1  | 2023-07-24 18:10:54,912 - relevance_prediction_parquet.py:238 - data_preprocessing - INFO - 0 articles has missing feature and its relevance cannot be predicted.
article-relevance-prediction_1  | 2023-07-24 18:10:54,912 - relevance_prediction_parquet.py:239 - data_preprocessing - INFO - Data preprocessing completed.
article-relevance-prediction_1  | 2023-07-24 18:10:54,912 - relevance_prediction_parquet.py:257 - add_embeddings - INFO - Sentence embedding start.
Downloading (…)2c72f/.gitattributes: 100%|██████████| 1.48k/1.48k [00:00<00:00, 3.53MB/s]
Downloading (…)be7662c72f/README.md: 100%|██████████| 8.09k/8.09k [00:00<00:00, 22.2MB/s]
Downloading (…)7662c72f/config.json: 100%|██████████| 754/754 [00:00<00:00, 2.62MB/s]
Downloading pytorch_model.bin: 100%|██████████| 440M/440M [00:10<00:00, 40.7MB/s] 
Downloading (…)cial_tokens_map.json: 100%|██████████| 125/125 [00:00<00:00, 810kB/s]
Downloading (…)2c72f/tokenizer.json: 100%|██████████| 717k/717k [00:00<00:00, 5.29MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 453/453 [00:00<00:00, 1.59MB/s]
Downloading (…)be7662c72f/vocab.txt: 100%|██████████| 228k/228k [00:00<00:00, 3.56MB/s]
article-relevance-prediction_1  | No sentence-transformers model found with name /root/.cache/torch/sentence_transformers/allenai_specter2. Creating a new one with MEAN pooling.
article-relevance-prediction_1  | 2023-07-24 18:11:09,041 - relevance_prediction_parquet.py:275 - add_embeddings - INFO - Sentence embedding completed.
article-relevance-prediction_1  | 2023-07-24 18:11:09,050 - relevance_prediction_parquet.py:294 - relevance_prediction - INFO - Prediction start.
article-relevance-prediction_1  | 2023-07-24 18:11:09,064 - relevance_prediction_parquet.py:307 - relevance_prediction - INFO - Running prediction for 0 articles.
article-relevance-prediction_1  | Traceback (most recent call last):
article-relevance-prediction_1  |   File "/app/src/article_relevance/relevance_prediction_parquet.py", line 456, in <module>
article-relevance-prediction_1  |     main()
article-relevance-prediction_1  |   File "/app/src/article_relevance/relevance_prediction_parquet.py", line 445, in main
article-relevance-prediction_1  |     predicted = relevance_prediction(embedded, model_path, predict_thld = 0.5)
article-relevance-prediction_1  |   File "/app/src/article_relevance/relevance_prediction_parquet.py", line 311, in relevance_prediction
article-relevance-prediction_1  |     nan_exists = valid_df.loc[:, feature_col].isnull().any(axis = 1)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1067, in __getitem__
article-relevance-prediction_1  |     return self._getitem_tuple(key)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1256, in _getitem_tuple
article-relevance-prediction_1  |     return self._getitem_tuple_same_dim(tup)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 924, in _getitem_tuple_same_dim
article-relevance-prediction_1  |     retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1301, in _getitem_axis
article-relevance-prediction_1  |     return self._getitem_iterable(key, axis=axis)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1239, in _getitem_iterable
article-relevance-prediction_1  |     keyarr, indexer = self._get_listlike_indexer(key, axis)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1432, in _get_listlike_indexer
article-relevance-prediction_1  |     keyarr, indexer = ax._get_indexer_strict(key, axis_name)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6070, in _get_indexer_strict
article-relevance-prediction_1  |     self._raise_if_missing(keyarr, indexer, axis_name)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6133, in _raise_if_missing
article-relevance-prediction_1  |     raise KeyError(f"{not_found} not in index")
article-relevance-prediction_1  | KeyError: "['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130', '131', '132', '133', '134', '135', '136', '137', '138', '139', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '150', '151', '152', '153', '154', '155', '156', '157', '158', '159', '160', '161', '162', '163', '164', '165', '166', '167', '168', '169', '170', '171', '172', '173', '174', '175', '176', '177', '178', '179', '180', '181', '182', '183', '184', '185', '186', '187', '188', '189', '190', '191', '192', '193', '194', '195', '196', '197', '198', '199', '200', '201', '202', '203', '204', '205', '206', '207', '208', '209', '210', '211', '212', '213', '214', '215', '216', '217', '218', '219', '220', '221', '222', '223', '224', '225', '226', '227', '228', '229', '230', '231', '232', '233', '234', '235', '236', '237', '238', '239', '240', '241', '242', '243', '244', '245', '246', '247', '248', '249', '250', '251', '252', '253', '254', '255', '256', '257', '258', '259', '260', '261', '262', '263', '264', '265', '266', '267', '268', '269', '270', '271', '272', '273', '274', '275', '276', '277', '278', '279', '280', '281', '282', '283', '284', '285', '286', '287', '288', '289', '290', '291', '292', '293', '294', '295', '296', '297', '298', '299', '300', '301', '302', '303', '304', '305', '306', '307', '308', '309', '310', '311', '312', '313', '314', '315', '316', '317', '318', '319', '320', '321', '322', '323', '324', '325', '326', '327', '328', '329', '330', '331', '332', '333', '334', '335', '336', '337', '338', '339', '340', '341', '342', '343', '344', '345', '346', '347', '348', '349', '350', '351', '352', '353', '354', '355', '356', '357', '358', '359', '360', '361', '362', '363', '364', '365', '366', '367', '368', '369', '370', '371', '372', '373', '374', '375', '376', '377', '378', '379', '380', '381', '382', '383', '384', '385', '386', '387', '388', '389', '390', '391', '392', '393', '394', '395', '396', '397', '398', '399', '400', '401', '402', '403', '404', '405', '406', '407', '408', '409', '410', '411', '412', '413', '414', '415', '416', '417', '418', '419', '420', '421', '422', '423', '424', '425', '426', '427', '428', '429', '430', '431', '432', '433', '434', '435', '436', '437', '438', '439', '440', '441', '442', '443', '444', '445', '446', '447', '448', '449', '450', '451', '452', '453', '454', '455', '456', '457', '458', '459', '460', '461', '462', '463', '464', '465', '466', '467', '468', '469', '470', '471', '472', '473', '474', '475', '476', '477', '478', '479', '480', '481', '482', '483', '484', '485', '486', '487', '488', '489', '490', '491', '492', '493', '494', '495', '496', '497', '498', '499', '500', '501', '502', '503', '504', '505', '506', '507', '508', '509', '510', '511', '512', '513', '514', '515', '516', '517', '518', '519', '520', '521', '522', '523', '524', '525', '526', '527', '528', '529', '530', '531', '532', '533', '534', '535', '536', '537', '538', '539', '540', '541', '542', '543', '544', '545', '546', '547', '548', '549', '550', '551', '552', '553', '554', '555', '556', '557', '558', '559', '560', '561', '562', '563', '564', '565', '566', '567', '568', '569', '570', '571', '572', '573', '574', '575', '576', '577', '578', '579', '580', '581', '582', '583', '584', '585', '586', '587', '588', '589', '590', '591', '592', '593', '594', '595', '596', '597', '598', '599', '600', '601', '602', '603', '604', '605', '606', '607', '608', '609', '610', '611', '612', '613', '614', '615', '616', '617', '618', '619', '620', '621', '622', '623', '624', '625', '626', '627', '628', '629', '630', '631', '632', '633', '634', '635', '636', '637', '638', '639', '640', '641', '642', '643', '644', '645', '646', '647', '648', '649', '650', '651', '652', '653', '654', '655', '656', '657', '658', '659', '660', '661', '662', '663', '664', '665', '666', '667', '668', '669', '670', '671', '672', '673', '674', '675', '676', '677', '678', '679', '680', '681', '682', '683', '684', '685', '686', '687', '688', '689', '690', '691', '692', '693', '694', '695', '696', '697', '698', '699', '700', '701', '702', '703', '704', '705', '706', '707', '708', '709', '710', '711', '712', '713', '714', '715', '716', '717', '718', '719', '720', '721', '722', '723', '724', '725', '726', '727', '728', '729', '730', '731', '732', '733', '734', '735', '736', '737', '738', '739', '740', '741', '742', '743', '744', '745', '746', '747', '748', '749', '750', '751', '752', '753', '754', '755', '756', '757', '758', '759', '760', '761', '762', '763', '764', '765', '766', '767'] not in index"
metaextractor_article-relevance-prediction_1 exited with code 1

Which seems to be coming from this line:

nan_exists = valid_df.loc[:, feature_col].isnull().any(axis = 1)

in the relevance_prediction() function.

I'll try a bit of debugging to see why it's popping up.

@SimonGoring
Copy link
Contributor Author

Just looking at the code a bit more, does it have to do with the fact that a number is hard-coded here:

https://github.com/NeotomaDB/MetaExtractor/blob/main/src/article_relevance/relevance_prediction_parquet.py#L310

@tieandrews
Copy link
Collaborator

tieandrews commented Jul 29, 2023

I believe the issue that for some reason none of those 1000 articles queried grom xDD are NOT in English (which gets checked) and for some reason flags all the articles as "not valid for prediction". So it's a bug in the function that doesn't handle when the dataframe of articles is empty and therefore doesn't get the sentence embeddings added to it.

We should be able to use the steps to query xDD and crossref then inspect the returned papers metadata before it gets fed into the prediction step, can't imagine all 1000 articles are actually not in English!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants