Index error in docker-compose article-relevance-prediction. #103

SimonGoring · 2023-07-24T18:16:21Z

Running the docker compose in the root directory I am now running into a new error with indices:

simon@partyLaptop:~/Documents/Neotoma/MetaExtractor$ docker-compose up article-relevance-prediction
Starting metaextractor_article-relevance-prediction_1 ... done
Attaching to metaextractor_article-relevance-prediction_1
article-relevance-prediction_1  | 2023-07-24 18:04:08,781 - gdd_api_query.py:113 - get_new_gdd_articles - INFO - Querying by n_recent = 1000
article-relevance-prediction_1  | 2023-07-24 18:04:09,379 - gdd_api_query.py:151 - get_new_gdd_articles - INFO - 1000 articles queried from GeoDeepDive (page 1).
article-relevance-prediction_1  | 2023-07-24 18:04:09,379 - gdd_api_query.py:174 - get_new_gdd_articles - INFO - GeoDeepDive query completed.
article-relevance-prediction_1  | 2023-07-24 18:04:09,854 - gdd_api_query.py:197 - get_new_gdd_articles - INFO - 1000 articles returned from GeoDeepDive.
article-relevance-prediction_1  | 2023-07-24 18:04:12,763 - relevance_prediction_parquet.py:57 - crossref_extract - INFO - Running crossref_extract function.
article-relevance-prediction_1  | 2023-07-24 18:04:12,766 - relevance_prediction_parquet.py:77 - crossref_extract - INFO - Querying CrossRef API for article metadata.
article-relevance-prediction_1  | 2023-07-24 18:10:54,843 - relevance_prediction_parquet.py:98 - crossref_extract - INFO - CrossRef API query completed for 1000 articles.
article-relevance-prediction_1  | 2023-07-24 18:10:54,877 - relevance_prediction_parquet.py:164 - data_preprocessing - INFO - Prediction data preprocessing begin.
article-relevance-prediction_1  | /app/src/article_relevance/relevance_prediction_parquet.py:178: DeprecationWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`
article-relevance-prediction_1  |   metadata_df.loc[valid_condition, 'has_abstract'] = metadata_df.loc[valid_condition, "abstract"].isnull()
article-relevance-prediction_1  | 2023-07-24 18:10:54,896 - relevance_prediction_parquet.py:189 - data_preprocessing - INFO - Running article language imputation.
article-relevance-prediction_1  | 2023-07-24 18:10:54,903 - relevance_prediction_parquet.py:201 - data_preprocessing - INFO - 81 articles require language imputation
article-relevance-prediction_1  | 2023-07-24 18:10:54,903 - relevance_prediction_parquet.py:203 - data_preprocessing - INFO - 81 cannot be imputed due to too short text metadata(title, subtitle and abstract less than 5 character).
article-relevance-prediction_1  | 2023-07-24 18:10:54,905 - relevance_prediction_parquet.py:213 - data_preprocessing - INFO - Missing language imputation completed
article-relevance-prediction_1  | 2023-07-24 18:10:54,906 - relevance_prediction_parquet.py:214 - data_preprocessing - INFO - After imputation, there are 1000 non-English articles in total excluded from the prediction pipeline.
article-relevance-prediction_1  | 2023-07-24 18:10:54,912 - relevance_prediction_parquet.py:238 - data_preprocessing - INFO - 0 articles has missing feature and its relevance cannot be predicted.
article-relevance-prediction_1  | 2023-07-24 18:10:54,912 - relevance_prediction_parquet.py:239 - data_preprocessing - INFO - Data preprocessing completed.
article-relevance-prediction_1  | 2023-07-24 18:10:54,912 - relevance_prediction_parquet.py:257 - add_embeddings - INFO - Sentence embedding start.
Downloading (…)2c72f/.gitattributes: 100%|██████████| 1.48k/1.48k [00:00<00:00, 3.53MB/s]
Downloading (…)be7662c72f/README.md: 100%|██████████| 8.09k/8.09k [00:00<00:00, 22.2MB/s]
Downloading (…)7662c72f/config.json: 100%|██████████| 754/754 [00:00<00:00, 2.62MB/s]
Downloading pytorch_model.bin: 100%|██████████| 440M/440M [00:10<00:00, 40.7MB/s] 
Downloading (…)cial_tokens_map.json: 100%|██████████| 125/125 [00:00<00:00, 810kB/s]
Downloading (…)2c72f/tokenizer.json: 100%|██████████| 717k/717k [00:00<00:00, 5.29MB/s]
Downloading (…)okenizer_config.json: 100%|██████████| 453/453 [00:00<00:00, 1.59MB/s]
Downloading (…)be7662c72f/vocab.txt: 100%|██████████| 228k/228k [00:00<00:00, 3.56MB/s]
article-relevance-prediction_1  | No sentence-transformers model found with name /root/.cache/torch/sentence_transformers/allenai_specter2. Creating a new one with MEAN pooling.
article-relevance-prediction_1  | 2023-07-24 18:11:09,041 - relevance_prediction_parquet.py:275 - add_embeddings - INFO - Sentence embedding completed.
article-relevance-prediction_1  | 2023-07-24 18:11:09,050 - relevance_prediction_parquet.py:294 - relevance_prediction - INFO - Prediction start.
article-relevance-prediction_1  | 2023-07-24 18:11:09,064 - relevance_prediction_parquet.py:307 - relevance_prediction - INFO - Running prediction for 0 articles.
article-relevance-prediction_1  | Traceback (most recent call last):
article-relevance-prediction_1  |   File "/app/src/article_relevance/relevance_prediction_parquet.py", line 456, in <module>
article-relevance-prediction_1  |     main()
article-relevance-prediction_1  |   File "/app/src/article_relevance/relevance_prediction_parquet.py", line 445, in main
article-relevance-prediction_1  |     predicted = relevance_prediction(embedded, model_path, predict_thld = 0.5)
article-relevance-prediction_1  |   File "/app/src/article_relevance/relevance_prediction_parquet.py", line 311, in relevance_prediction
article-relevance-prediction_1  |     nan_exists = valid_df.loc[:, feature_col].isnull().any(axis = 1)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1067, in __getitem__
article-relevance-prediction_1  |     return self._getitem_tuple(key)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1256, in _getitem_tuple
article-relevance-prediction_1  |     return self._getitem_tuple_same_dim(tup)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 924, in _getitem_tuple_same_dim
article-relevance-prediction_1  |     retval = getattr(retval, self.name)._getitem_axis(key, axis=i)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1301, in _getitem_axis
article-relevance-prediction_1  |     return self._getitem_iterable(key, axis=axis)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1239, in _getitem_iterable
article-relevance-prediction_1  |     keyarr, indexer = self._get_listlike_indexer(key, axis)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexing.py", line 1432, in _get_listlike_indexer
article-relevance-prediction_1  |     keyarr, indexer = ax._get_indexer_strict(key, axis_name)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6070, in _get_indexer_strict
article-relevance-prediction_1  |     self._raise_if_missing(keyarr, indexer, axis_name)
article-relevance-prediction_1  |   File "/usr/local/lib/python3.10/site-packages/pandas/core/indexes/base.py", line 6133, in _raise_if_missing
article-relevance-prediction_1  |     raise KeyError(f"{not_found} not in index")
article-relevance-prediction_1  | KeyError: "['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '72', '73', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '85', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '109', '110', '111', '112', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130', '131', '132', '133', '134', '135', '136', '137', '138', '139', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '150', '151', '152', '153', '154', '155', '156', '157', '158', '159', '160', '161', '162', '163', '164', '165', '166', '167', '168', '169', '170', '171', '172', '173', '174', '175', '176', '177', '178', '179', '180', '181', '182', '183', '184', '185', '186', '187', '188', '189', '190', '191', '192', '193', '194', '195', '196', '197', '198', '199', '200', '201', '202', '203', '204', '205', '206', '207', '208', '209', '210', '211', '212', '213', '214', '215', '216', '217', '218', '219', '220', '221', '222', '223', '224', '225', '226', '227', '228', '229', '230', '231', '232', '233', '234', '235', '236', '237', '238', '239', '240', '241', '242', '243', '244', '245', '246', '247', '248', '249', '250', '251', '252', '253', '254', '255', '256', '257', '258', '259', '260', '261', '262', '263', '264', '265', '266', '267', '268', '269', '270', '271', '272', '273', '274', '275', '276', '277', '278', '279', '280', '281', '282', '283', '284', '285', '286', '287', '288', '289', '290', '291', '292', '293', '294', '295', '296', '297', '298', '299', '300', '301', '302', '303', '304', '305', '306', '307', '308', '309', '310', '311', '312', '313', '314', '315', '316', '317', '318', '319', '320', '321', '322', '323', '324', '325', '326', '327', '328', '329', '330', '331', '332', '333', '334', '335', '336', '337', '338', '339', '340', '341', '342', '343', '344', '345', '346', '347', '348', '349', '350', '351', '352', '353', '354', '355', '356', '357', '358', '359', '360', '361', '362', '363', '364', '365', '366', '367', '368', '369', '370', '371', '372', '373', '374', '375', '376', '377', '378', '379', '380', '381', '382', '383', '384', '385', '386', '387', '388', '389', '390', '391', '392', '393', '394', '395', '396', '397', '398', '399', '400', '401', '402', '403', '404', '405', '406', '407', '408', '409', '410', '411', '412', '413', '414', '415', '416', '417', '418', '419', '420', '421', '422', '423', '424', '425', '426', '427', '428', '429', '430', '431', '432', '433', '434', '435', '436', '437', '438', '439', '440', '441', '442', '443', '444', '445', '446', '447', '448', '449', '450', '451', '452', '453', '454', '455', '456', '457', '458', '459', '460', '461', '462', '463', '464', '465', '466', '467', '468', '469', '470', '471', '472', '473', '474', '475', '476', '477', '478', '479', '480', '481', '482', '483', '484', '485', '486', '487', '488', '489', '490', '491', '492', '493', '494', '495', '496', '497', '498', '499', '500', '501', '502', '503', '504', '505', '506', '507', '508', '509', '510', '511', '512', '513', '514', '515', '516', '517', '518', '519', '520', '521', '522', '523', '524', '525', '526', '527', '528', '529', '530', '531', '532', '533', '534', '535', '536', '537', '538', '539', '540', '541', '542', '543', '544', '545', '546', '547', '548', '549', '550', '551', '552', '553', '554', '555', '556', '557', '558', '559', '560', '561', '562', '563', '564', '565', '566', '567', '568', '569', '570', '571', '572', '573', '574', '575', '576', '577', '578', '579', '580', '581', '582', '583', '584', '585', '586', '587', '588', '589', '590', '591', '592', '593', '594', '595', '596', '597', '598', '599', '600', '601', '602', '603', '604', '605', '606', '607', '608', '609', '610', '611', '612', '613', '614', '615', '616', '617', '618', '619', '620', '621', '622', '623', '624', '625', '626', '627', '628', '629', '630', '631', '632', '633', '634', '635', '636', '637', '638', '639', '640', '641', '642', '643', '644', '645', '646', '647', '648', '649', '650', '651', '652', '653', '654', '655', '656', '657', '658', '659', '660', '661', '662', '663', '664', '665', '666', '667', '668', '669', '670', '671', '672', '673', '674', '675', '676', '677', '678', '679', '680', '681', '682', '683', '684', '685', '686', '687', '688', '689', '690', '691', '692', '693', '694', '695', '696', '697', '698', '699', '700', '701', '702', '703', '704', '705', '706', '707', '708', '709', '710', '711', '712', '713', '714', '715', '716', '717', '718', '719', '720', '721', '722', '723', '724', '725', '726', '727', '728', '729', '730', '731', '732', '733', '734', '735', '736', '737', '738', '739', '740', '741', '742', '743', '744', '745', '746', '747', '748', '749', '750', '751', '752', '753', '754', '755', '756', '757', '758', '759', '760', '761', '762', '763', '764', '765', '766', '767'] not in index"
metaextractor_article-relevance-prediction_1 exited with code 1

Which seems to be coming from this line:

nan_exists = valid_df.loc[:, feature_col].isnull().any(axis = 1)

in the relevance_prediction() function.

I'll try a bit of debugging to see why it's popping up.

The text was updated successfully, but these errors were encountered:

SimonGoring · 2023-07-24T18:20:12Z

Just looking at the code a bit more, does it have to do with the fact that a number is hard-coded here:

https://github.com/NeotomaDB/MetaExtractor/blob/main/src/article_relevance/relevance_prediction_parquet.py#L310

tieandrews · 2023-07-29T00:04:51Z

I believe the issue that for some reason none of those 1000 articles queried grom xDD are NOT in English (which gets checked) and for some reason flags all the articles as "not valid for prediction". So it's a bug in the function that doesn't handle when the dataframe of articles is empty and therefore doesn't get the sentence embeddings added to it.

We should be able to use the steps to query xDD and crossref then inspect the returned papers metadata before it gets fed into the prediction step, can't imagine all 1000 articles are actually not in English!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index error in docker-compose article-relevance-prediction. #103

Index error in docker-compose article-relevance-prediction. #103

SimonGoring commented Jul 24, 2023

SimonGoring commented Jul 24, 2023

tieandrews commented Jul 29, 2023 •

edited

Loading

Index error in docker-compose article-relevance-prediction. #103

Index error in docker-compose article-relevance-prediction. #103

Comments

SimonGoring commented Jul 24, 2023

SimonGoring commented Jul 24, 2023

tieandrews commented Jul 29, 2023 • edited Loading

tieandrews commented Jul 29, 2023 •

edited

Loading