polygon_client.list_tickers() returns duplicate tickers #637

evanvolgas · 2024-04-02T12:57:15Z

Describe the bug

A few days ago, my ETL audits started reporting duplicate records in my all_tickers BigQuery table, which itself is based on polygon_client.list_tickers(). I dug into these 102 duplicates and it appears to me they are being caused by the same ticker having different last_updated_utc and/or a different CIK.

In 12 of the 51 affected tickers I found, the same ticker has two CIKs listed at the polygon_client.list_tickers() endpoint.

In 39 of the 51 affected tickers I found, the same ticker has the same CIK listed at the polygon_client.list_tickers() endpoint, and the only thing different is the last_updated_utc

To Reproduce

import os 

from collections import Counter
from concurrent.futures import ThreadPoolExecutor
from threading import Lock

import pandas as pd

from polygon import RESTClient

WORKERS = 50
polygon_secret = os.getenv("POLYGON_API_KEY")
polygon_client = RESTClient(polygon_secret, retries=10, trace=False)

# Fetch Ticker Info
tickers_active = polygon_client.list_tickers()
dfs = []
dfs_lock = Lock()

def fetch_ticker_info(ticker):
    ticker_info = {
        "active": ticker.active,
        "cik": ticker.cik,
        "composite_figi": ticker.composite_figi,
        "currency_symbol": ticker.currency_symbol,
        "base_currency_symbol": ticker.base_currency_symbol,
        "currency_name": ticker.currency_name,
        "base_currency_name": ticker.base_currency_name,
        "delisted_utc": ticker.delisted_utc,
        "locale": ticker.locale,
        "title": ticker.name,
        "primary_exchange": ticker.primary_exchange,
        "share_class_figi": ticker.share_class_figi,
        "ticker": ticker.ticker,
        "type": ticker.type,
        "source_feed": ticker.source_feed,
    }
    with dfs_lock:
        dfs.append(ticker_info)
    return ticker_info


# Using ThreadPoolExecutor to fetch tickers data concurrently
print(f"Scraping web data")
with ThreadPoolExecutor(max_workers=WORKERS) as executor:
    executor.map(fetch_ticker_info, tickers_active)

all_tickers_df = pd.DataFrame(dfs)
all_tickers_df

# Show the affected records; 
# there are 51 unique tickers and 102 records that appear to be duplicated
all_tickers_df[all_tickers_df.groupby('ticker')['ticker'].transform('count') > 1]

# Consider the duplicate records above; 
# Count the number of tickers with more than one CIK (True) 
# vs the number of # tickers with exactly one CIK (False)
Counter(all_tickers_df[all_tickers_df.groupby('ticker')['ticker'].transform('count') > 1].groupby("ticker")["cik"].transform('nunique') > 1)

Expected behavior
all_tickers_df[all_tickers_df.groupby('ticker')['ticker'].transform('count') > 1] should return None. If you notice, the results of this command show 51 unique tickers with two rows each. There are 12 unique tickers for which the same ticker has two CIKs. There are 39 unique tickers for which the same ticker has only one CIK, but has a different updated at timestamp.

Additional context
I join to this table and assume the tickers are unique. I have 6 downstream tables that are affected by this. I can deduplicate the all tickers endpoint on my end, but it feels to me as if the endpoint shouldn't be returning what appear to be duplicate records in the first place.

FWIW, nothing on my end changed and the audits I've put in place have been active for over four months without issue. According to my audit records, these duplicates started showing up on March 24th. I only download these data weekly, so I believe it's likely that the problem was introduced some time between March 17th and March 24th.

The text was updated successfully, but these errors were encountered:

evanvolgas · 2024-04-10T15:25:34Z

It seems that duplicated data are continuing to be added to the data source on your end. I'm now seeing 46 exact duplicates with the only difference being the updated at timestamp, and 11 partial dups where one ticker has two different CIDs but otherwise the records are the same.

justinpolygon · 2024-04-10T15:30:25Z

Hi @evanvolgas, we're checking this out now. We created an internal ticker earlier but bumping things up in priority. Thank you for reporting.

evanvolgas · 2024-04-10T17:00:37Z

Thank you!

evanvolgas added the bug Something isn't working label Apr 2, 2024

justinpolygon assigned jchow-polygon Apr 25, 2024

justinpolygon added the data label Apr 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

polygon_client.list_tickers() returns duplicate tickers #637

polygon_client.list_tickers() returns duplicate tickers #637

evanvolgas commented Apr 2, 2024 •

edited

evanvolgas commented Apr 10, 2024 •

edited

justinpolygon commented Apr 10, 2024

evanvolgas commented Apr 10, 2024

polygon_client.list_tickers() returns duplicate tickers #637

polygon_client.list_tickers() returns duplicate tickers #637

Comments

evanvolgas commented Apr 2, 2024 • edited

evanvolgas commented Apr 10, 2024 • edited

justinpolygon commented Apr 10, 2024

evanvolgas commented Apr 10, 2024

evanvolgas commented Apr 2, 2024 •

edited

evanvolgas commented Apr 10, 2024 •

edited