Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

polygon_client.list_tickers() returns duplicate tickers #637

Open
evanvolgas opened this issue Apr 2, 2024 · 3 comments
Open

polygon_client.list_tickers() returns duplicate tickers #637

evanvolgas opened this issue Apr 2, 2024 · 3 comments
Assignees
Labels
bug Something isn't working data

Comments

@evanvolgas
Copy link

evanvolgas commented Apr 2, 2024

Describe the bug

A few days ago, my ETL audits started reporting duplicate records in my all_tickers BigQuery table, which itself is based on polygon_client.list_tickers(). I dug into these 102 duplicates and it appears to me they are being caused by the same ticker having different last_updated_utc and/or a different CIK.

In 12 of the 51 affected tickers I found, the same ticker has two CIKs listed at the polygon_client.list_tickers() endpoint.

In 39 of the 51 affected tickers I found, the same ticker has the same CIK listed at the polygon_client.list_tickers() endpoint, and the only thing different is the last_updated_utc

To Reproduce

import os 

from collections import Counter
from concurrent.futures import ThreadPoolExecutor
from threading import Lock

import pandas as pd

from polygon import RESTClient

WORKERS = 50
polygon_secret = os.getenv("POLYGON_API_KEY")
polygon_client = RESTClient(polygon_secret, retries=10, trace=False)

# Fetch Ticker Info
tickers_active = polygon_client.list_tickers()
dfs = []
dfs_lock = Lock()

def fetch_ticker_info(ticker):
    ticker_info = {
        "active": ticker.active,
        "cik": ticker.cik,
        "composite_figi": ticker.composite_figi,
        "currency_symbol": ticker.currency_symbol,
        "base_currency_symbol": ticker.base_currency_symbol,
        "currency_name": ticker.currency_name,
        "base_currency_name": ticker.base_currency_name,
        "delisted_utc": ticker.delisted_utc,
        "locale": ticker.locale,
        "title": ticker.name,
        "primary_exchange": ticker.primary_exchange,
        "share_class_figi": ticker.share_class_figi,
        "ticker": ticker.ticker,
        "type": ticker.type,
        "source_feed": ticker.source_feed,
    }
    with dfs_lock:
        dfs.append(ticker_info)
    return ticker_info


# Using ThreadPoolExecutor to fetch tickers data concurrently
print(f"Scraping web data")
with ThreadPoolExecutor(max_workers=WORKERS) as executor:
    executor.map(fetch_ticker_info, tickers_active)

all_tickers_df = pd.DataFrame(dfs)
all_tickers_df

# Show the affected records; 
# there are 51 unique tickers and 102 records that appear to be duplicated
all_tickers_df[all_tickers_df.groupby('ticker')['ticker'].transform('count') > 1]

# Consider the duplicate records above; 
# Count the number of tickers with more than one CIK (True) 
# vs the number of # tickers with exactly one CIK (False)
Counter(all_tickers_df[all_tickers_df.groupby('ticker')['ticker'].transform('count') > 1].groupby("ticker")["cik"].transform('nunique') > 1)

Expected behavior
all_tickers_df[all_tickers_df.groupby('ticker')['ticker'].transform('count') > 1] should return None. If you notice, the results of this command show 51 unique tickers with two rows each. There are 12 unique tickers for which the same ticker has two CIKs. There are 39 unique tickers for which the same ticker has only one CIK, but has a different updated at timestamp.

Additional context
I join to this table and assume the tickers are unique. I have 6 downstream tables that are affected by this. I can deduplicate the all tickers endpoint on my end, but it feels to me as if the endpoint shouldn't be returning what appear to be duplicate records in the first place.

FWIW, nothing on my end changed and the audits I've put in place have been active for over four months without issue. According to my audit records, these duplicates started showing up on March 24th. I only download these data weekly, so I believe it's likely that the problem was introduced some time between March 17th and March 24th.

@evanvolgas evanvolgas added the bug Something isn't working label Apr 2, 2024
@evanvolgas
Copy link
Author

evanvolgas commented Apr 10, 2024

It seems that duplicated data are continuing to be added to the data source on your end. I'm now seeing 46 exact duplicates with the only difference being the updated at timestamp, and 11 partial dups where one ticker has two different CIDs but otherwise the records are the same.

@justinpolygon
Copy link
Contributor

Hi @evanvolgas, we're checking this out now. We created an internal ticker earlier but bumping things up in priority. Thank you for reporting.

@evanvolgas
Copy link
Author

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data
Projects
None yet
Development

No branches or pull requests

3 participants