You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A few days ago, my ETL audits started reporting duplicate records in my all_tickers BigQuery table, which itself is based on polygon_client.list_tickers(). I dug into these 102 duplicates and it appears to me they are being caused by the same ticker having different last_updated_utc and/or a different CIK.
In 12 of the 51 affected tickers I found, the same ticker has two CIKs listed at the polygon_client.list_tickers() endpoint.
In 39 of the 51 affected tickers I found, the same ticker has the same CIK listed at the polygon_client.list_tickers() endpoint, and the only thing different is the last_updated_utc
To Reproduce
import os
from collections import Counter
from concurrent.futures import ThreadPoolExecutor
from threading import Lock
import pandas as pd
from polygon import RESTClient
WORKERS = 50
polygon_secret = os.getenv("POLYGON_API_KEY")
polygon_client = RESTClient(polygon_secret, retries=10, trace=False)
# Fetch Ticker Info
tickers_active = polygon_client.list_tickers()
dfs = []
dfs_lock = Lock()
def fetch_ticker_info(ticker):
ticker_info = {
"active": ticker.active,
"cik": ticker.cik,
"composite_figi": ticker.composite_figi,
"currency_symbol": ticker.currency_symbol,
"base_currency_symbol": ticker.base_currency_symbol,
"currency_name": ticker.currency_name,
"base_currency_name": ticker.base_currency_name,
"delisted_utc": ticker.delisted_utc,
"locale": ticker.locale,
"title": ticker.name,
"primary_exchange": ticker.primary_exchange,
"share_class_figi": ticker.share_class_figi,
"ticker": ticker.ticker,
"type": ticker.type,
"source_feed": ticker.source_feed,
}
with dfs_lock:
dfs.append(ticker_info)
return ticker_info
# Using ThreadPoolExecutor to fetch tickers data concurrently
print(f"Scraping web data")
with ThreadPoolExecutor(max_workers=WORKERS) as executor:
executor.map(fetch_ticker_info, tickers_active)
all_tickers_df = pd.DataFrame(dfs)
all_tickers_df
# Show the affected records;
# there are 51 unique tickers and 102 records that appear to be duplicated
all_tickers_df[all_tickers_df.groupby('ticker')['ticker'].transform('count') > 1]
# Consider the duplicate records above;
# Count the number of tickers with more than one CIK (True)
# vs the number of # tickers with exactly one CIK (False)
Counter(all_tickers_df[all_tickers_df.groupby('ticker')['ticker'].transform('count') > 1].groupby("ticker")["cik"].transform('nunique') > 1)
Expected behavior all_tickers_df[all_tickers_df.groupby('ticker')['ticker'].transform('count') > 1] should return None. If you notice, the results of this command show 51 unique tickers with two rows each. There are 12 unique tickers for which the same ticker has two CIKs. There are 39 unique tickers for which the same ticker has only one CIK, but has a different updated at timestamp.
Additional context
I join to this table and assume the tickers are unique. I have 6 downstream tables that are affected by this. I can deduplicate the all tickers endpoint on my end, but it feels to me as if the endpoint shouldn't be returning what appear to be duplicate records in the first place.
FWIW, nothing on my end changed and the audits I've put in place have been active for over four months without issue. According to my audit records, these duplicates started showing up on March 24th. I only download these data weekly, so I believe it's likely that the problem was introduced some time between March 17th and March 24th.
The text was updated successfully, but these errors were encountered:
It seems that duplicated data are continuing to be added to the data source on your end. I'm now seeing 46 exact duplicates with the only difference being the updated at timestamp, and 11 partial dups where one ticker has two different CIDs but otherwise the records are the same.
Describe the bug
A few days ago, my ETL audits started reporting duplicate records in my all_tickers BigQuery table, which itself is based on
polygon_client.list_tickers()
. I dug into these 102 duplicates and it appears to me they are being caused by the same ticker having differentlast_updated_utc
and/or a differentCIK
.In 12 of the 51 affected tickers I found, the same ticker has two CIKs listed at the
polygon_client.list_tickers()
endpoint.In 39 of the 51 affected tickers I found, the same ticker has the same CIK listed at the
polygon_client.list_tickers()
endpoint, and the only thing different is thelast_updated_utc
To Reproduce
Expected behavior
all_tickers_df[all_tickers_df.groupby('ticker')['ticker'].transform('count') > 1]
should returnNone
. If you notice, the results of this command show 51 unique tickers with two rows each. There are 12 unique tickers for which the same ticker has two CIKs. There are 39 unique tickers for which the same ticker has only one CIK, but has a different updated at timestamp.Additional context
I join to this table and assume the tickers are unique. I have 6 downstream tables that are affected by this. I can deduplicate the all tickers endpoint on my end, but it feels to me as if the endpoint shouldn't be returning what appear to be duplicate records in the first place.
FWIW, nothing on my end changed and the audits I've put in place have been active for over four months without issue. According to my audit records, these duplicates started showing up on March 24th. I only download these data weekly, so I believe it's likely that the problem was introduced some time between March 17th and March 24th.
The text was updated successfully, but these errors were encountered: