pypi.projects seems to have silently stopped tracking releases for a lot of packages
The pipeline itself is running fine (latest upload_time is from today), and pypi_downloads_per_day is current. But for specific packages, pypi.projects just stopped getting new rows at some point in the past.
A few examples:
- boto3 (1.8B monthly downloads) - last pypi.projects entry is from 2015
- botocore (1.2B) - also 2015
- numpy (776M) - stopped in 2018
This means queries like argMax(version, upload_time) return the wrong latest version for a lot of packages.
To reproduce, this query finds packages with >1M monthly downloads but stale pypi.projects data:
SELECT
p.norm_name,
p.max_upload,
d.monthly_downloads
FROM (
SELECT
replaceRegexpAll(lower(name), '[-_.]+', '-') as norm_name,
max(upload_time) as max_upload
FROM pypi.projects
GROUP BY norm_name
HAVING max_upload < '2025-01-01'
) p
INNER JOIN (
SELECT project as norm_name, sum(count) as monthly_downloads
FROM pypi.pypi_downloads_per_day
WHERE date >= today() - 30
GROUP BY project
HAVING monthly_downloads > 1000000
) d ON p.norm_name = d.norm_name
ORDER BY d.monthly_downloads DESC
LIMIT 30
The download tables (e.g. pypi_downloads_per_day_by_version) do have current version data for these packages, it's specifically the pypi.projects table that's stale.
Thanks for maintaining this dataset.
pypi.projects seems to have silently stopped tracking releases for a lot of packages
The pipeline itself is running fine (latest upload_time is from today), and pypi_downloads_per_day is current. But for specific packages, pypi.projects just stopped getting new rows at some point in the past.
A few examples:
This means queries like argMax(version, upload_time) return the wrong latest version for a lot of packages.
To reproduce, this query finds packages with >1M monthly downloads but stale pypi.projects data:
SELECT
p.norm_name,
p.max_upload,
d.monthly_downloads
FROM (
SELECT
replaceRegexpAll(lower(name), '[-_.]+', '-') as norm_name,
max(upload_time) as max_upload
FROM pypi.projects
GROUP BY norm_name
HAVING max_upload < '2025-01-01'
) p
INNER JOIN (
SELECT project as norm_name, sum(count) as monthly_downloads
FROM pypi.pypi_downloads_per_day
WHERE date >= today() - 30
GROUP BY project
HAVING monthly_downloads > 1000000
) d ON p.norm_name = d.norm_name
ORDER BY d.monthly_downloads DESC
LIMIT 30
The download tables (e.g. pypi_downloads_per_day_by_version) do have current version data for these packages, it's specifically the pypi.projects table that's stale.
Thanks for maintaining this dataset.