Skip to content

pypi.projects table missing releases for many popular packages #210

@williajm

Description

@williajm

pypi.projects seems to have silently stopped tracking releases for a lot of packages

The pipeline itself is running fine (latest upload_time is from today), and pypi_downloads_per_day is current. But for specific packages, pypi.projects just stopped getting new rows at some point in the past.

A few examples:

  • boto3 (1.8B monthly downloads) - last pypi.projects entry is from 2015
  • botocore (1.2B) - also 2015
  • numpy (776M) - stopped in 2018

This means queries like argMax(version, upload_time) return the wrong latest version for a lot of packages.

To reproduce, this query finds packages with >1M monthly downloads but stale pypi.projects data:

SELECT
p.norm_name,
p.max_upload,
d.monthly_downloads
FROM (
SELECT
replaceRegexpAll(lower(name), '[-_.]+', '-') as norm_name,
max(upload_time) as max_upload
FROM pypi.projects
GROUP BY norm_name
HAVING max_upload < '2025-01-01'
) p
INNER JOIN (
SELECT project as norm_name, sum(count) as monthly_downloads
FROM pypi.pypi_downloads_per_day
WHERE date >= today() - 30
GROUP BY project
HAVING monthly_downloads > 1000000
) d ON p.norm_name = d.norm_name
ORDER BY d.monthly_downloads DESC
LIMIT 30

The download tables (e.g. pypi_downloads_per_day_by_version) do have current version data for these packages, it's specifically the pypi.projects table that's stale.

Thanks for maintaining this dataset.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions