# Anaconda Package Data Quickstart

This notebook gives you easy access to package download stats for single day, month or year at hourly resolutions and download stats for single year at monthly resolutions. Please see the [README](https://github.com/ContinuumIO/anaconda-package-data/blob/master/README.md#quickstart) for environment setup instructions.

## Notebook variables

In this notebook, we are first setting some parameters as an example for data you may want. Below are the parameters and allowed values. All except year, month and day must be formatted as strings. Year, month and day must be integers. If you get zero total downloads, you could be using non-existent data values for your settings. For example, you may have a mismatch between a package name and a version that actually exists. 

### Required variables

- *year*, *month*, *day* - the year (e.g., 2024), month (e.g., 4, for April) and day of interest. Single-digit months must not be preceded with a zero.

- *pkg_name* - the name of the conda package for which you want download statistics. Examples: "pandas", "numpy", "scikit-learn".

- *data_source* - the channel for which you want download stats, as a string. Must be one of: "conda-forge", "anaconda", "bioconda", "nvidia", "rapidsai-nightly", "rapidsai", "pyviz", "rdkit", "plotly", "pytorch".

- *pkg_platform* - the target platform of interest. Examples: "linux-64", "linux-aarch-64", "win-64".

- *pkg_python* - the python version of interest. Major and minor versions only are handled.

- *pkg_version* - the version of interest. Must be the _full_ version as listed via `conda search`.


In [1]:
# required variable assignments
year=2024
month=5
day=12
pkg_name="numpy"
data_source="anaconda"
pkg_platform="linux-64"
pkg_python="3.10"
pkg_version="1.23.4"

## Setup

The intake catalog defines 4 sources (`anaconda_package_data_by_day`, `anaconda_package_data_by_month`, `anaconda_package_data_by_year`, `anaconda_package_data_by_year_month_res`) that we can use to access the underlying s3 buckets storing package data. This notebook will outline usage of each source using the `intake` and `dask` packages.  

In [None]:
import hvplot.pandas
import intake
import dask
import pandas as pd
from typing import Optional

# Get the data
cat = intake.open_catalog('https://raw.githubusercontent.com/anaconda/anaconda-package-data/master/catalog/anaconda_package_data.yaml')

We will filter for the same values in this notebook so we will define a pandas filter function for reuseability. Your application will likely need your own data processing steps .  

In [3]:
def filter_package_data(
    df: pd.DataFrame,
    pkg_name: str,
    data_source: str,
    pkg_platform: str,
    pkg_python: str,
    pkg_version: str
) -> Optional[pd.DataFrame]:
    """
    Filter package data based on specified criteria.
        
    Returns:
        Optional[pd.DataFrame]: Filtered DataFrame if data exists, None if empty
    """
        
    if df.empty:
        print("Data not available - input DataFrame is empty.")
        return None
        
    filtered_df = df[
        (df["pkg_name"] == pkg_name) & 
        (df["data_source"] == data_source) & 
        (df["pkg_platform"] == pkg_platform) &
        (df["pkg_python"] == pkg_python) &
        (df["pkg_version"] == pkg_version)   
    ].reset_index(drop=True)

    if filtered_df.empty:
        raise ValueError("This combination of variables has not matched any package download data. Please modify the required variables.")
    return filtered_df

## Hourly resolution

### Import and process a single day of package download data

In [4]:
df_day = cat.anaconda_package_data_by_day(year=year, month=month, day=day).to_dask().compute()

pkg_day = filter_package_data(
    df_day,
    pkg_name,
    data_source,
    pkg_platform,
    pkg_python,
    pkg_version
)

### Plot downloads for a day

In [None]:
pkg_day.hvplot("time", "counts")

### Import and process a single month of package download data

In [6]:
df_month = cat.anaconda_package_data_by_month(year=year, month=month).to_dask().compute()

pkg_month = filter_package_data(
    df_month,
    pkg_name,
    data_source,
    pkg_platform,
    pkg_python,
    pkg_version
)

### Plot downloads for a month

In [None]:
pkg_month.hvplot("time", "counts")

### Import and process a single year of package download data

In [8]:
## this data import will take quite a while
df_year = cat.anaconda_package_data_by_year(year=year).to_dask().compute()

pkg_year = filter_package_data(
    df_year,
    pkg_name,
    data_source,
    pkg_platform,
    pkg_python,
    pkg_version
)

### Plot downloads for a year

In [None]:
pkg_year.hvplot("time", "counts")

## Monthly resolution

### Import and process a single year of package download data

In [10]:
df_year_monthly_res = cat.anaconda_package_data_by_year_month_res(year=year).to_dask().compute()

pkg_year_monthly_res = filter_package_data(
    df_year_monthly_res,
    pkg_name,
    data_source,
    pkg_platform,
    pkg_python,
    pkg_version
)

In [None]:
pkg_year_monthly_res.hvplot("time", "counts")