# Anaconda Package Data Quickstart

This notebook gives you easy access to package download stats. See the [README](https://github.com/ContinuumIO/anaconda-package-data/blob/master/README.md#quickstart) for setup instructions.

## Settings

First set the parameters for the data you want. Below are the parameters and allowed values. All except year and month must be formatted as strings. Year and month must be integers.

#### Required

*year*, *month* - the year (e.g., 2024) and month (e.g., 4, for April) of interest. The data will be given for that month, with a graph showing daily downloads within that month. Single-digit months must not be preceded with a zero.

#### Optional

If any of the the settings below are not given, then the data for all options will be included.

*data_source* - the channel for which you want download stats, as a string. Must be one of: "conda-forge", "anaconda", "bioconda", "nvidia", "rapidsai-nightly", "rapidsai", "pyviz", "rdkit", "plotly".

*pkg_name* - the name of the conda package for which you want download statistics. Examples: "pandas", "numpy", "scikit-learn".

*pkg_version* - the version of interest. Must be the _full_ version as listed via `conda search`.

*pkg_platform* - the target platform of interest. Examples: "linux-64", "linux-aarch-64", "win-64".

*pkg_python* - the python version of interest. Major and minor versions only are handled.

In [None]:
year=2024
month=5
data_source="anaconda"
pkg_name="pandas"
#pkg_version="2.2.2"
#pkg_platform="linux-64"
#pkg_python="3.11"

In [None]:
### Download and process data

In [None]:
import dask
from datetime import datetime
import hvplot.pandas
import intake
import numpy as np
import pandas as pd

from dask.diagnostics import ProgressBar
pbar = ProgressBar()
pbar.register()

# Get the data
cat = intake.open_catalog('https://raw.githubusercontent.com/ContinuumIO/anaconda-package-data/master/catalog/anaconda_package_data.yaml')
df = cat.anaconda_package_data_by_month(year=year, month=month).to_dask()

# Initial processing
if not len(df):
    print("Data not available for given month. Note that data is only updated at the end of each month.")
else:
    df['day'] = df.time.dt.day
    df = df.drop(columns='time')
    condition = True
    for setting in ["data_source", "pkg_name", "pkg_version", "pkg_platform", "pkg_python"]:
        if setting in globals():
            condition &= (df[setting] == globals()[setting])

    pkg_day_agg = df\
        .loc[condition]\
        .groupby(['day'])\
        .sum()\
        .reset_index()\
        .compute()

## Compute results

#### Total downloads

In [None]:
pkg_day_agg['counts'].sum()

## Troubleshooting

If you get zero total downloads, you could be using non-existent data values for your settings.
Uncomment the code below and run it to troubleshoot.

In [None]:
# df_filtered = df
# error_suffix = ""

# # Perform nested checks
# for setting in ["data_source", "pkg_name", "pkg_version", "pkg_platform", "pkg_python"]:
#     value = globals()[setting]
#     if value not in df_filtered[setting].compute().unique():
#         print(f"The value '{value}' is not present for '{setting}'{error_suffix}.")
#         break
#     error_suffix = f" for {setting}='{value}'" + error_suffix
#     df_filtered = df_filtered[df_filtered[setting] == value]
# else:
#     print("The data exists in the dataset.")

In [None]:
df["pkg_version"].head()

#### Downloads by day

In [None]:
pkg_day_agg.hvplot('day','counts')

#### Downloads by platform

In [None]:
platform_month = df.groupby(['pkg_platform'])['counts'].sum().reset_index().compute()
platform_month.hvplot.bar('pkg_platform', 'counts', rot=90)