# BW \#71 Holidays
The goal is to create a a dataframe with holidays from around the world based on a database on PyPI, and then we'll run lots of queries about it.

## Data and six questions
The original data comes from the  `holidays` package on PyPI, which lets you retrieve the holidays for any country in any year range. In order to query this data for all countries, we need to download the `pycountry` package as well, using its 2 character country codes to make your queries of the holidays package. 

## Challenges
The learning goals include creating a data frame from Python data, string handling, date handling, grouping, and joins.
- Create a data frame with four columns (country name, alpha2, date, and holiday name) for all countries, from the years 2010 through 2024. Use the pycountry module (from PyPI) to go through all of the countries in the world, and the holidays module (also from PyPI) to grab all of the holidays from there. The dates
should be in a datetime column.
- Which countries have holidays in June 2024? Which of this month's holidays, if any, are celebrated in more than one country? Do we see any issues that might result in a mis-count?


In [1]:
import pandas as pd 
import holidays
import pycountry

In [2]:
# Create an empty DataFrame with specified columns
df = pd.DataFrame(columns=['country name', 'alpha2', 'date', 'holiday name'])

# Generate a date range
df['date'] = pd.date_range(start='2010-01-01', end='2024-12-31', freq='D')

# Convert the dates to the desired format if necessary
df['date'] = pd.to_datetime(df['date'], errors='coerce')


In [23]:
!pip install --upgrade holidays

Collecting holidays
  Downloading holidays-0.55-py3-none-any.whl.metadata (23 kB)
Downloading holidays-0.55-py3-none-any.whl (1.1 MB)
   ---------------------------------------- 0.0/1.1 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.1 MB ? eta -:--:--
   ---------------------------------------- 0.0/1.1 MB ? eta -:--:--
   --------- ------------------------------ 0.3/1.1 MB ? eta -:--:--
   --------- ------------------------------ 0.3/1.1 MB ? eta -:--:--
   ------------------- -------------------- 0.5/1.1 MB 621.2 kB/s eta 0:00:01
   ------------------- -------------------- 0.5/1.1 MB 621.2 kB/s eta 0:00:01
   ----------------------------- ---------- 0.8/1.1 MB 599.2 kB/s eta 0:00:01
   ----------------------------- ---------- 0.8/1.1 MB 599.2 kB/s eta 0:00:01
   ---------------------------------------- 1.1/1.1 MB 587.3 kB/s eta 0:00:00
Installing collected packages: holidays
  Attempting uninstall: holidays
    Found existing installation: holidays 0.53
    Uninsta

In [33]:
all_holidays = []

for one_country in pycountry.countries:
    try:
        for (holiday_date, 
             holiday_name) in holidays.country_holidays(
                                        one_country.alpha_2,
                                        years=range(2010, 2025)
                                                ).items():
            all_holidays.append([one_country.name,
                                 one_country.alpha_2,
                                 holiday_date,
                                 holiday_name])
    except NotImplementedError as e:
        pass

df = (pd.DataFrame(all_holidays, 
                columns='country alpha_2 date holiday'.split())
    .assign(date=lambda df_: pd.to_datetime(df_['date']))
     )
df



Unnamed: 0,country,alpha_2,date,holiday
0,Aruba,AW,2016-01-01,Aña Nobo
1,Aruba,AW,2016-01-25,Dia di Betico
2,Aruba,AW,2016-02-08,Dialuna despues di Carnaval Grandi
3,Aruba,AW,2016-03-18,Dia di Himno y Bandera
4,Aruba,AW,2016-03-25,Bierna Santo
...,...,...,...,...
30953,Zimbabwe,ZW,2015-08-10,Zimbabwe Heroes' Day
30954,Zimbabwe,ZW,2015-08-11,Defense Forces Day
30955,Zimbabwe,ZW,2015-12-22,Unity Day
30956,Zimbabwe,ZW,2015-12-25,Christmas Day


## Which countries have holidays in June 2024?

Let's start by finding all of the rows in our data frame with a holiday in June, 2024. One easy way to do this is by setting the date column to be our data frame's index with set_index. With that in place, we can use loc to retrieve only those rows that match our year and month by leaving out the date.

Now that we've removed rows from other months and years, let's count the number of times each country appears. We can do this by retrieving only the country column and then running drop_duplicates on the result:

We find that 88 different countries have at least one holiday this month.



In [42]:
(
    df
    .set_index('date')
    .loc['2024-06']
    ['country']
    .drop_duplicates()
)


date
2024-06-16                              Albania
2024-06-15                 United Arab Emirates
2024-06-20                            Argentina
2024-06-19                       American Samoa
2024-06-15                           Azerbaijan
                            ...                
2024-06-16                           Uzbekistan
2024-06-29        Holy See (Vatican City State)
2024-06-24    Venezuela, Bolivarian Republic of
2024-06-19                 Virgin Islands, U.S.
2024-06-16                         South Africa
Name: country, Length: 88, dtype: object

i want to filter the rows where the date is in june 2024 and get the unique countries. We find that 88 different countries have at least one holiday this month.



In [45]:
df[(df['date'].dt.year == 2024) & (df['date'].dt.month == 6)]['country'].nunique()

88

## Which of this month's holidays, if any, are celebrated in more than one country? 

In [58]:
holiday_count = df[(df['date'].dt.year == 2024) & (df['date'].dt.month == 6)].groupby('holiday')['country'].unique()
holiday_count[holiday_count.apply(len) > 1]


holiday
(تقدير) عطلة عيد الأضحى                 [United Arab Emirates, Bahrain, Algeria, Egypt...
(تقدير) عيد الأضحى                      [United Arab Emirates, Bahrain, Algeria, Egypt...
(تقدير) يوم عرفة                           [Egypt, Jordan, Kuwait, Saudi Arabia, Tunisia]
Eid al-Adha (estimated)                 [Albania, Burkina Faso, Cameroon, Gabon, India...
Eid al-Adha (estimated) (observed)                                    [Albania, Cameroon]
Juneteenth National Independence Day    [American Samoa, Guam, Northern Mariana Island...
King's Birthday                                           [New Zealand, Papua New Guinea]
San Pedro y San Pablo                                                       [Chile, Peru]
Δευτέρα του Αγίου Πνεύματος                                              [Cyprus, Greece]
Name: country, dtype: object

In [50]:
(
    df
    .set_index('date')
    .loc['2024-06']
    .groupby('holiday')['country'].count()
    .loc[lambda s_: s_ > 1]
    .sort_values(ascending=False)    
)

holiday
(تقدير) عطلة عيد الأضحى                 17
Eid al-Adha (estimated)                 11
(تقدير) عيد الأضحى                       9
Juneteenth National Independence Day     7
(تقدير) يوم عرفة                         5
Söndag                                   5
Kurban Bayramı                           4
Eid-ul-Adha (estimated)                  3
Eid al-Adha (estimated) (observed)       2
King's Birthday                          2
Qurban bayrami (təxmini)                 2
Rusaliile                                2
San Pedro y San Pablo                    2
Δευτέρα του Αγίου Πνεύματος              2
Name: country, dtype: int64

## Do we see any issues that might result in a mis-count?