# **4.2 Exercise Assignment 5**
# Michael J. Montana
# College of Science and Tecnology, Bellevue University
# DSC400: Big Data, Technology, and Algorithms
# Professor Shawn Hermans
# July 2 2023

# Acquiring and Storing Data

## Assignment 5

For this assignment and future assignments, assume that you are the owner of a small but growing retail business, *Datums R Us*. Your store sells technology, tools, and clothing for the discerning data scientist. You currently have stores in the following five locations. 

- Bellevue, Nebraska
- Columbus, Ohio
- Denver, Colorado
- San Francisco, California
- Baltimore, Maryland

You have been tasked with creating a data lake for the company using a [directory structure based on Cookiecutter Data Science recommendations](https://drivendata.github.io/cookiecutter-data-science/#directory-structure). This basic directory structure works well for small, self-contained data science projects and organizing large-scale data warehouses.

```
├── data
│   ├── external       <- Data from third-party sources.
│   ├── interim        <- Intermediate data that has been transformed.
│   ├── processed      <- The final, canonical data sets for modeling and reports.
│   └── raw            <- The original, immutable data dump.
```

You have identified the following items for initial inclusion in the data lake. 

**External Data Sets**

- Census (Updated Yearly)
- Weather Forecasts (Updated Daily)

**Raw Data Dumps**

- Sales (Updated Hourly)
- Inventory (Updated Daily)
- Expenses (Updated Daily)

**Processed Data Sets and Reports**

*Weekly*

- Modeling Data Set

*Monthly*

- Inventory Update Request

*Quarterly*

- Quarterly Financial Report

### Assignment 5.1

In the first part of the assignment, you will describe the directory structure for the data lake. For the most part, this directory structure will not depend on the technical details of how you store the data. You could be storing the data in a local filesystem, a distributed filesystem such as HDFS, or object storage, such as Amazon S3. 

You will only be creating the directory structures and not populating actual content. Real-world data lakes store data in a variety of formats including,  Apache Parquet, Google Protocol Buffers, Apache Avro, JSONL, and CSV. 

You will use Python's built-in [calendar library](https://docs.python.org/3/library/calendar.html), and [datetime library](https://docs.python.org/3/library/datetime.html) to work with the dates and times required for this assignment. You will use the [PurePosixPath](https://docs.python.org/3/library/pathlib.html#pathlib.PurePosixPath) class from Python's built-in [pathlib library](https://docs.python.org/3/library/pathlib.html) to represent locations on the data lake. 

You will generate the output directories for an entire year's worth of data starting on January 1st of this year. Unless otherwise specified, all times will be in Coordinated Universal Time (UTC). 

In [33]:
# Imports the required Python libraries and 
# sets global variables for the assignment
import calendar
import datetime
from pathlib import PurePosixPath

today = datetime.date.today()
current_year = today.year
days_in_year = 365

if calendar.isleap(current_year):
    days_in_year +=1

hours_in_year = days_in_year * 24

In [34]:
# Creates paths for the external, interim, processed, and raw directories
# Use these paths when creating new paths

root_data_dir = PurePosixPath('/data')
external_data_dir = root_data_dir.joinpath('external')
interim_data_dir = root_data_dir.joinpath('interim')
processed_data_dir = root_data_dir.joinpath('processed')
raw_data_dir = root_data_dir.joinpath('raw')

print('Root Data Directory: {}'.format(root_data_dir))
print('External Data Directory: {}'.format(external_data_dir))
print('Interim Data Directory: {}'.format(interim_data_dir))
print('Processed Data Directory: {}'.format(processed_data_dir))
print('Raw Data Directory: {}'.format(raw_data_dir))

Root Data Directory: /data
External Data Directory: /data/external
Interim Data Directory: /data/interim
Processed Data Directory: /data/processed
Raw Data Directory: /data/raw


#### Assignment 5.1.a

For the purposes of this assignment, we will be using three Census data sets as examples of external data updated yearly. These data sets are:

- [American Community Survey (ACS) Summary File](https://www.census.gov/programs-surveys/acs/data/summary-file.html)
- [American Community Survey (ACS) Public Use Microdata Sample (PUMS)]( https://www.census.gov/programs-surveys/acs/microdata.html)
- [Tiger/Line Shapefiles](https://www.census.gov/geographies/mapping-files/time-series/geo/tiger-line-file.html)

If you are curious, you can find the actual data sets at the following locations: 

- [ACS Summary File](https://www2.census.gov/programs-surveys/acs/summary_file/)
- [PUMS](https://www2.census.gov/programs-surveys/acs/data/pums/)
- [Tiger](https://www2.census.gov/geo/tiger/)

For this assignment, we use the following naming convention for external data sets

```
/data/external/<source>/<data-set>/<year>/
```
where *source* is the organization providing the data, *data-set* is the specific data set, and *year* is the year. 

```
data
├── external
│   ├── census
│   │   ├── acs-summaryfile
│   │   │   ├── 2015
│   │   │   ├── 2016
│   │   │   ...
│   │   │   ...
│   │   │   └── 2019
│   │   ├── pums
│   │   │   ├── 2015
│   │   │   ├── 2016
│   │   │   ...
│   │   │   ...
│   │   │   └── 2020
│   │   └── tiger
│   │       ├── 2015
│   │       ├── 2016
│   │   │   ...
│   │   │   ...
│   │       └── 2020
│   └── nwc-wpc
├── interim
├── processed
└── raw
```

Create and add the paths for these data sets. Verify that you have added the paths correctly.

## <font color=orange>**TODO**</font>

In [35]:
acs_summary_file_dirs = set()

# TODO: Create and add the paths for this data set
acs_summary_file_dir = external_data_dir.joinpath('census','acs-summaryfile')
for year in range(2015, current_year):
    acs_summary_file_dirs.add(acs_summary_file_dir.joinpath(str(year)))

# Should output sorted directories from 2015 to present 
sorted(list(acs_summary_file_dirs))

[PurePosixPath('/data/external/census/acs-summaryfile/2015'),
 PurePosixPath('/data/external/census/acs-summaryfile/2016'),
 PurePosixPath('/data/external/census/acs-summaryfile/2017'),
 PurePosixPath('/data/external/census/acs-summaryfile/2018'),
 PurePosixPath('/data/external/census/acs-summaryfile/2019'),
 PurePosixPath('/data/external/census/acs-summaryfile/2020'),
 PurePosixPath('/data/external/census/acs-summaryfile/2021'),
 PurePosixPath('/data/external/census/acs-summaryfile/2022')]

#### Assignment 5.1.b

## <font color=orange>**TODO**</font>

In [36]:
pums_dirs = set()

# TODO: Create and add the paths for this data set
pums_dir = external_data_dir.joinpath('census','pums')
for year in range(2015, current_year):
    pums_dirs.add(pums_dir.joinpath(str(year)))
# Should output sorted directories from 2015 to present 
sorted(list(pums_dirs)) 

[PurePosixPath('/data/external/census/pums/2015'),
 PurePosixPath('/data/external/census/pums/2016'),
 PurePosixPath('/data/external/census/pums/2017'),
 PurePosixPath('/data/external/census/pums/2018'),
 PurePosixPath('/data/external/census/pums/2019'),
 PurePosixPath('/data/external/census/pums/2020'),
 PurePosixPath('/data/external/census/pums/2021'),
 PurePosixPath('/data/external/census/pums/2022')]

#### Assignment 5.1.c

## <font color=orange>**TODO**</font>

In [37]:
tiger_dirs = set()

# TODO: Create and add the paths for this data set
tiger_dir = external_data_dir.joinpath('census','tiger')
for year in range(2015, current_year):
    tiger_dirs.add(tiger_dir.joinpath(str(year)))
# Should output sorted directories from 2015 to present 
sorted(list(tiger_dirs)) # Should output sorted directories from 2015 to present 

[PurePosixPath('/data/external/census/tiger/2015'),
 PurePosixPath('/data/external/census/tiger/2016'),
 PurePosixPath('/data/external/census/tiger/2017'),
 PurePosixPath('/data/external/census/tiger/2018'),
 PurePosixPath('/data/external/census/tiger/2019'),
 PurePosixPath('/data/external/census/tiger/2020'),
 PurePosixPath('/data/external/census/tiger/2021'),
 PurePosixPath('/data/external/census/tiger/2022')]

#### Assignment 5.1.d

Finally, you will create directories for a daily data set based on the [National Weather Service's (NWS) Weather Prediction Center's (WPC) daily forecasts](https://www.wpc.ncep.noaa.gov/kml/kmlproducts.php). 

For this part, we use the following naming convention

```
/data/external/nwc-wpc/forecasts/<year>/<month>/<day>/
```
where *year* is the year, *month* is the two-digit month, and *day* is the two-digit day. We use this convention when working with date-based data as the directories are naturally in date order. 

```
data
├── external
│   ├── census
│   └── nwc-wpc
│       └── forecasts
│           └── 2020
│               ├── 01
│               │   ├── 01
│               │   ├── 02
│               │   ├── 03
│               │   ...
│               │   ...
│               │   ├── 30
│               │   └── 31
│               ├── 02
│               │   ├── 01
│               │   ├── 02
│               │   ...
│               │   ...
│               │   ├── 28
│               │   └── 29
│               ├── 03
│               ...
│               ...
│               ├── 11
│               └── 12
│                   ├── 01
│                   ├── 02
│                   ...
│                   ...
│                   ├── 29
│                   ├── 30
│                   └── 31
├── interim
├── processed
└── raw
```

Create and add the paths for these data sets. Verify that you have added the paths correctly.

## <font color=orange>**TODO**</font>

In [38]:
forecast_dirs = set()
# TODO: Create and add the paths for this data set
forecast_dir = external_data_dir.joinpath('nwc-wpc','forecasts')
start_day = datetime.date (2022,1, 1)
end_day = datetime.date (2023,1,1)
current_day = start_day
while current_day < end_day:
    forecast_dirs.add(forecast_dir.joinpath(f"{current_day.year:04d}", f"{current_day.month:02d}",f"{current_day.day:02d}"))
    current_day += datetime.timedelta(days=1)
forecast_dirs=sorted(list(forecast_dirs))
# Should have 365 directories (366 if leap year)
forecast_dirs[:10],len(forecast_dirs)


([PurePosixPath('/data/external/nwc-wpc/forecasts/2022/01/01'),
  PurePosixPath('/data/external/nwc-wpc/forecasts/2022/01/02'),
  PurePosixPath('/data/external/nwc-wpc/forecasts/2022/01/03'),
  PurePosixPath('/data/external/nwc-wpc/forecasts/2022/01/04'),
  PurePosixPath('/data/external/nwc-wpc/forecasts/2022/01/05'),
  PurePosixPath('/data/external/nwc-wpc/forecasts/2022/01/06'),
  PurePosixPath('/data/external/nwc-wpc/forecasts/2022/01/07'),
  PurePosixPath('/data/external/nwc-wpc/forecasts/2022/01/08'),
  PurePosixPath('/data/external/nwc-wpc/forecasts/2022/01/09'),
  PurePosixPath('/data/external/nwc-wpc/forecasts/2022/01/10')],
 365)

### Assignment 5.2

In the second part of the assignment, you will create the structure for the raw source data. We will use the following directory naming convention. 

```
/data/raw/inventory/<location>/<year>/<month>/<day>/
/data/raw/expenses/<location>/<year>/<month>/<day>/
/data/raw/sales/<location>/<year>/<month>/<day>/<hour>/
```
For *location*, we will use the three-letter IATA code for the airport nearest to the location.  We will use the same year, month, and day convention from the previous example. For *hour*, we will use the two-digit hour value based on a 24-hour clock set to UTC. 

#### Assignment 5.2.a

The following is an example of the directory structure for daily data dumps. 

```
data
├── external
├── interim
├── processed
└── raw
    ├── expenses
    ├── inventory
    │   ├── bwi
    │   ├── cmh
    │   ├── den
    │   ├── oma
    │   │   └── 2020
    │   │       ├── 01
    │   │       │   ├── 01
    │   │       │   ├── 02
    │   │       │   ...    
    │   │       │   └── 31
    │   │       ├── 02
    │   │       │   ├── 01
    │   │       │   ...
    │   │       │   └── 29
    │   │       ├── 03
    │   │       ... 
    │   │       ├── 11
    │   │       └── 12
    │   │           ├── 01
    │   │           ├── 02
    │   │           ...  
    │   │           └── 31
    │   └── sfo
    └── sales
```

Create and add the paths for these data sets. Verify that you have added the paths correctly.

## <font color=orange>**TODO**</font>

In [39]:
# /data/raw/inventory/<location>/<year>/<month>/<day>/

inventory_dirs = set()
# TODO: Create and add the paths for this data set
inventory_dir = raw_data_dir.joinpath('inventory')
inventory_loc_dir= [inventory_dir.joinpath('bwi'),
                inventory_dir.joinpath('cmh'),
                inventory_dir.joinpath('den'),
                inventory_dir.joinpath('oma'),
                inventory_dir.joinpath('sfo')
                ]
start_day = datetime.date (2022,1, 1)
end_day = datetime.date (2023,1,1)
current_day = start_day

while current_day < end_day:
    for directory in inventory_loc_dir:
        inventory_dirs.add(directory.joinpath(f"{current_day.year:04d}", f"{current_day.month:02d}",f"{current_day.day:02d}"))
    current_day += datetime.timedelta(days=1)
inventory_dirs=sorted(list(inventory_dirs))
# Should have 1825 directories (1830 if leap year)
inventory_dirs[:10],len(inventory_dirs)

([PurePosixPath('/data/raw/inventory/bwi/2022/01/01'),
  PurePosixPath('/data/raw/inventory/bwi/2022/01/02'),
  PurePosixPath('/data/raw/inventory/bwi/2022/01/03'),
  PurePosixPath('/data/raw/inventory/bwi/2022/01/04'),
  PurePosixPath('/data/raw/inventory/bwi/2022/01/05'),
  PurePosixPath('/data/raw/inventory/bwi/2022/01/06'),
  PurePosixPath('/data/raw/inventory/bwi/2022/01/07'),
  PurePosixPath('/data/raw/inventory/bwi/2022/01/08'),
  PurePosixPath('/data/raw/inventory/bwi/2022/01/09'),
  PurePosixPath('/data/raw/inventory/bwi/2022/01/10')],
 1825)

## <font color=orange>**TODO**</font>

In [40]:
expenses_dirs = set()
# /data/raw/expenses/<location>/<year>/<month>/<day>/

# TODO: Create and add the paths for this data set
expenses_dir = raw_data_dir.joinpath('expenses')
expenses_loc_dir= [expenses_dir.joinpath('bwi'),
                expenses_dir.joinpath('cmh'),
                expenses_dir.joinpath('den'),
                expenses_dir.joinpath('oma'),
                expenses_dir.joinpath('sfo')
                ]
start_day = datetime.date (2022,1, 1)
end_day = datetime.date (2023,1,1)
current_day = start_day

while current_day < end_day:
    for directory in expenses_loc_dir:
        expenses_dirs.add(directory.joinpath(f"{current_day.year:04d}", f"{current_day.month:02d}",f"{current_day.day:02d}"))
    current_day += datetime.timedelta(days=1)
expenses_dirs=sorted(list(expenses_dirs))
# Should have 1825 directories (1830 if leap year)
expenses_dirs[:10],len(expenses_dirs)

([PurePosixPath('/data/raw/expenses/bwi/2022/01/01'),
  PurePosixPath('/data/raw/expenses/bwi/2022/01/02'),
  PurePosixPath('/data/raw/expenses/bwi/2022/01/03'),
  PurePosixPath('/data/raw/expenses/bwi/2022/01/04'),
  PurePosixPath('/data/raw/expenses/bwi/2022/01/05'),
  PurePosixPath('/data/raw/expenses/bwi/2022/01/06'),
  PurePosixPath('/data/raw/expenses/bwi/2022/01/07'),
  PurePosixPath('/data/raw/expenses/bwi/2022/01/08'),
  PurePosixPath('/data/raw/expenses/bwi/2022/01/09'),
  PurePosixPath('/data/raw/expenses/bwi/2022/01/10')],
 1825)

#### Assignment 5.2.b

Finally, create the paths for the hourly sales data. The following is an example of the directory structure for the sales data. 

```
├── external
├── interim
├── processed
└── raw
    ├── expenses
    ├── inventory
    └── sales
        ├── bwi
        ├── cmh
        ├── den
        ├── oma
        │   └── 2020
        │       ├── 01
        │       │   └── 01
        │       │       ├── 00
        │       │       ├── 01   
        │       │       ├── 02
        │       │       ...     
        │       │       ├── 22
        │       │       └── 23
        │       ├── 02
        │       ...
        │       └── 12
        └── sfo
```

## <font color=orange>**TODO**</font>

In [41]:
sales_dirs = set()
# /data/raw/sales/<location>/<year>/<month>/<day>/<hour>/
# TODO: Create and add the paths for this data set
sales_dir = raw_data_dir.joinpath('sales')
sales_loc_dir= [sales_dir.joinpath('bwi'),
                sales_dir.joinpath('cmh'),
                sales_dir.joinpath('den'),
                sales_dir.joinpath('oma'),
                sales_dir.joinpath('sfo')
                ]
start_day = datetime.datetime (2022,1, 1) # uses datetime.datetime to use hours
end_day = datetime.datetime (2023,1,1)
current_day = start_day


while current_day < end_day:
    for directory in sales_loc_dir:
        sales_dirs.add(directory.joinpath(f"{current_day.year:04d}", f"{current_day.month:02d}",f"{current_day.day:02d}",f"{current_day.hour:02d}"))
    current_day += datetime.timedelta(hours=1)
sales_dirs=sorted(list(sales_dirs))
# Should have 43,800 directories (43,920 if leap year)
sales_dirs[:10],len(sales_dirs)

([PurePosixPath('/data/raw/sales/bwi/2022/01/01/00'),
  PurePosixPath('/data/raw/sales/bwi/2022/01/01/01'),
  PurePosixPath('/data/raw/sales/bwi/2022/01/01/02'),
  PurePosixPath('/data/raw/sales/bwi/2022/01/01/03'),
  PurePosixPath('/data/raw/sales/bwi/2022/01/01/04'),
  PurePosixPath('/data/raw/sales/bwi/2022/01/01/05'),
  PurePosixPath('/data/raw/sales/bwi/2022/01/01/06'),
  PurePosixPath('/data/raw/sales/bwi/2022/01/01/07'),
  PurePosixPath('/data/raw/sales/bwi/2022/01/01/08'),
  PurePosixPath('/data/raw/sales/bwi/2022/01/01/09')],
 43800)

### Assignment 5.3

#### Assignment 5.3.a

We have two choices for structuring our weekly data set. We can use the following naming convention where the date is based on the first day of the week. 

```
/data/processed/modeling/<year>/<month>/<day>/
```

Otherwise, we could use a naming convention where *week* is the number of weeks it has been since the beginning of the year. 
 
```
/data/processed/modeling/<year>/<week>/
```

We will use the first option for our naming convention. Python's *calendar* library has a function that determines the first day of the week.

## <font color=orange>**TODO**</font>

In [42]:
# /data/processed/modeling/<year>/<month>/<day>/
modeling_data_dirs = set()
modeling_data_dir = processed_data_dir.joinpath('modeling')

mod_year = 2022
cal = calendar.Calendar()

for month in range(1, 13):
    for week in cal.monthdatescalendar(mod_year, month):
        first_day = week[0]
        if first_day.year == mod_year: #keeps every thing in 2022
            modeling_data_dirs.add(modeling_data_dir.joinpath(f"{first_day.year:04d}", f"{first_day.month:02d}", f"{first_day.day:02d}"))
            current_day += datetime.timedelta(weeks=1)
modeling_data_dirs = sorted(list(modeling_data_dirs))
# Should have 52 directories
modeling_data_dirs[:10], len(modeling_data_dirs)

([PurePosixPath('/data/processed/modeling/2022/01/03'),
  PurePosixPath('/data/processed/modeling/2022/01/10'),
  PurePosixPath('/data/processed/modeling/2022/01/17'),
  PurePosixPath('/data/processed/modeling/2022/01/24'),
  PurePosixPath('/data/processed/modeling/2022/01/31'),
  PurePosixPath('/data/processed/modeling/2022/02/07'),
  PurePosixPath('/data/processed/modeling/2022/02/14'),
  PurePosixPath('/data/processed/modeling/2022/02/21'),
  PurePosixPath('/data/processed/modeling/2022/02/28'),
  PurePosixPath('/data/processed/modeling/2022/03/07')],
 52)

#### Assignment 5.3.b

Next, create the monthly inventory requests using the following convention. 

```
/data/processed/inventory/requests/<year>/<month>/
```

## <font color=orange>**TODO**</font>

In [43]:
# /data/processed/inventory/requests/<year>/<month>/
inventory_request_dirs = set()
inventory_request_dir = processed_data_dir.joinpath('inventory','requests')
# TODO: Create and add the paths for this data set
for month in range(1, 13):
    inventory_request_dirs.add(inventory_request_dir.joinpath(f"{2022:04d}", f"{month:02d}"))
 # Should output 12 directories
inventory_request_dirs, len(inventory_request_dirs)

({PurePosixPath('/data/processed/inventory/requests/2022/01'),
  PurePosixPath('/data/processed/inventory/requests/2022/02'),
  PurePosixPath('/data/processed/inventory/requests/2022/03'),
  PurePosixPath('/data/processed/inventory/requests/2022/04'),
  PurePosixPath('/data/processed/inventory/requests/2022/05'),
  PurePosixPath('/data/processed/inventory/requests/2022/06'),
  PurePosixPath('/data/processed/inventory/requests/2022/07'),
  PurePosixPath('/data/processed/inventory/requests/2022/08'),
  PurePosixPath('/data/processed/inventory/requests/2022/09'),
  PurePosixPath('/data/processed/inventory/requests/2022/10'),
  PurePosixPath('/data/processed/inventory/requests/2022/11'),
  PurePosixPath('/data/processed/inventory/requests/2022/12')},
 12)

#### Assignment 5.3.c

Finally, create the quarterly financial reports using the following convention. 

```
`/data/processed/financials/quarterly/<year>/<quarter>/`
```
While it does not matter for this assignment, the following are the typical dates associated with financial quarters.

## <font color=orange>**TODO**</font>

In [44]:
# `/data/processed/financials/quarterly/<year>/<quarter>/`
financials_dirs = set()
financials_dir = processed_data_dir.joinpath('financials','quarterly')
# TODO: Create and add the paths for this data set
for quarter in range(1, 5):
    financials_dirs.add(financials_dir.joinpath(f"{2022:04d}", f"{quarter:02d}"))
# Should output four quarterly directories
financials_dirs, len(financials_dirs)

({PurePosixPath('/data/processed/financials/quarterly/2022/01'),
  PurePosixPath('/data/processed/financials/quarterly/2022/02'),
  PurePosixPath('/data/processed/financials/quarterly/2022/03'),
  PurePosixPath('/data/processed/financials/quarterly/2022/04')},
 4)