## Rubric

Instructions: DELETE this cell before you submit via a `git push` to your repo before deadline. This cell is for your reference only and is not needed in your report. 

Scoring: Out of 10 points

- Each Developing  => -2 pts
- Each Unsatisfactory/Missing => -4 pts
  - until the score is 

If students address the detailed feedback in a future checkpoint they will earn these points back


|                  | Unsatisfactory                                                                                                                                                                                                    | Developing                                                                                                                                                                                              | Proficient                                     | Excellent                                                                                                                              |
|------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|
| Data relevance   | Did not have data relevant to their question. Or the datasets don't work together because there is no way to line them up against each other. If there are multiple datasets, most of them have this trouble | Data was only tangentially relevant to the question or a bad proxy for the question. If there are multiple datasets, some of them may be irrelevant or can't be easily combined.                       | All data sources are relevant to the question. | Multiple data sources for each aspect of the project. It's clear how the data supports the needs of the project.                         |
| Data description | Dataset or its cleaning procedures are not described. If there are multiple datasets, most have this trouble                                                                                              | Data was not fully described. If there are multiple datasets, some of them are not fully described                                                                                                      | Data was fully described                       | The details of the data descriptions and perhaps some very basic EDA also make it clear how the data supports the needs of the project. |
| Data wrangling   | Did not obtain data. They did not clean/tidy the data they obtained.  If there are multiple datasets, most have this trouble                                                                                 | Data was partially cleaned or tidied. Perhaps you struggled to verify that the data was clean because they did not present it well. If there are multiple datasets, some have this trouble | The data is cleaned and tidied.                | The data is spotless and they used tools to visualize the data cleanliness and you were convinced at first glance                      |


# COGS 108 - Data Checkpoint

## Authors

Instructions: REPLACE the contents of this cell with your team list and their contributions. Note that this will change over the course of the checkpoints

This is a modified [CRediT taxonomy of contributions](https://credit.niso.org). For each group member please list how they contributed to this project using these terms:
> Analysis, Background research, Conceptualization, Data curation, Experimental investigation, Methodology, Project administration, Software, Visualization, Writing – original draft, Writing – review & editing

Example team list and credits:
- Alice Anderson: Conceptualization, Data curation, Methodology, Writing - original draft
- Bob Barker:  Analysis, Software, Visualization
- Charlie Chang: Project administration, Software, Writing - review & editing
- Dani Delgado: Analysis, Background research, Visualization, Writing - original draft

## Research Question

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback



## Background and Prior Work

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback

## Hypothesis


Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback


## Data

In [16]:
# import dependencies
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

### Data overview

Instructions: REPLACE the contents of this cell with descriptions of your actual datasets.

For each dataset include the following information
- Dataset #1
  - Dataset Name:
  - Link to the dataset:
  - Number of observations:
  - Number of variables:
  - Description of the variables most relevant to this project
  - Descriptions of any shortcomings this dataset has with repsect to the project
- Dataset #2 (if you have more than one!)
  - same as above
- etc

Each dataset deserves either a set of bullet points as above or a few sentences if you prefer that method.

If you plan to use multiple datasets, add a few sentences about how you plan to combine these datasets.

In [17]:
# Run this code every time when you're actively developing modules in .py files.  It's not needed if you aren't making modules
#
## this code is necessary for making sure that any modules we load are updated here 
## when their source code .py files are modified

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [18]:
# Setup code -- this only needs to be run once after cloning the repo!
# this code downloads the data from its source to the `data/00-raw/` directory
# if the data hasn't updated you don't need to do this again!

# if you don't already have these packages (you should!) uncomment this line
# %pip install requests tqdm

import sys
sys.path.append('./modules') # this tells python where to look for modules to import

import get_data # this is where we get the function we need to download data

# replace the urls and filenames in this list with your actual datafiles
# yes you can use Google drive share links or whatever
# format is a list of dictionaries; 
# each dict has keys of 
#   'url' where the resource is located
#   'filename' for the local filename where it will be stored 
datafiles = [
    { 'url': 'https://raw.githubusercontent.com/COGS108/Group078_WI26/refs/heads/master/data/00-raw/BEA_econ_dataset.csv', 'filename':'BEA_econ_dataset.csv'},
    { 'url': 'https://raw.githubusercontent.com/COGS108/Group078_WI26/refs/heads/master/data/00-raw/labour_stats_season_adjusted_dataset.csv', 'filename': 'labour_stats_season_adjusted_dataset.csv'},
    { 'url': 'https://raw.githubusercontent.com/COGS108/Group078_WI26/refs/heads/master/data/00-raw/BLS_county_level_CA.csv', 'filename': 'BLS_county_level_CA.csv'}
]

get_data.get_raw(datafiles,destination_directory='data/00-raw/')

Overall Download Progress:  33%|███▎      | 1/3 [00:00<00:00,  5.40it/s]

Successfully downloaded: BEA_econ_dataset.csv


Overall Download Progress: 100%|██████████| 3/3 [00:00<00:00,  3.39it/s].00/236k [00:00<?, ?B/s][A

Successfully downloaded: labour_stats_season_adjusted_dataset.csv
Error downloading BLS_county_level_CA.csv: 404 Client Error: Not Found for url: https://raw.githubusercontent.com/COGS108/Group078_WI26/refs/heads/master/data/00-raw/BLS_county_level_CA.csv





### Dataset #1 

Instructions: 
1. Change the header from Dataset #1 to something more descriptive of the dataset
2. Write a few paragraphs about this dataset. Make sure to cover
   1. Describe the important metrics, what units they are in, and giv some sense of what they mean.  For example "Fasting blood glucose in units of mg glucose per deciliter of blood.  Normal values for healthy individuals range from 70 to 100 mg/dL.  Values 100-125 are prediabetic and values >125mg/dL indicate diabetes. Values <70 indicate hypoglycemia. Fasting idicates the patient hasn't eaten in the last 8 hours.  If blood glucose is >250 or <50 at any time (regardless of the time of last meal) the patient's life may be in immediate danger"
   2. If there are any major concerns with the dataset, describe them. For example "Dataset is composed of people who are serious enough about eating healthy that they voluntarily downloaded an app dedicated to tracking their eating patterns. This sample is likely biased because of that self-selection. These people own smartphones and may be healthier and may have more disposable income than the average person.  Those who voluntarily log conscientiously and for long amounts of time are also likely even more interested in health than those who download the app and only log a bit before getting tired of it"
3. Use the cell below to 
    1. load the dataset 
    2. make the dataset tidy or demonstrate that it was already tidy
    3. demonstrate the size of the dataset
    4. find out how much data is missing, where its missing, and if its missing at random or seems to have any systematic relationships in its missingness
    5. find and flag any outliers or suspicious entries
    6. clean the data or demonstrate that it was already clean.  You may choose how to deal with missingness (dropna of fillna... how='any' or 'all') and you should justify your choice in some way
    7. You will load raw data from `data/00-raw/`, you will (optionally) write intermediate stages of your work to `data/01-interim` and you will write the final fully wrangled version of your data to `data/02-processed`
4. Optionally you can also show some summary statistics for variables that you think are important to the project
5. Feel free to add more cells here if that's helpful for you


In [None]:
## YOUR CODE TO LOAD/CLEAN/TIDY/WRANGLE THE DATA GOES HERE


### Dataset #2 U.S. Bureau of Economic Analysis Regional GDP and Personal Income Dataset

#### Dataset Description:
This dataset is obtained directly through The U.S. Bureau of Economic Analysis. The Gross personal income by county, county population size, and the per capita personal income values are obtained for all 58 counties in the state of California, across 15 years from 2010 to 2014.<br>
The Dataset on it own is a good baseline for many of the downstream analysis we will run.
#### Dataset Limitations:
1. This dataset only includes county level income data, and have relatively low temporal resolution. i.e. only one mesurement per county is taken per year, instead of tracking the change in the income per captia over each month.

2. In some area, particularly rural parts of California, the size of informal economy(where people trade without documenting or paying any tax) may be large. Thus, we have to take into consideration of the fact that this dataset may not 100% reflect the per capita income landscape across different counties in CA.


In [None]:
# Importing Dataset 2
# Skip the first 3 rows which are metadata/legends
df_CA_personal_income = pd.read_csv("data/00-raw/BEA_econ_dataset.csv", on_bad_lines='skip', skiprows=3)

# Quick Look at the Dataset
print("Quick Look at the U.S. Bureau of Economic Analysis Regional GDP and Personal Income Dataset")
print(df_CA_personal_income.head())
print(type(df_CA_personal_income))
print("\nDataFrame shape:", df_CA_personal_income.shape)
print("\nColumn names:")
print(df_CA_personal_income.columns.tolist())

In [None]:
# Making sure each row uses the same unit for measurment

# Converting rows with unit (thousands of dollars) into the unit (dollars)
# Filter rows where LineCode == 1.0 (Personal income rows)
personal_income_rows = df_CA_personal_income['LineCode'] == 1.0

# Multiply all year columns by 1000 for personal income rows
year_columns = [col for col in df_CA_personal_income.columns if col.isdigit()]
df_CA_personal_income.loc[personal_income_rows, year_columns] = df_CA_personal_income.loc[personal_income_rows, year_columns] * 1000

# Update the description column for personal income rows
df_CA_personal_income.loc[personal_income_rows, 'Description'] = 'Personal income (dollars)'

# Verify the changes
print("Updated DataFrame (showing personal income rows):")
print(df_CA_personal_income[df_CA_personal_income['LineCode'] == 1.0].head())


In [None]:
# Removing the rows with NaN from the dataset
# See where NaN values are
print(df_CA_personal_income.isnull().sum())
print("\nRows with any NaN:")
print(df_CA_personal_income.isnull().any(axis=1).sum())

# Remove rows with NaN
df_CA_personal_income = df_CA_personal_income.dropna()

In [None]:
# Columns and Rows in the dataset
print("====Properties of the dataset====")

print("====Years====")
print(*year_columns)
print(f"Number of years included in the Dataset: {len(year_columns)}")

print("====County in California====")
county_list_raw = df_CA_personal_income["GeoName"].unique()
county_list = []
for i in range(len(county_list_raw)):
    if "CA" in str(county_list_raw[i]):
        county_list.append(county_list_raw[i])
    else:
        continue
# print(df_CA_personal_income["GeoName"].unique())
print(county_list)
print(f"Number of counties included in the Dataset: {len(county_list)}")

print("====Economic Metrics Included====")
for metric in df_CA_personal_income["Description"].unique():
    print(metric)

In [None]:
# Description of the basic data statistics
# Breaking each variable into its own df
df_county_personal_income = df_CA_personal_income[df_CA_personal_income["Description"] == "Personal income (dollars)"]
df_county_population = df_CA_personal_income[df_CA_personal_income["Description"] == "Population (persons) 1"]
df_county_per_cap_income = df_CA_personal_income[df_CA_personal_income["Description"] == "Per capita personal income (dollars) 2"]

print(df_county_personal_income.describe())
print(df_county_population.describe())
print(df_county_per_cap_income.describe())

In [None]:
# Filter for per capita personal income rows
per_capita_rows = df_CA_personal_income['Description'].str.contains('Per capita personal income')
df_per_capita = df_CA_personal_income[per_capita_rows]

# Set up the years for plotting
years = [int(col) for col in year_columns]

plt.figure(figsize=(12, 8))

# Plot each county's per capita income as a line
for idx, row in df_per_capita.iterrows():
    county = row['GeoName']
    values = row[year_columns].values.astype(float)
    plt.plot(years, values, label=county)

plt.xlabel('Year')
plt.ylabel('Per Capita Personal Income (dollars)')
plt.title('County Per Capita Personal Income Over Time')
plt.legend(loc='upper left', bbox_to_anchor=(1, 1), fontsize='small', ncol=2)
plt.tight_layout()
plt.show()

### Dataset #3.1 Local Area Unemployment Statistics (LAUS), Seasonally Adjusted (provided by California Employment Development Department)

#### Dataset Description:
Metro Area or <br>
The Dataset on it own is a good baseline for many of the downstream analysis we will run.
#### Dataset Limitations:
1. This dataset only includes county level income data, and have relatively low temporal resolution. i.e. only one mesurement per county is taken per year, instead of tracking the change in the income per captia over each month.

2. In some area, particularly rural parts of California, the size of informal economy(where people trade without documenting or paying any tax) may be large. Thus, we have to take into consideration of the fact that this dataset may not 100% reflect the per capita income landscape across different counties in CA.


In [None]:
# Import the County level unemployment dataset
df_CA_unemployment = pd.read_csv("data/00-raw/labour_stats_season_adjusted_dataset.csv", on_bad_lines='skip')

print("====Quick Look at the Data Frame Imported====")
print(df_CA_unemployment.head())
print(f"Size of the raw unemployment dataset: {df_CA_unemployment.shape}")

In [None]:
# Removing the columns full of NaN
# See where NaN values are
print("Columns NaN counts")
print(df_CA_unemployment.isnull().sum())
print("\nRows with any NaN:")
print(df_CA_unemployment.isnull().any(axis=0).sum())

In [None]:
# Removing columns 11 and 12 and rows with NaN
df_CA_unemployment_clean = df_CA_unemployment.drop(['Unnamed: 11', 'Unnamed: 12'], axis=1)
df_CA_unemployment_clean = df_CA_unemployment_clean.dropna(axis=0)
print(df_CA_unemployment_clean.columns)
print(df_CA_unemployment_clean)

### Dataset 3.2 US Bureau of Labor Statistics dataset

#### Dataset Description:
This Dataset includes the following monthly labor statistics for all CA county (2010-2025)
* Unemployment
* Unempolyment Rate
* Empolyment
* Labor forcce

#### Dataset Limitations:
1.
2.




In [15]:
# Importing dataset
df_labor_data_CA = pd.read_csv("data/00-raw/BLS_county_CA.csv")
print(df_labor_data_CA.head())

FileNotFoundError: [Errno 2] No such file or directory: 'data/00-raw/BLS_county_CA.csv'

In [None]:
# 1. Grab the 3rd row (index 2) and set it as the column headers
df_labor_data_CA.columns = df_labor_data_CA.iloc[2]

# 2. Slice the DataFrame to drop rows 0, 1, and 2, keeping only the actual data
df_labor_data_CA = df_labor_data_CA.iloc[3:].reset_index(drop=True)

In [None]:
# Create a clean column for the FIPS code (grabbing the '06001' part)
df_labor_data_CA['FIPS'] = df_labor_data_CA['Local Area Unemployment Statistics'].str[5:10]

# Create a clean column to verify the metric (grabbing the '03' part)
df_labor_data_CA['Measure_Code'] = df_labor_data_CA['Local Area Unemployment Statistics'].str[-2:]

# Filter the dataframe to ONLY keep the Unemployment Rate rows
# df_labor_data_CA = df_labor_data_CA[df_labor_data_CA['Local Area Unemployment Statistics'] == '03']
print(df_labor_data_CA.head())

## Ethics

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback

## Team Expectations 

Instructions: REPLACE the contents of this cell with your work, including any updates to recover points lost in your proposal feedback


## Project Timeline Proposal

Instructions: Replace this with your timeline.  **PLEASE UPDATE your Timeline!** No battle plan survives contact with the enemy, so make sure we understand how your plans have changed.  Also if you have lost points on the previous checkpoint fix them