Step 1: Getting the data into Python, and cleaning it.
- will need to write code to import and clean, then functionalize it.

Steps to clean data:

*Remaining balance:
- need to remove $ and ,
- convert to integer
pandas already did it

*location:
city names contain misspellings and characters
state names are not all abbreviated
some zip codes have postal codes
- start by assigning all blank values as missing  
- pull only the first part of the zip into a new col
- if original zip col is not missing look zip up in zippopotamus to return city and state, 
- if original zip is missing or blank, look up city and state
- if city and state are missing, return missing

Language:
-correct language blanks to missing

*DOB:
-assign DOBs before today as NA

Marital status:
- assign blanks to missing

*Gender:
- assign blanks to missing

*Race:
- some values for white misspelled
- some values for American Indian misspelled
- if contains american indian, then American Indian or Alaska Native
 - if starts with W, then white
- assign blanks to missing

*Hispanic/Latino:
- some values mispelled
- non-hispanic or latino not consistent
- some values no
- assign blanks to missing
- Everything that starts with no should be assigned to non-hispanic or latino
- if doesn't start with no or is missing, decline to answer,  or non-hispanic, assign to Hispanic or Latino

*Sexual orientation:
- assign blanks and N/As to missing
- assign decline to "decline to answer"
- If it starts with st assign to straight

*Insurance type:
- if contains medicare or medicaid, assign to Medicare & Medicaid
- if starts with un then uninsured
- if missing or blank, assign missing

Household size:
- assign the row with 4602 to blank
- assign missings to blank?

*Household income:
- remove $ - and , assign as integers
- assign missings to blank
pandas already did it

*Distance round trip:
- take only numbers, assign text to missing

referral source:
- assign blanks to missing

*Amount:
 - take only numbers, remove $ - and ,
- assign blanks to missing

*payment method:
only text, assign responses with only numbers to missing

payable to:
Surely I don't have to do anything with this

*patient letter notified:
- assign na, n/a, missing, and blanks to No
- assign dates to Y

Application signed:
should be fine.

other:
make sure data types align with what's needed
- use type "object" to handle numerical and non-numerical data?
Bulleted cols need to be cleaned



To-do:
-cleaning for address using zipopotamus api
-lesson 9 to guide

In [None]:
import pandas as pd
import numpy as np
import requests
import os
import sys

def fetch_zip_info(zip_codes):
    """Fetch city and state info for zip codes."""
    zip_to_locale = {}
    for zip_code in zip_codes:
        try:
            url = f"https://api.zippopotam.us/us/{zip_code}"
            response = requests.get(url)
            if response.status_code == 200:
                zip_data = response.json()
                city = zip_data['places'][0]['place name']
                state = zip_data['places'][0]['state abbreviation']
                zip_to_locale[zip_code] = {'City': city, 'State': state}
            else:
                zip_to_locale[zip_code] = {'City': 'Unknown', 'State': 'Unknown'}
        except Exception:
            zip_to_locale[zip_code] = {'City': 'Error', 'State': 'Error'}
    return zip_to_locale

def clean_data(filepath, sheet_name=None):
    """Clean the service learning data."""
    today = pd.Timestamp.today()

    # Read file (Excel or CSV)
    if filepath.endswith('.xlsx'):
        data = pd.read_excel(filepath, sheet_name=sheet_name)
    elif filepath.endswith('.csv'):
        data = pd.read_csv(filepath)
    else:
        raise ValueError("Unsupported file format: Only .csv or .xlsx allowed.")

    # Clean Zip, City, State
    data['Pt Zip'] = data['Pt Zip'].replace(["", np.nan], "Missing")
    data.loc[data['Pt Zip'] == "Missing", ['Pt City', 'Pt State']] = "Missing"

    valid_zips = data[data['Pt Zip'] != "Missing"]['Pt Zip'].unique()
    zip_to_locale = fetch_zip_info(valid_zips)

    data['Pt City'] = data['Pt Zip'].apply(lambda z: zip_to_locale.get(z, {}).get('City', 'Missing') if z != 'Missing' else 'Missing')
    data['Pt State'] = data['Pt Zip'].apply(lambda z: zip_to_locale.get(z, {}).get('State', 'Missing') if z != 'Missing' else 'Missing')

    # Clean DOB
    data['DOB'] = pd.to_datetime(data['DOB'], errors='coerce')
    data.loc[data['DOB'] > today, 'DOB'] = "Missing"

    # Clean Gender
    data['Gender'] = data['Gender'].replace(r'^\s*$', "Missing", regex=True)

    # Clean Race
    data['Race'] = data['Race'].astype(str).str.strip().str.lower()
    data['Race'] = data['Race'].apply(lambda x: (
        'American Indian or Alaska Native' if 'american indian' in x else
        'White' if x.startswith('w') else
        "Missing" if x in ['', 'nan'] else x.title()
    ))

    # Clean Hispanic/Latino
    data['Hispanic/Latino'] = data['Hispanic/Latino'].astype(str).str.strip().str.lower()
    data['Hispanic/Latino'] = data['Hispanic/Latino'].apply(lambda x: (
        'Non-Hispanic or Latino' if x.startswith('no') else
        'Hispanic or Latino' if not (x.startswith('no') or x in ['nan', '', 'missing', 'decline to answer', 'non-hispanic']) else
        np.nan
    ))

    # Clean Sexual Orientation
    data['Sexual Orientation'] = data['Sexual Orientation'].astype(str).str.strip().str.lower()
    data['Sexual Orientation'] = data['Sexual Orientation'].apply(lambda x: (
        'Decline to answer' if x == 'decline' else
        'Straight' if x.startswith('st') else
        np.nan if x in ['n/a', '', 'nan'] else x.title()
    ))

    # Clean Insurance Type
    data['Insurance Type'] = data['Insurance Type'].astype(str).str.strip().str.lower()
    data['Insurance Type'] = data['Insurance Type'].apply(lambda x: (
        'Medicare & Medicaid' if 'medicare' in x or 'medicaid' in x else
        'Uninsured' if x.startswith('un') else
        'Missing' if x in ['', 'nan'] else
        x.title()
    ))

    # Clean Distance roundtrip
    data['Distance roundtrip/Tx'] = pd.to_numeric(
        data['Distance roundtrip/Tx'].astype(str).str.extract(r'(\d+\.?\d*)')[0],
        errors='coerce'
    )

    # Clean Payment Method
    data['Payment Method'] = data['Payment Method'].astype(str).str.strip()
    data['Payment Method'] = data['Payment Method'].apply(
        lambda x: x if x.isalpha() else np.nan
    )

    # Clean Patient Letter Notified
    def letter_notified(val):
        val = str(val).strip().lower()
        if val in ['na', 'n/a', 'missing', '', 'nan']:
            return 'No'
        try:
            pd.to_datetime(val)
            return 'Y'
        except:
            return 'No'

    data['Patient Letter Notified? (Directly/Indirectly through rep)'] = data['Patient Letter Notified? (Directly/Indirectly through rep)'].apply(letter_notified)

    return data

def main():
    """Entry point for GitHub Action or local run."""
    # Get the changed file from GitHub Action argument
    input_file = sys.argv[1]
    output_file = os.path.splitext(input_file)[0] + "_CLEANED.csv"

    # Detect sheet_name only if Excel
    sheet_name = "PA Log Sheet" if input_file.endswith(".xlsx") else None

    cleaned_df = clean_data(input_file, sheet_name=sheet_name)
    cleaned_df.to_csv(output_file, index=False)

    print(f"✅ Cleaned {input_file} -> {output_file}")

if __name__ == "__main__":
    main()



Unnamed: 0,Patient ID#,Grant Req Date,App Year,Remaining Balance,Request Status,Payment Submitted?,Reason - Pending/No,Pt City,Pt State,Pt Zip,Language,DOB,Marital Status,Gender,Race,Hispanic/Latino,Sexual Orientation,Insurance Type,Household Size,Total Household Gross Monthly Income,Distance roundtrip/Tx,Referral Source,Referred By:,Type of Assistance (CLASS),Amount,Payment Method,Payable to:,Patient Letter Notified? (Directly/Indirectly through rep),Application Signed?,Notes
0,180001,2018-10-17,1,1180.00,Approved,Yes,,Missing,Missing,Missing,Missing,NaT,Missing,Missing,Missing,,Missing,Missing,Missing,,,NCS,Dr. Natarajan/Lily Salinas,Medical Supplies/Prescription Co-pay(s),320,Missing,Missing,No,Missing,
1,190001,2019-01-03,1,1428.39,Approved,Yes,,Missing,Missing,Missing,Missing,NaT,Missing,Missing,Missing,,Missing,Missing,Missing,,,NCS,Pam Owen/Sheri Shannon\n,Medical Supplies/Prescription Co-pay(s),21.61,Missing,Missing,No,Missing,
2,190001,2019-03-11,1,1428.39,Approved,Yes,,Missing,Missing,Missing,Missing,NaT,Missing,Missing,Missing,,Missing,Missing,Missing,,,NCS,Teresa Pfister,Food/Groceries,50,GC,Missing,No,Missing,
3,190002,2019-05-20,1,1400.00,Approved,Yes,,Missing,Missing,Missing,Missing,NaT,Missing,Missing,Missing,,Missing,Missing,Missing,,,NCS,AG/Susan Keith,Food/Groceries,100,GC,Missing,No,Missing,
4,190003,2019-05-22,1,1425.00,Approved,Yes,,Missing,Missing,Missing,Missing,NaT,Missing,Missing,Missing,,Missing,Missing,Missing,,,NCS,AG/Kristi McHugh,Medical Supplies/Prescription Co-pay(s),75,,Missing,No,Missing,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2287,240393,2025-01-31,2,1000.00,Pending,No,HS,Falls City,NE,68355,English,1960-09-23 00:00:00,Widowed,Female,Asian,Non-Hispanic or Latino,Straight,Uninsured,1,4000,100.0,CPN,,Gas,500,,,No,,Waiting on HS
2288,240393,2025-01-31,2,1000.00,Pending,No,HS,Falls City,NE,68355,English,1960-09-23 00:00:00,Widowed,Female,Asian,Non-Hispanic or Latino,Straight,Uninsured,1,4000,100.0,CPN,,Food/Groceries,500,,,No,,Waiting on HS
2289,240548,2025-01-31,2,1000.00,Pending,No,,Fremont,NE,68025,English,1962-04-03 00:00:00,Married,Male,White,Non-Hispanic or Latino,Straight,Private,2,2895,15.0,NCS,ALISA SEIDLER,Multiple,1068.56,,,No,,
2290,250038,2025-01-31,1,1500.00,Pending,No,,Hastings,NE,68901,Spanish,1980-10-02 00:00:00,Single,Female,Other,Hispanic or Latino,Straight,Uninsured,2,918,2.0,Morrison Cancer Center,Kellie Sterkel-SW,Housing,1500,,,No,,


name: Clean New Data

on:
  push:
    paths:
      - '**/*.csv'
      - '**/*.xlsx'
    branches:
      - main  # or your main branch

jobs:
  clean-data:
    runs-on: ubuntu-latest

    steps:
    - name: Checkout code
      uses: actions/checkout@v4

    - name: Set up Python
      uses: actions/setup-python@v5
      with:
        python-version: '3.x'

    - name: Install dependencies
      run: |
        pip install pandas numpy requests openpyxl

    - name: Detect changed file
      id: detect_file
      run: |
        echo "CHANGED_FILE=$(git diff --name-only ${{ github.event.before }} ${{ github.sha }} | grep -E '\.csv$|\.xlsx$' | head -n 1)" >> $GITHUB_ENV

    - name: Run data cleaning script
      run: |
        python clean_data_script.py "$CHANGED_FILE"

    - name: Commit cleaned data
      run: |
        git config --global user.name 'github-actions[bot]'
        git config --global user.email 'github-actions[bot]@users.noreply.github.com'
        git add *_CLEANED.csv
        git commit -m "Automated: Cleaned $CHANGED_FILE"
        git push
      continue-on-error: true