In [2]:
import pandas as pd

In [3]:
pd.options.display.max_columns = False

## 1. Clean tbl_JuvenileHistory

### Initial Data Inspection


The dataset contains approximately 3 million rows. To avoid memory issues and reduce unnecessary processing, only the first 100 rows will be inspected initially:

- Get an overview of the data types (`dtypes`)
- Identify columns/features worth keeping for the EDA
- Skip any columns that appear to be irrelevant or redundant

This initial check will help streamline the analysis and focus only on useful information.

In [4]:
juvenile_history_path = "tbl_JuvenileHistory.csv"
juvenile_history = pd.read_csv(
    filepath_or_buffer=juvenile_history_path, delimiter="\t", nrows=100
)

In [5]:
juvenile_history.head()

Unnamed: 0,idnJuvenileHistory,idnCase,idnProceeding,idnJuvenile,DATCREATEDON,DATMODIFIEDON
0,5,2046990,3200129,1,2014-09-06 19:24:46.373,
1,6,2047179,3199488,1,2014-09-06 19:24:46.373,
2,7,2047179,3199489,1,2014-09-06 19:24:46.373,
3,8,2047199,3199497,1,2014-09-06 19:24:46.373,
4,9,2047199,3199498,1,2014-09-06 19:24:46.373,


In [6]:
juvenile_history.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   idnJuvenileHistory  100 non-null    int64  
 1   idnCase             100 non-null    int64  
 2   idnProceeding       100 non-null    int64  
 3   idnJuvenile         100 non-null    int64  
 4   DATCREATEDON        100 non-null    object 
 5   DATMODIFIEDON       0 non-null      float64
dtypes: float64(1), int64(4), object(1)
memory usage: 4.8+ KB


In [7]:
juvenile_history.columns

Index(['idnJuvenileHistory', 'idnCase', 'idnProceeding', 'idnJuvenile',
       'DATCREATEDON', 'DATMODIFIEDON'],
      dtype='object')

**`idnJuvenile`**: Although stored as `int64`, this column only contains values from 1 to 6 (as per `Lookup/tblLookup_Juvenile.csv`). It should be treated as a categorical feature.

**`DATCREATEDON`** and **`DATMODIFIEDON`**:
  - `DATCREATEDON` likely reflects the date when the record was entered into the system, not when the case itself was initiated.
  - `DATMODIFIEDON` contains only null values and may be dropped.

In [8]:
juvenile_history = pd.read_csv(
    filepath_or_buffer=juvenile_history_path,
    delimiter="\t",
    usecols=["idnJuvenileHistory", "idnCase", "idnProceeding", "idnJuvenile"],
    dtype={
        "idnJuvenileHistory": "Int64",
        "idnCase": "Int64",
        "idnProceeding": "Int64",
        "idnJuvenile": "category",
    },
    low_memory=False,
)

### Data Inspection

In [9]:
juvenile_history.head()

Unnamed: 0,idnJuvenileHistory,idnCase,idnProceeding,idnJuvenile
0,5,2046990,3200129,1
1,6,2047179,3199488,1
2,7,2047179,3199489,1
3,8,2047199,3199497,1
4,9,2047199,3199498,1


In [10]:
juvenile_history.shape

(2857093, 4)

In [11]:
juvenile_history.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2857093 entries, 0 to 2857092
Data columns (total 4 columns):
 #   Column              Dtype   
---  ------              -----   
 0   idnJuvenileHistory  Int64   
 1   idnCase             Int64   
 2   idnProceeding       Int64   
 3   idnJuvenile         category
dtypes: Int64(3), category(1)
memory usage: 76.3 MB


In [12]:
juvenile_history.isna().sum()

idnJuvenileHistory      0
idnCase                 0
idnProceeding          98
idnJuvenile           999
dtype: int64

In [13]:
juvenile_history.duplicated().sum()

0

#### `idnCase` List

Before working with `A_TblCase`, we extract the list of `idnCase` values  
associated with juvenile records. This will allow us to filter the case-level  
dataset to include only juvenile-related cases, without performing a full merge.

In [14]:
juvenile_case_ids = juvenile_history["idnCase"].unique()

In [15]:
juvenile_history.to_csv(
    "../outputs/juvenile_history_cleaned.csv.gz", index=False, compression="gzip"
)

### Cleaned `tbl_JuvenileHistory`

- Loaded ~2.85M rows from the raw CSV file.
- Dropped irrelevant or fully null columns.
- Converted `idnJuvenile` to a categorical variable.
- Retained only 4 key fields:
  - `idnJuvenileHistory`: primary key
  - `idnCase`: foreign key to `tbl_Case`
  - `idnProceeding`: foreign key to `tbl_Proceeding`
  - `idnJuvenile`: foreign key to `tblLookup_Juvenile`
- Missing values:
  - `idnProceeding`: 98 missing
  - `idnJuvenile`: 999 missing

Saved cleaned file as:
- `juvenile_history_cleaned.csv.gz` 

## 2. Clean A_TblCase

### Initial Data Inspection


The dataset contains approximately 12 million rows. To avoid memory issues and reduce unnecessary processing, only the first 1000 rows will be inspected initially:

- Get an overview of the data types (`dtypes`)
- Identify columns/features worth keeping for the EDA
- Skip any columns that appear to be irrelevant or redundant

This initial check will help streamline the analysis and focus only on useful information.

In [16]:
case_path = "A_TblCase.csv"
cases = pd.read_csv(filepath_or_buffer=case_path, delimiter="\t", nrows=1000)

In [17]:
cases.head()

Unnamed: 0,IDNCASE,ALIEN_CITY,ALIEN_STATE,ALIEN_ZIPCODE,UPDATED_ZIPCODE,UPDATED_CITY,NAT,LANG,CUSTODY,SITE_TYPE,E_28_DATE,ATTY_NBR,CASE_TYPE,UPDATE_SITE,LATEST_HEARING,LATEST_TIME,LATEST_CAL_TYPE,UP_BOND_DATE,UP_BOND_RSN,CORRECTIONAL_FAC,RELEASE_MONTH,RELEASE_YEAR,INMATE_HOUSING,DATE_OF_ENTRY,C_ASY_TYPE,C_BIRTHDATE,C_RELEASE_DATE,UPDATED_STATE,ADDRESS_CHANGEDON,ZBOND_MRG_FLAG,GENDER,DATE_DETAINED,DATE_RELEASED,LPR,DETENTION_DATE,DETENTION_LOCATION,DCO_LOCATION,DETENTION_FACILITY_TYPE,CASEPRIORITY_CODE
0,11782069,SAINT CHARLES,IL,60174,,,VE,SP,N,M,,,RMV,CHI,2026-01-20 00:00:00.000,900.0,M,,,,,,,2023-05-04 00:00:00.000,,11/1999,,,,,F,,,,,,,,
1,11782070,PLYMOUTH,MA,2360,,,EC,SP,D,M,,,RMV,CHE,,,,,,,,,,2023-04-30 00:00:00.000,,11/1983,,,2024-11-19 08:32:39.000,,M,2024-11-14 00:00:00.000,,,,,,,
2,11782071,WEST SACRAMENTO,CA,95691,,,RU,RUS,N,M,,,RMV,SFR,2026-08-19 00:00:00.000,1330.0,M,,,,,,,2023-05-04 00:00:00.000,E,5/1985,,,2024-11-01 12:22:02.000,,F,,,,,,,,
3,11782072,SAN JOSE,CA,95117,,,MX,SP,N,M,2024-04-19 11:32:17.227,0.0,RMV,SFR,2028-02-17 00:00:00.000,1330.0,M,,,,,,,2023-05-03 00:00:00.000,E,9/1999,,,2025-03-17 14:07:53.000,,F,,,,,,,,
4,11782074,KEARNS,UT,84118,,,VE,SP,N,M,,,RMV,SLC,2025-11-25 00:00:00.000,1400.0,M,,,,,,,2023-04-30 00:00:00.000,E,9/2004,,,,,F,,,,,,,,


In [18]:
cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 39 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   IDNCASE                  1000 non-null   int64  
 1   ALIEN_CITY               958 non-null    object 
 2   ALIEN_STATE              974 non-null    object 
 3   ALIEN_ZIPCODE            974 non-null    object 
 4   UPDATED_ZIPCODE          0 non-null      float64
 5   UPDATED_CITY             0 non-null      float64
 6   NAT                      1000 non-null   object 
 7   LANG                     1000 non-null   object 
 8   CUSTODY                  1000 non-null   object 
 9   SITE_TYPE                878 non-null    object 
 10  E_28_DATE                271 non-null    object 
 11  ATTY_NBR                 526 non-null    float64
 12  CASE_TYPE                1000 non-null   object 
 13  UPDATE_SITE              1000 non-null   object 
 14  LATEST_HEARING           

In [19]:
cases.columns

Index(['IDNCASE', 'ALIEN_CITY', 'ALIEN_STATE', 'ALIEN_ZIPCODE',
       'UPDATED_ZIPCODE', 'UPDATED_CITY', 'NAT', 'LANG', 'CUSTODY',
       'SITE_TYPE', 'E_28_DATE', 'ATTY_NBR', 'CASE_TYPE', 'UPDATE_SITE',
       'LATEST_HEARING', 'LATEST_TIME', 'LATEST_CAL_TYPE', 'UP_BOND_DATE',
       'UP_BOND_RSN', 'CORRECTIONAL_FAC', 'RELEASE_MONTH', 'RELEASE_YEAR',
       'INMATE_HOUSING', 'DATE_OF_ENTRY', 'C_ASY_TYPE', 'C_BIRTHDATE',
       'C_RELEASE_DATE', 'UPDATED_STATE', 'ADDRESS_CHANGEDON',
       'ZBOND_MRG_FLAG', 'GENDER', 'DATE_DETAINED', 'DATE_RELEASED', 'LPR',
       'DETENTION_DATE', 'DETENTION_LOCATION', 'DCO_LOCATION',
       'DETENTION_FACILITY_TYPE', 'CASEPRIORITY_CODE'],
      dtype='object')

#### Selected Features for EDA

The list of selected columns below was discussed earlier in the documentation for the source dataset. It represents the core case-level features relevant to our analysis of juvenile immigration cases.

These fields include:

- Demographic information (e.g., `GENDER`, `NAT`, `LANG`)  
- Case characteristics (e.g., `CASE_TYPE`, `CUSTODY`, `LATEST_HEARING`)  
- Key dates (e.g., `DATE_OF_ENTRY`, `DETENTION_DATE`, `C_BIRTHDATE`)

In [20]:
selected_columns = [
    "IDNCASE",
    "NAT",
    "LANG",
    "CUSTODY",
    "CASE_TYPE",
    "LATEST_HEARING",
    "LATEST_CAL_TYPE",
    "DATE_OF_ENTRY",
    "C_BIRTHDATE",
    "GENDER",
    "DATE_DETAINED",
    "DATE_RELEASED",
    "DETENTION_DATE",
]

In [21]:
cases = cases[selected_columns]

In [22]:
cases.head()

Unnamed: 0,IDNCASE,NAT,LANG,CUSTODY,CASE_TYPE,LATEST_HEARING,LATEST_CAL_TYPE,DATE_OF_ENTRY,C_BIRTHDATE,GENDER,DATE_DETAINED,DATE_RELEASED,DETENTION_DATE
0,11782069,VE,SP,N,RMV,2026-01-20 00:00:00.000,M,2023-05-04 00:00:00.000,11/1999,F,,,
1,11782070,EC,SP,D,RMV,,,2023-04-30 00:00:00.000,11/1983,M,2024-11-14 00:00:00.000,,
2,11782071,RU,RUS,N,RMV,2026-08-19 00:00:00.000,M,2023-05-04 00:00:00.000,5/1985,F,,,
3,11782072,MX,SP,N,RMV,2028-02-17 00:00:00.000,M,2023-05-03 00:00:00.000,9/1999,F,,,
4,11782074,VE,SP,N,RMV,2025-11-25 00:00:00.000,M,2023-04-30 00:00:00.000,9/2004,F,,,


In [23]:
cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   IDNCASE          1000 non-null   int64  
 1   NAT              1000 non-null   object 
 2   LANG             1000 non-null   object 
 3   CUSTODY          1000 non-null   object 
 4   CASE_TYPE        1000 non-null   object 
 5   LATEST_HEARING   859 non-null    object 
 6   LATEST_CAL_TYPE  854 non-null    object 
 7   DATE_OF_ENTRY    643 non-null    object 
 8   C_BIRTHDATE      558 non-null    object 
 9   GENDER           554 non-null    object 
 10  DATE_DETAINED    39 non-null     object 
 11  DATE_RELEASED    5 non-null      object 
 12  DETENTION_DATE   0 non-null      float64
dtypes: float64(1), int64(1), object(11)
memory usage: 101.7+ KB


#### Note on `DETENTION_DATE`

This field is entirely null in the sample, but we'll reassess after  
loading the full dataset.

In [24]:
cases.dtypes

IDNCASE              int64
NAT                 object
LANG                object
CUSTODY             object
CASE_TYPE           object
LATEST_HEARING      object
LATEST_CAL_TYPE     object
DATE_OF_ENTRY       object
C_BIRTHDATE         object
GENDER              object
DATE_DETAINED       object
DATE_RELEASED       object
DETENTION_DATE     float64
dtype: object

#### Specifying Column Data Types

- `Int64`: Used for `IDNCASE` to allow nullable integer values.
- `category`: Applied to string columns with repeated values  
  (e.g., `NAT`, `LANG`, `CUSTODY`, `CASE_TYPE`, `LATEST_CAL_TYPE`, `GENDER`)  
  for efficient storage and faster processing.
- `string`: Used for `C_BIRTHDATE` since it uses a partial date format (`MM/YYYY`)  
  and may contain nulls.
- `float64`: Used for `DETENTION_DATE`, which appears to contain only nulls in the  
  sample but may include numeric timestamps or intervals in the full dataset.

In [25]:
dtype = {
    "IDNCASE": "Int64",
    "NAT": "category",
    "LANG": "category",
    "CUSTODY": "category",
    "CASE_TYPE": "category",
    "LATEST_CAL_TYPE": "category",
    "GENDER": "category",
    "C_BIRTHDATE": "string",
    "DETENTION_DATE": "float64",
}

In [26]:
cases = pd.read_csv(
    filepath_or_buffer=case_path,
    delimiter="\t",
    on_bad_lines="skip",
    usecols=selected_columns,
    dtype=dtype,
    parse_dates=["LATEST_HEARING", "DATE_OF_ENTRY", "DATE_DETAINED", "DATE_RELEASED"],
    low_memory=False,
    skiprows=[11711221],  # skip malformed row: C error - EOF inside string on this line
)

### Data Inspection

In [27]:
cases.head()

Unnamed: 0,IDNCASE,NAT,LANG,CUSTODY,CASE_TYPE,LATEST_HEARING,LATEST_CAL_TYPE,DATE_OF_ENTRY,C_BIRTHDATE,GENDER,DATE_DETAINED,DATE_RELEASED,DETENTION_DATE
0,11782069,VE,SP,N,RMV,2026-01-20 00:00:00.000,M,2023-05-04 00:00:00.000,11/1999,F,,,
1,11782070,EC,SP,D,RMV,,,2023-04-30 00:00:00.000,11/1983,M,2024-11-14 00:00:00.000,,
2,11782071,RU,RUS,N,RMV,2026-08-19 00:00:00.000,M,2023-05-04 00:00:00.000,5/1985,F,,,
3,11782072,MX,SP,N,RMV,2028-02-17 00:00:00.000,M,2023-05-03 00:00:00.000,9/1999,F,,,
4,11782074,VE,SP,N,RMV,2025-11-25 00:00:00.000,M,2023-04-30 00:00:00.000,9/2004,F,,,


In [28]:
cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11711220 entries, 0 to 11711219
Data columns (total 13 columns):
 #   Column           Dtype   
---  ------           -----   
 0   IDNCASE          Int64   
 1   NAT              category
 2   LANG             category
 3   CUSTODY          category
 4   CASE_TYPE        category
 5   LATEST_HEARING   object  
 6   LATEST_CAL_TYPE  category
 7   DATE_OF_ENTRY    object  
 8   C_BIRTHDATE      string  
 9   GENDER           category
 10  DATE_DETAINED    object  
 11  DATE_RELEASED    object  
 12  DETENTION_DATE   float64 
dtypes: Int64(1), category(6), float64(1), object(4), string(1)
memory usage: 726.0+ MB


In [29]:
cases.shape

(11711220, 13)

#### Filtering for Juvenile Cases

The current `cases` DataFrame includes both adult and juvenile records.  
To isolate only juvenile cases, we will filter it using the list of  
`idnCase` values from `tbl_JuvenileHistory`, without performing a full merge.

In [30]:
juvenile_cases = cases[cases["IDNCASE"].isin(juvenile_case_ids)].reset_index(drop=True)

#### Category Cleanup

Removed unused category levels after filtering to ensure all categorical columns reflect only the values present in the `juvenile_cases` subset.

In [31]:
for col in juvenile_cases.select_dtypes(include='category'):
    juvenile_cases[col] = juvenile_cases[col].cat.remove_unused_categories()

In [32]:
juvenile_cases.head()

Unnamed: 0,IDNCASE,NAT,LANG,CUSTODY,CASE_TYPE,LATEST_HEARING,LATEST_CAL_TYPE,DATE_OF_ENTRY,C_BIRTHDATE,GENDER,DATE_DETAINED,DATE_RELEASED,DETENTION_DATE
0,13758313,GT,SP,N,RMV,,,2024-01-21 00:00:00.000,2/2008,F,,,
1,14870586,MX,SP,D,RFR,2025-02-04 00:00:00.000,I,,,,2025-01-22 00:00:00.000,,
2,14870588,GT,SP,R,WHO,2025-07-31 00:00:00.000,M,,6/1997,F,2025-01-29 00:00:00.000,2025-02-06 00:00:00.000,
3,13816559,MX,SP,N,RMV,2027-02-18 00:00:00.000,I,,5/2020,M,,,
4,13816560,MX,SP,N,RMV,2027-02-18 00:00:00.000,I,,5/2021,F,,,


In [33]:
juvenile_cases.shape

(1858773, 13)

In [34]:
juvenile_cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1858773 entries, 0 to 1858772
Data columns (total 13 columns):
 #   Column           Dtype   
---  ------           -----   
 0   IDNCASE          Int64   
 1   NAT              category
 2   LANG             category
 3   CUSTODY          category
 4   CASE_TYPE        category
 5   LATEST_HEARING   object  
 6   LATEST_CAL_TYPE  category
 7   DATE_OF_ENTRY    object  
 8   C_BIRTHDATE      string  
 9   GENDER           category
 10  DATE_DETAINED    object  
 11  DATE_RELEASED    object  
 12  DETENTION_DATE   float64 
dtypes: Int64(1), category(6), float64(1), object(4), string(1)
memory usage: 115.2+ MB


### Missing Values Summary

Counts and percentages of missing values were calculated for each column in the `juvenile_cases` dataset to assess data completeness.  
`.isna().sum()` computes the total number of missing entries, and the percentage is derived by dividing by the total number of rows.  
Columns are sorted by missing count to highlight those requiring attention during data cleaning.

In [35]:
null_counts = juvenile_cases.isna().sum()
percent_missing = (null_counts / len(juvenile_cases)) * 100

missing_summary = pd.DataFrame(
    {"Missing Count": null_counts, "Missing %": percent_missing.round(2)}
).sort_values(by="Missing Count", ascending=False)

display(missing_summary)

Unnamed: 0,Missing Count,Missing %
DETENTION_DATE,1858773,100.0
DATE_RELEASED,1411804,75.95
DATE_DETAINED,1053320,56.67
GENDER,488535,26.28
DATE_OF_ENTRY,480722,25.86
C_BIRTHDATE,446724,24.03
LATEST_CAL_TYPE,303538,16.33
LATEST_HEARING,296813,15.97
NAT,3057,0.16
LANG,1692,0.09


The `DETENTION_DATE` column is entirely null (100% missing) and therefore will be dropped from the dataset

In [36]:
juvenile_cases = juvenile_cases.drop("DETENTION_DATE", axis=1)

#### Datetime Format Validation

The following date-related features will be the focus of the next stage of preprocessing:  
- `LATEST_HEARING`  
- `DATE_OF_ENTRY`  
- `C_BIRTHDATE`  
- `DATE_DETAINED`  
- `DATE_RELEASED`  

These features may require format standardization and conversion to datetime objects to enable accurate temporal analysis.

Every datetime feature (except `C_BIRTHDATE`) follows the format `'YYYY-MM-DD 00:00:00.000'` (e.g., `'2025-02-04 00:00:00.000'`).

Before conversion, each feature will be tested against this pattern to ensure values are valid.  
All non-null entries will be checked to avoid unintended data loss during transformation with `pd.to_datetime()`.

Only the **`YYYY-MM-DD`** portion of each timestamp will be retained.

In [37]:
def find_invalid_dates(df, column):
    """
    Returns non-null rows that don’t match the `YYYY-MM-DD` pattern.
    """
    return df[
        df[column].notna() &
        ~df[column].astype(str).str.contains(r'\d{4}-\d{2}-\d{2}', regex=True)
    ][[column]]

In [38]:
invalid_latest_hearing = find_invalid_dates(juvenile_cases, 'LATEST_HEARING')
invalid_date_of_entry = find_invalid_dates(juvenile_cases, 'DATE_OF_ENTRY')
invalid_date_detained = find_invalid_dates(juvenile_cases, 'DATE_DETAINED')
invalid_date_released = find_invalid_dates(juvenile_cases, 'DATE_RELEASED')

In [39]:
def report_invalid(name, df):
    """
    Prints the number of invalid entries and displays the DataFrame if not empty.
    """
    count = len(df)
    print(f"{name}: {count} invalid entr{'y' if count == 1 else 'ies'}")
    if count > 0:
        display(df)  

In [40]:
report_invalid("LATEST_HEARING", invalid_latest_hearing)
report_invalid("DATE_OF_ENTRY", invalid_date_of_entry)
report_invalid("DATE_DETAINED", invalid_date_detained)
report_invalid("DATE_RELEASED", invalid_date_released)

LATEST_HEARING: 9 invalid entries


Unnamed: 0,LATEST_HEARING
1676217,SFR
1678888,PHI
1701983,NEW
1774914,NYV
1775104,NYV
1788271,PHI
1790694,HAR
1796974,DAL
1842725,SNA


DATE_OF_ENTRY: 0 invalid entries
DATE_DETAINED: 7 invalid entries


Unnamed: 0,DATE_DETAINED
1676217,M
1678888,F
1774914,M
1775104,M
1788271,M
1790694,F
1842725,M


DATE_RELEASED: 0 invalid entries


Invalid date values were not counted separately, as `errors='coerce'` was used in `pd.to_datetime()`.  
This automatically handles invalid formats by converting them to `NaT`, simplifying the cleaning process.

In [41]:
date_cols = ['LATEST_HEARING', 'DATE_OF_ENTRY', 'DATE_DETAINED', 'DATE_RELEASED']

juvenile_cases[date_cols] = juvenile_cases[date_cols].apply(
    lambda col: pd.to_datetime(col, errors='coerce')
)

Identified non-null `C_BIRTHDATE` values that do not match the expected `MM/YYYY` format.  
This helps reveal alternate formats and prevents unintended data loss during conversion.

In [42]:
invalid_birthdates = juvenile_cases[
    juvenile_cases['C_BIRTHDATE'].notna() &
    ~juvenile_cases['C_BIRTHDATE'].astype(str).str.contains(r'^\d{1,2}/\d{4}$', regex=True)
][['C_BIRTHDATE']]

In [43]:
print(f"Invalid C_BIRTHDATE entries: {len(invalid_birthdates)}")
display(invalid_birthdates)

Invalid C_BIRTHDATE entries: 7


Unnamed: 0,C_BIRTHDATE
1678888,E
1774914,E
1775104,E
1788271,E
1790694,E
1796974,E
1842725,E


In [44]:
juvenile_cases['C_BIRTHDATE'] = pd.to_datetime(
    juvenile_cases['C_BIRTHDATE'],
    format='%m/%Y',
    errors='coerce'
)

#### Date Conversion Check

Confirmed that all date columns were successfully converted to `datetime64[ns]` format.

In [45]:
date_cols = ['LATEST_HEARING', 'DATE_OF_ENTRY', 'DATE_DETAINED', 'DATE_RELEASED', 'C_BIRTHDATE']
print(juvenile_cases[date_cols].dtypes)

LATEST_HEARING    datetime64[ns]
DATE_OF_ENTRY     datetime64[ns]
DATE_DETAINED     datetime64[ns]
DATE_RELEASED     datetime64[ns]
C_BIRTHDATE       datetime64[ns]
dtype: object


In [46]:
juvenile_cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1858773 entries, 0 to 1858772
Data columns (total 12 columns):
 #   Column           Dtype         
---  ------           -----         
 0   IDNCASE          Int64         
 1   NAT              category      
 2   LANG             category      
 3   CUSTODY          category      
 4   CASE_TYPE        category      
 5   LATEST_HEARING   datetime64[ns]
 6   LATEST_CAL_TYPE  category      
 7   DATE_OF_ENTRY    datetime64[ns]
 8   C_BIRTHDATE      datetime64[ns]
 9   GENDER           category      
 10  DATE_DETAINED    datetime64[ns]
 11  DATE_RELEASED    datetime64[ns]
dtypes: Int64(1), category(6), datetime64[ns](5)
memory usage: 101.0 MB


In [47]:
juvenile_cases.head()

Unnamed: 0,IDNCASE,NAT,LANG,CUSTODY,CASE_TYPE,LATEST_HEARING,LATEST_CAL_TYPE,DATE_OF_ENTRY,C_BIRTHDATE,GENDER,DATE_DETAINED,DATE_RELEASED
0,13758313,GT,SP,N,RMV,NaT,,2024-01-21,2008-02-01,F,NaT,NaT
1,14870586,MX,SP,D,RFR,2025-02-04,I,NaT,NaT,,2025-01-22,NaT
2,14870588,GT,SP,R,WHO,2025-07-31,M,NaT,1997-06-01,F,2025-01-29,2025-02-06
3,13816559,MX,SP,N,RMV,2027-02-18,I,NaT,2020-05-01,M,NaT,NaT
4,13816560,MX,SP,N,RMV,2027-02-18,I,NaT,2021-05-01,F,NaT,NaT


#### Categorical Value Cleanup

Standardized and cleaned values in key categorical features (`NAT`, `LANG`, `CUSTODY`, `CASE_TYPE`, `LATEST_CAL_TYPE`, `GENDER`)  
to ensure consistency and prevent issues caused by typos or rare variants.

`NAT` and `LANG` contain a large number of unique values.  
These were retained in full to preserve detail and will be grouped or simplified as needed during analysis.

In [48]:
juvenile_cases['CUSTODY'].value_counts()

CUSTODY
N      1010878
R       471998
D       375884
SP           8
POR          1
Name: count, dtype: int64

In [49]:
juvenile_cases['CASE_TYPE'].value_counts()

CASE_TYPE
RMV    1721295
CFR      88087
RFR      21395
WHO      21372
AOC       4358
DEP       1830
REC        218
EXC        153
CSR         51
0            5
AOL          2
NAC          2
DCC          1
Name: count, dtype: int64

In [50]:
juvenile_cases['LATEST_CAL_TYPE'].value_counts()

LATEST_CAL_TYPE
M       1069154
I        486049
N            22
0900          4
1300          2
R             2
0830          1
1030          1
Name: count, dtype: int64

Cleaned `LATEST_CAL_TYPE` by keeping known values (`M` = Master, `I` = Individual) and replacing unexpected entries (e.g., time strings or unknown codes) with `NaN`.

In [51]:
valid_types = ['M', 'I']

juvenile_cases['LATEST_CAL_TYPE'] = juvenile_cases['LATEST_CAL_TYPE'].apply(
    lambda x: x if x in valid_types else pd.NA
).astype('category')

juvenile_cases['LATEST_CAL_TYPE'] = juvenile_cases['LATEST_CAL_TYPE'].cat.remove_unused_categories()

In [52]:
juvenile_cases['GENDER'].value_counts()

GENDER
M    875507
F    494724
N         6
U         1
Name: count, dtype: int64

Cleaned `GENDER` by keeping known values (`M` = Male, `F` = Female) and replaced rare or unclear codes (`N`, `U`) with `NaN` to ensure consistency and avoid ambiguity in gender-related analysis.

In [53]:
juvenile_cases['GENDER'] = juvenile_cases['GENDER'].apply(
    lambda x: x if x in ['M', 'F'] else pd.NA
).astype('category')

juvenile_cases['GENDER'] = juvenile_cases['GENDER'].cat.remove_unused_categories()

In [55]:
juvenile_cases.head()

Unnamed: 0,IDNCASE,NAT,LANG,CUSTODY,CASE_TYPE,LATEST_HEARING,LATEST_CAL_TYPE,DATE_OF_ENTRY,C_BIRTHDATE,GENDER,DATE_DETAINED,DATE_RELEASED
0,13758313,GT,SP,N,RMV,NaT,,2024-01-21,2008-02-01,F,NaT,NaT
1,14870586,MX,SP,D,RFR,2025-02-04,I,NaT,NaT,,2025-01-22,NaT
2,14870588,GT,SP,R,WHO,2025-07-31,M,NaT,1997-06-01,F,2025-01-29,2025-02-06
3,13816559,MX,SP,N,RMV,2027-02-18,I,NaT,2020-05-01,M,NaT,NaT
4,13816560,MX,SP,N,RMV,2027-02-18,I,NaT,2021-05-01,F,NaT,NaT


In [54]:
juvenile_cases.to_csv(
    "../outputs/juvenile_cases_cleaned.csv.gz", index=False, compression="gzip"
)

### Cleaned `tbl_Case`

- Filtered to ~1.86M juvenile-related cases based on `IDNCASE`.
- Reset index after filtering.
- Removed unused categories from categorical columns.
- Converted date fields to `datetime64[ns]`:
  - `LATEST_HEARING`
  - `DATE_OF_ENTRY`
  - `C_BIRTHDATE` (parsed from `MM/YYYY` format)
  - `DATE_DETAINED`
  - `DATE_RELEASED`
- Cleaned categorical features:
  - `LATEST_CAL_TYPE`: kept only `M` (Master) and `I` (Individual); others set to `NaN`
  - `GENDER`: kept only `M` and `F`; others set to `NaN`

Saved cleaned file as:  
`cases_juvenile_cleaned.csv.gz`