In [15]:
import pandas as pd

In [16]:
pd.options.display.max_columns = False

### Clean `A_TblCase` table

#### Initial Data Inspection

The dataset contains approximately 12 million rows. To avoid memory issues and reduce unnecessary processing, only the first 1000 rows will be inspected initially:

- Get an overview of the data types (`dtypes`)
- Identify columns/features worth keeping for the EDA
- Skip any columns that appear to be irrelevant or redundant

This initial check will help streamline the analysis and focus only on useful information.

In [17]:
case_path = "../../data/raw/A_TblCase.csv"

cases = pd.read_csv(
    filepath_or_buffer=case_path, delimiter="\t", nrows=1000
)

In [18]:
cases.head()

Unnamed: 0,IDNCASE,ALIEN_CITY,ALIEN_STATE,ALIEN_ZIPCODE,UPDATED_ZIPCODE,UPDATED_CITY,NAT,LANG,CUSTODY,SITE_TYPE,E_28_DATE,ATTY_NBR,CASE_TYPE,UPDATE_SITE,LATEST_HEARING,LATEST_TIME,LATEST_CAL_TYPE,UP_BOND_DATE,UP_BOND_RSN,CORRECTIONAL_FAC,RELEASE_MONTH,RELEASE_YEAR,INMATE_HOUSING,DATE_OF_ENTRY,C_ASY_TYPE,C_BIRTHDATE,C_RELEASE_DATE,UPDATED_STATE,ADDRESS_CHANGEDON,ZBOND_MRG_FLAG,Sex,DATE_DETAINED,DATE_RELEASED,LPR,DETENTION_DATE,DETENTION_LOCATION,DCO_LOCATION,DETENTION_FACILITY_TYPE,CASEPRIORITY_CODE
0,6807606,TACOMA,WA,98421,,,MX,SP,D,,,,RMV,TAC,,,,,,,,,,,,,,,,,,,,,,,,,
1,6807608,GRANITE CITY,IL,62040,,,RP,ENG,R,,2012-03-05 00:00:00.000,3.0,RMV,KAN,2012-08-16 00:00:00.000,1300.0,M,,,,,,,1996-02-19 00:00:00.000,,7/1969,,,2012-02-27 12:50:34.993,,,2011-07-19 00:00:00.000,2011-08-24 00:00:00.000,0.0,,,,,
2,6807609,PEARSALL,TX,78061,,,MX,SP,D,,2020-04-08 00:00:00.000,0.0,RMV,PSD,2021-07-19 00:00:00.000,1330.0,M,,,,,,,,,,,,2020-02-10 17:03:43.040,,,2020-02-03 00:00:00.000,,,,,,,
3,6807610,ANNANDALE,VA,22003,,,ES,SP,R,,2023-01-17 00:00:00.000,0.0,RMV,ANN,2024-09-09 00:00:00.000,1430.0,I,,,,,,,,E,5/1971,,,2023-04-28 10:46:46.000,,,2011-07-07 00:00:00.000,2011-09-15 00:00:00.000,,,,,,
4,6807611,ORANGE,CA,92868,,,MX,SP,D,,,,RMV,NLA,,,,,,,,,,,,7/1981,,,,,,,,,,,,,


In [19]:
cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 39 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   IDNCASE                  1000 non-null   int64  
 1   ALIEN_CITY               998 non-null    object 
 2   ALIEN_STATE              999 non-null    object 
 3   ALIEN_ZIPCODE            999 non-null    object 
 4   UPDATED_ZIPCODE          6 non-null      float64
 5   UPDATED_CITY             6 non-null      object 
 6   NAT                      983 non-null    object 
 7   LANG                     985 non-null    object 
 8   CUSTODY                  985 non-null    object 
 9   SITE_TYPE                601 non-null    object 
 10  E_28_DATE                485 non-null    object 
 11  ATTY_NBR                 775 non-null    float64
 12  CASE_TYPE                1000 non-null   object 
 13  UPDATE_SITE              985 non-null    object 
 14  LATEST_HEARING           

In [20]:
cases.columns

Index(['IDNCASE', 'ALIEN_CITY', 'ALIEN_STATE', 'ALIEN_ZIPCODE',
       'UPDATED_ZIPCODE', 'UPDATED_CITY', 'NAT', 'LANG', 'CUSTODY',
       'SITE_TYPE', 'E_28_DATE', 'ATTY_NBR', 'CASE_TYPE', 'UPDATE_SITE',
       'LATEST_HEARING', 'LATEST_TIME', 'LATEST_CAL_TYPE', 'UP_BOND_DATE',
       'UP_BOND_RSN', 'CORRECTIONAL_FAC', 'RELEASE_MONTH', 'RELEASE_YEAR',
       'INMATE_HOUSING', 'DATE_OF_ENTRY', 'C_ASY_TYPE', 'C_BIRTHDATE',
       'C_RELEASE_DATE', 'UPDATED_STATE', 'ADDRESS_CHANGEDON',
       'ZBOND_MRG_FLAG', 'Sex', 'DATE_DETAINED', 'DATE_RELEASED', 'LPR',
       'DETENTION_DATE', 'DETENTION_LOCATION', 'DCO_LOCATION',
       'DETENTION_FACILITY_TYPE', 'CASEPRIORITY_CODE'],
      dtype='object')

#### Selected Features for EDA – `A_TblCase`

The list of selected columns below was discussed earlier in the documentation for the source dataset. It represents the core case-level features relevant to our analysis of juvenile immigration cases.

These fields include:

- Demographic information (e.g., `GENDER`, `NAT`, `LANG`)  
- Case characteristics (e.g., `CASE_TYPE`, `CUSTODY`, `LATEST_HEARING`)  
- Key dates (e.g., `DATE_OF_ENTRY`, `DETENTION_DATE`, `C_BIRTHDATE`)
- representation status: While `ATTY_NBR` exists as a legacy field in A_TblCase, accurate and up-to-date representation data should be derived from the `tbl_RepsAssigned` table using the IDNCASE key.

In [21]:
selected_columns = [
    "IDNCASE",
    "NAT",
    "LANG",
    "CUSTODY",
    "CASE_TYPE",
    "LATEST_HEARING",
    "LATEST_CAL_TYPE",
    "DATE_OF_ENTRY",
    "C_BIRTHDATE",
    "Sex",
    "DATE_DETAINED",
    "DATE_RELEASED",
    "DETENTION_DATE",
]

In [22]:
cases = cases[selected_columns]

In [23]:
cases.head()

Unnamed: 0,IDNCASE,NAT,LANG,CUSTODY,CASE_TYPE,LATEST_HEARING,LATEST_CAL_TYPE,DATE_OF_ENTRY,C_BIRTHDATE,Sex,DATE_DETAINED,DATE_RELEASED,DETENTION_DATE
0,6807606,MX,SP,D,RMV,,,,,,,,
1,6807608,RP,ENG,R,RMV,2012-08-16 00:00:00.000,M,1996-02-19 00:00:00.000,7/1969,,2011-07-19 00:00:00.000,2011-08-24 00:00:00.000,
2,6807609,MX,SP,D,RMV,2021-07-19 00:00:00.000,M,,,,2020-02-03 00:00:00.000,,
3,6807610,ES,SP,R,RMV,2024-09-09 00:00:00.000,I,,5/1971,,2011-07-07 00:00:00.000,2011-09-15 00:00:00.000,
4,6807611,MX,SP,D,RMV,,,,7/1981,,,,


In [24]:
cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   IDNCASE          1000 non-null   int64  
 1   NAT              983 non-null    object 
 2   LANG             985 non-null    object 
 3   CUSTODY          985 non-null    object 
 4   CASE_TYPE        1000 non-null   object 
 5   LATEST_HEARING   817 non-null    object 
 6   LATEST_CAL_TYPE  818 non-null    object 
 7   DATE_OF_ENTRY    723 non-null    object 
 8   C_BIRTHDATE      249 non-null    object 
 9   Sex              253 non-null    object 
 10  DATE_DETAINED    195 non-null    object 
 11  DATE_RELEASED    78 non-null     object 
 12  DETENTION_DATE   0 non-null      float64
dtypes: float64(1), int64(1), object(11)
memory usage: 101.7+ KB


#### Note on `DETENTION_DATE`

This field is entirely null in the sample, but we'll reassess after loading the full dataset.

In [25]:
cases.dtypes

IDNCASE              int64
NAT                 object
LANG                object
CUSTODY             object
CASE_TYPE           object
LATEST_HEARING      object
LATEST_CAL_TYPE     object
DATE_OF_ENTRY       object
C_BIRTHDATE         object
Sex                 object
DATE_DETAINED       object
DATE_RELEASED       object
DETENTION_DATE     float64
dtype: object

#### Specifying Column Data Types

- `Int64`: Used for `IDNCASE` to allow nullable integer values.
- `category`: Applied to string columns with repeated values (e.g., `NAT`, `LANG`, `CUSTODY`, `CASE_TYPE`, `LATEST_CAL_TYPE`, `GENDER`) for efficient storage and faster processing.
- `string`: Used for `C_BIRTHDATE` since it uses a partial date format (`MM/YYYY`) and may contain nulls.
- `float64`: Used for `DETENTION_DATE`, which appears to contain only nulls in the sample but may include numeric timestamps or intervals in the full dataset.

In [26]:
dtype = {
    "IDNCASE": "Int64",
    "NAT": "category",
    "LANG": "category",
    "CUSTODY": "category",
    "CASE_TYPE": "category",
    "LATEST_CAL_TYPE": "category",
    "Sex": "category",
    "C_BIRTHDATE": "string",
    "DETENTION_DATE": "float64",
}

In [None]:
cases = pd.read_csv(
    filepath_or_buffer=case_path,
    delimiter="\t",
    encoding="utf-8",
    on_bad_lines="skip",
    usecols=selected_columns,
    dtype=dtype,
    parse_dates=["LATEST_HEARING", "DATE_OF_ENTRY", "DATE_DETAINED", "DATE_RELEASED"],
    low_memory=False
)

#### Data Inspection

In [28]:
cases.head()

Unnamed: 0,IDNCASE,NAT,LANG,CUSTODY,CASE_TYPE,LATEST_HEARING,LATEST_CAL_TYPE,DATE_OF_ENTRY,C_BIRTHDATE,Sex,DATE_DETAINED,DATE_RELEASED,DETENTION_DATE
0,6807606,MX,SP,D,RMV,,,,,,,,
1,6807608,RP,ENG,R,RMV,2012-08-16 00:00:00.000,M,1996-02-19 00:00:00.000,7/1969,,2011-07-19 00:00:00.000,2011-08-24 00:00:00.000,
2,6807609,MX,SP,D,RMV,2021-07-19 00:00:00.000,M,,,,2020-02-03 00:00:00.000,,
3,6807610,ES,SP,R,RMV,2024-09-09 00:00:00.000,I,,5/1971,,2011-07-07 00:00:00.000,2011-09-15 00:00:00.000,
4,6807611,MX,SP,D,RMV,,,,7/1981,,,,


In [29]:
cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12124239 entries, 0 to 12124238
Data columns (total 13 columns):
 #   Column           Dtype   
---  ------           -----   
 0   IDNCASE          Int64   
 1   NAT              category
 2   LANG             category
 3   CUSTODY          category
 4   CASE_TYPE        category
 5   LATEST_HEARING   object  
 6   LATEST_CAL_TYPE  category
 7   DATE_OF_ENTRY    object  
 8   C_BIRTHDATE      string  
 9   Sex              category
 10  DATE_DETAINED    object  
 11  DATE_RELEASED    object  
 12  DETENTION_DATE   float64 
dtypes: Int64(1), category(6), float64(1), object(4), string(1)
memory usage: 751.6+ MB


In [30]:
cases.shape

(12124239, 13)

#### Filtering for Juvenile Cases

The `cases` DataFrame includes both adult and juvenile records.  
Juvenile cases are isolated by filtering with the list of `idnCase` values extracted from `tbl_JuvenileHistory`.  
This avoids performing a full table join and improves processing efficiency.

In [37]:
juvenile_history = pd.read_csv(
    "../../data/cleaned/juvenile_history_cleaned.csv.gz",
    usecols=["idnCase"],
    dtype={"idnCase": "Int64"},
)

In [38]:
juvenile_case_ids = juvenile_history["idnCase"].dropna().unique()

In [39]:
juvenile_cases = cases[cases["IDNCASE"].isin(juvenile_case_ids)].reset_index(drop=True)

In [47]:
juvenile_cases.head()

Unnamed: 0,IDNCASE,NAT,LANG,CUSTODY,CASE_TYPE,LATEST_HEARING,LATEST_CAL_TYPE,DATE_OF_ENTRY,C_BIRTHDATE,Sex,DATE_DETAINED,DATE_RELEASED
0,6807609,MX,SP,D,RMV,2021-07-19 00:00:00.000,M,,,,2020-02-03 00:00:00.000,
1,6807621,TZ,SWA,D,RMV,,,,,,2011-09-27 00:00:00.000,
2,6807631,HO,SP,D,RMV,,,2011-07-01 00:00:00.000,,,2011-10-19 00:00:00.000,
3,6807676,MX,SP,N,RMV,,,,,,,
4,6807683,MX,ENG,N,RMV,2016-03-10 00:00:00.000,M,,,,,


In [42]:
juvenile_cases.shape

(1923172, 13)

In [43]:
juvenile_cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1923172 entries, 0 to 1923171
Data columns (total 13 columns):
 #   Column           Dtype   
---  ------           -----   
 0   IDNCASE          Int64   
 1   NAT              category
 2   LANG             category
 3   CUSTODY          category
 4   CASE_TYPE        category
 5   LATEST_HEARING   object  
 6   LATEST_CAL_TYPE  category
 7   DATE_OF_ENTRY    object  
 8   C_BIRTHDATE      string  
 9   Sex              category
 10  DATE_DETAINED    object  
 11  DATE_RELEASED    object  
 12  DETENTION_DATE   float64 
dtypes: Int64(1), category(6), float64(1), object(4), string(1)
memory usage: 119.2+ MB


#### Category Cleanup

Removed unused category levels after filtering to ensure all categorical columns reflect only the values present in the `juvenile_cases` subset.

In [40]:
for col in juvenile_cases.select_dtypes(include="category"):
    juvenile_cases[col] = juvenile_cases[col].cat.remove_unused_categories()

#### Missing Values Summary

Counts and percentages of missing values were calculated for each column in the `juvenile_cases` dataset to assess data completeness.  
`.isna().sum()` computes the total number of missing entries, and the percentage is derived by dividing by the total number of rows.  
Columns are sorted by missing count to highlight those requiring attention during data cleaning.

In [44]:
null_counts = juvenile_cases.isna().sum()
percent_missing = (null_counts / len(juvenile_cases)) * 100

missing_summary = pd.DataFrame(
    {"Missing Count": null_counts, "Missing %": percent_missing.round(2)}
).sort_values(by="Missing Count", ascending=False)

display(missing_summary)

Unnamed: 0,Missing Count,Missing %
DETENTION_DATE,1923171,100.0
DATE_RELEASED,1460521,75.94
DATE_DETAINED,1084523,56.39
Sex,503601,26.19
DATE_OF_ENTRY,500057,26.0
C_BIRTHDATE,460967,23.97
LATEST_CAL_TYPE,317532,16.51
LATEST_HEARING,310755,16.16
NAT,3290,0.17
LANG,1870,0.1


The `DETENTION_DATE` column is entirely null (100% missing) and therefore will be dropped from the dataset

In [45]:
juvenile_cases = juvenile_cases.drop("DETENTION_DATE", axis=1)

#### Datetime Format Validation

The following date-related features will be the focus of the next stage of preprocessing:  
- `LATEST_HEARING`  
- `DATE_OF_ENTRY`  
- `C_BIRTHDATE`  
- `DATE_DETAINED`  
- `DATE_RELEASED`  

These features may require format standardization and conversion to datetime objects to enable accurate temporal analysis.

Every datetime feature (except `C_BIRTHDATE`) follows the format `'YYYY-MM-DD 00:00:00.000'` (e.g., `'2025-02-04 00:00:00.000'`).

Before conversion, each feature will be tested against this pattern to ensure values are valid.  
All non-null entries will be checked to avoid unintended data loss during transformation with `pd.to_datetime()`.

Only the **`YYYY-MM-DD`** portion of each timestamp will be retained.

In [48]:
def find_invalid_dates(df, column):
    """
    Returns non-null rows that don’t match the `YYYY-MM-DD` pattern.
    """
    return df[
        df[column].notna()
        & ~df[column].astype(str).str.contains(r"\d{4}-\d{2}-\d{2}", regex=True)
    ][[column]]

In [49]:
invalid_latest_hearing = find_invalid_dates(juvenile_cases, "LATEST_HEARING")
invalid_date_of_entry = find_invalid_dates(juvenile_cases, "DATE_OF_ENTRY")
invalid_date_detained = find_invalid_dates(juvenile_cases, "DATE_DETAINED")
invalid_date_released = find_invalid_dates(juvenile_cases, "DATE_RELEASED")

In [50]:
def report_invalid(name, df):
    """
    Prints the number of invalid entries and displays the DataFrame if not empty.
    """
    count = len(df)
    print(f"{name}: {count} invalid entr{'y' if count == 1 else 'ies'}")
    if count > 0:
        display(df)

In [51]:
report_invalid("LATEST_HEARING", invalid_latest_hearing)
report_invalid("DATE_OF_ENTRY", invalid_date_of_entry)
report_invalid("DATE_DETAINED", invalid_date_detained)
report_invalid("DATE_RELEASED", invalid_date_released)

LATEST_HEARING: 11 invalid entries


Unnamed: 0,LATEST_HEARING
1658348,PHI
1679699,SFR
1683625,NEW
1723497,CHE
1738962,HAR
1742864,NYV
1743005,NYV
1747408,PHI
1775947,DAL
1805844,SNA


DATE_OF_ENTRY: 0 invalid entries
DATE_DETAINED: 8 invalid entries


Unnamed: 0,DATE_DETAINED
1658348,F
1679699,M
1738962,F
1742864,M
1743005,M
1747408,M
1805844,M
1908366,F


DATE_RELEASED: 0 invalid entries


Convert columns `LATEST_HEARING`, `DATE_OF_ENTRY`, `DATE_DETAINED`, and `DATE_RELEASED` to datetime using `pd.to_datetime()` with `errors='coerce'`.  
This ensures consistent formatting while automatically converting invalid entries to `NaT`, avoiding the need for separate error handling.

In [52]:
date_cols = ["LATEST_HEARING", "DATE_OF_ENTRY", "DATE_DETAINED", "DATE_RELEASED"]

juvenile_cases[date_cols] = juvenile_cases[date_cols].apply(
    lambda col: pd.to_datetime(col, errors="coerce")
)

Identify non-null `C_BIRTHDATE` values that do not match the expected `MM/YYYY` format.  
This helps reveal alternate formats and prevents unintended data loss during conversion.

In [53]:
invalid_birthdates = juvenile_cases[
    juvenile_cases["C_BIRTHDATE"].notna()
    & ~juvenile_cases["C_BIRTHDATE"]
    .astype(str)
    .str.contains(r"^\d{1,2}/\d{4}$", regex=True)
][["C_BIRTHDATE"]]

In [54]:
print(f"Invalid C_BIRTHDATE entries: {len(invalid_birthdates)}")
display(invalid_birthdates)

Invalid C_BIRTHDATE entries: 8


Unnamed: 0,C_BIRTHDATE
1658348,E
1723497,E
1738962,E
1742864,E
1743005,E
1747408,E
1775947,E
1805844,E


In [55]:
juvenile_cases["C_BIRTHDATE"] = pd.to_datetime(
    juvenile_cases["C_BIRTHDATE"], format="%m/%Y", errors="coerce"
)

#### Date Conversion Check

Confirmed that all date columns were successfully converted to `datetime64[ns]` format.

In [56]:
date_cols = [
    "LATEST_HEARING",
    "DATE_OF_ENTRY",
    "DATE_DETAINED",
    "DATE_RELEASED",
    "C_BIRTHDATE",
]
print(juvenile_cases[date_cols].dtypes)

LATEST_HEARING    datetime64[ns]
DATE_OF_ENTRY     datetime64[ns]
DATE_DETAINED     datetime64[ns]
DATE_RELEASED     datetime64[ns]
C_BIRTHDATE       datetime64[ns]
dtype: object


In [57]:
juvenile_cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1923172 entries, 0 to 1923171
Data columns (total 12 columns):
 #   Column           Dtype         
---  ------           -----         
 0   IDNCASE          Int64         
 1   NAT              category      
 2   LANG             category      
 3   CUSTODY          category      
 4   CASE_TYPE        category      
 5   LATEST_HEARING   datetime64[ns]
 6   LATEST_CAL_TYPE  category      
 7   DATE_OF_ENTRY    datetime64[ns]
 8   C_BIRTHDATE      datetime64[ns]
 9   Sex              category      
 10  DATE_DETAINED    datetime64[ns]
 11  DATE_RELEASED    datetime64[ns]
dtypes: Int64(1), category(6), datetime64[ns](5)
memory usage: 104.5 MB


In [58]:
juvenile_cases.head()

Unnamed: 0,IDNCASE,NAT,LANG,CUSTODY,CASE_TYPE,LATEST_HEARING,LATEST_CAL_TYPE,DATE_OF_ENTRY,C_BIRTHDATE,Sex,DATE_DETAINED,DATE_RELEASED
0,6807609,MX,SP,D,RMV,2021-07-19,M,NaT,NaT,,2020-02-03,NaT
1,6807621,TZ,SWA,D,RMV,NaT,,NaT,NaT,,2011-09-27,NaT
2,6807631,HO,SP,D,RMV,NaT,,2011-07-01,NaT,,2011-10-19,NaT
3,6807676,MX,SP,N,RMV,NaT,,NaT,NaT,,NaT,NaT
4,6807683,MX,ENG,N,RMV,2016-03-10,M,NaT,NaT,,NaT,NaT


#### Categorical Value Cleanup

Standardize values in key categorical features (`NAT`, `LANG`, `CUSTODY`, `CASE_TYPE`, `LATEST_CAL_TYPE`, `Sex`) to ensure consistency and prevent issues caused by typos or rare variants.

`NAT` and `LANG` contain a large number of unique values.  
These are retained in full to preserve detail and will be grouped or simplified as needed during analysis.

In [59]:
juvenile_cases["CUSTODY"].value_counts()

CUSTODY
N      1040503
R       488510
D       394144
SP           9
LIN          1
POR          1
Name: count, dtype: int64

In [60]:
juvenile_cases["CASE_TYPE"].value_counts()

CASE_TYPE
RMV    1781888
CFR      89495
RFR      22294
WHO      22261
AOC       4909
DEP       1872
REC        227
EXC        156
CSR         54
0            6
AOL          2
NAC          2
DCC          1
Name: count, dtype: int64

In [61]:
juvenile_cases["LATEST_CAL_TYPE"].value_counts()

LATEST_CAL_TYPE
M       1100100
I        505503
N            25
0900          4
0830          2
1030          2
R             2
1000          1
1300          1
Name: count, dtype: int64

Clean `LATEST_CAL_TYPE` by keeping known values (`M` = Master, `I` = Individual) and replacing unexpected entries (e.g., time strings or unknown codes) with `NaN`.

In [62]:
valid_types = ["M", "I"]

juvenile_cases["LATEST_CAL_TYPE"] = (
    juvenile_cases["LATEST_CAL_TYPE"]
    .apply(lambda x: x if x in valid_types else pd.NA)
    .astype("category")
)

juvenile_cases["LATEST_CAL_TYPE"] = juvenile_cases[
    "LATEST_CAL_TYPE"
].cat.remove_unused_categories()

In [63]:
juvenile_cases["Sex"].value_counts()

Sex
M    909482
F    510082
N         6
U         1
Name: count, dtype: int64

Clean `GENDER` by keeping known values (`M` = Male, `F` = Female) and replaced rare or unclear codes (`N`, `U`) with `NaN` to ensure consistency and avoid ambiguity in gender-related analysis.

In [64]:
juvenile_cases["Sex"] = (
    juvenile_cases["Sex"]
    .apply(lambda x: x if x in ["M", "F"] else pd.NA)
    .astype("category")
)

juvenile_cases["Sex"] = juvenile_cases["Sex"].cat.remove_unused_categories()

In [65]:
juvenile_cases.head()

Unnamed: 0,IDNCASE,NAT,LANG,CUSTODY,CASE_TYPE,LATEST_HEARING,LATEST_CAL_TYPE,DATE_OF_ENTRY,C_BIRTHDATE,Sex,DATE_DETAINED,DATE_RELEASED
0,6807609,MX,SP,D,RMV,2021-07-19,M,NaT,NaT,,2020-02-03,NaT
1,6807621,TZ,SWA,D,RMV,NaT,,NaT,NaT,,2011-09-27,NaT
2,6807631,HO,SP,D,RMV,NaT,,2011-07-01,NaT,,2011-10-19,NaT
3,6807676,MX,SP,N,RMV,NaT,,NaT,NaT,,NaT,NaT
4,6807683,MX,ENG,N,RMV,2016-03-10,M,NaT,NaT,,NaT,NaT


In [66]:
juvenile_cases.to_csv(
    "../../data/cleaned/juvenile_cases_cleaned.csv.gz", index=False, compression="gzip"
)

### Cleaned `A_TblCase`

- Filtered to ~1.86M juvenile-related cases using `IDNCASE`.
- Dropped irrelevant or fully null columns.
- Removed unused categories from categorical features.
- Converted date columns to `datetime64[ns]`:
  - `LATEST_HEARING`, `DATE_OF_ENTRY`, `C_BIRTHDATE` (parsed from `MM/YYYY`), `DATE_DETAINED`, `DATE_RELEASED`
- Standardized categorical features:
  - `LATEST_CAL_TYPE`: retained `M` (Master) and `I` (Individual); others set to `NaN`
  - `GENDER`: retained `M` and `F`; others set to `NaN`

Saved as: `cases_juvenile_cleaned.csv.gz`  
Original row count: 12,124,239