In [480]:
import pandas as pd
import numpy as np

In [481]:
pd.options.display.max_columns = False

## 1. Clean tbl_JuvenileHistory

### Initial Data Inspection


The dataset contains approximately 3 million rows. To avoid memory issues and reduce unnecessary processing, only the first 100 rows will be inspected initially:

- Get an overview of the data types (`dtypes`)
- Identify columns/features worth keeping for the EDA
- Skip any columns that appear to be irrelevant or redundant

This initial check will help streamline the analysis and focus only on useful information.

In [482]:
juvenile_history_path = "tbl_JuvenileHistory.csv"
juvenile_history = pd.read_csv(
    filepath_or_buffer=juvenile_history_path, delimiter="\t", nrows=100
)

In [483]:
juvenile_history.head()

Unnamed: 0,idnJuvenileHistory,idnCase,idnProceeding,idnJuvenile,DATCREATEDON,DATMODIFIEDON
0,5,2046990,3200129,1,2014-09-06 19:24:46.373,
1,6,2047179,3199488,1,2014-09-06 19:24:46.373,
2,7,2047179,3199489,1,2014-09-06 19:24:46.373,
3,8,2047199,3199497,1,2014-09-06 19:24:46.373,
4,9,2047199,3199498,1,2014-09-06 19:24:46.373,


In [484]:
juvenile_history.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   idnJuvenileHistory  100 non-null    int64  
 1   idnCase             100 non-null    int64  
 2   idnProceeding       100 non-null    int64  
 3   idnJuvenile         100 non-null    int64  
 4   DATCREATEDON        100 non-null    object 
 5   DATMODIFIEDON       0 non-null      float64
dtypes: float64(1), int64(4), object(1)
memory usage: 4.8+ KB


In [485]:
juvenile_history.columns

Index(['idnJuvenileHistory', 'idnCase', 'idnProceeding', 'idnJuvenile',
       'DATCREATEDON', 'DATMODIFIEDON'],
      dtype='object')

**`idnJuvenile`**: Although stored as `int64`, this column only contains values from 1 to 6 (as per `Lookup/tblLookup_Juvenile.csv`). It should be treated as a categorical feature.

**`DATCREATEDON`** and **`DATMODIFIEDON`**:
  - `DATCREATEDON` likely reflects the date when the record was entered into the system, not when the case itself was initiated.
  - `DATMODIFIEDON` likely reflects the date when the record was edited in the system, not when the case itself was updated.

In [486]:
juvenile_history = pd.read_csv(
    filepath_or_buffer=juvenile_history_path,
    delimiter="\t",
    usecols=["idnJuvenileHistory", "idnCase", "idnProceeding", "idnJuvenile"],
    dtype={
        "idnJuvenileHistory": "Int64",
        "idnCase": "Int64",
        "idnProceeding": "Int64",
        "idnJuvenile": "category",
    },
    low_memory=False,
)

### Data Inspection

In [487]:
juvenile_history.head()

Unnamed: 0,idnJuvenileHistory,idnCase,idnProceeding,idnJuvenile
0,5,2046990,3200129,1
1,6,2047179,3199488,1
2,7,2047179,3199489,1
3,8,2047199,3199497,1
4,9,2047199,3199498,1


In [488]:
juvenile_history.shape

(2876040, 4)

In [489]:
juvenile_history.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2876040 entries, 0 to 2876039
Data columns (total 4 columns):
 #   Column              Dtype   
---  ------              -----   
 0   idnJuvenileHistory  Int64   
 1   idnCase             Int64   
 2   idnProceeding       Int64   
 3   idnJuvenile         category
dtypes: Int64(3), category(1)
memory usage: 76.8 MB


In [490]:
juvenile_history.isna().sum()

idnJuvenileHistory      0
idnCase                 0
idnProceeding          99
idnJuvenile           999
dtype: int64

In [491]:
juvenile_history.duplicated().sum()

0

#### `idnCase` List

Extracted unique `idnCase` values from `tbl_JuvenileHistory`  
to filter `A_TblCase` and retain only juvenile-related records  
without performing a full table join.

In [492]:
juvenile_case_ids = juvenile_history["idnCase"].unique()

#### `idnProceeding` List

Extracted unique `idnProceeding` values from `tbl_JuvenileHistory`  
to filter `B_TblProceeding` and retain only juvenile-related records  
without performing a full table join.

In [493]:
juvenile_proceeding_ids = juvenile_history["idnProceeding"].dropna().unique()

In [494]:
juvenile_history.to_csv(
    "../cleaned/juvenile_cases_cleaned.csv.gz", index=False, compression="gzip"
)

### Cleaned `tbl_JuvenileHistory`

- Loaded ~2.87M rows from the raw CSV file.
- Dropped irrelevant or fully null columns.
- Converted `idnJuvenile` to a categorical variable.
- Retained only 4 key fields:
  - `idnJuvenileHistory`: primary key
  - `idnCase`: foreign key to `tbl_Case`
  - `idnProceeding`: foreign key to `tbl_Proceeding`
  - `idnJuvenile`: foreign key to `tblLookup_Juvenile`
- Missing values:
  - `idnProceeding`: 99 missing
  - `idnJuvenile`: 999 missing

Saved cleaned file as:
- `juvenile_history_cleaned.csv.gz` 

## 2. Clean A_TblCase

### Initial Data Inspection


The dataset contains approximately 12 million rows. To avoid memory issues and reduce unnecessary processing, only the first 1000 rows will be inspected initially:

- Get an overview of the data types (`dtypes`)
- Identify columns/features worth keeping for the EDA
- Skip any columns that appear to be irrelevant or redundant

This initial check will help streamline the analysis and focus only on useful information.

In [495]:
case_path = "A_TblCase.csv"
cases = pd.read_csv(filepath_or_buffer=case_path, delimiter="\t", nrows=1000)

In [496]:
cases.head()

Unnamed: 0,IDNCASE,ALIEN_CITY,ALIEN_STATE,ALIEN_ZIPCODE,UPDATED_ZIPCODE,UPDATED_CITY,NAT,LANG,CUSTODY,SITE_TYPE,E_28_DATE,ATTY_NBR,CASE_TYPE,UPDATE_SITE,LATEST_HEARING,LATEST_TIME,LATEST_CAL_TYPE,UP_BOND_DATE,UP_BOND_RSN,CORRECTIONAL_FAC,RELEASE_MONTH,RELEASE_YEAR,INMATE_HOUSING,DATE_OF_ENTRY,C_ASY_TYPE,C_BIRTHDATE,C_RELEASE_DATE,UPDATED_STATE,ADDRESS_CHANGEDON,ZBOND_MRG_FLAG,Sex,DATE_DETAINED,DATE_RELEASED,LPR,DETENTION_DATE,DETENTION_LOCATION,DCO_LOCATION,DETENTION_FACILITY_TYPE,CASEPRIORITY_CODE
0,6807606,TACOMA,WA,98421,,,MX,SP,D,,,,RMV,TAC,,,,,,,,,,,,,,,,,,,,,,,,,
1,6807608,GRANITE CITY,IL,62040,,,RP,ENG,R,,2012-03-05 00:00:00.000,3.0,RMV,KAN,2012-08-16 00:00:00.000,1300.0,M,,,,,,,1996-02-19 00:00:00.000,,7/1969,,,2012-02-27 12:50:34.993,,,2011-07-19 00:00:00.000,2011-08-24 00:00:00.000,0.0,,,,,
2,6807609,PEARSALL,TX,78061,,,MX,SP,D,,2020-04-08 00:00:00.000,0.0,RMV,PSD,2021-07-19 00:00:00.000,1330.0,M,,,,,,,,,,,,2020-02-10 17:03:43.040,,,2020-02-03 00:00:00.000,,,,,,,
3,6807610,ANNANDALE,VA,22003,,,ES,SP,R,,2023-01-17 00:00:00.000,0.0,RMV,ANN,2024-09-09 00:00:00.000,1430.0,I,,,,,,,,E,5/1971,,,2023-04-28 10:46:46.000,,,2011-07-07 00:00:00.000,2011-09-15 00:00:00.000,,,,,,
4,6807611,ORANGE,CA,92868,,,MX,SP,D,,,,RMV,NLA,,,,,,,,,,,,7/1981,,,,,,,,,,,,,


In [497]:
cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 39 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   IDNCASE                  1000 non-null   int64  
 1   ALIEN_CITY               998 non-null    object 
 2   ALIEN_STATE              999 non-null    object 
 3   ALIEN_ZIPCODE            999 non-null    object 
 4   UPDATED_ZIPCODE          6 non-null      float64
 5   UPDATED_CITY             6 non-null      object 
 6   NAT                      983 non-null    object 
 7   LANG                     985 non-null    object 
 8   CUSTODY                  985 non-null    object 
 9   SITE_TYPE                601 non-null    object 
 10  E_28_DATE                485 non-null    object 
 11  ATTY_NBR                 775 non-null    float64
 12  CASE_TYPE                1000 non-null   object 
 13  UPDATE_SITE              985 non-null    object 
 14  LATEST_HEARING           

In [498]:
cases.columns

Index(['IDNCASE', 'ALIEN_CITY', 'ALIEN_STATE', 'ALIEN_ZIPCODE',
       'UPDATED_ZIPCODE', 'UPDATED_CITY', 'NAT', 'LANG', 'CUSTODY',
       'SITE_TYPE', 'E_28_DATE', 'ATTY_NBR', 'CASE_TYPE', 'UPDATE_SITE',
       'LATEST_HEARING', 'LATEST_TIME', 'LATEST_CAL_TYPE', 'UP_BOND_DATE',
       'UP_BOND_RSN', 'CORRECTIONAL_FAC', 'RELEASE_MONTH', 'RELEASE_YEAR',
       'INMATE_HOUSING', 'DATE_OF_ENTRY', 'C_ASY_TYPE', 'C_BIRTHDATE',
       'C_RELEASE_DATE', 'UPDATED_STATE', 'ADDRESS_CHANGEDON',
       'ZBOND_MRG_FLAG', 'Sex', 'DATE_DETAINED', 'DATE_RELEASED', 'LPR',
       'DETENTION_DATE', 'DETENTION_LOCATION', 'DCO_LOCATION',
       'DETENTION_FACILITY_TYPE', 'CASEPRIORITY_CODE'],
      dtype='object')

### Selected Features for EDA – `A_TblCase`

The list of selected columns below was discussed earlier in the documentation for the source dataset. It represents the core case-level features relevant to our analysis of juvenile immigration cases.

These fields include:

- Demographic information (e.g., `GENDER`, `NAT`, `LANG`)  
- Case characteristics (e.g., `CASE_TYPE`, `CUSTODY`, `LATEST_HEARING`)  
- Key dates (e.g., `DATE_OF_ENTRY`, `DETENTION_DATE`, `C_BIRTHDATE`)
- representation status: While `ATTY_NBR` exists as a legacy field in A_TblCase, accurate and up-to-date representation data should be derived from the `tbl_RepsAssigned` table using the IDNCASE key.

In [499]:
selected_columns = [
    "IDNCASE",
    "NAT",
    "LANG",
    "CUSTODY",
    "CASE_TYPE",
    "LATEST_HEARING",
    "LATEST_CAL_TYPE",
    "DATE_OF_ENTRY",
    "C_BIRTHDATE",
    "Sex",
    "DATE_DETAINED",
    "DATE_RELEASED",
    "DETENTION_DATE",
]

In [500]:
cases = cases[selected_columns]

In [501]:
cases.head()

Unnamed: 0,IDNCASE,NAT,LANG,CUSTODY,CASE_TYPE,LATEST_HEARING,LATEST_CAL_TYPE,DATE_OF_ENTRY,C_BIRTHDATE,Sex,DATE_DETAINED,DATE_RELEASED,DETENTION_DATE
0,6807606,MX,SP,D,RMV,,,,,,,,
1,6807608,RP,ENG,R,RMV,2012-08-16 00:00:00.000,M,1996-02-19 00:00:00.000,7/1969,,2011-07-19 00:00:00.000,2011-08-24 00:00:00.000,
2,6807609,MX,SP,D,RMV,2021-07-19 00:00:00.000,M,,,,2020-02-03 00:00:00.000,,
3,6807610,ES,SP,R,RMV,2024-09-09 00:00:00.000,I,,5/1971,,2011-07-07 00:00:00.000,2011-09-15 00:00:00.000,
4,6807611,MX,SP,D,RMV,,,,7/1981,,,,


In [502]:
cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   IDNCASE          1000 non-null   int64  
 1   NAT              983 non-null    object 
 2   LANG             985 non-null    object 
 3   CUSTODY          985 non-null    object 
 4   CASE_TYPE        1000 non-null   object 
 5   LATEST_HEARING   817 non-null    object 
 6   LATEST_CAL_TYPE  818 non-null    object 
 7   DATE_OF_ENTRY    723 non-null    object 
 8   C_BIRTHDATE      249 non-null    object 
 9   Sex              253 non-null    object 
 10  DATE_DETAINED    195 non-null    object 
 11  DATE_RELEASED    78 non-null     object 
 12  DETENTION_DATE   0 non-null      float64
dtypes: float64(1), int64(1), object(11)
memory usage: 101.7+ KB


#### Note on `DETENTION_DATE`

This field is entirely null in the sample, but we'll reassess after  
loading the full dataset.

In [503]:
cases.dtypes

IDNCASE              int64
NAT                 object
LANG                object
CUSTODY             object
CASE_TYPE           object
LATEST_HEARING      object
LATEST_CAL_TYPE     object
DATE_OF_ENTRY       object
C_BIRTHDATE         object
Sex                 object
DATE_DETAINED       object
DATE_RELEASED       object
DETENTION_DATE     float64
dtype: object

#### Specifying Column Data Types

- `Int64`: Used for `IDNCASE` to allow nullable integer values.
- `category`: Applied to string columns with repeated values  
  (e.g., `NAT`, `LANG`, `CUSTODY`, `CASE_TYPE`, `LATEST_CAL_TYPE`, `GENDER`)  
  for efficient storage and faster processing.
- `string`: Used for `C_BIRTHDATE` since it uses a partial date format (`MM/YYYY`)  
  and may contain nulls.
- `float64`: Used for `DETENTION_DATE`, which appears to contain only nulls in the  
  sample but may include numeric timestamps or intervals in the full dataset.

In [504]:
dtype = {
    "IDNCASE": "Int64",
    "NAT": "category",
    "LANG": "category",
    "CUSTODY": "category",
    "CASE_TYPE": "category",
    "LATEST_CAL_TYPE": "category",
    "Sex": "category",
    "C_BIRTHDATE": "string",
    "DETENTION_DATE": "float64",
}

In [505]:
cases = pd.read_csv(
    filepath_or_buffer=case_path,
    delimiter="\t",
    encoding="utf-8",
    on_bad_lines="skip",
    usecols=selected_columns,
    dtype=dtype,
    parse_dates=["LATEST_HEARING", "DATE_OF_ENTRY", "DATE_DETAINED", "DATE_RELEASED"],
    low_memory=False,
    # skiprows=[11711221],  # skip malformed row: C error - EOF inside string on this line
)

### Data Inspection

In [506]:
cases.head()

Unnamed: 0,IDNCASE,NAT,LANG,CUSTODY,CASE_TYPE,LATEST_HEARING,LATEST_CAL_TYPE,DATE_OF_ENTRY,C_BIRTHDATE,Sex,DATE_DETAINED,DATE_RELEASED,DETENTION_DATE
0,6807606,MX,SP,D,RMV,,,,,,,,
1,6807608,RP,ENG,R,RMV,2012-08-16 00:00:00.000,M,1996-02-19 00:00:00.000,7/1969,,2011-07-19 00:00:00.000,2011-08-24 00:00:00.000,
2,6807609,MX,SP,D,RMV,2021-07-19 00:00:00.000,M,,,,2020-02-03 00:00:00.000,,
3,6807610,ES,SP,R,RMV,2024-09-09 00:00:00.000,I,,5/1971,,2011-07-07 00:00:00.000,2011-09-15 00:00:00.000,
4,6807611,MX,SP,D,RMV,,,,7/1981,,,,


In [507]:
cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12124239 entries, 0 to 12124238
Data columns (total 13 columns):
 #   Column           Dtype   
---  ------           -----   
 0   IDNCASE          Int64   
 1   NAT              category
 2   LANG             category
 3   CUSTODY          category
 4   CASE_TYPE        category
 5   LATEST_HEARING   object  
 6   LATEST_CAL_TYPE  category
 7   DATE_OF_ENTRY    object  
 8   C_BIRTHDATE      string  
 9   Sex              category
 10  DATE_DETAINED    object  
 11  DATE_RELEASED    object  
 12  DETENTION_DATE   float64 
dtypes: Int64(1), category(6), float64(1), object(4), string(1)
memory usage: 751.6+ MB


In [508]:
cases.shape

(12124239, 13)

#### Filtering for Juvenile Cases

The current `cases` DataFrame includes both adult and juvenile records.  
To isolate only juvenile cases, we will filter it using the list of  
`idnCase` values from `tbl_JuvenileHistory`, without performing a full merge.

In [509]:
juvenile_cases = cases[cases["IDNCASE"].isin(juvenile_case_ids)].reset_index(drop=True)

#### Category Cleanup

Removed unused category levels after filtering to ensure all categorical columns reflect only the values present in the `juvenile_cases` subset.

In [510]:
for col in juvenile_cases.select_dtypes(include="category"):
    juvenile_cases[col] = juvenile_cases[col].cat.remove_unused_categories()

In [511]:
juvenile_cases.head()

Unnamed: 0,IDNCASE,NAT,LANG,CUSTODY,CASE_TYPE,LATEST_HEARING,LATEST_CAL_TYPE,DATE_OF_ENTRY,C_BIRTHDATE,Sex,DATE_DETAINED,DATE_RELEASED,DETENTION_DATE
0,6807609,MX,SP,D,RMV,2021-07-19 00:00:00.000,M,,,,2020-02-03 00:00:00.000,,
1,6807621,TZ,SWA,D,RMV,,,,,,2011-09-27 00:00:00.000,,
2,6807631,HO,SP,D,RMV,,,2011-07-01 00:00:00.000,,,2011-10-19 00:00:00.000,,
3,6807676,MX,SP,N,RMV,,,,,,,,
4,6807683,MX,ENG,N,RMV,2016-03-10 00:00:00.000,M,,,,,,


In [512]:
juvenile_cases.shape

(1923172, 13)

In [513]:
juvenile_cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1923172 entries, 0 to 1923171
Data columns (total 13 columns):
 #   Column           Dtype   
---  ------           -----   
 0   IDNCASE          Int64   
 1   NAT              category
 2   LANG             category
 3   CUSTODY          category
 4   CASE_TYPE        category
 5   LATEST_HEARING   object  
 6   LATEST_CAL_TYPE  category
 7   DATE_OF_ENTRY    object  
 8   C_BIRTHDATE      string  
 9   Sex              category
 10  DATE_DETAINED    object  
 11  DATE_RELEASED    object  
 12  DETENTION_DATE   float64 
dtypes: Int64(1), category(6), float64(1), object(4), string(1)
memory usage: 119.2+ MB


### Missing Values Summary

Counts and percentages of missing values were calculated for each column in the `juvenile_cases` dataset to assess data completeness.  
`.isna().sum()` computes the total number of missing entries, and the percentage is derived by dividing by the total number of rows.  
Columns are sorted by missing count to highlight those requiring attention during data cleaning.

In [514]:
null_counts = juvenile_cases.isna().sum()
percent_missing = (null_counts / len(juvenile_cases)) * 100

missing_summary = pd.DataFrame(
    {"Missing Count": null_counts, "Missing %": percent_missing.round(2)}
).sort_values(by="Missing Count", ascending=False)

display(missing_summary)

Unnamed: 0,Missing Count,Missing %
DETENTION_DATE,1923171,100.0
DATE_RELEASED,1460521,75.94
DATE_DETAINED,1084523,56.39
Sex,503601,26.19
DATE_OF_ENTRY,500057,26.0
C_BIRTHDATE,460967,23.97
LATEST_CAL_TYPE,317532,16.51
LATEST_HEARING,310755,16.16
NAT,3290,0.17
LANG,1870,0.1


The `DETENTION_DATE` column is entirely null (100% missing) and therefore will be dropped from the dataset

In [515]:
juvenile_cases = juvenile_cases.drop("DETENTION_DATE", axis=1)

#### Datetime Format Validation

The following date-related features will be the focus of the next stage of preprocessing:  
- `LATEST_HEARING`  
- `DATE_OF_ENTRY`  
- `C_BIRTHDATE`  
- `DATE_DETAINED`  
- `DATE_RELEASED`  

These features may require format standardization and conversion to datetime objects to enable accurate temporal analysis.

Every datetime feature (except `C_BIRTHDATE`) follows the format `'YYYY-MM-DD 00:00:00.000'` (e.g., `'2025-02-04 00:00:00.000'`).

Before conversion, each feature will be tested against this pattern to ensure values are valid.  
All non-null entries will be checked to avoid unintended data loss during transformation with `pd.to_datetime()`.

Only the **`YYYY-MM-DD`** portion of each timestamp will be retained.

In [516]:
def find_invalid_dates(df, column):
    """
    Returns non-null rows that don’t match the `YYYY-MM-DD` pattern.
    """
    return df[
        df[column].notna()
        & ~df[column].astype(str).str.contains(r"\d{4}-\d{2}-\d{2}", regex=True)
    ][[column]]

In [517]:
invalid_latest_hearing = find_invalid_dates(juvenile_cases, "LATEST_HEARING")
invalid_date_of_entry = find_invalid_dates(juvenile_cases, "DATE_OF_ENTRY")
invalid_date_detained = find_invalid_dates(juvenile_cases, "DATE_DETAINED")
invalid_date_released = find_invalid_dates(juvenile_cases, "DATE_RELEASED")

In [518]:
def report_invalid(name, df):
    """
    Prints the number of invalid entries and displays the DataFrame if not empty.
    """
    count = len(df)
    print(f"{name}: {count} invalid entr{'y' if count == 1 else 'ies'}")
    if count > 0:
        display(df)

In [519]:
report_invalid("LATEST_HEARING", invalid_latest_hearing)
report_invalid("DATE_OF_ENTRY", invalid_date_of_entry)
report_invalid("DATE_DETAINED", invalid_date_detained)
report_invalid("DATE_RELEASED", invalid_date_released)

LATEST_HEARING: 11 invalid entries


Unnamed: 0,LATEST_HEARING
1658348,PHI
1679699,SFR
1683625,NEW
1723497,CHE
1738962,HAR
1742864,NYV
1743005,NYV
1747408,PHI
1775947,DAL
1805844,SNA


DATE_OF_ENTRY: 0 invalid entries
DATE_DETAINED: 8 invalid entries


Unnamed: 0,DATE_DETAINED
1658348,F
1679699,M
1738962,F
1742864,M
1743005,M
1747408,M
1805844,M
1908366,F


DATE_RELEASED: 0 invalid entries


Invalid date values were not counted separately, as `errors='coerce'` was used in `pd.to_datetime()`.  
This automatically handles invalid formats by converting them to `NaT`, simplifying the cleaning process.

In [520]:
date_cols = ["LATEST_HEARING", "DATE_OF_ENTRY", "DATE_DETAINED", "DATE_RELEASED"]

juvenile_cases[date_cols] = juvenile_cases[date_cols].apply(
    lambda col: pd.to_datetime(col, errors="coerce")
)

Identified non-null `C_BIRTHDATE` values that do not match the expected `MM/YYYY` format.  
This helps reveal alternate formats and prevents unintended data loss during conversion.

In [521]:
invalid_birthdates = juvenile_cases[
    juvenile_cases["C_BIRTHDATE"].notna()
    & ~juvenile_cases["C_BIRTHDATE"]
    .astype(str)
    .str.contains(r"^\d{1,2}/\d{4}$", regex=True)
][["C_BIRTHDATE"]]

In [522]:
print(f"Invalid C_BIRTHDATE entries: {len(invalid_birthdates)}")
display(invalid_birthdates)

Invalid C_BIRTHDATE entries: 8


Unnamed: 0,C_BIRTHDATE
1658348,E
1723497,E
1738962,E
1742864,E
1743005,E
1747408,E
1775947,E
1805844,E


In [523]:
juvenile_cases["C_BIRTHDATE"] = pd.to_datetime(
    juvenile_cases["C_BIRTHDATE"], format="%m/%Y", errors="coerce"
)

#### Date Conversion Check

Confirmed that all date columns were successfully converted to `datetime64[ns]` format.

In [524]:
date_cols = [
    "LATEST_HEARING",
    "DATE_OF_ENTRY",
    "DATE_DETAINED",
    "DATE_RELEASED",
    "C_BIRTHDATE",
]
print(juvenile_cases[date_cols].dtypes)

LATEST_HEARING    datetime64[ns]
DATE_OF_ENTRY     datetime64[ns]
DATE_DETAINED     datetime64[ns]
DATE_RELEASED     datetime64[ns]
C_BIRTHDATE       datetime64[ns]
dtype: object


In [525]:
juvenile_cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1923172 entries, 0 to 1923171
Data columns (total 12 columns):
 #   Column           Dtype         
---  ------           -----         
 0   IDNCASE          Int64         
 1   NAT              category      
 2   LANG             category      
 3   CUSTODY          category      
 4   CASE_TYPE        category      
 5   LATEST_HEARING   datetime64[ns]
 6   LATEST_CAL_TYPE  category      
 7   DATE_OF_ENTRY    datetime64[ns]
 8   C_BIRTHDATE      datetime64[ns]
 9   Sex              category      
 10  DATE_DETAINED    datetime64[ns]
 11  DATE_RELEASED    datetime64[ns]
dtypes: Int64(1), category(6), datetime64[ns](5)
memory usage: 104.5 MB


In [526]:
juvenile_cases.head()

Unnamed: 0,IDNCASE,NAT,LANG,CUSTODY,CASE_TYPE,LATEST_HEARING,LATEST_CAL_TYPE,DATE_OF_ENTRY,C_BIRTHDATE,Sex,DATE_DETAINED,DATE_RELEASED
0,6807609,MX,SP,D,RMV,2021-07-19,M,NaT,NaT,,2020-02-03,NaT
1,6807621,TZ,SWA,D,RMV,NaT,,NaT,NaT,,2011-09-27,NaT
2,6807631,HO,SP,D,RMV,NaT,,2011-07-01,NaT,,2011-10-19,NaT
3,6807676,MX,SP,N,RMV,NaT,,NaT,NaT,,NaT,NaT
4,6807683,MX,ENG,N,RMV,2016-03-10,M,NaT,NaT,,NaT,NaT


#### Categorical Value Cleanup

Standardized and cleaned values in key categorical features (`NAT`, `LANG`, `CUSTODY`, `CASE_TYPE`, `LATEST_CAL_TYPE`, `Sex`)  
to ensure consistency and prevent issues caused by typos or rare variants.

`NAT` and `LANG` contain a large number of unique values.  
These were retained in full to preserve detail and will be grouped or simplified as needed during analysis.

In [527]:
juvenile_cases["CUSTODY"].value_counts()

CUSTODY
N      1040503
R       488510
D       394144
SP           9
LIN          1
POR          1
Name: count, dtype: int64

In [528]:
juvenile_cases["CASE_TYPE"].value_counts()

CASE_TYPE
RMV    1781888
CFR      89495
RFR      22294
WHO      22261
AOC       4909
DEP       1872
REC        227
EXC        156
CSR         54
0            6
AOL          2
NAC          2
DCC          1
Name: count, dtype: int64

In [529]:
juvenile_cases["LATEST_CAL_TYPE"].value_counts()

LATEST_CAL_TYPE
M       1100100
I        505503
N            25
0900          4
0830          2
1030          2
R             2
1000          1
1300          1
Name: count, dtype: int64

Cleaned `LATEST_CAL_TYPE` by keeping known values (`M` = Master, `I` = Individual) and replacing unexpected entries (e.g., time strings or unknown codes) with `NaN`.

In [530]:
valid_types = ["M", "I"]

juvenile_cases["LATEST_CAL_TYPE"] = (
    juvenile_cases["LATEST_CAL_TYPE"]
    .apply(lambda x: x if x in valid_types else pd.NA)
    .astype("category")
)

juvenile_cases["LATEST_CAL_TYPE"] = juvenile_cases[
    "LATEST_CAL_TYPE"
].cat.remove_unused_categories()

In [531]:
juvenile_cases["Sex"].value_counts()

Sex
M    909482
F    510082
N         6
U         1
Name: count, dtype: int64

Cleaned `GENDER` by keeping known values (`M` = Male, `F` = Female) and replaced rare or unclear codes (`N`, `U`) with `NaN` to ensure consistency and avoid ambiguity in gender-related analysis.

In [532]:
juvenile_cases["Sex"] = (
    juvenile_cases["Sex"]
    .apply(lambda x: x if x in ["M", "F"] else pd.NA)
    .astype("category")
)

juvenile_cases["Sex"] = juvenile_cases["Sex"].cat.remove_unused_categories()

In [533]:
juvenile_cases.head()

Unnamed: 0,IDNCASE,NAT,LANG,CUSTODY,CASE_TYPE,LATEST_HEARING,LATEST_CAL_TYPE,DATE_OF_ENTRY,C_BIRTHDATE,Sex,DATE_DETAINED,DATE_RELEASED
0,6807609,MX,SP,D,RMV,2021-07-19,M,NaT,NaT,,2020-02-03,NaT
1,6807621,TZ,SWA,D,RMV,NaT,,NaT,NaT,,2011-09-27,NaT
2,6807631,HO,SP,D,RMV,NaT,,2011-07-01,NaT,,2011-10-19,NaT
3,6807676,MX,SP,N,RMV,NaT,,NaT,NaT,,NaT,NaT
4,6807683,MX,ENG,N,RMV,2016-03-10,M,NaT,NaT,,NaT,NaT


In [534]:
juvenile_cases.to_csv(
    "../cleaned/juvenile_cases_cleaned.csv.gz", index=False, compression="gzip"
)

### Cleaned `A_TblCase`

- Filtered to ~1.86M juvenile-related cases based on `IDNCASE`.
- Reset index after filtering.
- Removed unused categories from categorical columns.
- Converted date fields to `datetime64[ns]`:
  - `LATEST_HEARING`
  - `DATE_OF_ENTRY`
  - `C_BIRTHDATE` (parsed from `MM/YYYY` format)
  - `DATE_DETAINED`
  - `DATE_RELEASED`
- Cleaned categorical features:
  - `LATEST_CAL_TYPE`: kept only `M` (Master) and `I` (Individual); others set to `NaN`
  - `GENDER`: kept only `M` and `F`; others set to `NaN`

Saved cleaned file as:  
`cases_juvenile_cleaned.csv.gz`

## 3. Clean tbl_RepsAssigned

### Table Description: 

The `tbl_RepsAssigned` table contains detailed records of represented cases in the immigration system, including both noncitizen (Alien) and government (INS) representation. Each row represents an instance of legal representation assigned to a case.


This table is essential for identifying the presence and type of legal representation in each case.

### Data Inspection

In [535]:
reps_assigned = pd.read_csv("tbl_RepsAssigned.csv", delimiter="\t", low_memory=False)

In [536]:
reps_assigned.head()

Unnamed: 0,IDNREPSASSIGNED,IDNCASE,STRATTYLEVEL,STRATTYTYPE,PARENT_TABLE,PARENT_IDN,BASE_CITY_CODE,INS_TA_DATE_ASSIGNED,E_27_DATE,E_28_DATE,BLNPRIMEATTY
0,5945062,2047751,COURT,ALIEN,A_Tblcase,2047751.0,WAS,,,2004-04-06 00:00:00.000,1
1,5945065,2052183,COURT,ALIEN,A_Tblcase,2052183.0,WAS,,,2004-04-15 00:00:00.000,1
2,5945066,2052595,COURT,ALIEN,A_Tblcase,2052595.0,WAS,,,1987-07-21 00:00:00.000,1
3,5945067,2054039,COURT,ALIEN,A_Tblcase,2054039.0,WAS,,,1994-06-15 00:00:00.000,1
4,5945068,2056837,COURT,ALIEN,A_Tblcase,2056837.0,WAS,,,2003-05-15 00:00:00.000,1


In [537]:
reps_assigned.columns

Index(['IDNREPSASSIGNED', 'IDNCASE', 'STRATTYLEVEL', 'STRATTYTYPE',
       'PARENT_TABLE', 'PARENT_IDN', 'BASE_CITY_CODE', 'INS_TA_DATE_ASSIGNED',
       'E_27_DATE', 'E_28_DATE', 'BLNPRIMEATTY'],
      dtype='object')

In [538]:
reps_assigned.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24350854 entries, 0 to 24350853
Data columns (total 11 columns):
 #   Column                Dtype  
---  ------                -----  
 0   IDNREPSASSIGNED       int64  
 1   IDNCASE               int64  
 2   STRATTYLEVEL          object 
 3   STRATTYTYPE           object 
 4   PARENT_TABLE          object 
 5   PARENT_IDN            float64
 6   BASE_CITY_CODE        object 
 7   INS_TA_DATE_ASSIGNED  object 
 8   E_27_DATE             object 
 9   E_28_DATE             object 
 10  BLNPRIMEATTY          int64  
dtypes: float64(1), int64(3), object(7)
memory usage: 2.0+ GB


In [539]:
reps_assigned.shape

(24350854, 11)

In [540]:
reps_assigned.isna().sum()

IDNREPSASSIGNED                0
IDNCASE                        0
STRATTYLEVEL                   0
STRATTYTYPE                    0
PARENT_TABLE                   0
PARENT_IDN                     1
BASE_CITY_CODE           7912377
INS_TA_DATE_ASSIGNED    11642124
E_27_DATE               23405446
E_28_DATE               12693828
BLNPRIMEATTY                   0
dtype: int64

In [541]:
# to check the values in STRATTYLEVE
reps_assigned["STRATTYLEVEL"].unique()

array(['COURT', 'BOARD'], dtype=object)

In [542]:
#  to check the values in STRATTYLEVE
reps_assigned["STRATTYTYPE"].unique()

array(['ALIEN', 'INS'], dtype=object)

#### Filtering for Alien Representation

The current `reps_assigned` DataFrame includes records for both noncitizen (Alien) and government (Immigration and Naturalization Service, or INS) representation.

To focus only on cases where the attorney is representing the noncitizen, the data will be filtered to keep only rows where the `STRATTYTYPE` column has the value `ALIEN`.

In [543]:
reps_assigned = reps_assigned[reps_assigned["STRATTYTYPE"] == "ALIEN"].reset_index(
    drop=True
)
# to check the unique values in STRATTYTYPE after filtering
reps_assigned["STRATTYTYPE"].unique()

array(['ALIEN'], dtype=object)

#### Filtering for Juvenile Cases

The current `reps_assigned` DataFrame is populated with both adult and juvenile records.  

To isolate only juvenile cases, it will be filtered using the list of `idnCase` values from `tbl_JuvenileHistory`, without a full merge being performed.


In [544]:
juvenile_reps = reps_assigned[
    reps_assigned["IDNCASE"].isin(juvenile_case_ids)
].reset_index(drop=True)

In [545]:
juvenile_reps.shape

(2704603, 11)

### Deciding Between `IDNCASE` and `PARENT_IDN` for later Merging with Cases Table
#### Comparing `PARENT_IDN` and `IDNCASE`

- `PARENT_IDN` is represented as the ID of the case in the parent table, which may or may not be `tblCase`.
- `IDNCASE` is represented as the ID of the case specifically in `tblCase`.

Before proceeding, we need to determine whether to use `IDNCASE` or `PARENT_IDN` as the key for merging with the cases table.

To make this decision, the following steps will be taken:

- Filter to keep only rows where the parent table is `tbl_case`.
- Compare `IDNCASE` and `PARENT_IDN` to check if their values are identical; they should theoretically match.
- Assess missing values and data types for both columns to decide which is more reliable for merging.



#### Filter to keep only rows where the parent table is `tbl_case`.

In [546]:
# checking the possible parent tables and their counts
juvenile_reps["PARENT_TABLE"].value_counts()

PARENT_TABLE
a_tblcase    2529175
tblappeal      89452
A_Tblcase      50325
tblAppeal      34703
A_TblCase        948
Name: count, dtype: int64

**Note:** The `tbl_case` table appears under three different capitalization variants in the data:  
- `A_TblCase`  
- `A_Tblcase`  
- `a_tblcase`  

this should be considerd when filtring the data for `tbl_case`

In [547]:
juvenile_reps = juvenile_reps[
    juvenile_reps["PARENT_TABLE"].isin(["A_TblCase", "A_Tblcase", "a_tblcase"])
].reset_index(drop=True)

juvenile_reps["PARENT_TABLE"].value_counts()

PARENT_TABLE
a_tblcase    2529175
A_Tblcase      50325
A_TblCase        948
Name: count, dtype: int64

#### Compare `IDNCASE` and `PARENT_IDN` to check if their values are identical.


In [548]:
juvenile_reps["Match_IDNCASE_PARENTS"] = (juvenile_reps["IDNCASE"]
    == juvenile_reps["PARENT_IDN"])

print(juvenile_reps["Match_IDNCASE_PARENTS"].value_counts())
juvenile_reps = juvenile_reps.drop("Match_IDNCASE_PARENTS", axis=1)

Match_IDNCASE_PARENTS
True     2580425
False         23
Name: count, dtype: int64


In [None]:
# show sample of the rows where IDNCASE and PARENT_IDN are different
mismatched_rows = juvenile_reps[juvenile_reps["IDNCASE"] != juvenile_reps["PARENT_IDN"]]
mismatched_rows.head()

Unnamed: 0,IDNREPSASSIGNED,IDNCASE,STRATTYLEVEL,STRATTYTYPE,PARENT_TABLE,PARENT_IDN,BASE_CITY_CODE,INS_TA_DATE_ASSIGNED,E_27_DATE,E_28_DATE,BLNPRIMEATTY
28197,12086960,6163622,COURT,ALIEN,a_tblcase,3317728.0,NYC,,,2009-03-05 00:00:00.000,0
31780,12573607,6271339,COURT,ALIEN,a_tblcase,6271917.0,DET,,,2010-01-28 00:00:00.000,1
35667,13110143,6597692,COURT,ALIEN,a_tblcase,6622225.0,HON,,,2010-12-06 00:00:00.000,1
57232,15128637,7304902,COURT,ALIEN,a_tblcase,7258006.0,,,,2015-11-27 00:00:00.000,1
127551,16946073,6163622,COURT,ALIEN,a_tblcase,3317728.0,NYC,,,2009-03-05 00:00:00.000,0


The two keys are identical for most cases, but further investigation is needed to determine which is more accurate and reliable for merging later.

#### Assess missing values and data types for both columns to decide which is more reliable for merging.

In [550]:
print(f"Empty IDNCASE values: {juvenile_reps['IDNCASE'].isna().sum()}")
print(f"Empty PARENT_IDN values: {juvenile_reps['PARENT_IDN'].isna().sum()}")

Empty IDNCASE values: 0
Empty PARENT_IDN values: 0


In [551]:
print(f"IDNCASE data type is: {juvenile_reps["IDNCASE"].dtype}")
print(f"PARENT_IDN data type is: {juvenile_reps["PARENT_IDN"].dtype}")

IDNCASE data type is: int64
PARENT_IDN data type is: float64


A definitive conclusion regarding which key is more accurate could not be reached. The primary distinction lies in the data type—integer is preferable as it aligns with the format used in other tables, although this is not critical since data types can be easily adjusted. Given that only 23 out of 2.5 million records contain differing values, the `IDNCASE` field will be utilized for subsequent merging.

### Checking for Repeated Cases

Identify and document cases that appear multiple times in the dataset, as these duplicates may impact downstream analysis and interpretation.

In [552]:
juvenile_reps.shape

(2580448, 11)

In [553]:
# checking for repeated cases in all columns
repeated_cases = juvenile_reps[juvenile_reps.duplicated(keep=False)]
repeated_cases.shape[0]

0

In [None]:
# checking for repeated cases in idnCase
repeated_cases_idncase = juvenile_reps[juvenile_reps.duplicated(subset=["IDNCASE"], keep=False)]
repeated_cases_idncase.shape[0]

In [555]:
juvenile_reps["STRATTYLEVEL"].unique()

array(['COURT'], dtype=object)

Most `IDNCASE` values are repeated, with some differences in other columns.  
However, since all records in this representation table are for represented cases and share the same `STRATTYLEVEL` ("COURT" for all alien juvenile cases),  
we can safely drop duplicate entries without impacting the analysis.

In [556]:
juvenile_reps = juvenile_reps.drop_duplicates(subset=["IDNCASE"],keep="first")

In [557]:
juvenile_reps.shape

(1031166, 11)

### Selected Features  – `tbl_RepsAssigned`


- `IDNREPSASSIGNED`: primary key
- `IDNCASE`: Foreign key, links this table to tblCase table
- `STRATTYLEVEL`: This field tells you at which stage or jurisdiction level the attorney is participating:
  - `"COURT"` = Immigration Court
  - `"BOARD"` = Board of Immigration Appeals
- `STRATTYTYPE`: This field tells you who the attorney or representative is representing in the case.
  - `ALIEN`: This means the attorney or representative is representing the noncitizen in the case. (which means the person is represented)
  - `INS` stands for the Immigration and Naturalization Service.

- `E_28_DATE`: Date of submission of **Form EOIR-28** —  
  Notice of Entry of Appearance as Attorney or Representative Before the Immigration Court.  

 - `E_27_DATE`: Date of submission of **Form EOIR-27** — 
  Notice of Entry of Appearance as Attorney or Representative Before the Board of Immigration Appeals (BIA).  


In [558]:
columns = [
    "IDNREPSASSIGNED",
    "IDNCASE",
    "STRATTYLEVEL",
    "STRATTYTYPE",
    "E_28_DATE",
    "E_27_DATE",
]

juvenile_reps = juvenile_reps[columns]
juvenile_reps.head()

Unnamed: 0,IDNREPSASSIGNED,IDNCASE,STRATTYLEVEL,STRATTYTYPE,E_28_DATE,E_27_DATE
0,5947072,2607017,COURT,ALIEN,2004-09-23 00:00:00.000,
1,5947352,2848551,COURT,ALIEN,1991-10-29 00:00:00.000,
2,5948254,2824744,COURT,ALIEN,1999-06-25 00:00:00.000,
3,5948542,2047758,COURT,ALIEN,2004-07-15 00:00:00.000,
4,5949322,3131023,COURT,ALIEN,2005-05-26 00:00:00.000,


In [559]:
juvenile_reps.shape

(1031166, 6)

#### Datetime Format Validation

The following date-related features will be the focus of the next stage of preprocessing:  
- `E_28_DATE`  
- `E_27_DATE`  

These features may require format standardization and conversion to datetime objects to enable accurate temporal analysis.

Every datetime feature follows the format `'YYYY-MM-DD 00:00:00.000'` (e.g., `'2025-02-04 00:00:00.000'`).

Before conversion, each feature will be tested against this pattern to ensure values are valid.  
All non-null entries will be checked to avoid unintended data loss during transformation with `pd.to_datetime()`.

Only the **`YYYY-MM-DD`** portion of each timestamp will be retained.

The `find_invalid_dates` function, which was defined earlier, will be used.

In [560]:
invalid_E_28_DATE = find_invalid_dates(juvenile_reps, "E_28_DATE")
invalid_E_27_DATE = find_invalid_dates(juvenile_reps, "E_27_DATE")

The `report_invalid` function, which was defined earlier, will be used.


In [561]:
report_invalid("E_28_DATE", invalid_E_28_DATE)
report_invalid("E_27_DATE", invalid_E_27_DATE)

E_28_DATE: 0 invalid entries
E_27_DATE: 0 invalid entries


In [562]:
date_cols = ["E_28_DATE", "E_27_DATE"]

juvenile_reps[date_cols] = juvenile_reps[date_cols].apply(
    lambda col: pd.to_datetime(col, errors="coerce")
)

#### Specifying Column Data Types

In [563]:
juvenile_reps.dtypes

IDNREPSASSIGNED             int64
IDNCASE                     int64
STRATTYLEVEL               object
STRATTYTYPE                object
E_28_DATE          datetime64[ns]
E_27_DATE          datetime64[ns]
dtype: object

In [564]:
juvenile_reps = juvenile_reps.astype(
    {"STRATTYLEVEL": "category", "STRATTYTYPE": "category"}
)
juvenile_reps.dtypes

IDNREPSASSIGNED             int64
IDNCASE                     int64
STRATTYLEVEL             category
STRATTYTYPE              category
E_28_DATE          datetime64[ns]
E_27_DATE          datetime64[ns]
dtype: object

In [565]:
juvenile_reps.to_csv(
    "../cleaned/juvenile_reps_assigned.csv.gz", index=False, compression="gzip"
)

### Cleaned `tbl_RepsAssigned`

- Filtered to ~2.7M Alien_juvenile represntaion related cases based on `STRATTYTYPE` & `IDNCASE`.
- Reset index after filtering.
- Selected only the relevant 6 columns for further analysis:
  - `IDNREPSASSIGNED`, `IDNCASE`, `STRATTYLEVEL`, `STRATTYTYPE`, `E_28_DATE`, `E_27_DATE`
- Converted date fields to `category`:
  - `STRATTYLEVEL`
  - `STRATTYTYPE `
- Converted date fields to `datetime64[ns]`:
  - `E_28_DATE`
  - `E_27_DATE`

Saved cleaned file as:  
`juvenile_reps_assigned.csv.gz`

## 4. Clean B_TblProceeding

### Initial Data Inspection


The dataset contains approximately 12 million rows. To avoid memory issues and reduce unnecessary processing, only the first 1000 rows will be inspected initially:

- Get an overview of the data types (`dtypes`)
- Identify columns/features worth keeping for the EDA
- Skip any columns that appear to be irrelevant or redundant

This initial check will help streamline the analysis and focus only on useful information.

In [566]:
proceedings_path = "B_TblProceeding.csv"
proceedings = pd.read_csv(
    filepath_or_buffer=proceedings_path, delimiter="\t", nrows=1000
)

In [567]:
proceedings.head()

Unnamed: 0,IDNPROCEEDING,IDNCASE,OSC_DATE,INPUT_DATE,BASE_CITY_CODE,HEARING_LOC_CODE,IJ_CODE,TRANS_IN_DATE,PREV_HEARING_LOC,PREV_HEARING_BASE,PREV_IJ_CODE,TRANS_NBR,HEARING_DATE,HEARING_TIME,DEC_TYPE,DEC_CODE,DEPORTED_1,DEPORTED_2,OTHER_COMP,APPEAL_RSVD,APPEAL_NOT_FILED,COMP_DATE,ABSENTIA,VENUE_CHG_GRANTED,TRANSFER_TO,DATE_APPEAL_DUE_STATUS,TRANSFER_STATUS,CUSTODY,CASE_TYPE,NAT,LANG,SCHEDULED_HEAR_LOC,CORRECTIONAL_FAC,CRIM_IND,IHP,AGGRAVATE_FELON,DATE_DETAINED,DATE_RELEASED
0,9329335,9431121,2019-12-22 00:00:00.000,2020-01-31 00:00:00.000,CHL,CHL,SC2,,,,,,2022-02-22 00:00:00.000,800.0,,,,,,,,,,,,,,N,RMV,MX,SP,,,N,,0.0,,
1,9329336,9431122,2019-12-22 00:00:00.000,2020-01-31 00:00:00.000,CHL,CHL,SC2,,,,,,2022-02-22 00:00:00.000,800.0,,,,,,,,,,,,,,N,RMV,MX,SP,,,N,,0.0,,
2,9329337,9431123,2019-12-22 00:00:00.000,2020-01-31 00:00:00.000,CHL,CHL,SC2,,,,,,2022-02-22 00:00:00.000,800.0,,,,,,,,,,,,,,N,RMV,MX,SP,,,N,,0.0,,
3,9329338,9431124,2019-12-09 00:00:00.000,2020-02-13 00:00:00.000,KRO,BCJ,DL,,,,,,2020-02-24 00:00:00.000,830.0,O,X,CU,,,,,2020-02-24 00:00:00.000,N,,,2020-03-25 00:00:00.000,,D,RMV,CU,SP,,,Y,,0.0,2020-02-07 00:00:00.000,
4,9329339,9431125,2019-12-12 00:00:00.000,2020-01-31 00:00:00.000,CHL,CHL,SC2,,,,,,2022-02-22 00:00:00.000,800.0,,,,,,,,,,,,,,R,RMV,MX,SP,,,N,,0.0,2020-01-31 00:00:00.000,2020-01-31 00:00:00.000


In [568]:
proceedings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 38 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   IDNPROCEEDING           1000 non-null   int64  
 1   IDNCASE                 1000 non-null   int64  
 2   OSC_DATE                1000 non-null   object 
 3   INPUT_DATE              971 non-null    object 
 4   BASE_CITY_CODE          1000 non-null   object 
 5   HEARING_LOC_CODE        1000 non-null   object 
 6   IJ_CODE                 983 non-null    object 
 7   TRANS_IN_DATE           208 non-null    object 
 8   PREV_HEARING_LOC        430 non-null    object 
 9   PREV_HEARING_BASE       430 non-null    object 
 10  PREV_IJ_CODE            426 non-null    object 
 11  TRANS_NBR               436 non-null    float64
 12  HEARING_DATE            965 non-null    object 
 13  HEARING_TIME            965 non-null    float64
 14  DEC_TYPE                624 non-null    o

In [569]:
proceedings.columns

Index(['IDNPROCEEDING', 'IDNCASE', 'OSC_DATE', 'INPUT_DATE', 'BASE_CITY_CODE',
       'HEARING_LOC_CODE', 'IJ_CODE', 'TRANS_IN_DATE', 'PREV_HEARING_LOC',
       'PREV_HEARING_BASE', 'PREV_IJ_CODE', 'TRANS_NBR', 'HEARING_DATE',
       'HEARING_TIME', 'DEC_TYPE', 'DEC_CODE', 'DEPORTED_1', 'DEPORTED_2',
       'OTHER_COMP', 'APPEAL_RSVD', 'APPEAL_NOT_FILED', 'COMP_DATE',
       'ABSENTIA', 'VENUE_CHG_GRANTED', 'TRANSFER_TO',
       'DATE_APPEAL_DUE_STATUS', 'TRANSFER_STATUS', 'CUSTODY', 'CASE_TYPE',
       'NAT', 'LANG', 'SCHEDULED_HEAR_LOC', 'CORRECTIONAL_FAC', 'CRIM_IND',
       'IHP', 'AGGRAVATE_FELON', 'DATE_DETAINED', 'DATE_RELEASED'],
      dtype='object')

### Selected Features for EDA – `B_TblProceeding`

The selected columns below are core proceeding-level features relevant to analyzing juvenile immigration cases. They were chosen based on the source dataset documentation and their importance for understanding case timelines, decisions, and access to justice.

These fields include:

- **Case timeline**: `OSC_DATE`, `COMP_DATE`, `INPUT_DATE`
- **Case outcomes**: `DEC_CODE`, `ABSENTIA`, `CRIM_IND`
- **Demographics & access**: `NAT`, `LANG`, `CUSTODY`, `CASE_TYPE`
- **Court geography**: `BASE_CITY_CODE`, `HEARING_LOC_CODE`

These variables allow us to track the duration of proceedings, categorize outcomes, and evaluate regional disparities and vulnerability factors.

In [570]:
selected_columns = [
    "IDNPROCEEDING",
    "IDNCASE",
    "OSC_DATE",
    "INPUT_DATE",
    "COMP_DATE",
    "BASE_CITY_CODE",
    "HEARING_LOC_CODE",
    "DEC_CODE",
    "ABSENTIA",
    "CRIM_IND",
    "NAT",
    "LANG",
    "CASE_TYPE",
    "CUSTODY",
    "DATE_DETAINED",
    "DATE_RELEASED",
]

In [571]:
proceedings = proceedings[selected_columns]

In [572]:
proceedings.head()

Unnamed: 0,IDNPROCEEDING,IDNCASE,OSC_DATE,INPUT_DATE,COMP_DATE,BASE_CITY_CODE,HEARING_LOC_CODE,DEC_CODE,ABSENTIA,CRIM_IND,NAT,LANG,CASE_TYPE,CUSTODY,DATE_DETAINED,DATE_RELEASED
0,9329335,9431121,2019-12-22 00:00:00.000,2020-01-31 00:00:00.000,,CHL,CHL,,,N,MX,SP,RMV,N,,
1,9329336,9431122,2019-12-22 00:00:00.000,2020-01-31 00:00:00.000,,CHL,CHL,,,N,MX,SP,RMV,N,,
2,9329337,9431123,2019-12-22 00:00:00.000,2020-01-31 00:00:00.000,,CHL,CHL,,,N,MX,SP,RMV,N,,
3,9329338,9431124,2019-12-09 00:00:00.000,2020-02-13 00:00:00.000,2020-02-24 00:00:00.000,KRO,BCJ,X,N,Y,CU,SP,RMV,D,2020-02-07 00:00:00.000,
4,9329339,9431125,2019-12-12 00:00:00.000,2020-01-31 00:00:00.000,,CHL,CHL,,,N,MX,SP,RMV,R,2020-01-31 00:00:00.000,2020-01-31 00:00:00.000


In [573]:
proceedings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column            Non-Null Count  Dtype 
---  ------            --------------  ----- 
 0   IDNPROCEEDING     1000 non-null   int64 
 1   IDNCASE           1000 non-null   int64 
 2   OSC_DATE          1000 non-null   object
 3   INPUT_DATE        971 non-null    object
 4   COMP_DATE         675 non-null    object
 5   BASE_CITY_CODE    1000 non-null   object
 6   HEARING_LOC_CODE  1000 non-null   object
 7   DEC_CODE          546 non-null    object
 8   ABSENTIA          657 non-null    object
 9   CRIM_IND          983 non-null    object
 10  NAT               998 non-null    object
 11  LANG              1000 non-null   object
 12  CASE_TYPE         1000 non-null   object
 13  CUSTODY           1000 non-null   object
 14  DATE_DETAINED     213 non-null    object
 15  DATE_RELEASED     110 non-null    object
dtypes: int64(2), object(14)
memory usage: 125.1+ KB


In [574]:
proceedings.dtypes

IDNPROCEEDING        int64
IDNCASE              int64
OSC_DATE            object
INPUT_DATE          object
COMP_DATE           object
BASE_CITY_CODE      object
HEARING_LOC_CODE    object
DEC_CODE            object
ABSENTIA            object
CRIM_IND            object
NAT                 object
LANG                object
CASE_TYPE           object
CUSTODY             object
DATE_DETAINED       object
DATE_RELEASED       object
dtype: object

#### Specifying Column Data Types

- `Int64`: Used for `IDNPROCEEDING`, `IDNCASE` to support nullable integer IDs.  
- `category`: Applied to repeated string fields for memory efficiency:
  `BASE_CITY_CODE`, `HEARING_LOC_CODE`, `DEC_CODE`, `ABSENTIA`, `CRIM_IND`,
  `NAT`, `LANG`, `CASE_TYPE`, `CUSTODY`.  
- `object`: Used for date fields (`OSC_DATE`, `INPUT_DATE`, `COMP_DATE`,
  `DATE_DETAINED`, `DATE_RELEASED`) to preserve original formats before validation
  and conversion to `datetime`.

In [575]:
dtypes = {
    "IDNPROCEEDING": "Int64",
    "IDNCASE": "Int64",
    "OSC_DATE": "object",
    "INPUT_DATE": "object",
    "COMP_DATE": "object",
    "BASE_CITY_CODE": "category",
    "HEARING_LOC_CODE": "category",
    "DEC_CODE": "category",
    "ABSENTIA": "category",
    "CRIM_IND": "category",
    "NAT": "category",
    "LANG": "category",
    "CASE_TYPE": "category",
    "CUSTODY": "category",
    "DATE_DETAINED": "object",
    "DATE_RELEASED": "object",
}

### Skipping Malformed Rows

Some lines in the dataset have an incorrect number of tab-separated fields, which can cause parsing errors during loading.  
The script first determines the expected number of columns from the header, then identifies and skips any rows that don't match.  
This approach ensures only properly structured records are read without triggering errors or discarding valid data.

In [576]:
import csv

with open(proceedings_path, "r", encoding="utf-8", errors="ignore") as f:
    header = f.readline().rstrip("\n").split("\t")
n_fields = len(header)

bad_rows = []
with open(proceedings_path, "r", encoding="utf-8", errors="ignore") as f:
    for i, line in enumerate(f):
        if i == 0:
            continue  # skip header
        if len(line.rstrip("\n").split("\t")) != n_fields:
            bad_rows.append(i)

print("Rows to skip:", bad_rows)

Rows to skip: [97835, 570096, 570097, 711871, 808187, 925893, 1340981, 1374429, 1375092, 1526869, 1540198, 1634092, 2027082, 2178293, 2178294, 2799401, 2833726, 2845396, 2874449, 2874454, 3050966, 3072502, 3089401, 3094348, 3314332, 3314333, 3499731, 3529540, 3834981, 4010908, 4192732, 4192733, 4247596, 4328751, 4328752, 4402946, 4456120, 4485669, 4485670, 4594041, 4594042, 4777655, 5037195, 5289573, 5294225, 5460027, 5505089, 5685628, 5765052, 5767554, 5770037, 6079161, 6079162, 6354988, 6372650, 6720545, 6720546, 6906838, 7537286, 7537287, 7903157, 7903158, 8106585, 8106586, 8192257, 8296762, 8561598, 8660111, 8738299, 8738764, 8744904, 8831487, 8914785, 9882723, 9973801, 10102404, 10244630, 10284073, 10438487, 10551043, 10813073, 10894138, 10894139, 10937485, 11256349, 11256350, 11415193, 11415194, 11451606, 11482022, 11523116, 11539956, 11551964, 11586386, 11668880, 13514773, 13514774, 13529532, 13529533, 13535869, 14139697, 14139698, 14392108, 14392109, 14464216, 14557193, 1455719

In [577]:
proceedings = pd.read_csv(
    proceedings_path,
    sep="\t",
    usecols=selected_columns,
    dtype=str,
    skiprows=bad_rows,
    quoting=csv.QUOTE_NONE,
)

In [578]:
proceedings = proceedings.astype(dtypes)

In [579]:
juvenile_proceedings = proceedings[
    proceedings["IDNPROCEEDING"].isin(juvenile_proceeding_ids)
].reset_index(drop=True)

In [580]:
print(
    f"Number of idnProceeding keys in tbl_JuvenileHistory.csv table: {len(juvenile_proceeding_ids):,}"
)
print(
    f"Number of matched juvenile rows in B_TblProceeding.csv based on those keys: {juvenile_proceedings.shape[0]:,}"
)

Number of idnProceeding keys in tbl_JuvenileHistory.csv table: 2,819,053
Number of matched juvenile rows in B_TblProceeding.csv based on those keys: 2,819,051


In [581]:
juvenile_proceedings.head()

Unnamed: 0,IDNPROCEEDING,IDNCASE,OSC_DATE,INPUT_DATE,BASE_CITY_CODE,HEARING_LOC_CODE,DEC_CODE,COMP_DATE,ABSENTIA,CUSTODY,CASE_TYPE,NAT,LANG,CRIM_IND,DATE_DETAINED,DATE_RELEASED
0,9329335,9431121,2019-12-22 00:00:00.000,2020-01-31 00:00:00.000,CHL,CHL,,,,N,RMV,MX,SP,N,,
1,9329336,9431122,2019-12-22 00:00:00.000,2020-01-31 00:00:00.000,CHL,CHL,,,,N,RMV,MX,SP,N,,
2,9329337,9431123,2019-12-22 00:00:00.000,2020-01-31 00:00:00.000,CHL,CHL,,,,N,RMV,MX,SP,N,,
3,9329338,9431124,2019-12-09 00:00:00.000,2020-02-13 00:00:00.000,KRO,BCJ,X,2020-02-24 00:00:00.000,N,D,RMV,CU,SP,Y,2020-02-07 00:00:00.000,
4,9329339,9431125,2019-12-12 00:00:00.000,2020-01-31 00:00:00.000,CHL,CHL,,,,R,RMV,MX,SP,N,2020-01-31 00:00:00.000,2020-01-31 00:00:00.000


In [582]:
juvenile_proceedings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2819051 entries, 0 to 2819050
Data columns (total 16 columns):
 #   Column            Dtype   
---  ------            -----   
 0   IDNPROCEEDING     Int64   
 1   IDNCASE           Int64   
 2   OSC_DATE          object  
 3   INPUT_DATE        object  
 4   BASE_CITY_CODE    category
 5   HEARING_LOC_CODE  category
 6   DEC_CODE          category
 7   COMP_DATE         object  
 8   ABSENTIA          category
 9   CUSTODY           category
 10  CASE_TYPE         category
 11  NAT               category
 12  LANG              category
 13  CRIM_IND          category
 14  DATE_DETAINED     object  
 15  DATE_RELEASED     object  
dtypes: Int64(2), category(9), object(5)
memory usage: 188.3+ MB


#### Datetime Format Validation

The following date-related features will be the focus of the next stage of preprocessing:  
- `OSC_DATE`  
- `INPUT_DATE`  
- `COMP_DATE`  
- `DATE_DETAINED`  
- `DATE_RELEASED`  

These features may require format standardization and conversion to datetime objects to enable accurate temporal analysis.

Every datetime feature (except `C_BIRTHDATE`) follows the format `'YYYY-MM-DD 00:00:00.000'` (e.g., `'2025-02-04 00:00:00.000'`).

Before conversion, each feature will be tested against this pattern to ensure values are valid.  
All non-null entries will be checked to avoid unintended data loss during transformation with `pd.to_datetime()`.

Only the **`YYYY-MM-DD`** portion of each timestamp will be retained.

In [583]:
invalid_osc_date = find_invalid_dates(juvenile_proceedings, "OSC_DATE")
invalid_input_date = find_invalid_dates(juvenile_proceedings, "INPUT_DATE")
invalid_comp_date = find_invalid_dates(juvenile_proceedings, "COMP_DATE")
invalid_date_detained = find_invalid_dates(juvenile_proceedings, "DATE_DETAINED")
invalid_date_released = find_invalid_dates(juvenile_proceedings, "DATE_RELEASED")

In [584]:
report_invalid("OSC_DATE", invalid_osc_date)
report_invalid("INPUT_DATE", invalid_input_date)
report_invalid("COMP_DATE", invalid_comp_date)
report_invalid("DATE_DETAINED", invalid_date_detained)
report_invalid("DATE_RELEASED", invalid_date_released)

OSC_DATE: 0 invalid entries
INPUT_DATE: 0 invalid entries
COMP_DATE: 0 invalid entries
DATE_DETAINED: 0 invalid entries
DATE_RELEASED: 0 invalid entries


In [585]:
date_cols = ["OSC_DATE", "INPUT_DATE", "COMP_DATE", "DATE_DETAINED", "DATE_RELEASED"]

juvenile_proceedings[date_cols] = juvenile_proceedings[date_cols].apply(
    lambda col: pd.to_datetime(col, errors="coerce")
)

In [586]:
juvenile_proceedings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2819051 entries, 0 to 2819050
Data columns (total 16 columns):
 #   Column            Dtype         
---  ------            -----         
 0   IDNPROCEEDING     Int64         
 1   IDNCASE           Int64         
 2   OSC_DATE          datetime64[ns]
 3   INPUT_DATE        datetime64[ns]
 4   BASE_CITY_CODE    category      
 5   HEARING_LOC_CODE  category      
 6   DEC_CODE          category      
 7   COMP_DATE         datetime64[ns]
 8   ABSENTIA          category      
 9   CUSTODY           category      
 10  CASE_TYPE         category      
 11  NAT               category      
 12  LANG              category      
 13  CRIM_IND          category      
 14  DATE_DETAINED     datetime64[ns]
 15  DATE_RELEASED     datetime64[ns]
dtypes: Int64(2), category(9), datetime64[ns](5)
memory usage: 188.3 MB


In [587]:
null_counts = juvenile_proceedings.isna().sum()
percent_missing = (null_counts / len(juvenile_proceedings)) * 100

missing_summary = pd.DataFrame(
    {"Missing Count": null_counts, "Missing %": percent_missing.round(2)}
).sort_values(by="Missing Count", ascending=False)

display(missing_summary)

Unnamed: 0,Missing Count,Missing %
DATE_RELEASED,1797051,63.75
DATE_DETAINED,1339487,47.52
DEC_CODE,1325544,47.02
ABSENTIA,443884,15.75
COMP_DATE,419983,14.9
CRIM_IND,173907,6.17
OSC_DATE,24218,0.86
INPUT_DATE,11182,0.4
NAT,8194,0.29
LANG,1919,0.07


### Categorical Value Cleanup

Currently reviewing categorical columns to identify and clean inconsistent or invalid entries.  
Suspicious values (e.g., empty strings, control characters, or rare invalid codes) will be converted to `NaN` for clarity and downstream compatibility.

Target columns:
- BASE_CITY_CODE
- HEARING_LOC_CODE
- DEC_CODE
- ABSENTIA
- CUSTODY
- CASE_TYPE
- NAT
- LANG
- CRIM_IND

This step ensures categorical data is standardized and ready for analysis.

In [588]:
juvenile_proceedings["BASE_CITY_CODE"].value_counts()

BASE_CITY_CODE
MIA    130860
NYC    121506
SFR    105165
SNA    103132
DAL     95447
        ...  
ELC         0
LAN         0
TST         0
BDC         0
NYD         0
Name: count, Length: 88, dtype: int64

In [589]:
juvenile_proceedings["HEARING_LOC_CODE"].value_counts()

HEARING_LOC_CODE
MIA    123648
NYC    117607
SFR     82208
DAL     76035
CHL     75154
        ...  
POL         0
ECE         0
ECD         0
PRC         0
1AC         0
Name: count, Length: 836, dtype: int64

In [590]:
display(
    juvenile_proceedings[
        juvenile_proceedings["CASE_TYPE"].notna()
        & juvenile_proceedings["DEC_CODE"].notna()
    ][["CASE_TYPE", "DEC_CODE"]]
)

Unnamed: 0,CASE_TYPE,DEC_CODE
3,RMV,X
5,RMV,V
7,RMV,X
8,RMV,X
9,CFR,A
...,...,...
2819042,RMV,X
2819044,CFR,V
2819045,RMV,X
2819046,RMV,X


In [591]:
juvenile_proceedings["DEC_CODE"].value_counts()

DEC_CODE
X    712195
U    284771
T    183240
V    105445
A     84170
R     82231
      14761
D     11935
W      5522
G      2843
O      2084
E      1479
Z      1457
L       953
H       342
J        71
S         4
C         3
K         1
2         0
Name: count, dtype: int64

In [592]:
juvenile_proceedings["DEC_CODE"] = juvenile_proceedings["DEC_CODE"].replace("2", np.nan)

  juvenile_proceedings["DEC_CODE"] = juvenile_proceedings["DEC_CODE"].replace("2", np.nan)


In [593]:
juvenile_proceedings["ABSENTIA"].value_counts()

ABSENTIA
N    1969135
Y     406018
          14
5          0
Name: count, dtype: int64

In [594]:
juvenile_proceedings["ABSENTIA"] = juvenile_proceedings["ABSENTIA"].replace(
    {"": np.nan, "5": np.nan}
)

  juvenile_proceedings["ABSENTIA"] = juvenile_proceedings["ABSENTIA"].replace(


In [595]:
juvenile_proceedings["CUSTODY"].value_counts()

CUSTODY
N    1257933
R    1076461
D     484641
          1
          0
          0
          0
C          0
Name: count, dtype: int64

In [596]:
valid_custody = ["N", "R", "D"]
juvenile_proceedings["CUSTODY"] = juvenile_proceedings["CUSTODY"].where(
    juvenile_proceedings["CUSTODY"].isin(valid_custody), np.nan
)

In [597]:
juvenile_proceedings[["CRIM_IND"]].value_counts()

CRIM_IND
N           2586750
Y             58378
                 16
Name: count, dtype: int64

In [598]:
juvenile_proceedings["CRIM_IND"] = juvenile_proceedings["CRIM_IND"].replace(
    r"^\s*$", np.nan, regex=True
)

In [599]:
juvenile_proceedings["LANG"].value_counts()

LANG
SP     2225917
ENG     121951
POR      62926
PUN      48666
CRE      30086
        ...   
BGQ          0
RMS          0
ILG          0
BEM          0
FSI          0
Name: count, Length: 567, dtype: int64

In [600]:
juvenile_proceedings["NAT"].value_counts()

NAT
GT    599057
HO    532679
MX    371442
ES    333348
CU    109364
       ...  
PC         0
PF         0
RQ         0
SB         0
b6         0
Name: count, Length: 264, dtype: int64

Replaced invalid values with `NaN` and removed unused categories for cleaner analysis.

In [601]:
categorical_cols = [
    "BASE_CITY_CODE",
    "HEARING_LOC_CODE",
    "DEC_CODE",
    "ABSENTIA",
    "CUSTODY",
    "CASE_TYPE",
    "NAT",
    "LANG",
    "CRIM_IND",
]

for col in categorical_cols:
    if isinstance(juvenile_proceedings[col].dtype, pd.CategoricalDtype):
        juvenile_proceedings[col] = juvenile_proceedings[
            col
        ].cat.remove_unused_categories()

In [602]:
juvenile_proceedings.duplicated().sum()

0

In [603]:
juvenile_proceedings.head()

Unnamed: 0,IDNPROCEEDING,IDNCASE,OSC_DATE,INPUT_DATE,BASE_CITY_CODE,HEARING_LOC_CODE,DEC_CODE,COMP_DATE,ABSENTIA,CUSTODY,CASE_TYPE,NAT,LANG,CRIM_IND,DATE_DETAINED,DATE_RELEASED
0,9329335,9431121,2019-12-22,2020-01-31,CHL,CHL,,NaT,,N,RMV,MX,SP,N,NaT,NaT
1,9329336,9431122,2019-12-22,2020-01-31,CHL,CHL,,NaT,,N,RMV,MX,SP,N,NaT,NaT
2,9329337,9431123,2019-12-22,2020-01-31,CHL,CHL,,NaT,,N,RMV,MX,SP,N,NaT,NaT
3,9329338,9431124,2019-12-09,2020-02-13,KRO,BCJ,X,2020-02-24,N,D,RMV,CU,SP,Y,2020-02-07,NaT
4,9329339,9431125,2019-12-12,2020-01-31,CHL,CHL,,NaT,,R,RMV,MX,SP,N,2020-01-31,2020-01-31


In [604]:
juvenile_proceedings.to_csv(
    "../cleaned/juvenile_proceedings_cleaned.csv.gz", index=False, compression="gzip"
)

### Cleaned `tbl_Proceeding`

- The raw file (about 12 million rows) was filtered to only include proceedings related to juveniles by matching `idnProceeding` values found in `tbl_JuvenileHistory`, resulting in approximately 2.8 million rows.
- Malformed rows with missing or extra fields were detected and skipped during import to prevent errors.
- Only relevant columns (as defined in `selected_columns` and `dtypes` in the notebook) were retained for analysis.
- All data was initially read as `str`, then column data types were updated as needed.
- The DataFrame index was reset after filtering.
- Unused or fully null columns were removed.
- The final table contains only juvenile-related proceeding records.

Saved the cleaned file as:  
- `juvenile_proceedings_cleaned.csv.gz`