In [1]:
import pandas as pd

In [2]:
pd.options.display.max_columns = False

### Clean `B_TblProceeding` table

#### Initial Data Inspection

The dataset contains approximately 16 million rows. To avoid memory issues and reduce unnecessary processing, only the first 1000 rows will be inspected initially:

- Get an overview of the data types (`dtypes`)
- Identify columns/features worth keeping for the EDA
- Skip any columns that appear to be irrelevant or redundant

This initial check will help streamline the analysis and focus only on useful information.

In [3]:
proceedings_path = "../../data/raw/B_TblProceeding.csv"

proceedings = pd.read_csv(
    filepath_or_buffer=proceedings_path, delimiter="\t", nrows=1000
)

In [4]:
proceedings.head()

Unnamed: 0,IDNPROCEEDING,IDNCASE,OSC_DATE,INPUT_DATE,BASE_CITY_CODE,HEARING_LOC_CODE,IJ_CODE,TRANS_IN_DATE,PREV_HEARING_LOC,PREV_HEARING_BASE,PREV_IJ_CODE,TRANS_NBR,HEARING_DATE,HEARING_TIME,DEC_TYPE,DEC_CODE,DEPORTED_1,DEPORTED_2,OTHER_COMP,APPEAL_RSVD,APPEAL_NOT_FILED,COMP_DATE,ABSENTIA,VENUE_CHG_GRANTED,TRANSFER_TO,DATE_APPEAL_DUE_STATUS,TRANSFER_STATUS,CUSTODY,CASE_TYPE,NAT,LANG,SCHEDULED_HEAR_LOC,CORRECTIONAL_FAC,CRIM_IND,IHP,AGGRAVATE_FELON,DATE_DETAINED,DATE_RELEASED
0,9329335,9431121,2019-12-22 00:00:00.000,2020-01-31 00:00:00.000,CHL,CHL,SC2,,,,,,2022-02-22 00:00:00.000,800.0,,,,,,,,,,,,,,N,RMV,MX,SP,,,N,,0.0,,
1,9329336,9431122,2019-12-22 00:00:00.000,2020-01-31 00:00:00.000,CHL,CHL,SC2,,,,,,2022-02-22 00:00:00.000,800.0,,,,,,,,,,,,,,N,RMV,MX,SP,,,N,,0.0,,
2,9329337,9431123,2019-12-22 00:00:00.000,2020-01-31 00:00:00.000,CHL,CHL,SC2,,,,,,2022-02-22 00:00:00.000,800.0,,,,,,,,,,,,,,N,RMV,MX,SP,,,N,,0.0,,
3,9329338,9431124,2019-12-09 00:00:00.000,2020-02-13 00:00:00.000,KRO,BCJ,DL,,,,,,2020-02-24 00:00:00.000,830.0,O,X,CU,,,,,2020-02-24 00:00:00.000,N,,,2020-03-25 00:00:00.000,,D,RMV,CU,SP,,,Y,,0.0,2020-02-07 00:00:00.000,
4,9329339,9431125,2019-12-12 00:00:00.000,2020-01-31 00:00:00.000,CHL,CHL,SC2,,,,,,2022-02-22 00:00:00.000,800.0,,,,,,,,,,,,,,R,RMV,MX,SP,,,N,,0.0,2020-01-31 00:00:00.000,2020-01-31 00:00:00.000


In [5]:
proceedings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 38 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   IDNPROCEEDING           1000 non-null   int64  
 1   IDNCASE                 1000 non-null   int64  
 2   OSC_DATE                1000 non-null   object 
 3   INPUT_DATE              971 non-null    object 
 4   BASE_CITY_CODE          1000 non-null   object 
 5   HEARING_LOC_CODE        1000 non-null   object 
 6   IJ_CODE                 983 non-null    object 
 7   TRANS_IN_DATE           208 non-null    object 
 8   PREV_HEARING_LOC        430 non-null    object 
 9   PREV_HEARING_BASE       430 non-null    object 
 10  PREV_IJ_CODE            426 non-null    object 
 11  TRANS_NBR               436 non-null    float64
 12  HEARING_DATE            965 non-null    object 
 13  HEARING_TIME            965 non-null    float64
 14  DEC_TYPE                624 non-null    o

In [6]:
proceedings.columns

Index(['IDNPROCEEDING', 'IDNCASE', 'OSC_DATE', 'INPUT_DATE', 'BASE_CITY_CODE',
       'HEARING_LOC_CODE', 'IJ_CODE', 'TRANS_IN_DATE', 'PREV_HEARING_LOC',
       'PREV_HEARING_BASE', 'PREV_IJ_CODE', 'TRANS_NBR', 'HEARING_DATE',
       'HEARING_TIME', 'DEC_TYPE', 'DEC_CODE', 'DEPORTED_1', 'DEPORTED_2',
       'OTHER_COMP', 'APPEAL_RSVD', 'APPEAL_NOT_FILED', 'COMP_DATE',
       'ABSENTIA', 'VENUE_CHG_GRANTED', 'TRANSFER_TO',
       'DATE_APPEAL_DUE_STATUS', 'TRANSFER_STATUS', 'CUSTODY', 'CASE_TYPE',
       'NAT', 'LANG', 'SCHEDULED_HEAR_LOC', 'CORRECTIONAL_FAC', 'CRIM_IND',
       'IHP', 'AGGRAVATE_FELON', 'DATE_DETAINED', 'DATE_RELEASED'],
      dtype='object')

#### Selected Features for EDA – `B_TblProceeding`

The selected columns below are core proceeding-level features relevant to analyzing juvenile immigration cases. They were chosen based on the source dataset documentation and their importance for understanding case timelines, decisions, and access to justice.

These fields include:

- **Case timeline**: `OSC_DATE`, `COMP_DATE`, `INPUT_DATE`
- **Case outcomes**: `DEC_CODE`, `ABSENTIA`, `CRIM_IND`
- **Demographics & access**: `NAT`, `LANG`, `CUSTODY`, `CASE_TYPE`
- **Court geography**: `BASE_CITY_CODE`, `HEARING_LOC_CODE`

These variables allow us to track the duration of proceedings, categorize outcomes, and evaluate regional disparities and vulnerability factors.

In [7]:
selected_columns = [
    "IDNPROCEEDING",
    "IDNCASE",
    "OSC_DATE",
    "INPUT_DATE",
    "COMP_DATE",
    "BASE_CITY_CODE",
    "HEARING_LOC_CODE",
    "DEC_CODE",
    "ABSENTIA",
    "CRIM_IND",
    "NAT",
    "LANG",
    "CASE_TYPE",
    "CUSTODY",
    "DATE_DETAINED",
    "DATE_RELEASED",
]

#### Specifying Column Data Types

- `Int64`: Assigned to `IDNPROCEEDING` (primary key) and `IDNCASE` (foreign key to `cases` table) to support nullable integer identifiers.  
- `category`: Applied to repeated string fields for memory efficiency:
  `BASE_CITY_CODE`, `HEARING_LOC_CODE`, `DEC_CODE`, `ABSENTIA`, `CRIM_IND`, `NAT`, `LANG`, `CASE_TYPE`, `CUSTODY`.  
- `object`: Used for date fields (`OSC_DATE`, `INPUT_DATE`, `COMP_DATE`, `DATE_DETAINED`, `DATE_RELEASED`) to preserve original formats before validation and conversion to `datetime`.

In [8]:
dtypes = {
    "IDNPROCEEDING": "Int64",
    "IDNCASE": "Int64",
    "OSC_DATE": "object",
    "INPUT_DATE": "object",
    "COMP_DATE": "object",
    "BASE_CITY_CODE": "category",
    "HEARING_LOC_CODE": "category",
    "DEC_CODE": "category",
    "ABSENTIA": "category",
    "CRIM_IND": "category",
    "NAT": "category",
    "LANG": "category",
    "CASE_TYPE": "category",
    "CUSTODY": "category",
    "DATE_DETAINED": "object",
    "DATE_RELEASED": "object",
}

#### Skipping Malformed Rows

Some lines in the dataset have an incorrect number of tab-separated fields, which can cause parsing errors during loading.  
The script first determines the expected number of columns from the header, then identifies and skips any rows that don't match.  
This approach ensures only properly structured records are read without triggering errors or discarding valid data.

In [9]:
import csv

with open(proceedings_path, "r", encoding="utf-8", errors="ignore") as f:
    header = f.readline().rstrip("\n").split("\t")
n_fields = len(header)

bad_rows = []
with open(proceedings_path, "r", encoding="utf-8", errors="ignore") as f:
    for i, line in enumerate(f):
        if i == 0:
            continue  # skip header
        if len(line.rstrip("\n").split("\t")) != n_fields:
            bad_rows.append(i)

print("Rows to skip:", bad_rows)

Rows to skip: [97835, 570096, 570097, 711871, 808187, 925893, 1340981, 1374429, 1375092, 1526869, 1540198, 1634092, 2027082, 2178293, 2178294, 2799401, 2833726, 2845396, 2874449, 2874454, 3050966, 3072502, 3089401, 3094348, 3314332, 3314333, 3499731, 3529540, 3834981, 4010908, 4192732, 4192733, 4247596, 4328751, 4328752, 4402946, 4456120, 4485669, 4485670, 4594041, 4594042, 4777655, 5037195, 5289573, 5294225, 5460027, 5505089, 5685628, 5765052, 5767554, 5770037, 6079161, 6079162, 6354988, 6372650, 6720545, 6720546, 6906838, 7537286, 7537287, 7903157, 7903158, 8106585, 8106586, 8192257, 8296762, 8561598, 8660111, 8738299, 8738764, 8744904, 8831487, 8914785, 9882723, 9973801, 10102404, 10244630, 10284073, 10438487, 10551043, 10813073, 10894138, 10894139, 10937485, 11256349, 11256350, 11415193, 11415194, 11451606, 11482022, 11523116, 11539956, 11551964, 11586386, 11668880, 13514773, 13514774, 13529532, 13529533, 13535869, 14139697, 14139698, 14392108, 14392109, 14464216, 14557193, 1455719

In [10]:
proceedings = pd.read_csv(
    proceedings_path,
    sep="\t",
    usecols=selected_columns,
    dtype=str,
    skiprows=bad_rows,
    quoting=csv.QUOTE_NONE,
)

In [11]:
proceedings = proceedings.astype(dtypes)

In [12]:
proceedings.shape

(15560384, 16)

#### Filtering for Juvenile Proceedings

To isolate juvenile records from the `proceedings` DataFrame, apply a filter using `idnProceeding` values from `tbl_JuvenileHistory`.  
This approach avoids an expensive join operation and reduces memory usage while ensuring that only juvenile-related proceedings are retained.

In [13]:
juvenile_history = pd.read_csv(
    "../../data/cleaned/juvenile_history_cleaned.csv.gz",
    usecols=["idnProceeding"],
    dtype={"idnProceeding": "Int64"},
)

In [14]:
juvenile_proceeding_ids = juvenile_history["idnProceeding"].dropna().unique()

In [15]:
juvenile_proceedings = proceedings[
    proceedings["IDNPROCEEDING"].isin(juvenile_proceeding_ids)
].reset_index(drop=True)

In [16]:
print(
    f"Number of idnProceeding keys in tbl_JuvenileHistory.csv table: {len(juvenile_proceeding_ids):,}"
)
print(
    f"Number of matched juvenile rows in B_TblProceeding.csv based on those keys: {juvenile_proceedings.shape[0]:,}"
)

Number of idnProceeding keys in tbl_JuvenileHistory.csv table: 2,819,053
Number of matched juvenile rows in B_TblProceeding.csv based on those keys: 2,819,051


#### Data Inspection

In [17]:
juvenile_proceedings.head()

Unnamed: 0,IDNPROCEEDING,IDNCASE,OSC_DATE,INPUT_DATE,BASE_CITY_CODE,HEARING_LOC_CODE,DEC_CODE,COMP_DATE,ABSENTIA,CUSTODY,CASE_TYPE,NAT,LANG,CRIM_IND,DATE_DETAINED,DATE_RELEASED
0,9329335,9431121,2019-12-22 00:00:00.000,2020-01-31 00:00:00.000,CHL,CHL,,,,N,RMV,MX,SP,N,,
1,9329336,9431122,2019-12-22 00:00:00.000,2020-01-31 00:00:00.000,CHL,CHL,,,,N,RMV,MX,SP,N,,
2,9329337,9431123,2019-12-22 00:00:00.000,2020-01-31 00:00:00.000,CHL,CHL,,,,N,RMV,MX,SP,N,,
3,9329338,9431124,2019-12-09 00:00:00.000,2020-02-13 00:00:00.000,KRO,BCJ,X,2020-02-24 00:00:00.000,N,D,RMV,CU,SP,Y,2020-02-07 00:00:00.000,
4,9329339,9431125,2019-12-12 00:00:00.000,2020-01-31 00:00:00.000,CHL,CHL,,,,R,RMV,MX,SP,N,2020-01-31 00:00:00.000,2020-01-31 00:00:00.000


In [18]:
juvenile_proceedings.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2819051 entries, 0 to 2819050
Data columns (total 16 columns):
 #   Column            Dtype   
---  ------            -----   
 0   IDNPROCEEDING     Int64   
 1   IDNCASE           Int64   
 2   OSC_DATE          object  
 3   INPUT_DATE        object  
 4   BASE_CITY_CODE    category
 5   HEARING_LOC_CODE  category
 6   DEC_CODE          category
 7   COMP_DATE         object  
 8   ABSENTIA          category
 9   CUSTODY           category
 10  CASE_TYPE         category
 11  NAT               category
 12  LANG              category
 13  CRIM_IND          category
 14  DATE_DETAINED     object  
 15  DATE_RELEASED     object  
dtypes: Int64(2), category(9), object(5)
memory usage: 188.3+ MB


#### Datetime Format Validation

The following date-related features will be the focus of the next stage of preprocessing:  
- `OSC_DATE`  
- `INPUT_DATE`  
- `COMP_DATE`  
- `DATE_DETAINED`  
- `DATE_RELEASED`  

These features may require format standardization and conversion to datetime objects to enable accurate temporal analysis.

Every datetime feature (except `C_BIRTHDATE`) follows the format `'YYYY-MM-DD 00:00:00.000'` (e.g., `'2025-02-04 00:00:00.000'`).

Before conversion, each feature will be tested against this pattern to ensure values are valid.  
All non-null entries will be checked to avoid unintended data loss during transformation with `pd.to_datetime()`.

Only the **`YYYY-MM-DD`** portion of each timestamp will be retained.

In [19]:
def find_invalid_dates(df, column):
    """
    Returns non-null rows that don’t match the `YYYY-MM-DD` pattern.
    """
    return df[
        df[column].notna()
        & ~df[column].astype(str).str.contains(r"\d{4}-\d{2}-\d{2}", regex=True)
    ][[column]]

In [20]:
invalid_osc_date = find_invalid_dates(juvenile_proceedings, "OSC_DATE")
invalid_input_date = find_invalid_dates(juvenile_proceedings, "INPUT_DATE")
invalid_comp_date = find_invalid_dates(juvenile_proceedings, "COMP_DATE")
invalid_date_detained = find_invalid_dates(juvenile_proceedings, "DATE_DETAINED")
invalid_date_released = find_invalid_dates(juvenile_proceedings, "DATE_RELEASED")

In [21]:
def report_invalid(name, df):
    """
    Prints the number of invalid entries and displays the DataFrame if not empty.
    """
    count = len(df)
    print(f"{name}: {count} invalid entr{'y' if count == 1 else 'ies'}")
    if count > 0:
        display(df)

In [22]:
report_invalid("OSC_DATE", invalid_osc_date)
report_invalid("INPUT_DATE", invalid_input_date)
report_invalid("COMP_DATE", invalid_comp_date)
report_invalid("DATE_DETAINED", invalid_date_detained)
report_invalid("DATE_RELEASED", invalid_date_released)

OSC_DATE: 0 invalid entries
INPUT_DATE: 0 invalid entries
COMP_DATE: 0 invalid entries
DATE_DETAINED: 0 invalid entries
DATE_RELEASED: 0 invalid entries


In [23]:
date_cols = ["OSC_DATE", "INPUT_DATE", "COMP_DATE", "DATE_DETAINED", "DATE_RELEASED"]

juvenile_proceedings[date_cols] = juvenile_proceedings[date_cols].apply(
    lambda col: pd.to_datetime(col, errors="coerce")
)

In [24]:
juvenile_proceedings.dtypes

IDNPROCEEDING                Int64
IDNCASE                      Int64
OSC_DATE            datetime64[ns]
INPUT_DATE          datetime64[ns]
BASE_CITY_CODE            category
HEARING_LOC_CODE          category
DEC_CODE                  category
COMP_DATE           datetime64[ns]
ABSENTIA                  category
CUSTODY                   category
CASE_TYPE                 category
NAT                       category
LANG                      category
CRIM_IND                  category
DATE_DETAINED       datetime64[ns]
DATE_RELEASED       datetime64[ns]
dtype: object

In [25]:
null_counts = juvenile_proceedings.isna().sum()
percent_missing = (null_counts / len(juvenile_proceedings)) * 100

missing_summary = pd.DataFrame(
    {"Missing Count": null_counts, "Missing %": percent_missing.round(2)}
).sort_values(by="Missing Count", ascending=False)

display(missing_summary)

Unnamed: 0,Missing Count,Missing %
DATE_RELEASED,1797051,63.75
DATE_DETAINED,1339487,47.52
DEC_CODE,1325544,47.02
ABSENTIA,443884,15.75
COMP_DATE,419983,14.9
CRIM_IND,173907,6.17
OSC_DATE,24218,0.86
INPUT_DATE,11182,0.4
NAT,8194,0.29
LANG,1919,0.07


#### Categorical Value Cleanup

Reviewing key categorical columns to detect and standardize inconsistent or invalid entries.  
Suspicious values (e.g., empty strings, control characters, or rare codes) are converted to `NaN` to improve data clarity and ensure compatibility with downstream tasks.

**Target columns:**
- `BASE_CITY_CODE`
- `HEARING_LOC_CODE`
- `DEC_CODE`
- `ABSENTIA`
- `CUSTODY`
- `CASE_TYPE`
- `NAT`
- `LANG`
- `CRIM_IND`

This step ensures categorical fields are clean, consistent, and analysis-ready.

Remove unused category levels after filtering to ensure all categorical columns reflect only the values present in the `juvenile_proceedings` subset.

In [26]:
for col in juvenile_proceedings.select_dtypes(include="category"):
    juvenile_proceedings[col] = juvenile_proceedings[col].cat.remove_unused_categories()

In [27]:
juvenile_proceedings["BASE_CITY_CODE"].value_counts()

BASE_CITY_CODE
MIA    130860
NYC    121506
SFR    105165
SNA    103132
DAL     95447
        ...  
AGA       362
SAI       299
LOS        10
OIT         2
YOR         1
Name: count, Length: 80, dtype: int64

In [28]:
juvenile_proceedings["HEARING_LOC_CODE"].value_counts()

HEARING_LOC_CODE
MIA    123648
NYC    117607
SFR     82208
DAL     76035
CHL     75154
        ...  
CLI         1
HUG         1
CJD         1
CHV         1
ELD         1
Name: count, Length: 612, dtype: int64

In [29]:
display(
    juvenile_proceedings[
        juvenile_proceedings["CASE_TYPE"].notna()
        & juvenile_proceedings["DEC_CODE"].notna()
    ][["CASE_TYPE", "DEC_CODE"]]
)

Unnamed: 0,CASE_TYPE,DEC_CODE
3,RMV,X
5,RMV,V
7,RMV,X
8,RMV,X
9,CFR,A
...,...,...
2819042,RMV,X
2819044,CFR,V
2819045,RMV,X
2819046,RMV,X


In [30]:
juvenile_proceedings["DEC_CODE"].value_counts()

DEC_CODE
X    712195
U    284771
T    183240
V    105445
A     84170
R     82231
      14761
D     11935
W      5522
G      2843
O      2084
E      1479
Z      1457
L       953
H       342
J        71
S         4
C         3
K         1
Name: count, dtype: int64

In [31]:
juvenile_proceedings["ABSENTIA"].value_counts()

ABSENTIA
N    1969135
Y     406018
          14
Name: count, dtype: int64

In [32]:
juvenile_proceedings["ABSENTIA"] = juvenile_proceedings["ABSENTIA"].replace(
    r"^\s*$", pd.NA, regex=True
)

In [33]:
juvenile_proceedings["ABSENTIA"].value_counts()

ABSENTIA
N    1969135
Y     406018
           0
Name: count, dtype: int64

In [34]:
juvenile_proceedings["CUSTODY"].value_counts()

CUSTODY
N    1257933
R    1076461
D     484641
          1
Name: count, dtype: int64

In [35]:
juvenile_proceedings["CUSTODY"] = juvenile_proceedings["CUSTODY"].replace(
    r"^[\s\x00-\x1F\x7F]*$", pd.NA, regex=True
)

In [36]:
juvenile_proceedings["CUSTODY"].value_counts()

CUSTODY
N    1257933
R    1076461
D     484641
          0
Name: count, dtype: int64

In [37]:
juvenile_proceedings[["CRIM_IND"]].value_counts()

CRIM_IND
N           2586750
Y             58378
                 16
Name: count, dtype: int64

In [38]:
juvenile_proceedings["CRIM_IND"] = juvenile_proceedings["CRIM_IND"].replace(
    r"^[\s\x00-\x1F\x7F]*$", pd.NA, regex=True
)

In [39]:
juvenile_proceedings[["CRIM_IND"]].value_counts()

CRIM_IND
N           2586750
Y             58378
                  0
Name: count, dtype: int64

In [40]:
juvenile_proceedings["LANG"].value_counts()

LANG
SP     2225917
ENG     121951
POR      62926
PUN      48666
CRE      30086
        ...   
KAM          1
FLE          1
TAT          1
BAV          1
BKM          1
Name: count, Length: 462, dtype: int64

In [41]:
for col in juvenile_proceedings.select_dtypes(include="category"):
    juvenile_proceedings[col] = juvenile_proceedings[col].cat.remove_unused_categories()

In [42]:
juvenile_proceedings.duplicated().sum()

np.int64(0)

In [43]:
juvenile_proceedings.head()

Unnamed: 0,IDNPROCEEDING,IDNCASE,OSC_DATE,INPUT_DATE,BASE_CITY_CODE,HEARING_LOC_CODE,DEC_CODE,COMP_DATE,ABSENTIA,CUSTODY,CASE_TYPE,NAT,LANG,CRIM_IND,DATE_DETAINED,DATE_RELEASED
0,9329335,9431121,2019-12-22,2020-01-31,CHL,CHL,,NaT,,N,RMV,MX,SP,N,NaT,NaT
1,9329336,9431122,2019-12-22,2020-01-31,CHL,CHL,,NaT,,N,RMV,MX,SP,N,NaT,NaT
2,9329337,9431123,2019-12-22,2020-01-31,CHL,CHL,,NaT,,N,RMV,MX,SP,N,NaT,NaT
3,9329338,9431124,2019-12-09,2020-02-13,KRO,BCJ,X,2020-02-24,N,D,RMV,CU,SP,Y,2020-02-07,NaT
4,9329339,9431125,2019-12-12,2020-01-31,CHL,CHL,,NaT,,R,RMV,MX,SP,N,2020-01-31,2020-01-31
