In [15]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [16]:
pd.options.display.max_columns = False

## 1. Clean tbl_JuvenileHistory

### Initial Data Inspection


The dataset contains approximately 3 million rows. To avoid memory issues and reduce unnecessary processing, only the first 100 rows will be inspected initially:

- Get an overview of the data types (`dtypes`)
- Identify columns/features worth keeping for the EDA
- Skip any columns that appear to be irrelevant or redundant

This initial check will help streamline the analysis and focus only on useful information.

In [17]:
juvenile_history_path = 'tbl_JuvenileHistory.csv'
juvenile_history = pd.read_csv(
    filepath_or_buffer=juvenile_history_path,
    delimiter='\t',
    nrows=100
)

In [18]:
juvenile_history.head()

Unnamed: 0,idnJuvenileHistory,idnCase,idnProceeding,idnJuvenile,DATCREATEDON,DATMODIFIEDON
0,5,2046990,3200129,1,2014-09-06 19:24:46.373,
1,6,2047179,3199488,1,2014-09-06 19:24:46.373,
2,7,2047179,3199489,1,2014-09-06 19:24:46.373,
3,8,2047199,3199497,1,2014-09-06 19:24:46.373,
4,9,2047199,3199498,1,2014-09-06 19:24:46.373,


In [19]:
juvenile_history.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   idnJuvenileHistory  100 non-null    int64  
 1   idnCase             100 non-null    int64  
 2   idnProceeding       100 non-null    int64  
 3   idnJuvenile         100 non-null    int64  
 4   DATCREATEDON        100 non-null    object 
 5   DATMODIFIEDON       0 non-null      float64
dtypes: float64(1), int64(4), object(1)
memory usage: 4.8+ KB


In [20]:
juvenile_history.columns

Index(['idnJuvenileHistory', 'idnCase', 'idnProceeding', 'idnJuvenile',
       'DATCREATEDON', 'DATMODIFIEDON'],
      dtype='object')

**`idnJuvenile`**: Although stored as `int64`, this column only contains values from 1 to 6 (as per `Lookup/tblLookup_Juvenile.csv`). It should be treated as a categorical feature.

**`DATCREATEDON`** and **`DATMODIFIEDON`**:
  - `DATCREATEDON` likely reflects the date when the record was entered into the system, not when the case itself was initiated.
  - `DATMODIFIEDON` contains only null values and may be dropped.

In [21]:
juvenile_history = pd.read_csv(
    filepath_or_buffer=juvenile_history_path,
    delimiter='\t',
    usecols=['idnJuvenileHistory', 
             'idnCase', 
             'idnProceeding', 
             'idnJuvenile'
             ],
    dtype={
        'idnJuvenileHistory': 'Int64',
        'idnCase': 'Int64',
        'idnProceeding': 'Int64',
        'idnJuvenile': 'category'
    },
    low_memory=False,
)

### Data Inspection

In [22]:
juvenile_history.head()

Unnamed: 0,idnJuvenileHistory,idnCase,idnProceeding,idnJuvenile
0,5,2046990,3200129,1
1,6,2047179,3199488,1
2,7,2047179,3199489,1
3,8,2047199,3199497,1
4,9,2047199,3199498,1


In [23]:
juvenile_history.shape

(2857093, 4)

In [24]:
juvenile_history.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2857093 entries, 0 to 2857092
Data columns (total 4 columns):
 #   Column              Dtype   
---  ------              -----   
 0   idnJuvenileHistory  Int64   
 1   idnCase             Int64   
 2   idnProceeding       Int64   
 3   idnJuvenile         category
dtypes: Int64(3), category(1)
memory usage: 76.3 MB


In [25]:
juvenile_history.isna().sum()

idnJuvenileHistory      0
idnCase                 0
idnProceeding          98
idnJuvenile           999
dtype: int64

In [26]:
juvenile_history.duplicated().sum()

0

#### `idnCase` List

Before working with `A_TblCase`, we extract the list of `idnCase` values  
associated with juvenile records. This will allow us to filter the case-level  
dataset to include only juvenile-related cases, without performing a full merge.

In [27]:
juvenile_case_ids = juvenile_history['idnCase'].unique()

In [28]:
juvenile_history.to_csv(
    '../outputs/juvenile_history_cleaned.csv.gz',
    index=False,
    compression='gzip'
)

### Cleaned `tbl_JuvenileHistory`

- Loaded ~2.85M rows from the raw CSV file.
- Dropped irrelevant or fully null columns.
- Converted `idnJuvenile` to a categorical variable.
- Retained only 4 key fields:
  - `idnJuvenileHistory`: primary key
  - `idnCase`: foreign key to `tbl_Case`
  - `idnProceeding`: foreign key to `tbl_Proceeding`
  - `idnJuvenile`: foreign key to `tblLookup_Juvenile`
- Missing values:
  - `idnProceeding`: 98 missing
  - `idnJuvenile`: 999 missing

Saved cleaned file as:
- `juvenile_history_cleaned.csv.gz` 

## 2. Clean A_TblCase

### Initial Data Inspection


The dataset contains approximately 12 million rows. To avoid memory issues and reduce unnecessary processing, only the first 1000 rows will be inspected initially:

- Get an overview of the data types (`dtypes`)
- Identify columns/features worth keeping for the EDA
- Skip any columns that appear to be irrelevant or redundant

This initial check will help streamline the analysis and focus only on useful information.

In [29]:
case_path = 'A_TblCase.csv'
cases = pd.read_csv(
    filepath_or_buffer=case_path,
    delimiter='\t',
    nrows=1000
)

In [30]:
cases.head()

Unnamed: 0,IDNCASE,ALIEN_CITY,ALIEN_STATE,ALIEN_ZIPCODE,UPDATED_ZIPCODE,UPDATED_CITY,NAT,LANG,CUSTODY,SITE_TYPE,E_28_DATE,ATTY_NBR,CASE_TYPE,UPDATE_SITE,LATEST_HEARING,LATEST_TIME,LATEST_CAL_TYPE,UP_BOND_DATE,UP_BOND_RSN,CORRECTIONAL_FAC,RELEASE_MONTH,RELEASE_YEAR,INMATE_HOUSING,DATE_OF_ENTRY,C_ASY_TYPE,C_BIRTHDATE,C_RELEASE_DATE,UPDATED_STATE,ADDRESS_CHANGEDON,ZBOND_MRG_FLAG,GENDER,DATE_DETAINED,DATE_RELEASED,LPR,DETENTION_DATE,DETENTION_LOCATION,DCO_LOCATION,DETENTION_FACILITY_TYPE,CASEPRIORITY_CODE
0,11782069,SAINT CHARLES,IL,60174,,,VE,SP,N,M,,,RMV,CHI,2026-01-20 00:00:00.000,900.0,M,,,,,,,2023-05-04 00:00:00.000,,11/1999,,,,,F,,,,,,,,
1,11782070,PLYMOUTH,MA,2360,,,EC,SP,D,M,,,RMV,CHE,,,,,,,,,,2023-04-30 00:00:00.000,,11/1983,,,2024-11-19 08:32:39.000,,M,2024-11-14 00:00:00.000,,,,,,,
2,11782071,WEST SACRAMENTO,CA,95691,,,RU,RUS,N,M,,,RMV,SFR,2026-08-19 00:00:00.000,1330.0,M,,,,,,,2023-05-04 00:00:00.000,E,5/1985,,,2024-11-01 12:22:02.000,,F,,,,,,,,
3,11782072,SAN JOSE,CA,95117,,,MX,SP,N,M,2024-04-19 11:32:17.227,0.0,RMV,SFR,2028-02-17 00:00:00.000,1330.0,M,,,,,,,2023-05-03 00:00:00.000,E,9/1999,,,2025-03-17 14:07:53.000,,F,,,,,,,,
4,11782074,KEARNS,UT,84118,,,VE,SP,N,M,,,RMV,SLC,2025-11-25 00:00:00.000,1400.0,M,,,,,,,2023-04-30 00:00:00.000,E,9/2004,,,,,F,,,,,,,,


In [31]:
cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 39 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   IDNCASE                  1000 non-null   int64  
 1   ALIEN_CITY               958 non-null    object 
 2   ALIEN_STATE              974 non-null    object 
 3   ALIEN_ZIPCODE            974 non-null    object 
 4   UPDATED_ZIPCODE          0 non-null      float64
 5   UPDATED_CITY             0 non-null      float64
 6   NAT                      1000 non-null   object 
 7   LANG                     1000 non-null   object 
 8   CUSTODY                  1000 non-null   object 
 9   SITE_TYPE                878 non-null    object 
 10  E_28_DATE                271 non-null    object 
 11  ATTY_NBR                 526 non-null    float64
 12  CASE_TYPE                1000 non-null   object 
 13  UPDATE_SITE              1000 non-null   object 
 14  LATEST_HEARING           

In [32]:
cases.columns

Index(['IDNCASE', 'ALIEN_CITY', 'ALIEN_STATE', 'ALIEN_ZIPCODE',
       'UPDATED_ZIPCODE', 'UPDATED_CITY', 'NAT', 'LANG', 'CUSTODY',
       'SITE_TYPE', 'E_28_DATE', 'ATTY_NBR', 'CASE_TYPE', 'UPDATE_SITE',
       'LATEST_HEARING', 'LATEST_TIME', 'LATEST_CAL_TYPE', 'UP_BOND_DATE',
       'UP_BOND_RSN', 'CORRECTIONAL_FAC', 'RELEASE_MONTH', 'RELEASE_YEAR',
       'INMATE_HOUSING', 'DATE_OF_ENTRY', 'C_ASY_TYPE', 'C_BIRTHDATE',
       'C_RELEASE_DATE', 'UPDATED_STATE', 'ADDRESS_CHANGEDON',
       'ZBOND_MRG_FLAG', 'GENDER', 'DATE_DETAINED', 'DATE_RELEASED', 'LPR',
       'DETENTION_DATE', 'DETENTION_LOCATION', 'DCO_LOCATION',
       'DETENTION_FACILITY_TYPE', 'CASEPRIORITY_CODE'],
      dtype='object')

#### Selected Features for EDA

The list of selected columns below was discussed earlier in the documentation for the source dataset. It represents the core case-level features relevant to our analysis of juvenile immigration cases.

These fields include:

- Demographic information (e.g., `GENDER`, `NAT`, `LANG`)  
- Case characteristics (e.g., `CASE_TYPE`, `CUSTODY`, `LATEST_HEARING`)  
- Key dates (e.g., `DATE_OF_ENTRY`, `DETENTION_DATE`, `C_BIRTHDATE`)

In [33]:
selected_columns = [
    'IDNCASE',
    'NAT',
    'LANG',
    'CUSTODY',
    'CASE_TYPE',
    'LATEST_HEARING', 'LATEST_CAL_TYPE',
    'DATE_OF_ENTRY',
    'C_BIRTHDATE',
    'GENDER',
    'DATE_DETAINED', 'DATE_RELEASED',
    'DETENTION_DATE'
]

In [34]:
cases = cases[selected_columns]

In [35]:
cases.head()

Unnamed: 0,IDNCASE,NAT,LANG,CUSTODY,CASE_TYPE,LATEST_HEARING,LATEST_CAL_TYPE,DATE_OF_ENTRY,C_BIRTHDATE,GENDER,DATE_DETAINED,DATE_RELEASED,DETENTION_DATE
0,11782069,VE,SP,N,RMV,2026-01-20 00:00:00.000,M,2023-05-04 00:00:00.000,11/1999,F,,,
1,11782070,EC,SP,D,RMV,,,2023-04-30 00:00:00.000,11/1983,M,2024-11-14 00:00:00.000,,
2,11782071,RU,RUS,N,RMV,2026-08-19 00:00:00.000,M,2023-05-04 00:00:00.000,5/1985,F,,,
3,11782072,MX,SP,N,RMV,2028-02-17 00:00:00.000,M,2023-05-03 00:00:00.000,9/1999,F,,,
4,11782074,VE,SP,N,RMV,2025-11-25 00:00:00.000,M,2023-04-30 00:00:00.000,9/2004,F,,,


In [36]:
cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   IDNCASE          1000 non-null   int64  
 1   NAT              1000 non-null   object 
 2   LANG             1000 non-null   object 
 3   CUSTODY          1000 non-null   object 
 4   CASE_TYPE        1000 non-null   object 
 5   LATEST_HEARING   859 non-null    object 
 6   LATEST_CAL_TYPE  854 non-null    object 
 7   DATE_OF_ENTRY    643 non-null    object 
 8   C_BIRTHDATE      558 non-null    object 
 9   GENDER           554 non-null    object 
 10  DATE_DETAINED    39 non-null     object 
 11  DATE_RELEASED    5 non-null      object 
 12  DETENTION_DATE   0 non-null      float64
dtypes: float64(1), int64(1), object(11)
memory usage: 101.7+ KB


#### Note on `DETENTION_DATE`

This field is entirely null in the sample, but we'll reassess after  
loading the full dataset.

In [37]:
cases.dtypes

IDNCASE              int64
NAT                 object
LANG                object
CUSTODY             object
CASE_TYPE           object
LATEST_HEARING      object
LATEST_CAL_TYPE     object
DATE_OF_ENTRY       object
C_BIRTHDATE         object
GENDER              object
DATE_DETAINED       object
DATE_RELEASED       object
DETENTION_DATE     float64
dtype: object

#### Specifying Column Data Types

- `Int64`: Used for `IDNCASE` to allow nullable integer values.
- `category`: Applied to string columns with repeated values  
  (e.g., `NAT`, `LANG`, `CUSTODY`, `CASE_TYPE`, `LATEST_CAL_TYPE`, `GENDER`)  
  for efficient storage and faster processing.
- `string`: Used for `C_BIRTHDATE` since it uses a partial date format (`MM/YYYY`)  
  and may contain nulls.
- `float64`: Used for `DETENTION_DATE`, which appears to contain only nulls in the  
  sample but may include numeric timestamps or intervals in the full dataset.

In [38]:
dtype = {
    'IDNCASE': 'Int64',
    'NAT': 'category',
    'LANG': 'category',
    'CUSTODY': 'category',
    'CASE_TYPE': 'category',
    'LATEST_CAL_TYPE': 'category',
    'GENDER': 'category',
    'C_BIRTHDATE': 'string',        
    'DETENTION_DATE': 'float64'     
}

In [39]:
cases = pd.read_csv(
    filepath_or_buffer=case_path,
    delimiter='\t',
    on_bad_lines='skip',
    usecols=selected_columns,
    dtype=dtype,
    parse_dates=[
        'LATEST_HEARING',
        'DATE_OF_ENTRY',
        'DATE_DETAINED',
        'DATE_RELEASED'
    ],
    low_memory=False,
    skiprows=[11711221]  # skip malformed row: C error - EOF inside string on this line
)

### Data Inspection

In [40]:
cases.head()

Unnamed: 0,IDNCASE,NAT,LANG,CUSTODY,CASE_TYPE,LATEST_HEARING,LATEST_CAL_TYPE,DATE_OF_ENTRY,C_BIRTHDATE,GENDER,DATE_DETAINED,DATE_RELEASED,DETENTION_DATE
0,11782069,VE,SP,N,RMV,2026-01-20 00:00:00.000,M,2023-05-04 00:00:00.000,11/1999,F,,,
1,11782070,EC,SP,D,RMV,,,2023-04-30 00:00:00.000,11/1983,M,2024-11-14 00:00:00.000,,
2,11782071,RU,RUS,N,RMV,2026-08-19 00:00:00.000,M,2023-05-04 00:00:00.000,5/1985,F,,,
3,11782072,MX,SP,N,RMV,2028-02-17 00:00:00.000,M,2023-05-03 00:00:00.000,9/1999,F,,,
4,11782074,VE,SP,N,RMV,2025-11-25 00:00:00.000,M,2023-04-30 00:00:00.000,9/2004,F,,,


In [41]:
cases.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11711220 entries, 0 to 11711219
Data columns (total 13 columns):
 #   Column           Dtype   
---  ------           -----   
 0   IDNCASE          Int64   
 1   NAT              category
 2   LANG             category
 3   CUSTODY          category
 4   CASE_TYPE        category
 5   LATEST_HEARING   object  
 6   LATEST_CAL_TYPE  category
 7   DATE_OF_ENTRY    object  
 8   C_BIRTHDATE      string  
 9   GENDER           category
 10  DATE_DETAINED    object  
 11  DATE_RELEASED    object  
 12  DETENTION_DATE   float64 
dtypes: Int64(1), category(6), float64(1), object(4), string(1)
memory usage: 726.0+ MB


In [42]:
cases.shape

(11711220, 13)

#### Filtering for Juvenile Cases

The current `cases` DataFrame includes both adult and juvenile records.  
To isolate only juvenile cases, we will filter it using the list of  
`idnCase` values from `tbl_JuvenileHistory`, without performing a full merge.

In [43]:
juvenile_cases = cases[cases['IDNCASE'].isin(juvenile_case_ids)]

In [44]:
juvenile_cases.head()

Unnamed: 0,IDNCASE,NAT,LANG,CUSTODY,CASE_TYPE,LATEST_HEARING,LATEST_CAL_TYPE,DATE_OF_ENTRY,C_BIRTHDATE,GENDER,DATE_DETAINED,DATE_RELEASED,DETENTION_DATE
18,13758313,GT,SP,N,RMV,,,2024-01-21 00:00:00.000,2/2008,F,,,
34,14870586,MX,SP,D,RFR,2025-02-04 00:00:00.000,I,,,,2025-01-22 00:00:00.000,,
35,14870588,GT,SP,R,WHO,2025-07-31 00:00:00.000,M,,6/1997,F,2025-01-29 00:00:00.000,2025-02-06 00:00:00.000,
69,13816559,MX,SP,N,RMV,2027-02-18 00:00:00.000,I,,5/2020,M,,,
70,13816560,MX,SP,N,RMV,2027-02-18 00:00:00.000,I,,5/2021,F,,,


In [45]:
juvenile_cases.shape

(1858773, 13)