In [23]:
import pandas as pd

In [24]:
pd.options.display.max_columns = False

### Clean `tbl_JuvenileHistory` table

#### Initial Data Inspection

The dataset contains approximately 3 million rows. To avoid memory issues and reduce unnecessary processing, only the first 100 rows will be inspected initially:

- Get an overview of the data types (`dtypes`)
- Identify columns/features worth keeping for the EDA
- Skip any columns that appear to be irrelevant or redundant

This initial check will help streamline the analysis and focus only on useful information.

In [25]:
juvenile_history_path = "../../data/raw/tbl_JuvenileHistory.csv"

juvenile_history = pd.read_csv(
    filepath_or_buffer=juvenile_history_path, delimiter="\t", nrows=100
)

In [26]:
juvenile_history.head()

Unnamed: 0,idnJuvenileHistory,idnCase,idnProceeding,idnJuvenile,DATCREATEDON,DATMODIFIEDON
0,5,2046990,3200129,1,2014-09-06 19:24:46.373,
1,6,2047179,3199488,1,2014-09-06 19:24:46.373,
2,7,2047179,3199489,1,2014-09-06 19:24:46.373,
3,8,2047199,3199497,1,2014-09-06 19:24:46.373,
4,9,2047199,3199498,1,2014-09-06 19:24:46.373,


In [27]:
juvenile_history.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   idnJuvenileHistory  100 non-null    int64  
 1   idnCase             100 non-null    int64  
 2   idnProceeding       100 non-null    int64  
 3   idnJuvenile         100 non-null    int64  
 4   DATCREATEDON        100 non-null    object 
 5   DATMODIFIEDON       0 non-null      float64
dtypes: float64(1), int64(4), object(1)
memory usage: 4.8+ KB


In [28]:
juvenile_history.columns

Index(['idnJuvenileHistory', 'idnCase', 'idnProceeding', 'idnJuvenile',
       'DATCREATEDON', 'DATMODIFIEDON'],
      dtype='object')

In [29]:
juvenile_history["DATCREATEDON"].unique()

array(['2014-09-06 19:24:46.373'], dtype=object)

#### Column Notes

- **`idnJuvenile`**:  
  Although stored as `int64`, this column only contains values from 1 to 6 (based on `Lookup/tblLookup_Juvenile.csv`). It should be treated as a categorical feature.

- **`DATCREATEDON`** and **`DATMODIFIEDON`**:  
  These columns likely reflect internal system activity rather than meaningful case-level events:
  - `DATCREATEDON` has the same value across all rows in the sample.
  - `DATMODIFIEDON` is completely null in the sample.

  Both columns were excluded from further analysis.

In [30]:
dtype = {
    "idnJuvenileHistory": "Int64",
    "idnCase": "Int64",
    "idnProceeding": "Int64",
    "idnJuvenile": "category",
}

In [31]:
juvenile_history = pd.read_csv(
    filepath_or_buffer=juvenile_history_path,
    delimiter="\t",
    usecols=["idnJuvenileHistory", "idnCase", "idnProceeding", "idnJuvenile"],
    dtype=dtype,
    low_memory=False,
)

### Data Inspection

In [32]:
juvenile_history.head()

Unnamed: 0,idnJuvenileHistory,idnCase,idnProceeding,idnJuvenile
0,5,2046990,3200129,1
1,6,2047179,3199488,1
2,7,2047179,3199489,1
3,8,2047199,3199497,1
4,9,2047199,3199498,1


In [33]:
juvenile_history.shape

(2876040, 4)

In [34]:
juvenile_history.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2876040 entries, 0 to 2876039
Data columns (total 4 columns):
 #   Column              Dtype   
---  ------              -----   
 0   idnJuvenileHistory  Int64   
 1   idnCase             Int64   
 2   idnProceeding       Int64   
 3   idnJuvenile         category
dtypes: Int64(3), category(1)
memory usage: 76.8 MB


In [35]:
juvenile_history.isna().sum()

idnJuvenileHistory      0
idnCase                 0
idnProceeding          99
idnJuvenile           999
dtype: int64

In [36]:
juvenile_history.duplicated().sum()

np.int64(0)

In [38]:
juvenile_history.to_csv(
    "../../data/cleaned/juvenile_history_cleaned.csv.gz",
    index=False,
    compression="gzip",
)

### Cleaned `tbl_JuvenileHistory`

- Loaded ~2.87M rows from the raw CSV file.
- Dropped irrelevant or fully null columns.
- Converted `idnJuvenile` to a categorical variable.
- Retained only 4 key fields:
  - `idnJuvenileHistory`: primary key
  - `idnCase`: foreign key to `tbl_Case`
  - `idnProceeding`: foreign key to `tbl_Proceeding`
  - `idnJuvenile`: foreign key to `tblLookup_Juvenile`
- Missing values:
  - `idnProceeding`: 99 missing
  - `idnJuvenile`: 999 missing

Saved cleaned file as:
- `juvenile_history_cleaned.csv.gz` 