# **1. Summary of Project Idea**

**1. Summary of Project Idea**

The core idea of this project is to use the **Supreme Court Database (SCDB)**, specifically the modern data (1946-2023), to predict the *duration of time between when a case is orally argued and when the final decision is announced*. We aim to achieve this using machine learning (ML) techniques - for now XGBoost.

Beyond just prediction, a key goal is to use Explainable AI (XAI) methods (like SHAP and Dalex) to understand which characteristics of a case (e.g., the type of legal issue, the parties involved, lower court conflict, case complexity indicators) are most influential in determining this argument-to-decision duration, according to the ML model.

## **1.1 Preparation of Data Steps**

The data preparation steps we've taken are needed for building a valid and meaningful predictive model for this specific task:

**Using caseCentered_Docket.csv:** We chose this file because its unit of analysis (the docket) aligns well with tracking a case's journey. It allows us to handle consolidated cases (multiple dockets per caseId) effectively, which is important for measuring case complexity. Files organized by issue or vote would be too granular for predicting the overall case duration.

**Calculating duration_days:** Since "time of case" isn't directly in the SCDB, we defined it operationally as *dateDecision - dateArgument*. This is the most feasible measure using only SCDB data for the post-argument phase.

**Filtering NaN duration_days:** We must remove rows where duration is missing (NaN). This happens primarily because the case was decided without oral argument (dateArgument is missing). Our target variable literally doesn't exist for these cases, so the model cannot learn from them for this specific prediction task. This step focuses the analysis on orally argued cases.

**Filtering Negative (< 0) duration_days:** We removed these rows because a negative duration is logically impossible and indicates data errors in the recorded dates (dateDecision before dateArgument). Keeping these would introduce noise and errors into the model.

**Keeping Zero (== 0) duration_days (For Now):** We decided to keep cases with exactly zero duration for the time being. While potentially data errors or atypical same-day decisions, removing them immediately might discard valid edge cases. We flagged their presence and noted that they could be removed later if they prove problematic for the model. (There are 2 such cases in the dataset.)

**Feature Engineering (num_dockets_in_case, had_reargument, docket_category):** We created new features to capture potentially predictive information not directly present as single variables:

**num_dockets_in_case:** Measures complexity from consolidation (using the docket file structure).

*had_reargument:* Captures complexity/contentiousness indicated by the Court needing a second argument. (bollean variable)

*docket_category:* Attempts to extract the type of docket (Original, Miscellaneous, Merits) from the raw docket string, as this might correlate with different processing timelines.

*Excluding Leakage Variables:* This is critical. We carefully removed variables whose values are only known after the decision is made (e.g., partyWinning, decisionDirection, majVotes, caseDisposition). Including these would allow the model to "cheat" by using information from the future (relative to the prediction point), leading to artificially inflated performance and invalid results for predicting duration before the decision is known. 

*Removal of identifiers and raw date columns:* We also removed identifiers and raw date columns after extracting useful info from them.

*Need for Categorical Simplification (Placeholder):* Many SCDB variables (petitioner, respondent, issueArea, jurisdiction, etc.) use hundreds of numeric codes. Directly using these high-cardinality features can make modeling difficult and less interpretable. Grouping these codes into meaningful categories based on the SCDB Codebook (e.g., 'Business' vs 'Government' petitioner types, broader issue areas) is essential. This step still needs manual implementation.

In essence, the preparation aimed to create a clean dataset containing only the relevant cases (orally argued with positive duration) and a feature matrix (X) containing potentially predictive information available before the decision, while excluding invalid data and leakage variables.


## **1.2. Variables in the Dataset**

--- Descriptions for ALL 53 SCDB Variables Listed ---

caseId: (Identification)
- Description: Unique identifier assigned by the SCDB to each distinct Supreme Court dispute or consolidated set of disputes.
- Values: String, typically YYYY-NNN format (e.g., "1946-001").
- Use: Primary key for linking related rows (e.g., dockets in a consolidated case, or linking to external data like Salience). Not typically used directly as a feature.

docketId: (Identification)
- Description: Unique identifier for each specific docket number associated with a caseId. Multiple docketIds can share a caseId in consolidated cases.
- Values: String, often YYYY-NNN- DocketSeq format (e.g., "1946-001-01").
- Use: Primary key in Docket-centered files. Useful for identifying consolidated cases (via caseId). Not typically used directly as a feature.

caseIssuesId: (Identification)
- Description: Unique identifier for each specific set of issue(s) and legal provision(s) addressed within a docketId. More granular than docketId. Found in Issue/LegalProvision organized files.
- Values: String, builds on docketId (e.g., "1946-001-01-01").
- Use: Identifier for issue-level analysis. Not typically used for case-level duration prediction.

voteId: (Identification)
- Description: Unique identifier for each specific voting alignment on a caseIssuesId, mainly relevant for rare split votes. Most granular identifier. Found in Vote-organized files.
- Values: String, builds on caseIssuesId (e.g., "1946-001-01-01-01").
- Use: Identifier for vote-level analysis, especially complex voting patterns. Not typically used for case-level duration prediction.

dateDecision: (Chronological)
- Description: The date the Supreme Court announced its decision.
- Values: Date object.
- Use: Crucial endpoint for calculating duration. Can derive features like decision_year, decision_month. Cannot be used directly as a predictive *feature* for duration ending on this date (as it defines the end point). Potential inaccuracies exist.

decisionType: (Outcome / Process)
- Description: Code indicating how the Court processed/decided the case procedurally.
- Values: Numeric codes (1=Opinion Post-Argument, 2=Per Curiam Opinion, 4=Decree, 5=Judgment, 7=Per Curiam Vacated/Remanded, etc.). See Codebook.
- Use: Explains *why* dateArgument might be missing (codes 2, 7 often lack argument). Can be a feature itself, but determined at/near decision time, so potential leakage risk depending on exact prediction point.

usCite: (Identification / Background)
- Description: Citation for the case in the official United States Reports.
- Values: String (e.g., "329 U.S. 1"). Can have NaNs if not yet published or not applicable.
- Use: Case identifier/linking. Not typically used as a feature.

sctCite: (Identification / Background)
- Description: Citation in the Supreme Court Reporter (West).
- Values: String (e.g., "67 S. Ct. 6"). Can have NaNs.
- Use: Case identifier/linking. Not typically used as a feature.

ledCite: (Identification / Background)
- Description: Citation in the Lawyers' Edition (LexisNexis).
- Values: String (e.g., "91 L. Ed. 3"). Can have NaNs.
- Use: Case identifier/linking. Not typically used as a feature.

lexisCite: (Identification / Background)
- Description: Citation in the LexisNexis database format.
- Values: String (e.g., "1946 U.S. LEXIS 1724"). Can have NaNs.
- Use: Case identifier/linking. Not typically used as a feature.

term: (Chronological / Context)
- Description: The Supreme Court Term in which the decision was handed down.
- Values: Numeric year representing start of Term (e.g., 1946 for Oct 1946 - June 1947).
- Use: Key chronological feature for trends, era effects, temporal splits, merging external data (like MQ scores). Can treat as numeric or categorical.

naturalCourt: (Chronological / Context)
- Description: Code identifying periods of stable membership on the Court.
- Values: Numeric codes (e.g., 1301). See Codebook Appendix 1.
- Use: Feature capturing effects of specific Court compositions. Treat as categorical.

chief: (Chronological / Context)
- Description: Code identifying the Chief Justice presiding.
- Values: Numeric codes (e.g., 78=Vinson, 1=Warren, 4=Roberts). See Codebook.
- Use: Feature capturing effects of Chief Justice eras. Treat as categorical.

docket: (Identification / Background)
- Description: The original, raw docket number string assigned by the SC (e.g., "24", "133M", "5, Orig.").
- Values: String. Format can be inconsistent (See Guide Section V).
- Use: Linking to external court records. Can be *engineered* into features (like docket_category), but generally not used directly as a feature due to inconsistency and high cardinality.

caseName: (Identification / Background)
- Description: The name of the case (e.g., "HALLIBURTON OIL WELL CEMENTING CO. v. WALKER...").
- Values: String.
- Use: Identifier. Not suitable as a standard feature for ML (text analysis techniques would be needed).

dateArgument: (Chronological)
- Description: The date of the first day of oral argument. Missing (NaN/NaT) if case was not orally argued.
- Values: Date object or NaT.
- Use: Crucial starting point for calculating Argument-to-Decision duration. Can derive features like argument_month. Potential inaccuracies exist (See Guide Section V).

dateRearg: (Chronological)
- Description: The date of the first day of reargument, if held. Missing (NaN/NaT) if no reargument occurred.
- Values: Date object or NaT. Very high percentage of missing values.
- Use: Primarily used to engineer the 'had_reargument' binary flag feature, indicating case complexity/uncertainty.

petitioner / respondent: (Background)
- Description: Identifies the type of party petitioning the Court / responding.
- Values: Hundreds of numeric codes. See Codebook Appendix 10.
- Use: Potential feature. *Simplification into broad groups (e.g., 'Business', 'Individual', 'US Govt', 'State Govt') is essential.*

petitionerState / respondentState: (Background)
- Description: Identifies the state associated with the petitioner/respondent, if applicable.
- Values: Numeric state codes (FIPS codes). See Codebook Appendix 11. High number of NaNs.
- Use: Potential geographic feature, especially for state actors. Treat as categorical. Consider impact of missing values.

jurisdiction: (Background)
- Description: Code for how the case reached the Supreme Court.
- Values: Numeric codes (1=Certiorari, 2=Appeal, 3=Original, etc.). See Codebook Appendix 2.
- Use: Potential feature (different paths may have different processing). Treat as categorical. *Simplification recommended.*

adminAction: (Background)
- Description: Code identifying if the case reviewed a federal administrative agency action, and which agency. 0=Not applicable.
- Values: Numeric codes. See Codebook Appendix 6. High number of NaNs (or 0 values).
- Use: Potential feature. Treat as categorical. *Simplification (e.g., binary flag 'IsAdminAction', or grouping agencies) recommended.*

adminActionState: (Background)
- Description: State associated with the administrative action, if applicable.
- Values: Numeric state codes. Very high number of NaNs.
- Use: Limited use due to high missingness. Potentially a feature if imputed/handled carefully. Treat as categorical.

threeJudgeFdc: (Background)
- Description: Flag indicating if a three-judge Federal District Court was involved.
- Values: 0=No, 1=Yes.
- Use: Potential feature indicating specific case types. Treat as categorical or binary numeric.

caseOrigin: (Background)
- Description: Code for the specific court/body where the case originated before appeals.
- Values: Hundreds of numeric codes. See Codebook Appendix 5. Some NaNs possible.
- Use: Potential feature. *Simplification (grouping by type/level/region) is essential.*

caseOriginState: (Background)
- Description: State associated with the originating court/body.
- Values: Numeric state codes. High number of NaNs.
- Use: Potential geographic feature. Treat as categorical. Consider impact of missing values.

caseSource: (Background)
- Description: Code for the court whose decision the SC is directly reviewing.
- Values: Hundreds of numeric codes. See Codebook Appendix 4. Some NaNs possible.
- Use: Potential feature indicating case posture/context. *Simplification (grouping by type/level/circuit) is essential.*

caseSourceState: (Background)
- Description: State associated with the source court.
- Values: Numeric state codes. High number of NaNs.
- Use: Potential geographic feature. Treat as categorical. Consider impact of missing values.

lcDisagreement: (Background)
- Description: Flag indicating explicit disagreement among lower federal courts.
- Values: 0=No, 1=Yes.
- Use: Potential feature indicating complexity/reason for grant. Treat as categorical or binary numeric.

certReason: (Background)
- Description: Code(s) for the Court's stated reason for granting review.
- Values: Numeric codes (1=Fed conflict, 4=Important fed question, etc.). See Codebook Appendix 7. Some NaNs possible.
- Use: Potential feature indicating perceived importance/reason for grant. Treat as categorical. *Simplification may be useful.*

lcDisposition: (Background)
- Description: Code for the lower court's disposition (outcome).
- Values: Numeric codes (2=Affirmed, 3=Reversed, etc.). See Codebook Appendix 8. Some NaNs possible.
- Use: Potential feature indicating case posture. Treat as categorical. *Simplification may be useful.*

lcDispositionDirection: (Background)
- Description: Ideological direction assigned to the lower court's disposition.
- Values: 1=Conservative, 2=Liberal, 3=Unspecifiable. Some NaNs possible.
- Use: Potential feature indicating ideological context. Treat as categorical.

## --- Outcome Variables (Leakage for Duration Prediction) --- ## 

declarationUncon: (Outcome)
- Description: Flag indicating if the SC declared a law/action unconstitutional.
- Values: Numeric codes (0=No, 1=Yes-Fed, 2=Yes-State, 3=Yes-Local).
- Use: Outcome variable. **Leakage Variable** - cannot be used as predictor for duration.

caseDisposition: (Outcome)
- Description: Code for how the SC ultimately disposed of the case.
- Values: Numeric codes (1=Stay, 2=Affirmed, 3=Reversed, 5=Vacated/Remanded, 6=Affirmed/Reversed in part, etc.). See Codebook Appendix 12. Some ambiguity (e.g., DIGs).
- Use: Outcome variable. **Leakage Variable**.

caseDispositionUnusual: (Outcome)
- Description: Flag for unusual case dispositions.
- Values: 0=No, 1=Yes.
- Use: Outcome characteristic. **Leakage Variable**.

partyWinning: (Outcome)
- Description: Flag indicating if the petitioner won (vs. respondent).
- Values: 0=Respondent won, 1=Petitioner won, NA=Unclear/Other.
- Use: Outcome variable. **Leakage Variable**.

precedentAlteration: (Outcome)
- Description: Flag indicating if the decision formally altered existing SC precedent.
- Values: 0=No, 1=Yes.
- Use: Outcome characteristic. **Leakage Variable**.

voteUnclear: (Voting/Opinion)
- Description: Flag indicating if the voting alignment was unclear.
- Values: 0=Clear, 1=Unclear.
- Use: Data quality flag related to outcome/voting. Determined at/after decision, potential **Leakage Variable**.

issue: (Substantive)
- Description: Code for the specific legal issue within the broader issueArea.
- Values: Many numeric codes, nested under issueArea. See Codebook section on Issues. Some NaNs possible.
- Use: Potential feature (more granular than issueArea). Treat as categorical. High cardinality may require careful handling or using only issueArea.

issueArea: (Substantive)
- Description: Broad subject matter category of the legal issue.
- Values: Numeric codes (1=CrimPro, 2=CivRts, 8=Econ, etc.). See Codebook Appendix 3. Some NaNs possible.
- Use: Key substantive feature. Treat as categorical. *Mapping to names recommended.*

decisionDirection: (Outcome)
- Description: Ideological direction assigned to the SC's decision.
- Values: 1=Conservative, 2=Liberal, 3=Unspecifiable. Some NaNs possible.
- Use: Common target variable for *outcome* prediction. **Leakage Variable** for duration prediction.

decisionDirectionDissent: (Outcome)
- Description: Ideological direction assigned to the primary dissent.
- Values: 1=Conservative, 2=Liberal, 3=Unspecifiable. High number of NaNs (no dissent).
- Use: Outcome characteristic. **Leakage Variable**.

authorityDecision1 / authorityDecision2: (Outcome)
- Description: Codes for the primary/secondary legal authority the Court relied upon.
- Values: Numeric codes (1=Conflict, 2=Federal Con. interp, 3=Federal Statute interp, etc.). See Codebook. High NaNs for authorityDecision2.
- Use: Outcome characteristic. **Leakage Variable**.

lawType: (Substantive)
- Description: Code for the type of law or action under review (e.g., statute, constitution, regulation).
- Values: Numeric codes. See Codebook Appendix 9. Some NaNs possible.
- Use: Potential feature indicating legal basis. Treat as categorical. *Simplification might be useful.*

lawSupp: (Substantive)
- Description: Code providing supplemental detail about the law under review (e.g., specific amendment, act name category).
- Values: Numeric codes. See Codebook Appendix 9. Some NaNs possible.
- Use: Potential feature (more detail than lawType). Treat as categorical. High cardinality.

lawMinor: (Substantive)
- Description: Free text field intended for minor legal points or specific statute sections.
- Values: String. Very high number of NaNs.
- Use: Generally **not usable** for ML due to inconsistency, typos, and high missingness (See Guide Section V). Usually dropped.

majOpinWriter: (Voting/Opinion)
- Description: Code identifying the justice who wrote the majority/plurality opinion.
- Values: Numeric justice codes (e.g., 102=Black, 112=Roberts). See Codebook Justice List. Some NaNs possible (per curiam).
- Use: Outcome characteristic. **Leakage Variable**. Requires Justice-centered data or aggregation for use as feature in outcome prediction.

majOpinAssigner: (Voting/Opinion)
- Description: Code identifying the justice who assigned the majority opinion (Chief Justice or senior justice in majority).
- Values: Numeric justice codes. Some NaNs possible.
- Use: Outcome characteristic. **Leakage Variable**.

splitVote: (Voting/Opinion)
- Description: Flag indicating if the case involved multiple distinct voting alignments on different aspects of the same issue/legal provision.
- Values: Numeric codes (0=No split, 1=Vote info pertains to 1st vote, 2=Vote info pertains to 2nd vote). See Codebook.
- Use: Indicator of high voting complexity. If engineered into a simple flag ('had_split_vote'), potentially usable as a pre-decision complexity feature, but the raw code itself describes the outcome voting. Treat with caution regarding leakage.

majVotes: (Voting/Opinion)
- Description: Number of justices voting in the majority coalition.
- Values: Integer (e.g., 5, 6, 9).
- Use: Outcome characteristic (vote margin). **Leakage Variable**.

minVotes: (Voting/Opinion)
- Description: Number of justices voting in the primary minority coalition (dissent).
- Values: Integer (e.g., 4, 3, 0).
- Use: Outcome characteristic (vote margin). **Leakage Variable**.

The purpose of above section was to early dive into the SCDB variables and their potential use in the model.
It aimed to:

- Use only information available before the event we are trying to predict.
- Include variables that have a plausible theoretical connection to the complexity, importance, context, or procedural aspects of a case, which in turn could influence deliberation time.
- Clean out variables that are non-informative, problematic due to data quality, or redundant.

# **2. Coding Part**

Loading data

In [1]:
import os 

os.listdir()
#os.getcwd()

notebook_dir = os.getcwd()

project_root = os.path.abspath(os.path.join(notebook_dir, ".."))
print(f"Project Root: {project_root}")

file_path_root = 'data/raw/SCDB_2024_01_caseCentered_Docket.csv'

file_path = os.path.join(project_root, file_path_root)
print(f"Attempting to load data from: {file_path}")

Project Root: h:\000_Projects\01_GitHub\05_PythonProjects\scdb-case-timing-prediction
Attempting to load data from: h:\000_Projects\01_GitHub\05_PythonProjects\scdb-case-timing-prediction\data/raw/SCDB_2024_01_caseCentered_Docket.csv


In [2]:
import pandas as pd
import numpy as np

# old file_path
#file_path = 'data/raw/SCDB_2024_01_caseCentered_Docket.csv'

try:
    df = pd.read_csv(file_path)
except UnicodeDecodeError:
    # ISO-8859-1 (or latin-1) is common for older datasets
    df = pd.read_csv(file_path, encoding='ISO-8859-1')
except FileNotFoundError:
    print(f"Error: File not found at {file_path}")
except Exception as e:
    print(f"Error loading file: {e}")

print(f"Data loaded successfully. Shape: {df.shape}")
print("\nFirst 5 rows of data:")
display(df.head())

print("\nData Information:")
df.info()

print("\nMissing values per column:")
display(df.isnull().sum())

Data loaded successfully. Shape: (10783, 53)

First 5 rows of data:


Unnamed: 0,caseId,docketId,caseIssuesId,voteId,dateDecision,decisionType,usCite,sctCite,ledCite,lexisCite,...,authorityDecision1,authorityDecision2,lawType,lawSupp,lawMinor,majOpinWriter,majOpinAssigner,splitVote,majVotes,minVotes
0,1946-001,1946-001-01,1946-001-01-01,1946-001-01-01-01,11/18/1946,1,329 U.S. 1,67 S. Ct. 6,91 L. Ed. 3,1946 U.S. LEXIS 1724,...,4.0,,6.0,600.0,35 U.S.C. § 33,78.0,78.0,1,8,1
1,1946-002,1946-002-01,1946-002-01-01,1946-002-01-01-01,11/18/1946,1,329 U.S. 14,67 S. Ct. 13,91 L. Ed. 12,1946 U.S. LEXIS 1725,...,4.0,,6.0,600.0,18 U.S.C. § 398,81.0,87.0,1,6,3
2,1946-002,1946-002-02,1946-002-02-01,1946-002-02-01-01,11/18/1946,1,329 U.S. 14,67 S. Ct. 13,91 L. Ed. 12,1946 U.S. LEXIS 1725,...,4.0,,6.0,600.0,18 U.S.C. § 398,81.0,87.0,1,6,3
3,1946-002,1946-002-03,1946-002-03-01,1946-002-03-01-01,11/18/1946,1,329 U.S. 14,67 S. Ct. 13,91 L. Ed. 12,1946 U.S. LEXIS 1725,...,4.0,,6.0,600.0,18 U.S.C. § 398,81.0,87.0,1,6,3
4,1946-002,1946-002-04,1946-002-04-01,1946-002-04-01-01,11/18/1946,1,329 U.S. 14,67 S. Ct. 13,91 L. Ed. 12,1946 U.S. LEXIS 1725,...,4.0,,6.0,600.0,18 U.S.C. § 398,81.0,87.0,1,6,3



Data Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10783 entries, 0 to 10782
Data columns (total 53 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   caseId                    10783 non-null  object 
 1   docketId                  10783 non-null  object 
 2   caseIssuesId              10783 non-null  object 
 3   voteId                    10783 non-null  object 
 4   dateDecision              10783 non-null  object 
 5   decisionType              10783 non-null  int64  
 6   usCite                    10282 non-null  object 
 7   sctCite                   10779 non-null  object 
 8   ledCite                   10777 non-null  object 
 9   lexisCite                 10783 non-null  object 
 10  term                      10783 non-null  int64  
 11  naturalCourt              10783 non-null  int64  
 12  chief                     10783 non-null  object 
 13  docket                    10754 non-null  

caseId                          0
docketId                        0
caseIssuesId                    0
voteId                          0
dateDecision                    0
decisionType                    0
usCite                        501
sctCite                         4
ledCite                         6
lexisCite                       0
term                            0
naturalCourt                    0
chief                           0
docket                         29
caseName                        0
dateArgument                 1249
dateRearg                   10552
petitioner                      3
petitionerState              8657
respondent                      6
respondentState              7854
jurisdiction                    3
adminAction                  7632
adminActionState            10037
threeJudgeFdc                  23
caseOrigin                    430
caseOriginState              7969
caseSource                    266
caseSourceState              8395
lcDisagreement

In [3]:
df.columns

Index(['caseId', 'docketId', 'caseIssuesId', 'voteId', 'dateDecision',
       'decisionType', 'usCite', 'sctCite', 'ledCite', 'lexisCite', 'term',
       'naturalCourt', 'chief', 'docket', 'caseName', 'dateArgument',
       'dateRearg', 'petitioner', 'petitionerState', 'respondent',
       'respondentState', 'jurisdiction', 'adminAction', 'adminActionState',
       'threeJudgeFdc', 'caseOrigin', 'caseOriginState', 'caseSource',
       'caseSourceState', 'lcDisagreement', 'certReason', 'lcDisposition',
       'lcDispositionDirection', 'declarationUncon', 'caseDisposition',
       'caseDispositionUnusual', 'partyWinning', 'precedentAlteration',
       'voteUnclear', 'issue', 'issueArea', 'decisionDirection',
       'decisionDirectionDissent', 'authorityDecision1', 'authorityDecision2',
       'lawType', 'lawSupp', 'lawMinor', 'majOpinWriter', 'majOpinAssigner',
       'splitVote', 'majVotes', 'minVotes'],
      dtype='object')

Datetime conversion

In [4]:
# We need to turn the following string-formatted date columns into true `datetime` objects so we can do arithmetic on them.  
# - `dateDecision`: when the Court announced its decision  
# - `dateArgument`: when the Court held oral argument  
# - `dateRearg`:  when (if) the Court re-argued the case  

date_cols = ['dateDecision', 'dateArgument', 'dateRearg']

print("\nConverting date columns...")
for col in date_cols:
    df[col] = pd.to_datetime(df[col], errors='coerce')

print("\nDate columns after conversion:")
df[['caseId'] + date_cols].info()
print(df[['caseId'] + date_cols].head())


Converting date columns...

Date columns after conversion:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10783 entries, 0 to 10782
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   caseId        10783 non-null  object        
 1   dateDecision  10783 non-null  datetime64[ns]
 2   dateArgument  9534 non-null   datetime64[ns]
 3   dateRearg     231 non-null    datetime64[ns]
dtypes: datetime64[ns](3), object(1)
memory usage: 337.1+ KB
     caseId dateDecision dateArgument  dateRearg
0  1946-001   1946-11-18   1946-01-09 1946-10-23
1  1946-002   1946-11-18   1945-10-10 1946-10-17
2  1946-002   1946-11-18   1945-10-10 1946-10-17
3  1946-002   1946-11-18   1945-10-10 1946-10-17
4  1946-002   1946-11-18   1945-10-10 1946-10-17


- dateDecision:  
    This is when the Supreme Court announces its decision on a case. The time from oral argument to the decision can vary widely, depending on the complexity of the case, the number of justices, and the nature of the legal issues.

- dateArgument:  
    This is when the Supreme Court hears the initial oral arguments. During oral argument, each side presents its case and the Justices ask questions to clarify issues. This step is crucial because it allows the Court to gather information directly from the attorneys before deliberating.

- dateRearg (dateReargument):  
    This is the date of a reargument session—an additional oral argument scheduled if the Court needs further clarification after the initial argument. Reargument is relatively rare and typically happens when the Justices cannot reach a clear decision or believe critical issues require more discussion.

Simple Conclusions:
1. If the Justices feel satisfied with the initial arguments, they will proceed toward their final deliberations and announce a verdict on dateDecision.  
2. In rare cases, the Court calls for a reargument, scheduled on dateRearg, if additional oral presentations or clarifications are necessary.  
3. Once satisfied with all arguments, the Supreme Court finalizes its opinion and releases it on dateDecision, effectively ending that stage of the case.

Calculating cour decision duration

In [5]:
import pandas as pd
import numpy as np

print("--- Starting Full Data Preparation & Feature Engineering ---")
print(f"Initial DataFrame 'df' shape: {df.shape}")

print("\n=== Step 1: Calculate Duration & Verify ===")

# We are measuring the time from the initial argument.
# The duration is the time from the date of the argument to the date of the decision.
# # This is the time it took for the court to make a decision after the argument.

try:
    df['duration_days'] = (df['dateDecision'] - df['dateArgument']).dt.days
    print("Calculated 'duration_days'.")
except KeyError as e:
    print(f"FATAL Error: Missing input column {e} for duration calculation.")
    exit()

# Verification checks
if 'duration_days' not in df.columns:
    print("FATAL: 'duration_days' column NOT FOUND after calculation.")
    exit()
if 'dateDecision' not in df.columns or 'dateArgument' not in df.columns:
     print("FATAL: 'dateDecision' or 'dateArgument' column missing!")
     exit()
if not pd.api.types.is_datetime64_any_dtype(df['dateDecision']) or \
   not pd.api.types.is_datetime64_any_dtype(df['dateArgument']):
     print("WARNING: Input date columns are not datetime type! Check initial conversion.")
print("Verification Checks Passed.")

--- Starting Full Data Preparation & Feature Engineering ---
Initial DataFrame 'df' shape: (10783, 53)

=== Step 1: Calculate Duration & Verify ===
Calculated 'duration_days'.
Verification Checks Passed.


Filtering data

In [6]:
print("\n=== Step 2: Filter Data for Modeling ===")
initial_rows_before_filter = df.shape[0]

# 1. Remove Rows with Missing Duration (NaN)
#    - Rationale: Necessary as the target variable ('duration_days') cannot be calculated
#      for these rows (typically non-argued cases). Models require a valid target.

# found later on -> Decision Without Oral Argument: The most common reason by far is that the Supreme Court decided the case without hearing oral arguments.  

df_argued = df.dropna(subset=['duration_days']).copy()
rows_after_nan_drop = df_argued.shape[0]
print(f"1. Dropped {initial_rows_before_filter - rows_after_nan_drop} rows due to missing 'duration_days' (NaN).")

# 2. Address Rows with Non-Positive Duration (<= 0)
#    - Rationale: Negative durations indicate data errors. Zero durations are aabnormal
#      for argued cases and may indicate errors. Removing negatives is essential.
#      Keeping zeros is optional (currently kept as there are only 2).
negative_durations = df_argued[df_argued['duration_days'] < 0]
zero_durations = df_argued[df_argued['duration_days'] == 0]
if not negative_durations.empty:
    num_negative = negative_durations.shape[0]
    print(f"- Found {num_negative} rows with negative (< 0) duration. Excluding as errors...")
    df_argued = df_argued[df_argued['duration_days'] >= 0].copy() # Keep >= 0
    print(f"- Dropped {num_negative} rows with negative duration.")
else:    print("- No negative (< 0) durations found.")
if not zero_durations.empty:
    print(f"- Found {zero_durations.shape[0]} rows with zero (0) duration. Kept for now.")
else:    print("- No zero (0) durations found.")

final_rows = df_argued.shape[0]
print(f"\nFinal DataFrame 'df_argued' for modeling has {final_rows} rows.")


=== Step 2: Filter Data for Modeling ===
1. Dropped 1249 rows due to missing 'duration_days' (NaN).
- No negative (< 0) durations found.
- Found 2 rows with zero (0) duration. Kept for now.

Final DataFrame 'df_argued' for modeling has 9534 rows.


In [7]:
# digression check on data validity: 

df_features = df_argued
# 1. Filter for rows that HAVE a reargument date first
rearg_cases = df_features[df_features['dateRearg'].notna()].copy()

if not rearg_cases.empty:
    # 2. Within those cases, find rows where reargument is on or after decision
    erroneous_rearg_dates = rearg_cases[rearg_cases['dateRearg'] >= rearg_cases['dateDecision']]

    # 3. Report Findings
    if not erroneous_rearg_dates.empty:
        num_errors = len(erroneous_rearg_dates)
        print(f"WARNING: Found {num_errors} instance(s) where dateRearg >= dateDecision.")
        print("This indicates a potential data error, as reargument should precede the decision.")
        print("Showing details for these instances:")
        # Display relevant columns for the erroneous rows
        display(erroneous_rearg_dates[[
            'caseId', 'docketId', 'term',
            'dateArgument', 'dateRearg', 'dateDecision', 'duration_days'
        ]])

    else:
        print("OK: No instances found where dateRearg >= dateDecision among cases with reargument.")

else:
    print("OK: No cases with a dateRearg found in the filtered dataset to check.")

OK: No instances found where dateRearg >= dateDecision among cases with reargument.


Feature engineering

In [8]:
print(f"\n=== Step 3: Basic Feature Engineering (on df_features shape: {df_features.shape}) ===")

df_features = df_argued.copy()

# Engineered temp features from existing date columns
df_features['decision_year'] = df_features['dateDecision'].dt.year
df_features['decision_month'] = df_features['dateDecision'].dt.month
df_features['argument_month'] = df_features['dateArgument'].dt.month
print("- Engineered basic temp features (decision_year/month, argument_month).")

# Engineered complexity features
# Count dockets per caseId
docket_counts = df_features.groupby('caseId')['docketId'].transform('count')
df_features['num_dockets_in_case'] = docket_counts
# Flag for reargument
df_features['had_reargument'] = df_features['dateRearg'].notna().astype(int)
print("- Engineered complexity features (num_dockets_in_case, had_reargument).")
# print(df_features['had_reargument'].value_counts())

# Engineered BROAD docket category
# Broad str contains 'm' check might misclassify some edge cases. -> could be redone in some way in the future! 
print("- Engineering BROAD 'docket_category' feature...")
if 'docket' in df_features.columns:
    df_features['docket_str_temp'] = df_features['docket'].astype(str).fillna('').str.lower()
    conditions = [
        df_features['docket_str_temp'].str.contains('orig', regex=False, na=False), # Original Jurisdiction
        df_features['docket_str_temp'].str.contains('m', regex=False, na=False)    # Miscellaneous (any 'm')
    ]
    categories = ['Original', 'Miscellaneous']
    default_category = 'Merits/Other' # Standard cases
    df_features['docket_category'] = np.select(conditions, categories, default=default_category)
    df_features = df_features.drop('docket_str_temp', axis=1)
    print("  Example counts for BROAD 'docket_category':")
    print(df_features['docket_category'].value_counts(dropna=False))
else:
    print("  'docket' column not found, cannot engineer 'docket_category'.")
    df_features['docket_category'] = 'Unknown'


=== Step 3: Basic Feature Engineering (on df_features shape: (9534, 54)) ===
- Engineered basic temp features (decision_year/month, argument_month).
- Engineered complexity features (num_dockets_in_case, had_reargument).
- Engineering BROAD 'docket_category' feature...
  Example counts for BROAD 'docket_category':
docket_category
Merits/Other     9409
Original          103
Miscellaneous      22
Name: count, dtype: int64


Simplifying categorical variables

In [9]:
print("\n=== Step 4a: Simplifying Categoricals (Examples using Codebook) ===")

# --- Simplify 'jurisdiction' ---
JURISDICTION_COL = 'jurisdiction'; NEW_JURISDICTION_COL = 'jurisdiction_group'
if JURISDICTION_COL in df_features.columns:
    print(f"\nSimplifying '{JURISDICTION_COL}' into '{NEW_JURISDICTION_COL}'...")
    # Based on Appendix A15 'varJurisdiction' [cite: 778]
    jurisdiction_map = { 1: 'Certiorari', 2: 'Appeal', 4: 'Certification', 13: 'Writ of Error', 9: 'Original', 5: 'Writ', 6: 'Writ', 7: 'Writ', 14: 'Writ', 8: 'Writ', 10: 'Writ', 3: 'Other Procedural', 12: 'Other Procedural', 15: 'Other Procedural' }
    df_features[NEW_JURISDICTION_COL] = df_features[JURISDICTION_COL].map(jurisdiction_map).fillna('Unknown/Missing')
    print(f"  Value counts for '{NEW_JURISDICTION_COL}':\n{df_features[NEW_JURISDICTION_COL].value_counts(dropna=False)}")
else: print(f"'{JURISDICTION_COL}' column not found.")

# --- Simplify 'petitioner' ---
PETITIONER_COL = 'petitioner'; NEW_PETITIONER_COL = 'petitioner_group'
if PETITIONER_COL in df_features.columns:
    print(f"\nSimplifying '{PETITIONER_COL}' into '{NEW_PETITIONER_COL}' (Partial Example)...")
    # Based on Appendix A24 'varParties' [cite: 803] - NEEDS FULL MAPPING by user
    base_map = { 27: 'US Govt', 28: 'State Govt', 1: 'US Govt Official', 3: 'Local Govt', 5: 'Local Govt', 18: 'Local Govt', 21: 'Local Govt', 113: 'Business', 119: 'Business', 122: 'Business', 151: 'Business', 184: 'Business', 231: 'Business', 245: 'Business', 171: 'Business', 100: 'Individual', 106: 'Individual', 111: 'Individual', 145: 'Employee', 170: 'Indian', 174: 'Individual', 175: 'Individual', 212: 'Individual', 214: 'Individual', 215: 'Individual', 249: 'Union'}
    agency_map = {code: 'US Agency' for code in range(301, 423)}
    petitioner_map = {**base_map, **agency_map}
    df_features[NEW_PETITIONER_COL] = df_features[PETITIONER_COL].map(petitioner_map).fillna('Other/Unknown')
    print(f"  Value counts for '{NEW_PETITIONER_COL}' (Top 20):\n{df_features[NEW_PETITIONER_COL].value_counts(dropna=False).head(20)}")
else: print(f"'{PETITIONER_COL}' column not found.")

# --- Simplify 'respondent' ---
RESPONDENT_COL = 'respondent'; NEW_RESPONDENT_COL = 'respondent_group'
if RESPONDENT_COL in df_features.columns:
    print(f"\nSimplifying '{RESPONDENT_COL}' into '{NEW_RESPONDENT_COL}' (Using same map as petitioner)...")
    # Using the same map created for petitioner based on Appendix A24 [cite: 803]
    if 'petitioner_map' in locals():
         df_features[NEW_RESPONDENT_COL] = df_features[RESPONDENT_COL].map(petitioner_map).fillna('Other/Unknown')
         print(f"  Value counts for '{NEW_RESPONDENT_COL}' (Top 20):\n{df_features[NEW_RESPONDENT_COL].value_counts(dropna=False).head(20)}")
    else:
         print("  ERROR: 'petitioner_map' not defined. Cannot simplify respondent.")
         df_features[NEW_RESPONDENT_COL] = 'Error_Map_Missing' # Mark error
else: print(f"'{RESPONDENT_COL}' not found.")

# --- Map 'issueArea' codes to names ---
ISSUE_AREA_COL = 'issueArea'; NEW_ISSUE_AREA_COL = 'issueArea_name'
if ISSUE_AREA_COL in df_features.columns:
    print(f"\nMapping '{ISSUE_AREA_COL}' codes into '{NEW_ISSUE_AREA_COL}'...")
    # Based on Appendix A14 'varIssuesAreas' [cite: 778]
    issue_area_map_names = { 1: 'Criminal Procedure', 2: 'Civil Rights', 3: 'First Amendment', 4: 'Due Process', 5: 'Privacy', 6: 'Attorneys', 7: 'Unions', 8: 'Economic Activity', 9: 'Judicial Power', 10: 'Federalism', 11: 'Interstate Relations', 12: 'Federal Taxation', 13: 'Miscellaneous', 14: 'Private Action' }
    df_features[NEW_ISSUE_AREA_COL] = df_features[ISSUE_AREA_COL].map(issue_area_map_names).fillna('Unknown/Missing Code')
    print(f"  Value counts for '{NEW_ISSUE_AREA_COL}':\n{df_features[NEW_ISSUE_AREA_COL].value_counts(dropna=False)}")
else: print(f"'{ISSUE_AREA_COL}' column not found.")

# --- Simplify 'caseSource' ---
CASE_SOURCE_COL = 'caseSource'; NEW_CASE_SOURCE_COL = 'caseSource_group'
if CASE_SOURCE_COL in df_features.columns:
    print(f"\nSimplifying '{CASE_SOURCE_COL}' into '{NEW_CASE_SOURCE_COL}'...")
    # Based on Appendix A6 'varCaseSources' [cite: 725] - Grouping major types
    combined_source_map = {
         **{code: 'FedCircCt' for code in range(21, 33)}, 8: 'FedCircCt',
         300: 'StateHighCt', 301: 'StateAppCt', 302: 'StateTrialCt',
         **{code: 'FedDistCt' for code in range(41, 188)},
         **{code: 'SpecialtyFedCt' for code in [1,2,3,4,5,6,7,9,10,12,13,14,18,20,601]},
         **{code: 'TerritorialCt' for code in [15,16,17]},
         **{code: 'LegacyFedCircCt' for code in range(400, 450)},
         19: 'DC_FedRelated'
        }
    df_features[NEW_CASE_SOURCE_COL] = df_features[CASE_SOURCE_COL].map(combined_source_map).fillna('Other/Unknown')
    print(f"  Value counts for '{NEW_CASE_SOURCE_COL}':\n{df_features[NEW_CASE_SOURCE_COL].value_counts(dropna=False)}")
else: print(f"'{CASE_SOURCE_COL}' not found.")

# --- Simplify 'caseOrigin' ---
CASE_ORIGIN_COL = 'caseOrigin'; NEW_CASE_ORIGIN_COL = 'caseOrigin_group'
if CASE_ORIGIN_COL in df_features.columns:
     print(f"\nSimplifying '{CASE_ORIGIN_COL}' into '{NEW_CASE_ORIGIN_COL}' (Using same map as caseSource)...")
     # Using the same map created for caseSource based on Appendix A6 [cite: 725]
     if 'combined_source_map' in locals():
          df_features[NEW_CASE_ORIGIN_COL] = df_features[CASE_ORIGIN_COL].map(combined_source_map).fillna('Other/Unknown')
          print(f"  Value counts for '{NEW_CASE_ORIGIN_COL}':\n{df_features[NEW_CASE_ORIGIN_COL].value_counts(dropna=False)}")
     else:
          print("   ERROR: 'combined_source_map' not defined. Cannot simplify caseOrigin.")
          df_features[NEW_CASE_ORIGIN_COL] = 'Error_Map_Missing'
else: print(f"'{CASE_ORIGIN_COL}' not found.")


# --- Simplify 'lawType' ---
LAW_TYPE_COL = 'lawType'; NEW_LAW_TYPE_COL = 'lawType_group'
if LAW_TYPE_COL in df_features.columns:
    print(f"\nSimplifying '{LAW_TYPE_COL}' into '{NEW_LAW_TYPE_COL}'...")
    # Based on Appendix A20 'varLawArea' [cite: 827]
    law_type_map = {
        1: 'Constitution', 2: 'Const Amendment', 3: 'Fed Statute',
        4: 'Court Rules', 5: 'Other', 6: 'Infrequent Fed Statute',
        8: 'State/Local Law', 9: 'No Legal Provision' }
    df_features[NEW_LAW_TYPE_COL] = df_features[LAW_TYPE_COL].map(law_type_map).fillna('Unknown/Missing')
    print(f"  Value counts for '{NEW_LAW_TYPE_COL}':\n{df_features[NEW_LAW_TYPE_COL].value_counts(dropna=False)}")
else: print(f"'{LAW_TYPE_COL}' not found.")

# --- Simplify 'certReason' ---
CERT_REASON_COL = 'certReason'; NEW_CERT_REASON_COL = 'certReason_group'
if CERT_REASON_COL in df_features.columns:
    print(f"\nSimplifying '{CERT_REASON_COL}' into '{NEW_CERT_REASON_COL}'...")
    # Based on Appendix A7 'varCertReason' [cite: 746]
    cert_reason_map = {
        1: 'Not Applicable / Denied', # Case did not arise on cert or cert not granted
        2: 'Court Conflict',          # federal court conflict
        3: 'Court Conflict',          # federal court conflict and important question (group with conflict)
        4: 'Court Conflict',          # putative conflict (group with conflict)
        5: 'Court Conflict',          # conflict between federal and state court
        6: 'Court Conflict',          # state court conflict
        7: 'Confusion/Uncertainty',   # federal court confusion or uncertainty
        8: 'Confusion/Uncertainty',   # state court confusion or uncertainty
        9: 'Confusion/Uncertainty',   # federal and state court confusion or uncertainty
        10: 'Important Question',     # to resolve important or significant question
        11: 'Important Question',     # to resolve question presented (group with important question)
        12: 'No Reason Given',        # no reason given
        13: 'Other Reason'            # other reason
    }
    df_features[NEW_CERT_REASON_COL] = df_features[CERT_REASON_COL].map(cert_reason_map).fillna('Unknown/Missing')
    print(f"  Value counts for '{NEW_CERT_REASON_COL}':\n{df_features[NEW_CERT_REASON_COL].value_counts(dropna=False)}")
else: print(f"'{CERT_REASON_COL}' not found.")

# --- Simplify 'lcDisposition' ---
LC_DISP_COL = 'lcDisposition'; NEW_LC_DISP_COL = 'lcDisposition_group'
if LC_DISP_COL in df_features.columns:
    print(f"\nSimplifying '{LC_DISP_COL}' into '{NEW_LC_DISP_COL}'...")
    # Based on Appendix A3 'varCaseDispositionLc' [cite: 722]
    lc_disp_map = {
        2: 'Affirmed',           # affirmed
        6: 'Affirmed',           # affirmed and reversed (or vacated) in part (treat as affirmation overall)
        7: 'Affirmed',           # affirmed and reversed (or vacated) in part and remanded (treat as affirmation overall)
        3: 'Reversed',           # reversed
        4: 'Reversed',           # reversed and remanded (treat as reversal overall)
        5: 'Vacated/Remanded',   # vacated and remanded
        8: 'Vacated/Remanded',   # vacated (group with vacate/remand)
        11: 'Vacated/Remanded',  # remand (group with vacate/remand)
        1: 'Procedural',         # stay, petition, or motion granted
        9: 'Procedural',         # petition denied or appeal dismissed
        10: 'Procedural',        # modify (treat as procedural adjustment)
        12: 'Procedural'         # unusual disposition (treat as procedural)
        # This grouping is subjective, especially codes 6, 7, 10, 11, 12. 
    }
    df_features[NEW_LC_DISP_COL] = df_features[LC_DISP_COL].map(lc_disp_map).fillna('Unknown/Missing')
    print(f"  Value counts for '{NEW_LC_DISP_COL}':\n{df_features[NEW_LC_DISP_COL].value_counts(dropna=False)}")
else: print(f"'{LC_DISP_COL}' not found.")

print("\n First part done.")


=== Step 4a: Simplifying Categoricals (Examples using Codebook) ===

Simplifying 'jurisdiction' into 'jurisdiction_group'...
  Value counts for 'jurisdiction_group':
jurisdiction_group
Certiorari          7768
Appeal              1615
Original             112
Writ                  20
Other Procedural      11
Certification          6
Unknown/Missing        2
Name: count, dtype: int64

Simplifying 'petitioner' into 'petitioner_group' (Partial Example)...
  Value counts for 'petitioner_group' (Top 20):
petitioner_group
Other/Unknown       4309
Individual          1007
US Govt              918
US Agency            906
State Govt           823
Business             755
Local Govt           277
Employee             218
Union                214
Indian                63
US Govt Official      44
Name: count, dtype: int64

Simplifying 'respondent' into 'respondent_group' (Using same map as petitioner)...
  Value counts for 'respondent_group' (Top 20):
respondent_group
Other/Unknown       4042
St

In [10]:
print("\n=== Step 4b: Advanced Feature Engineering ===")

# --- Enhanced Case Complexity Metrics ---
print("\n--- Engineering Advanced Complexity Metrics ---")
if 'num_dockets_in_case' in df_features.columns and 'lcDisagreement' in df_features.columns:
    # Rationale: Captures potentially heightened complexity when consolidation AND lower court disagreement occur together (Higher complexity = longer duration
    df_features['complex_consolidated_disagreement'] = df_features['num_dockets_in_case'] * df_features['lcDisagreement']
    print("- Engineered 'complex_consolidated_disagreement'.")
else: print("- Skipping 'complex_consolidated_disagreement'.")

NEW_ISSUE_AREA_COL = 'issueArea_name' # 4a. variable
ISSUE_AREA_COL = 'issueArea'
issue_area_for_econ = NEW_ISSUE_AREA_COL if NEW_ISSUE_AREA_COL in df_features.columns else ISSUE_AREA_COL
econ_val_to_check = 'Economic Activity' if NEW_ISSUE_AREA_COL in df_features.columns else 8
if 'adminAction' in df_features.columns and issue_area_for_econ in df_features.columns:
    # Flags the intersection of administrative agency review and economic issues, which might have distinct durations
    is_admin = (df_features['adminAction'] > 0).astype(int)
    is_econ = (df_features[issue_area_for_econ] == econ_val_to_check).astype(int)
    df_features['is_AdminAction_x_Economic'] = is_admin * is_econ
    print(f"- Engineered 'is_AdminAction_x_Economic'.")
else: print("- Skipping 'is_AdminAction_x_Economic'.")

# --- Temporal Dynamics within Term ---
print("\n--- Engineering Advanced Temporal Dynamics ---")
if 'term' in df_features.columns and 'dateDecision' in df_features.columns and 'dateArgument' in df_features.columns:
    # Measures case timing relative to the Court's annual calendar start, captures potential workload/pace variations throughout the term
    df_features['term'] = df_features['term'].astype(int)
    df_features['approx_term_start_date'] = pd.to_datetime(df_features['term'].astype(str) + '-10-01', errors='coerce')
    df_features['days_from_term_start_to_decision'] = (df_features['dateDecision'] - df_features['approx_term_start_date']).dt.days
    df_features['days_from_term_start_to_argument'] = (df_features['dateArgument'] - df_features['approx_term_start_date']).dt.days
    df_features['days_from_term_start_to_decision'] = df_features['days_from_term_start_to_decision'].apply(lambda x: max(x, 0) if pd.notna(x) else x)
    df_features['days_from_term_start_to_argument'] = df_features['days_from_term_start_to_argument'].apply(lambda x: max(x, 0) if pd.notna(x) else x)
    df_features = df_features.drop('approx_term_start_date', axis=1)
    print("- Engineered 'days_from_term_start_to_decision'/'_argument'.")
else: print("- Skipping 'Term Timing' features.")

if 'decision_month' in df_features.columns:
    # Identifies cases decided during the end-of-term "June Rush" period (May onward)
    df_features['is_late_term_decision'] = (df_features['decision_month'] >= 5).astype(int)
    print("- Engineered 'is_late_term_decision'.")
else: print("- Skipping 'is_late_term_decision'.")

if 'argument_month' in df_features.columns:
    # Identifies cases argued late in the schedule (April onward)
    df_features['is_late_term_argument'] = (df_features['argument_month'] >= 4).astype(int)
    print("- Engineered 'is_late_term_argument'.")
else: print("- Skipping 'is_late_term_argument'.")

# --- Additional Interaction Features ---
print("\n--- Engineering Additional Interaction Features ---")
# Explores how lower court disagreement interacts with the ideological direction of the lower court's ruling
if 'lcDisagreement' in df_features.columns and 'lcDispositionDirection' in df_features.columns:
    cond_list = [ (df_features['lcDisagreement'] == 0) & (df_features['lcDispositionDirection'] == 1), (df_features['lcDisagreement'] == 0) & (df_features['lcDispositionDirection'] == 2), (df_features['lcDisagreement'] == 1) & (df_features['lcDispositionDirection'] == 1), (df_features['lcDisagreement'] == 1) & (df_features['lcDispositionDirection'] == 2) ]
    choice_list = [ 'LC_Agree_Conservative', 'LC_Agree_Liberal', 'LC_Disagree_Conservative', 'LC_Disagree_Liberal' ]
    df_features['lc_disagree_direction'] = np.select(cond_list, choice_list, default='LC_Unspec/Other')
    print("- Engineered 'lc_disagree_direction' interaction.")
else: print("- Skipping 'lc_disagree_direction': Base columns missing.")

# Tests if the combination of issue area and jurisdictional path influences duration
NEW_JURISDICTION_COL = 'jurisdiction_group' # 4a
NEW_ISSUE_AREA_COL = 'issueArea_name'       # 4a
if NEW_JURISDICTION_COL in df_features.columns and NEW_ISSUE_AREA_COL in df_features.columns:
    df_features['issue_x_jurisdiction'] = df_features[NEW_ISSUE_AREA_COL].astype(str) + '_via_' + df_features[NEW_JURISDICTION_COL].astype(str)
    print("- Engineered 'issue_x_jurisdiction' interaction.")
else: print("- Skipping 'issue_x_jurisdiction': Requires simplified columns.")

# Effect of a three-judge district court origin across issue areas
if 'threeJudgeFdc' in df_features.columns and NEW_ISSUE_AREA_COL in df_features.columns:
     df_features['threeJudge_x_issue'] = df_features[NEW_ISSUE_AREA_COL].astype(str) + '_3Judge_' + df_features['threeJudgeFdc'].astype(str)
     print("- Engineered 'threeJudge_x_issue' interaction.")
else: print("- Skipping 'threeJudge_x_issue': Base columns missing.")

# --- Implementing Placeholders ---
print("\n--- Implementing Placeholder Features ---")
# potential duration effects based on specific litigant pairings (e.g., government vs. business)
NEW_PETITIONER_COL = 'petitioner_group' # 4a
NEW_RESPONDENT_COL = 'respondent_group' # 4a
if NEW_PETITIONER_COL in df_features.columns and NEW_RESPONDENT_COL in df_features.columns and \
   'Unknown_Not_Simplified' not in df_features[NEW_RESPONDENT_COL].unique():
    print("- Engineering 'Party Configuration' features...")
    govt_list = ['US Govt', 'State Govt', 'Local Govt', 'US Govt Official', 'Govt Official', 'US Agency']
    business_list = ['Business']
    individual_list = ['Individual', 'Employee', 'Indian']
    is_govt_pet = df_features[NEW_PETITIONER_COL].isin(govt_list); is_business_resp = df_features[NEW_RESPONDENT_COL].isin(business_list)
    is_business_pet = df_features[NEW_PETITIONER_COL].isin(business_list); is_govt_resp = df_features[NEW_RESPONDENT_COL].isin(govt_list)
    is_indiv_pet = df_features[NEW_PETITIONER_COL].isin(individual_list); is_indiv_resp = df_features[NEW_RESPONDENT_COL].isin(individual_list)
    is_state_pet = (df_features[NEW_PETITIONER_COL] == 'State Govt'); is_state_resp = (df_features[NEW_RESPONDENT_COL] == 'State Govt')
    df_features['is_Govt_vs_Business'] = ((is_govt_pet & is_business_resp) | (is_business_pet & is_govt_resp)).astype(int)
    df_features['is_Individual_vs_Govt'] = ((is_indiv_pet & is_govt_resp) | (is_govt_pet & is_indiv_resp)).astype(int)
    df_features['is_State_vs_State'] = (is_state_pet & is_state_resp).astype(int)
    print("  - Engineered party configuration flags.")
else: print("- Skipping 'Party Configuration': Requires COMPLETED simplified petitioner AND respondent columns.")

# Checks if lower court disagreement originates specifically from the influential federal circuit courts
NEW_CASE_SOURCE_COL = 'caseSource_group' # 4a
FED_CIRC_CATEGORY_NAME = 'FedCircCt'
if NEW_CASE_SOURCE_COL in df_features.columns and 'lcDisagreement' in df_features.columns and \
   'Unknown_Not_Simplified' not in df_features[NEW_CASE_SOURCE_COL].unique():
    print("- Engineering 'Lower Court Context' features...")
    is_fed_circ = (df_features[NEW_CASE_SOURCE_COL] == FED_CIRC_CATEGORY_NAME)
    df_features['is_FedCirc_Conflict'] = (is_fed_circ & (df_features['lcDisagreement'] == 1)).astype(int)
    print("  - Engineered is_FedCirc_Conflict flag.")
else: print("- Skipping 'Lower Court Context': Requires COMPLETED simplified caseSource column.")

print(f"Final shape of df_features after advanced engineering: {df_features.shape}")


=== Step 4b: Advanced Feature Engineering ===

--- Engineering Advanced Complexity Metrics ---
- Engineered 'complex_consolidated_disagreement'.
- Engineered 'is_AdminAction_x_Economic'.

--- Engineering Advanced Temporal Dynamics ---
- Engineered 'days_from_term_start_to_decision'/'_argument'.
- Engineered 'is_late_term_decision'.
- Engineered 'is_late_term_argument'.

--- Engineering Additional Interaction Features ---
- Engineered 'lc_disagree_direction' interaction.
- Engineered 'issue_x_jurisdiction' interaction.
- Engineered 'threeJudge_x_issue' interaction.

--- Implementing Placeholder Features ---
- Engineering 'Party Configuration' features...
  - Engineered party configuration flags.
- Engineering 'Lower Court Context' features...
  - Engineered is_FedCirc_Conflict flag.
Final shape of df_features after advanced engineering: (9534, 82)


In [11]:
df_features.columns

Index(['caseId', 'docketId', 'caseIssuesId', 'voteId', 'dateDecision',
       'decisionType', 'usCite', 'sctCite', 'ledCite', 'lexisCite', 'term',
       'naturalCourt', 'chief', 'docket', 'caseName', 'dateArgument',
       'dateRearg', 'petitioner', 'petitionerState', 'respondent',
       'respondentState', 'jurisdiction', 'adminAction', 'adminActionState',
       'threeJudgeFdc', 'caseOrigin', 'caseOriginState', 'caseSource',
       'caseSourceState', 'lcDisagreement', 'certReason', 'lcDisposition',
       'lcDispositionDirection', 'declarationUncon', 'caseDisposition',
       'caseDispositionUnusual', 'partyWinning', 'precedentAlteration',
       'voteUnclear', 'issue', 'issueArea', 'decisionDirection',
       'decisionDirectionDissent', 'authorityDecision1', 'authorityDecision2',
       'lawType', 'lawSupp', 'lawMinor', 'majOpinWriter', 'majOpinAssigner',
       'splitVote', 'majVotes', 'minVotes', 'duration_days', 'decision_year',
       'decision_month', 'argument_month', 'num_do

Saving the modified dataset

In [12]:
# --- Define Output Directory and File ---
output_dir = 'data/processed'
output_filename = 'scdb_processed_uncropped_df1.csv'
output_filepath = os.path.join(project_root, output_dir, output_filename)

df_features.to_csv(output_filepath, index=False)

Identifying leakage variables

In [13]:
print(f"\n=== Starting Step 5: Select Final Features (X) and Target (y) ===")
print(f"Working with df_features shape: {df_features.shape}")

# Defining dependent variable
TARGET_COL = 'duration_days'
if TARGET_COL not in df_features.columns:
    print(f"FATAL ERROR: Target column '{TARGET_COL}' not found in df_features!")
    exit()
y = df_features[TARGET_COL]
print(f"\nTarget variable 'y' ('{TARGET_COL}') defined. Shape: {y.shape}")


print("\n--- Defining Columns to Exclude from Features ---")

# 1. Leakage Variables:
#    Rationale: These variables contain information that would only be known *after*
#               the case decision ('dateDecision') is made. Including them to predict
#               'duration_days' (which ends at 'dateDecision') would mean using
#               information from the "future" relative to the prediction point.
#               Those shouldnt be used for prediction.
#                   
leakage_vars = [
    'decisionType', 'declarationUncon', 'caseDisposition', 'caseDispositionUnusual',
    'partyWinning', 'precedentAlteration', 'voteUnclear', 'decisionDirection',
    'decisionDirectionDissent', 'authorityDecision1', 'authorityDecision2',
    'majOpinWriter', 'majOpinAssigner', 'splitVote', 'majVotes', 'minVotes'
]
print(f"\n- Identifying {len(leakage_vars)} potential LEAKAGE variables to exclude.")

# 2. Identifiers & Reference Information:
#    Rationale: These columns serve to uniquely identify a case, docket, issue, vote,
#               or provide citation/name information. They generally do not contain
#               substantive information predictive of case duration itself. 
#               these columns are dropped as they are either uninformative or outside the scope of this study
identifier_vars = [
    'caseId', 'docketId', 'caseIssuesId', 'voteId', # Core IDs
    'usCite', 'sctCite', 'ledCite', 'lexisCite', # Citations
    'docket', # Raw docket string
    'caseName' # Case name string
]
identifier_vars = [col for col in identifier_vars if col in df_features.columns] # Check which exist
print(f"- Identifying {len(identifier_vars)} IDENTIFIER variables to exclude.")

# 3. Raw Date Columns:
#    Rationale: The relevant predictive information *from* these date columns (year,
#               month, duration itself, reargument flag) has already been extracted
#               into other variables.
raw_date_vars = ['dateDecision', 'dateArgument', 'dateRearg']
print(f"- Identifying {len(raw_date_vars)} RAW DATE variables to exclude.")

# 4. Target Variable:
#
target_var = ['duration_days']
print(f"- Identifying 1 TARGET variable ('{target_var[0]}') to exclude.")

# 5. High Missing / Poor Quality Variables: 
#    Rationale: Variables with an extremely high percentage of missing values (>50%) often provide little predictive signal and require complex
#               imputation that might introduce noise. Vars that are simply more or less specific versions of other variables are dropped too.
high_nan_or_poor_quality_vars = [
     'petitionerState', 'respondentState', 
     'adminActionState', 'adminAction',
     'caseOriginState', 'caseSourceState',
     'lawMinor', 
     'lawSupp', # Will use lawType
     'issue' # Will use issueArea
]
high_nan_or_poor_quality_vars = [col for col in high_nan_or_poor_quality_vars if col in df_features.columns]
print(f"- Identifying {len(high_nan_or_poor_quality_vars)} HIGH NAN / POOR QUALITY variables to exclude.")

# 6. Original Categorical Columns that were Simplified:
#    Rationale: If simplified versions (e.g., 'jurisdiction_group', 'petitioner_group')
#               were created in Step 4a, the original numeric code columns become redundant
#               and should be excluded to avoid feeding both raw codes and grouped categories
#               into the model (which can confuse encoding/interpretation).
original_categoricals_simplified = [
    'jurisdiction',   # Excluded because 'jurisdiction_group' was created
    'petitioner',     # Excluded because 'petitioner_group' was created
    'respondent',     # Excluded because 'respondent_group' was created
    'issueArea',      # Excluded because 'issueArea_name' was created
    'caseSource',     # Excluded because 'caseSource_group' was created
    'caseOrigin',     # Excluded because 'caseOrigin_group' was created
    'lawType',        # Excluded because 'lawType_group' was created
    'certReason',     # Excluded because 'certReason_group' was created
    'lcDisposition',  # Excluded because 'lcDisposition_group' was created.
]
original_categoricals_simplified = [col for col in original_categoricals_simplified if col in df_features.columns] # Check existence
print(f"- Identifying {len(original_categoricals_simplified)} ORIGINAL CATEGORICAL variables (simplified versions will be kept).")

columns_to_exclude = list(set(
    leakage_vars + identifier_vars + raw_date_vars + target_var 
    + original_categoricals_simplified 
    + high_nan_or_poor_quality_vars
))
print(f"\nTotal unique columns identified for exclusion from features: {len(columns_to_exclude)}")


# Define feature_columns as everything in df_features MINUS the excluded columns
all_columns = df_features.columns.tolist()
feature_columns = [col for col in all_columns if col not in columns_to_exclude]

print(f"\nFinal selected feature columns for X ({len(feature_columns)}):")
feature_columns.sort() # Sort alphabetically for consistent order
print(feature_columns)

# Final check if all selected feature columns actually exist
missing_cols_check = [col for col in feature_columns if col not in df_features.columns]
if missing_cols_check:
    print(f"\nFATAL ERROR: Selected feature columns missing from df_features: {missing_cols_check}")
    print("Check Step 4 engineering and Step 5 exclusion lists.")
    exit()
else:
    X = df_features[feature_columns].copy()
    print(f"\nFeature matrix 'X' defined. Shape: {X.shape}")

print("\n--- Step 5: Feature Selection Complete ---")
print("You now have the final 'X' (features) and 'y' (target) ready.")
print("Next steps: Handle duplicates in X/y, Split data, Preprocess (impute, encode, scale).")


=== Starting Step 5: Select Final Features (X) and Target (y) ===
Working with df_features shape: (9534, 82)

Target variable 'y' ('duration_days') defined. Shape: (9534,)

--- Defining Columns to Exclude from Features ---

- Identifying 16 potential LEAKAGE variables to exclude.
- Identifying 10 IDENTIFIER variables to exclude.
- Identifying 3 RAW DATE variables to exclude.
- Identifying 1 TARGET variable ('duration_days') to exclude.
- Identifying 9 HIGH NAN / POOR QUALITY variables to exclude.
- Identifying 9 ORIGINAL CATEGORICAL variables (simplified versions will be kept).

Total unique columns identified for exclusion from features: 48

Final selected feature columns for X (34):
['argument_month', 'caseOrigin_group', 'caseSource_group', 'certReason_group', 'chief', 'complex_consolidated_disagreement', 'days_from_term_start_to_argument', 'days_from_term_start_to_decision', 'decision_month', 'decision_year', 'docket_category', 'had_reargument', 'is_AdminAction_x_Economic', 'is_Fed

SAving modified cropped dataset

In [14]:
print("\n=== Step 6: Saving Processed Data ===")

output_dir = 'data/processed'
output_filename = 'scdb_processed_cropped_df2.csv'
output_filepath = os.path.join(project_root, output_dir, output_filename)

if not os.path.exists(output_dir):
    print(f"Creating output directory: {output_dir}")
    os.makedirs(output_dir)
else:
    print(f"Output directory '{output_dir}' already exists.")

# --- Combine X and y for Saving ---
print(f"\nCombining features (X shape: {X.shape}) and target (y shape: {y.shape}) for saving...")
y.name = 'duration_days'
df_to_save = pd.concat([X, y], axis=1)
print(f"Combined DataFrame shape: {df_to_save.shape}")

# --- Save to CSV ---
try:
    print(f"Saving combined data to: {output_filepath}")
    df_to_save.to_csv(output_filepath, index=False)
    print("Data saved successfully!")
except Exception as e:
    print(f"ERROR saving data: {e}")


=== Step 6: Saving Processed Data ===
Creating output directory: data/processed

Combining features (X shape: (9534, 34)) and target (y shape: (9534,)) for saving...
Combined DataFrame shape: (9534, 35)
Saving combined data to: h:\000_Projects\01_GitHub\05_PythonProjects\scdb-case-timing-prediction\data/processed\scdb_processed_cropped_df2.csv
Data saved successfully!


# We will follow through with further steps inside of 002 file, it will have both somewhat eda for presentation and codes for further cleaning.