# **1. Summary of Project Idea**

**1. Summary of Project Idea**

The core idea of this project is to use the **Supreme Court Database (SCDB)**, specifically the modern era data (1946-2023), to predict the duration of time between when a case is orally argued and when the final decision is announced. We aim to achieve this using machine learning (ML) techniques.

Beyond just prediction, a key goal is to use Explainable AI (XAI) methods (like SHAP) to understand which characteristics of a case (e.g., the type of legal issue, the parties involved, lower court conflict, case complexity indicators) are most influential in determining this argument-to-decision duration, according to the ML model.

## **1.1 Preparation of Data Steps**

The data preparation steps we've taken are needed for building a valid and meaningful predictive model for this specific task:

**Using caseCentered_Docket.csv:** We chose this file because its unit of analysis (the docket) aligns well with tracking a case's journey. It allows us to handle consolidated cases (multiple dockets per caseId) effectively, which is important for measuring case complexity. Files organized by issue or vote would be too granular for predicting the overall case duration.

**Calculating duration_days:** Since "time of case" isn't directly in the SCDB, we defined it operationally as dateDecision - dateArgument. This is the most feasible measure using only SCDB data for the post-argument phase.

**Filtering NaN duration_days:** We must remove rows where duration is missing (NaN). This happens primarily because the case was decided without oral argument (dateArgument is missing). Our target variable literally doesn't exist for these cases, so the model cannot learn from them for this specific prediction task. This step focuses the analysis on orally argued cases.

**Filtering Negative (< 0) duration_days:** We removed these rows because a negative duration is logically impossible and indicates data errors in the recorded dates (dateDecision before dateArgument). Keeping these would introduce noise and errors into the model.

**Keeping Zero (== 0) duration_days (For Now):** We decided to keep cases with exactly zero duration for the time being. While potentially data errors or atypical same-day decisions, removing them immediately might discard valid edge cases. We flagged their presence and noted that they could be removed later if they prove problematic for the model. (There are 2 such cases in the dataset.)

**Feature Engineering (num_dockets_in_case, had_reargument, docket_category):** We created new features to capture potentially predictive information not directly present as single variables:

**num_dockets_in_case:** Measures complexity from consolidation (using the docket file structure).

*had_reargument:* Captures complexity/contentiousness indicated by the Court needing a second argument. (bollean variable)

*docket_category:* Attempts to extract the type of docket (Original, Miscellaneous, Merits) from the raw docket string, as this might correlate with different processing timelines.

*Excluding Leakage Variables:* This is critical. We carefully removed variables whose values are only known after the decision is made (e.g., partyWinning, decisionDirection, majVotes, caseDisposition). Including these would allow the model to "cheat" by using information from the future (relative to the prediction point), leading to artificially inflated performance and invalid results for predicting duration before the decision is known. 

We also removed identifiers and raw date columns after extracting useful info from them.

*Need for Categorical Simplification (Placeholder):* Many SCDB variables (petitioner, respondent, issueArea, jurisdiction, etc.) use hundreds of numeric codes. Directly using these high-cardinality features can make modeling difficult and less interpretable. Grouping these codes into meaningful categories based on the SCDB Codebook (e.g., 'Business' vs 'Government' petitioner types, broader issue areas) is essential. This step still needs manual implementation.

In essence, the preparation aimed to create a clean dataset (df_argued) containing only the relevant cases (orally argued with positive duration) and a feature matrix (X) containing potentially predictive information available before the decision, while excluding invalid data and leakage variables.


## **1.2. Variables in the Dataset**

The dataset (df_argued) contains the following variables:

--- Descriptions for ALL 53 SCDB Variables Listed ---
Note: Role/Use descriptions relate primarily to the task of
predicting Argument-to-Decision duration unless otherwise specified.

caseId: (Identification)
- Description: Unique identifier assigned by the SCDB to each distinct Supreme Court dispute or consolidated set of disputes.
- Values: String, typically YYYY-NNN format (e.g., "1946-001").
- Use: Primary key for linking related rows (e.g., dockets in a consolidated case, or linking to external data like Salience). Not typically used directly as a feature.

docketId: (Identification)
- Description: Unique identifier for each specific docket number associated with a caseId. Multiple docketIds can share a caseId in consolidated cases.
- Values: String, often YYYY-NNN- DocketSeq format (e.g., "1946-001-01").
- Use: Primary key in Docket-centered files. Useful for identifying consolidated cases (via caseId). Not typically used directly as a feature.

caseIssuesId: (Identification)
- Description: Unique identifier for each specific set of issue(s) and legal provision(s) addressed within a docketId. More granular than docketId. Found in Issue/LegalProvision organized files.
- Values: String, builds on docketId (e.g., "1946-001-01-01").
- Use: Identifier for issue-level analysis. Not typically used for case-level duration prediction.

voteId: (Identification)
- Description: Unique identifier for each specific voting alignment on a caseIssuesId, mainly relevant for rare split votes. Most granular identifier. Found in Vote-organized files.
- Values: String, builds on caseIssuesId (e.g., "1946-001-01-01-01").
- Use: Identifier for vote-level analysis, especially complex voting patterns. Not typically used for case-level duration prediction.

dateDecision: (Chronological)
- Description: The date the Supreme Court announced its decision.
- Values: Date object.
- Use: Crucial endpoint for calculating duration. Can derive features like decision_year, decision_month. Cannot be used directly as a predictive *feature* for duration ending on this date (as it defines the end point). Potential inaccuracies exist (see Guide Section V).

decisionType: (Outcome / Process)
- Description: Code indicating how the Court processed/decided the case procedurally.
- Values: Numeric codes (1=Opinion Post-Argument, 2=Per Curiam Opinion, 4=Decree, 5=Judgment, 7=Per Curiam Vacated/Remanded, etc.). See Codebook.
- Use: Explains *why* dateArgument might be missing (codes 2, 7 often lack argument). Can be a feature itself, but determined at/near decision time, so potential leakage risk depending on exact prediction point.

usCite: (Identification / Background)
- Description: Citation for the case in the official United States Reports.
- Values: String (e.g., "329 U.S. 1"). Can have NaNs if not yet published or not applicable.
- Use: Case identifier/linking. Not typically used as a feature.

sctCite: (Identification / Background)
- Description: Citation in the Supreme Court Reporter (West).
- Values: String (e.g., "67 S. Ct. 6"). Can have NaNs.
- Use: Case identifier/linking. Not typically used as a feature.

ledCite: (Identification / Background)
- Description: Citation in the Lawyers' Edition (LexisNexis).
- Values: String (e.g., "91 L. Ed. 3"). Can have NaNs.
- Use: Case identifier/linking. Not typically used as a feature.

lexisCite: (Identification / Background)
- Description: Citation in the LexisNexis database format.
- Values: String (e.g., "1946 U.S. LEXIS 1724"). Can have NaNs.
- Use: Case identifier/linking. Not typically used as a feature.

term: (Chronological / Context)
- Description: The Supreme Court Term in which the decision was handed down.
- Values: Numeric year representing start of Term (e.g., 1946 for Oct 1946 - June 1947).
- Use: Key chronological feature for trends, era effects, temporal splits, merging external data (like MQ scores). Can treat as numeric or categorical.

naturalCourt: (Chronological / Context)
- Description: Code identifying periods of stable membership on the Court.
- Values: Numeric codes (e.g., 1301). See Codebook Appendix 1.
- Use: Feature capturing effects of specific Court compositions. Treat as categorical.

chief: (Chronological / Context)
- Description: Code identifying the Chief Justice presiding.
- Values: Numeric codes (e.g., 78=Vinson, 1=Warren, 4=Roberts). See Codebook.
- Use: Feature capturing effects of Chief Justice eras. Treat as categorical.

docket: (Identification / Background)
- Description: The original, raw docket number string assigned by the SC (e.g., "24", "133M", "5, Orig.").
- Values: String. Format can be inconsistent (See Guide Section V).
- Use: Linking to external court records. Can be *engineered* into features (like docket_category), but generally not used directly as a feature due to inconsistency and high cardinality.

caseName: (Identification / Background)
- Description: The name of the case (e.g., "HALLIBURTON OIL WELL CEMENTING CO. v. WALKER...").
- Values: String.
- Use: Identifier. Not suitable as a standard feature for ML (text analysis techniques would be needed).

dateArgument: (Chronological)
- Description: The date of the first day of oral argument. Missing (NaN/NaT) if case was not orally argued.
- Values: Date object or NaT.
- Use: Crucial starting point for calculating Argument-to-Decision duration. Can derive features like argument_month. Potential inaccuracies exist (See Guide Section V).

dateRearg: (Chronological)
- Description: The date of the first day of reargument, if held. Missing (NaN/NaT) if no reargument occurred.
- Values: Date object or NaT. Very high percentage of missing values.
- Use: Primarily used to engineer the 'had_reargument' binary flag feature, indicating case complexity/uncertainty.

petitioner / respondent: (Background)
- Description: Identifies the type of party petitioning the Court / responding.
- Values: Hundreds of numeric codes. See Codebook Appendix 10.
- Use: Potential feature. *Simplification into broad groups (e.g., 'Business', 'Individual', 'US Govt', 'State Govt') is essential.*

petitionerState / respondentState: (Background)
- Description: Identifies the state associated with the petitioner/respondent, if applicable.
- Values: Numeric state codes (FIPS codes). See Codebook Appendix 11. High number of NaNs.
- Use: Potential geographic feature, especially for state actors. Treat as categorical. Consider impact of missing values.

jurisdiction: (Background)
- Description: Code for how the case reached the Supreme Court.
- Values: Numeric codes (1=Certiorari, 2=Appeal, 3=Original, etc.). See Codebook Appendix 2.
- Use: Potential feature (different paths may have different processing). Treat as categorical. *Simplification recommended.*

adminAction: (Background)
- Description: Code identifying if the case reviewed a federal administrative agency action, and which agency. 0=Not applicable.
- Values: Numeric codes. See Codebook Appendix 6. High number of NaNs (or 0 values).
- Use: Potential feature. Treat as categorical. *Simplification (e.g., binary flag 'IsAdminAction', or grouping agencies) recommended.*

adminActionState: (Background)
- Description: State associated with the administrative action, if applicable.
- Values: Numeric state codes. Very high number of NaNs.
- Use: Limited use due to high missingness. Potentially a feature if imputed/handled carefully. Treat as categorical.

threeJudgeFdc: (Background)
- Description: Flag indicating if a three-judge Federal District Court was involved.
- Values: 0=No, 1=Yes.
- Use: Potential feature indicating specific case types. Treat as categorical or binary numeric.

caseOrigin: (Background)
- Description: Code for the specific court/body where the case originated before appeals.
- Values: Hundreds of numeric codes. See Codebook Appendix 5. Some NaNs possible.
- Use: Potential feature. *Simplification (grouping by type/level/region) is essential.*

caseOriginState: (Background)
- Description: State associated with the originating court/body.
- Values: Numeric state codes. High number of NaNs.
- Use: Potential geographic feature. Treat as categorical. Consider impact of missing values.

caseSource: (Background)
- Description: Code for the court whose decision the SC is directly reviewing.
- Values: Hundreds of numeric codes. See Codebook Appendix 4. Some NaNs possible.
- Use: Potential feature indicating case posture/context. *Simplification (grouping by type/level/circuit) is essential.*

caseSourceState: (Background)
- Description: State associated with the source court.
- Values: Numeric state codes. High number of NaNs.
- Use: Potential geographic feature. Treat as categorical. Consider impact of missing values.

lcDisagreement: (Background)
- Description: Flag indicating explicit disagreement among lower federal courts.
- Values: 0=No, 1=Yes.
- Use: Potential feature indicating complexity/reason for grant. Treat as categorical or binary numeric.

certReason: (Background)
- Description: Code(s) for the Court's stated reason for granting review.
- Values: Numeric codes (1=Fed conflict, 4=Important fed question, etc.). See Codebook Appendix 7. Some NaNs possible.
- Use: Potential feature indicating perceived importance/reason for grant. Treat as categorical. *Simplification may be useful.*

lcDisposition: (Background)
- Description: Code for the lower court's disposition (outcome).
- Values: Numeric codes (2=Affirmed, 3=Reversed, etc.). See Codebook Appendix 8. Some NaNs possible.
- Use: Potential feature indicating case posture. Treat as categorical. *Simplification may be useful.*

lcDispositionDirection: (Background)
- Description: Ideological direction assigned to the lower court's disposition.
- Values: 1=Conservative, 2=Liberal, 3=Unspecifiable. Some NaNs possible.
- Use: Potential feature indicating ideological context. Treat as categorical.

--- Outcome Variables (Leakage for Duration Prediction) ---

declarationUncon: (Outcome)
- Description: Flag indicating if the SC declared a law/action unconstitutional.
- Values: Numeric codes (0=No, 1=Yes-Fed, 2=Yes-State, 3=Yes-Local).
- Use: Outcome variable. **Leakage Variable** - cannot be used as predictor for duration.

caseDisposition: (Outcome)
- Description: Code for how the SC ultimately disposed of the case.
- Values: Numeric codes (1=Stay, 2=Affirmed, 3=Reversed, 5=Vacated/Remanded, 6=Affirmed/Reversed in part, etc.). See Codebook Appendix 12. Some ambiguity (e.g., DIGs).
- Use: Outcome variable. **Leakage Variable**.

caseDispositionUnusual: (Outcome)
- Description: Flag for unusual case dispositions.
- Values: 0=No, 1=Yes.
- Use: Outcome characteristic. **Leakage Variable**.

partyWinning: (Outcome)
- Description: Flag indicating if the petitioner won (vs. respondent).
- Values: 0=Respondent won, 1=Petitioner won, NA=Unclear/Other.
- Use: Outcome variable. **Leakage Variable**.

precedentAlteration: (Outcome)
- Description: Flag indicating if the decision formally altered existing SC precedent.
- Values: 0=No, 1=Yes.
- Use: Outcome characteristic. **Leakage Variable**.

voteUnclear: (Voting/Opinion)
- Description: Flag indicating if the voting alignment was unclear.
- Values: 0=Clear, 1=Unclear.
- Use: Data quality flag related to outcome/voting. Determined at/after decision, potential **Leakage Variable**.

issue: (Substantive)
- Description: Code for the specific legal issue within the broader issueArea.
- Values: Many numeric codes, nested under issueArea. See Codebook section on Issues. Some NaNs possible.
- Use: Potential feature (more granular than issueArea). Treat as categorical. High cardinality may require careful handling or using only issueArea.

issueArea: (Substantive)
- Description: Broad subject matter category of the legal issue.
- Values: Numeric codes (1=CrimPro, 2=CivRts, 8=Econ, etc.). See Codebook Appendix 3. Some NaNs possible.
- Use: Key substantive feature. Treat as categorical. *Mapping to names recommended.*

decisionDirection: (Outcome)
- Description: Ideological direction assigned to the SC's decision.
- Values: 1=Conservative, 2=Liberal, 3=Unspecifiable. Some NaNs possible.
- Use: Common target variable for *outcome* prediction. **Leakage Variable** for duration prediction.

decisionDirectionDissent: (Outcome)
- Description: Ideological direction assigned to the primary dissent.
- Values: 1=Conservative, 2=Liberal, 3=Unspecifiable. High number of NaNs (no dissent).
- Use: Outcome characteristic. **Leakage Variable**.

authorityDecision1 / authorityDecision2: (Outcome)
- Description: Codes for the primary/secondary legal authority the Court relied upon.
- Values: Numeric codes (1=Conflict, 2=Federal Con. interp, 3=Federal Statute interp, etc.). See Codebook. High NaNs for authorityDecision2.
- Use: Outcome characteristic. **Leakage Variable**.

lawType: (Substantive)
- Description: Code for the type of law or action under review (e.g., statute, constitution, regulation).
- Values: Numeric codes. See Codebook Appendix 9. Some NaNs possible.
- Use: Potential feature indicating legal basis. Treat as categorical. *Simplification might be useful.*

lawSupp: (Substantive)
- Description: Code providing supplemental detail about the law under review (e.g., specific amendment, act name category).
- Values: Numeric codes. See Codebook Appendix 9. Some NaNs possible.
- Use: Potential feature (more detail than lawType). Treat as categorical. High cardinality.

lawMinor: (Substantive)
- Description: Free text field intended for minor legal points or specific statute sections.
- Values: String. Very high number of NaNs.
- Use: Generally **not usable** for ML due to inconsistency, typos, and high missingness (See Guide Section V). Usually dropped.

majOpinWriter: (Voting/Opinion)
- Description: Code identifying the justice who wrote the majority/plurality opinion.
- Values: Numeric justice codes (e.g., 102=Black, 112=Roberts). See Codebook Justice List. Some NaNs possible (per curiam).
- Use: Outcome characteristic. **Leakage Variable**. Requires Justice-centered data or aggregation for use as feature in outcome prediction.

majOpinAssigner: (Voting/Opinion)
- Description: Code identifying the justice who assigned the majority opinion (Chief Justice or senior justice in majority).
- Values: Numeric justice codes. Some NaNs possible.
- Use: Outcome characteristic. **Leakage Variable**.

splitVote: (Voting/Opinion)
- Description: Flag indicating if the case involved multiple distinct voting alignments on different aspects of the same issue/legal provision.
- Values: Numeric codes (0=No split, 1=Vote info pertains to 1st vote, 2=Vote info pertains to 2nd vote). See Codebook.
- Use: Indicator of high voting complexity. If engineered into a simple flag ('had_split_vote'), potentially usable as a pre-decision complexity feature, but the raw code itself describes the outcome voting. Treat with caution regarding leakage.

majVotes: (Voting/Opinion)
- Description: Number of justices voting in the majority coalition.
- Values: Integer (e.g., 5, 6, 9).
- Use: Outcome characteristic (vote margin). **Leakage Variable**.

minVotes: (Voting/Opinion)
- Description: Number of justices voting in the primary minority coalition (dissent).
- Values: Integer (e.g., 4, 3, 0).
- Use: Outcome characteristic (vote margin). **Leakage Variable**.

# **2. Coding Part**

In [43]:
# import os 

# os.getcwd() 

In [44]:
import os 

os.listdir()

['.git',
 '.gitattributes',
 'data',
 'SCDB_2024_01_caseCentered_Docket.csv.zip',
 'SCDB_2024_01_caseCentered_Vote.csv.zip',
 'SCDB_2024_01_codebook.pdf',
 'DataCleaning.ipynb',
 'variable_description.pdf',
 'Untitled-1 - Copy.ipynb']

In [45]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt # Useful for plotting later

file_path = 'data/SCDB_2024_01_caseCentered_Docket.csv'

# Try reading with different encodings if it fails
try:
    df = pd.read_csv(file_path)
except UnicodeDecodeError:
    # ISO-8859-1 (or latin-1) is common for older datasets
    df = pd.read_csv(file_path, encoding='ISO-8859-1')
except FileNotFoundError:
    print(f"Error: File not found at {file_path}")
    # Exit or raise error
except Exception as e:
    print(f"Error loading file: {e}")
    # Exit or handle error

print(f"Data loaded successfully. Shape: {df.shape}")
print("\nFirst 5 rows of data:")
display(df.head())

print("\nData Information:")
df.info()

print("\nMissing values per column:")
display(df.isnull().sum())

Data loaded successfully. Shape: (10783, 53)

First 5 rows of data:


Unnamed: 0,caseId,docketId,caseIssuesId,voteId,dateDecision,decisionType,usCite,sctCite,ledCite,lexisCite,...,authorityDecision1,authorityDecision2,lawType,lawSupp,lawMinor,majOpinWriter,majOpinAssigner,splitVote,majVotes,minVotes
0,1946-001,1946-001-01,1946-001-01-01,1946-001-01-01-01,11/18/1946,1,329 U.S. 1,67 S. Ct. 6,91 L. Ed. 3,1946 U.S. LEXIS 1724,...,4.0,,6.0,600.0,35 U.S.C. § 33,78.0,78.0,1,8,1
1,1946-002,1946-002-01,1946-002-01-01,1946-002-01-01-01,11/18/1946,1,329 U.S. 14,67 S. Ct. 13,91 L. Ed. 12,1946 U.S. LEXIS 1725,...,4.0,,6.0,600.0,18 U.S.C. § 398,81.0,87.0,1,6,3
2,1946-002,1946-002-02,1946-002-02-01,1946-002-02-01-01,11/18/1946,1,329 U.S. 14,67 S. Ct. 13,91 L. Ed. 12,1946 U.S. LEXIS 1725,...,4.0,,6.0,600.0,18 U.S.C. § 398,81.0,87.0,1,6,3
3,1946-002,1946-002-03,1946-002-03-01,1946-002-03-01-01,11/18/1946,1,329 U.S. 14,67 S. Ct. 13,91 L. Ed. 12,1946 U.S. LEXIS 1725,...,4.0,,6.0,600.0,18 U.S.C. § 398,81.0,87.0,1,6,3
4,1946-002,1946-002-04,1946-002-04-01,1946-002-04-01-01,11/18/1946,1,329 U.S. 14,67 S. Ct. 13,91 L. Ed. 12,1946 U.S. LEXIS 1725,...,4.0,,6.0,600.0,18 U.S.C. § 398,81.0,87.0,1,6,3



Data Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10783 entries, 0 to 10782
Data columns (total 53 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   caseId                    10783 non-null  object 
 1   docketId                  10783 non-null  object 
 2   caseIssuesId              10783 non-null  object 
 3   voteId                    10783 non-null  object 
 4   dateDecision              10783 non-null  object 
 5   decisionType              10783 non-null  int64  
 6   usCite                    10282 non-null  object 
 7   sctCite                   10779 non-null  object 
 8   ledCite                   10777 non-null  object 
 9   lexisCite                 10783 non-null  object 
 10  term                      10783 non-null  int64  
 11  naturalCourt              10783 non-null  int64  
 12  chief                     10783 non-null  object 
 13  docket                    10754 non-null  

caseId                          0
docketId                        0
caseIssuesId                    0
voteId                          0
dateDecision                    0
decisionType                    0
usCite                        501
sctCite                         4
ledCite                         6
lexisCite                       0
term                            0
naturalCourt                    0
chief                           0
docket                         29
caseName                        0
dateArgument                 1249
dateRearg                   10552
petitioner                      3
petitionerState              8657
respondent                      6
respondentState              7854
jurisdiction                    3
adminAction                  7632
adminActionState            10037
threeJudgeFdc                  23
caseOrigin                    430
caseOriginState              7969
caseSource                    266
caseSourceState              8395
lcDisagreement

In [46]:
df.columns

Index(['caseId', 'docketId', 'caseIssuesId', 'voteId', 'dateDecision',
       'decisionType', 'usCite', 'sctCite', 'ledCite', 'lexisCite', 'term',
       'naturalCourt', 'chief', 'docket', 'caseName', 'dateArgument',
       'dateRearg', 'petitioner', 'petitionerState', 'respondent',
       'respondentState', 'jurisdiction', 'adminAction', 'adminActionState',
       'threeJudgeFdc', 'caseOrigin', 'caseOriginState', 'caseSource',
       'caseSourceState', 'lcDisagreement', 'certReason', 'lcDisposition',
       'lcDispositionDirection', 'declarationUncon', 'caseDisposition',
       'caseDispositionUnusual', 'partyWinning', 'precedentAlteration',
       'voteUnclear', 'issue', 'issueArea', 'decisionDirection',
       'decisionDirectionDissent', 'authorityDecision1', 'authorityDecision2',
       'lawType', 'lawSupp', 'lawMinor', 'majOpinWriter', 'majOpinAssigner',
       'splitVote', 'majVotes', 'minVotes'],
      dtype='object')

In [47]:
date_cols = ['dateDecision', 'dateArgument', 'dateRearg']

print("\nConverting date columns...")
for col in date_cols:
    df[col] = pd.to_datetime(df[col], errors='coerce')

print("\nDate columns after conversion:")
df[['caseId'] + date_cols].info()
print(df[['caseId'] + date_cols].head())


Converting date columns...

Date columns after conversion:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10783 entries, 0 to 10782
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   caseId        10783 non-null  object        
 1   dateDecision  10783 non-null  datetime64[ns]
 2   dateArgument  9534 non-null   datetime64[ns]
 3   dateRearg     231 non-null    datetime64[ns]
dtypes: datetime64[ns](3), object(1)
memory usage: 337.1+ KB
     caseId dateDecision dateArgument  dateRearg
0  1946-001   1946-11-18   1946-01-09 1946-10-23
1  1946-002   1946-11-18   1945-10-10 1946-10-17
2  1946-002   1946-11-18   1945-10-10 1946-10-17
3  1946-002   1946-11-18   1945-10-10 1946-10-17
4  1946-002   1946-11-18   1945-10-10 1946-10-17


In [48]:
df.columns

Index(['caseId', 'docketId', 'caseIssuesId', 'voteId', 'dateDecision',
       'decisionType', 'usCite', 'sctCite', 'ledCite', 'lexisCite', 'term',
       'naturalCourt', 'chief', 'docket', 'caseName', 'dateArgument',
       'dateRearg', 'petitioner', 'petitionerState', 'respondent',
       'respondentState', 'jurisdiction', 'adminAction', 'adminActionState',
       'threeJudgeFdc', 'caseOrigin', 'caseOriginState', 'caseSource',
       'caseSourceState', 'lcDisagreement', 'certReason', 'lcDisposition',
       'lcDispositionDirection', 'declarationUncon', 'caseDisposition',
       'caseDispositionUnusual', 'partyWinning', 'precedentAlteration',
       'voteUnclear', 'issue', 'issueArea', 'decisionDirection',
       'decisionDirectionDissent', 'authorityDecision1', 'authorityDecision2',
       'lawType', 'lawSupp', 'lawMinor', 'majOpinWriter', 'majOpinAssigner',
       'splitVote', 'majVotes', 'minVotes'],
      dtype='object')

Below is a corrected and streamlined explanation of each date type, followed by a few short conclusions:

• dateDecision:  
    This is when the Supreme Court announces its decision on a case. The time from oral argument to the decision can vary widely, depending on the complexity of the case, the number of justices, and the nature of the legal issues.

• dateArgument:  
    This is when the Supreme Court hears the initial oral arguments. During oral argument, each side presents its case and the Justices ask questions to clarify issues. This step is crucial because it allows the Court to gather information directly from the attorneys before deliberating.

• dateRearg (dateReargument):  
    This is the date of a reargument session—an additional oral argument scheduled if the Court needs further clarification after the initial argument. Reargument is relatively rare and typically happens when the Justices cannot reach a clear decision or believe critical issues require more discussion.

Simple Conclusions:
1. If the Justices feel satisfied with the initial arguments, they will proceed toward their final deliberations and announce a verdict on dateDecision.  
2. In rare cases, the Court calls for a reargument, scheduled on dateRearg, if additional oral presentations or clarifications are necessary.  
3. Once satisfied with all arguments, the Supreme Court finalizes its opinion and releases it on dateDecision, effectively ending that stage of the case.

In [50]:
import pandas as pd
import numpy as np

# import matplotlib.pyplot as plt
# import seaborn as sns

# 1. Calculate duration_days
try:
    df['duration_days'] = (df['dateDecision'] - df['dateArgument']).dt.days
    print("\nCalculated 'duration_days'.")
except KeyError as e:
    print(f"\nError calculating duration: Missing input column {e}. Cannot proceed.")
    exit()

# 2. Verify column creation and inputs (as before)
print("\nChecking if 'duration_days' column exists:")
if 'duration_days' in df.columns:
    print("- 'duration_days' column FOUND.")
else:
    print("- 'duration_days' column NOT FOUND. Calculation likely failed.")
    exit()
print("\nChecking input date columns used for duration:")
if 'dateDecision' in df.columns and 'dateArgument' in df.columns:
     if not pd.api.types.is_datetime64_any_dtype(df['dateDecision']) or \
        not pd.api.types.is_datetime64_any_dtype(df['dateArgument']):
         print("  WARNING: Input date columns are not datetime type!")
else:
     print("Error: 'dateDecision' or 'dateArgument' column missing!")
     exit()


Calculated 'duration_days'.

Checking if 'duration_days' column exists:
- 'duration_days' column FOUND.

Checking input date columns used for duration:


In [51]:
print("\nFiltering DataFrame based on 'duration_days'...")
initial_rows_before_filter = df.shape[0]

# 1. Remove Rows with Missing Duration (NaN) - NECESSARY FOR TARGET
df_argued = df.dropna(subset=['duration_days']).copy()
rows_after_nan_drop = df_argued.shape[0]
print(f"\n1. Dropped {initial_rows_before_filter - rows_after_nan_drop} rows due to missing 'duration_days' (NaN).")

# 2. Address Rows with Non-Positive Duration (<= 0) - Remove negatives, keep zeros (may remove later if needed but only 2 rows)
negative_durations = df_argued[df_argued['duration_days'] < 0]
zero_durations = df_argued[df_argued['duration_days'] == 0]
if not negative_durations.empty:
    num_negative = negative_durations.shape[0]
    print(f"- Found {num_negative} rows with strictly negative (< 0) duration. Excluding as errors...")
    df_argued = df_argued[df_argued['duration_days'] >= 0].copy() # Keep >= 0
    print(f"- Dropped {num_negative} rows with negative duration.")
else:
    print("- No negative (< 0) durations found.")
if not zero_durations.empty:
    print(f"- Found {zero_durations.shape[0]} rows with exactly zero (0) duration. Keeping them for now.")
    # Optional: uncomment below to remove zeros if needed later
    # df_argued = df_argued[df_argued['duration_days'] > 0].copy()
else:
    print("- No zero (0) durations found.")

final_rows = df_argued.shape[0]
print(f"\nFinal DataFrame 'df_argued' for modeling has {final_rows} rows.")


Filtering DataFrame based on 'duration_days'...

1. Dropped 1249 rows due to missing 'duration_days' (NaN).
- No negative (< 0) durations found.
- Found 2 rows with exactly zero (0) duration. Keeping them for now.

Final DataFrame 'df_argued' for modeling has 9534 rows.


In [53]:
df_features = df_argued.copy()
print(f"\nStarting Feature Engineering on 'df_features' (shape: {df_features.shape})...")

# Engineer temporal features (KEEP)
df_features['decision_year'] = df_features['dateDecision'].dt.year
df_features['decision_month'] = df_features['dateDecision'].dt.month
df_features['argument_month'] = df_features['dateArgument'].dt.month
print("- Engineered temporal features (decision_year/month, argument_month).")

# Engineer complexity features (KEEP)
docket_counts = df_features.groupby('caseId')['docketId'].transform('count')
df_features['num_dockets_in_case'] = docket_counts
df_features['had_reargument'] = df_features['dateRearg'].notna().astype(int)
print("- Engineered complexity features (num_dockets_in_case, had_reargument).")

df_features['had_reargument'].value_counts()


Starting Feature Engineering on 'df_features' (shape: (9534, 54))...
- Engineered temporal features (decision_year/month, argument_month).
- Engineered complexity features (num_dockets_in_case, had_reargument).


had_reargument
0    9303
1     231
Name: count, dtype: int64

In [None]:
import numpy as np
import re # Make sure regex library is imported

# This section helps understand the different formats present before categorizing.
print("\n--- Inspecting raw 'docket' column before engineering category ---")
if 'docket' in df_features.columns:
    # Create a clean string series, dropping NaNs first for inspection
    docket_series_inspect = df_features['docket'].dropna().astype(str)

    # 1. Identify dockets NOT strictly numeric
    not_strictly_numeric_mask_inspect = ~docket_series_inspect.str.match(r'^[0-9]+$', na=False)
    non_numeric_dockets_inspect = docket_series_inspect[not_strictly_numeric_mask_inspect]

    print(f"\nFound {len(non_numeric_dockets_inspect)} non-numeric docket entries out of {len(docket_series_inspect)} non-NaN entries.")
    print("Examples of non-numeric docket entries (first 20):")
    print(non_numeric_dockets_inspect.head(20))
    print("\nUnique non-numeric docket values (first 50 examples):")
    try:
        unique_non_numeric_inspect = non_numeric_dockets_inspect.unique()
        print(unique_non_numeric_inspect[:50])
    except Exception as e:
        print(f"Could not display unique values: {e}")
    # Clean up inspection variable
    del docket_series_inspect, not_strictly_numeric_mask_inspect, non_numeric_dockets_inspect
else:
    print("\n'docket' column not found, skipping inspection.")



# This section creates the 'docket_category' based on inspection findings.
print("\n- Engineering refined 'docket_category' feature...")
if 'docket' in df_features.columns:
    # 1. Create temporary lowercase string column, fill NaNs to avoid errors
    df_features['docket_str_temp'] = df_features['docket'].astype(str).fillna('').str.lower()

    # 2. Define conditions using corrected regex patterns
    conditions = [
        # Condition 1: Look for 'orig' ANYWHERE in the lowercase string
        df_features['docket_str_temp'].str.contains('orig', regex=False, na=False), # regex=False for simple substring check

        # Condition 2: Look for 'm' ANYWHERE in the lowercase string
        # This will catch '133m', 'misc', 'm term', 'original maxim', etc.
        # Warning: This is very broad. It might misclassify edge cases.
        df_features['docket_str_temp'].str.contains('m', regex=False, na=False) # regex=False for simple substring check
    ]
    # Define the corresponding categories
    categories = [
        'Original',         # Category for condition 1
        'Miscellaneous'     # Category for condition 2
    ]
    # Default category if none of the above match
    default_category = 'Merits/Other' 

    # 3. Use numpy.select to apply conditions and create the new column
    df_features['docket_category'] = np.select(conditions, categories, default=default_category)

    # 4. Drop the temporary string column
    df_features = df_features.drop('docket_str_temp', axis=1)

    print("  Example counts for refined 'docket_category':")
    # Check distribution to see if 'Miscellaneous' is now captured
    print(df_features['docket_category'].value_counts(dropna=False))

else:
    print("  'docket' column not found, cannot engineer 'docket_category'.")
    # Create a default column if 'docket' was missing entirely
    df_features['docket_category'] = 'Unknown'

print("\n--- Docket Category Engineering Step Complete ---")


--- Inspecting raw 'docket' column before engineering category ---

Found 6141 non-numeric docket entries out of 9522 non-NaN entries.
Examples of non-numeric docket entries (first 20):
184                 133M
239                 325M
350                   1M
440                 206M
441                 233M
442                 269M
450                 265M
451                 106M
452                  47M
453                 374M
454                 184M
455                 109M
456                 372M
631                   2M
634     No. 12, Original
635         13, Original
826                 159M
830                  15M
1038                ORIG
1096                ORIG
Name: docket, dtype: object

Unique non-numeric docket values (first 50 examples):
['133M' '325M' '1M' '206M' '233M' '269M' '265M' '106M' '47M' '374M' '184M'
 '109M' '372M' '2M' 'No. 12, Original' '13, Original' '159M' '15M'
 '   ORIG' 'ORIG' '11 ORIG' '10 ORIG' '13 ORIG' '202 M' '12 ORIG' '1 M'
 '15 ORIG' '2 OR

In [67]:
df_features['docket']

0            24
1            12
2            17
3            14
4            19
          ...  
10778    23-726
10779    23-727
10780    23-939
10781    22-277
10782    22-555
Name: docket, Length: 9534, dtype: object

In [65]:
import numpy as np
import re # Make sure regex library is imported

# Assume 'df_features' is your DataFrame after filtering (df_argued.copy())
# and potentially with other features already engineered.

# --- Broad Docket Category Feature Engineering ---
print("\n- Engineering BROAD 'docket_category' feature...")
if 'docket' in df_features.columns:
    # 1. Create temporary lowercase string column, fill NaNs to avoid errors
    #    Lowercasing handles case-insensitivity for the simple checks below.
    df_features['docket_str_temp'] = df_features['docket'].astype(str).fillna('').str.lower()

    # 2. Define conditions using broad 'contains' checks
    conditions = [
        # Condition 1: Look for 'orig' ANYWHERE in the lowercase string
        df_features['docket_str_temp'].str.contains('orig', regex=False, na=False), # regex=False for simple substring check

        # Condition 2: Look for 'm' ANYWHERE in the lowercase string
        # This will catch '133m', 'misc', 'm term', 'original maxim', etc.
        # Warning: This is very broad. It might misclassify edge cases.
        df_features['docket_str_temp'].str.contains('m', regex=False, na=False) # regex=False for simple substring check
    ]
    # Define the corresponding categories
    categories = [
        'Original',         # Category for condition 1
        'Miscellaneous'     # Category for condition 2 (will catch anything with 'm' not already 'Original')
    ]
    # Default category if none of the above match
    default_category = 'Merits/Other'

    # 3. Use numpy.select to apply conditions efficiently
    #    np.select applies the conditions in order. If 'orig' is found, it's 'Original'.
    #    Only if 'orig' is NOT found, it checks for 'm'. If found, it's 'Miscellaneous'.
    #    Otherwise, it's 'Merits/Other'.
    df_features['docket_category'] = np.select(conditions, categories, default=default_category)

    # 4. Drop the temporary string column
    df_features = df_features.drop('docket_str_temp', axis=1)

    print("  Example counts for BROAD 'docket_category':")
    # Check distribution again - Expect 'Miscellaneous' count to increase significantly
    print(df_features['docket_category'].value_counts(dropna=False))

    # --- Optional Verification: Spot-check classifications ---
    # Uncomment these lines to verify the broader categorization
    # print("\nSpot-checking 'Original' cases:")
    # print(df_features[df_features['docket_category'] == 'Original'][['docket', 'docket_category']].head())
    # print("\nSpot-checking 'Miscellaneous' cases (Check for unexpected matches):")
    # print(df_features[df_features['docket_category'] == 'Miscellaneous'][['docket', 'docket_category']].head(20)) # Show more
    # print("\nSpot-checking 'Merits/Other' cases (Should NOT contain 'orig' or 'm'):")
    # print(df_features[df_features['docket_category'] == 'Merits/Other'][['docket', 'docket_category']].sample(10, random_state=4))


else:
    print("  'docket' column not found, cannot engineer 'docket_category'.")
    # Create a default column if 'docket' was missing entirely
    df_features['docket_category'] = 'Unknown'

print("\n--- BROAD Docket Category Engineering Step Complete ---")
# df_features now contains the broadly defined 'docket_category' column.
# You can proceed with feature selection. Remember this categorization is less precise.


- Engineering BROAD 'docket_category' feature...
  Example counts for BROAD 'docket_category':
docket_category
Merits/Other     9409
Original          103
Miscellaneous      22
Name: count, dtype: int64

--- BROAD Docket Category Engineering Step Complete ---


In [68]:
# --- Step 7 Continued: Select Final Features and Target ---

# Define the target variable 'y'
y = df_features['duration_days']
print(f"\nTarget variable 'y' (duration_days) defined. Length: {len(y)}")

# Define lists of columns to EXCLUDE from features 'X'

# 1. Leakage Variables (Determined at/after decision)
leakage_vars = [
    'decisionType', 'declarationUncon', 'caseDisposition',
    'caseDispositionUnusual', 'partyWinning', 'precedentAlteration',
    'voteUnclear', 'decisionDirection', 'decisionDirectionDissent',
    'authorityDecision1', 'authorityDecision2', 'majOpinWriter',
    'majOpinAssigner', 'splitVote', 'majVotes', 'minVotes'
]
print(f"\nExcluding {len(leakage_vars)} potential leakage variables.")

# 2. Identifiers and Reference Information
identifier_vars = [
    'caseId', 'docketId', 'caseIssuesId', 'voteId', # Core IDs
    'usCite', 'sctCite', 'ledCite', 'lexisCite', # Citations
    'docket', # Raw docket string (used to engineer docket_category)
    'caseName' # Case name string
]
# Check which identifiers actually exist in df_features to avoid errors
identifier_vars = [col for col in identifier_vars if col in df_features.columns]
print(f"Excluding {len(identifier_vars)} identifier variables.")

# 3. Raw Date Columns (Info extracted into features/target)
raw_date_vars = ['dateDecision', 'dateArgument', 'dateRearg']
print(f"Excluding {len(raw_date_vars)} raw date variables (info extracted).")

# 4. Target Variable Itself
target_var = ['duration_days']

# 5. High Missing/Cautious Use Variables (Optional - initially exclude states)
#    Decide whether to keep or drop these based on NaN analysis and imputation strategy
high_nan_vars_to_drop = [
     'petitionerState', 'respondentState', 'adminActionState',
     'caseOriginState', 'caseSourceState'
     # Add others like 'lawSupp' or 'adminAction' if dropping them
]
high_nan_vars_to_drop = [col for col in high_nan_vars_to_drop if col in df_features.columns]
print(f"Excluding {len(high_nan_vars_to_drop)} variables with very high NaNs (e.g., states).")


# Combine all columns to exclude
columns_to_exclude = leakage_vars + identifier_vars + raw_date_vars + target_var + high_nan_vars_to_drop
# Remove duplicates just in case
columns_to_exclude = list(set(columns_to_exclude))

# Define final feature_columns list
all_columns = df_features.columns.tolist()
feature_columns = [col for col in all_columns if col not in columns_to_exclude]

print(f"\nFinal selected feature columns for X ({len(feature_columns)}):")
print(feature_columns)

# Define the final feature matrix 'X'
# Final check if all selected feature columns exist
missing_cols = [col for col in feature_columns if col not in df_features.columns]
if missing_cols:
    print(f"\nERROR: The following selected feature columns are missing: {missing_cols}")
    exit()
else:
    X = df_features[feature_columns].copy()
    print(f"\nFeature matrix 'X' defined. Shape: {X.shape}")
    print("First 5 rows of feature matrix 'X':")
    print(X.head())

print("\n--- Data Filtering and Feature Engineering Complete ---")
print("You now have 'X' (features) and 'y' (target) ready for splitting and preprocessing.")


Target variable 'y' (duration_days) defined. Length: 9534

Excluding 16 potential leakage variables.
Excluding 10 identifier variables.
Excluding 3 raw date variables (info extracted).
Excluding 5 variables with very high NaNs (e.g., states).

Final selected feature columns for X (20):
['term', 'naturalCourt', 'chief', 'petitioner', 'respondent', 'jurisdiction', 'adminAction', 'threeJudgeFdc', 'caseOrigin', 'caseSource', 'lcDisagreement', 'certReason', 'lcDisposition', 'lcDispositionDirection', 'issue', 'issueArea', 'lawType', 'lawSupp', 'lawMinor', 'docket_category']

Feature matrix 'X' defined. Shape: (9534, 20)
First 5 rows of feature matrix 'X':
   term  naturalCourt   chief  petitioner  respondent  jurisdiction  \
0  1946          1301  Vinson       198.0       172.0           6.0   
1  1946          1301  Vinson       100.0        27.0           1.0   
2  1946          1301  Vinson       100.0        27.0           1.0   
3  1946          1301  Vinson       100.0        27.0    

In [69]:
import pandas as pd
import numpy as np
import re # For regex if needed later, though broad approach uses simple contains

# --- Assume 'df' exists: Loaded from CSV (e.g., SCDB_2024_01_caseCentered_Docket.csv) ---
# --- Assume date columns ('dateDecision', 'dateArgument', 'dateRearg') are converted to datetime ---
print("Starting Data Preparation and Feature Engineering...")
print(f"Initial DataFrame 'df' shape: {df.shape}")

# --- Step 1: Calculate Duration and Verify ---

try:
    df['duration_days'] = (df['dateDecision'] - df['dateArgument']).dt.days
    print("\nCalculated 'duration_days'.")
except KeyError as e:
    print(f"\nError calculating duration: Missing input column {e}. Cannot proceed.")
    exit()

print("\nVerifying 'duration_days' column and inputs...")
if 'duration_days' not in df.columns:
    print("FATAL: 'duration_days' column NOT FOUND after calculation.")
    exit()
if 'dateDecision' not in df.columns or 'dateArgument' not in df.columns:
     print("FATAL: 'dateDecision' or 'dateArgument' column missing!")
     exit()
if not pd.api.types.is_datetime64_any_dtype(df['dateDecision']) or \
   not pd.api.types.is_datetime64_any_dtype(df['dateArgument']):
     print("WARNING: Input date columns are not datetime type! Check conversion.")
print("- Duration calculation checks passed.")


# --- Step 2: Filter Data for Modeling Based on Target Variable ---
print("\nFiltering DataFrame based on 'duration_days' for modeling...")
initial_rows_before_filter = df.shape[0]

# 1. Remove Rows with Missing Duration (NaN)
#    - Rationale: Necessary as the target variable cannot be calculated for these rows
#      (typically cases decided without oral argument). Models require a valid target.
df_argued = df.dropna(subset=['duration_days']).copy()
rows_after_nan_drop = df_argued.shape[0]
print(f"\n1. Dropped {initial_rows_before_filter - rows_after_nan_drop} rows due to missing 'duration_days' (NaN).")

# 2. Address Rows with Non-Positive Duration (<= 0)
#    - Rationale: Negative durations indicate data errors. Zero durations are atypical
#      for argued cases and may also indicate errors or processes outside the norm.
#      We remove negatives; keeping zeros is optional but flagged.
negative_durations = df_argued[df_argued['duration_days'] < 0]
zero_durations = df_argued[df_argued['duration_days'] == 0]
if not negative_durations.empty:
    num_negative = negative_durations.shape[0]
    print(f"- Found {num_negative} rows with negative (< 0) duration. Excluding as errors...")
    df_argued = df_argued[df_argued['duration_days'] >= 0].copy() # Keep >= 0
    print(f"- Dropped {num_negative} rows with negative duration.")
else:    print("- No negative (< 0) durations found.")
if not zero_durations.empty:
    print(f"- Found {zero_durations.shape[0]} rows with zero (0) duration. Kept for now.")
else:    print("- No zero (0) durations found.")

final_rows = df_argued.shape[0]
print(f"\nFinal DataFrame 'df_argued' for modeling has {final_rows} rows.")

# --- Step 3 & 4: Feature Engineering ---
df_features = df_argued.copy()
print(f"\nStarting Feature Engineering on 'df_features' (shape: {df_features.shape})...")

# Engineer temporal features
df_features['decision_year'] = df_features['dateDecision'].dt.year # Capture long-term trends
df_features['decision_month'] = df_features['dateDecision'].dt.month # Capture seasonality
df_features['argument_month'] = df_features['dateArgument'].dt.month # Capture seasonality
print("- Engineered temporal features (decision_year/month, argument_month).")

# Engineer complexity features
docket_counts = df_features.groupby('caseId')['docketId'].transform('count')
df_features['num_dockets_in_case'] = docket_counts # Proxy for consolidation complexity
df_features['had_reargument'] = df_features['dateRearg'].notna().astype(int) # Proxy for process complexity/uncertainty
print("- Engineered complexity features (num_dockets_in_case, had_reargument).")

# Engineer BROAD docket category
# Warning: Broad 'm' check might misclassify some edge cases.
print("- Engineering BROAD 'docket_category' feature...")
if 'docket' in df_features.columns:
    df_features['docket_str_temp'] = df_features['docket'].astype(str).fillna('').str.lower()
    conditions = [
        df_features['docket_str_temp'].str.contains('orig', regex=False, na=False), # Original Jurisdiction
        df_features['docket_str_temp'].str.contains('m', regex=False, na=False)    # Miscellaneous (any 'm')
    ]
    categories = ['Original', 'Miscellaneous']
    default_category = 'Merits/Other' # Standard cases
    df_features['docket_category'] = np.select(conditions, categories, default=default_category)
    df_features = df_features.drop('docket_str_temp', axis=1)
    print("  Example counts for BROAD 'docket_category':")
    print(df_features['docket_category'].value_counts(dropna=False))
else:
    print("  'docket' column not found, cannot engineer 'docket_category'.")
    df_features['docket_category'] = 'Unknown'

# Drop poor quality column
if 'lawMinor' in df_features.columns:
     df_features = df_features.drop('lawMinor', axis=1)
     print("- Dropped 'lawMinor' column due to poor quality/high NaNs.")

# --- Placeholder for User Action: Simplify Other Categoricals ---
print("- Placeholder: Further categorical simplification (e.g., petitioner, issueArea) required using Codebook for better results.")


# --- Step 5 & 6: Select Final Features (X) and Target (y) ---

# Define the target variable 'y'
y = df_features['duration_days']
print(f"\nTarget variable 'y' (duration_days) defined. Shape: {y.shape}")

# Define lists of columns to EXCLUDE from features 'X' with explanations

# 1. Leakage Variables: Information determined AT or AFTER the decision date.
#    Cannot use these to predict duration ending on that date.
leakage_vars = [
    'decisionType', 'declarationUncon', 'caseDisposition',
    'caseDispositionUnusual', 'partyWinning', 'precedentAlteration',
    'voteUnclear', 'decisionDirection', 'decisionDirectionDissent',
    'authorityDecision1', 'authorityDecision2', 'majOpinWriter',
    'majOpinAssigner', 'splitVote', 'majVotes', 'minVotes'
]
print(f"\nExcluding {len(leakage_vars)} LEAKAGE variables (known only post-decision).")

# 2. Identifiers & Reference Info: Used for linking/reference, not direct prediction.
identifier_vars = [
    'caseId', 'docketId', 'caseIssuesId', 'voteId', # Core IDs
    'usCite', 'sctCite', 'ledCite', 'lexisCite', # Citations
    'docket', # Raw docket string (info captured in docket_category)
    'caseName' # Case name string
]
identifier_vars = [col for col in identifier_vars if col in df_features.columns] # Check which exist
print(f"Excluding {len(identifier_vars)} IDENTIFIER variables.")

# 3. Raw Date Columns: Information extracted into target or features.
raw_date_vars = ['dateDecision', 'dateArgument', 'dateRearg']
print(f"Excluding {len(raw_date_vars)} RAW DATE variables (info extracted).")

# 4. Target Variable Itself: Cannot use the target to predict itself.
target_var = ['duration_days']
print(f"Excluding 1 TARGET variable ('{target_var[0]}').")

# 5. High Missing/Cautious Use Variables: Dropping state variables due to high NaNs.
#    Requires robust imputation strategy if kept. Consider dropping others like adminAction too.
high_nan_vars_to_drop = [
     'petitionerState', 'respondentState', 'adminActionState',
     'caseOriginState', 'caseSourceState'
     # Consider adding 'adminAction', 'lawSupp' here if not planning imputation/simplification
]
high_nan_vars_to_drop = [col for col in high_nan_vars_to_drop if col in df_features.columns]
print(f"Excluding {len(high_nan_vars_to_drop)} HIGH NAN variables (e.g., states).")

# 6. Poor Quality Variables (Already dropped but good practice to list)
#    poor_quality_vars = ['lawMinor'] # Already dropped above


# Combine all columns to exclude
columns_to_exclude = list(set(
    leakage_vars + identifier_vars + raw_date_vars + target_var + high_nan_vars_to_drop
))
print(f"\nTotal columns to exclude from features: {len(columns_to_exclude)}")

# Define final feature_columns list
all_columns = df_features.columns.tolist()
feature_columns = [col for col in all_columns if col not in columns_to_exclude]

print(f"\nFinal selected feature columns for X ({len(feature_columns)}):")
# Sort for easier reading (optional)
feature_columns.sort()
print(feature_columns)

# Define the final feature matrix 'X'
missing_cols = [col for col in feature_columns if col not in df_features.columns]
if missing_cols:
    print(f"\nERROR: The following selected feature columns are missing: {missing_cols}")
    exit()
else:
    X = df_features[feature_columns].copy()
    print(f"\nFeature matrix 'X' defined. Shape: {X.shape}")
    print("First 5 rows of feature matrix 'X':")
    print(X.head())

print("\n--- Data Filtering and Feature Engineering Complete ---")
print("You now have 'X' (features) and 'y' (target) ready for splitting and preprocessing.")

Starting Data Preparation and Feature Engineering...
Initial DataFrame 'df' shape: (10783, 54)

Calculated 'duration_days'.

Verifying 'duration_days' column and inputs...
- Duration calculation checks passed.

Filtering DataFrame based on 'duration_days' for modeling...

1. Dropped 1249 rows due to missing 'duration_days' (NaN).
- No negative (< 0) durations found.
- Found 2 rows with zero (0) duration. Kept for now.

Final DataFrame 'df_argued' for modeling has 9534 rows.

Starting Feature Engineering on 'df_features' (shape: (9534, 54))...
- Engineered temporal features (decision_year/month, argument_month).
- Engineered complexity features (num_dockets_in_case, had_reargument).
- Engineering BROAD 'docket_category' feature...
  Example counts for BROAD 'docket_category':
docket_category
Merits/Other     9409
Original          103
Miscellaneous      22
Name: count, dtype: int64
- Dropped 'lawMinor' column due to poor quality/high NaNs.
- Placeholder: Further categorical simplificati

In [70]:
import pandas as pd
import numpy as np

# Assume 'df_features' DataFrame exists and contains necessary base columns
# (including the results of basic engineering and potentially simplified categoricals)
print(f"\n--- Starting Advanced Feature Engineering on df_features (Shape: {df_features.shape}) ---")

# --- 1. Enhanced Case Complexity Metrics ---

# a) Interaction Term: num_dockets * lcDisagreement
if 'num_dockets_in_case' in df_features.columns and 'lcDisagreement' in df_features.columns:
    df_features['complex_consolidated_disagreement'] = df_features['num_dockets_in_case'] * df_features['lcDisagreement']
    print("- Engineered 'complex_consolidated_disagreement' (Interaction).")
    # print(df_features['complex_consolidated_disagreement'].value_counts()) # Optional check
else:
    print("- Skipping 'complex_consolidated_disagreement': Base columns missing.")

# b) Interaction Term: Admin Action in Economic Area
#    Requires checking/knowing codes: Assume issueArea 8 = Economic Activity, adminAction > 0 means admin case.
ADMIN_ACTION_COL = 'adminAction' # Replace if name differs
ISSUE_AREA_COL = 'issueArea'   # Replace if name differs
ECONOMIC_AREA_CODE = 8         # Verify this code in your Codebook

if ADMIN_ACTION_COL in df_features.columns and ISSUE_AREA_COL in df_features.columns:
    is_admin = (df_features[ADMIN_ACTION_COL] > 0).astype(int)
    is_econ = (df_features[ISSUE_AREA_COL] == ECONOMIC_AREA_CODE).astype(int)
    df_features['is_AdminAction_x_Economic'] = is_admin * is_econ
    print(f"- Engineered 'is_AdminAction_x_Economic' (Flag for Admin action in IssueArea {ECONOMIC_AREA_CODE}).")
    # print(df_features['is_AdminAction_x_Economic'].value_counts()) # Optional check
else:
    print("- Skipping 'is_AdminAction_x_Economic': Base columns missing.")

# c) Issue Granularity within Area (Conceptual - Requires Different Prep)
#    This requires counting unique 'issue' codes per 'caseId' *before* filtering
#    or using a different source file structure. Cannot be directly computed here.
print("- Skipping 'Issue Granularity': Requires different data prep (grouping before filtering).")
# Conceptual:
# If you had df_granular with caseId and issue:
# issue_counts = df_granular.groupby('caseId')['issue'].transform('nunique')
# Then merge issue_counts onto df_features using caseId.


# --- 2. Temporal Dynamics within Term ---

# a) Decision/Argument Timing from Term Start (Approximate)
#    NOTE: Assumes term starts roughly Oct 1st. For accuracy, use precise start dates.
if 'term' in df_features.columns and 'dateDecision' in df_features.columns and 'dateArgument' in df_features.columns:
    # Approximate term start date (Year from 'term', Month=10, Day=1)
    # Ensure 'term' is integer if necessary: df_features['term'] = df_features['term'].astype(int)
    df_features['approx_term_start_date'] = pd.to_datetime(df_features['term'].astype(str) + '-10-01', errors='coerce')

    # Calculate days from approx start to decision/argument
    df_features['days_from_term_start_to_decision'] = (df_features['dateDecision'] - df_features['approx_term_start_date']).dt.days
    df_features['days_from_term_start_to_argument'] = (df_features['dateArgument'] - df_features['approx_term_start_date']).dt.days

    # Handle potential errors (e.g., decision before Oct 1st - rare for argued cases)
    df_features['days_from_term_start_to_decision'] = df_features['days_from_term_start_to_decision'].apply(lambda x: max(x, 0) if pd.notna(x) else x)
    df_features['days_from_term_start_to_argument'] = df_features['days_from_term_start_to_argument'].apply(lambda x: max(x, 0) if pd.notna(x) else x)

    df_features = df_features.drop('approx_term_start_date', axis=1) # Remove temporary column
    print("- Engineered 'days_from_term_start_to_decision'/'_argument' (Approximate).")
else:
    print("- Skipping 'Term Timing' features: Base columns missing.")


# b) "June Rush" Effect / Late Term Decisions
if 'decision_month' in df_features.columns:
    # Flag for decisions in May (5) or June (6) or later (adjust threshold if needed)
    df_features['is_late_term_decision'] = (df_features['decision_month'] >= 5).astype(int)
    print("- Engineered 'is_late_term_decision' (Flag for >= May).")
    # print(df_features['is_late_term_decision'].value_counts()) # Optional check
else:
    print("- Skipping 'is_late_term_decision': 'decision_month' missing.")

# c) Late Term Arguments
if 'argument_month' in df_features.columns:
    # Flag for arguments in April (4) or later (often indicates decision will be late)
    df_features['is_late_term_argument'] = (df_features['argument_month'] >= 4).astype(int)
    print("- Engineered 'is_late_term_argument' (Flag for >= April).")
    # print(df_features['is_late_term_argument'].value_counts()) # Optional check
else:
    print("- Skipping 'is_late_term_argument': 'argument_month' missing.")


# --- 3. Court Composition and Ideology ---
#    Requires merging external data (e.g., Justice-Term MQ scores) first.
print("\n- Skipping 'Court Composition/Ideology': Requires merging external MQ scores data.")
# Conceptual (assuming df_merged has 'term' and 'justice_mq_score' per justice):
# median_mq = df_merged.groupby('term')['justice_mq_score'].transform('median')
# std_dev_mq = df_merged.groupby('term')['justice_mq_score'].transform('std')
# df_features = df_features.merge(median_mq.drop_duplicates(), on='term', how='left') # Merge aggregated scores
# df_features = df_features.merge(std_dev_mq.drop_duplicates(), on='term', how='left')


# --- 4. Party Configuration Archetypes ---
#    Requires simplified 'petitioner_type' and 'respondent_type' columns exist from manual coding.
#    Replace 'petitioner_type', 'respondent_type' with your actual column names.
#    Replace 'USGovt', 'StateGovt', 'Business', 'Individual' with your actual category names.
print("\n- Engineering 'Party Configuration' features...")
PETITIONER_TYPE_COL = 'petitioner_type' # CHANGE TO YOUR COLUMN NAME
RESPONDENT_TYPE_COL = 'respondent_type' # CHANGE TO YOUR COLUMN NAME

if PETITIONER_TYPE_COL in df_features.columns and RESPONDENT_TYPE_COL in df_features.columns:
    # Example: Government vs. Business
    is_govt_pet = df_features[PETITIONER_TYPE_COL].isin(['USGovt', 'StateGovt'])
    is_business_resp = (df_features[RESPONDENT_TYPE_COL] == 'Business')
    is_business_pet = (df_features[PETITIONER_TYPE_COL] == 'Business')
    is_govt_resp = df_features[RESPONDENT_TYPE_COL].isin(['USGovt', 'StateGovt'])
    df_features['is_Govt_vs_Business'] = ((is_govt_pet & is_business_resp) | (is_business_pet & is_govt_resp)).astype(int)

    # Example: Individual vs. Government
    is_indiv_pet = (df_features[PETITIONER_TYPE_COL] == 'Individual')
    is_indiv_resp = (df_features[RESPONDENT_TYPE_COL] == 'Individual')
    df_features['is_Individual_vs_Govt'] = ((is_indiv_pet & is_govt_resp) | (is_govt_pet & is_indiv_resp)).astype(int)

    # Example: State vs. State (Original Jurisdiction often)
    is_state_pet = (df_features[PETITIONER_TYPE_COL] == 'StateGovt')
    is_state_resp = (df_features[RESPONDENT_TYPE_COL] == 'StateGovt')
    df_features['is_State_vs_State'] = (is_state_pet & is_state_resp).astype(int)

    print("- Engineered party configuration flags (e.g., 'is_Govt_vs_Business').")
    # print(df_features['is_Govt_vs_Business'].value_counts()) # Optional checks
else:
    print("- Skipping 'Party Configuration': Requires simplified petitioner/respondent type columns.")


# --- 5. Lower Court Context Refined ---
#    Requires simplified 'caseSource_type' column exists from manual coding.
#    Replace 'caseSource_type' with your actual column name.
#    Replace 'FedCirc' with your actual category name for Federal Circuit Courts.
print("\n- Engineering 'Lower Court Context' features...")
CASE_SOURCE_TYPE_COL = 'caseSource_type' # CHANGE TO YOUR COLUMN NAME
FED_CIRC_CATEGORY = 'FedCirc'           # CHANGE TO YOUR CATEGORY NAME

if CASE_SOURCE_TYPE_COL in df_features.columns and 'lcDisagreement' in df_features.columns:
    # Interaction: Federal Circuit source AND Lower Court Disagreement
    is_fed_circ = (df_features[CASE_SOURCE_TYPE_COL] == FED_CIRC_CATEGORY)
    df_features['is_FedCirc_Conflict'] = (is_fed_circ & (df_features['lcDisagreement'] == 1)).astype(int)
    print("- Engineered 'is_FedCirc_Conflict' flag.")
else:
    print("- Skipping 'is_FedCirc_Conflict': Requires simplified caseSource type column.")

# Circuit Effects (Advanced - Conceptual)
#    Requires mapping 'caseSource' codes to specific circuits (1-11, DC, Fed) -> 'circuit' column
#    Then create dummy variables. High cardinality - use with caution.
print("- Skipping 'Circuit Effects': Requires mapping caseSource codes to circuits and careful handling of high cardinality.")
# Conceptual:
# if 'circuit' in df_features.columns: # Assuming 'circuit' column was created
#    circuit_dummies = pd.get_dummies(df_features['circuit'], prefix='circuit', drop_first=True) # Creates flags like circuit_1, circuit_2...
#    df_features = pd.concat([df_features, circuit_dummies], axis=1)
#    print("- Conceptual: Created circuit dummy variables.")


# --- 6. External Attention Indicators ---
print("\n- Skipping 'External Attention': Requires merging external Salience / Amicus data.")
# Conceptual:
# If df_features contains 'nytSalience' after merging:
# feature_columns.append('nytSalience')
# If df_features contains 'num_amicus_briefs' after merging:
# feature_columns.append('num_amicus_briefs')


# --- Final Check ---
print(f"\n--- Advanced Feature Engineering Attempted ---")
print(f"DataFrame 'df_features' shape after additions: {df_features.shape}")
# Display new columns created (optional)
# new_cols = [col for col in df_features.columns if col not in df_argued.columns] # Get list of new cols
# print("\nNew columns created:")
# print(new_cols)
# print(df_features[new_cols].head())

# --- IMPORTANT ---
# Remember to add the names of any new features you want to use
# (e.g., 'complex_consolidated_disagreement', 'is_late_term_decision', 'is_Govt_vs_Business', etc.)
# to your final 'feature_columns' list before creating matrix 'X'.


--- Starting Advanced Feature Engineering on df_features (Shape: (9534, 59)) ---
- Engineered 'complex_consolidated_disagreement' (Interaction).
- Engineered 'is_AdminAction_x_Economic' (Flag for Admin action in IssueArea 8).
- Skipping 'Issue Granularity': Requires different data prep (grouping before filtering).
- Engineered 'days_from_term_start_to_decision'/'_argument' (Approximate).
- Engineered 'is_late_term_decision' (Flag for >= May).
- Engineered 'is_late_term_argument' (Flag for >= April).

- Skipping 'Court Composition/Ideology': Requires merging external MQ scores data.

- Engineering 'Party Configuration' features...
- Skipping 'Party Configuration': Requires simplified petitioner/respondent type columns.

- Engineering 'Lower Court Context' features...
- Skipping 'is_FedCirc_Conflict': Requires simplified caseSource type column.
- Skipping 'Circuit Effects': Requires mapping caseSource codes to circuits and careful handling of high cardinality.

- Skipping 'External Atte