# *Data Preprocessing*

*The data has been preprocessed to extract the subset of features needed for analysis.*

*The selected features include:*

1. State Code
2. District Code
3. Court Number
4. Judge Position
5. Defendant's Gender
6. Gender of Defendant's Advocate
7. Gender of Petitioner's Advocate
8. Case Type
9. Case Purpose
10. Case Completion Time
11. Judge's Gender
12. Judge's Experience

*By focusing on these specific features, we have obtained a subset of data that can be used for further analysis and classification tasks.*

*To ensure a balanced representation of "acquitted" and "convicted" cases, a subset of cases from the time period between 2010 and 2015 has been selected. This specific time range was chosen to obtain a sufficient number of cases for both dispositions, thereby avoiding any potential class imbalance issues.*

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## *Cases from 2010-2015*

In [None]:
cases_2010 = pd.read_csv("/kaggle/input/court-data/cases/cases/cases_2010.csv")
cases_2010.drop(columns=["year", "cino", "female_petitioner", "date_first_list", "date_last_list", "date_next_list"], inplace=True)
cases_2010.head()

In [None]:
cases_2011 = pd.read_csv("/kaggle/input/court-data/cases/cases/cases_2011.csv")
cases_2011.drop(columns=["year", "cino", "female_petitioner", "date_first_list", "date_last_list", "date_next_list"], inplace=True)
cases_2011.drop(cases_2011[cases_2011["disp_name"] != 19].index, inplace=True)
cases_2011.head()

In [None]:
cases_2012 = pd.read_csv("/kaggle/input/court-data/cases/cases/cases_2012.csv")
cases_2012.drop(columns=["year", "cino", "female_petitioner", "date_first_list", "date_last_list", "date_next_list"], inplace=True)
cases_2012.drop(cases_2012[cases_2012["disp_name"] != 19].index, inplace=True)
cases_2012.head()

In [None]:
cases_2013 = pd.read_csv("/kaggle/input/court-data/cases/cases/cases_2013.csv")
cases_2013.drop(columns=["year", "cino", "female_petitioner", "date_first_list", "date_last_list", "date_next_list"], inplace=True)
cases_2013.drop(cases_2013[cases_2013["disp_name"] != 19].index, inplace=True)
cases_2013.head()

In [None]:
cases_2014 = pd.read_csv("/kaggle/input/court-data/cases/cases/cases_2014.csv")
cases_2014.drop(columns=["year", "cino", "female_petitioner", "date_first_list", "date_last_list", "date_next_list"], inplace=True)
cases_2014.drop(cases_2014[cases_2014["disp_name"] != 19].index, inplace=True)
cases_2014.head()

In [None]:
cases_2015 = pd.read_csv("/kaggle/input/court-data/cases/cases/cases_2015.csv")
cases_2015.drop(columns=["year", "cino", "female_petitioner", "date_first_list", "date_last_list", "date_next_list"], inplace=True)
cases_2015.drop(cases_2015[cases_2015["disp_name"] != 19].index, inplace=True)
cases_2015.head()

*Concatinating the data frames from all the 6 years*

In [None]:
cases_6yrs = pd.concat([cases_2010, cases_2011, cases_2012, cases_2013, cases_2014, cases_2015], axis=0)
cases_6yrs.head()

In [None]:
import warnings
warnings.simplefilter('ignore')
import gc
import subprocess


del cases_2010
gc.collect()
del cases_2011
gc.collect()
del cases_2012
gc.collect()
del cases_2013
gc.collect()
del cases_2014
gc.collect()
del cases_2015
gc.collect()

*To ensure a clean and accurate prediction process, any instances with missing values (NaN) in the dataset have been dropped.*

In [None]:
cases_6yrs.dropna(inplace=True)

*The date and time variables representing the case filing and decision date have been converted to datetime series. This conversion enables us to compute and compare the duration between different dates and times accurately.*

In [None]:
cases_6yrs['date_of_decision'] =  pd.to_datetime(cases_6yrs['date_of_decision'], errors='coerce')
cases_6yrs['date_of_filing'] =  pd.to_datetime(cases_6yrs['date_of_filing'], errors='coerce')
cases_6yrs.info()

In [None]:
cases_6yrs['case_duration'] = (cases_6yrs['date_of_decision'] - cases_6yrs['date_of_filing']).dt.days
cases_6yrs.drop(columns=['date_of_filing'], inplace=True)
cases_6yrs.drop(cases_6yrs[ cases_6yrs['case_duration'] <= 0 ].index, inplace = True)

cases_6yrs.head()

*Removing the rows with unknown gender and then converting object datatype to integer.*

In [None]:
cases_6yrs.drop(cases_6yrs[(cases_6yrs["female_adv_def"] != 0) & (cases_6yrs["female_adv_def"] != 1)].index, inplace=True)
cases_6yrs.drop(cases_6yrs[(cases_6yrs["female_adv_pet"] != 0) & (cases_6yrs["female_adv_pet"] != 1)].index, inplace=True)

cases_6yrs.drop(cases_6yrs[(cases_6yrs["female_defendant"] != "0 male") & (cases_6yrs["female_defendant"] != "1 female")].index, inplace=True)
cases_6yrs["female_defendant"] = np.where(cases_6yrs["female_defendant"] == "1 female", 1, 0)
cases_6yrs

## *Judges and Cases Relational data*

In [None]:
judges_keys = pd.read_csv("/kaggle/input/court-data/keys/keys/judge_case_merge_key.csv")
judges_keys.drop(columns=["ddl_filing_judge_id"], inplace=True)

judges_keys.head()

*Merging two dataframes based on a common key, that is, `case id`. The resulting merged dataframe provides a consolidated view of the data, incorporating information from both original data sources.*

In [None]:
cases = pd.merge(cases_6yrs, judges_keys, on='ddl_case_id', how='left')
cases.dropna(inplace=True)
cases

In [None]:
del cases_6yrs
gc.collect()

## *Judges data*

In [None]:
judges = pd.read_csv("/kaggle/input/court-data/judges_clean/judges_clean.csv")
judges.drop(columns=["state_code", "dist_code", "court_no", "judge_position", "end_date"], inplace=True)
judges.head()

Merging two dataframes based on a common key, that is, `judge id`.

In [None]:
cases = pd.merge(cases, judges, how='left', left_on='ddl_decision_judge_id', right_on='ddl_judge_id')
cases.head()

*The date and time variables representing the judge's starting date has been converted to datetime series to compute the judges experience.*

In [None]:
cases['start_date'] =  pd.to_datetime(cases['start_date'], errors='coerce')
cases['judge_experience'] = (cases['date_of_decision'] - cases['start_date']).dt.days
cases.drop(columns=['date_of_decision', 'start_date'], inplace=True)
cases.drop(cases[ cases['judge_experience'] <= 0 ].index, inplace = True)

cases.head()

In [None]:
cases.drop(cases[(cases["female_judge"] != "0 nonfemale") & (cases["female_judge"] != "1 female")].index, inplace=True)
cases["female_judge"] = np.where(cases["female_judge"] == "1 female", 1, 0)

cases

In [None]:
cases.drop(columns=["ddl_decision_judge_id","ddl_judge_id"], inplace=True)

## *Disposition data*

In [None]:
disp_key = pd.read_csv("/kaggle/input/court-data/keys/keys/disp_name_key.csv")
disp_key = disp_key.sort_values(by='disp_name', ascending=True, ignore_index=True)
disp_key.drop(columns=['year', 'count'], inplace=True)
disp_key.drop_duplicates(subset='disp_name_s', inplace=True, ignore_index=True)
disp_key

*Merging two dataframes based on a common key, that is,`disposition name`.*

In [None]:
cases = pd.merge(cases, disp_key, how='left', on='disp_name')
cases.drop(columns=['disp_name'], inplace=True)
cases

*Filtering the data to only disposition of `convicted` or `acquitted`.*

In [None]:
cases.drop(cases[(cases["disp_name_s"] != "convicted") & (cases["disp_name_s"] != "acquitted")].index, inplace=True)
cases

In [None]:
cases["disp_name_s"].value_counts()

*converting to boolean*

In [None]:
cases["disp_name_s"] = np.where(cases["disp_name_s"] == "convicted", 1, 0)

In [None]:
cases.to_csv("/kaggle/working/cases_convicted_acquitted.csv", index=False)