# *Data Preprocessing*

*The data has been preprocessed to create a dataset that includes the following features or the classification and prediction of the time taken to solve a case into five distinct categories -- 1-100 days, 100-500 days, 500-1000 days, 1000-1500 days, and 1500+ days:*

1. State Code
2. District Code
3. Court Number
4. Judge Position
5. Gender of Defendant's Advocate
6. Gender of Petitioner's Advocate
7. Case Type
8. Case Purpose
9. Disposition Name
10. Act
11. Section
12. Number of Sections IPC

*Including cases from the years 2016 to 2018 expands the dataset and provides a more comprehensive representation of case resolution time during that period.* 

*In this notebook we filter out the unnecessary data pionts of all the cases from the years 2016, 2017, and 2018.*

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/court-data/judges_clean/judges_clean.csv
/kaggle/input/court-data/acts_sections/acts_sections.csv
/kaggle/input/court-data/cases/cases/cases_2015.csv
/kaggle/input/court-data/cases/cases/cases_2012.csv
/kaggle/input/court-data/cases/cases/cases_2018.csv
/kaggle/input/court-data/cases/cases/cases_2013.csv
/kaggle/input/court-data/cases/cases/cases_2017.csv
/kaggle/input/court-data/cases/cases/cases_2010.csv
/kaggle/input/court-data/cases/cases/cases_2014.csv
/kaggle/input/court-data/cases/cases/cases_2016.csv
/kaggle/input/court-data/cases/cases/cases_2011.csv
/kaggle/input/court-data/keys/keys/type_name_key.csv
/kaggle/input/court-data/keys/keys/cases_district_key.csv
/kaggle/input/court-data/keys/keys/act_key.csv
/kaggle/input/court-data/keys/keys/disp_name_key.csv
/kaggle/input/court-data/keys/keys/purpose_name_key.csv
/kaggle/input/court-data/keys/keys/cases_state_key.csv
/kaggle/input/court-data/keys/keys/section_key.csv
/kaggle/input/court-data/keys/keys/cases_court_

## *Cases from 2016-2018*

In [2]:
cases_2016 = pd.read_csv("/kaggle/input/court-data/cases/cases/cases_2016.csv")
cases_2016.head()

Unnamed: 0,ddl_case_id,year,state_code,dist_code,court_no,cino,judge_position,female_defendant,female_petitioner,female_adv_def,female_adv_pet,type_name,purpose_name,disp_name,date_of_filing,date_of_decision,date_first_list,date_last_list,date_next_list
0,01-01-01-201900000012016,2016,1,1,1,MHNB030000032016,chief judicial magistrate,0 male,0 male,-9999,0,1940.0,767.0,26,2016-01-02,2016-02-16,2016-01-02,2016-02-16,2016-02-16
1,01-01-01-201900000022016,2016,1,1,1,MHNB030000042016,chief judicial magistrate,0 male,0 male,-9999,0,1940.0,767.0,26,2016-01-02,2016-02-16,2016-01-02,2016-02-16,2016-02-16
2,01-01-01-201900000032016,2016,1,1,1,MHNB030000052016,chief judicial magistrate,0 male,1 female,-9999,0,1940.0,4878.0,43,2016-01-02,2016-02-13,2016-01-02,2016-02-13,2016-02-13
3,01-01-01-201900000042016,2016,1,1,1,MHNB030000062016,chief judicial magistrate,0 male,0 male,-9999,1,1940.0,7430.0,23,2016-01-05,2017-06-07,2016-01-05,2017-06-07,2017-06-07
4,01-01-01-201900000052016,2016,1,1,1,MHNB030000072016,chief judicial magistrate,0 male,0 male,-9999,0,1940.0,5251.0,26,2016-01-06,2016-02-18,2016-01-06,2016-02-18,2016-02-18


In [3]:
cases_2016.drop(columns=["year", "cino", "female_defendant", "female_petitioner", "date_first_list", "date_last_list", "date_next_list"], inplace=True)

In [4]:
cases_2017 = pd.read_csv("/kaggle/input/court-data/cases/cases/cases_2017.csv")
cases_2017.head()

Unnamed: 0,ddl_case_id,year,state_code,dist_code,court_no,cino,judge_position,female_defendant,female_petitioner,female_adv_def,female_adv_pet,type_name,purpose_name,disp_name,date_of_filing,date_of_decision,date_first_list,date_last_list,date_next_list
0,01-01-01-201900000012017,2017,1,1,1,MHNB030000102017,chief judicial magistrate,0 male,0 male,-9999,0,1943.0,3206.0,26,2017-01-02,2017-03-06,2017-01-02,2017-03-06,2017-03-06
1,01-01-01-201900000022017,2017,1,1,1,MHNB030000282017,chief judicial magistrate,0 male,0 male,-9999,1,1943.0,6212.0,26,2017-01-04,2017-02-27,2017-01-04,2017-02-27,2017-02-27
2,01-01-01-201900000032017,2017,1,1,1,MHNB030000292017,chief judicial magistrate,-9998 unclear,0 male,-9999,0,1943.0,6004.0,31,2017-01-04,2017-01-21,2017-01-04,2017-01-21,2017-01-21
3,01-01-01-201900000042017,2017,1,1,1,MHNB030000302017,chief judicial magistrate,0 male,0 male,-9999,1,1943.0,3206.0,31,2017-01-04,2017-02-14,2017-01-04,2017-02-14,2017-02-14
4,01-01-01-201900000052017,2017,1,1,1,MHNB030000312017,chief judicial magistrate,-9998 unclear,0 male,-9999,0,1943.0,7122.0,31,2017-01-04,2017-01-06,2017-01-04,2017-01-06,2017-01-06


In [5]:
cases_2017.drop(columns=["year", "cino", "female_defendant", "female_petitioner", "date_first_list", "date_last_list", "date_next_list"], inplace=True)

In [6]:
cases_2018 = pd.read_csv("/kaggle/input/court-data/cases/cases/cases_2018.csv")
cases_2018.head()

Unnamed: 0,ddl_case_id,year,state_code,dist_code,court_no,cino,judge_position,female_defendant,female_petitioner,female_adv_def,female_adv_pet,type_name,purpose_name,disp_name,date_of_filing,date_of_decision,date_first_list,date_last_list,date_next_list
0,01-01-01-201900000012018,2018,1,1,1,MHNB030000022018,chief judicial magistrate,0 male,0 male,-9999,0,1943,2975.0,33,2018-01-01,2018-02-07,2018-01-01,2018-02-07,2018-02-07
1,01-01-01-201900000022018,2018,1,1,1,MHNB030000032018,chief judicial magistrate,0 male,0 male,-9999,0,1943,3315.0,52,2018-01-01,2018-02-01,2018-01-01,2018-02-01,2018-02-01
2,01-01-01-201900000032018,2018,1,1,1,MHNB030000042018,chief judicial magistrate,0 male,0 male,-9999,0,1943,5877.0,52,2018-01-01,2018-02-01,2018-01-01,2018-02-01,2018-02-01
3,01-01-01-201900000042018,2018,1,1,1,MHNB030000052018,chief judicial magistrate,0 male,0 male,-9999,0,1943,840.0,52,2018-01-01,2018-02-01,2018-01-01,2018-02-01,2018-02-01
4,01-01-01-201900000052018,2018,1,1,1,MHNB030000062018,chief judicial magistrate,-9998 unclear,0 male,-9999,1,1943,840.0,5,2018-01-01,2018-01-09,2018-01-01,2018-01-09,2018-01-09


In [7]:
cases_2018.drop(columns=["year", "cino", "female_defendant", "female_petitioner", "date_first_list", "date_last_list", "date_next_list"], inplace=True)

*Concatinating the data frames from all the 3 years*

In [8]:
cases_3yrs = pd.concat([cases_2016, cases_2017, cases_2018], axis=0)
cases_3yrs.head()

Unnamed: 0,ddl_case_id,state_code,dist_code,court_no,judge_position,female_adv_def,female_adv_pet,type_name,purpose_name,disp_name,date_of_filing,date_of_decision
0,01-01-01-201900000012016,1,1,1,chief judicial magistrate,-9999,0,1940.0,767.0,26,2016-01-02,2016-02-16
1,01-01-01-201900000022016,1,1,1,chief judicial magistrate,-9999,0,1940.0,767.0,26,2016-01-02,2016-02-16
2,01-01-01-201900000032016,1,1,1,chief judicial magistrate,-9999,0,1940.0,4878.0,43,2016-01-02,2016-02-13
3,01-01-01-201900000042016,1,1,1,chief judicial magistrate,-9999,1,1940.0,7430.0,23,2016-01-05,2017-06-07
4,01-01-01-201900000052016,1,1,1,chief judicial magistrate,-9999,0,1940.0,5251.0,26,2016-01-06,2016-02-18


In [9]:
import warnings
warnings.simplefilter('ignore')
import gc
import subprocess


del cases_2016
gc.collect()
del cases_2017
gc.collect()
del cases_2018
gc.collect()

0

In [10]:
cases_3yrs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 38139072 entries, 0 to 13724298
Data columns (total 12 columns):
 #   Column            Dtype  
---  ------            -----  
 0   ddl_case_id       object 
 1   state_code        int64  
 2   dist_code         int64  
 3   court_no          int64  
 4   judge_position    object 
 5   female_adv_def    int64  
 6   female_adv_pet    int64  
 7   type_name         float64
 8   purpose_name      float64
 9   disp_name         int64  
 10  date_of_filing    object 
 11  date_of_decision  object 
dtypes: float64(2), int64(6), object(4)
memory usage: 3.7+ GB


In [11]:
cases_3yrs.describe()

Unnamed: 0,state_code,dist_code,court_no,female_adv_def,female_adv_pet,type_name,purpose_name,disp_name
count,38139070.0,38139070.0,38139070.0,38139070.0,38139070.0,38139030.0,36879470.0,38139070.0
mean,11.39615,18.63644,6.73391,-9106.285,-5373.051,3369.984,3831.728,26.63822
std,7.466314,15.30588,7.122301,2851.185,4985.453,2139.658,2407.072,11.31671
min,1.0,1.0,1.0,-9999.0,-9999.0,1.0,1.0,1.0
25%,4.0,7.0,2.0,-9999.0,-9999.0,1561.0,1211.0,23.0
50%,13.0,15.0,4.0,-9999.0,-9998.0,2701.0,4342.0,27.0
75%,16.0,26.0,9.0,-9999.0,0.0,5312.0,5644.0,30.0
max,33.0,76.0,75.0,1.0,1.0,7556.0,8785.0,52.0


In [12]:
cases_3yrs['date_of_decision'].isnull().sum()

14107430

In [13]:
cases_3yrs['date_of_filing'].isnull().sum()

0

*There are null values in "purpose_name" and in the "date_of_decision"*

*To ensure a clean and accurate prediction process, any instances with missing values (NaN) in the dataset have been dropped.*

In [14]:
cases_3yrs.dropna(inplace=True)

*The date and time variables representing the case filing and decision date have been converted to datetime series. This conversion enables us to compute and compare the duration between different dates and times accurately.*

In [15]:
cases_3yrs['date_of_decision'] =  pd.to_datetime(cases_3yrs['date_of_decision'], errors='coerce')
cases_3yrs['date_of_filing'] =  pd.to_datetime(cases_3yrs['date_of_filing'], errors='coerce')
cases_3yrs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 22998510 entries, 0 to 13724298
Data columns (total 12 columns):
 #   Column            Dtype         
---  ------            -----         
 0   ddl_case_id       object        
 1   state_code        int64         
 2   dist_code         int64         
 3   court_no          int64         
 4   judge_position    object        
 5   female_adv_def    int64         
 6   female_adv_pet    int64         
 7   type_name         float64       
 8   purpose_name      float64       
 9   disp_name         int64         
 10  date_of_filing    datetime64[ns]
 11  date_of_decision  datetime64[ns]
dtypes: datetime64[ns](2), float64(2), int64(6), object(2)
memory usage: 2.2+ GB


In [16]:
cases_3yrs['case_duration'] = (cases_3yrs['date_of_decision'] - cases_3yrs['date_of_filing']).dt.days
cases_3yrs.drop(columns=['date_of_filing', 'date_of_decision'], inplace=True)
cases_3yrs.drop(cases_3yrs[ cases_3yrs['case_duration'] <= 0 ].index, inplace = True)

cases_3yrs.head()

Unnamed: 0,ddl_case_id,state_code,dist_code,court_no,judge_position,female_adv_def,female_adv_pet,type_name,purpose_name,disp_name,case_duration
0,01-01-01-201900000012016,1,1,1,chief judicial magistrate,-9999,0,1940.0,767.0,26,45.0
1,01-01-01-201900000022016,1,1,1,chief judicial magistrate,-9999,0,1940.0,767.0,26,45.0
2,01-01-01-201900000032016,1,1,1,chief judicial magistrate,-9999,0,1940.0,4878.0,43,42.0
3,01-01-01-201900000042016,1,1,1,chief judicial magistrate,-9999,1,1940.0,7430.0,23,519.0
4,01-01-01-201900000052016,1,1,1,chief judicial magistrate,-9999,0,1940.0,5251.0,26,43.0


*Removing the rows with unknown gender and then converting object datatype to integer.*

In [17]:
cases_3yrs.drop(cases_3yrs[(cases_3yrs["female_adv_def"] != 0) & (cases_3yrs["female_adv_def"] != 1)].index, inplace=True)
cases_3yrs.drop(cases_3yrs[(cases_3yrs["female_adv_pet"] != 0) & (cases_3yrs["female_adv_pet"] != 1)].index, inplace=True)

In [18]:
cases_3yrs

Unnamed: 0,ddl_case_id,state_code,dist_code,court_no,judge_position,female_adv_def,female_adv_pet,type_name,purpose_name,disp_name,case_duration
131,01-01-01-201900001322016,1,1,1,chief judicial magistrate,0,0,1940.0,2813.0,23,947.0
178,01-01-01-201900001792016,1,1,1,chief judicial magistrate,0,0,1940.0,6253.0,16,241.0
300,01-01-01-201900003012016,1,1,1,chief judicial magistrate,0,0,1940.0,4776.0,5,602.0
463,01-01-01-201900004642016,1,1,1,chief judicial magistrate,0,1,1940.0,5426.0,28,445.0
520,01-01-01-202800000242016,1,1,1,chief judicial magistrate,1,0,4777.0,4342.0,5,599.0
...,...,...,...,...,...,...,...,...,...,...,...
13724269,33-02-03-220700000052018,33,2,3,chief judicial magistrate,1,1,2795.0,4476.0,25,176.0
13724285,33-02-03-224100000012018,33,2,3,chief judicial magistrate,0,0,5889.0,840.0,30,29.0
13724290,33-02-03-224600000022018,33,2,3,chief judicial magistrate,0,0,3364.0,2234.0,16,107.0
13724293,33-02-03-224600000052018,33,2,3,chief judicial magistrate,0,0,3364.0,1163.0,16,57.0


In [19]:
cases_3yrs.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 274652 entries, 131 to 13724296
Data columns (total 11 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   ddl_case_id     274652 non-null  object 
 1   state_code      274652 non-null  int64  
 2   dist_code       274652 non-null  int64  
 3   court_no        274652 non-null  int64  
 4   judge_position  274652 non-null  object 
 5   female_adv_def  274652 non-null  int64  
 6   female_adv_pet  274652 non-null  int64  
 7   type_name       274652 non-null  float64
 8   purpose_name    274652 non-null  float64
 9   disp_name       274652 non-null  int64  
 10  case_duration   274652 non-null  float64
dtypes: float64(3), int64(6), object(2)
memory usage: 25.1+ MB


In [20]:
cases_3yrs.to_csv("/kaggle/working/3yrs_cases.csv", index=False)