# *Data Preprocessing*

*The data has been preprocessed to create a dataset that includes the following features or the classification and prediction of the time taken to solve a case into five distinct categories -- 1-100 days, 100-500 days, 500-1000 days, 1000-1500 days, and 1500+ days:*

1. State Code
2. District Code
3. Court Number
4. Judge Position
5. Gender of Defendant's Advocate
6. Gender of Petitioner's Advocate
7. Case Type
8. Case Purpose
9. Disposition Name
10. Act
11. Section
12. Number of Sections IPC

*Including cases from the years 2016 to 2018 expands the dataset and provides a more comprehensive representation of case resolution time during that period.* 

*In this notebook we filter out the unnecessary data pionts of all the cases from the years 2016, 2017, and 2018.*

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## *Cases from 2016-2018*

In [None]:
cases_2016 = pd.read_csv("/kaggle/input/court-data/cases/cases/cases_2016.csv")
cases_2016.head()

In [None]:
cases_2016.drop(columns=["year", "cino", "female_defendant", "female_petitioner", "date_first_list", "date_last_list", "date_next_list"], inplace=True)

In [None]:
cases_2017 = pd.read_csv("/kaggle/input/court-data/cases/cases/cases_2017.csv")
cases_2017.head()

In [None]:
cases_2017.drop(columns=["year", "cino", "female_defendant", "female_petitioner", "date_first_list", "date_last_list", "date_next_list"], inplace=True)

In [None]:
cases_2018 = pd.read_csv("/kaggle/input/court-data/cases/cases/cases_2018.csv")
cases_2018.head()

In [None]:
cases_2018.drop(columns=["year", "cino", "female_defendant", "female_petitioner", "date_first_list", "date_last_list", "date_next_list"], inplace=True)

*Concatinating the data frames from all the 3 years*

In [None]:
cases_3yrs = pd.concat([cases_2016, cases_2017, cases_2018], axis=0)
cases_3yrs.head()

In [None]:
import warnings
warnings.simplefilter('ignore')
import gc
import subprocess


del cases_2016
gc.collect()
del cases_2017
gc.collect()
del cases_2018
gc.collect()

In [None]:
cases_3yrs.info()

In [None]:
cases_3yrs.describe()

In [None]:
cases_3yrs['date_of_decision'].isnull().sum()

In [None]:
cases_3yrs['date_of_filing'].isnull().sum()

*There are null values in "purpose_name" and in the "date_of_decision"*

*To ensure a clean and accurate prediction process, any instances with missing values (NaN) in the dataset have been dropped.*

In [None]:
cases_3yrs.dropna(inplace=True)

*The date and time variables representing the case filing and decision date have been converted to datetime series. This conversion enables us to compute and compare the duration between different dates and times accurately.*

In [None]:
cases_3yrs['date_of_decision'] =  pd.to_datetime(cases_3yrs['date_of_decision'], errors='coerce')
cases_3yrs['date_of_filing'] =  pd.to_datetime(cases_3yrs['date_of_filing'], errors='coerce')
cases_3yrs.info()

In [None]:
cases_3yrs['case_duration'] = (cases_3yrs['date_of_decision'] - cases_3yrs['date_of_filing']).dt.days
cases_3yrs.drop(columns=['date_of_filing', 'date_of_decision'], inplace=True)
cases_3yrs.drop(cases_3yrs[ cases_3yrs['case_duration'] <= 0 ].index, inplace = True)

cases_3yrs.head()

*Removing the rows with unknown gender and then converting object datatype to integer.*

In [None]:
cases_3yrs.drop(cases_3yrs[(cases_3yrs["female_adv_def"] != 0) & (cases_3yrs["female_adv_def"] != 1)].index, inplace=True)
cases_3yrs.drop(cases_3yrs[(cases_3yrs["female_adv_pet"] != 0) & (cases_3yrs["female_adv_pet"] != 1)].index, inplace=True)

In [None]:
cases_3yrs

In [None]:
cases_3yrs.info()

In [None]:
cases_3yrs.to_csv("/kaggle/working/3yrs_cases.csv", index=False)