# *Classification of Case Resolution Time: Categorizing Cases into Time-Based Categories*
*We employ a classification approach to categorize a specific subset of our data based on the duration it takes for a case to be resolved. The data is divided into five distinct categories: 1-100 days, 100-500 days, 500-1000 days, 1000-1500 days, and 1500+ days.*

## *Features Used for Classification*
1. State Code: The code representing the state where the case is being heard.
2. District Code: The code indicating the specific district within the state where the case is being heard.
3. Court Number: The number identifying the court where the case is being heard.
4. Judge Position: The position or rank of the judge presiding over the case.
5. Gender of Defendant's Advocate: The gender of the lawyer representing the defendant.
6. Gender of Petitioner's Advocate: The gender of the lawyer or advocate representing the petitioner.
7. Case Type: The type or category of the case (e.g., criminal, civil, family law).
8. Case Purpose: The intended purpose or objective of the case.
9. Disposition Name: The specific name or label representing the case disposition (e.g., conviction, acquittal).
10. Act: The relevant act or legislation associated with the case.
11. Section: The specific section of the act or legislation relevant to the case.
12. Number of Sections IPC: The number of sections of the Indian Penal Code (IPC) applicable to the case (for cases in India).


## *Model Used for Classification*
*The random forest classification model is chosen due to its distinct advantages over other models. By incorporating randomness and independence in the model, it offers several benefits. Random forests excel at handling high-dimensional data and are less prone to overfitting compared to other models.*

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

## *Libraries*

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [None]:
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import ConfusionMatrixDisplay

## *Data*
Importing the pre-processed data file into the notebook

In [None]:
cases = pd.read_csv("/kaggle/input/data-preprocessing-2/preprocessed_3yrs_cases.csv")
cases

In [None]:
cases.describe()

#### *'case_duration' column has continuous values. The values should be grouped together*

In [None]:
# Creating a list of category labels and the corresponding bin edges
# categories = ['<100', '100-500', '500-1000', '1000-1500', '1500+']
categories = [1, 2, 3, 4, 5]
bins = [0, 100, 500, 1000, 1500, float('inf')]

cases['duration_category'] = pd.cut(cases['case_duration'], bins=bins, labels=categories).astype(int)

cases

In [None]:
cases.info()

## *Training*

In [None]:
columns = ['state_code', 'dist_code', 'court_no', 'judge_position',
       'female_adv_def', 'female_adv_pet', 'type_name', 'purpose_name',
       'disp_name', 'act', 'section', 'number_sections_ipc', 'duration_category']

encoder = LabelEncoder()

for col in columns:
    cases[col] = encoder.fit_transform(cases[col])

In [None]:
X = cases[['state_code', 'dist_code', 'court_no','judge_position','female_adv_def', 'female_adv_pet', 'type_name', 'purpose_name','disp_name', 'act', 'section', 'number_sections_ipc']]
y = cases['duration_category']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=101)

In [None]:
classifier = RandomForestClassifier(n_estimators=500, max_depth=5, random_state=101, class_weight = 'balanced_subsample')

In [None]:
classifier.fit(X_train, y_train)

## *Performance Metrics*

#### *Making predictions on the testing set*

In [None]:
y_pred = classifier.predict(X_test)

#### *Classification Report*

In [None]:
report = classification_report(y_test, y_pred)
print(report)

#### *Score*

In [None]:
print(classifier.score(X_test, y_test)*100)

#### *Confusion Matrix*

In [None]:
print(confusion_matrix(y_test, y_pred))

## *Tuning the parameters of the `RandomForestClassifier`*
*We use `GridSearchCV` from scikit-learn to perform a grid search over the specified parameter grid. The `param_grid` dictionary contains different values to try for each parameter. The `cv` parameter in GridSearchCV specifies the number of cross-validation folds to use during the search.*

In [None]:
# Defining the parameter grid to search over
param_grid = {
    'n_estimators': [300, 400, 500, 600], 
    'max_depth': [4, 5, 8],     
}

In [None]:
classifier = RandomForestClassifier()

# Performing grid search using cross-validation
grid_search = GridSearchCV(classifier, param_grid, cv=5, verbose=3)

In [None]:
grid_search.fit(X_train, y_train)

In [None]:
print("Best Parameters:", grid_search.best_params_)

### *Using the best model for prediction*

In [None]:
best_classifier = grid_search.best_estimator_
y_pred = best_classifier.predict(X_test)

## *Performance Metrics*

#### *Classification Report*

In [None]:
report = classification_report(y_test, y_pred, zero_division=0)
print(report)

#### *Score*

In [None]:
print(best_classifier.score(X_test, y_test)*100)

#### *Confusion Matrix*

In [None]:
print(confusion_matrix(y_test, y_pred))

In [None]:
fig, axs = plt.subplots(figsize=(10,10))
display = ConfusionMatrixDisplay.from_predictions(y_test, y_pred, cmap = 'mako', ax=axs)

plt.savefig('confusion_matrix.png')

#### *Our model demonstrates a commendable accuracy score of 70% in classifying resolution time. Notably, we were able to achieve a further improvement in accuracy to 79% by modifying the parameters used in the model. However, it is worth noting that the model's performance can be enhanced even further by increasing the amount of test data available for evaluation. Expanding the test dataset would likely lead to more comprehensive and robust model training, resulting in higher accuracy scores.*