## Classification Problem

The given database can be used to train a model to guess whether a given case was dismissed or not.
This problem is a binary classification problem where it will output a "Yes" if the case was dismissed and "No" if it wasn't.

We use the scikit learn and seaborn libraries so if it isn't installed run the following cell.

In [1]:
%pip install -U scikit-learn seaborn

Note: you may need to restart the kernel to use updated packages.


In [1]:
# Standard Imports
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
import pickle

# Transformers
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, StandardScaler, MinMaxScaler

# Modeling Evaluation
from sklearn.model_selection import train_test_split, cross_val_score, KFold, GridSearchCV
from sklearn.metrics import accuracy_score, precision_score, recall_score,f1_score, confusion_matrix, classification_report
from IPython.display import display, Markdown

# Pipelines
from sklearn.pipeline import Pipeline, FeatureUnion
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer

# Machine Learning
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import KFold
from sklearn.neighbors import KNeighborsClassifier

In [2]:
df = pd.read_csv("../cases/mixed_sample.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,ddl_case_id,year,state_code,dist_code,court_no,cino,judge_position,female_defendant,female_petitioner,female_adv_def,female_adv_pet,type_name,purpose_name,disp_name,date_of_filing,date_of_decision,date_first_list,date_last_list,date_next_list
0,0,01-01-01-201908000172010,2010,1,1,1,MHNB030000482010,chief judicial magistrate,0 male,-9998 unclear,-9999,0,1429.0,4946.0,25,2010-01-20,2012-10-19,2010-04-20,2012-09-13,2012-10-19
1,1,01-01-01-201908000292010,2010,1,1,1,MHNB030001902010,chief judicial magistrate,0 male,1 female,0,1,1429.0,3006.0,30,2010-02-25,2010-12-01,2010-03-18,2010-11-29,2010-12-01
2,2,01-01-01-201908000602010,2010,1,1,1,MHNB030004262010,chief judicial magistrate,-9998 unclear,0 male,-9998,1,1429.0,509.0,25,2010-05-06,2010-05-26,2010-05-10,2010-05-25,2010-05-26
3,3,01-01-01-201908000742010,2010,1,1,1,MHNB030005682010,chief judicial magistrate,-9998 unclear,0 male,-9999,0,1429.0,2237.0,30,2010-05-29,2013-04-29,2010-06-17,2013-04-12,2013-04-29
4,4,01-01-01-201908001032010,2010,1,1,1,MHNB030006862010,chief judicial magistrate,0 male,1 female,-9999,-9998,1429.0,4882.0,25,2010-07-08,2010-11-20,2010-08-10,2010-10-30,2010-11-20


In [3]:
state_frame = pd.read_csv("../keys/cases_state_key.csv")[['year', 'state_code', 'state_name']]
district_frame = pd.read_csv("../keys/cases_district_key.csv")[['year', 'state_code', 'dist_code', 'district_name']]
disp_frame = pd.read_csv("../keys/disp_name_key.csv")[['year', 'disp_name', 'disp_name_s']]

In [5]:
df_merged = pd.merge(df, state_frame, on=['year','state_code'], how='left')
df_merged = pd.merge(df_merged, district_frame, on=['year', 'state_name', 'dist_code'], how='left')
df_merged = pd.merge(df_merged, disp_frame, on=['disp_name', 'year'], how='left')
df_merged.shape

(1350000, 23)

In [6]:
def clean_gender_1(row):
    if row == '0 male' or row == '0 nonfemale':
        row = 0
    elif row == '1 female':
        row = 1
    else:
        row = np.NaN

    return row

df_merged.female_defendant = [clean_gender_1(row) for row in df_merged.female_defendant]
df_merged.female_petitioner = [clean_gender_1(row) for row in df_merged.female_petitioner] 

In [7]:
df_merged.date_of_decision = pd.to_datetime(df_merged.date_of_decision, errors='coerce')
df_merged.date_of_filing = pd.to_datetime(df_merged.date_of_filing, errors='coerce')

In [8]:
df_merged['filing_year'] = df_merged.date_of_filing.dt.year
df_merged['filing_month'] = df_merged.date_of_filing.dt.month
df_merged['filing_day'] = df_merged.date_of_filing.dt.day
df_merged['decision_year'] = df_merged.date_of_decision.dt.year
df_merged['decision_month'] = df_merged.date_of_decision.dt.month
df_merged['decision_day'] = df_merged.date_of_decision.dt.day

In [9]:
columns=['year', 'state_name', 'district_name', 'female_defendant', 'female_petitioner', 'judge_position',
        'filing_year', 'filing_month', 'filing_day', 'decision_year', 'decision_month', 'decision_day']

numerical_columns = ['year', 'female_defendant', 'female_petitioner',
        'filing_year', 'filing_month', 'filing_day', 'decision_year', 'decision_month', 'decision_day']

categorical_columns = ['state_name', 'district_name', 'judge_position']
df_merged = df_merged[columns]

In [11]:
df_merged.shape

(1350000, 12)

In [11]:
df_merged.dropna(inplace=True)
df_merged.shape

(151260, 12)

In [12]:
df_merged.head()

Unnamed: 0,year,state_name,district_name,female_defendant,female_petitioner,judge_position,filing_year,filing_month,filing_day,decision_year,decision_month,decision_day
0,2010,Maharashtra,Nandurbar,,,chief judicial magistrate,2010,2,25,2010.0,11.0,21.0
1,2010,Maharashtra,Nandurbar,,0.0,civil judge senior division,2010,2,25,2010.0,11.0,21.0
2,2010,Maharashtra,Nandurbar,0.0,,civil judge junior division,2010,2,4,2013.0,3.0,3.0
3,2010,Maharashtra,Nandurbar,0.0,,civil judge junior division,2010,4,20,2013.0,4.0,21.0
4,2010,Maharashtra,Dhule,1.0,0.0,district and sessions court,2010,1,19,2013.0,11.0,23.0
