# **Classification Problem**
## I'm trying to predict the case verdict by segregating it into 3 categories-- convicted, acquitted and others based on case type(depending on the crime) , district and state(to show the prevalent beliefs and constructs of the region) , all genders(in case of any gender bias) and the position of the judge. For lesser running time per notebook, I separated neural network training and traditional machine learning models. This notebook is where I train traditional ML models-- Categorical Naive Bayes, K-Nearest Neighbours, Decision Trees and Random Forest Classifier which achieved highest accuracy of 0.8503 with Random Forest.

## **Importing relevant modules**

In [1]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import classification_report

## **Reading the case CSVs and concatenating them to a single dataframe**

In [2]:
case=pd.read_csv("/kaggle/input/precog-cases/cases_2010.csv")
for i in range(1,5):
    case_year=pd.read_csv("/kaggle/input/precog-cases/cases_201%s.csv" %i)
    case=pd.concat([case, case_year])
    

## **Reading the Judge related CSVs**

In [3]:
# getting judges per case data
judges= pd.read_csv("/kaggle/input/keys-precog/judge_case_merge_key.csv")
# getting the gender of judges
judge_gender= pd.read_csv("/kaggle/input/judges-clean/judges_clean.csv")

In [4]:
#renaming judge_gender column name for merging later
judge_gender = judge_gender.rename(columns={'ddl_judge_id': 'ddl_decision_judge_id'})
judge_gender

Unnamed: 0,ddl_decision_judge_id,state_code,dist_code,court_no,judge_position,female_judge,start_date,end_date
0,1,1,1,1,chief judicial magistrate,0 nonfemale,20-09-2013,20-02-2014
1,2,1,1,1,chief judicial magistrate,0 nonfemale,31-10-2013,20-02-2014
2,3,1,1,1,chief judicial magistrate,0 nonfemale,21-02-2014,31-05-2016
3,4,1,1,1,chief judicial magistrate,0 nonfemale,01-06-2016,06-06-2016
4,5,1,1,1,chief judicial magistrate,0 nonfemale,06-06-2016,07-07-2018
...,...,...,...,...,...,...,...,...
98473,98474,30,2,9,criminal cases,1 female,21-04-2004,14-11-2013
98474,98475,30,2,9,criminal cases,1 female,16-01-2015,16-01-2016
98475,98476,30,2,9,criminal cases,1 female,09-12-2016,31-07-2017
98476,98477,30,2,10,criminal cases,1 female,15-05-2017,28-01-2019


In [5]:
#merging judges and case database
judge_case=pd.merge(judges,case,on='ddl_case_id')
#merging judge_case with act_sections
"""
judge_case_act=pd.merge(judge_case,acts_sections,on='ddl_case_id') 
--> I didn't include this since many acts didn't have enough datapoints to be trained on
"""
judge_case


Unnamed: 0,ddl_case_id,ddl_filing_judge_id,ddl_decision_judge_id,year,state_code,dist_code,court_no,cino,judge_position,female_defendant,...,female_adv_def,female_adv_pet,type_name,purpose_name,disp_name,date_of_filing,date_of_decision,date_first_list,date_last_list,date_next_list
0,01-01-01-201908000012013,50.0,50.0,2013,1,1,1,MHNB030000112013,chief judicial magistrate,0 male,...,-9999,0,1919.0,7062.0,25,2013-01-01,2013-01-01,2013-01-01,2013-01-01,2013-01-01
1,01-01-01-201908000022014,92.0,93.0,2014,1,1,1,MHNB030000982014,chief judicial magistrate,-9998 unclear,...,-9999,-9998,1907.0,5487.0,25,2014-01-03,2014-10-28,2014-01-16,2014-09-30,2014-10-28
2,01-01-01-201908000052013,50.0,50.0,2013,1,1,1,MHNB030000212013,chief judicial magistrate,0 male,...,-9999,0,1919.0,5148.0,25,2013-01-02,2013-01-02,2013-01-02,2013-01-02,2013-01-02
3,01-01-01-201908000052014,92.0,93.0,2014,1,1,1,MHNB030001302014,chief judicial magistrate,0 male,...,-9999,0,1907.0,4366.0,25,2014-01-02,2014-06-10,2014-02-12,2014-05-29,2014-06-10
4,01-01-01-201908000062012,92.0,92.0,2012,1,1,1,MHNB030000182012,chief judicial magistrate,-9998 unclear,...,-9999,1,1849.0,3035.0,25,2012-01-05,2012-01-12,2012-01-09,2012-01-09,2012-01-12
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3100892,30-02-06-201400000992011,,98452.0,2011,30,2,6,GASG020016602011,criminal cases,0 male,...,-9999,-9999,2870.0,4126.0,30,2011-04-19,2013-02-18,2011-07-26,2013-02-14,2013-02-18
3100893,30-02-06-201400001002011,,98462.0,2011,30,2,6,GASG020016612011,criminal cases,0 male,...,-9999,-9999,2870.0,423.0,33,2011-04-19,2016-12-13,2011-07-14,2016-12-13,2016-12-13
3100894,30-02-06-201400001042011,,98364.0,2011,30,2,6,GASG020017162011,criminal cases,0 male,...,0,-9999,2870.0,1547.0,26,2011-04-26,,2011-06-06,2019-01-08,2019-03-07
3100895,30-02-06-201400001092011,,98452.0,2011,30,2,6,GASG020018622011,criminal cases,0 male,...,-9999,-9999,2870.0,4126.0,30,2011-05-03,2012-08-30,2011-07-14,2012-08-21,2012-08-30


In [6]:
#merging dataframes
judge_case=pd.merge(judge_case,judge_gender,on='ddl_decision_judge_id')

## **Checking the above process**

In [7]:
judge_case

Unnamed: 0,ddl_case_id,ddl_filing_judge_id,ddl_decision_judge_id,year,state_code_x,dist_code_x,court_no_x,cino,judge_position_x,female_defendant,...,date_first_list,date_last_list,date_next_list,state_code_y,dist_code_y,court_no_y,judge_position_y,female_judge,start_date,end_date
0,01-01-01-201908000012013,50.0,50.0,2013,1,1,1,MHNB030000112013,chief judicial magistrate,0 male,...,2013-01-01,2013-01-01,2013-01-01,1,1,2,chief judicial magistrate,0 nonfemale,01-10-2011,10-06-2013
1,01-01-01-201908000052013,50.0,50.0,2013,1,1,1,MHNB030000212013,chief judicial magistrate,0 male,...,2013-01-02,2013-01-02,2013-01-02,1,1,2,chief judicial magistrate,0 nonfemale,01-10-2011,10-06-2013
2,01-01-01-201908000092012,50.0,50.0,2012,1,1,1,MHNB030000552012,chief judicial magistrate,-9998 unclear,...,2012-01-21,2012-02-01,2012-02-02,1,1,2,chief judicial magistrate,0 nonfemale,01-10-2011,10-06-2013
3,01-01-01-201908000132013,50.0,50.0,2013,1,1,1,MHNB030001882013,chief judicial magistrate,-9998 unclear,...,2013-02-04,2013-02-20,2013-02-28,1,1,2,chief judicial magistrate,0 nonfemale,01-10-2011,10-06-2013
4,01-01-01-201908000192012,50.0,50.0,2012,1,1,1,MHNB030001272012,chief judicial magistrate,0 male,...,2012-02-02,2012-02-02,2012-02-02,1,1,2,chief judicial magistrate,0 nonfemale,01-10-2011,10-06-2013
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2979524,30-02-06-201400000682011,,98452.0,2011,30,2,6,GASG020010982011,criminal cases,0 male,...,2011-04-19,2012-06-16,2012-07-07,30,2,6,criminal cases,0 nonfemale,04-06-2012,31-05-2013
2979525,30-02-06-201400000992011,,98452.0,2011,30,2,6,GASG020016602011,criminal cases,0 male,...,2011-07-26,2013-02-14,2013-02-18,30,2,6,criminal cases,0 nonfemale,04-06-2012,31-05-2013
2979526,30-02-06-201400001092011,,98452.0,2011,30,2,6,GASG020018622011,criminal cases,0 male,...,2011-07-14,2012-08-21,2012-08-30,30,2,6,criminal cases,0 nonfemale,04-06-2012,31-05-2013
2979527,30-02-06-201400001242011,,98452.0,2011,30,2,6,GASG020020212011,criminal cases,0 male,...,2011-07-26,2012-08-07,2012-08-28,30,2,6,criminal cases,0 nonfemale,04-06-2012,31-05-2013


## **Choosing the relevant columns and dropping null values**

In [8]:
judge_case=judge_case[['ddl_decision_judge_id','type_name','judge_position_x','state_code_x','dist_code_x','female_judge','female_defendant','female_adv_def','female_adv_pet','disp_name']]
judge_case=judge_case.dropna()
display(judge_case)

Unnamed: 0,ddl_decision_judge_id,type_name,judge_position_x,state_code_x,dist_code_x,female_judge,female_defendant,female_adv_def,female_adv_pet,disp_name
0,50.0,1919.0,chief judicial magistrate,1,1,0 nonfemale,0 male,-9999,0,25
1,50.0,1919.0,chief judicial magistrate,1,1,0 nonfemale,0 male,-9999,0,25
2,50.0,1849.0,chief judicial magistrate,1,1,0 nonfemale,-9998 unclear,0,0,25
3,50.0,1919.0,chief judicial magistrate,1,1,0 nonfemale,-9998 unclear,-9999,0,25
4,50.0,1849.0,chief judicial magistrate,1,1,0 nonfemale,0 male,-9999,0,25
...,...,...,...,...,...,...,...,...,...,...
2979524,98452.0,2870.0,criminal cases,30,2,0 nonfemale,0 male,-9999,-9999,33
2979525,98452.0,2870.0,criminal cases,30,2,0 nonfemale,0 male,-9999,-9999,30
2979526,98452.0,2870.0,criminal cases,30,2,0 nonfemale,0 male,-9999,-9999,30
2979527,98452.0,2870.0,criminal cases,30,2,0 nonfemale,0 male,-9999,-9999,33


## **Dividing data into 3 labels-- convicted, acquitted and others**

In [9]:
#changing all others to 0 except for acquitted and convicted
judge_case['disp_name'] = judge_case['disp_name'].mask(judge_case['disp_name'] < 4, 1)
judge_case['disp_name'] = judge_case['disp_name'].mask(judge_case['disp_name'] >19, 1)
judge_case['disp_name'] = judge_case['disp_name'].mask((judge_case['disp_name'] < 19) & (judge_case['disp_name'] > 4), 1)
display(judge_case.head())

Unnamed: 0,ddl_decision_judge_id,type_name,judge_position_x,state_code_x,dist_code_x,female_judge,female_defendant,female_adv_def,female_adv_pet,disp_name
0,50.0,1919.0,chief judicial magistrate,1,1,0 nonfemale,0 male,-9999,0,1
1,50.0,1919.0,chief judicial magistrate,1,1,0 nonfemale,0 male,-9999,0,1
2,50.0,1849.0,chief judicial magistrate,1,1,0 nonfemale,-9998 unclear,0,0,1
3,50.0,1919.0,chief judicial magistrate,1,1,0 nonfemale,-9998 unclear,-9999,0,1
4,50.0,1849.0,chief judicial magistrate,1,1,0 nonfemale,0 male,-9999,0,1


In [10]:
#adding labels
from sklearn import preprocessing
le = preprocessing.LabelEncoder()

judge_case['judge_position_x'] = le.fit_transform(judge_case['judge_position_x'])
judge_case['female_defendant'] = le.fit_transform(judge_case['female_defendant'])
judge_case['female_adv_def'] = le.fit_transform(judge_case['female_adv_def'])
judge_case['female_adv_pet'] = le.fit_transform(judge_case['female_adv_pet'])
judge_case['female_judge'] = le.fit_transform(judge_case['female_judge'])


## **Checking the above process**

In [11]:
display(judge_case[judge_case.disp_name==19])

Unnamed: 0,ddl_decision_judge_id,type_name,judge_position_x,state_code_x,dist_code_x,female_judge,female_defendant,female_adv_def,female_adv_pet,disp_name
34,50.0,5487.0,48,1,1,1,2,0,0,19
35,50.0,5487.0,48,1,1,1,2,0,0,19
36,50.0,5487.0,48,1,1,1,2,0,0,19
37,50.0,5487.0,48,1,1,1,2,0,0,19
38,50.0,5487.0,48,1,1,1,2,0,0,19
...,...,...,...,...,...,...,...,...,...,...
2978683,98032.0,820.0,79,29,10,2,2,1,1,19
2978753,98032.0,915.0,79,29,10,2,3,0,1,19
2979101,98032.0,682.0,79,29,10,2,2,0,1,19
2979283,98030.0,915.0,79,29,10,1,3,2,1,19


## **Categorical Naive Bayes**

In [12]:
from sklearn.naive_bayes import CategoricalNB

X = judge_case.drop(columns=['disp_name'],axis=1)
y = judge_case.disp_name
X_train,X_test,y_train,y_test = train_test_split(X,y,train_size=0.3,random_state=101, stratify=y)
#
# Fit train set for Gaussian Naive Bayes
#
CNB = CategoricalNB()
CNB.fit(X_train,y_train)
#
# Predict for test set
#
y_pred = CNB.predict(X_test)
print(classification_report(y_test,y_pred))



              precision    recall  f1-score   support

           1       0.92      0.86      0.89   1665812
           4       0.48      0.61      0.54    339902
          19       0.32      0.38      0.35     79956

    accuracy                           0.80   2085670
   macro avg       0.57      0.62      0.59   2085670
weighted avg       0.82      0.80      0.81   2085670



## **Reporting Accuracy**

In [13]:
from sklearn import metrics
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.8017318175933873


## **K-Nearest Neighbours Algorithms**

In [14]:
#Import knearest neighbors Classifier model
from sklearn.neighbors import KNeighborsClassifier
#Create KNN Classifier
knn = KNeighborsClassifier(n_neighbors=5)

#Train the model using the training sets
knn.fit(X_train, y_train)

#Predict the response for test dataset
y_pred = knn.predict(X_test)

## Reporting performance and accuracy

In [15]:
print(classification_report(y_test,y_pred))
from sklearn import metrics

              precision    recall  f1-score   support

           1       0.89      0.93      0.91   1665812
           4       0.58      0.51      0.54    339902
          19       0.59      0.30      0.39     79956

    accuracy                           0.84   2085670
   macro avg       0.69      0.58      0.62   2085670
weighted avg       0.83      0.84      0.83   2085670



In [16]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.8411076536556598


## **Decision Tree Classifier and corresponding accuracy**

In [17]:
from sklearn.tree import DecisionTreeClassifier
d=DecisionTreeClassifier()
d.fit(X_train,y_train)
y_pred=d.predict(X_test)

print(metrics.accuracy_score(y_test,y_pred))

0.8401012624240651


In [18]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           1       0.89      0.93      0.91   1665812
           4       0.59      0.51      0.55    339902
          19       0.48      0.32      0.38     79956

    accuracy                           0.84   2085670
   macro avg       0.65      0.59      0.61   2085670
weighted avg       0.83      0.84      0.83   2085670



## **Random Forest Classifier**

## Dividing the dataset into test and train

In [19]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3) # 70% training and 30% test

## Training the model

In [20]:
#Import Random Forest Model
from sklearn.ensemble import RandomForestClassifier

#Create a Gaussian Classifier
clf=RandomForestClassifier(n_estimators=100)

#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train,y_train)

y_pred=clf.predict(X_test)

## Reporting Accuracy

In [21]:
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))

Accuracy: 0.8502817558473987


In [22]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           1       0.90      0.94      0.92    713547
           4       0.61      0.55      0.58    145888
          19       0.54      0.33      0.41     34424

    accuracy                           0.85    893859
   macro avg       0.68      0.61      0.64    893859
weighted avg       0.84      0.85      0.84    893859



## **Thus the highest accuracy achieved was with Random Forest Classifiers at 0.8503.**
## The precision of label 1 is much higher than that of label 4 and label 9 and so is recall.


 ### *Precision is defined as the ratio of correctly classified positive samples (True Positive) to a total number of classified positive samples (either correctly or incorrectly).*
 ### *The recall is calculated as the ratio between the numbers of Positive samples correctly classified as Positive to the total number of Positive samples. The recall measures the model's ability to detect positive samples.*

Reasons I think why these ML models didn't perform better:
1. Data Quality: There was a lot of missing and unknown data like 
    - missing names(for NN training)
    - unknown gender for some names
    - lacking data for many judges per case
    - missing disposition name
    - there wasn't much data that could predict something meaningful, for instance, people's genders, acts and geography 
      can only predict till a certain accuracy beyond which more detailed data would be required like number of witnesses,       track record of participating lawyers etc
2. skewed data-- choice of classification problem
    - the number of 'others' were much more than the number of 'convicted' and 'acquitted'
    - this skewed the data and made many ml models less effective
3. lack of proper pre-processing
    - pre-processing like one hot encoding, data normalisation, feature enginerring etc was not done due to time constraint and more complexity
4. Data Limit on Kaggle
    - some of the above issues would've been fixed if I could train on more data but kaggle's data limit made it tougher
