### Data exploration of Law School Admissions Bar Passage Dataset

##### dataset source: https://www.kaggle.com/datasets/danofer/law-school-admissions-bar-passage/data

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

print('libraries imported')

libraries imported


In [2]:
#load the dataset
dataset_path = 'C:/Users/20181588/Desktop/AA SDG/synthetic-data-generation/Law School Admissions Bar Passage/Dataset/bar_pass_prediction.csv'
original_data = pd.read_csv(dataset_path)

# Make a copy of the dataset to make sure original data untouched 
data_full = original_data.copy()

There are many features but we will keep only the following in the upcoming exploration:

ugpa - This one stands for undergraduate gpa. Every other feature with gpa in it either perfectly correlates or perfectly negatively correlates with ugpa and this is the one I decided to keep.

decile1, decile1b, decile3 - These features represent the law school rankings by decile of each candidate in years 1 and 3 (and I’m assuming year one second semester, but I can’t be sure about that).

lsat - This feature is obviously the LSAT score of each candidate, but oddly it’s not formatted in the 120 to 180 score range of the actual LSAT. That said it has a strong correlation to whether someone passed the bar and, well, the feature is called lsat!

grad - This seems to be whether the student who took the bar exam graduated from law school. It’s a binary variable and there are very few 0s, so that makes the most sense. (Special kudos to the 65 students surveyed who didn’t graduate law school, but still managed to pass the bar!)

fulltime - Was the student a full-time student.

fam_inc - This feature is family income by quintile.

tier - What tier law school did the student attend by quintile.

race1 - Categorizes students by race. Of all the race variables this was the most complete, so it’s the one I kept. All the information contained in the other race variables are contained in race1.

sex - student gender.

pass_bar - This is the target variable. Did the student pass the bar.

##### Dataset preparation

In [3]:
# Check all columns available
data_full.columns

Index(['decile1b', 'decile3', 'ID', 'decile1', 'sex', 'race', 'cluster',
       'lsat', 'ugpa', 'zfygpa', 'DOB_yr', 'grad', 'zgpa', 'bar1', 'bar1_yr',
       'bar2', 'bar2_yr', 'fulltime', 'fam_inc', 'age', 'gender', 'parttime',
       'male', 'race1', 'race2', 'Dropout', 'other', 'asian', 'black', 'hisp',
       'pass_bar', 'bar', 'bar_passed', 'tier', 'index6040', 'indxgrp',
       'indxgrp2', 'dnn_bar_pass_prediction', 'gpa'],
      dtype='object')

In [4]:
# List of columns to keep
columns_to_keep = [
    'pass_bar',  'ugpa', 'decile1', 'decile1b',
    'decile3', 'lsat', 'grad', 'fulltime', 'fam_inc',
    'tier', 'race1', 'gender'
]



df = data_full[columns_to_keep]

In [5]:
# Check the shape and names of the columns left
print(df.shape)
print(df.columns)


(22407, 12)
Index(['pass_bar', 'ugpa', 'decile1', 'decile1b', 'decile3', 'lsat', 'grad',
       'fulltime', 'fam_inc', 'tier', 'race1', 'gender'],
      dtype='object')


In [6]:
# Check how many rows with missing values are there
print(df.isnull().sum())


pass_bar       0
ugpa           0
decile1     1092
decile1b    1604
decile3     1604
lsat           0
grad           3
fulltime      34
fam_inc      289
tier          96
race1         16
gender         5
dtype: int64


In [7]:
# Inspect the dataframe
display(df.head())

Unnamed: 0,pass_bar,ugpa,decile1,decile1b,decile3,lsat,grad,fulltime,fam_inc,tier,race1,gender
0,1,3.5,10.0,10.0,10.0,44.0,Y,1.0,5.0,4.0,white,female
1,1,3.5,5.0,5.0,4.0,29.0,Y,1.0,4.0,2.0,white,female
2,1,3.5,3.0,3.0,2.0,36.0,Y,1.0,1.0,3.0,white,male
3,1,3.5,7.0,7.0,4.0,39.0,Y,1.0,4.0,3.0,white,male
4,1,3.5,9.0,9.0,8.0,48.0,Y,1.0,4.0,5.0,white,male


In [8]:
# Remove all rows with missing values
initial_row_count = df.shape[0]
df = df.dropna()
removed_rows = initial_row_count - df.shape[0]
print(f"Removed {removed_rows} rows with missing values.")

Removed 1895 rows with missing values.


In [9]:
# Check if all the rows with missing values are removed
print(df.shape)
print(df.columns)
print(df.isnull().sum())


(20512, 12)
Index(['pass_bar', 'ugpa', 'decile1', 'decile1b', 'decile3', 'lsat', 'grad',
       'fulltime', 'fam_inc', 'tier', 'race1', 'gender'],
      dtype='object')
pass_bar    0
ugpa        0
decile1     0
decile1b    0
decile3     0
lsat        0
grad        0
fulltime    0
fam_inc     0
tier        0
race1       0
gender      0
dtype: int64


In [10]:
# Inspect the dataframe
display(df.head())

Unnamed: 0,pass_bar,ugpa,decile1,decile1b,decile3,lsat,grad,fulltime,fam_inc,tier,race1,gender
0,1,3.5,10.0,10.0,10.0,44.0,Y,1.0,5.0,4.0,white,female
1,1,3.5,5.0,5.0,4.0,29.0,Y,1.0,4.0,2.0,white,female
2,1,3.5,3.0,3.0,2.0,36.0,Y,1.0,1.0,3.0,white,male
3,1,3.5,7.0,7.0,4.0,39.0,Y,1.0,4.0,3.0,white,male
4,1,3.5,9.0,9.0,8.0,48.0,Y,1.0,4.0,5.0,white,male


In [11]:
# Convert 'grad' to binary where 'Y' is 1 and anything else is 0
df['grad'] = np.where(df['grad'] == 'Y', 1, 0)

In [12]:
# Inspect the dataframe
display(df.head())

Unnamed: 0,pass_bar,ugpa,decile1,decile1b,decile3,lsat,grad,fulltime,fam_inc,tier,race1,gender
0,1,3.5,10.0,10.0,10.0,44.0,1,1.0,5.0,4.0,white,female
1,1,3.5,5.0,5.0,4.0,29.0,1,1.0,4.0,2.0,white,female
2,1,3.5,3.0,3.0,2.0,36.0,1,1.0,1.0,3.0,white,male
3,1,3.5,7.0,7.0,4.0,39.0,1,1.0,4.0,3.0,white,male
4,1,3.5,9.0,9.0,8.0,48.0,1,1.0,4.0,5.0,white,male


In [13]:
# Check types of columns
print(df.dtypes)

pass_bar      int64
ugpa        float64
decile1     float64
decile1b    float64
decile3     float64
lsat        float64
grad          int32
fulltime    float64
fam_inc     float64
tier        float64
race1        object
gender       object
dtype: object


In [14]:
# Which colums are categorical
categorical_columns = df.select_dtypes(include=['object'])
print(categorical_columns.columns)

Index(['race1', 'gender'], dtype='object')


##### Basic dataset exploration