A Binary Classifier Optimized for maximum Area Under Receiver Operating Characteristic Curve (AU-ROC Curve):
From Data Cleaning to Model Validation, Classifying whether a blight ticket will be paid in time or not, Trained 3 different Classifier on a Highly imbalanced Data provided by Detroit Open Data Portal with around 160000 Tickets.
Data Cleaning : (Data Cleaning & Feature Engineering Notebook)
- There are not a lot of null values in the data sets therefore, I Simply Dropped the rows with missing data and dropped 'violation_zip_code' , 'non_us_str_zip_code' , 'grafitti status' Data Field as it was more than 60% missing, And i Also Dropped 'payement_date' , 'collection_status' Data Field to Avoid Data Leakage.
-
Then, I Combined the lat-lon.csv and addresses.csv Data Set to the train.csv to Map each ticket_id to corresponding Latitude & Longitude.
-
Cleaned up Some of the Text Based Error in the Data Set and Some Dissimilarities were Handeled
- E.x : Some Data Fields were Filled with 'Deter' & 'Determi' instead of 'Determination'
Feature Engineering : (Data Cleaning & Feature Engineering Notebook)
- Use Several Data Fields To Extract Model Ready Information :
'Disposition' ----> 'Responsible_by' & 'Fine_Waived'
'Violator_description' ----> 'Len_Description' & 'Count_Violation'
'VIolator_name' ----> 'Type of violator'
'Ticket_Issued_date' ----> 'Month_BIn' & 'Ticket_TIme' - Categorical Data Was Mapped for EDA
Exploratory Data Analysis : (EDA & Feature Selection Notebook)
Feature Selection : (EDA & Feature Selection Notebook)
Two Methods were used for Feature selection Namely, Univariate Selection (Chi-Square Test) & Based on Feature Importance from a Simple Extra Tree Classifier
Evaluation Metric Selection : (Model Building Notebook)
- The True Positive divided by the sum of the False Negative and the True Positives.
- TPR describes the proportion of the actual postive samples that were correctly classified as positive, thus we want this value to approach 1.
- The False positives divided by the sum of the False Positive and the True Negatives.
- FNR describes the error of the Positive cases, thus we want this value to approach 0.
- A comparison of the True Positive Rate and the False Positive Rate.
- The goal is to have a ROC close to 1, as this suggests the model is getting a balanced split.
- The ROC can help guide where the best threshold split might be.
- ie, the integral of the \(ROC(x)\epsilon D,s.t. D=0\le x\le 1\)
- This provides and aggregated measure of preformance across all thresholds.
- A general idea as to the overall potential accuracy of a model.
Model Building and HyperParameter Tuning : (Model Building Notebook)
This DataSet is Highly Unbalanced thus, I will use 'A Dummy Classifiers' as a baseline for Performance and Evaluation namely:
- Dummy Classifier (Stratergy = 'Most_Frequent')
- Dummy Classifier (Stratergy = 'Unifrom')
For The Final Model Building as the Data is very Sparse and Feature Don't Relate with each other that much, I will Fit and Compare 4 Different Models namely:
- KNN Classifier
- Logistic Regression
- Decision Tree
Finally The Logistic Regression Classifier was Chosen, As, it is Linear Therefore gives Prediction Faster and it gained the highest AU-ROC Score on the train set
Model Validation : (Model Validation Notebook)
The validation Data-Set was used to validate the Final Model, It Was Made Sure That The Validation set was never seen before to the Model during its Training Phase You Can See More On This Here In This Notebook