A Machine Learning Project To Predict Blight Compliance

A Binary Classifier Optimized for maximum Area Under Receiver Operating Characteristic Curve (AU-ROC Curve):

From Data Cleaning to Model Validation, Classifying whether a blight ticket will be paid in time or not, Trained 3 different Classifier on a Highly imbalanced Data provided by Detroit Open Data Portal with around 160000 Tickets.

About The Data :

All The data for this project has been gathered through the Detroit Open Data Portal.

Detroit Open Data Portal (Blight Violation)

Description and More About Data Fields can found Here

Descripton and Data Fields

Data Cleaning : (Data Cleaning & Feature Engineering Notebook)

There are not a lot of null values in the data sets therefore, I Simply Dropped the rows with missing data and dropped 'violation_zip_code' , 'non_us_str_zip_code' , 'grafitti status' Data Field as it was more than 60% missing, And i Also Dropped 'payement_date' , 'collection_status' Data Field to Avoid Data Leakage.

Then, I Combined the lat-lon.csv and addresses.csv Data Set to the train.csv to Map each ticket_id to corresponding Latitude & Longitude.
Cleaned up Some of the Text Based Error in the Data Set and Some Dissimilarities were Handeled
- E.x : Some Data Fields were Filled with 'Deter' & 'Determi' instead of 'Determination'

Feature Engineering : (Data Cleaning & Feature Engineering Notebook)

Use Several Data Fields To Extract Model Ready Information :

'Disposition' ----> 'Responsible_by' & 'Fine_Waived'
'Violator_description' ----> 'Len_Description' & 'Count_Violation'
'VIolator_name' ----> 'Type of violator'
'Ticket_Issued_date' ----> 'Month_BIn' & 'Ticket_TIme'
Categorical Data Was Mapped for EDA

Exploratory Data Analysis : (EDA & Feature Selection Notebook)

1. Categorical Feature Distribution :

2. Categorical Feature Distribution Over Compliance :

3. Co-Relation Visualization :

Feature Selection : (EDA & Feature Selection Notebook)

Two Methods were used for Feature selection Namely, Univariate Selection (Chi-Square Test) & Based on Feature Importance from a Simple Extra Tree Classifier

1. Univariate Selection (Chi-Square Statistical Test) Score :

2. Feature Importance from a Simple Extra Tree Classifier :

Evaluation Metric Selection : (Model Building Notebook)

Confusion Matrix for a Binary Classifier :

1. Recall ( The True Positive Rate ):

$Recall=\frac{TP}{TP+FN}=TPR$

The True Positive divided by the sum of the False Negative and the True Positives.
TPR describes the proportion of the actual postive samples that were correctly classified as positive, thus we want this value to approach 1.

2. Specificity ( The False Positive Rate ):

$FNR=\frac{FP}{FP+TN}=Specificity$

The False positives divided by the sum of the False Positive and the True Negatives.
FNR describes the error of the Positive cases, thus we want this value to approach 0.

3. Receiver Operating Characteristics (ROC) is defined as:

A comparison of the True Positive Rate and the False Positive Rate.

$ROC=\frac{TPR}{FPR}=\frac{\frac{TP}{TP+FN}}{\frac{FP}{FP+TN}}$

The goal is to have a ROC close to 1, as this suggests the model is getting a balanced split.
The ROC can help guide where the best threshold split might be.

4. The Area Under the ROC Curve ( AUC ):

ie, the integral of the $ROC(x)\epsilon D,s.t. D=0\le x\le 1$

This provides and aggregated measure of preformance across all thresholds.
A general idea as to the overall potential accuracy of a model.

Finally AU-ROC Was Chossen as Evaluation Matrix

Model Building and HyperParameter Tuning : (Model Building Notebook)

This DataSet is Highly Unbalanced thus, I will use 'A Dummy Classifiers' as a baseline for Performance and Evaluation namely:

Dummy Classifier (Stratergy = 'Most_Frequent')
Dummy Classifier (Stratergy = 'Unifrom')

For The Final Model Building as the Data is very Sparse and Feature Don't Relate with each other that much, I will Fit and Compare 4 Different Models namely:

KNN Classifier
Logistic Regression
Decision Tree

Final Tuned Tree Visualization :

Comparison of AU-ROC Curve Score for all the Trained and Tunned Models

Finally The Logistic Regression Classifier was Chosen, As, it is Linear Therefore gives Prediction Faster and it gained the highest AU-ROC Score on the train set

Model Validation : (Model Validation Notebook)

The validation Data-Set was used to validate the Final Model, It Was Made Sure That The Validation set was never seen before to the Model during its Training Phase You Can See More On This Here In This Notebook

Conclusion :

The Final Logistic Regression Model With Parameters :

C = 100 , Class_Weights = {0 : 0.3 , 1 : 0.7} , max_iter =1000 , penalty = 'l2' , Solver = 'lbfgs' , Dual = False , tol = 0.0001

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Orignal_DataSet		Orignal_DataSet
Processed_DataSets		Processed_DataSets
Visualizations		Visualizations
_Images		_Images
__pycache__		__pycache__
Data Cleaning and Feature Engineering.ipynb		Data Cleaning and Feature Engineering.ipynb
Exploratory Data Analysis & Feature Selection.ipynb		Exploratory Data Analysis & Feature Selection.ipynb
Model Building.ipynb		Model Building.ipynb
Model Validation.ipynb		Model Validation.ipynb
README.md		README.md
Transforming Validation Data to Match Training Data.ipynb		Transforming Validation Data to Match Training Data.ipynb
Tree_Visualization_data.dot		Tree_Visualization_data.dot
Visualization_Utility.py		Visualization_Utility.py

ITrustNumbers/A_Classification_Model_To_Predict_Blight_Compliance

Folders and files

Latest commit

History

Repository files navigation

A Machine Learning Project To Predict Blight Compliance

A Binary Classifier Optimized for maximum Area Under Receiver Operating Characteristic Curve (AU-ROC Curve):

About The Data :

All The data for this project has been gathered through the Detroit Open Data Portal.

Description and More About Data Fields can found Here

Data Cleaning : (Data Cleaning & Feature Engineering Notebook)

Feature Engineering : (Data Cleaning & Feature Engineering Notebook)

Exploratory Data Analysis : (EDA & Feature Selection Notebook)

1. Categorical Feature Distribution :

2. Categorical Feature Distribution Over Compliance :

3. Co-Relation Visualization :

Feature Selection : (EDA & Feature Selection Notebook)

1. Univariate Selection (Chi-Square Statistical Test) Score :

2. Feature Importance from a Simple Extra Tree Classifier :

Evaluation Metric Selection : (Model Building Notebook)

Confusion Matrix for a Binary Classifier :

1. Recall ( The True Positive Rate ):

2. Specificity ( The False Positive Rate ):

3. Receiver Operating Characteristics (ROC) is defined as:

4. The Area Under the ROC Curve ( AUC ):

Finally AU-ROC Was Chossen as Evaluation Matrix

Model Building and HyperParameter Tuning : (Model Building Notebook)

This DataSet is Highly Unbalanced thus, I will use 'A Dummy Classifiers' as a baseline for Performance and Evaluation namely:

For The Final Model Building as the Data is very Sparse and Feature Don't Relate with each other that much, I will Fit and Compare 4 Different Models namely:

Final Tuned Tree Visualization :

Comparison of AU-ROC Curve Score for all the Trained and Tunned Models

Finally The Logistic Regression Classifier was Chosen, As, it is Linear Therefore gives Prediction Faster and it gained the highest AU-ROC Score on the train set

Model Validation : (Model Validation Notebook)

The validation Data-Set was used to validate the Final Model, It Was Made Sure That The Validation set was never seen before to the Model during its Training Phase You Can See More On This Here In This Notebook

Conclusion :

The Final Logistic Regression Model With Parameters :

C = 100 , Class_Weights = {0 : 0.3 , 1 : 0.7} , max_iter =1000 , penalty = 'l2' , Solver = 'lbfgs' , Dual = False , tol = 0.0001

Gained The Traget AU-ROC Curve Score of >0.8 i.e 0.84 with an Accuracy of 93% on the Validation Data

About

Topics

Resources

Stars

Watchers

Forks

Languages