<a href="https://colab.research.google.com/github/MadugulaMeenakshi/-_1/blob/main/Assignment_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Problem Statement: Build a blight model using Decision Trees and Random Forests

This assignment is based on a data challenge from the Michigan Data Science Team ([MDST](http://midas.umich.edu/mdst/)).

Blight violations are issued by the city to individuals who allow their properties to remain in a deteriorated condition. Every year, the city of Detroit issues millions of dollars in fines to residents and every year, many of these fines remain unpaid. Enforcing unpaid blight fines is a costly and tedious process, so the city wants to know: how can we increase blight ticket compliance?

The first step in answering this question is understanding when and why a resident might fail to comply with a blight ticket. This is where predictive modeling comes in. For this assignment, your task is to predict whether a given blight ticket will be paid on time.

We provide you with a data file for use in training and validating your models: blight_data.csv. Each row in the file corresponds to a single blight ticket, and includes information about when, why, and to whom each ticket was issued. The target variable is compliance, which is True if the ticket was paid early, on time, or within one month of the hearing data, False if the ticket was paid after the hearing date or not at all, and Null if the violator was found not responsible.


**Data fields**


    ticket_id - unique identifier for tickets
    agency_name - Agency that issued the ticket
    inspector_name - Name of inspector that issued the ticket
    violator_name - Name of the person/organization that the ticket was issued to
    violation_street_number, violation_street_name, violation_zip_code - Address where the violation occurred
    mailing_address_str_number, mailing_address_str_name, city, state, zip_code, non_us_str_code, country - Mailing address of the violator
    ticket_issued_date - Date and time the ticket was issued
    hearing_date - Date and time the violator's hearing was scheduled
    violation_code, violation_description - Type of violation
    disposition - Judgment and judgement type
    fine_amount - Violation fine amount, excluding fees
    admin_fee - $20 fee assigned to responsible judgments
state_fee - $10 fee assigned to responsible judgments
    late_fee - 10% fee assigned to responsible judgments
    discount_amount - discount applied, if any
    clean_up_cost - DPW clean-up or graffiti removal cost
    judgment_amount - Sum of all fines and fees
    grafitti_status - Flag for graffiti violations
    payment_amount - Amount paid, if any
    payment_date - Date payment was made, if it was received
    payment_status - Current payment status as of Feb 1 2017
    balance_due - Fines and fees still owed
    collection_status - Flag for payments in collections
    compliance [target variable for prediction]
     Null = Not responsible
     0 = Responsible, non-compliant
     1 = Responsible, compliant
    compliance_detail - More information on why each ticket was marked compliant or non-compliant


___

## Evaluation

Your predictions will be given as the probability that the corresponding blight ticket will be paid on time.

The evaluation metric for this assignment is the Area Under the ROC Curve (AUC).

In [20]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import warnings
warnings.filterwarnings('ignore')

In [21]:
df = pd.read_csv('blight_data_train.csv', encoding = 'ISO-8859-1')
df = df[['disposition', 'fine_amount',
        'late_fee', 'discount_amount',
        'clean_up_cost', 'judgment_amount', 'grafitti_status','compliance']]
df.head()

Unnamed: 0,disposition,fine_amount,late_fee,discount_amount,clean_up_cost,judgment_amount,grafitti_status,compliance
0,Responsible by Default,250.0,25.0,0.0,0.0,305.0,,0.0
1,Responsible by Determination,750.0,75.0,0.0,0.0,855.0,,1.0
2,Not responsible by Dismissal,250.0,0.0,0.0,0.0,0.0,,
3,Not responsible by City Dismissal,250.0,0.0,0.0,0.0,0.0,,
4,Not responsible by Dismissal,250.0,0.0,0.0,0.0,0.0,,


In [22]:
df.isnull().sum()

disposition             0
fine_amount             1
late_fee                0
discount_amount         0
clean_up_cost           0
judgment_amount         0
grafitti_status    250305
compliance          90426
dtype: int64

In [25]:
df1 = df.drop(['grafitti_status','clean_up_cost'], axis = 1)
df1.head()

Unnamed: 0,disposition,fine_amount,late_fee,discount_amount,judgment_amount,compliance
0,Responsible by Default,250.0,25.0,0.0,305.0,0.0
1,Responsible by Determination,750.0,75.0,0.0,855.0,1.0
2,Not responsible by Dismissal,250.0,0.0,0.0,0.0,
3,Not responsible by City Dismissal,250.0,0.0,0.0,0.0,
4,Not responsible by Dismissal,250.0,0.0,0.0,0.0,


In [None]:
df1.describe()

In [None]:
df1.info()

In [37]:
df1.isnull().sum()

disposition        0
fine_amount        0
late_fee           0
discount_amount    0
judgment_amount    0
compliance         0
dtype: int64

In [39]:
df1.head()

Unnamed: 0,disposition,fine_amount,late_fee,discount_amount,judgment_amount,compliance
0,Responsible by Default,250.0,25.0,0.0,305.0,0.0
1,Responsible by Determination,750.0,75.0,0.0,855.0,1.0
2,Not responsible by Dismissal,250.0,0.0,0.0,0.0,0.072536
3,Not responsible by City Dismissal,250.0,0.0,0.0,0.0,0.072536
4,Not responsible by Dismissal,250.0,0.0,0.0,0.0,0.072536


In [41]:
mean_values = df1.mean()

# Replace null values with the respective column mean
df1.fillna(mean_values, inplace=True)

In [42]:
df1.isnull().sum()

disposition        0
fine_amount        0
late_fee           0
discount_amount    0
judgment_amount    0
compliance         0
dtype: int64

Q. Are there any columns with null values?

In [None]:
#yes there are
print(df.isnull().any(axis=0))

disposition        False
fine_amount         True
late_fee           False
discount_amount    False
clean_up_cost      False
judgment_amount    False
grafitti_status     True
compliance          True
dtype: bool


Q. Which columns are categorical vs continuous? What is our target variable?

In [None]:
df.info()
#categorical = disposition
#target = compliance
#continuous =

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250306 entries, 0 to 250305
Data columns (total 8 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   disposition      250306 non-null  object 
 1   fine_amount      250305 non-null  float64
 2   late_fee         250306 non-null  float64
 3   discount_amount  250306 non-null  float64
 4   clean_up_cost    250306 non-null  float64
 5   judgment_amount  250306 non-null  float64
 6   grafitti_status  1 non-null       object 
 7   compliance       159880 non-null  float64
dtypes: float64(6), object(2)
memory usage: 15.3+ MB


Q. There are two columns that needs to be dropped before building the models. What are those and why it needs to be dropped?

In [None]:
#clean up cost and grafitti_status
#graffiti_status column contains NaN values for all the rows
#clean_up_cost column contains all 0.0 values for all the rows

Q. Write a code to drop rows in df with null values. What percentage of rows have null values?

In [None]:
percentage_null_values = (df.isnull().mean() * 100).round(5)

print(percentage_null_values)

disposition         0.00000
fine_amount         0.00040
late_fee            0.00000
discount_amount     0.00000
clean_up_cost       0.00000
judgment_amount     0.00000
grafitti_status    99.99960
compliance         36.12618
dtype: float64


In [43]:
mean_values = df1.mean()

# Replace null values with the respective column mean
df1.fillna(mean_values, inplace=True)

In [44]:
df1.isnull().sum()

disposition        0
fine_amount        0
late_fee           0
discount_amount    0
judgment_amount    0
compliance         0
dtype: int64

Q. Convert categorical variables into dummy variables, and then apply *train_test_split* to split the data into training and test set. Use *random_state = 0* keeping other arguments at default.

In [None]:
df2 = pd.get_dummies(df1)
print(df2.head())

In [None]:
df2.info()

In [64]:
df2 = df2.astype('int64')
df2.dtypes

fine_amount                                       int64
late_fee                                          int64
discount_amount                                   int64
judgment_amount                                   int64
compliance                                        int64
disposition_Not responsible by City Dismissal     int64
disposition_Not responsible by Determination      int64
disposition_Not responsible by Dismissal          int64
disposition_PENDING JUDGMENT                      int64
disposition_Responsible (Fine Waived) by Deter    int64
disposition_Responsible by Admission              int64
disposition_Responsible by Default                int64
disposition_Responsible by Determination          int64
disposition_SET-ASIDE (PENDING JUDGMENT)          int64
dtype: object

In [65]:
import pandas as pd
from sklearn.model_selection import train_test_split
X = df2.drop(['compliance'],axis = 1)
y = df2['compliance']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

0    0
1    1
2    0
3    0
4    0
Name: compliance, dtype: int64

Q. Use the training set to train a dummy classifier (*from sklearn.dummy import DummyClassifier*) that classifies everything as the majority class (most_frequent strategy) of the training data. What is the accuracy and roc auc score of this classifier?

Note: For calculating roc auc score, you need to pass predicted probability by using *predict_proba* method. For example, for a classifier with name *'clf'*, predicted probability can be calculated by:

*y_clf_proba = clf.predict_proba(X_test)[:,1]*

In [66]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score, roc_auc_score


dummy_classifier = DummyClassifier(strategy='most_frequent')
dummy_classifier.fit(X_train, y_train)

# Make predictions using the dummy classifier
y_pred = dummy_classifier.predict(X_test)

# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)

# Calculate predicted probabilities using predict_proba method
y_dummy_proba = dummy_classifier.predict_proba(X_test)[:, 1]

# Calculating ROC AUC score
roc_auc = roc_auc_score(y_test, y_dummy_proba)
print("ROC AUC Score:", roc_auc)

Accuracy: 0.9526847067597081
ROC AUC Score: 0.5


### Logistic Regression

Q. Use X_train to train two logistic regression model with fit_intercept= True and fit_intercept=False. Calculate the accuracy, confusion matrix, TPR and FPR. Also report the metrics if the probability threshold is changed to 0.3 and 0.7 .

In [73]:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, roc_curve, auc

logreg_intercept_true = LogisticRegression(fit_intercept=True)
logreg_intercept_true.fit(X_train, y_train)
y_prob_intercept_true = logreg_intercept_true.predict_proba(X_test)[:, 1]



logreg_intercept_false = LogisticRegression(fit_intercept=False)
logreg_intercept_false.fit(X_train, y_train)
y_prob_intercept_false = logreg_intercept_false.predict_proba(X_test)[:, 1]

thresholds = [0.3, 0.7]

for threshold in thresholds:
    y_pred_intercept_true = (y_prob_intercept_true >= threshold).astype(int)
    y_pred_intercept_false = (y_prob_intercept_false >= threshold).astype(int)

    # Calculate metrics for each model and threshold
    accuracy_intercept_true = accuracy_score(y_test, y_pred_intercept_true)
    accuracy_intercept_false = accuracy_score(y_test, y_pred_intercept_false)

    confusion_matrix_intercept_true = confusion_matrix(y_test, y_pred_intercept_true)
    confusion_matrix_intercept_false = confusion_matrix(y_test, y_pred_intercept_false)

    fpr_intercept_true, tpr_intercept_true, _ = roc_curve(y_test, y_prob_intercept_true)
    fpr_intercept_false, tpr_intercept_false, _ = roc_curve(y_test, y_prob_intercept_false)

    auc_intercept_true = auc(fpr_intercept_true, tpr_intercept_true)
    auc_intercept_false = auc(fpr_intercept_false, tpr_intercept_false)

    print(f"Threshold: {threshold}")
    print("Model with fit_intercept=True:")
    print("Accuracy:", accuracy_intercept_true)
    print("Confusion Matrix:\n", confusion_matrix_intercept_true)
    print("TPR:", tpr_intercept_true)
    print("FPR:", fpr_intercept_true)
    print("AUC:", auc_intercept_true)
    print("")

    print("Model with fit_intercept=False:")
    print("Accuracy:", accuracy_intercept_false)
    print("Confusion Matrix:\n", confusion_matrix_intercept_false)
    print("TPR:", tpr_intercept_false)
    print("FPR:", fpr_intercept_false)
    print("AUC:", auc_intercept_false)
    print("")







Threshold: 0.3
Model with fit_intercept=True:
Accuracy: 0.949248921323177
Confusion Matrix:
 [[70197  1342]
 [ 2469  1084]]
TPR: [0.         0.00337743 0.00365888 0.00478469 0.0059105  0.00816212
 0.00872502 0.01210245 0.05347594 0.05403884 0.06107515 0.0619195
 0.06839291 0.06923726 0.07064453 0.08387278 0.08443569 0.08471714
 0.0861244  0.08865747 0.08893892 0.09231635 0.0925978  0.10019702
 0.11398818 0.11455108 0.11483254 0.14466648 0.16633831 0.16690121
 0.18885449 0.20968196 0.21080777 0.21165212 0.21193358 0.21249648
 0.22094005 0.23050943 0.23276105 0.24767802 0.24767802 0.24795947
 0.24964818 0.26203209 0.26259499 0.30509429 0.3062201  0.3073459
 0.30762736 0.30762736 0.30847171 0.43850267 0.43878413 0.45088657
 0.45144948 0.49000844 0.52068674 0.52575288 0.52969322 0.53025612
 0.53110048 0.53110048 0.53194484 0.53278919 0.53278919 0.53729243
 0.53729243 0.53729243 0.54095131 0.54264002 0.54264002 0.54264002
 0.54292147 0.54320293 0.54461019 0.54489164 0.54489164 0.54545455
 0

In [68]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline,make_pipeline

model = make_pipeline(StandardScaler(),LogisticRegression())

model.fit(X_train,y_train)

Q. In the above question, according to you which is a better threshold for the problem statement.

Q. Report the Coefficients corresponding to eaach of the feaatures. How do these values explain the effect on the final classificaation outcome?

Q. Plot the ROC curve

### Decision Trees

Q. Use X_train to train a decision tree model using *DecisionTreeClassifier*, with max_depth = 3. Calculate the accuracy and roc auc score on X_test for this model.

Q. Rank the feature importances of this classifier. Which is the most important feature used in the model?

Q. Fit 5 different Decision Tree classifier by varying the 'max_depth' parameter from 1 to 5. Plot the accuracy as a function of 'max_depth' parameter using the test dataset. At what 'max_depth' value does the classifier achieves maximum accuracy?

Q. What is 'max_depth' in decision tree model?

In [None]:
'''
max_depth is hyperparameter suggested the maximum depth that we allow the tree to grow to.
The deeper is the tree, the more complex model will become
'''

Q. How does a decision tree algorithm decide where to split?

In [None]:
'''
Decision Tree algorithm selects the best attribute to split the data based on a metric such
as entropy or Gini impurity, which measures the level of impurity or randomness.
'''

Q. List one advantage and disadvatage of using decison tree model.

In [None]:
#ADVANTAGES OF DECISION TREE

'''
Can handle both categorical and numerical data
Can handle missing values and outliers
Can be used for classification and regression problems
'''

#DISADVANTAGES
'''
Overfitting can occur
Large decision trees can be hard to interpret
Decision trees can be sensitive to small changes in the data
'''


### Random Forests

Q. What is bagging? How is bagging used in Random Forests?

In [None]:
'''
Bagging is a technique to improve the stability and accuracy of machine learning algorithms.
It decreases the variance and helps to avoid overfitting
It helps improve machine learning results by combining several models.
'''

Q. Build a random forest classifier using X_train. Experiment with different values of parameters such as *n_estimators*, *max_features* and *max_depth* to come up with a model such that roc auc score is maximized on X_test. Set *random_state = 0*.

Q. What is the accuracy and roc auc score that you achieved on the test dataset? At what parameter values did you achieved this score?

Q. How are predictions made by a random forest regression and classification on a new data point?

Q. List one advantage and disadvatage of using a random forest model.

### K-Nearest Neighbors (KNN)

Q. Use *KNeighborsClassifier* in *sklearn.neighbors* to fit a model using training data set. Set n_neighbors = 5 keeping other parameters as default.

Q. Use the testing dataset to calculate the accuracy, precision and recall of this classifier?

Q. Fit 20 different KNN classification model by varying the 'k' parameter from 1 to 20. Plot the accuracy as a function of 'k'
parameter using the test dataset. At what 'k' value does the classifier achieves maximum accuracy?

Q. Fit the KNN model using the best 'k' value found. Calculate the accuracy, precision and recall for this new model? Comment if this model is better than the Dummy Classifier and why.

Q. Briefly describe how KNN regression and classification predict on a new data point after a model is built?

Q. What is one of the major drawback of using KNN?

###  Support Vector Machines (SVM)

Q. Use SVC (Support Vector Classifier) to fit a model using training dataset. Use default model parameters.

Q. Use the testing data set to calculate the accuracy, precision and recall of this SVM classifier?

Q. Use *kernel = 'rbf', gamma = 1e-07, C = 1e9* to train a SVM classifier model. Calculate the accuracy, precision and recall for this new model? Comment if this model is better than the Dummy Classifier and why.

Q. What is a kernel in SVM?

Q. What does 'gamma' value in *rbf* kernel control?

Q. What is the significance of 'C' parameter in SVM?

Q. Briefly describe how SVM algorithm works?

Q. For fraud detection, which model would you choose among the three models built (Dummy, KNN, SVM) and why? What things would you consider while choosing the best model for your purpose?