###### which performance metric i am considering?

#For my dataset i am choosing Precision as my performance metric to evaluate the effectiveness of my model in accurately predicting the original rate of the products.

#Precision is a useful metric when we want to evaluate how well our model is predicting the original rate of the products while minimizing false positives. In the context of this sales product rate prediction and analysis task, precision can help us determine how often our model is correctly predicting the actual original rate of the products, out of all the predicted original rates.

#By optimizing for precision, we aim to minimize the number of false positives or incorrect original rate predictions, which can be crucial in making informed business decisions. Therefore, precision can be a useful performance metric to evaluate the effectiveness of our model in predicting the original rate of the products.

## Setup

In [1]:
import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from sklearn.svm import SVC
from sklearn.linear_model import LinearRegression

np.random.seed(1)

## load data

In [2]:
X_train = pd.read_csv(r"C:\Users\akhil\OneDrive - University of South Florida\Desktop\DSP\Assignment1\assign1_train_X.csv", encoding='ISO-8859-1')
X_test = pd.read_csv(r"C:\Users\akhil\OneDrive - University of South Florida\Desktop\DSP\Assignment1\assign1_test_X.csv", encoding='ISO-8859-1')
y_train = pd.read_csv(r"C:\Users\akhil\OneDrive - University of South Florida\Desktop\DSP\Assignment1\assign1_train_y.csv", encoding='ISO-8859-1')
y_test = pd.read_csv(r"C:\Users\akhil\OneDrive - University of South Florida\Desktop\DSP\Assignment1\assign1_test_y.csv", encoding='ISO-8859-1')

In [3]:
y_train

Unnamed: 0,rate_category
0,2
1,3
2,1
3,1
4,4
...,...
787,4
788,4
789,3
790,2


## Model the data

In [4]:
performance = pd.DataFrame({"model": [], "Accuracy": [], "Precision": [], "Recall": [], "F1": []})

## Fit and test a Logistic Regression model

In [5]:
from sklearn.tree import DecisionTreeClassifier

In [6]:
log_reg_model = LogisticRegression(max_iter=5000)
_ = log_reg_model.fit(X_train, np.ravel(y_train))

In [7]:
model_preds = log_reg_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"default logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.944828,0.981481,0.883333,0.929825


## Change to liblinear solver

In [8]:
log_reg_liblin_model = LogisticRegression(solver='liblinear').fit(X_train, np.ravel(y_train))

In [9]:
model_preds = log_reg_liblin_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"liblinear logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.944828,0.981481,0.883333,0.929825
0,liblinear logistic,0.896296,0.973684,0.74,0.840909


## L2 Regularization

In [10]:
log_reg_L2_model = LogisticRegression(penalty='l2', max_iter=5000)
_ = log_reg_L2_model.fit(X_train, np.ravel(y_train))

In [11]:
model_preds = log_reg_L2_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"L2 logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.944828,0.981481,0.883333,0.929825
0,liblinear logistic,0.896296,0.973684,0.74,0.840909
0,L2 logistic,0.944828,0.981481,0.883333,0.929825


## L1 Regularization

In [12]:
log_reg_L1_model = LogisticRegression(solver='liblinear', penalty='l1')
_ = log_reg_L1_model.fit(X_train, np.ravel(y_train))

In [13]:
model_preds = log_reg_L1_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"L1 logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.944828,0.981481,0.883333,0.929825
0,liblinear logistic,0.896296,0.973684,0.74,0.840909
0,L2 logistic,0.944828,0.981481,0.883333,0.929825
0,L1 logistic,0.939394,0.97561,0.851064,0.909091


## Elastic Net Regularization

In [14]:
log_reg_elastic_model = LogisticRegression(solver='saga', penalty='elasticnet', l1_ratio=0.5, max_iter=5000)
_ = log_reg_elastic_model.fit(X_train, np.ravel(y_train))

In [15]:
model_preds = log_reg_elastic_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Elestic logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.944828,0.981481,0.883333,0.929825
0,liblinear logistic,0.896296,0.973684,0.74,0.840909
0,L2 logistic,0.944828,0.981481,0.883333,0.929825
0,L1 logistic,0.939394,0.97561,0.851064,0.909091
0,Elestic logistic,0.855172,0.914894,0.716667,0.803738


### Random search logistic

In [None]:
core_measure = "precision"
LR=LogisticRegression()
kfolds = 5
param_grid = {'C': [0.1, 1, 10,0.001], 
              "solver" : [ 'lbfgs', 'liblinear'],
              "penalty" : ['l1','l2','lasso','elastic']} 
  
grid = RandomizedSearchCV(LR, param_grid, refit = True, verbose = 3)
  
# fitting the model for grid search
grid.fit(X_train, y_train)
print(f"The best {score_measure} score is {grid.best_score_}")
print(f"... with parameters: {grid.best_params_}")

bestRecallTree = grid.best_estimator_

In [None]:
c_matrix = confusion_matrix(y_test, grid.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Random search logistic ", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

### Grid search logistic

In [None]:
core_measure = "precision"
kfolds = 5
param_grid = {'C': [0.1, 1, 10], 
              'solver' : [ 'lbfgs', 'liblinear'],
              'penalty' : ['l1','l2','lasso','elastic']} 
  
grid = GridSearchCV(LogisticRegression(), param_grid, refit = True, verbose = 3)
  
# fitting the model for grid search
grid.fit(X_train, y_train)
print(f"The best {score_measure} score is {grid.best_score_}")
print(f"... with parameters: {grid.best_params_}")

bestRecallTree = grid.best_estimator_

In [None]:
model_preds = grid.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Grid Regression logistic", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])

## SVM classification model using linear kernal

In [16]:
svm_lin_model = SVC(kernel="linear")
_ = svm_lin_model.fit(X_train, np.ravel(y_train))

In [17]:
model_preds = svm_lin_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"svm with linear kernel", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.944828,0.981481,0.883333,0.929825
0,liblinear logistic,0.896296,0.973684,0.74,0.840909
0,L2 logistic,0.944828,0.981481,0.883333,0.929825
0,L1 logistic,0.939394,0.97561,0.851064,0.909091
0,Elestic logistic,0.855172,0.914894,0.716667,0.803738
0,svm with linear kernel,0.938356,0.964286,0.885246,0.923077


## SVM classification model using rbf kernal

In [18]:
svm_rbf_model = SVC(kernel="rbf", C=10, gamma='scale')
_ = svm_rbf_model.fit(X_train, np.ravel(y_train))

In [19]:
model_preds = svm_rbf_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"svm with rbf kernel", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.944828,0.981481,0.883333,0.929825
0,liblinear logistic,0.896296,0.973684,0.74,0.840909
0,L2 logistic,0.944828,0.981481,0.883333,0.929825
0,L1 logistic,0.939394,0.97561,0.851064,0.909091
0,Elestic logistic,0.855172,0.914894,0.716667,0.803738
0,svm with linear kernel,0.938356,0.964286,0.885246,0.923077
0,svm with rbf kernel,0.547771,0.509434,0.375,0.432


## SVM classification model using polynomial kernal

In [20]:
svm_poly_model = SVC(kernel="poly", degree=3, coef0=1, C=10)
_ = svm_poly_model.fit(X_train, np.ravel(y_train))

In [21]:
model_preds = svm_poly_model.predict(X_test)
c_matrix = confusion_matrix(y_test, model_preds)
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"svm with polynomial kernel", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.944828,0.981481,0.883333,0.929825
0,liblinear logistic,0.896296,0.973684,0.74,0.840909
0,L2 logistic,0.944828,0.981481,0.883333,0.929825
0,L1 logistic,0.939394,0.97561,0.851064,0.909091
0,Elestic logistic,0.855172,0.914894,0.716667,0.803738
0,svm with linear kernel,0.938356,0.964286,0.885246,0.923077
0,svm with rbf kernel,0.547771,0.509434,0.375,0.432
0,svm with polynomial kernel,0.452229,0.436364,0.666667,0.527473


## Random search SVM

In [31]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV

In [None]:
score_measure = "precision"
kfolds = 5
param_grid = {'C': [0.1, 1, 10], 
              'gamma': [1, 0.1, 0.01, 0.001],
              'kernel': ['linear','poly','rbf']} 
  
grid = RandomizedSearchCV(SVC(), param_grid, refit = True, verbose = 3)
  
# fitting the model for grid search
grid.fit(X_train, y_train)
print(f"The best {score_measure} score is {grid.best_score_}")
print(f"... with parameters: {grid.best_params_}")

bestRecallTree = grid.best_estimator_

Fitting 5 folds for each of 10 candidates, totalling 50 fits
[CV 1/5] END ....C=0.1, gamma=0.001, kernel=rbf;, score=0.415 total time=   0.0s
[CV 2/5] END ....C=0.1, gamma=0.001, kernel=rbf;, score=0.377 total time=   0.0s


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


[CV 3/5] END ....C=0.1, gamma=0.001, kernel=rbf;, score=0.399 total time=   0.0s
[CV 4/5] END ....C=0.1, gamma=0.001, kernel=rbf;, score=0.487 total time=   0.0s
[CV 5/5] END ....C=0.1, gamma=0.001, kernel=rbf;, score=0.392 total time=   0.0s


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


[CV 1/5] END .........C=10, gamma=1, kernel=rbf;, score=0.698 total time=   0.0s
[CV 2/5] END .........C=10, gamma=1, kernel=rbf;, score=0.730 total time=   0.0s


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


[CV 3/5] END .........C=10, gamma=1, kernel=rbf;, score=0.722 total time=   0.0s
[CV 4/5] END .........C=10, gamma=1, kernel=rbf;, score=0.696 total time=   0.0s


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


[CV 5/5] END .........C=10, gamma=1, kernel=rbf;, score=0.734 total time=   0.0s
[CV 1/5] END ......C=10, gamma=0.01, kernel=rbf;, score=0.723 total time=   0.0s
[CV 2/5] END ......C=10, gamma=0.01, kernel=rbf;, score=0.692 total time=   0.0s
[CV 3/5] END ......C=10, gamma=0.01, kernel=rbf;, score=0.677 total time=   0.0s


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


[CV 4/5] END ......C=10, gamma=0.01, kernel=rbf;, score=0.665 total time=   0.0s
[CV 5/5] END ......C=10, gamma=0.01, kernel=rbf;, score=0.639 total time=   0.0s
[CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.377 total time=   0.0s


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


[CV 2/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.358 total time=   0.0s
[CV 3/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.386 total time=   0.0s


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


[CV 4/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.418 total time=   0.0s
[CV 5/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.418 total time=   0.0s
[CV 1/5] END ......C=1, gamma=0.001, kernel=rbf;, score=0.453 total time=   0.0s


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


[CV 2/5] END ......C=1, gamma=0.001, kernel=rbf;, score=0.403 total time=   0.0s
[CV 3/5] END ......C=1, gamma=0.001, kernel=rbf;, score=0.513 total time=   0.0s
[CV 4/5] END ......C=1, gamma=0.001, kernel=rbf;, score=0.487 total time=   0.0s
[CV 5/5] END ......C=1, gamma=0.001, kernel=rbf;, score=0.424 total time=   0.0s


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


[CV 1/5] END ....C=1, gamma=0.01, kernel=linear;, score=0.881 total time=   4.2s


  y = column_or_1d(y, warn=True)


[CV 2/5] END ....C=1, gamma=0.01, kernel=linear;, score=0.855 total time=   7.6s


  y = column_or_1d(y, warn=True)


[CV 3/5] END ....C=1, gamma=0.01, kernel=linear;, score=0.880 total time=   4.7s


  y = column_or_1d(y, warn=True)


[CV 4/5] END ....C=1, gamma=0.01, kernel=linear;, score=0.905 total time=   5.3s


  y = column_or_1d(y, warn=True)


[CV 5/5] END ....C=1, gamma=0.01, kernel=linear;, score=0.880 total time=   5.6s
[CV 1/5] END ........C=1, gamma=0.1, kernel=rbf;, score=0.736 total time=   0.0s
[CV 2/5] END ........C=1, gamma=0.1, kernel=rbf;, score=0.711 total time=   0.0s


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


[CV 3/5] END ........C=1, gamma=0.1, kernel=rbf;, score=0.722 total time=   0.0s
[CV 4/5] END ........C=1, gamma=0.1, kernel=rbf;, score=0.747 total time=   0.0s
[CV 5/5] END ........C=1, gamma=0.1, kernel=rbf;, score=0.734 total time=   0.0s


  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)
  y = column_or_1d(y, warn=True)


[CV 1/5] END .C=0.1, gamma=0.001, kernel=linear;, score=0.805 total time=   1.3s


  y = column_or_1d(y, warn=True)


[CV 2/5] END .C=0.1, gamma=0.001, kernel=linear;, score=0.843 total time=   1.3s


  y = column_or_1d(y, warn=True)


[CV 3/5] END .C=0.1, gamma=0.001, kernel=linear;, score=0.804 total time=   1.3s


  y = column_or_1d(y, warn=True)


[CV 4/5] END .C=0.1, gamma=0.001, kernel=linear;, score=0.778 total time=   1.5s


  y = column_or_1d(y, warn=True)


[CV 5/5] END .C=0.1, gamma=0.001, kernel=linear;, score=0.741 total time=   1.7s


  y = column_or_1d(y, warn=True)


In [None]:
c_matrix = confusion_matrix(y_test, grid.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Random search SVM ", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

## Grid search SVM

In [None]:
score_measure = "precision"
kfolds = 5
param_grid = {'C': [0.1, 1, 10], 
              'gamma': [1, 0.1, 0.01, 0.001],
              'kernel': ['linear','poly','rbf']} 
  
grid = GridSearchCV(SVC(), param_grid, refit = True, verbose = 3)
  
# fitting the model for grid search
grid.fit(X_train, y_train)
print(f"The best {score_measure} score is {grid.best_score_}")
print(f"... with parameters: {grid.best_params_}")

bestRecallTree = grid.best_estimator_

In [None]:
c_matrix = confusion_matrix(y_test, grid.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Grid search SVM ", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

## Decision Tree Classifer

In [22]:
from sklearn.tree import DecisionTreeClassifier

In [23]:
dt=DecisionTreeClassifier()
dt=dt.fit(X_train, y_train)

In [24]:
c_matrix = confusion_matrix(y_test, dt.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Decision Tree", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.944828,0.981481,0.883333,0.929825
0,liblinear logistic,0.896296,0.973684,0.74,0.840909
0,L2 logistic,0.944828,0.981481,0.883333,0.929825
0,L1 logistic,0.939394,0.97561,0.851064,0.909091
0,Elestic logistic,0.855172,0.914894,0.716667,0.803738
0,svm with linear kernel,0.938356,0.964286,0.885246,0.923077
0,svm with rbf kernel,0.547771,0.509434,0.375,0.432
0,svm with polynomial kernel,0.452229,0.436364,0.666667,0.527473
0,Decision Tree,0.980263,0.984848,0.970149,0.977444


## Randomized Search Decision Tree

In [25]:
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier 

In [26]:
score_measure = "precision"
kfolds = 5

param_grid = {
    'min_samples_split': np.arange(1,50),  
    'min_samples_leaf': np.arange(1,50),
    'min_impurity_decrease': np.arange(0.0001, 0.01, 0.0005),
    'max_leaf_nodes': np.arange(5, 200), 
    'max_depth': np.arange(1,60), 
    'criterion': ['entropy', 'gini'],
}

dtree = DecisionTreeClassifier()
rand_search = RandomizedSearchCV(estimator = dtree, param_distributions=param_grid, cv=kfolds, n_iter=500,
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = rand_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {rand_search.best_score_}")
print(f"... with parameters: {rand_search.best_params_}")

bestRecallTree = rand_search.best_estimator_

Fitting 5 folds for each of 500 candidates, totalling 2500 fits
The best precision score is nan
... with parameters: {'min_samples_split': 15, 'min_samples_leaf': 26, 'min_impurity_decrease': 0.0001, 'max_leaf_nodes': 66, 'max_depth': 9, 'criterion': 'gini'}


25 fits failed out of a total of 2500.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
25 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\akhil\anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\akhil\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 937, in fit
    super().fit(
  File "C:\Users\akhil\anaconda3\lib\site-packages\sklearn\tree\_classes.py", line 250, in fit
    raise ValueError(
ValueError: min_samples_split must be an integer greater than 1 or a float in (0.0, 1.0]; got the integer 1

 nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
 na

In [27]:
c_matrix = confusion_matrix(y_test,rand_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Decisio Tree random", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.944828,0.981481,0.883333,0.929825
0,liblinear logistic,0.896296,0.973684,0.74,0.840909
0,L2 logistic,0.944828,0.981481,0.883333,0.929825
0,L1 logistic,0.939394,0.97561,0.851064,0.909091
0,Elestic logistic,0.855172,0.914894,0.716667,0.803738
0,svm with linear kernel,0.938356,0.964286,0.885246,0.923077
0,svm with rbf kernel,0.547771,0.509434,0.375,0.432
0,svm with polynomial kernel,0.452229,0.436364,0.666667,0.527473
0,Decision Tree,0.980263,0.984848,0.970149,0.977444
0,Decisio Tree random,0.987179,1.0,0.971831,0.985714


## Grid search DecisionTree

In [28]:
score_measure = "precision"
kfolds = 5

param_grid = {
    'min_samples_split': np.arange(13,18),  
    'min_samples_leaf': np.arange(24,28),
    'min_impurity_decrease': np.arange(0.0000, 0.0004, 0.0001),
    'max_leaf_nodes': np.arange(63, 68), 
    'max_depth': np.arange(7,11), 
    'criterion': ['entropy'],
}

dtree = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator = dtree, param_grid=param_grid, cv=kfolds, 
                           scoring=score_measure, verbose=1, n_jobs=-1,  # n_jobs=-1 will utilize all available CPUs 
                           return_train_score=True)

_ = grid_search.fit(X_train, y_train)

print(f"The best {score_measure} score is {grid_search.best_score_}")
print(f"... with parameters: {grid_search.best_params_}")

bestRecallTree = grid_search.best_estimator_

Fitting 5 folds for each of 1600 candidates, totalling 8000 fits
The best precision score is nan
... with parameters: {'criterion': 'entropy', 'max_depth': 7, 'max_leaf_nodes': 63, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 24, 'min_samples_split': 13}




In [29]:
c_matrix = confusion_matrix(y_test,grid_search.predict(X_test))
TP = c_matrix[1][1]
TN = c_matrix[0][0]
FP = c_matrix[0][1]
FN = c_matrix[1][0]
performance = pd.concat([performance, pd.DataFrame({'model':"Decision Tree grid Search", 
                                                    'Accuracy': [(TP+TN)/(TP+TN+FP+FN)], 
                                                    'Precision': [TP/(TP+FP)], 
                                                    'Recall': [TP/(TP+FN)], 
                                                    'F1': [2*TP/(2*TP+FP+FN)]
                                                     }, index=[0])])
performance

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,default logistic,0.944828,0.981481,0.883333,0.929825
0,liblinear logistic,0.896296,0.973684,0.74,0.840909
0,L2 logistic,0.944828,0.981481,0.883333,0.929825
0,L1 logistic,0.939394,0.97561,0.851064,0.909091
0,Elestic logistic,0.855172,0.914894,0.716667,0.803738
0,svm with linear kernel,0.938356,0.964286,0.885246,0.923077
0,svm with rbf kernel,0.547771,0.509434,0.375,0.432
0,svm with polynomial kernel,0.452229,0.436364,0.666667,0.527473
0,Decision Tree,0.980263,0.984848,0.970149,0.977444
0,Decisio Tree random,0.987179,1.0,0.971831,0.985714


In [30]:
performance.sort_values(by=['Precision'])

Unnamed: 0,model,Accuracy,Precision,Recall,F1
0,svm with rbf kernel,0.547771,0.509434,0.375,0.432
0,svm with polynomial kernel,0.452229,0.436364,0.666667,0.527473
0,Elestic logistic,0.855172,0.914894,0.716667,0.803738
0,liblinear logistic,0.896296,0.973684,0.74,0.840909
0,L1 logistic,0.939394,0.97561,0.851064,0.909091
0,svm with linear kernel,0.938356,0.964286,0.885246,0.923077
0,default logistic,0.944828,0.981481,0.883333,0.929825
0,L2 logistic,0.944828,0.981481,0.883333,0.929825
0,Decision Tree,0.980263,0.984848,0.970149,0.977444
0,Decisio Tree random,0.987179,1.0,0.971831,0.985714


## Conclusion

#Based on the given performance metrics, the "Decision Tree random & Grid search" models has the highest precision score of 1.000000, indicating that it correctly identified all positive cases without any false positives. Therefore, we can conclude that the "Decision Tree random & Grid search" models are performing the best compared to all models, with precision as the chosen metric.