## Model Fitting File
### Janhavi Anantprakash Kharmale

In this file, I am fitting 3 models - Logistic Regression, Linear SVM and decision Tree on the previously preprocessed dataset. Also, Randomized search and Grid Search hyperparameter tuning techniques have been used for each model.

In [28]:
#Importing the required libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import confusion_matrix,accuracy_score,classification_report
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
np.random.seed(42)

In [2]:
#reading the CSV file
df= pd.read_csv(r"C:\Users\janha\Downloads\archive\breast_cancer_data.csv", index_col=False)

In [3]:
df

Unnamed: 0,diagnosis,radius_mean,texture_mean,smoothness_mean,compactness_mean,symmetry_mean,fractal_dimension_mean,radius_se,texture_se,smoothness_se,compactness_se,symmetry_se,fractal_dimension_se
0,1,17.99,10.38,0.11840,0.27760,0.2419,0.07871,1.0950,0.9053,0.006399,0.04904,0.03003,0.006193
1,1,20.57,17.77,0.08474,0.07864,0.1812,0.05667,0.5435,0.7339,0.005225,0.01308,0.01389,0.003532
2,1,19.69,21.25,0.10960,0.15990,0.2069,0.05999,0.7456,0.7869,0.006150,0.04006,0.02250,0.004571
3,1,11.42,20.38,0.14250,0.28390,0.2597,0.09744,0.4956,1.1560,0.009110,0.07458,0.05963,0.009208
4,1,20.29,14.34,0.10030,0.13280,0.1809,0.05883,0.7572,0.7813,0.011490,0.02461,0.01756,0.005115
...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,1,21.56,22.39,0.11100,0.11590,0.1726,0.05623,1.1760,1.2560,0.010300,0.02891,0.01114,0.004239
565,1,20.13,28.25,0.09780,0.10340,0.1752,0.05533,0.7655,2.4630,0.005769,0.02423,0.01898,0.002498
566,1,16.60,28.08,0.08455,0.10230,0.1590,0.05648,0.4564,1.0750,0.005903,0.03731,0.01318,0.003892
567,1,20.60,29.33,0.11780,0.27700,0.2397,0.07016,0.7260,1.5950,0.006522,0.06158,0.02324,0.006185


In [4]:
#Separating the features and target variables
X = df.drop(columns=['diagnosis'])
y = df['diagnosis']

In [6]:
#Splitting the data into Train and Test data with into 70% and 30% respectively.
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3, random_state=1)


In [7]:
# create a standard scaler and fit it to the training set of predictors
from sklearn import preprocessing 
scaler = preprocessing.StandardScaler()
scaler.fit(X_train)

# Transform the predictors of training and validation sets
X_train = scaler.transform(X_train) 
X_test = scaler.transform(X_test)

## Logistic Regression Model

In [18]:
model = LogisticRegression(
    max_iter=1000,  # increase the number of iterations
    n_jobs=-1 )      # use all processors
model.fit(X_train, y_train)

In [19]:
results = pd.DataFrame()
results['actual'] = y_test
results['predicted'] = model.predict(X_test)
results

Unnamed: 0,actual,predicted
421,0,1
47,1,0
292,0,0
186,1,1
414,1,1
...,...,...
6,1,1
487,1,1
11,1,1
268,0,0


In [20]:
df['diagnosis'].value_counts()  

diagnosis
0    357
1    212
Name: count, dtype: int64

In [21]:
results['predicted'].value_counts()

predicted
0    113
1     58
Name: count, dtype: int64

These counts can be useful for understanding the class distribution within each column. In the "diagnosis" column, we have 357 instances of class 0 and 212 instances of class 1, which suggests that class 0 is the majority class. 

In [23]:
cm=confusion_matrix(y_test,results['predicted'])
cm

array([[101,   7],
       [ 12,  51]], dtype=int64)

In [24]:
accuracy_score(y_test,results['predicted'])

0.8888888888888888

### Randomized search

In [25]:
# Hyperparameter grid for Logistic
param_grid_lr = {
    'penalty': ['l1', 'l2', 'elasticnet', 'none'],
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga']
}

In [26]:
# Create a RandomizedSearchCV object
LR_random_search = RandomizedSearchCV(model, param_grid_lr, n_iter=10, cv=5, random_state=42)

# Fit the RandomizedSearchCV object to the training data
LR_random_search.fit(X_train, y_train)

# Get the best parameters
best_params_random_lr = LR_random_search.best_params_

print('Best parameters found: ', best_params_random_lr)
print('---------------------------------------------------------------------------------\n')

y_pred_lr = LR_random_search.predict(X_test)

print(f"{'Accuracy Score : ':10}{accuracy_score(y_test, y_pred_lr):.3f}")
print(f"{'Recall Score : ':10}{recall_score(y_test, y_pred_lr):.3f}")
print(f"{'Precision Score : ':10}{precision_score(y_test, y_pred_lr):.3f}")
print(f"{'F1 Score : ':10}{f1_score(y_test, y_pred_lr):.3f}")
print('---------------------------------------------------------------------------------\n')

# Calculate the overall probability of a person having breast cancer
test_probabilities = model.predict_proba(X_test)[:, 1]   

test_overall_probability = test_probabilities.mean()

print(f"Overall Probability of having breast cancer: {test_overall_probability*100:.2f}")



Best parameters found:  {'solver': 'saga', 'penalty': 'l1', 'C': 1}
---------------------------------------------------------------------------------

Accuracy Score : 0.901
Recall Score : 0.857
Precision Score : 0.871
F1 Score : 0.864
---------------------------------------------------------------------------------

Overall Probability of having breast cancer: 36.16


15 fits failed out of a total of 50.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\janha\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\janha\anaconda3\Lib\site-packages\sklearn\base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\janha\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1169, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

### Grid Search

In [29]:
# Create a GridSearchCV object
LR_grid_search = GridSearchCV(model, param_grid_lr, cv=5)

# Fit the GridSearchCV object to your training data
LR_grid_search.fit(X_train, y_train)

# Get the best parameters
best_params_lr_grid = LR_grid_search.best_params_

print("Best Parameters:", best_params_lr_grid)
print('---------------------------------------------------------------------------------\n')

y_pred_lr_grid = LR_grid_search.predict(X_test)

print(f"{'Accuracy Score : ':10}{accuracy_score(y_test, y_pred_lr_grid):.3f}")
print(f"{'Recall Score : ':10}{recall_score(y_test, y_pred_lr_grid):.3f}")
print(f"{'Precision Score : ':10}{precision_score(y_test, y_pred_lr_grid):.3f}")
print(f"{'F1 Score : ':10}{f1_score(y_test, y_pred_lr_grid):.3f}")
print('---------------------------------------------------------------------------------\n')

# Calculate the overall probability of person having breast cancer
test_probabilities = model.predict_proba(X_test)[:, 1]   

test_overall_probability = test_probabilities.mean()

print(f"Overall Probability of person having breast cancer: {test_overall_probability*100:.2f}")




Best Parameters: {'C': 1, 'penalty': 'l1', 'solver': 'saga'}
---------------------------------------------------------------------------------

Accuracy Score : 0.901
Recall Score : 0.857
Precision Score : 0.871
F1 Score : 0.864
---------------------------------------------------------------------------------

Overall Probability of person having breast cancer: 36.16


270 fits failed out of a total of 600.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\janha\anaconda3\Lib\site-packages\sklearn\model_selection\_validation.py", line 729, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\janha\anaconda3\Lib\site-packages\sklearn\base.py", line 1152, in wrapper
    return fit_method(estimator, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\janha\anaconda3\Lib\site-packages\sklearn\linear_model\_logistic.py", line 1169, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

## SVM Model

### Linear SVM Model

In [30]:
# Initialize the model with a linear kernel
clf_linear = SVC(kernel='linear', probability=True)

# Train the model
clf_linear.fit(X_train, y_train)

# Make predictions
y_pred_linear = clf_linear.predict(X_test)

### Randomized Search

In [31]:
#Hyperparameter grid for SVM
param_grid_svm = {
    'C': [0.1, 1, 10],
    'kernel': ['linear', 'rbf'],
    'gamma': ['scale', 'auto']
}


In [32]:
#Create a RandomizedSeacrchCV object
SVM_random_search = RandomizedSearchCV(clf_linear, param_grid_svm, n_iter=12, cv=5, random_state=42)

#Fit the RandomizedSeacrchCV object to the training data
SVM_random_search.fit(X_train, y_train)

# Get the best parameters
best_params_random_svm = SVM_random_search.best_params_

print('Best parameters found: ', best_params_random_svm)
print('---------------------------------------------------------------------------------\n')

y_pred_svm = SVM_random_search.predict(X_test)

print(f"{'Accuracy Score : ':10}{accuracy_score(y_test, y_pred_svm):.3f}")
print(f"{'Recall Score : ':10}{recall_score(y_test, y_pred_svm):.3f}")
print(f"{'Precision Score : ':10}{precision_score(y_test, y_pred_svm):.3f}")
print(f"{'F1 Score : ':10}{f1_score(y_test, y_pred_svm):.3f}")
print('---------------------------------------------------------------------------------\n')

#Probability
test_probabilities_svm = clf_linear.predict_proba(X_test)[:, 1]  
average_probability_svm = test_probabilities_svm.mean()

# Print the average probability
print(f"Overall Probability of person having breast cancer: {average_probability_svm*100:.2f}", )



Best parameters found:  {'kernel': 'linear', 'gamma': 'scale', 'C': 1}
---------------------------------------------------------------------------------

Accuracy Score : 0.895
Recall Score : 0.857
Precision Score : 0.857
F1 Score : 0.857
---------------------------------------------------------------------------------

Overall Probability of person having breast cancer: 36.85


### Grid Search

In [33]:
# Create a GridSearchCV object
SVM_grid_search = GridSearchCV(clf_linear, param_grid_svm, cv=5)

# Fit the GridSearchCV object to your training data
SVM_grid_search.fit(X_train, y_train)

# Get the best parameters
best_params_svm_grid = SVM_grid_search.best_params_

print("Best Parameters:", best_params_svm_grid)
print('---------------------------------------------------------------------------------\n')

y_pred_svm_grid = SVM_grid_search.predict(X_test)

print(f"{'Accuracy Score : ':10}{accuracy_score(y_test, y_pred_svm_grid):.3f}")
print(f"{'Recall Score : ':10}{recall_score(y_test, y_pred_svm_grid):.3f}")
print(f"{'Precision Score : ':10}{precision_score(y_test, y_pred_svm_grid):.3f}")
print(f"{'F1 Score : ':10}{f1_score(y_test, y_pred_svm_grid):.3f}")
print('---------------------------------------------------------------------------------\n')

#Probability
test_probabilities_svm = clf_linear.predict_proba(X_test)[:, 1]  
average_probability_svm = test_probabilities_svm.mean()

# Print the average probability
print(f"Overall Probability of person having breast cancer: {average_probability_svm*100:.2f}", )


Best Parameters: {'C': 1, 'gamma': 'scale', 'kernel': 'linear'}
---------------------------------------------------------------------------------

Accuracy Score : 0.895
Recall Score : 0.857
Precision Score : 0.857
F1 Score : 0.857
---------------------------------------------------------------------------------

Overall Probability of person having breast cancer: 36.85


### Decision Tree Model

In [34]:
# Initialize and train the Decision Tree model
clf = DecisionTreeClassifier(random_state=42)
clf.fit(X_train, y_train)

In [35]:
# Evaluate the model's performance
score = clf.score(X_test, y_test)
print(f"Model Accuracy: {score:.2f}")

Model Accuracy: 0.87


In [36]:
tree_pred = clf.predict(X_test)
print("Decision Tree:")
print(classification_report(y_test, tree_pred))

Decision Tree:
              precision    recall  f1-score   support

           0       0.91      0.87      0.89       108
           1       0.79      0.86      0.82        63

    accuracy                           0.87       171
   macro avg       0.85      0.86      0.86       171
weighted avg       0.87      0.87      0.87       171



### Randomized Search

In [37]:
#Hyperparameter grid for Decision Tree
param_grid_tree = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 10, 20, 30, 40, 50],
    'min_samples_split': [2, 5, 10]
}

In [38]:
# Create a RandomizedSearchCV object
DT_random_search = RandomizedSearchCV(clf, param_grid_tree, n_iter=10, cv=5, random_state=42)

# Fit the GridSearchCV object to your training data
DT_random_search.fit(X_train, y_train)

# Get the best parameters
best_params_random_dt = DT_random_search.best_params_

print('Best parameters found: ', best_params_random_dt)
print('---------------------------------------------------------------------------------\n')

y_pred_dt = DT_random_search.predict(X_test)

print(f"{'Accuracy Score : ':10}{accuracy_score(y_test, y_pred_dt):.3f}")
print(f"{'Recall Score : ':10}{recall_score(y_test, y_pred_dt):.3f}")
print(f"{'Precision Score : ':10}{precision_score(y_test, y_pred_dt):.3f}")
print(f"{'F1 Score : ':10}{f1_score(y_test, y_pred_dt):.3f}")
print('---------------------------------------------------------------------------------\n')

#Probability
test_probabilities_svm = clf_linear.predict_proba(X_test)[:, 1]  
average_probability_svm = test_probabilities_svm.mean()

# Print the average probability
print(f"Overall Probability of person having breast cancer: {average_probability_svm*100:.2f}", )


Best parameters found:  {'min_samples_split': 5, 'max_depth': 40, 'criterion': 'entropy'}
---------------------------------------------------------------------------------

Accuracy Score : 0.883
Recall Score : 0.794
Precision Score : 0.877
F1 Score : 0.833
---------------------------------------------------------------------------------

Overall Probability of person having breast cancer: 36.85


### Grid Search

In [39]:
# Create a GridSearchCV object
DT_grid_search = GridSearchCV(clf, param_grid_tree, cv=5)

# Fit the GridSearchCV object to your training data
DT_grid_search.fit(X_train, y_train)

# Get the best parameters
best_params_grid_dt = DT_grid_search.best_params_

print("Best Parameters:", best_params_grid_dt)
print('---------------------------------------------------------------------------------\n')

y_pred_grid_dt = DT_grid_search.predict(X_test)

print(f"{'Accuracy Score : ':10}{accuracy_score(y_test, y_pred_grid_dt):.3f}")
print(f"{'Recall Score : ':10}{recall_score(y_test, y_pred_grid_dt):.3f}")
print(f"{'Precision Score : ':10}{precision_score(y_test, y_pred_grid_dt):.3f}")
print(f"{'F1 Score : ':10}{f1_score(y_test, y_pred_grid_dt):.3f}")
print('---------------------------------------------------------------------------------\n')

#Probability
test_probabilities_svm = clf_linear.predict_proba(X_test)[:, 1]  
average_probability_svm = test_probabilities_svm.mean()

# Print the average probability
print(f"Overall Probability of person having breast cancer: {average_probability_svm*100:.2f}", )

Best Parameters: {'criterion': 'entropy', 'max_depth': None, 'min_samples_split': 5}
---------------------------------------------------------------------------------

Accuracy Score : 0.883
Recall Score : 0.794
Precision Score : 0.877
F1 Score : 0.833
---------------------------------------------------------------------------------

Overall Probability of person having breast cancer: 36.85


# Conclusion

After fitting 3 models on the data, I got the following results :

## Results

### 1. For Logistic Regression Model

* Randomized Search : 

Accuracy Score : 0.901

Recall Score : 0.857

Precision Score : 0.871

F1 Score : 0.864

Overall Probability of person having breast cancer in the test data: 36.16

* Grid Search :

Accuracy Score : 0.901

Recall Score : 0.857

Precision Score : 0.871

F1 Score : 0.864

Overall Probability of person having breast cancer in the test data: 36.16

### 2. For Linear SVM Model 

* Randomized Search :

Accuracy Score : 0.895 

Recall Score : 0.857

Precision Score : 0.857

F1 Score : 0.857

Overall Probability of  person having breast cancer in the test data: 36.85

* Grid Search :

Accuracy Score : 0.895

Recall Score : 0.857

Precision Score : 0.857

F1 Score : 0.857

Overall Probability of person having breast cancer in the test data: 36.85

### 3. For Decision Tree Model

* Randomized Search :

Accuracy Score : 0.883

Recall Score :  0.794

Precision Score : 0.877 

F1 Score :  0.833

Overall Probability of person having breast cancer in the test data: 36.85

* Grid Search : 

Accuracy Score : 0.883

Recall Score : 0.794

Precision Score : 0.877

F1 Score : 0.833

Overall Probability of person having breast cancer in the test data: 36.85 

## Best performance metric for the data
The choice of the most suitable performance metric depends on the specific goals and constraints of your problem. 
* It is important to minimize false positives, as misdiagnosing individuals without breast cancer can lead to unnecessary stress and costs. "Precision" is relevant in this context because it measures the accuracy of positive predictions. You would aim for a high precision score to minimize false positives.s.* 

Since "F1 Score" is the harmonic mean of precision and recall, it's a balanced metric that considers both false positives and false negatives. If you want a single metric that balances these two considerations, the "F1 Score" is a good choim* Since, I am primarily concerned with breast cancer diagnosis, it's essential to minimize false negatives (i.e., correctly identifying individuals with cancer to avoid missing cases). In such a case, "Recall" is a crucial metric to consider because it focuses on the ability to correctly identify positive instances.

Hence I select Recall as my best performance metric for my data.

## Best Model for the data

* Logistic Regression Model appears to be the better-performing model for predicting whether a person has breast cancer or not. 
* Logistic Regression Model achieved greater recall, which means it correctly identified all people having breast cancer (True Positives). This is a crucial metric because we want to correctly identify positive instances. Also the Accuracy Score is greater than the other two model.   es.






## Summary
Better Performance metric for data : Recall

Better Performing model for data : Logistic Regression Model

Probability of person having breast cancer in the test data : 36.16%