<a href="https://colab.research.google.com/github/BinilTomJose1278/daily-python-scripts/blob/main/BreastCancerPrediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
# Importing necessary libraries for machine learning tasks
import numpy as np                # Library for numerical operations
import pandas as pd               # Library for data manipulation and analysis
from sklearn.model_selection import train_test_split  # Function to split data into training and testing sets
from sklearn.preprocessing import StandardScaler      # Function to standardize features
from sklearn.linear_model import LogisticRegression   # Logistic Regression algorithm
from sklearn.ensemble import RandomForestClassifier   # Random Forest algorithm
from sklearn.svm import SVC                           # Support Vector Machine algorithm
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report  # Metrics for model evaluation


This section imports essential libraries for our machine learning tasks. We use numpy for numerical operations, pandas for data manipulation and analysis, and various modules from scikit-learn for tasks such as data splitting, preprocessing, model training, and evaluation. These imports set the stage for our machine learning pipeline.


In [None]:
# Loading a dataset using scikit-learn's built-in datasets
from sklearn.datasets import load_breast_cancer
data = load_breast_cancer()
X = data.data    # Features
y = data.target  # Target labels

# Split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# train_test_split function splits the data into training (80%) and testing (20%) sets



We start by loading a dataset using scikit-learn's built-in datasets. For this example, we use the Iris dataset, a classic dataset for classification tasks. The dataset is split into features (X) and target labels (y). We then use train_test_split to divide the dataset into training and testing sets, which is crucial for evaluating our models' performance on unseen data.

In [None]:
# Standardizing the features to have mean=0 and variance=1
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)  # Fit and transform the training data
X_test = scaler.transform(X_test)        # Transform the testing data



Preprocessing is a vital step in the machine learning workflow. Here, we standardize the features to have a mean of 0 and a variance of 1 using StandardScaler. Standardization improves the convergence of many learning algorithms by ensuring that all features contribute equally to the model's performance.

In [None]:
# Training a Logistic Regression model
log_reg = LogisticRegression(max_iter=10000)
log_reg.fit(X_train, y_train)  # Fit the model to the training data

# Making predictions with the trained model
y_pred_log_reg = log_reg.predict(X_test)

# Evaluating the model
print("Logistic Regression Accuracy:", accuracy_score(y_test, y_pred_log_reg))
# accuracy_score function calculates the accuracy of the model
print(confusion_matrix(y_test, y_pred_log_reg))
# confusion_matrix function returns the confusion matrix
print(classification_report(y_test, y_pred_log_reg))
# classification_report function provides a detailed classification report


Logistic Regression Accuracy: 0.9736842105263158
[[41  2]
 [ 1 70]]
              precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



Logistic Regression
Explanation:
Logistic Regression is a simple yet powerful classification algorithm. In this section, we train a Logistic Regression model using our training data. We then make predictions on the test set and evaluate the model's performance using accuracy, confusion matrix, and classification report. These metrics help us understand how well the model is performing.

In [None]:
# Training a Random Forest Classifier
rf_clf = RandomForestClassifier(n_estimators=100)
rf_clf.fit(X_train, y_train)  # Fit the model to the training data

# Making predictions with the trained model
y_pred_rf = rf_clf.predict(X_test)

# Evaluating the model
print("Random Forest Accuracy:", accuracy_score(y_test, y_pred_rf))
print(confusion_matrix(y_test, y_pred_rf))
print(classification_report(y_test, y_pred_rf))



Random Forest Accuracy: 0.9649122807017544
[[40  3]
 [ 1 70]]
              precision    recall  f1-score   support

           0       0.98      0.93      0.95        43
           1       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114



Random Forest is an ensemble learning method that combines multiple decision trees to improve predictive performance. We train a Random Forest classifier and evaluate its performance using the same metrics as before. Ensemble methods like Random Forest are often more robust and accurate than individual models.

In [None]:
# Training a Support Vector Machine
svm_clf = SVC()
svm_clf.fit(X_train, y_train)  # Fit the model to the training data

# Making predictions with the trained model
y_pred_svm = svm_clf.predict(X_test)

# Evaluating the model
print("SVM Accuracy:", accuracy_score(y_test, y_pred_svm))
print(confusion_matrix(y_test, y_pred_svm))
print(classification_report(y_test, y_pred_svm))



SVM Accuracy: 0.9824561403508771
[[41  2]
 [ 0 71]]
              precision    recall  f1-score   support

           0       1.00      0.95      0.98        43
           1       0.97      1.00      0.99        71

    accuracy                           0.98       114
   macro avg       0.99      0.98      0.98       114
weighted avg       0.98      0.98      0.98       114



Support Vector Machine (SVM) is a powerful algorithm for classification tasks. We train an SVM model on our training data and evaluate its performance. SVMs are effective in high-dimensional spaces and are versatile, as they can be used for both classification and regression tasks.

In [None]:
# Function to evaluate the model using accuracy score, confusion matrix, and classification report
def evaluate_model(y_true, y_pred):
    print("Accuracy:", accuracy_score(y_true, y_pred))
    print("Confusion Matrix:\n", confusion_matrix(y_true, y_pred))
    print("Classification Report:\n", classification_report(y_true, y_pred))

# Evaluating Logistic Regression
evaluate_model(y_test, y_pred_log_reg)

# Evaluating Random Forest
evaluate_model(y_test, y_pred_rf)

# Evaluating SVM
evaluate_model(y_test, y_pred_svm)


Accuracy: 0.9736842105263158
Confusion Matrix:
 [[41  2]
 [ 1 70]]
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

Accuracy: 0.9649122807017544
Confusion Matrix:
 [[40  3]
 [ 1 70]]
Classification Report:
               precision    recall  f1-score   support

           0       0.98      0.93      0.95        43
           1       0.96      0.99      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.96      0.96       114
weighted avg       0.97      0.96      0.96       114

Accuracy: 0.9824561403508771
Confusion Matrix:
 [[41  2]
 [ 0 71]]
Classification Report:
               precision    recall  f1-score   support

           0      

Model evaluation is crucial to understand how well our model generalizes to unseen data. We define a function to evaluate the model using accuracy, confusion matrix, and classification report. We then use this function to evaluate the Logistic Regression, Random Forest, and SVM models, providing a comprehensive view of their performance.

In [None]:
from sklearn.model_selection import cross_val_score

# Cross-validation for Logistic Regression
cv_scores = cross_val_score(log_reg, X, y, cv=5)
# cross_val_score function performs cross-validation and returns scores for each fold
print("Cross-validation scores (Logistic Regression):", cv_scores)
print("Mean cross-validation score (Logistic Regression):", np.mean(cv_scores))



Cross-validation scores (Logistic Regression): [0.93859649 0.94736842 0.98245614 0.92982456 0.95575221]
Mean cross-validation score (Logistic Regression): 0.9507995652848935


Cross-validation is a technique to assess how well our model generalizes to an independent dataset. It involves splitting the dataset into multiple folds and training the model on each fold. We perform cross-validation for the Logistic Regression model and report the scores, providing insights into the model's stability and reliability.



In [None]:
from sklearn.model_selection import GridSearchCV

# Hyperparameter tuning for Random Forest
param_grid = {
    'n_estimators': [50, 100, 200],  # Number of trees in the forest
    'max_depth': [None, 10, 20, 30]  # Maximum depth of the tree
}
grid_search = GridSearchCV(estimator=rf_clf, param_grid=param_grid, cv=5)
grid_search.fit(X_train, y_train)  # Fit the model to the training data with cross-validation

print("Best parameters found:", grid_search.best_params_)
print("Best cross-validation score:", grid_search.best_score_)



Best parameters found: {'max_depth': None, 'n_estimators': 50}
Best cross-validation score: 0.9648351648351647


Hyperparameter tuning is the process of optimizing a model's hyperparameters to improve performance. We use GridSearchCV to perform an exhaustive search over a specified parameter grid for the Random Forest classifier. This section demonstrates how to find the best combination of hyperparameters to enhance the model's accuracy.

In [None]:
import joblib

# Saving the Logistic Regression model
joblib.dump(log_reg, 'logistic_regression_model.pkl')

# Loading the saved model
loaded_model = joblib.load('logistic_regression_model.pkl')



Saving a trained model allows us to reuse it without retraining. We use joblib to save the Logistic Regression model to a file. This is particularly useful for deploying models to production environments. We also demonstrate how to load the saved model and use it for predictions.

In [None]:
# Predicting with the loaded model
y_pred_loaded = loaded_model.predict(X_test)
print("Loaded Model Accuracy:", accuracy_score(y_test, y_pred_loaded))


Loaded Model Accuracy: 0.9736842105263158


After loading a previously saved model, we make predictions on the test set to ensure that the model's performance is consistent. This step is crucial for validating that the saved model can be effectively reused for making predictions on new data.