<div style="background-color: rgba(247, 200, 115, 0.3); padding: 30px 0;">
    <div style="max-width: 800px; margin: 0 auto; text-align: center;">
        <h1 style="font-size: 48px; color: #cc7a00; margin-bottom: 10px;">🚀 Machine Learning 📊</h1>
        <h3 style="font-size: 28px; color: #cc7a00; margin-bottom: 10px;">SVM & Decision Tree & Cross Validation</h3>
        <h4 style="font-size: 18px; color: #cc7a00;"><a href="https://www.linkedin.com/in/mohammadreza-qaderi/" style="color: #1e90ff; text-decoration: none;">MohammadReza Qaderi</a></h4>
        <h4 style="font-size: 18px; color: #cc7a00;"><a href="https://github.com/MR-Qaderi/MachineLearningCourseMaterials" style="color: #1e90ff; text-decoration: none;">GitHub Repository</a></h4>
    </div>
</div>


<div align="center" style="border: 2px solid #e74c3c; padding: 10px; background-color: #f39c12; border-radius: 5px;">
  <h1 style="font-family: 'Palatino Linotype', serif; color: white;">🔥 SVM 🔥</h1>
</div>

<div style="background-color: #f9f9f9; border: 1px solid #ccc; padding: 15px; border-radius: 10px; font-family: Arial;">

<h2 style="color: #555; text-align: center;">Support Vector Machine (SVM) Algorithm</h2>

<p>The Support Vector Machine (SVM) is a powerful supervised learning algorithm used for both classification and regression tasks. It operates by finding the hyperplane that best separates classes in a high-dimensional feature space.</p>

<ul>
    <li><strong>Margin Maximization:</strong> SVM aims to find a hyperplane that maximizes the margin, which is the distance between the hyperplane and the nearest data points of each class. This ensures better generalization to unseen data.</li>
    <li><strong>Kernel Trick:</strong> SVM can efficiently handle non-linearly separable data by mapping the input features into a higher-dimensional space using kernel functions. This allows for more complex decision boundaries.</li>
    <li><strong>Robustness to Outliers:</strong> SVM is less sensitive to outliers due to its margin-based approach. It focuses on correctly classifying instances near the decision boundary.</li>
    <li><strong>Effective in High-Dimensional Spaces:</strong> SVM performs well even when the number of features is greater than the number of samples, making it suitable for complex datasets.</li>
    <li><strong>Memory Efficiency:</strong> It uses a subset of training points (support vectors) to define the decision boundary, which leads to memory efficiency.</li>
</ul>

<p>SVM is widely used in various fields such as image classification, text mining, bioinformatics, and more. Its versatility, robustness, and ability to handle non-linear data make it a valuable tool in machine learning.</p>

<p style="font-style: italic; text-align: center;">Keep in mind that SVM's performance may be influenced by the choice of kernel and its hyperparameters. Therefore, careful tuning is crucial for achieving optimal results.</p>

</div>


<img src = "https://alpopkes.com/posts/machine_learning/images/separating_hyperplanes.png" >

<img src = "https://static.javatpoint.com/tutorial/machine-learning/images/support-vector-machine-algorithm.png" >

<img src = "https://upload.wikimedia.org/wikipedia/commons/thumb/7/72/SVM_margin.png/600px-SVM_margin.png" >

<img src = "https://docs.opencv.org/3.4/sample-errors-dist.png" >

<img src = "https://i.stack.imgur.com/kP0j9.png" >

In [1]:
import time

# Record the start time
start_time = time.time()

In [2]:
import pandas as pd
from sklearn.datasets import load_breast_cancer
import seaborn as sns
import matplotlib.pyplot as plt

# Load Breast Cancer Wisconsin (Diagnostic) dataset
breast_cancer = load_breast_cancer()
data_df = pd.DataFrame(breast_cancer.data, columns=breast_cancer.feature_names)
data_df['diagnosis'] = breast_cancer.target
# Exploratory Data Analysis (EDA)
# Display summary statistics and visualization
data_df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,diagnosis
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0


In [3]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Separate X (features) and y (target)
X = data_df.drop('diagnosis', axis=1)
y = data_df['diagnosis']

In [4]:
# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [5]:
# Initialize and train the SVM classifier
svm_classifier = SVC(kernel='linear')
svm_classifier.fit(X_train, y_train)

In [6]:
# Predict on the test set
y_pred = svm_classifier.predict(X_test)

In [7]:
# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print("SVM Accuracy:", accuracy)

SVM Accuracy: 0.956140350877193


In [8]:
from sklearn.model_selection import cross_val_score, GridSearchCV

# Initialize SVM classifier
svm_classifier_cv = SVC()

<div style="background-color: #f9f9f9; border: 1px solid #ccc; padding: 15px; border-radius: 10px; font-family: Arial;">

<h2 style="color: #555; text-align: center;">Cross-Validation: Enhancing Model Reliability</h2>

<p>Cross-validation is a crucial technique in machine learning for assessing the performance and generalizability of a predictive model. It involves partitioning the dataset into multiple subsets, known as folds. The model is trained on several combinations of these folds and evaluated on the remaining portions. This process is repeated to ensure every subset serves as both training and validation data.</p>

<h3 style="color: #777;">Importance of Cross-Validation:</h3>

<ol>
    <li><strong>Robustness to Overfitting:</strong> Cross-validation helps identify if a model is overfitting to the training data. If a model performs exceptionally well on the training data but poorly on the validation data, it's a sign of overfitting.</li>
    <li><strong>Optimizing Hyperparameters:</strong> It aids in the selection of hyperparameters. By comparing performance across different folds, one can fine-tune hyperparameters for optimal model performance.</li>
    <li><strong>Improved Model Evaluation:</strong> It provides a more reliable estimate of a model's performance compared to a single train-test split. This is crucial for ensuring the model's performance on new, unseen data.</li>
    <li><strong>Utilization of Data:</strong> Cross-validation allows for better use of available data. It ensures that each data point is used for training and validation at least once, which is especially important when data is limited.</li>
    <li><strong>Reduced Variance in Results:</strong> By performing multiple iterations of cross-validation, we can reduce the variance in evaluation metrics, leading to more stable and consistent results.</li>
</ol>

<p style="font-style: italic; text-align: center; color: #777;">In summary, cross-validation enhances the reliability of machine learning models by providing a robust assessment of their performance. It helps in avoiding overfitting, fine-tuning models, and obtaining more accurate estimates of their real-world performance.</p>

</div>


<img src = "https://scikit-learn.org/stable/_images/grid_search_cross_validation.png" >

In [9]:
# Cross Validation
cv_scores = cross_val_score(svm_classifier_cv, X, y, cv=5)
print("Cross Validation Scores:", cv_scores)
print("Mean CV Score:", cv_scores.mean())

Cross Validation Scores: [0.85087719 0.89473684 0.92982456 0.94736842 0.9380531 ]
Mean CV Score: 0.9121720229777983


<img src = "https://i.stack.imgur.com/JcaO2.png" >

<div style="background-color: #f9f9f9; border: 1px solid #ccc; padding: 15px; border-radius: 10px; font-family: Arial;">

<h2 style="color: #555; text-align: center;">Grid Search: Fine-Tuning Model Hyperparameters</h2>

<p>Grid search is a powerful technique used in machine learning for systematically searching through a specified hyperparameter space to find the optimal combination that yields the best model performance.</p>

<h3 style="color: #777;">How Grid Search Works:</h3>

<p>Grid search operates by defining a grid of hyperparameters to explore. For each combination of hyperparameters, the model is trained and evaluated using cross-validation. The performance metrics (e.g., accuracy, F1 score) are recorded for each combination. Finally, the hyperparameter values resulting in the highest performance are selected.</p>

<h3 style="color: #777;">Importance of Grid Search:</h3>

<ol>
    <li><strong>Hyperparameter Optimization:</strong> Grid search automates the process of hyperparameter tuning, saving time and effort in manually testing different combinations.</li>
    <li><strong>Improved Model Performance:</strong> By systematically exploring a range of hyperparameters, grid search helps find the configuration that leads to the highest model performance.</li>
    <li><strong>Prevention of Overfitting:</strong> Grid search aids in selecting hyperparameters that generalize well to new, unseen data, reducing the risk of overfitting.</li>
    <li><strong>Enhanced Model Robustness:</strong> A well-tuned model with optimized hyperparameters is more likely to perform consistently across different datasets and scenarios.</li>
</ol>

<p style="font-style: italic; text-align: center; color: #777;">In summary, grid search is an essential tool for fine-tuning model hyperparameters, leading to improved performance, robustness, and generalizability of machine learning models.</p>

</div>


In [10]:
# Perform Grid Search
param_grid = {'C': [0.1, 1, 10, 100],
              'gamma': [1, 0.1, 0.01, 0.001],
              'kernel': ['linear', 'rbf']}
grid_search = GridSearchCV(svm_classifier_cv, param_grid, refit=True, verbose=3)
grid_search.fit(X, y)

Fitting 5 folds for each of 32 candidates, totalling 160 fits
[CV 1/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.939 total time=   0.0s
[CV 2/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.947 total time=   0.0s
[CV 3/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.982 total time=   0.3s
[CV 4/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.921 total time=   0.0s
[CV 5/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.956 total time=   0.0s
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.623 total time=   0.0s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.623 total time=   0.0s
[CV 3/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.632 total time=   0.0s
[CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.632 total time=   0.0s
[CV 5/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.628 total time=   0.0s
[CV 1/5] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.939 total time=   0.0s
[CV 2/5] END ...C=0.1, gamma=0.1, kernel=linear

In [11]:
# Best parameters from Grid Search
print("Best Parameters:", grid_search.best_params_)

Best Parameters: {'C': 100, 'gamma': 1, 'kernel': 'linear'}


In [12]:
# Best accuracy from Grid Search
best_accuracy = grid_search.best_score_
print("Best Accuracy:", best_accuracy)

Best Accuracy: 0.9631268436578171


In [13]:
from sklearn.preprocessing import StandardScaler

# Initialize the StandardScaler
scaler = StandardScaler()

In [14]:
# Scale the features
X_scaled = scaler.fit_transform(X)

In [15]:
# Perform Grid Search with scaled data
grid_search_scaled = GridSearchCV(svm_classifier_cv, param_grid, refit=True, verbose=3)
grid_search_scaled.fit(X_scaled, y)

Fitting 5 folds for each of 32 candidates, totalling 160 fits
[CV 1/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.974 total time=   0.0s
[CV 2/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.974 total time=   0.0s
[CV 3/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.982 total time=   0.0s
[CV 4/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.965 total time=   0.0s
[CV 5/5] END .....C=0.1, gamma=1, kernel=linear;, score=0.982 total time=   0.0s
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.623 total time=   0.0s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.623 total time=   0.0s
[CV 3/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.632 total time=   0.0s
[CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.632 total time=   0.0s
[CV 5/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.628 total time=   0.0s
[CV 1/5] END ...C=0.1, gamma=0.1, kernel=linear;, score=0.974 total time=   0.0s
[CV 2/5] END ...C=0.1, gamma=0.1, kernel=linear

In [16]:
# Best parameters from Grid Search with scaled data
best_params_scaled = grid_search_scaled.best_params_
print("Best Parameters (Scaled Data):", best_params_scaled)

Best Parameters (Scaled Data): {'C': 10, 'gamma': 0.01, 'kernel': 'rbf'}


In [17]:
# Best accuracy from Grid Search with scaled data
best_accuracy_scaled = grid_search_scaled.best_score_
print("Best Accuracy (Scaled Data):", best_accuracy_scaled)

Best Accuracy (Scaled Data): 0.9789318428815401


In [18]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Metrics for the simple SVM model
y_pred_simple = svm_classifier.predict(X_test)
accuracy_simple = accuracy_score(y_test, y_pred_simple)
precision_simple = precision_score(y_test, y_pred_simple)
recall_simple = recall_score(y_test, y_pred_simple)
f1_simple = f1_score(y_test, y_pred_simple)

In [19]:
# Metrics for the SVM with Cross Validation and Grid Search (unscaled data)
best_svm = grid_search.best_estimator_
y_pred_cv = best_svm.predict(X_test)
accuracy_cv = accuracy_score(y_test, y_pred_cv)
precision_cv = precision_score(y_test, y_pred_cv)
recall_cv = recall_score(y_test, y_pred_cv)
f1_cv = f1_score(y_test, y_pred_cv)

In [20]:
# Metrics for the SVM with Cross Validation and Grid Search (scaled data)
best_svm_scaled = grid_search_scaled.best_estimator_
X_test_scaled = scaler.transform(X_test)
y_pred_scaled = best_svm_scaled.predict(X_test_scaled)
accuracy_scaled = accuracy_score(y_test, y_pred_scaled)
precision_scaled = precision_score(y_test, y_pred_scaled)
recall_scaled = recall_score(y_test, y_pred_scaled)
f1_scaled = f1_score(y_test, y_pred_scaled)

In [21]:
# Create a comparison DataFrame
comparison_df = pd.DataFrame({
    'Metric': ['Accuracy', 'Precision', 'Recall', 'F1 Score'],
    'Simple SVM': [accuracy_simple, precision_simple, recall_simple, f1_simple],
    'SVM with CV & GridSearch (Unscaled)': [accuracy_cv, precision_cv, recall_cv, f1_cv],
    'SVM with CV & GridSearch (Scaled)': [accuracy_scaled, precision_scaled, recall_scaled, f1_scaled]
})

In [22]:
# Display the comparison DataFrame
print(comparison_df)

      Metric  Simple SVM  SVM with CV & GridSearch (Unscaled)  \
0   Accuracy    0.956140                             0.964912   
1  Precision    0.945946                             0.958904   
2     Recall    0.985915                             0.985915   
3   F1 Score    0.965517                             0.972222   

   SVM with CV & GridSearch (Scaled)  
0                           0.982456  
1                           0.972603  
2                           1.000000  
3                           0.986111  


In [23]:
# Create a comparison table
comparison_data = [
    ['Accuracy', accuracy_simple, accuracy_cv, accuracy_scaled],
    ['Precision', precision_simple, precision_cv, precision_scaled],
    ['Recall', recall_simple, recall_cv, recall_scaled],
    ['F1 Score', f1_simple, f1_cv, f1_scaled]
]


In [24]:
# Display the comparison table using tabulate
from tabulate import tabulate

comparison_table = tabulate(comparison_data, headers=['Metric', 'Simple SVM', 'SVM with CV & GridSearch (Unscaled)', 'SVM with CV & GridSearch (Scaled)'], tablefmt='pretty')

print(comparison_table)

+-----------+--------------------+-------------------------------------+-----------------------------------+
|  Metric   |     Simple SVM     | SVM with CV & GridSearch (Unscaled) | SVM with CV & GridSearch (Scaled) |
+-----------+--------------------+-------------------------------------+-----------------------------------+
| Accuracy  | 0.956140350877193  |         0.9649122807017544          |        0.9824561403508771         |
| Precision | 0.9459459459459459 |          0.958904109589041          |        0.9726027397260274         |
|  Recall   | 0.9859154929577465 |         0.9859154929577465          |                1.0                |
| F1 Score  | 0.9655172413793103 |         0.9722222222222222          |        0.9861111111111112         |
+-----------+--------------------+-------------------------------------+-----------------------------------+


<div align="center" style="border: 2px solid #e74c3c; padding: 10px; background-color: #f39c12; border-radius: 5px;">
  <h1 style="font-family: 'Palatino Linotype', serif; color: white;">🔥 Decision Tree 🔥</h1>
</div>

<div style="background-color: #f9f9f9; border: 1px solid #ccc; padding: 15px; border-radius: 10px; font-family: Arial;">

<h2 style="color: #555; text-align: center;">Decision Tree Algorithm: Hierarchical Decision-Making</h2>

<p>The Decision Tree algorithm is a versatile supervised learning method used for both classification and regression tasks. It operates by recursively partitioning the feature space into subsets based on the values of the input features, ultimately leading to the prediction of the target variable.</p>

<h3 style="color: #777;">How Decision Tree Works:</h3>

<p>The algorithm begins by selecting the best feature to split the data, which is determined by maximizing information gain (or minimizing impurity) at each step. This process is repeated recursively for each subset, creating a hierarchical structure of decisions.</p>

<h3 style="color: #777;">Key Characteristics:</h3>

<ul>
    <li><strong>Interpretability:</strong> Decision trees provide easily interpretable rules, making them valuable for extracting insights and explaining predictions.</li>
    <li><strong>Handling Non-Linear Relationships:</strong> Decision trees can capture non-linear relationships between features and the target variable, making them suitable for complex datasets.</li>
    <li><strong>Robust to Outliers:</strong> They are less sensitive to outliers compared to some other algorithms.</li>
    <li><strong>Ensemble Learning:</strong> Decision trees form the basis of ensemble methods like Random Forests and Gradient Boosting, which can further enhance performance.</li>
</ul>

<h3 style="color: #777;">Importance of Decision Trees:</h3>

<p>Decision trees are widely used in fields like healthcare, finance, and marketing for tasks such as risk assessment, customer segmentation, and medical diagnosis. Their intuitive nature and ability to handle complex relationships make them a valuable tool in the machine learning toolkit.</p>

<p style="font-style: italic; text-align: center; color: #777;">In summary, the Decision Tree algorithm empowers hierarchical decision-making, providing interpretable rules and the ability to handle complex relationships in data.</p>

</div>


<img src = "https://av-eks-blogoptimized.s3.amazonaws.com/905753.png" >

<img src = "https://av-eks-blogoptimized.s3.amazonaws.com/542834.png" >

<img src = "https://media.geeksforgeeks.org/wp-content/uploads/20200620180439/Gini-Impurity-vs-Entropy.png" >

## Example: Decision Tree Split Using Gini Index

Consider a simplified example with two columns: "Outlook" and "Temperature." We'll use the "Play" column as the target variable, indicating whether or not to play a sport based on the weather conditions.

| Outlook   | Temperature | Play |
|-----------|-------------|------|
| Sunny     | Hot         | No   |
| Overcast  | Mild        | Yes  |
| Rainy     | Hot         | No   |
| Overcast  | Hot         | Yes  |
| Sunny     | Mild        | Yes  |
| Rainy     | Mild        | Yes  |
| Sunny     | Cool        | Yes  |
| Overcast  | Cool        | Yes  |
| Rainy     | Mild        | Yes  |
| Rainy     | Cool        | No   |
| Sunny     | Mild        | Yes  |
| Overcast  | Mild        | Yes  |
| Overcast  | Hot         | Yes  |
| Rainy     | Mild        | No   |

### Calculating Gini Index for "Outlook":
- P(Yes|Sunny) = 3/4
- P(No|Sunny) = 1/4
- Gini Index(Sunny) = 1 - ((3/4)^2 + (1/4)^2) = 0.375
- P(Yes|Overcast) = 5/5 = 1
- P(No|Overcast) = 0/5 = 0
- Gini Index(Overcast) = 1 - ((1)^2 + (0)^2) = 0
- P(Yes|Rainy) = 2/5
- P(No|Rainy) = 3/5
- Gini Index(Rainy) = 1 - ((2/5)^2 + (3/5)^2) = 0.48


- P(Sunny) = 4/14
- P(Overcast) = 5/14
- P(Rainy) = 5/14
- Weighted Gini Index(Outlook) = (4/14) * 0.375 + (5/14) * 0 + (5/14) * 0.48 = 0.278


### Calculating Gini Index for "Temperature":
- P(Yes|Hot) = 2/4
- P(No|Hot) = 2/4
- Gini Index(Hot) = 1 - ((2/4)^2 + (2/4)^2) = 0.5
- P(Yes|Mild) = 6/7
- P(No|Mild) = 1/7
- Gini Index(Mild) = 1 - ((6/7)^2 + (1/7)^2) = 0.24
- P(Yes|Cool) = 2/3
- P(No|Cool) = 1/3
- Gini Index(Cool) = 1 - ((2/3)^2 + (1/3)^2) = 0.44


- P(Hot) = 4/14
- P(Mild) = 7/14
- P(Cool) = 3/14
- Weighted Gini Index(Temperature) = (4/14) * 0.5 + (7/14) * 0.24 + (3/14) * 0.44 = 0.35


In this case, the "Outlook" attribute has a lower weighted Gini Index (0.278) compared to the "Temperature" attribute (0.35). Therefore, we would choose to split on the "Outlook" attribute as it results in a better separation of the classes.

Keep in mind that this is a simplified example, and in practice, decision tree algorithms evaluate multiple attributes and make a decision based on a combination of Gini Index or Entropy calculations.

## Example: Decision Tree Split Using Entropy

Continuing with our simplified example with the same dataset of two columns: "Outlook" and "Temperature," and the "Play" column as the target variable.

### Calculating Entropy for "Outlook":
- Entropy for "Sunny": - (3/4) * log2(3/4) - (1/4) * log2(1/4) ≈ 0.81
- Entropy for "Overcast": - (5/5) * log2(5/5) - (0/5) * log2(0/5) = 0
- Entropy for "Rainy": - (2/5) * log2(2/5) - (3/5) * log2(3/5) ≈ 0.52
- Weighted Entropy for "Outlook": (4/14) * 0.81 + (5/14) * 0 + (5/14) * 0.52 = 0.417

### Calculating Entropy for "Temperature":
- Entropy for "Hot": - (2/4) * log2(2/4) - (2/4) * log2(2/4) = 1
- Entropy for "Mild": - (6/7) * log2(6/7) - (1/7) * log2(1/7) ≈ 0.59
- Entropy for "Cool": - (2/3) * log2(2/3) - (1/3) * log2(1/3) ≈ 0.918
- Weighted Entropy for "Temperature": (4/14) * 1 + (7/14) * 0.59 + (3/14) * 0.918 = 0.777

Similarly, in this case, the "Outlook" attribute has a lower weighted Entropy (0.417) compared to the "Temperature" attribute (0.777). Therefore, we would choose to split on the "Outlook" attribute as it results in a better separation of the classes.

Entropy is a measure of impurity in a dataset. Decision tree algorithms aim to minimize entropy or maximize information gain when making splitting decisions.

In [25]:
from sklearn.datasets import load_breast_cancer
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.metrics import classification_report, accuracy_score

# Load the Breast Cancer dataset
data = load_breast_cancer()
X = data.data
y = data.target

# Split the data into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [26]:
# Simple Decision Tree
simple_tree = DecisionTreeClassifier(random_state=42)
simple_tree.fit(X_train, y_train)
y_pred_simple = simple_tree.predict(X_test)
accuracy_simple = accuracy_score(y_test, y_pred_simple)

In [27]:
# Cross Validation
cross_val_scores = cross_val_score(simple_tree, X, y, cv=5)
mean_cv_score = cross_val_scores.mean()

In [28]:
# Grid Search with Cross Validation
param_grid = {
    'criterion': ['gini', 'entropy'],
    'splitter': ['best', 'random'],
    'max_depth': [None, 5, 10, 15, 20],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': [None, 'sqrt', 'log2']
}

In [29]:
grid_search = GridSearchCV(DecisionTreeClassifier(random_state=42), param_grid, cv=5)
grid_search.fit(X, y)
best_params = grid_search.best_params_
best_score = grid_search.best_score_

In [30]:
# Print results
print("Simple Decision Tree Accuracy:", accuracy_simple)
print("Cross Validation Mean Score:", mean_cv_score)
print("Best Parameters from Grid Search:", best_params)
print("Best Cross Validation Score from Grid Search:", best_score)

Simple Decision Tree Accuracy: 0.9473684210526315
Cross Validation Mean Score: 0.9173420276354604
Best Parameters from Grid Search: {'criterion': 'entropy', 'max_depth': 5, 'max_features': None, 'min_samples_leaf': 1, 'min_samples_split': 10, 'splitter': 'random'}
Best Cross Validation Score from Grid Search: 0.9490296537804689


In [31]:
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score, precision_score, recall_score, f1_score

# Evaluate Simple Decision Tree
print("Simple Decision Tree Metrics:")
print("Accuracy:", accuracy_score(y_test, y_pred_simple))
print("Precision:", precision_score(y_test, y_pred_simple))
print("Recall:", recall_score(y_test, y_pred_simple))
print("F1-Score:", f1_score(y_test, y_pred_simple))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_simple))
print("Classification Report:\n", classification_report(y_test, y_pred_simple))

Simple Decision Tree Metrics:
Accuracy: 0.9473684210526315
Precision: 0.9577464788732394
Recall: 0.9577464788732394
F1-Score: 0.9577464788732394
Confusion Matrix:
 [[40  3]
 [ 3 68]]
Classification Report:
               precision    recall  f1-score   support

           0       0.93      0.93      0.93        43
           1       0.96      0.96      0.96        71

    accuracy                           0.95       114
   macro avg       0.94      0.94      0.94       114
weighted avg       0.95      0.95      0.95       114



In [32]:
# Evaluate Grid Search Decision Tree
y_pred_grid = grid_search.predict(X_test)
print("\nGrid Search Decision Tree Metrics:")
print("Accuracy:", accuracy_score(y_test, y_pred_grid))
print("Precision:", precision_score(y_test, y_pred_grid))
print("Recall:", recall_score(y_test, y_pred_grid))
print("F1-Score:", f1_score(y_test, y_pred_grid))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_grid))
print("Classification Report:\n", classification_report(y_test, y_pred_grid))


Grid Search Decision Tree Metrics:
Accuracy: 0.956140350877193
Precision: 0.9342105263157895
Recall: 1.0
F1-Score: 0.9659863945578232
Confusion Matrix:
 [[38  5]
 [ 0 71]]
Classification Report:
               precision    recall  f1-score   support

           0       1.00      0.88      0.94        43
           1       0.93      1.00      0.97        71

    accuracy                           0.96       114
   macro avg       0.97      0.94      0.95       114
weighted avg       0.96      0.96      0.96       114



In [33]:
from tabulate import tabulate

# Create a dictionary to store the evaluation metrics
metrics_dict = {
    'Model': ['Simple Decision Tree', 'Tuned Decision Tree'],
    'Accuracy': [accuracy_score(y_test, y_pred), accuracy_score(y_test, y_pred_grid)],
    'Precision': [precision_score(y_test, y_pred), precision_score(y_test, y_pred_grid)],
    'Recall': [recall_score(y_test, y_pred), recall_score(y_test, y_pred_grid)],
    'F1 Score': [f1_score(y_test, y_pred), f1_score(y_test, y_pred_grid)],
}

In [34]:
# Convert the dictionary to a table using tabulate
table = tabulate(metrics_dict, headers='keys', tablefmt='grid')

In [35]:
# Print the table
print(table)

+----------------------+------------+-------------+----------+------------+
| Model                |   Accuracy |   Precision |   Recall |   F1 Score |
| Simple Decision Tree |    0.95614 |    0.945946 | 0.985915 |   0.965517 |
+----------------------+------------+-------------+----------+------------+
| Tuned Decision Tree  |    0.95614 |    0.934211 | 1        |   0.965986 |
+----------------------+------------+-------------+----------+------------+


In [36]:
# Record the end time
end_time = time.time()

In [37]:
# Calculate the elapsed time
elapsed_time = end_time - start_time

print(f"Execution time: {elapsed_time:.2f} seconds")

Execution time: 339.35 seconds


In [38]:
# Convert the elapsed time to minutes
elapsed_time_minutes = elapsed_time / 60

print(f"Execution time: {elapsed_time_minutes:.2f} minutes")

Execution time: 5.66 minutes
