üß† Introduction (Beginner Summary)

In this activity, you‚Äôll learn how to build and test different machine learning models that can classify data into categories ‚Äî just like how AI can tell if an email is spam or not.

You‚Äôll use Python to create models such as logistic regression, decision trees, and support vector machines (SVMs). You‚Äôll also learn how to prepare your data, train the models, and compare how well they perform.

ü™ú What You‚Äôll Do Step-by-Step

Set up your environment ‚Äì Create a new Jupyter Notebook and use the right Python version.

Load and explore the dataset ‚Äì Open your data and take a look at what it contains.

Preprocess the data ‚Äì Clean and prepare the data so the models can understand it.

Build a Logistic Regression model ‚Äì A simple model that predicts between categories.

Build a Decision Tree model ‚Äì A visual, step-by-step model that splits data to make decisions.

Step 1: Set up the environment
Instructions

First, ensure you have the necessary libraries installed. We‚Äôll be using Scikit-Learn for machine learning models, pandas for data manipulation, and matplotlib or seaborn for visualization.

Install the required libraries using the following commands:

In [2]:
!pip install scikit-learn
!pip install pandas
!pip install matplotlib seaborn



Explanation

These libraries will provide the tools to load, manipulate, and visualize the dataset, as well as implement and evaluate classification models.

üß© Step 2: Load and Explore the Dataset (Summary)

In this step, you‚Äôll load your dataset and take a closer look at what‚Äôs inside before building your model.

You‚Äôll:

Download the dataset that contains both the inputs (features) and the outputs (labels) for your classification task.

Load it into a pandas DataFrame so you can easily view and analyze it.

Explore the data by checking for missing values, data types, and using commands like .head() to see the first few rows.

In [4]:
# Load Breast Cancer dataset and convert to DataFrame
from sklearn.datasets import load_breast_cancer
import pandas as pd

data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Explore the dataset
print(df.head())
print(df.info())

   mean radius  mean texture  mean perimeter  mean area  mean smoothness  \
0        17.99         10.38          122.80     1001.0          0.11840   
1        20.57         17.77          132.90     1326.0          0.08474   
2        19.69         21.25          130.00     1203.0          0.10960   
3        11.42         20.38           77.58      386.1          0.14250   
4        20.29         14.34          135.10     1297.0          0.10030   

   mean compactness  mean concavity  mean concave points  mean symmetry  \
0           0.27760          0.3001              0.14710         0.2419   
1           0.07864          0.0869              0.07017         0.1812   
2           0.15990          0.1974              0.12790         0.2069   
3           0.28390          0.2414              0.10520         0.2597   
4           0.13280          0.1980              0.10430         0.1809   

   mean fractal dimension  ...  worst texture  worst perimeter  worst area  \
0             

Explanation

Understanding the structure of your dataset is crucial for selecting the right preprocessing steps and models.

‚öôÔ∏è Step 3: Preprocess the Data (Summary)

Before training or testing your model, you need to clean and prepare your dataset so it‚Äôs ready for machine learning.

You‚Äôll:

Decide how to split your data

If using a pretrained model, you usually just split the data into training and testing sets (no need for a separate validation set).

This is especially useful if your dataset is small or you‚Äôre only evaluating model performance.

Handle missing values

Fill them using the mean or median of the column, or

Remove rows/columns with too many missing entries.

Encode categorical data

Convert text or category labels into numbers using LabelEncoder or pd.get_dummies() (one-hot encoding).

Split the dataset

Use train_test_split from Scikit-Learn to divide data into 80% training and 20% testing.

Set a random seed so you get the same split every time.

Verify the split

Check the shapes of your training and testing data to ensure everything worked correctly.

In [6]:
from sklearn.model_selection import train_test_split

# Handle missing data (example: filling missing values with the median)
df.fillna(df.median(), inplace=True)

# Split the data into features and labels
X = df.drop('target', axis=1)
y = df['target']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Explanation

Preprocessing ensures that your data is clean and ready for ML models to use. Splitting the dataset into training and test sets allows us to evaluate the model‚Äôs performance on unseen data.

Step 4: Implement a logistic regression model
Instructions

Train a logistic regression model on the training data, and evaluate its performance on the test data.
Steps

    Import LogisticRegression from Scikit-Learn.

    Train the model using fit().

    Predict the labels for the test data, and calculate accuracy.

In [8]:
# Train logistic regression model
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)

# Make predictions
y_pred = log_reg.predict(X_test)

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
print(f"Logistic Regression Accuracy: {accuracy * 100:.2f}%")

Logistic Regression Accuracy: 95.61%


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Explanation

Logistic regression is a simple yet effective model for binary classification tasks. Accuracy is one of the metrics used to evaluate how well the model is performing.

Step 5: Implement a decision tree model

Decision trees split the data based on feature values and make decisions at each node.
Instructions

Train a decision tree model, and evaluate its performance on the test set.
Steps

    Import DecisionTreeClassifier from Scikit-Learn.

    Train the model on the training data.

    Make predictions and evaluate the accuracy.

In [9]:
from sklearn.tree import DecisionTreeClassifier

# Train decision tree model
tree = DecisionTreeClassifier()
tree.fit(X_train, y_train)

# Make predictions
y_pred_tree = tree.predict(X_test)

# Evaluate the model
accuracy_tree = accuracy_score(y_test, y_pred_tree)
print(f"Decision Tree Accuracy: {accuracy_tree * 100:.2f}%")

Decision Tree Accuracy: 92.98%


Explanation

Decision trees are highly interpretable models that make decisions by splitting the data based on the most informative features. However, they can be prone to overfitting if not tuned properly.

Step 6: Implement a support vector machine model

An SVM model is great for high-dimensional spaces. SVMs find a hyperplane that separates the data points into different classes with maximum margin.
Instructions

Train a support vector machine (SVM) model, and evaluate its performance on the test set.
Steps

    Import support vector classifier (SVC) from Scikit-Learn.

    Train the model on the training data.

    Make predictions and evaluate the accuracy.

In [10]:
from sklearn.svm import SVC

# Train SVM model
svm = SVC()
svm.fit(X_train, y_train)

# Make predictions
y_pred_svm = svm.predict(X_test)

# Evaluate the model
accuracy_svm = accuracy_score(y_test, y_pred_svm)
print(f"SVM Accuracy: {accuracy_svm * 100:.2f}%")

SVM Accuracy: 94.74%


Explanation

SVMs are powerful models, particularly in high-dimensional spaces. They work by finding a hyperplane that separates data points into different classes with the maximum margin.

Step 7: Evaluate and compare model performance
Instructions

Compare the performance of the different models using accuracy, precision, recall, and the F1 score.
Steps

    Import additional evaluation metrics, including precision_score, recall_score, and f1_score.

    Calculate these metrics for each model, and print the results for comparison.

In [11]:
from sklearn.metrics import precision_score, recall_score, f1_score

# Evaluate performance
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')

print(f"Logistic Regression - Precision: {precision:.2f}, Recall: {recall:.2f}, F1 Score: {f1:.2f}")

Logistic Regression - Precision: 0.96, Recall: 0.96, F1 Score: 0.96


Explanation

Accuracy is not always the best metric for evaluating classification models, especially with imbalanced datasets. Precision, recall, and the F1 score provide a more complete picture of model performance.

Conclusion

In this activity, you successfully implemented several classification models using Python, including logistic regression, decision trees, and SVMs. By training and evaluating these models on a dataset, you gained experience in using common metrics to compare their performance. Understanding how different models work and how to evaluate them is crucial for building reliable machine learning systems.