# Lab 1 - Simple ML Pipeline with Scikit-Learn



## Introduction

In this notebook, we'll walk through a basic machine learning pipeline. We will:
1. Load and understand the dataset.
2. Preprocess the data.
3. Split the data into training and test sets.
4. Train a machine learning model.
5. Evaluate the model's performance.

Resources:
- https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression
- https://pandas.pydata.org/docs/

In [41]:
# Import necessary libraries
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

## Step 1: Load the Dataset

In [42]:
# Load the breast cancer dataset
data = load_breast_cancer()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

# Display the first few rows of the dataset
df.head()

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,target
0,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,0.2419,0.07871,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,0
1,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,0.1812,0.05667,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,0
2,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,0.2069,0.05999,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,0.2597,0.09744,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,0
4,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,0.1809,0.05883,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,0


In [43]:
# TODO:
# 1. How many features (colums), what are the name of columns in this dataset?
print(f"Total features: {len(df.columns)}. Column names: {df.columns}")
# 2. Data types of columns
for col in df.columns:
    print(col, type(col))
# 3. Is there any missing values?
print("Missing values", df.isna().sum())
# 4. What are the unique values of target column?
print("Unique targets", df['target'].unique())

Total features: 31. Column names: Index(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error', 'fractal dimension error',
       'worst radius', 'worst texture', 'worst perimeter', 'worst area',
       'worst smoothness', 'worst compactness', 'worst concavity',
       'worst concave points', 'worst symmetry', 'worst fractal dimension',
       'target'],
      dtype='object')
mean radius <class 'str'>
mean texture <class 'str'>
mean perimeter <class 'str'>
mean area <class 'str'>
mean smoothness <class 'str'>
mean compactness <class 'str'>
mean concavity <class 'str'>
mean concave points <class 'str'>
mean symmetry <class 'str'>
mean fractal dimension <class 

## Step 2: Data Preprocessing

In [44]:
# Separate features and target variable
X = df.drop(columns='target')
y = df['target']

In [45]:
# Standardize the features to similar ranges
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [46]:
# TODO:
# Print first item of X and X_scaled to compare the values
print(X.iloc[0], X_scaled[0])

mean radius                  17.990000
mean texture                 10.380000
mean perimeter              122.800000
mean area                  1001.000000
mean smoothness               0.118400
mean compactness              0.277600
mean concavity                0.300100
mean concave points           0.147100
mean symmetry                 0.241900
mean fractal dimension        0.078710
radius error                  1.095000
texture error                 0.905300
perimeter error               8.589000
area error                  153.400000
smoothness error              0.006399
compactness error             0.049040
concavity error               0.053730
concave points error          0.015870
symmetry error                0.030030
fractal dimension error       0.006193
worst radius                 25.380000
worst texture                17.330000
worst perimeter             184.600000
worst area                 2019.000000
worst smoothness              0.162200
worst compactness        

## Step 3: Split the Data into Training and Testing Sets

In [47]:
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

## Step 4: Train a Model

In [48]:
model = LogisticRegression()
model.fit(X_train, y_train)

0,1,2
,penalty,'l2'
,dual,False
,tol,0.0001
,C,1.0
,fit_intercept,True
,intercept_scaling,1
,class_weight,
,random_state,
,solver,'lbfgs'
,max_iter,100


## Step 5: Make Predictions and Evaluate the Model

In [49]:
# Make predictions on the test data
y_pred = model.predict(X_test)

In [50]:
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.98


## Exercise 1: Experiment with Different Models

Objective: To understand how different models perform on the same dataset and to practice replacing models in a pipeline.

Instructions:

- Replace the `Logistic Regression` model with another classifier, such as `KNeighborsClassifier` or `DecisionTreeClassifier`.
- Train and evaluate the new model on the same training and test sets.
- Compare the accuracy with those from the `Logistic Regression` model.

Questions:

- How did the performance of the new model compare to Logistic Regression?

In [51]:
from sklearn.tree import DecisionTreeClassifier

In [52]:
model = DecisionTreeClassifier()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.93


# Exercise 2: Experiment with different datasets

Choose 1 or these datasets and compare the accuracies.

- Wine Dataset

```python
from sklearn.datasets import load_wine
data = load_wine()
```
- Diabetes Dataset
```python
from sklearn.datasets import load_diabetes
data = load_diabetes()
```

In [55]:
from sklearn.datasets import load_wine
data = load_wine()
df = pd.DataFrame(data.data, columns=data.feature_names)
df['target'] = data.target

In [56]:
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,target
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0


In [57]:
X = df.drop(columns='target')
y = df['target']

In [58]:
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

In [59]:
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")

Accuracy: 0.98
