
Tutorial Section on SKLEARN


Scikit-learn, also known as sklearn, is an open-source, robust Python machine learning library. It was created to help simplify the process of implementing machine learning and statistical models in Python

In [None]:
#Install Sklearn
%pip install -U scikit-learn

We would used the wine datasets for this tutorial, the dataset's task involves classifying wines into one of  three cultivars. The three cultivars (classes) represented in the sklearn wine dataset correspond to different types or varieties of wine grapes. These cultivars are often associated with specific wine-producing regions and have distinct characteristics that influence the flavors, aromas, and overall profiles of the wines produced from them. (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_wine.html#sklearn.datasets.load_wine)

In [4]:
#import the dataset

import pandas as pd
from sklearn.datasets import load_wine

wine_data = load_wine()

# Convert data to pandas dataframe
wine_df = pd.DataFrame(wine_data.data, columns=wine_data.feature_names)

# Add the target label
wine_df["WineType"] = wine_data.target

# Take a preview
wine_df.head(5)
# delete me below
table = pd.crosstab(wine_df['alcohol'], wine_df['WineType'])
table.head(5)
wine_df.loc[wine_df['WineType'] == 0, 'alcohol']

0     14.23
1     13.20
2     13.16
3     14.37
4     13.24
5     14.20
6     14.39
7     14.06
8     14.83
9     13.86
10    14.10
11    14.12
12    13.75
13    14.75
14    14.38
15    13.63
16    14.30
17    13.83
18    14.19
19    13.64
20    14.06
21    12.93
22    13.71
23    12.85
24    13.50
25    13.05
26    13.39
27    13.30
28    13.87
29    14.02
30    13.73
31    13.58
32    13.68
33    13.76
34    13.51
35    13.48
36    13.28
37    13.05
38    13.07
39    14.22
40    13.56
41    13.41
42    13.88
43    13.24
44    13.05
45    14.21
46    14.38
47    13.90
48    14.10
49    13.94
50    13.05
51    13.83
52    13.82
53    13.77
54    13.74
55    13.56
56    14.22
57    13.29
58    13.72
Name: alcohol, dtype: float64

In [2]:
# cross tab 

table = pd.crosstab(wine_df['alcohol'], wine_df['WineType'])
table.head(5)
# save in a csv
# table.to_csv('wine.csv')

NameError: name 'wine_df' is not defined

In [5]:
# perform ttest on alcohol and WineType
from scipy.stats import ttest_ind
class_0_alcohol = wine_df.loc[wine_df['WineType'] == 0, 'alcohol']
class_1_alcohol = wine_df.loc[wine_df['WineType'] == 1, 'alcohol']
# pick some rows from the column
""" ttest_1 = ttest_ind(wine_df['alcohol'], wine_df['WineType'], alternative='two-sided', equal_var=False)
ttest_1 """
ttest_2 = ttest_ind(class_0_alcohol, class_1_alcohol, alternative='two-sided', equal_var=False)
print(ttest_2)

TtestResult(statistic=16.71133933365354, pvalue=5.926412330452344e-34, df=127.84705824080136)


In [6]:
#Data Exploration
wine_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 178 entries, 0 to 177
Data columns (total 14 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   alcohol                       178 non-null    float64
 1   malic_acid                    178 non-null    float64
 2   ash                           178 non-null    float64
 3   alcalinity_of_ash             178 non-null    float64
 4   magnesium                     178 non-null    float64
 5   total_phenols                 178 non-null    float64
 6   flavanoids                    178 non-null    float64
 7   nonflavanoid_phenols          178 non-null    float64
 8   proanthocyanins               178 non-null    float64
 9   color_intensity               178 non-null    float64
 10  hue                           178 non-null    float64
 11  od280/od315_of_diluted_wines  178 non-null    float64
 12  proline                       178 non-null    float64
 13  WineT

In [4]:
#Data Exploration
wine_df.describe()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,WineType
count,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0,178.0
mean,13.000618,2.336348,2.366517,19.494944,99.741573,2.295112,2.02927,0.361854,1.590899,5.05809,0.957449,2.611685,746.893258,0.938202
std,0.811827,1.117146,0.274344,3.339564,14.282484,0.625851,0.998859,0.124453,0.572359,2.318286,0.228572,0.70999,314.907474,0.775035
min,11.03,0.74,1.36,10.6,70.0,0.98,0.34,0.13,0.41,1.28,0.48,1.27,278.0,0.0
25%,12.3625,1.6025,2.21,17.2,88.0,1.7425,1.205,0.27,1.25,3.22,0.7825,1.9375,500.5,0.0
50%,13.05,1.865,2.36,19.5,98.0,2.355,2.135,0.34,1.555,4.69,0.965,2.78,673.5,1.0
75%,13.6775,3.0825,2.5575,21.5,107.0,2.8,2.875,0.4375,1.95,6.2,1.12,3.17,985.0,2.0
max,14.83,5.8,3.23,30.0,162.0,3.88,5.08,0.66,3.58,13.0,1.71,4.0,1680.0,2.0


Data preprocessing

Data processing is a vital step in the machine learning workflow because data from the real world is messy. It may contain: 

Missing values,
Redundant values
Outliers
Errors
Noise

You must deal with all of this before feeding the data to a machine learning model; otherwise, the model will incorporate these mistakes into its approximation function – it will learn to make mistakes on new instances. This is what formed the famous machine learning saying, “Garbage in, garbage out.” 

Another reason is that machine learning models typically require numeric data.  

Other than our data being on different scales, there’s not much else wrong with our data at first glance. To combat this problem, let’s standardize the features using sklearn’s StandardScaler class; this will ensure the mean of each feature is approximately equal to zero. 

In [9]:
from sklearn.preprocessing import StandardScaler

# Split data into features and label 
X = wine_df[wine_data.feature_names].copy()
y = wine_df["WineType"].copy() 

# Instantiate scaler and fit on features
scaler = StandardScaler()
scaler.fit(X)
# Transform features
X_scaled = scaler.transform(X.values)

# View first instance
print(X_scaled[0])



[ 1.51861254 -0.5622498   0.23205254 -1.16959318  1.91390522  0.80899739
  1.03481896 -0.65956311  1.22488398  0.25171685  0.36217728  1.84791957
  1.01300893]


Model Trainning --- Spliting the dataset

Before a machine learning model can make predictions, it must be trained on a set of data to learn an approximation function. 

There are several ways to split data into train and test sets, but scikit-learn has a built-in function to do this on our behalf called train_test_split(). 

We’ll use this function to split our data such that 70% is used to train the model and 30% is used to evaluate the model's ability to generalize to unseen instances. 

In [10]:
from sklearn.model_selection import train_test_split

# Split data into train and test
X_train_scaled, X_test_scaled, y_train, y_test = train_test_split(X_scaled,
                                                                  y,
                                                             train_size=.7,
                                                           random_state=0)

# Check the splits are correct
print(f"Train size: {round(len(X_train_scaled) / len(X) * 100)}% \n\
Test size: {round(len(X_test_scaled) / len(X) * 100)}%")



Train size: 70% 
Test size: 30%


Building the model
Thanks to sklearn, building a machine learning model is extremely simple. 

We are going to build three models to predict the class of wine: 

Logistic regression (https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression)
Support vector machine (https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html#sklearn.svm.SVC)
Decision tree classifier(https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier)

In [13]:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier

# Instnatiating the models 
logistic_regression = LogisticRegression()
svm = SVC()
tree = DecisionTreeClassifier(random_state=0) #Why do use think we are using random state and why arent we using it for the others, Some Machine learning 
                                              #algorithms are prone to randomization and would not produce the same result if random state is not decleared

# Training the models 
logistic_regression.fit(X_train_scaled, y_train)
svm.fit(X_train_scaled, y_train)
tree.fit(X_train_scaled, y_train)

# Making predictions with each model
log_reg_preds = logistic_regression.predict(X_test_scaled)
svm_preds = svm.predict(X_test_scaled)
tree_preds = tree.predict(X_test_scaled)

[0 2 1 0 1 1 0 2 1 1 2 2 0 1 2 1 0 0 1 0 1 0 0 1 1 1 1 1 1 2 0 0 1 0 0 0 2
 1 1 2 0 0 1 1 1 0 2 1 2 0 2 2 0 2]
[0 2 1 0 1 1 0 2 1 1 2 2 0 1 2 1 0 0 1 0 1 0 0 1 1 1 1 1 1 2 0 0 1 0 0 0 2
 1 1 2 0 0 1 1 1 0 2 1 2 0 2 2 0 2]
[0 2 1 0 1 1 0 2 1 1 2 2 0 1 2 1 0 0 1 0 1 0 0 1 1 1 1 1 1 2 0 0 1 0 0 0 2
 1 1 2 0 0 1 1 1 0 2 1 2 0 2 2 0 2]


Model evaluation

Model evaluation is done to test how well the model generalizes to unseen instances. Scikit-learn provides an array of classification and regression metrics to evaluate a trained model's performance

In [14]:
from sklearn.metrics import classification_report

# Store model predictions in a dictionary
# this makes it's easier to iterate through each model
# and print the results. 
model_preds = {
    "Logistic Regression": log_reg_preds,
    "Support Vector Machine": svm_preds,
    "Decision Tree": tree_preds
}

for model, preds in model_preds.items():
    print(f"{model} Results:\n{classification_report(y_test, preds)}", sep="\n\n")



Logistic Regression Results:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        22
           2       1.00      1.00      1.00        13

    accuracy                           1.00        54
   macro avg       1.00      1.00      1.00        54
weighted avg       1.00      1.00      1.00        54

Support Vector Machine Results:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        22
           2       1.00      1.00      1.00        13

    accuracy                           1.00        54
   macro avg       1.00      1.00      1.00        54
weighted avg       1.00      1.00      1.00        54

Decision Tree Results:
              precision    recall  f1-score   support

           0       1.00      0.89      0.94        19
           1       0.91      0.95      0.93  

Tasks

1. Change the random state of the Decision tree classifier (for example set it to 42), what was the effect of this change
2. Conduct an experiment using the 3 Machine learning algorithms onthe Sklearn breast cancer dataset (https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_breast_cancer.html#sklearn.datasets.load_breast_cancer)
3. what is the performance of the random forest algorithm on the breast cancer dataset (https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier)

Clustering is a type of unsupervised learning technique used to group data points or objects based on their similarity. The goal of clustering is to identify inherent patterns or structures in the data without prior knowledge of true labels. Clustering algorithms partition the data into groups or clusters such that data points within the same cluster are more similar to each other than to those in other clusters.

In this tutorial we would use two clustering algorithms: K-means and Agglomerative Clustering 

We would use metrics such as Silhouette Score, Davies-Bouldin Index, Calinski-Harabasz Score and Adjusted Rand Index (ARI)

K-means Clustering:
 K-means is a popular centroid-based clustering algorithm. It partitions the data into K clusters by iteratively assigning data points to the nearest cluster centroid and updating the centroids based on the mean of data points assigned to each cluster.


Agglomerative Clustering:
Agglomerative clustering is a hierarchical clustering method that starts with each data point as a separate cluster and merges clusters iteratively based on a linkage criterion (e.g., distance between clusters)


Performance Metrics:
Silhouette Score: Measures how similar a data point is to its own cluster compared to other clusters. Higher score indicates dense and well-separated clusters.

Davies-Bouldin Index: Computes the average similarity between each cluster and its most similar cluster, where lower values indicate better clustering.

Calinski-Harabasz Score: Ratio of within-cluster dispersion to between-cluster dispersion, higher values indicate better-defined clusters.

Adjusted Rand Index (ARI): Compares the similarity of true cluster assignments with the clustering results, providing a measure of cluster accuracy.

In [None]:
# Importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.metrics import silhouette_score, davies_bouldin_score, calinski_harabasz_score, adjusted_rand_score


In [None]:
# Load the wine dataset
data = load_wine()
X = data.data  # Features
y = data.target  # True labels (for adjusted Rand index)


In [None]:
# Standardize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)



In [None]:
# Define the clustering algorithms
kmeans = KMeans(n_clusters=3, random_state=42)
agg_clustering = AgglomerativeClustering(n_clusters=3)

In [None]:
# Fit the clustering algorithms to the scaled data
kmeans_labels = kmeans.fit_predict(X_scaled)
agg_labels = agg_clustering.fit_predict(X_scaled)

In [None]:
# Evaluate K-means clustering performance using multiple metrics
metrics_kmeans = {
    'Silhouette Score': silhouette_score(X_scaled, kmeans_labels),
    'Davies-Bouldin Index': davies_bouldin_score(X_scaled, kmeans_labels),
    'Calinski-Harabasz Score': calinski_harabasz_score(X_scaled, kmeans_labels),
    'Adjusted Rand Index': adjusted_rand_score(y, kmeans_labels)  # Using true labels for ARI
}

print("K-means Clustering Performance:")
for metric_name, metric_value in metrics_kmeans.items():
    print(f"{metric_name}: {metric_value:.4f}")

In [None]:
# Evaluate Agglomerative Clustering performance using multiple metrics
metrics_agg = {
    'Silhouette Score': silhouette_score(X_scaled, agg_labels),
    'Davies-Bouldin Index': davies_bouldin_score(X_scaled, agg_labels),
    'Calinski-Harabasz Score': calinski_harabasz_score(X_scaled, agg_labels),
    'Adjusted Rand Index': adjusted_rand_score(y, agg_labels)  # Using true labels for ARI
}

print("\nAgglomerative Clustering Performance:")
for metric_name, metric_value in metrics_agg.items():
    print(f"{metric_name}: {metric_value:.4f}")

We Knew that there were 3 cultivars in the wine dataset, what if we didnt there are several methods that can be used to determine the optimal number of clusters if the number of clusters are not known. Some of them include Elbow Method, Silhouette Analysis, Gap Statistic method, and Hierarchical Clustering

Elbow Method:
The elbow method involves plotting the within-cluster sum of squares (inertia) against the number of clusters (K) and identifying the "elbow" point where the rate of decrease in inertia sharply decreases. This point represents a good estimate for the optimal number of clusters.

In [1]:
inertias = []
for k in range(1, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    kmeans.fit(X_scaled)
    inertias.append(kmeans.inertia_)

plt.plot(range(1, 11), inertias, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Inertia')
plt.title('Elbow Method for Optimal K')
plt.show()


NameError: name 'KMeans' is not defined

Silhouette Analysis:
Silhouette analysis measures how well each data point fits into its assigned cluster and can be used to determine the optimal number of clusters. The highest average silhouette score across different numbers of clusters indicates the optimal number of clusters

In [None]:
silhouette_scores = []
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42)
    labels = kmeans.fit_predict(X_scaled)
    silhouette_avg = silhouette_score(X_scaled, labels)
    silhouette_scores.append(silhouette_avg)

plt.plot(range(2, 11), silhouette_scores, marker='o')
plt.xlabel('Number of Clusters (K)')
plt.ylabel('Silhouette Score')
plt.title('Silhouette Analysis for Optimal K')
plt.show()


Gap Statistic:
The gap statistic compares the within-cluster dispersion of the data to a reference null distribution and helps identify the optimal number of clusters by maximizing the gap statistic value.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans


def compute_gap_statistic(data, k_range, n_ref_samples=10, random_seed=None):
    """
    Compute the Gap Statistic for estimating the optimal number of clusters.
    
    Parameters:
        data (numpy.ndarray): Input data matrix (n_samples, n_features).
        k_range (list): List of integers specifying the range of k values (number of clusters) to evaluate.
        n_ref_samples (int): Number of reference samples to generate for calculating the reference distribution.
        random_seed (int): Random seed for reproducibility.
    
    Returns:
        tuple: Tuple containing the calculated gap statistics and standard deviations for each k value.
    """
    np.random.seed(random_seed)
    
    # Initialize arrays to store gap statistics and standard deviations
    gap_stats = []
    gap_stds = []
    
    for k in k_range:
        # Fit KMeans clustering to the data
        kmeans_model = KMeans(n_clusters=k, random_state=random_seed)
        kmeans_model.fit(data)
        
        # Calculate the within-cluster dispersion (log of sum of square distances)
        Wk = np.log(kmeans_model.inertia_)
        
        # Generate reference datasets and calculate their within-cluster dispersions
        ref_Wks = []
        for _ in range(n_ref_samples):
            # Generate reference dataset with the same shape and distribution as the original data
            ref_data = np.random.rand(*data.shape)
            
            # Fit KMeans to reference dataset
            ref_kmeans_model = KMeans(n_clusters=k, random_state=random_seed)
            ref_kmeans_model.fit(ref_data)
            
            # Calculate within-cluster dispersion of reference dataset
            ref_Wk = np.log(ref_kmeans_model.inertia_)
            ref_Wks.append(ref_Wk)
        
        # Calculate Gap Statistic and its standard deviation
        gap_stat = np.mean(ref_Wks) - Wk
        gap_std = np.std(ref_Wks) * np.sqrt(1 + 1/n_ref_samples)
        
        gap_stats.append(gap_stat)
        gap_stds.append(gap_std)
    
    return np.array(gap_stats), np.array(gap_stds)


# Define the range of k values (number of clusters) to evaluate
k_range = range(1, 11)

# Compute Gap Statistic for the range of k values
gap_stats, gap_stds = compute_gap_statistic(X_scaled, k_range, n_ref_samples=10, random_seed=42)

# Plotting the Gap Statistic curve
plt.figure(figsize=(10, 6))
plt.plot(k_range, gap_stats, marker='o', color='b', label='Gap Statistic')
plt.errorbar(k_range, gap_stats, yerr=gap_stds, fmt='-o', color='b', alpha=0.5, label='Gap Statistic with Std Dev')
plt.xlabel('Number of Clusters (k)')
plt.ylabel('Gap Statistic')
plt.title('Gap Statistic for Optimal k')
plt.xticks(k_range)
plt.legend()
plt.grid(True)
plt.show()

Hierarchical Clustering (Dendrogram):
Hierarchical clustering can provide insights into the underlying structure of the data by visualizing a dendrogram, which represents the hierarchical merging of clusters. The height at which branches are merged can help determine the number of clusters.

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(X_scaled, method='ward')
plt.figure(figsize=(12, 8))
dendrogram(Z)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()


Task: repeat the clustering tasks for the Sklearn BRCA dataset