<a href="https://colab.research.google.com/github/Amarsinh0/MY-NOTES/blob/main/ML_code_collection5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Isolation Forest in Scikit-learn**


In [None]:
from sklearn.ensemble import IsolationForest
iforest = IsolationForest(n_estimators = 100).fit(df)

#If we take the first 9 trees from the forest (iforest.estimators_[:9]) and plot them, this is what we get:


iforest.estimators_[:9]

output:-

[ExtraTreeRegressor(max_depth=3, max_features=1, random_state=637945643),
 ExtraTreeRegressor(max_depth=3, max_features=1, random_state=1319570945)]

Certainly! Let's go through the code step by step and explain its purpose:

```python
from sklearn.ensemble import IsolationForest

# Create an Isolation Forest model with 100 estimators
iforest = IsolationForest(n_estimators=100).fit(df)
```

In this code snippet, we import the `IsolationForest` class from the `sklearn.ensemble` module of the scikit-learn library. The Isolation Forest is an anomaly detection algorithm that is used to identify outliers or anomalies in a dataset. We create an instance of the Isolation Forest model with 100 estimators and fit it to the `df` dataset.

```python
iforest.estimators_[:9]
```

The `estimators_` attribute of the Isolation Forest model returns the individual decision trees that are part of the forest. In this code, we access the first 9 trees using slicing (`[:9]`). This provides a glimpse into the structure of the forest by displaying the specifications of the decision trees.

The output shows the details of the first 2 trees in the forest, which are represented as `ExtraTreeRegressor` objects. Each tree has parameters such as `max_depth` (maximum depth of the tree), `max_features` (maximum number of features considered for splitting), and `random_state` (random seed for reproducibility).

The purpose of this code is to demonstrate the creation of an Isolation Forest model using the `sklearn.ensemble` module. By accessing the `estimators_` attribute, we can examine the individual decision trees in the forest and explore their specifications. This information can be useful in understanding the behavior and performance of the Isolation Forest model.

# **Isolation Forest**

In [None]:
from sklearn.ensemble import IsolationForest

# Create the Isolation Forest model
iforest = IsolationForest(n_estimators=100, contamination=0.1, random_state=42)

# Fit the model to the data
iforest.fit(X)

# Predict anomaly scores
anomaly_scores = iforest.decision_function(X)

# Predict anomalies
predictions = iforest.predict(X)



Explanation of each step:

Import the IsolationForest class from the sklearn.ensemble module.

Create an instance of the Isolation Forest model. In this example, we set n_estimators to 100, which determines the number of isolation trees in the forest. The contamination parameter is used to specify the expected proportion of anomalies in the data. Here, we set it to 0.1, assuming that 10% of the data are anomalies. The random_state parameter ensures reproducibility of the results.

Fit the model to the data X. The Isolation Forest algorithm will learn to distinguish anomalies from normal observations based on the patterns in the data.

Use the decision_function() method to compute anomaly scores for each observation in the data. Anomaly scores represent the degree of abnormality of each instance. The lower the score, the more anomalous the instance.

Use the predict() method to classify each instance as an anomaly (-1) or a normal observation (1). Anomalies are assigned a label of -1, while normal observations are assigned a label of 1.

Isolation Forest is a popular algorithm for outlier/anomaly detection. It works by constructing an ensemble of isolation trees, which are binary trees that recursively partition the data space. The isolation trees separate anomalies quickly, as they require fewer splits to isolate them from the rest of the data. The final anomaly score is computed based on the average depth of isolation of each instance across all the trees in the forest.

By using the decision_function() and predict() methods, we can obtain anomaly scores and predict anomalies in the data based on the Isolation Forest model's learned patterns.



In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

# Load the dataset
df = pd.read_csv('data.csv')

# Select the feature for anomaly detection
feature = 'Sales'

# Create the Isolation Forest model
isolation_forest = IsolationForest(n_estimators=100, contamination=0.1, random_state=42)

# Fit the model to the data
isolation_forest.fit(df[[feature]])

# Generate outlier scores and predictions
xx = np.linspace(df[feature].min(), df[feature].max(), len(df)).reshape(-1,1)
anomaly_score = isolation_forest.decision_function(xx)
outlier = isolation_forest.predict(xx)

# Plot the anomaly scores and outliers
plt.figure(figsize=(10,4))
plt.plot(xx, anomaly_score, label='anomaly score')
plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score),
                 where=outlier==-1, color='r',
                 alpha=.4, label='outlier region')
plt.legend()
plt.ylabel('anomaly score')
plt.xlabel('Sales')
plt.show()




3. Select the feature for anomaly detection:
```python
feature = 'Sales'
```
Replace `'Sales'` with the name of the column in your dataset that you want to detect anomalies in.

4. Create the Isolation Forest model:
```python
isolation_forest = IsolationForest(n_estimators=100, contamination=0.1, random_state=42)
```
Adjust the hyperparameters according to your needs. `n_estimators` determines the number of trees in the forest, `contamination` sets the proportion of outliers in the dataset, and `random_state` ensures reproducibility.

5. Fit the model to the data:
```python
isolation_forest.fit(df[[feature]])
```
Extract the feature column from the dataset and fit the model to it.

6. Generate outlier scores and predictions:
```python
xx = np.linspace(df[feature].min(), df[feature].max(), len(df)).reshape(-1,1)
anomaly_score = isolation_forest.decision_function(xx)
outlier = isolation_forest.predict(xx)
```
Create a range of values spanning the minimum and maximum of the feature column. Then, use `decision_function` to obtain the anomaly scores for each value and `predict` to classify each value as an outlier (-1) or not (1).

7. Plot the anomaly scores and outliers:
```python
plt.figure(figsize=(10,4))
plt.plot(xx, anomaly_score, label='anomaly score')
plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score),
                 where=outlier==-1, color='r',
                 alpha=.4, label='outlier region')
plt.legend()
plt.ylabel('anomaly score')
plt.xlabel('Sales')
plt.show()
```
Plot the anomaly scores on the y-axis and the feature values on the x-axis. The fill_between function is used to highlight the region where the outliers are detected.

Make sure to replace `'data.csv'` with the actual path or filename of your dataset, and `'Sales'` with the appropriate feature column name. Adjust the hyperparameters of the Isolation Forest model as needed.

In [None]:
isolation_forest = IsolationForest(n_estimators=100, contamination=0.01)
isolation_forest.fit(df['Profit'].values.reshape(-1, 1))
df['anomaly_score_univariate_profit'] = isolation_forest.decision_function(df['Profit'].values.reshape(-1, 1))
df['outlier_univariate_profit'] = isolation_forest.predict(df['Profit'].values.reshape(-1, 1))


df.sort_values('anomaly_score_univariate_profit')

Certainly! Here's a simple explanation of the code:

1. Create an Isolation Forest model: Initialize an IsolationForest object with `n_estimators=100` (number of trees in the forest) and `contamination=0.01` (expected proportion of outliers in the dataset).

2. Fit the model to the data: Train the Isolation Forest model on the 'Profit' column of the DataFrame. Reshape the data using `values.reshape(-1, 1)` to ensure it is in the correct shape for the model.

3. Generate anomaly scores: Use the `decision_function()` method to calculate the anomaly scores for each data point in the 'Profit' column. The anomaly score represents how different each data point is compared to the rest of the dataset.

4. Generate outlier predictions: Use the `predict()` method to classify each data point as an outlier (-1) or not (1) based on the Isolation Forest model.

5. Sort the DataFrame by anomaly score: Sort the DataFrame in ascending order based on the 'anomaly_score_univariate_profit' column. This will allow you to identify the data points that have the lowest anomaly scores and are considered more normal or less likely to be outliers.

By sorting the DataFrame based on the anomaly scores, you can easily identify the data points that are deemed most anomalous or likely to be outliers in the 'Profit' column. The lower the anomaly score, the less likely a data point is considered an outlier according to the Isolation Forest model.

In [None]:
xx = np.linspace(df['Profit'].min(), df['Profit'].max(), len(df)).reshape(-1,1)
anomaly_score = isolation_forest.decision_function(xx)
outlier = isolation_forest.predict(xx)
plt.figure(figsize=(10,4))
plt.plot(xx, anomaly_score, label='anomaly score')
plt.fill_between(xx.T[0], np.min(anomaly_score), np.max(anomaly_score),
                 where=outlier==-1, color='r',
                 alpha=.4, label='outlier region')
plt.legend()
plt.ylabel('anomaly score')
plt.xlabel('Profit')
plt.show();

Certainly! Here's an explanation of the code and its purpose:

1. Generate data points for plotting: Using `np.linspace()`, create a series of evenly spaced values between the minimum and maximum values of the 'Profit' column. Reshape the array using `reshape(-1, 1)` to match the expected shape for the decision function.

2. Calculate anomaly scores: Use the `decision_function()` method of the Isolation Forest model to calculate the anomaly scores for each data point in the 'Profit' column. This provides a measure of how anomalous or different each data point is compared to the rest of the dataset.

3. Generate outlier predictions: Use the `predict()` method of the Isolation Forest model to classify each data point as an outlier (-1) or not (1) based on the model's learned decision boundary.

4. Plot the anomaly scores: Create a figure with a size of (10, 4) for the plot. Plot the anomaly scores on the y-axis and the 'Profit' values on the x-axis. This shows how the anomaly scores vary across the range of 'Profit' values.

5. Visualize the outlier region: Use `fill_between()` to fill the region between the minimum and maximum anomaly scores with a red color where the outliers are classified as -1. This helps highlight the area where the outliers are more likely to be found based on the Isolation Forest model.

6. Add labels and display the plot: Add labels for the legend, y-axis, and x-axis. Finally, display the plot using `plt.show()`.

The purpose of this code is to visually represent the anomaly scores and identify the region where outliers are more likely to be present. The plot provides an intuitive understanding of how the anomaly scores vary across the 'Profit' values and helps identify the potential outliers in the dataset. The red shaded region represents the area where the outliers are more likely to occur, based on the Isolation Forest model's decision boundary.

minmax = MinMaxScaler(feature_range=(0, 1))
X = minmax.fit_transform(df[['Sales','Profit']])
     

clf = IsolationForest(n_estimators=100, contamination=0.01, random_state=0)
clf.fit(X)

# predict raw anomaly score
df['multivariate_anomaly_score'] = clf.decision_function(X)
        
# prediction of a datapoint category outlier or inlier
df['multivariate_outlier'] = clf.predict(X)
     

plt.scatter(df['Sales'], df['Profit'],
            c=df.multivariate_outlier, edgecolor='none', alpha=0.5,
            cmap=plt.cm.get_cmap('Paired', 10))
plt.xlabel('Sales')
plt.ylabel('Profit')
plt.colorbar();

The code you provided performs multivariate anomaly detection using Isolation Forest. Here's an explanation of the code and its purpose:

1. Feature Scaling: The 'Sales' and 'Profit' columns from the DataFrame 'df' are selected and passed through the `MinMaxScaler()` function. This scales the values between 0 and 1, ensuring that all features have the same range. The scaled features are stored in the variable 'X'.

2. Isolation Forest model initialization: An instance of the Isolation Forest model is created with parameters such as the number of estimators (100), contamination rate (0.01), and random state (0).

3. Model fitting: The Isolation Forest model is fitted on the scaled data 'X' using the `fit()` method.

4. Anomaly Score Calculation: The `decision_function()` method of the Isolation Forest model is used to calculate the anomaly scores for each data point in 'X'. These scores represent the degree of abnormality of each data point.

5. Outlier Prediction: The `predict()` method is applied to 'X' to predict whether each data point is an outlier (-1) or an inlier (1).

6. Scatter Plot Visualization: A scatter plot is created using 'Sales' and 'Profit' as the x and y coordinates, respectively. Each data point is colored based on its predicted outlier status. The 'c' parameter in the scatter plot function is set to 'df.multivariate_outlier' to assign different colors to outliers and inliers. The colorbar is added to provide a color legend for the outliers.

The purpose of this code is to visualize the anomalies detected by the Isolation Forest model in a scatter plot. The scatter plot allows you to visually identify the outliers based on their positions in the 'Sales' and 'Profit' feature space. The outliers are highlighted with different colors, helping to distinguish them from the inliers.

# **k-means clustering**


In [None]:
from sklearn.datasets.samples_generator import make_blobs
X, y_true = make_blobs(n_samples=300, centers=4,
                       cluster_std=0.60, random_state=0)
plt.scatter(X[:, 0], X[:, 1], s=50)

The code you provided generates a synthetic dataset using the `make_blobs` function from `sklearn.datasets.samples_generator` module. Here's an explanation of the code and its purpose:

1. Dataset Generation: The `make_blobs` function is used to generate a synthetic dataset with 300 samples. The dataset is generated in such a way that it contains four clusters (`centers=4`). Each cluster has a standard deviation of 0.60 (`cluster_std=0.60`), meaning that the points within each cluster are relatively close to each other. The `random_state` parameter is set to 0 to ensure reproducibility of the generated dataset.

2. Data Visualization: The `plt.scatter` function from `matplotlib.pyplot` is used to create a scatter plot of the generated dataset. The `X[:, 0]` and `X[:, 1]` are the x and y coordinates of the data points, respectively. The `s=50` parameter specifies the size of the scatter plot markers.

The purpose of this code is to visualize the synthetic dataset generated using the `make_blobs` function. The scatter plot shows the distribution of the data points in two-dimensional space. Since the dataset contains four clusters, you can observe distinct groupings or clusters of data points in the plot. This synthetic dataset can be used for various purposes, such as testing clustering algorithms or demonstrating data visualization techniques.

In [None]:
plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=50, cmap='viridis')

centers = kmeans.cluster_centers_
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=200, alpha=0.5)

The code you provided is used to visualize the results of a clustering algorithm, specifically the K-means algorithm. Here's an explanation of the code and its purpose:

1. Data Visualization: The `plt.scatter` function from `matplotlib.pyplot` is used to create a scatter plot of the data points. The `X[:, 0]` and `X[:, 1]` are the x and y coordinates of the data points, respectively. The `c=y_kmeans` parameter assigns a different color to each data point based on its cluster label obtained from the K-means algorithm. The `s=50` parameter specifies the size of the scatter plot markers. The `cmap='viridis'` parameter sets the color map to 'viridis', which is a perceptually uniform colormap.

2. Cluster Centers Visualization: The `centers` variable holds the coordinates of the cluster centers obtained from the K-means algorithm. The `plt.scatter` function is called again with `centers[:, 0]` and `centers[:, 1]` as the x and y coordinates of the cluster centers, respectively. The `c='black'` parameter sets the color of the cluster centers to black. The `s=200` parameter specifies the size of the scatter plot markers for the cluster centers. The `alpha=0.5` parameter controls the transparency of the cluster center markers.

The purpose of this code is to visualize the clustering results obtained from the K-means algorithm. The scatter plot shows the data points colored according to their assigned cluster labels, and the cluster centers are displayed as distinct markers. This visualization helps in understanding the grouping of data points and the location of the cluster centers in the feature space.

# **The k-Means algorithm : **

In [None]:
from sklearn.metrics import pairwise_distances_argmin

def find_clusters(X, n_clusters, rseed=2):
    # 1. Randomly choose clusters
    rng = np.random.RandomState(rseed)
    i = rng.permutation(X.shape[0])[:n_clusters]
    centers = X[i]
    print(centers)

    while True:
        # 2a. Assign labels based on closest center
        labels = pairwise_distances_argmin(X, centers)

        # 2b. Find new centers from means of points
        new_centers = np.array([X[labels == i].mean(0)
                                for i in range(n_clusters)])
        print(new_centers)
        # 2c. Check for convergence
        if np.all(centers == new_centers):
            break
        centers = new_centers
        print(centers)

    return centers, labels

centers, labels = find_clusters(X, 4)
plt.scatter(X[:, 0], X[:, 1], c=labels,
            s=50, cmap='viridis')



output:-

[[ 0.27239604  5.46996004]
 [-1.36999388  7.76953035]

The code you provided implements the K-means clustering algorithm to find clusters in a dataset. Here's an explanation of the code and its purpose:

1. Function Definition: The code defines a function `find_clusters` that takes the following parameters:
   - `X`: The input data array of shape (n_samples, n_features).
   - `n_clusters`: The number of clusters to find.
   - `rseed`: The random seed for reproducibility.

2. Random Initialization: The function starts by randomly choosing `n_clusters` data points from the input data `X` as the initial cluster centers. These initial centers are stored in the `centers` variable.

3. Cluster Assignment and Center Update: The function enters a loop that iteratively performs the following steps:
   - 3a. Assign Labels: For each data point in `X`, the function calculates the pairwise distances to the current cluster centers and assigns the label of the closest center to each data point. These labels are stored in the `labels` variable.
   - 3b. Update Centers: The function calculates the mean of the data points assigned to each cluster label and updates the cluster centers accordingly. The new centers are stored in the `new_centers` variable.
   - 3c. Convergence Check: The function checks if the cluster centers have converged by comparing the current centers with the new centers. If they are identical, the loop is terminated.

4. Return Centers and Labels: Once the loop is finished, the function returns the final cluster centers (`centers`) and the labels assigned to each data point (`labels`).

5. Visualization: The code uses `plt.scatter` to create a scatter plot of the data points (`X[:, 0]` and `X[:, 1]`) colored according to their assigned cluster labels (`labels`). The `s=50` parameter sets the size of the scatter plot markers, and the `cmap='viridis'` parameter defines the color map used for coloring the data points.

The purpose of this code is to implement the K-means clustering algorithm and visualize the resulting clusters. It iteratively assigns data points to the nearest cluster center and updates the cluster centers until convergence. The scatter plot allows for a visual representation of the clusters in the feature space.

# **Selecting the number of clusters with silhouette analysis on KMeans clustering**
....

Silhouette analysis can be used to study the separation distance between the resulting clusters. The silhouette plot displays a measure of how close each point in one cluster is to points in the neighboring clusters and thus provides a way to assess parameters like number of clusters visually. This measure has a range of [-1, 1].
Silhouette coefficients (as these values are referred to as) near +1 indicate that the sample is far away from the neighboring clusters. A value of 0 indicates that the sample is on or very close to the decision boundary between two neighboring clusters and negative values indicate that those samples might have been assigned to the wrong cluster.
from sklearn.datasets import make_blobs

-------------------------------------------

In [None]:
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score

import matplotlib.pyplot as plt
import matplotlib.cm as cm
import numpy as np

print(__doc__)

# Generating the sample data from make_blobs
# This particular setting has one distinct cluster and 3 clusters placed close
# together.
X, y = make_blobs(n_samples=500,
                  n_features=2,
                  centers=4,
                  cluster_std=1,
                  center_box=(-10.0, 10.0),
                  shuffle=True,
                  random_state=1)  # For reproducibility

range_n_clusters = [2, 3, 4, 5, 6]

for n_clusters in range_n_clusters:
    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(18, 7)
# The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])
    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # Initialize the clusterer with n_clusters value and a random generator
    # seed of 10 for reproducibility.
    clusterer = KMeans(n_clusters=n_clusters, random_state=10)
    cluster_labels = clusterer.fit_predict(X)

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters =", n_clusters,
          "The average silhouette_score is :", silhouette_avg)

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)
    y_lower = 10
    for i in range(n_clusters):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")
    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=30, lw=0, alpha=0.7,
                c=colors, edgecolor='k')

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % i, alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")
    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')

plt.show()





output:-

Automatically created module for IPython interactive environment
For n_clusters = 2 The average silhouette_score is : 0.7049787496083262
For n_clusters = 3 The average silhouette_score is : 0.5882004012129721

Sure! Here's a stepwise explanation of the code:

1. Import Libraries: The code imports the necessary libraries, including `make_blobs` from `sklearn.datasets` for generating sample data, `KMeans` from `sklearn.cluster` for performing KMeans clustering, and `silhouette_samples` and `silhouette_score` from `sklearn.metrics` for computing silhouette scores. It also imports `matplotlib.pyplot` and `matplotlib.cm` for visualizations.

2. Generate Sample Data: The code uses the `make_blobs` function to generate sample data. It creates a dataset with 500 samples, 2 features, 4 centers (clusters), a standard deviation of 1, and a range of (-10.0, 10.0) for center placement. The generated data is stored in `X`, and the corresponding labels are stored in `y`.

3. Define Range of Clusters: The code defines a range of `range_n_clusters` with values [2, 3, 4, 5, 6]. These values represent the number of clusters to be evaluated.

4. Iterate over Clusters: The code iterates over each value in the `range_n_clusters` range. For each `n_clusters` value, it performs the following steps:

5. Create Subplots: The code creates a figure with two subplots, `ax1` and `ax2`, for displaying the silhouette plot and the scatter plot of the clustered data, respectively. The figure size is set to (18, 7).

6. Set Silhouette Plot Parameters: For the silhouette plot (`ax1`), the code sets the x-axis limit to [-0.1, 1] and the y-axis limit to accommodate all samples and clusters. These parameters ensure proper visualization of the silhouette coefficients.

7. Initialize KMeans Clustering: The code initializes the `KMeans` clustering algorithm with the current `n_clusters` value and a random state of 10 for reproducibility.

8. Perform KMeans Clustering: The code fits the KMeans model to the sample data `X` and obtains the cluster labels for each data point.

9. Compute Silhouette Score: The code computes the average silhouette score using the `silhouette_score` function. This score provides insight into the density and separation of the formed clusters.

10. Compute Silhouette Values: The code computes the silhouette scores for each sample using the `silhouette_samples` function. These scores represent the silhouette coefficient for each individual sample.

11. Visualize Silhouette Plot: The code visualizes the silhouette plot (`ax1`) by filling the area between silhouette scores for each cluster with different colors. It also labels each silhouette plot with the cluster number and includes a vertical line indicating the average silhouette score.

12. Visualize Scatter Plot: The code visualizes the scatter plot (`ax2`) of the clustered data points. It assigns different colors to each cluster and displays the cluster centers as white circles.

13. Add Labels to Cluster Centers: The code labels the cluster centers with their respective cluster numbers.

14. Set Titles and Labels: The code sets the titles, x-axis labels, and y-axis labels for both subplots.

15. Display the Silhouette Analysis: The code displays the silhouette analysis for the current `n_clusters` value as a plot.

16. Show the Plots: The code shows the plotted figures for each `n_clusters` value in the range.

Overall Purpose: The code performs silhouette analysis for KMeans clustering on sample data to determine the optimal number of clusters. It visualizes the silhouette coefficients and the scatter plot of the clustered data, providing insights into the quality and separation of the clusters. The analysis helps in understanding the effectiveness of the clustering algorithm and selecting the appropriate number of clusters for the given dataset.

In [None]:
from sklearn.datasets import make_moons
X, y = make_moons(200, noise=.05, random_state=0)
plt.scatter(X[:, 0], X[:, 1],c=y,s=50, cmap='viridis')

This code generates a dataset using the `make_moons` function from `sklearn.datasets`. Here's a step-by-step explanation of the code:

1. Import Libraries: The code imports the necessary libraries, including `make_moons` from `sklearn.datasets` for generating moons-shaped data, `matplotlib.pyplot` for visualization, and `numpy` for array operations.

2. Generate Moon-Shaped Data: The `make_moons` function is used to generate a dataset with 200 samples. The `noise` parameter controls the amount of random noise added to the data points, and the `random_state` parameter ensures reproducibility of the generated data. The generated data is stored in `X`, and the corresponding labels are stored in `y`.

3. Plot the Data: The `plt.scatter` function is used to create a scatter plot of the data points. The `X[:, 0]` and `X[:, 1]` provide the x and y coordinates of the points, respectively. The `c=y` parameter assigns different colors to the points based on their labels. The `s=50` parameter sets the size of the markers, and the `cmap='viridis'` parameter selects the color map for the plot.

Overall Purpose: The code generates and visualizes a moon-shaped dataset using the `make_moons` function. It helps in understanding and exploring nonlinear data patterns that cannot be captured by simple linear models.

# **SpectralClustering**

In [None]:
from sklearn.cluster import SpectralClustering
model = SpectralClustering(n_clusters=2, affinity='nearest_neighbors',
                           assign_labels='kmeans')
labels = model.fit_predict(X)
plt.scatter(X[:, 0], X[:, 1], c=labels,
            s=50, cmap='viridis')

This code performs spectral clustering on the given dataset `X` using the `SpectralClustering` algorithm from `sklearn.cluster`. Here's a step-by-step explanation of the code:

1. Import Libraries: The code imports the necessary libraries, including `SpectralClustering` from `sklearn.cluster` for performing spectral clustering, and `matplotlib.pyplot` for visualization.

2. Perform Spectral Clustering: The `SpectralClustering` function is used to create a spectral clustering model. The `n_clusters` parameter specifies the number of clusters to identify. The `affinity` parameter determines the similarity measure between samples, which is set to `'nearest_neighbors'` to use the nearest neighbors graph. The `assign_labels` parameter specifies the strategy for assigning labels to the samples, which is set to `'kmeans'` to use the k-means algorithm.

3. Fit and Predict: The `fit_predict` method of the model is used to perform clustering on the dataset `X` and obtain the predicted labels for each sample. The labels are stored in the variable `labels`.

4. Plot the Clusters: The `plt.scatter` function is used to create a scatter plot of the data points. The `X[:, 0]` and `X[:, 1]` provide the x and y coordinates of the points, respectively. The `c=labels` parameter assigns different colors to the points based on their cluster labels. The `s=50` parameter sets the size of the markers, and the `cmap='viridis'` parameter selects the color map for the plot.

Overall Purpose: The code performs spectral clustering on the given dataset and visualizes the resulting clusters. Spectral clustering is a technique used for clustering data points based on the similarity between them. It can capture complex relationships and identify non-convex clusters in the data.

In [None]:
from sklearn.cluster import AgglomerativeClustering
model = AgglomerativeClustering(distance_threshold=0, n_clusters=None)

model = model.fit(X)

# Number of clusters
model.n_clusters_
50
# Distances between clusters
distances = model.distances_
distances.min()
0.09999999999999964
distances.max()
3.828052620290243

from scipy.cluster.hierarchy import dendrogram
from scipy.cluster import hierarchy

Z = hierarchy.linkage(model.children_, 'ward')

import matplotlib.pyplot as plt

plt.figure(figsize=(20,10))
dn = hierarchy.dendrogram(Z)

Explanation:

1. Number of Clusters: The code `model.n_clusters_` returns the number of clusters determined by the agglomerative clustering algorithm. In this example, the number of clusters is 50.

2. Distances between Clusters: The code `model.distances_` returns the distances between the clusters. The `min()` and `max()` functions are used to find the minimum and maximum distances, respectively. In this example, the minimum distance between clusters is 0.1, and the maximum distance is 3.83.

3. Dendrogram: The code uses the `linkage` function from `scipy.cluster.hierarchy` to compute the hierarchical clustering and generate a linkage matrix `Z`. The linkage matrix represents the hierarchical relationship between the clusters.

4. Plotting the Dendrogram: The `plt.figure` function is called to create a figure with a specified size. The `hierarchy.dendrogram` function is used to plot the dendrogram based on the linkage matrix `Z`. The resulting dendrogram visualizes the hierarchical clustering structure, with clusters represented as branches and the height of each branch indicating the distance between the merged clusters.

Purpose:
The purpose of this code is to explore the results of agglomerative clustering further by examining the number of clusters, distances between clusters, and generating a dendrogram. The number of clusters provides insight into the granularity of the clustering solution. The distances between clusters highlight the range of similarity or dissimilarity between clusters. The dendrogram allows for a visual representation of the hierarchical structure of the clusters, showing how they are linked and the distances at which they merge. These additional analyses provide a deeper understanding of the clustering results and can be useful in determining the optimal number of clusters or exploring cluster relationships.

In [None]:
#We can now create a DBSCAN object and fit the data:

from sklearn.cluster import DBSCAN
db = DBSCAN(eps=0.4, min_samples=20)
db.fit(X)

y_pred = db.fit_predict(X)
plt.figure(figsize=(10,6))
plt.scatter(X[:,0], X[:,1],c=y_pred, cmap='Paired')
plt.title("Clusters determined by DBSCAN")


Explanation:

1. DBSCAN Clustering: The code creates an instance of the DBSCAN class with specified parameters. `eps` is the maximum distance between two samples for them to be considered as part of the same neighborhood, and `min_samples` is the minimum number of samples required for a cluster.

2. Fitting the Data: The `fit` method is called on the DBSCAN object to fit the model to the data. This process identifies clusters based on the density of the data points.

3. Predicting Cluster Labels: The `fit_predict` method is called on the DBSCAN object to assign cluster labels to each data point. The resulting cluster labels are stored in `y_pred`.

4. Plotting the Clusters: The code uses the `plt.scatter` function to create a scatter plot of the data points. The `c` parameter is set to `y_pred`, which assigns different colors to data points based on their cluster labels. The `cmap` parameter specifies the colormap used for coloring the points. The resulting plot shows the clusters determined by the DBSCAN algorithm.

Purpose:
The purpose of this code is to demonstrate how to perform DBSCAN clustering on a dataset. DBSCAN is a density-based clustering algorithm that can discover clusters of arbitrary shape. The code fits the DBSCAN model to the data, assigns cluster labels to each data point, and visualizes the clusters in a scatter plot. This allows for an understanding of the clustering results and the identification of distinct groups within the dataset.

In [None]:
import numpy as np
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
%matplotlib inline

#Determine centroids
centers = [[0.5, 2], [-1, -1], [1.5, -1]]

#Create dataset
X, y = make_blobs(n_samples=400, centers=centers,
                  cluster_std=0.5, random_state=0)

#Normalize the values
X = StandardScaler().fit_transform(X)
X.shape

# We can also plot the dataset to see how each cluster looks:

plt.figure(figsize=(10,6))
plt.scatter(X[:,0], X[:,1], c=y, cmap='Paired')

Explanation:

1. Generating Data: The code uses the `make_blobs` function from scikit-learn to generate synthetic data. The `centers` parameter specifies the coordinates of the cluster centers, and the `cluster_std` parameter controls the standard deviation of the clusters. The `n_samples` parameter determines the number of data points generated.

2. Normalizing the Data: The `StandardScaler` class from scikit-learn is used to standardize the features of the dataset. This step ensures that all features have the same scale, which can be important for certain clustering algorithms.

3. Scatter Plot: The code uses `plt.scatter` to create a scatter plot of the dataset. The `X[:,0]` and `X[:,1]` indexing selects the first and second columns of the feature matrix `X`, representing the x and y coordinates of the data points. The `c` parameter is set to `y`, which assigns different colors to data points based on their corresponding cluster labels. The `cmap` parameter specifies the colormap used for coloring the points.

Purpose:
The purpose of this code is to generate a synthetic dataset with three distinct clusters and visualize the data points in a scatter plot. The clusters are defined by their centers and standard deviation. The data is then normalized using `StandardScaler` to ensure consistent scaling across features. The scatter plot allows for a visual inspection of the data and helps in understanding the structure and separation of the clusters.