<a href="https://www.bigdatauniversity.com"><img src = "https://ibm.box.com/shared/static/cw2c7r3o20w9zn8gkecaeyjhgw3xdgbj.png" width = 400, align = "center"></a>
# <center>Density-Based Clustering</center>

Most of the traditional clustering techniques, such as k-means, hierarchical and fuzzy clustering, can be used to group data without supervision. 

However, when applied to tasks with arbitrary shape clusters, or clusters within cluster, the traditional techniques might be unable to achieve good results. That is, elements in the same cluster might not share enough similarity or the performance may be poor.
Additionally, Density-based Clustering locates regions of high density that are separated from one another by regions of low density. Density, in this context, is defined as the number of points within a specified radius.



In this section, the main focus will be manipulating the data and properties of DBSCAN and observing the resulting clustering.

Import the following libraries:
<ul>
    <li> <b>numpy as np</b> </li>
    <li> <b>DBSCAN</b> from <b>sklearn.cluster</b> </li>
    <li> <b>make_blobs</b> from <b>sklearn.datasets.samples_generator</b> </li>
    <li> <b>StandardScaler</b> from <b>sklearn.preprocessing</b> </li>
    <li> <b>matplotlib.pyplot as plt</b> </li>
</ul> <br>
Remember <b> %matplotlib inline </b> to display plots

In [None]:
# Notice: For visualization of map, you need basemap package.
# if you dont have basemap install on your machine, you can use the following line to install it
# !conda install -c conda-forge  basemap==1.1.0  matplotlib==2.2.2  -y
# Notice: you maight have to refresh your page and re-run the notebook after installation

In [2]:
import csv
import pandas as pd
import numpy as np

filename='dataset_hotel_bookings.csv'

#Read csv
pdf = pd.read_csv(filename)
pdf.head(5)

  pdf = pd.read_csv(filename)


Unnamed: 0,hotel,is_canceled,lead_time,arrival_date_year,arrival_date_month,arrival_date_week_number,arrival_date_day_of_month,stays_in_weekend_nights,stays_in_week_nights,adults,...,days_in_waiting_list,customer_type,adr,required_car_parking_spaces,total_of_special_requests,reservation_status,reservation_status_date,var_r,filter_$,arrival_month
0,1,0,342,2015,7,27,1,0,0,2,...,0,Transient,0.0,0,0,Check-Out,2015-07-01,491,0,7
1,1,0,737,2015,7,27,1,0,0,2,...,0,Transient,0.0,0,0,Check-Out,2015-07-01,737,0,7
2,1,0,7,2015,7,27,1,0,1,1,...,0,Transient,75.0,0,0,Check-Out,2015-07-02,245,0,7
3,1,0,13,2015,7,27,1,0,1,1,...,0,Transient,75.0,0,0,Check-Out,2015-07-02,245,0,7
4,1,0,14,2015,7,27,1,0,2,2,...,0,Transient,98.0,0,1,Check-Out,2015-07-03,245,0,7


### 4-Visualization
Visualization of stations on map using basemap package. The matplotlib basemap toolkit is a library for plotting 2D data on maps in Python. Basemap does not do any plotting on it’s own, but provides the facilities to transform coordinates to a map projections. 

Please notice that the size of each data points represents the average of maximum temperature for each station in a year. 

In [4]:
# Importing necessary libraries
from mpl_toolkits.basemap import Basemap  # Basemap toolkit for rendering geographic maps
import matplotlib.pyplot as plt  # Matplotlib's pyplot for plotting
from pylab import rcParams  # PyLab's rcParams for setting figure properties
%matplotlib inline  # Magic command for Jupyter Notebook to display plots inline

# Setting the size of the figure for plots
rcParams['figure.figsize'] = (14,10)  # Set the default figure size to 14x10 inches

# Defining geographical boundaries for the map
llon = -140  # Lower longitude boundary
ulon = -50   # Upper longitude boundary
llat = 40    # Lower latitude boundary
ulat = 65    # Upper latitude boundary

# Selecting specific columns from the DataFrame for analysis
pdf = pdf[['lead_time', 'arrival_date_month', 'stays_in_weekend_nights', 'stays_in_week_nights']]

# Creating a Basemap instance
my_map = Basemap(projection='merc',  # Mercator projection
                 resolution='l',     # Low resolution
                 area_thresh=1000.0, # Minimum area threshold in square kilometers for displaying features
                 llcrnrlon=llon, llcrnrlat=llat,  # Lower-left corner longitude and latitude
                 urcrnrlon=ulon, urcrnrlat=ulat)  # Upper-right corner longitude and latitude

# Drawing map elements
my_map.drawcoastlines()  # Draw coastlines
my_map.drawcountries()   # Draw country boundaries
# my_map.drawmapboundary()  # Uncomment to draw the map boundary
my_map.fillcontinents(color='white', alpha=0.3)  # Fill continents with white color and some transparency
my_map.shadedrelief()    # Add shaded relief to the map for a 3D effect

# Projecting the longitude and latitude data to the map's coordinate system
xs, ys = my_map(np.asarray(pdf.Long), np.asarray(pdf.Lat))  # Convert longitude and latitude to map coordinates
pdf['xm'] = xs.tolist()  # Store the projected x-coordinates in the DataFrame
pdf['ym'] = ys.tolist()  # Store the projected y-coordinates in the DataFrame

# Visualization: Plotting data points on the map
for index, row in pdf.iterrows():  # Iterate through each row in the DataFrame
    my_map.plot(row.xm, row.ym, markerfacecolor=([1,0,0]), marker='o', markersize=5, alpha=0.75)  # Plot each data point

# Display the plot
plt.show()  # Show the final plot


ModuleNotFoundError: No module named 'mpl_toolkits.basemap'

### 5- Clustering of stations based on their location i.e. Lat & Lon

__DBSCAN__ form sklearn library can runs DBSCAN clustering from vector array or distance matrix. In our case, we pass it the Numpy array Clus_dataSet to find core samples of high density and expands clusters from them. 

In [None]:
# Importing necessary libraries for clustering
from sklearn.cluster import DBSCAN  # DBSCAN clustering algorithm
import sklearn.utils  # General utility functions from scikit-learn
from sklearn.preprocessing import StandardScaler  # StandardScaler for normalization

# Setting a random state for reproducibility
sklearn.utils.check_random_state(1000)  # Ensure a deterministic random state

# Preparing the data for clustering
Clus_dataSet = pdf[['xm', 'ym']]  # Selecting only the 'xm' and 'ym' columns for clustering
Clus_dataSet = np.nan_to_num(Clus_dataSet)  # Replacing NaN values with 0 (or another small number)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)  # Normalizing the data

# Compute DBSCAN
db = DBSCAN(eps=0.15, min_samples=10).fit(Clus_dataSet)  # Applying DBSCAN algorithm to the normalized data
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)  # Initializing a mask for core samples
core_samples_mask[db.core_sample_indices_] = True  # Marking core samples in the mask
labels = db.labels_  # Extracting the labels assigned by the DBSCAN algorithm
pdf["Clus_Db"] = labels  # Adding the cluster labels to the original DataFrame

# Calculating the number of clusters
realClusterNum = len(set(labels)) - (1 if -1 in labels else 0)  # The number of actual clusters, excluding noise points
clusterNum = len(set(labels))  # Total number of clusters including noise (-1 label)

# Display a sample of the clustered data
pdf[["Stn_Name", "Tx", "Tm", "Clus_Db"]].head(5)  # Displaying the first 5 rows with cluster labels


As you can see for outliers, the cluster label is -1

In [None]:
set(labels)

### 6- Visualization of clusters based on location
Now, we can visualize the clusters using basemap:

In [None]:
# Importing necessary libraries for map visualization
from mpl_toolkits.basemap import Basemap  # Basemap toolkit for creating geographic maps
import matplotlib.pyplot as plt  # Matplotlib's pyplot for plotting
from pylab import rcParams  # PyLab's rcParams for setting figure properties
%matplotlib inline  # Magic command for Jupyter Notebook to display plots inline
rcParams['figure.figsize'] = (14,10)  # Setting the default figure size

# Creating a Basemap instance for map visualization
my_map = Basemap(projection='merc',  # Mercator projection
                 resolution='l',     # Low resolution of the map
                 area_thresh=1000.0, # Minimum area threshold for displaying features
                 llcrnrlon=llon, llcrnrlat=llat,  # Lower-left corner longitude and latitude
                 urcrnrlon=ulon, urcrnrlat=ulat)  # Upper-right corner longitude and latitude

# Drawing map elements
my_map.drawcoastlines()  # Draw coastlines on the map
my_map.drawcountries()   # Draw country boundaries on the map
# my_map.drawmapboundary()  # Uncomment to draw the map boundary
my_map.fillcontinents(color='white', alpha=0.3)  # Fill continents with white color and some transparency
my_map.shadedrelief()    # Add shaded relief to the map for a 3D effect

# Creating a color map for different clusters
colors = plt.get_cmap('jet')(np.linspace(0.0, 1.0, clusterNum))  # Using 'jet' colormap for cluster colors

# Visualization of clusters on the map
for clust_number in set(labels):  # Loop through each cluster number
    # Assign gray color to noise points, otherwise use a color from the color map
    c = (([0.4, 0.4, 0.4]) if clust_number == -1 else colors[np.int(clust_number)])
    
    # Selecting data points that belong to the current cluster
    clust_set = pdf[pdf.Clus_Db == clust_number]
    
    # Plotting data points for the cluster on the map
    my_map.scatter(clust_set.xm, clust_set.ym, color=c, marker='o', s=20, alpha=0.85)
    
    # If not a noise cluster, calculate and display the cluster center and print average temperature
    if clust_number != -1:
        cenx = np.mean(clust_set.xm)  # Calculate mean x-coordinate of the cluster
        ceny = np.mean(clust_set.ym)  # Calculate mean y-coordinate of the cluster
        plt.text(cenx, ceny, str(clust_number), fontsize=25, color='red')  # Display cluster number at the cluster center
        print("Cluster " + str(clust_number) + ', Avg Temp: ' + str(np.mean(clust_set.Tm)))  # Print average temperature of the cluster


### 7- Clustering of stations based on their location, mean, max, and min Temperature
In this section we re-run DBSCAN, but this time on a 5-dimensional dataset:

In [None]:
# Importing necessary libraries for clustering and data preprocessing
from sklearn.cluster import DBSCAN  # DBSCAN clustering algorithm
import sklearn.utils  # General utility functions from scikit-learn
from sklearn.preprocessing import StandardScaler  # StandardScaler for normalization

# Setting a random state for reproducibility
sklearn.utils.check_random_state(1000)  # Ensure a deterministic random state

# Preparing the data for clustering
Clus_dataSet = pdf[['xm', 'ym', 'Tx', 'Tm', 'Tn']]  # Selecting specific columns for clustering
Clus_dataSet = np.nan_to_num(Clus_dataSet)  # Replacing NaN values with 0 (or another small number)
Clus_dataSet = StandardScaler().fit_transform(Clus_dataSet)  # Normalizing the data

# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(Clus_dataSet)  # Applying DBSCAN algorithm to the normalized data
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)  # Initializing a mask for core samples
core_samples_mask[db.core_sample_indices_] = True  # Marking core samples in the mask
labels = db.labels_  # Extracting the labels assigned by the DBSCAN algorithm
pdf["Clus_Db"] = labels  # Adding the cluster labels to the original DataFrame

# Calculating the number of clusters
realClusterNum = len(set(labels)) - (1 if -1 in labels else 0)  # The number of actual clusters, excluding noise points
clusterNum = len(set(labels))  # Total number of clusters including noise (-1 label)

# Display a sample of the clustered data
pdf[["Stn_Name", "Tx", "Tm", "Clus_Db"]].head(5)  # Displaying the first 5 rows with cluster labels
