# Unsupervised Machine Learning
This type of learning the model is trained on data that does not have labelled responses.The main is to uncover hidden patterns, structures or relationships within the data.

## Types of Unsupervised Learning
### Clustering 
The GOAL is to Group similar Data Points together based on their similar features. E,g Safaricom Customers have been grouped into:

a) SMS Services: Send Promotional SMS updates that only target these customers.

b) Call Services

c) Bundles Subscribers

#### Clustering using the Following Algorithms:

- K-Means: Where the groupings is done based on similarities of the Mean e,g Mean Age. This ONLY work on Numerical Data.

- K-Mode: Grouped based on Categorical data e,g Gender, Location...

- K-Prototype: Grouping is done based on BOTH Numerical(Mean) and Categorical(Mode)

#### Applications of Clustering Include:

- Marget/Customer Segmentation.

- Image Segmentation

- Social Network Analyis

### Association Rule Mining
Checking relationship between Variables using a concepts called Basket Analysis.

### Real life scenarios of using clustering
- Customer Segmentation – grouping customers based on behavior or purchase patterns.

- Image Segmentation – separating objects within an image.

- Anomaly Detection – identifying unusual data points in datasets.

- Document or News Categorization – grouping similar documents or news articles.

## Clustering
### Customer segmentation: Grouping customers based on their similar purchasing behaviour

In [2]:
# Step 1: Read the mall customers data

# install libraries
# The exclamation mark `!` before `pip` is used in Jupyter Notebook to run shell commands directly from a notebook cell
# !pip install kagglehub 
# !pip install pandas

# import libraries
import kagglehub
import os
import pandas as pd

  from .autonotebook import tqdm as notebook_tqdm


In [3]:
# Get path
path = kagglehub.dataset_download("vjchoudhary7/customer-segmentation-tutorial-in-python")

print("Path to dataset files:", path)
print(os.listdir(path))

file_path = os.path.join(path, "Mall_Customers.csv")
print("File path: ", file_path)

# Read file
data = pd.read_csv(file_path)
data

Path to dataset files: /home/harris/.cache/kagglehub/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python/versions/1
['Mall_Customers.csv']
File path:  /home/harris/.cache/kagglehub/datasets/vjchoudhary7/customer-segmentation-tutorial-in-python/versions/1/Mall_Customers.csv


Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100)
0,1,Male,19,15,39
1,2,Male,21,15,81
2,3,Female,20,16,6
3,4,Female,23,16,77
4,5,Female,31,17,40
...,...,...,...,...,...
195,196,Female,35,120,79
196,197,Female,45,126,28
197,198,Male,32,126,74
198,199,Male,32,137,18


In [4]:
# Step 2: Data transformation
#  Check and remove empty/ missing records

data.isnull().sum() # Detect

# remove empty records
data.dropna(inplace=True)

In [5]:
# Step 3: Unsupervised ML
# Create a variable X to store our features
# NB. There is no y

array = data.values
print("Shape: ", array.shape)
# print(array)

X = array[ : , 2:]
print("Features(X): ", X)
print("Features(X) shape: ", X.shape)

Shape:  (200, 5)
Features(X):  [[19 15 39]
 [21 15 81]
 [20 16 6]
 [23 16 77]
 [31 17 40]
 [22 17 76]
 [35 18 6]
 [23 18 94]
 [64 19 3]
 [30 19 72]
 [67 19 14]
 [35 19 99]
 [58 20 15]
 [24 20 77]
 [37 20 13]
 [22 20 79]
 [35 21 35]
 [20 21 66]
 [52 23 29]
 [35 23 98]
 [35 24 35]
 [25 24 73]
 [46 25 5]
 [31 25 73]
 [54 28 14]
 [29 28 82]
 [45 28 32]
 [35 28 61]
 [40 29 31]
 [23 29 87]
 [60 30 4]
 [21 30 73]
 [53 33 4]
 [18 33 92]
 [49 33 14]
 [21 33 81]
 [42 34 17]
 [30 34 73]
 [36 37 26]
 [20 37 75]
 [65 38 35]
 [24 38 92]
 [48 39 36]
 [31 39 61]
 [49 39 28]
 [24 39 65]
 [50 40 55]
 [27 40 47]
 [29 40 42]
 [31 40 42]
 [49 42 52]
 [33 42 60]
 [31 43 54]
 [59 43 60]
 [50 43 45]
 [47 43 41]
 [51 44 50]
 [69 44 46]
 [27 46 51]
 [53 46 46]
 [70 46 56]
 [19 46 55]
 [67 47 52]
 [54 47 59]
 [63 48 51]
 [18 48 59]
 [43 48 50]
 [68 48 48]
 [19 48 59]
 [32 48 47]
 [70 49 55]
 [47 49 42]
 [60 50 49]
 [60 50 56]
 [59 54 47]
 [26 54 54]
 [45 54 53]
 [40 54 48]
 [23 54 52]
 [49 54 42]
 [57 54 51]
 [3

In [6]:
# Step 4: Import the clustering algorithm and fit the X data
# Make use of KMeans algorithm: It clusters groups by the nearest mean
# The goal is to partition the dataset into k clusters such that data points within each cluster are more similar to each other than to those in other clusters.

# !pip install scikit-learn

from sklearn.cluster import KMeans

In [7]:
# Elbow method: Determines the optimum clusters
model = KMeans(n_clusters=5, random_state=42)
model.fit(X)

0,1,2
,n_clusters,5
,init,'k-means++'
,n_init,'auto'
,max_iter,300
,tol,0.0001
,verbose,0
,random_state,42
,copy_x,True
,algorithm,'lloyd'


In [8]:
# Step 5: Generate the cluster means(K)

centroids = model.cluster_centers_
centroids

array([[ 46.21348315,  47.71910112,  41.79775281],
       [ 32.45454545, 108.18181818,  82.72727273],
       [ 24.68965517,  29.5862069 ,  73.65517241],
       [ 40.39473684,  87.        ,  18.63157895],
       [ 31.78787879,  76.09090909,  77.75757576]])

In [9]:
#  Step 6: Store the centroids in a dataframe

centroids_dataframe = pd.DataFrame(centroids, columns=["Customer Age", "Annual Income", "Spending Score"])
centroids_dataframe

Unnamed: 0,Customer Age,Annual Income,Spending Score
0,46.213483,47.719101,41.797753
1,32.454545,108.181818,82.727273
2,24.689655,29.586207,73.655172
3,40.394737,87.0,18.631579
4,31.787879,76.090909,77.757576


In [10]:
# Step 7: Assign members to their corresponding cluster

data["Cluster"] = model.labels_
data

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100),Cluster
0,1,Male,19,15,39,2
1,2,Male,21,15,81,2
2,3,Female,20,16,6,0
3,4,Female,23,16,77,2
4,5,Female,31,17,40,0
...,...,...,...,...,...,...
195,196,Female,35,120,79,1
196,197,Female,45,126,28,3
197,198,Male,32,126,74,1
198,199,Male,32,137,18,3


In [11]:
# Generating members of the same cluster
# Cluster 3 members

cluster_3 = data[data["Cluster"] == 3]
cluster_3

Unnamed: 0,CustomerID,Gender,Age,Annual Income (k$),Spending Score (1-100),Cluster
124,125,Female,23,70,29,3
126,127,Male,43,71,35,3
128,129,Male,59,71,11,3
130,131,Male,47,71,9,3
132,133,Female,25,72,34,3
134,135,Male,20,73,5,3
136,137,Female,44,73,7,3
138,139,Male,19,74,10,3
140,141,Female,57,75,5,3
142,143,Female,28,76,40,3
