<a href="https://colab.research.google.com/github/DeepsMaxi305/Data_Science/blob/main/clustering.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Clustering
You should build an end-to-end machine learning pipeline using a clustering model. In particular, you should do the following:
- Load the `customers` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Build an end-to-end machine learning pipeline, including a clustering model, such as [k-means](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html), [hdbscan](https://hdbscan.readthedocs.io/en/latest/), or [agglomerative clustering](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html).
- Identify the optimal number of clusters using the elbow method or the silhouette score if needed.
- Check the documentation to identify the most important hyperparameters, attributes, and methods of the model. Use them in practice.

#Importing Libraries

In [9]:
import pandas as pd
import sklearn.metrics
import sklearn.cluster
import sklearn.preprocessing
import sklearn.model_selection
import plotly.graph_objects as go


#Loading the dataset

In [10]:
df = pd.read_csv("/content/customers.csv")        #Reading the Mnist Dataset
df = df.set_index("ID")
df.head()             

Unnamed: 0_level_0,Sex,Marital status,Age,Education,Income,Occupation,Settlement size
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
100000001,0,0,67,2,124670,1,2
100000002,1,1,22,1,150773,1,2
100000003,0,0,49,1,89210,0,0
100000004,0,0,45,1,171565,1,1
100000005,0,0,53,1,149031,1,1


#Scaling features

In [11]:
scaler = sklearn.preprocessing.StandardScaler()
scaler.fit(df)
x = scaler.transform(df)

#Training the Model

In [12]:
model = sklearn.cluster.KMeans(n_clusters = 2)
model.fit(x)

print(model.inertia_)
print(model.labels_)
print(model.cluster_centers_)





10514.498873652414
[1 1 0 ... 0 0 0]
[[ 0.47998607  0.30744043 -0.29114263 -0.03738237 -0.50369925 -0.51686446
  -0.58571874]
 [-0.65607564 -0.42022923  0.39795235  0.05109661  0.68848832  0.70648338
   0.80059781]]


In [13]:
df['cluster id'] = model.labels_
df.head()

Unnamed: 0_level_0,Sex,Marital status,Age,Education,Income,Occupation,Settlement size,cluster id
ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
100000001,0,0,67,2,124670,1,2,1
100000002,1,1,22,1,150773,1,2,1
100000003,0,0,49,1,89210,0,0,0
100000004,0,0,45,1,171565,1,1,1
100000005,0,0,53,1,149031,1,1,1


#Identifying the number of Clusters


#Elbow Method

In [17]:
k_list = []
elbow_scores = []
for k in range(2,51):
  k_list.append(k)
  model = sklearn.cluster.KMeans(n_clusters = k)
  model.fit(x)
  es = model.inertia_
  elbow_scores.append(es)

fig = go.Figure(data=go.Scatter(x=k_list, y=elbow_scores))
fig.show()







































































































#Silhouette Score

In [20]:
k_list = []
silhouette_scores = []
for k in range(2,51):
  k_list.append(k)
  model = sklearn.cluster.KMeans(n_clusters = k)
  model.fit(x)
  y_predicted = model.predict(x)
  ss = sklearn.metrics.silhouette_score(x,y_predicted)
  silhouette_scores.append(ss)

fig = go.Figure(data=go.Scatter(x=k_list, y=silhouette_scores))
fig.show()



































































































