# Clustering



Clustering is an unsupervised Machine Learning method used for discovering patterns and similarities within data samples. The samples are then clustered into groups based on a high degree of similarity features. Clustering is significant because it ensures the intrinsic grouping among the current unlabeled data.

It can be defined as, "A method of sorting data points into different clusters based on their similarity. The objects with possible similarities are kept in a group with few or no similarities to another."

It accomplishes this by identifying comparable patterns in the unlabeled dataset, such as activity, size, color, and shape, and categorizing them according to the presence or absence of those patterns. The algorithm receives no supervision and works with an unlabeled dataset since it is an unsupervised learning method.

Following the application of the clustering technique, each group or cluster is given a cluster-ID, which the ML system can utilize to facilitate the processing of huge and complicated datasets.

The Scikit-learn library has a function called sklearn.cluster that can cluster unlabeled data.

Now that we understand clustering, let us explore the types of clustering methods in SkLearn.




### Clustering Techniques

- **K-Means clustering**
- **Hierarchical clustering**
- **Mean shift**


For a good understanding of how these algorithms work, watch the videos below:<br>
> **K-Means** $\to$ https://www.youtube.com/watch?v=4b5d3muPQmA&t=385s <br>
> **Hierarchical** $\to$ https://www.youtube.com/watch?v=7xHsRkOdVwo

# Clustering Case Study


We would be working with a dataset for a movie streaming service, which contains information about the movie watching habits of customers such as the number of times they open up the streaming service, number of minutes watched per week, whether they decided to watch the movie recommendations shown to them by the streaming service's recommendation system, and so on.

1he streaming service wants to know more about the customers using the services and have employed you as a data scientist to explore the dataset. We have already discussed some statistical methods with which can help give some statistical insights into the dataset, however, in this case, we would be looking at deriving new pieces of information from the dataset by identifying clusters of people with character traits unique to the cluster.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
#Importing the dataset from different file formats

#data = pd.read_csv("Movie_Watchers_Data.csv")

data = pd.read_parquet("Movie_Watchers_Data.parquet")

data = pd.read_feather("Movie_Watchers_Data.feather")

In [None]:
data.head()

In [None]:
data.info()

### Data issues:


**Watched_recommendations and mobile_app_usage** are percentages. But in order to use them for machine learning, they should be written in their decimal representations