Dec 11, 2021

Jason Cardinal Exercise Notebook Data Camp - Cluster Analysis in Python

[https://campus.datacamp.com/courses/cluster-analysis-in-python]

this notebook contains the code along notes

## 1.Introduction to Clustering

Before you are ready to classify news articles, you need to be introduced to the basics of clustering. This chapter familiarizes you with a class of machine learning algorithms called unsupervised learning and then introduces you to clustering, one of the popular unsupervised learning algorithms. You will know about two popular clustering techniques - hierarchical clustering and k-means clustering. The chapter concludes with basic pre-processing steps before you start clustering data.

### Unsupervised learning in real world

Which of the following examples can be solved with unsupervised learning?

Possible Answers

1. A list of tweets to be classified based on their sentiment, the data has tweets associated with a positive or negative sentiment.
2. A spam recognition system that marks incoming emails as spam, the data has emails marked as spam and not spam.
3. Segmentation of learners at DataCamp based on courses they complete. The training data has no labels.

Answer 3

### Pokémon sightings

There have been reports of sightings of rare, legendary Pokémon. You have been asked to investigate! Plot the coordinates of sightings to find out where the Pokémon might be. The X and Y coordinates of the points are stored in list x and y, respectively.

Instructions:

1. Import the pyplot class from matplotlib library as plt.
2. Create a scatter plot using the pyplot class.
3. Display the scatter plot created in the earlier step.

In [None]:
# Import plotting class from matplotlib library
from matplotlib import ____ as plt

# Create a scatter plot
plt.____(x, y)

# Display the scatter plot
plt.____()

In [None]:
# Import plotting class from matplotlib library
from matplotlib import pyplot as plt

# Create a scatter plot
plt.scatter(x, y)

# Display the scatter plot
plt.show()

### Pokémon sightings: hierarchical clustering

We are going to continue the investigation into the sightings of legendary Pokémon from the previous exercise. Remember that in the scatter plot of the previous exercise, you identified two areas where Pokémon sightings were dense. This means that the points seem to separate into two clusters. In this exercise, you will form two clusters of the sightings using hierarchical clustering.

'x' and 'y' are columns of X and Y coordinates of the locations of sightings, stored in a Pandas data frame, df. The following are available for use: matplotlib.pyplot as plt, seaborn as sns, and pandas as pd.

Instructions:

1. Import the linkage and fcluster libraries.
2. Use the linkage() function to compute distances using the ward method.
3. Generate cluster labels for each data point with two clusters using the fcluster() function.
4. Plot the points with seaborn and assign a different color to each cluster.

In [None]:
# Import linkage and fcluster functions
from scipy.cluster.hierarchy import ____, ____

# Use the linkage() function to compute distance
Z = ____(____, 'ward')

# Generate cluster labels
df['cluster_labels'] = ____(____, ____, criterion='maxclust')

# Plot the points with seaborn
sns.scatterplot(x=____, y=____, hue=____, data=df)
plt.show()

In [None]:
# Import linkage and fcluster functions
from scipy.cluster.hierarchy import linkage, fcluster

# Use the linkage() function to compute distances
Z = linkage(df, 'ward')

# Generate cluster labels
df['cluster_labels'] = fcluster(Z, 2, criterion='maxclust')

# Plot the points with seaborn
sns.scatterplot(x='x', y='y', hue='cluster_labels', data=df)
plt.show()

### Pokémon sightings: k-means clustering

We are going to continue the investigation into the sightings of legendary Pokémon from the previous exercise. Just like the previous exercise, we will use the same example of Pokémon sightings. In this exercise, you will form clusters of the sightings using k-means clustering.

x and y are columns of X and Y coordinates of the locations of sightings, stored in a Pandas data frame, df. The following are available for use: matplotlib.pyplot as plt, seaborn as sns, and pandas as pd.

Instructions:

1. Import the kmeans and vq functions.
2. Use the kmeans() function to compute cluster centers by defining two clusters.
3. Assign cluster labels to each data point using vq() function.
4. Plot the points with seaborn and assign a different color to each cluster

In [None]:
# Import kmeans and vq functions
from scipy.cluster.vq import ____, ____

# Compute cluster centers
centroids,_ = ____(____, ____)

# Assign cluster labels
df['cluster_labels'], _ = ____(____, ____)

# Plot the points with seaborn
sns.scatterplot(x=____, y=____, hue=____, data=df)
plt.show()

In [None]:
# Import kmeans and vq functions
from scipy.cluster.vq import kmeans, vq

# Compute cluster centers
centroids,_ = kmeans(df, 2)

# Assign cluster labels
df['cluster_labels'], _ = vq(df, centroids)

# Plot the points with seaborn
sns.scatterplot(x='x', y='y', hue='cluster_labels', data=df)
plt.show()

### Normalize basic list data

Now that you are aware of normalization, let us try to normalize some data. goals_for is a list of goals scored by a football team in their last ten matches. Let us standardize the data using the whiten() function.

Instructions:

1. Import the whiten function.
2. Use the whiten() function to standardize the data.

In [None]:
# Import the whiten function
from scipy.cluster.vq import ____

goals_for = [4,3,2,3,1,1,2,0,1,4]

# Use the whiten() function to standardize the data
scaled_data = ____(____)
print(scaled_data)

In [None]:
# Import the whiten function
from scipy.cluster.vq import whiten

goals_for = [4,3,2,3,1,1,2,0,1,4]

# Use the whiten() function to standardize the data
scaled_data = whiten(goals_for)
print(scaled_data)

### Visualize normalized data

After normalizing your data, you can compare the scaled data to the original data to see the difference. The variables from the last exercise, goals_for and scaled_data are already available to you.

Instructions:

1. Use the matplotlib library to plot the original and scaled data.
2. Show the legend in the plot.
3. Display the plot.

In [None]:
# Plot original data
plt.____(____, label='original')

# Plot scaled data
plt.____(____, label='scaled')

# Show the legend in the plot
plt.____()

# Display the plot
plt.____()

In [None]:
# Plot original data
plt.plot(goals_for, label='original')

# Plot scaled data
plt.plot(scaled_data, label='scaled')

# Show the legend in the plot
plt.legend()

# Display the plot
plt.show()

### Normalization of small numbers

In earlier examples, you have normalization of whole numbers. In this exercise, you will look at the treatment of fractional numbers - the change of interest rates in the country of Bangalla over the years. For your use, matplotlib.pyplot is imported as plt.

Instructions:

1. Scale the list rate_cuts, which contains the changes in interest rates.
2. Plot the original data against the scaled data.

In [None]:
# Prepare data
rate_cuts = [0.0025, 0.001, -0.0005, -0.001, -0.0005, 0.0025, -0.001, -0.0015, -0.001, 0.0005]

# Use the whiten() function to standardize the data
scaled_data = ____(____)

# Plot original data
plt.____(____, label='original')

# Plot scaled data
plt.____(____, label='scaled')

plt.legend()
plt.show()

In [None]:
# Prepare data
rate_cuts = [0.0025, 0.001, -0.0005, -0.001, -0.0005, 0.0025, -0.001, -0.0015, -0.001, 0.0005]

# Use the whiten() function to standardize the data
scaled_data = whiten(rate_cuts)

# Plot original data
plt.plot(rate_cuts, label='original')

# Plot scaled data
plt.plot(scaled_data, label='scaled')

plt.legend()
plt.show()

### FIFA 18: Normalize data

FIFA 18 is a football video game that was released in 2017 for PC and consoles. The dataset that you are about to work on contains data on the 1000 top individual players in the game. You will explore various features of the data as we move ahead in the course. In this exercise, you will work with two columns, eur_wage, the wage of a player in Euros and eur_value, their current transfer market value.

The data for this exercise is stored in a Pandas dataframe, fifa. whiten from scipy.cluster.vq and matplotlib.pyplot as plt have been pre-loaded.

Instructions 1/3 :

1. Scale the values of eur_wage and eur_value using the whiten() function.

In [None]:
# Scale wage and value
fifa['scaled_wage'] = ____(fifa[____])
fifa['scaled_value'] = ____(fifa[____])

In [None]:
# Scale wage and value
fifa['scaled_wage'] = whiten(fifa['eur_wage'])
fifa['scaled_value'] = whiten(fifa['eur_value'])

Instructions 2/3:
   1. Plot the scaled wages and transfer values of players using the .plot() method of Pandas.

In [None]:
# Scale wage and value
fifa['scaled_wage'] = whiten(fifa['eur_wage'])
fifa['scaled_value'] = whiten(fifa['eur_value'])

# Plot the two columns in a scatter plot
fifa.plot(x=____, y=____, kind='scatter')
plt.show()

In [None]:
# Scale wage and value
fifa['scaled_wage'] = whiten(fifa['eur_wage'])
fifa['scaled_value'] = whiten(fifa['eur_value'])

# Plot the two columns in a scatter plot
fifa.plot(x='scaled_wage', y='scaled_value', kind='scatter')
plt.show()

Instructions 3/3:
   1. Check the mean and standard deviation of the scaled data using the .describe() method of Pandas.

In [None]:
# Scale wage and value
fifa['scaled_wage'] = whiten(fifa['eur_wage'])
fifa['scaled_value'] = whiten(fifa['eur_value'])

# Plot the two columns in a scatter plot
fifa.plot(x='scaled_wage', y='scaled_value', kind = 'scatter')
plt.show()

# Check mean and standard deviation of scaled values
print(fifa[['scaled_wage', 'scaled_value']].____())

In [None]:
# Scale wage and value
fifa['scaled_wage'] = whiten(fifa['eur_wage'])
fifa['scaled_value'] = whiten(fifa['eur_value'])

# Plot the two columns in a scatter plot
fifa.plot(x='scaled_wage', y='scaled_value', kind = 'scatter')
plt.show()

# Check mean and standard deviation of scaled values
print(fifa[['scaled_wage', 'scaled_value']].describe())

## 2. Hierarchical Clustering

This chapter focuses on a popular clustering algorithm - hierarchical clustering - and its implementation in SciPy. In addition to the procedure to perform hierarchical clustering, it attempts to help you answer an important question - how many clusters are present in your data? The chapter concludes with a discussion on the limitations of hierarchical clustering and discusses considerations while using hierarchical clustering.

### Hierarchical clustering: ward method

It is time for Comic-Con! Comic-Con is an annual comic-based convention held in major cities in the world. You have the data of last year's footfall, the number of people at the convention ground at a given time. You would like to decide the location of your stall to maximize sales. Using the ward method, apply hierarchical clustering to find the two points of attraction in the area.

The data is stored in a Pandas data frame, comic_con. x_scaled and y_scaled are the column names of the standardized X and Y coordinates of people at a given point in time.

Instructions:

1. Import fcluster and linkage from scipy.cluster.hierarchy.
2. Use the ward method in the linkage() function.
3. Assign cluster labels by forming 2 flat clusters from distance_matrix.
4. Run the plotting code to see the results.

In [None]:
# Import the fcluster and linkage functions
from scipy.cluster.hierarchy import ____, ____

# Use the linkage() function
distance_matrix = ____(comic_con[['x_scaled', 'y_scaled']], ____ = ____, metric = 'euclidean')

# Assign cluster labels
comic_con['cluster_labels'] = ____(____, ____, criterion='maxclust')

# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled',
                hue='cluster_labels', data = comic_con)
plt.show()

In [None]:
# Import the fcluster and linkage functions
from scipy.cluster.hierarchy import fcluster, linkage

# Use the linkage() function
distance_matrix = linkage(comic_con[['x_scaled', 'y_scaled']], method = 'ward', metric = 'euclidean')

# Assign cluster labels
comic_con['cluster_labels'] = fcluster(distance_matrix, 2, criterion='maxclust')

# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled',
                hue='cluster_labels', data = comic_con)
plt.show()

### Hierarchical clustering: single method

Let us use the same footfall dataset and check if any changes are seen if we use a different method for clustering.

The data is stored in a Pandas data frame, comic_con. x_scaled and y_scaled are the column names of the standardized X and Y coordinates of people at a given point in time.

Instructions:

1. Import fcluster and linkage from scipy.cluster.hierarchy.
2. Use the single method in the linkage() function.

In [None]:
# Import the fcluster and linkage functions
from ____ import ____, ____

# Use the linkage() function
distance_matrix = ____(comic_con[[____, ____]], ____ = ____, metric = ____)

# Assign cluster labels
comic_con['cluster_labels'] = ____(____, ____, ____)

# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled',
                hue='cluster_labels', data = comic_con)
plt.show()

In [None]:
# Import the fcluster and linkage functions
from scipy.cluster.hierarchy import fcluster, linkage

# Use the linkage() function
distance_matrix = linkage(comic_con[['x_scaled', 'y_scaled']], method = 'single', metric = 'euclidean')

# Assign cluster labels
comic_con['cluster_labels'] = fcluster(distance_matrix, 2, criterion='maxclust')

# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled',
                hue='cluster_labels', data = comic_con)
plt.show()

### Hierarchical clustering: complete method

For the third and final time, let us use the same footfall dataset and check if any changes are seen if we use a different method for clustering.

The data is stored in a Pandas data frame, comic_con. x_scaled and y_scaled are the column names of the standardized X and Y coordinates of people at a given point in time.

Instructions:

1. Import fcluster and linkage from scipy.cluster.hierarchy.
2. Use the complete method in the .linkage() function.

In [None]:
# Import the fcluster and linkage functions
____

# Use the linkage() function
distance_matrix = ____(____, ____, ____)

# Assign cluster labels
comic_con['cluster_labels'] = ____

# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled',
                hue='cluster_labels', data = comic_con)
plt.show()

In [None]:
# Import the fcluster and linkage functions
from scipy.cluster.hierarchy import fcluster, linkage

# Use the linkage() function
distance_matrix = linkage(comic_con[['x_scaled', 'y_scaled']], method = 'complete', metric = 'euclidean')

# Assign cluster labels
comic_con['cluster_labels'] = fcluster(distance_matrix, 2, criterion='maxclust')

# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled',
                hue='cluster_labels', data = comic_con)
plt.show()

### Visualize clusters with matplotlib

We have discussed that visualizations are necessary to assess the clusters that are formed and spot trends in your data. Let us now focus on visualizing the footfall dataset from Comic-Con using the matplotlib module.

The data is stored in a Pandas data frame, comic_con. x_scaled and y_scaled are the column names of the standardized X and Y coordinates of people at a given point in time. cluster_labels has the cluster labels. A linkage object is stored in the variable distance_matrix.

Instructions:

1. Import the pyplot class from matplotlib module as plt.
2. Define a colors dictionary for two cluster labels, 1 and 2.
3. Plot a scatter plot with colors for each cluster as defined by the colors dictionary.

In [None]:
# Import the pyplot class
____

# Define a colors dictionary for clusters
colors = {____:'red', ____:'blue'}

# Plot a scatter plot
comic_con.plot.scatter(x=____,
                       y=____,
                       c=comic_con['cluster_labels'].apply(____))
plt.show()

In [None]:
# Import the pyplot class
from matplotlib import pyplot as plt

# Define a colors dictionary for clusters
colors = {1:'red', 2:'blue'}

# Plot a scatter plot
comic_con.plot.scatter(x='x_scaled',
                       y='y_scaled',
                       c=comic_con['cluster_labels'].apply(lambda x: colors[x]))
plt.show()

### Visualize clusters with seaborn

Let us now visualize the footfall dataset from Comic Con using the seaborn module. Visualizing clusters using seaborn is easier with the inbuild hue function for cluster labels.

The data is stored in a Pandas data frame, comic_con. x_scaled and y_scaled are the column names of the standardized X and Y coordinates of people at a given point in time. cluster_labels has the cluster labels. A linkage object is stored in the variable distance_matrix.

Instructions:

1. Import the seaborn module as sns.
2. Plot a scatter plot using the .scatterplot() method of seaborn, with the cluster labels as the hue argument.

In [None]:
# Import the seaborn module
____

# Plot a scatter plot using seaborn
sns.scatterplot(x=____,
                y=____,
                ____=____,
                data = comic_con)
plt.show()

In [None]:
# Import the seaborn module
import seaborn as sns

# Plot a scatter plot using seaborn
sns.scatterplot(x='x_scaled',
                y='y_scaled',
                hue='cluster_labels',
                data=comic_con)
plt.show()