# K Means Clustering Project

This project will attempt to use `K Means Clustering` to cluster **Universities** into two groups, **Private** and **Public**.

It is **very important to note, we actually have the labels for this data set, but we will NOT use them for the `K Means Clustering` algorithm, since that is an unsupervised learning algorithm**.

When using the `K Means` algorithm under normal circumstances, it is because we don't have labels. In this case we will use the labels to try to get an idea of how well the algorithm performed, but we won't usually do this for `K Means`, so the classification report and confusion matrix at the end of this project, don't truly make sense in a real world setting!

## The Data

> `College_Data.csv` with 777 observations on the following 18 variables:
* `Private` A factor with levels No and Yes indicating private or public university
* `Apps` Number of applications received
* `Accept` Number of applications accepted
* `Enroll` Number of new students enrolled
* `Top10perc` Pct. new students from top 10% of H.S. class
* `Top25perc` Pct. new students from top 25% of H.S. class
* `F.Undergrad` Number of fulltime undergraduates
* `P.Undergrad` Number of parttime undergraduates
* `Outstate` Out-of-state tuition
* `Room.Board` Room and board costs
* `Books` Estimated book costs
* `Personal` Estimated personal spending
* `PhD` Pct. of faculty with Ph.D.’s
* `Terminal` Pct. of faculty with terminal degree
* `S.F.Ratio` Student/faculty ratio
* `perc.alumni` Pct. alumni who donate
* `Expend` Instructional expenditure per student
* `Grad.Rate` Graduation rate

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

%matplotlib inline

## Loading Data

In [None]:
# Read in the College_Data.csv file and set the first column as the index
universities = pd.read_csv('College_Data.csv', index_col=0)

In [None]:
# Check the head of the data
universities.head()

In [None]:
# Check the info() method on the data
universities.info()

In [None]:
# Check the describe() on the data
universities.describe()

## Exploratory Data Analysis

In [None]:
# Create a scatteredplot of `Grad.Rate` versus `Room.Board` where
# the points are colored by the `Private` column
plt.figure(
    figsize=(6.5, 6.5), # Size of plot
    facecolor='#91EAD8', # The background color
)

sns.set(
    rc={
        'axes.facecolor':'#E0F9F4',
        'figure.facecolor':'#E0F9F4'
    }
)

sns.scatterplot(
    data=universities,
    x='Room.Board',
    y='Grad.Rate',
    hue='Private',
    alpha=0.9,
)

In [None]:
# Create a scatterplot of `F.Undergrad` versus `Outstate` where
# the points are colored by the `Private` column
plt.figure(
    figsize=(6.5, 6.5),
    facecolor='white'
)

sns.set(
    rc={
        'axes.facecolor':'#F5F5E3',
        'figure.facecolor':'white'
    }
)

sns.scatterplot(
    data=universities,
    x='Outstate',
    y='F.Undergrad',
    hue='Private',
    alpha=0.5
)

In [None]:
# Create a stacked histogram showing `Outstate` (Out of State Tuition)
# based on the `Private` column
sns.set_theme() # Use seaborn default theme

g = sns.FacetGrid(
    data=universities,
    height=4,
    aspect=2 # width = aspect * height,
)
g.map_dataframe(sns.histplot, x='Outstate', hue='Private')
g.set_xlabels(label='Outstate')

In [None]:
# Create two separate histograms showing `Outstate` (Out of State Tuition)
# based on the `Private` column
sns.set_theme() # Use seaborn default theme

g = sns.FacetGrid(
    data=universities,
    col='Private',
    height=4,
    aspect=2 # width = aspect * height,
)
g.map_dataframe(sns.histplot, x='Outstate', hue='Private')
g.set_xlabels(label='Outstate')

In [None]:
# Create a stacked histogram showing `Grad.Rate` (Graduation Rate)
# based on the `Private` column
sns.set_theme() # Use seaborn default theme

g = sns.FacetGrid(
    data=universities,
    height=4,
    aspect=2 # width = aspect * height,
)
g.map_dataframe(sns.histplot, x='Grad.Rate', hue='Private')
g.set_xlabels(label='Graduation Rate')

<font color=magenta>Notice how there seems to be a private school with a graduation rate of higher than 100%. Let's find that school!</font>

In [None]:
# Find school whose graduation rate is higher than 100%
universities[universities['Grad.Rate'] > 100]

<font color=magenta>Fix the above exception case by setting that school's `Grad.Rate` to 100 - which is an acceptable percentage value expressing `Graduation Rate`.</font>

In [None]:
# Fix `Grad.Rate` > 100
universities.loc[universities['Grad.Rate'] > 100, 'Grad.Rate'] = 100

In [None]:
# Confirm that no university has a `Grad.Rate` of higher than 100
# by recreating a stacked histogram showing `Grad.Rate` (Graduation Rate)
# based on the `Private` column
sns.set_theme() # Use seaborn default theme

g = sns.FacetGrid(
    data=universities,
    height=4,
    aspect=2 # width = aspect * height,
)
g.map_dataframe(sns.histplot, x='Grad.Rate', hue='Private')
g.set_xlabels(label='Graduation Rate')

## K Means Cluster Creation

In [None]:
from sklearn.cluster import KMeans

In [None]:
# Create an instance of a K Means model with 2 clusters
kmeans = KMeans(n_clusters=2)

In [None]:
# Fit the model to all the data except for the Private label
kmeans.fit(universities.drop('Private', axis=1))

In [None]:
# Check the cluster center vectors
kmeans.cluster_centers_

In [None]:
# Check the cluster labels that the kmeans model returned
kmeans.labels_

## Evaluation

<font color=magenta>Normally, we don't have `prepared true` labels for real-world problems in the `unsupervised learning` domain. In this project, however, we have such kind of labels. We, therefore, can look into those true labels and the ones the `K Means Clustering` model predicted and see how good the model is.</font>

In [None]:
# Create a column for the `universities` DataFrame and call it 'Cluster',
# which is a 1 for Private school, and a 0 for Public school.
universities['Cluster'] = universities['Private'] == 'Yes'
universities['Cluster'] = universities['Cluster'].astype(int)

In [None]:
universities

In [None]:
# Print out a classification report and a confusion matrix
from sklearn.metrics import classification_report, confusion_matrix

print(classification_report(universities['Cluster'], kmeans.labels_))
print(confusion_matrix(universities['Cluster'], kmeans.labels_))

<font color=magenta>Not so bad considering the algorithm is purely using the features to cluster the universities into 2 distinct groups! That is what `K Means` is useful for clustering `un-labeled` data.</font>