# K Means Clustering and KNN Project: Private vs. Public Universities


### The Universities Data:

We will use a data frame with 777 observations on the following 18 columns (features):
* Private: A factor with levels No and Yes indicating private or public university
* Apps: Number of applications received
* Accept: Number of applications accepted
* Enroll: Number of new students enrolled
* Top10perc: Pct. new students from top 10% of H.S. class
* Top25perc: Pct. new students from top 25% of H.S. class
* F.Undergrad: Number of fulltime undergraduates
* P.Undergrad: Number of parttime undergraduates
* Outstate: Out-of-state tuition
* Room.Board: Room and board costs
* Books: Estimated book costs
* Personal: Estimated personal spending
* PhD: Pct. of faculty with Ph.D.’s
* Terminal: Pct. of faculty with terminal degree
* S.F.Ratio: Student/faculty ratio
* perc.alumni: Pct. alumni who donate
* Expend: Instructional expenditure per student
* Grad.Rate: Graduation rate

**Note: we actually have the labels for this data set and we are going to use them for comparison to the known target label, but we will NOT use them for the KMeans clustering algorithm, since that is an Unsupervised Machine Learning algorithm.** In reality, we do not hava any labels, as this algorithms is merely used to make sense from data and produce clusters! 

Import libraries:

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Get the Data:

In [None]:
uni = pd.read_csv('../input/college-data/data.csv')

Check the head of the data:

In [None]:
uni.head()

info and describe of data:

In [None]:
uni.info()

In [None]:
uni.describe()

**Note:** seems that no data is missing, we can proceed!

In [None]:
plt.figure(figsize=(15,3))
sns.heatmap(uni.isnull(),yticklabels=False,cbar=False,cmap='viridis')

# 1. Exploratory Data Analysis

Create a scatterplot of Grad.Rate versus Room.Board, hue by Private/Public ('Private' column):

In [None]:
# SEABORN SETTINGS
sns.set() # DEFAULTS
sns.set_style('whitegrid')
sns.set_palette("coolwarm")

In [None]:
sns.lmplot('room_board','grad_rate',data=uni, hue='private',palette='coolwarm')

**Obs:** more correlation for the private Unis.

Create a scatterplot of F.undergrad versus Outstate, hue by Private/Public ('Private' column):

In [None]:
sns.lmplot('f_undergrad','outstate',data=uni, hue='private',palette='coolwarm')

**Obs:** more correlation for the private Unis.

Stacked histogram showing Out of State Tuition based on the Private column:

In [None]:
uni_priv=uni[uni['private']=='Yes']
uni_public=uni[uni['private']=='No']

In [None]:
plt.figure(figsize=(13,6))
sns.distplot(uni_priv['outstate'],bins=20, kde=False,color='red',label='Private')
sns.distplot(uni_public['outstate'],bins=20, kde=False,color='blue',label='Public')
plt.legend()

Similar histogram for the Grad.Rate column:

In [None]:
plt.figure(figsize=(13,6))
sns.distplot(uni_priv['grad_rate'],bins=20, kde=False,color='red',label='Private')
sns.distplot(uni_public['grad_rate'],bins=20, kde=False,color='blue',label='Public')
plt.legend()

**Obs:** there seems to be a private school with a graduation rate of higher than 100%. Which school is that?

In [None]:
uni[uni['grad_rate'] > 100]

Set that school's graduation rate to 100%:

In [None]:
uni['grad_rate'].iloc[95]

In [None]:
uni['grad_rate'].iloc[95] = 100

Replot the stacked histogram:

In [None]:
uni_priv=uni[uni['private']=='Yes']
uni_public=uni[uni['private']=='No']

plt.figure(figsize=(13,6))
sns.distplot(uni_priv['grad_rate'],bins=20, kde=False,color='red',label='Private')
sns.distplot(uni_public['grad_rate'],bins=20, kde=False,color='blue',label='Public')
plt.legend()

# 2. K Means Clustering Model

In [None]:
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=2)
X=uni.drop('private', axis=1)
kmeans.fit(X)

What are the cluster center vectors?

In [None]:
kmeans.cluster_centers_

The cluster labels predicted by the algorithm:

In [None]:
kmeans.labels_

**Evaluation and Performance**:

Comparison to known target labels (Private column in the uni dataframe):

Assign the Private column Yes/No to 2 corresponding clusters 1/0 using LabelEncoder (can use Python apply function too):

In [None]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
uni['Cluster'] = labelencoder.fit_transform(uni['private'])

In [None]:
uni.head(5)

Confusion matrix and Classification report:

In [None]:
from sklearn.metrics import confusion_matrix,classification_report
print(confusion_matrix(uni['Cluster'],kmeans.labels_))
print(classification_report(uni['Cluster'],kmeans.labels_))

# 3. KNNeighbours Model

Using the same dataset, let's try to predict labels in the 'Private' column using KNN:

In [None]:
uni = pd.read_csv('../input/college-data/data.csv')
uni.head()

LabelEncoding the 'Private' Column with Python CATCODES (same as LabelEncoding):

In [None]:
uni['private']=uni['private'].astype('category').cat.codes
uni['private']

In [None]:
uni.head()

### Standardize features (X) for better performance:

In [None]:
X=uni.drop('private',axis=1)
y=uni['private'] # OUR TARGET LABEL for predictions

In [None]:
from sklearn.preprocessing import StandardScaler
scaler=StandardScaler()
scaler.fit(X)
X_std=scaler.transform(X) # numpy array here

#Create new X dataframe
X_2=pd.DataFrame(X_std, columns=uni.drop('private',axis=1).columns)
X_2.head()

Train test split on standardized X and y (target labels):

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_2, y, test_size=0.33, random_state=42)

### KNN Model building:

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn=KNeighborsClassifier(n_neighbors=2)
knn.fit(X_train, y_train)

In [None]:
pred = knn.predict(X_test)

### Predictions and evaluation:

In [None]:
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score
print(confusion_matrix(y_test,pred))
print('\n')
print(classification_report(y_test,pred))

### Choosing an optimal K value:

Loop for both **Error Rate vs K** and **Accuracy Score vs K** for optimal choice of K:

In [None]:
error_rate = []
scores = []

for i in range(1,40): # check all values of K between 1 and 40
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train,y_train)
    pred_i = knn.predict(X_test)
    score=accuracy_score(y_test,pred_i)
    scores.append(score)
    error_rate.append(np.mean(pred_i != y_test)) # ERROR RATE DEF and add it to the list

In [None]:
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(10,6))
plt.plot(range(1,40),error_rate,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Error Rate vs. K Value')
plt.xlabel('K')
plt.ylabel('Error Rate')

In [None]:
plt.style.use('seaborn-whitegrid')
plt.figure(figsize=(10,6))
plt.plot(range(1,40),scores,color='blue', linestyle='dashed', marker='o',
         markerfacecolor='red', markersize=10)
plt.title('Accuracy Score vs. K Value')
plt.xlabel('K')
plt.ylabel('Accuracy Score')

**Obs:** K=7 choice.

### Retraining with K=7:

In [None]:
knn=KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train,y_train)

In [None]:
pred_7 = knn.predict(X_test)

In [None]:
from sklearn.metrics import classification_report,confusion_matrix, accuracy_score
print(confusion_matrix(y_test,pred_7))
print('\n')
print(classification_report(y_test,pred_7))

### Obs: accuracy went up by 3% on all categories, and ammount of FalsePos and FalseNeg decreased significantly!