# Predicting Myopia

In [None]:
# Dependencies
import pandas as pd
from pathlib import Path
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

In [None]:
# Import the Data and show dataframe
file_path = Path("C:/Users/anico/Desktop/boot_camp_work/Challenges/UnsupervisedML/unsupervised-machine-learning-challenge/myopia.csv")
df = pd.read_csv(file_path)
df.head(5)

## Prepare the data
* Remove "MYOPIC" column
* Standardize dataset so that columns that contain larger values do not influence the outcome

In [None]:
#remove MYOPIC column
new_df = df.drop(["MYOPIC"], axis='columns')
new_df.head(5)

In [None]:
# Standardize data using scaler
scaler = StandardScaler()
scaled_data = scaler.fit_transform(new_df)

In [None]:
# Name resulting DataFrame X
X = scaled_data
X

## Apply Dimensionality Reduction

Rather than specify the number of principal components when you instantiate the PCA model, state the desired explained variance. For example, say that a dataset has 100 features. Using PCA(n_components=0.99) creates a model that will preserve approximately 99% of the explained variance, whether that means reducing the dataset to 80 principal components or 3.

For this assignment, preserve 90% of the explained variance in dimensionality reduction.


In [None]:
# Perform dimensionality reduction with PCA. 
# How did the number of features change?

#initialize PCA model
pca = PCA(n_components=0.90)

myopia_pca = pca.fit_transform(X)

#transfrom pca data into a dataframe
df_myopia_pca = pd.DataFrame(
    data=myopia_pca)
df_myopia_pca.head()

After performing dimensionality reduction with PCA (and preserving 90% of the explained variance), the number of features went from 14 to 10.

In [None]:
# Further reduce the dataset dimensions with t-SNE and visually inspect the results
# run t-SNE on the principal components (the output of the PCA transformation)
tsne = TSNE(learning_rate=250)
tsne_features = tsne.fit_transform(myopia_pca)

In [None]:
tsne_features.shape

In [None]:
# Create a scatter plot of the t-SNE output. 
# Are there disctint clusters?
# Plot the results
plt.scatter(tsne_features[:,0], tsne_features[:,1])
plt.show()

plt.savefig("Output/Fig01.png")

No, there are no distinct clusters.

## Perform a Cluster Analysis with K-means
Create an elbow plot to identify the best number of clusters

In [None]:
# use a for loop to determine the inertia for each k between 1-10
# If possible, determine where the elbow of the plor is, and a which value of k it appears

# Identify the best number of clusters using the elbow curve
inertia = []

k = list(range(1,11))


# Looking for the best k
for i in k:
    km = KMeans(n_clusters=i, random_state=0)
    km.fit(df_myopia_pca)
    inertia.append(km.inertia_)

# Define a DataFrame to plot the Elbow Curve using hvPlot
elbow_data = {"k": k, "inertia": inertia}
df_elbow = pd.DataFrame(elbow_data)

plt.plot(df_elbow['k'], df_elbow['inertia'])
plt.xticks(range(1,11))
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')


plt.show()

plt.savefig("Output/Fig02.png")

## Recommendation
State a brief conclusion on wheter patients can be clustered together, how many clusters, and support it with findings

Based on the models, it does not appear that patients can be clustered apporpriately to better prediect myopia. After performing the PCA here are still a large numer of features that contribute to the variance. Additionally, after running t-SNE on the prinicpal components there are no clear clusters shown in the data. Finally, after running the K-means model, the plot does not show a clear elbow to point out if any groups exist in the data.