### **Instructions**

This activity is broken down into four parts:

- Part 1: Prepare the Data.
- Part 2: Apply Dimensionality Reduction.
- Part 3: Perform a Cluster Analysis with K-means.
- Part 4: Make a Recommendation.

### **

In [24]:
# Initial imports
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt

### **Part 1: Prepare the Data**

1. Read `myopia.csv` into a Pandas DataFrame.
    - **Note:** This file can be found in your Module 20 Challenge files.
2. Remove the "MYOPIC" column from the dataset.
    - **Note:** The target column is needed for supervised machine learning, but it will make an unsupervised model biased. After all, the target column is effectively providing clusters already!
3. Standardize your dataset so that columns that contain larger values do not influence the outcome more than columns with smaller values.

In [4]:
df_myopia = pd.read_csv("myopia.csv")
df_myopia.head(10)

Unnamed: 0,AGE,SPHEQ,AL,ACD,LT,VCD,SPORTHR,READHR,COMPHR,STUDYHR,TVHR,DIOPTERHR,MOMMY,DADMY,MYOPIC
0,6,-0.052,21.889999,3.69,3.498,14.7,45,8,0,0,10,34,1,1,1
1,6,0.608,22.379999,3.702,3.392,15.29,4,0,1,1,7,12,1,1,0
2,6,1.179,22.49,3.462,3.514,15.52,14,0,2,0,10,14,0,0,0
3,6,0.525,22.200001,3.862,3.612,14.73,18,11,0,0,4,37,0,1,1
4,5,0.697,23.290001,3.676,3.454,16.16,14,0,0,0,4,4,1,0,0
5,6,1.744,22.139999,3.224,3.556,15.36,10,6,2,1,19,44,0,1,0
6,6,0.683,22.33,3.186,3.654,15.49,12,7,2,1,8,36,0,1,0
7,6,1.272,22.389999,3.732,3.584,15.08,12,0,0,0,8,8,0,0,0
8,7,1.396,22.620001,3.464,3.408,15.74,4,0,3,1,3,12,0,0,0
9,6,0.972,22.74,3.504,3.696,15.54,30,5,1,0,10,27,0,0,0


In [5]:
df_myopia.dtypes

AGE            int64
SPHEQ        float64
AL           float64
ACD          float64
LT           float64
VCD          float64
SPORTHR        int64
READHR         int64
COMPHR         int64
STUDYHR        int64
TVHR           int64
DIOPTERHR      int64
MOMMY          int64
DADMY          int64
MYOPIC         int64
dtype: object

In [6]:
df_myopia = df_myopia.drop(columns=["MYOPIC"])
df_myopia.head(10)

Unnamed: 0,AGE,SPHEQ,AL,ACD,LT,VCD,SPORTHR,READHR,COMPHR,STUDYHR,TVHR,DIOPTERHR,MOMMY,DADMY
0,6,-0.052,21.889999,3.69,3.498,14.7,45,8,0,0,10,34,1,1
1,6,0.608,22.379999,3.702,3.392,15.29,4,0,1,1,7,12,1,1
2,6,1.179,22.49,3.462,3.514,15.52,14,0,2,0,10,14,0,0
3,6,0.525,22.200001,3.862,3.612,14.73,18,11,0,0,4,37,0,1
4,5,0.697,23.290001,3.676,3.454,16.16,14,0,0,0,4,4,1,0
5,6,1.744,22.139999,3.224,3.556,15.36,10,6,2,1,19,44,0,1
6,6,0.683,22.33,3.186,3.654,15.49,12,7,2,1,8,36,0,1
7,6,1.272,22.389999,3.732,3.584,15.08,12,0,0,0,8,8,0,0
8,7,1.396,22.620001,3.464,3.408,15.74,4,0,3,1,3,12,0,0
9,6,0.972,22.74,3.504,3.696,15.54,30,5,1,0,10,27,0,0


In [7]:
for column in df_myopia.columns:
    print(f"Column {column} has {df_myopia[column].isnull().sum()} null values")


Column AGE has 0 null values
Column SPHEQ has 0 null values
Column AL has 0 null values
Column ACD has 0 null values
Column LT has 0 null values
Column VCD has 0 null values
Column SPORTHR has 0 null values
Column READHR has 0 null values
Column COMPHR has 0 null values
Column STUDYHR has 0 null values
Column TVHR has 0 null values
Column DIOPTERHR has 0 null values
Column MOMMY has 0 null values
Column DADMY has 0 null values


In [8]:
print(f"Duplicate entries: {df_myopia.duplicated().sum()}")

Duplicate entries: 0


In [9]:
scaler = StandardScaler()
scaled_data = scaler.fit_transform(df_myopia[['AGE',
                                              'SPHEQ',
                                              'AL',
                                              'ACD',
                                              'LT',
                                              'VCD',
                                              'SPORTHR',
                                              'READHR',
                                              'COMPHR',
                                              'STUDYHR',
                                              'TVHR',
                                              'DIOPTERHR',
                                              'MOMMY',
                                              'DADMY']])

In [12]:
df_myopia.columns

Index(['AGE', 'SPHEQ', 'AL', 'ACD', 'LT', 'VCD', 'SPORTHR', 'READHR', 'COMPHR',
       'STUDYHR', 'TVHR', 'DIOPTERHR', 'MOMMY', 'DADMY'],
      dtype='object')

In [13]:
new_df_myopia = pd.DataFrame(scaled_data, columns=df_myopia.columns)

new_df_myopia.head()

Unnamed: 0,AGE,SPHEQ,AL,ACD,LT,VCD,SPORTHR,READHR,COMPHR,STUDYHR,TVHR,DIOPTERHR,MOMMY,DADMY
0,-0.420219,-1.363917,-0.892861,0.483784,-0.281443,-1.019792,4.150661,1.69745,-0.689311,-0.672996,0.184058,0.498304,0.987138,1.003241
1,-0.420219,-0.308612,-0.17184,0.53591,-0.967997,-0.130763,-0.998898,-0.912062,-0.361875,-0.221409,-0.340932,-0.875088,0.987138,1.003241
2,-0.420219,0.604386,-0.009977,-0.506628,-0.177812,0.215809,0.257092,-0.912062,-0.034439,-0.672996,0.184058,-0.750234,-1.01303,-0.996769
3,-0.420219,-0.441325,-0.436703,1.230936,0.456927,-0.974587,0.759488,2.676017,-0.689311,-0.672996,-0.865922,0.685585,-1.01303,1.003241
4,-1.823978,-0.166306,1.167204,0.42297,-0.566427,1.180178,0.257092,-0.912062,-0.689311,-0.672996,-0.865922,-1.374503,0.987138,-0.996769


### **Part 2: Apply Dimensionality Reduction**

1. Perform dimensionality reduction with PCA. How did the number of the features change?
    
    # **HINT**
    
2. Further reduce the dataset dimensions with t-SNE and visually inspect the results. To do this, run t-SNE on the principal components, which is the output of the PCA transformation.
3. Create a scatter plot of the t-SNE output. Are there distinct clusters?

### **Dimensionality Reduction (40 points)**

- PCA model is created and used to reduce dimensions of the scaled dataset (10 points)
- PCA model’s explained variance is set to 90% (0.9) (5 points)
- The shape of the reduced dataset is examined for reduction in number of features (5 points)
- t-SNE model is created and used to reduce dimensions of the scaled dataset (10 points)
- t-SNE is used to create a plot of the reduced features (10 points)

In [15]:
pca = PCA(n_components=2)

# Get two principal components for the data.
myopia_pca = pca.fit_transform(new_df_myopia)
myopia_pca

array([[ 0.53550271,  1.14500427],
       [-0.62470559, -1.57578643],
       [-0.93347937, -0.71707622],
       ...,
       [-0.89008202, -2.3080052 ],
       [-1.12399979,  0.45188978],
       [-0.69153391, -0.73704619]])

In [16]:
df_myopia_pca = pd.DataFrame(
    data=myopia_pca, columns=["principal component 1", "principal component 2"]
)
df_myopia_pca.head()

Unnamed: 0,principal component 1,principal component 2
0,0.535503,1.145004
1,-0.624706,-1.575786
2,-0.933479,-0.717076
3,0.106354,1.192475
4,-0.388503,-2.839655


In [17]:
# Fetch the explained variance
pca.explained_variance_ratio_

array([0.21177355, 0.15659716])

### **Part 3: Perform a Cluster Analysis with K-means**

Create an elbow plot to identify the best number of clusters. Make sure to do the following:

- Use a `for` loop to determine the inertia for each `k` between 1 through 10.
- If possible, determine where the elbow of the plot is, and at which value of `k` it appears.

### **Clustering (30 points)**

- A K-means model is created (10 points)
- A `for` loop is used to create a list of inertias for each k from 1 to 10, inclusive (5 points)
- A plot is created to examine any elbows that exist (10 points)
- States a brief (1-2 sentence) conclusion on whether patients can be clustered together, and supports it with findings (10 points)

In [21]:
# Initializing model with K = 3 (since we already know there are three classes of iris plants)
model = KMeans(n_clusters=2, random_state=5)

model.fit(df_myopia)

KMeans(n_clusters=2, random_state=5)

In [22]:
predictions = model.predict(df_myopia)
print(predictions)

[1 0 0 1 0 1 1 0 0 0 1 0 0 0 0 0 0 1 1 0 1 0 0 1 1 1 0 0 0 0 0 1 0 0 0 1 1
 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 1 0 0 0 0 1 0 1 1
 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 1 1 1 0 1 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0
 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 0 1 0 0
 1 0 0 1 1 0 0 0 1 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 1 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 1 0
 0 0 1 0 0 0 1 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 1 1 1 0 1 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0
 0 0 0 0 1 1 0 1 0 0 1 0 0 0 0 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1
 1 1 1 0 0 0 1 0 0 0 0 0 0 1 1 0 0 1 0 0 1 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 1
 0 0 0 0 0 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 1 0 1 0 0 1 0 0 0 0 0 0
 0 0 1 0 0 0 1 0 1 1 0 0 

In [None]:
inertia = []
k = list(range(1, 11))

# Calculate the inertia for the range of k values
for i in k:
    km = KMeans(n_clusters=i, random_state=0)
    km.fit(df_myopia_pca)
    inertia.append(km.inertia_)

# Creating the Elbow Curve
elbow_data = {"k": k, "inertia": inertia}
df_elbow = pd.DataFrame(elbow_data)

plt.plot(df_elbow['k'], df_elbow['inertia'])
plt.xticks(list(range(11)))
plt.title('Elbow Curve')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()



NameError: name 'plt' is not defined

### **Part 4: Make a Recommendation**

Based on your findings, write up a brief (one or two sentences) recommendation for your supervisor in your Jupyter Notebook. Can the patients be clustered? If so, into how many clusters?