# Project 1
### Jacob Hofer, Flint Morgan, Joseph Winjum, Keith Filler
---
## Loading data
Below is the code to load our data from the CSV into a Pandas DataFrame

We are dropping the columns including semantic data, such as the product name, since we do not want to attempt to cluster based on those columns.

We also drop the GFLOPS columns, as those are specific to GPUs, and we want to include both GPUs and CPUs.

Lastly, we drop any rows that have missing data. 

We can't fill the empty values with the mean for the column because as the die size, transistors and frequency will increase quickly with year, and if we choose to put the mean it could miss represent the data.

In [6]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv("chip_dataset.csv")

Type = df['Type']
# Drop semantic data
df.pop("Unnamed: 0");df.pop("Type");df.pop("Foundry");df.pop("Vendor");df.pop("Product");df.pop("Release Date")
# GFLOPS are specific to GPUs, so we exclude them here so we can also look at CPUs
df.pop("FP16 GFLOPS");df.pop("FP32 GFLOPS");df.pop("FP64 GFLOPS")
df = df.dropna()
display(df)


Unnamed: 0,Process Size (nm),TDP (W),Die Size (mm^2),Transistors (million),Freq (MHz)
0,65.0,45.0,77.0,122.0,2200.0
1,14.0,35.0,192.0,4800.0,3200.0
3,22.0,80.0,160.0,1400.0,1800.0
4,45.0,125.0,258.0,758.0,3700.0
5,22.0,95.0,160.0,1400.0,2400.0
...,...,...,...,...,...
4844,40.0,150.0,334.0,2154.0,700.0
4845,40.0,20.0,80.0,10.0,416.0
4846,28.0,21.0,68.0,302.0,550.0
4849,40.0,75.0,332.0,1950.0,450.0


## Part 2
---
### 2.1 What is the multivariate mean of the numerical data matrix (where categorical data have been converted to numerical values)?

In [7]:
multivariate_mean = np.mean(df, axis=0)
print(multivariate_mean)

Process Size (nm)          53.048510
TDP (W)                    83.740795
Die Size (mm^2)           200.003799
Transistors (million)    2163.295441
Freq (MHz)               1507.964641
dtype: float64


### 2.2 What is the covariance matrix of the numerical data matrix (where categorical data have been converted to numerical values)?

In [8]:
np.roundnp.cov(df.T)

AttributeError: module 'numpy' has no attribute 'roundnp'

### 2.3 Choose 2 pairs of attributes that you think could be related. Create scatter plots of all 2 pairs and include these in your report, along with a description and analysis that summarizes why these pairs of attributes might be related, and how the scatter plots do or do not support this intuition.

We believe that die size and transistors will be related because as the die size increases the total number of transistors could increase given the same size of transistor.

In [None]:
np_df = df.to_numpy()
titles = df.columns
for i in range(np.shape(np_df)[1]):
    for j in range(np.shape(np_df)[1]):
        if i != j and not i>j:
            Title = titles[i]+" vs "+titles[j]
            plt.figure()
            plt.scatter(np_df[:,i],np_df[:,j])
            plt.title(Title);plt.xlabel(titles[i]);plt.ylabel(titles[j])
            plt.tight_layout()
            plt.show()
            plt.close()
#np_df[:,0]

### 2.4 Which range-normalized numerical attributes have the greatest sample covariance? What is their sample covariance? Create a scatter plot of these range-normalized attributes.

In [None]:
from sklearn.preprocessing import MinMaxScaler
import copy
normalized_df = MinMaxScaler().fit_transform(np_df)
normalized_df = normalized_df - np.mean(normalized_df,axis=0)
cov_matrix = np.cov(normalized_df.T)
cov_only = copy.copy(cov_matrix); np.fill_diagonal(cov_only,0)
max_cov = cov_matrix.flat[np.abs(cov_only).argmax()]
print(cov_matrix,"\n Max sample covariance:",max_cov)
x,y = np.where(abs(cov_matrix) == max_cov)[0]

Title = titles[x]+" vs "+titles[y]
plt.figure()
plt.scatter(normalized_df[:,x],normalized_df[:,y])
plt.title(Title);plt.xlabel(titles[x]);plt.ylabel(titles[y])
plt.tight_layout()
plt.show()
plt.close()

### 2.5 Which Z-score-normalized numerical attributes have the greatest correlation? What is their correlation? Create a scatter plot of these Z-score-normalized attributes.

In [None]:
from scipy.stats import zscore
z_df = (np_df-np.mean(np_df))/(np.std(np_df))
corr = np.corrcoef(z_df.T)
np.fill_diagonal(corr,0)
max_corr = corr.flat[np.abs(corr).argmax()]
print(corr,"\n Max sample correlation:",max_corr)
x,y = np.where(abs(corr) == max_corr)[0]

Title = titles[x]+" vs "+titles[y]
plt.figure()
plt.scatter(z_df[:,x],z_df[:,y])
plt.title(Title);plt.xlabel(titles[x]);plt.ylabel(titles[y])
plt.tight_layout()
plt.show()
plt.close()

### 2.6 How many pairs of features have correlation greater than or equal to 0.5?

In [None]:
print(int(len(corr[corr>=0.5])/2),"pairs of features have greater than or equal to 0.5 correlation")

### 2.7 How many pairs of features have negative sample covariance?

In [None]:
print(int(len(cov_only[cov_only<0])/2),"pairs of features have negative sample covariance")

### 2.8 What is the total variance of the data?

In [None]:
print("The total variance of the data is", np.trace(cov_matrix))

### 2.9 What is the total variance of the data, restricted to the five features that have the 3 greatest sample variance?
I took this to mean three features that have 3 greatest sample varience

In [None]:
variances = np.diagonal(cov_matrix)

print("The total varience of the three features that have the greatest sample variance:", sum(sorted(variances,reverse=True)[0:3]))


---

## K-Means Clustering
### Computing the Clusters
The code below computes clusters in the data using our custom K-Means Clustering algorithm.

We found that a good number of clusters is 3, as any higher K value tended to produce clusters containing a very low number of points.

Our $\epsilon$ value was also chosen to be 0.001, which dictates the threshold of centroid change that denotes the algorithm should terminate.

The code that performs K-Means Clustering is available in `kmeans.py`.

In [None]:
import kmeans

k = 3

(centroids, assignments) = kmeans.kMeans(df, k, 0.001);
for i, centroid in enumerate(centroids):
    print("Cluster", i, "Centroid:", centroid)

### Plotting the Clusters
Below is an example plot of our clustering. The colors of each point correspond to their assigned cluster.

Cluster centroids are denoted by the larger black points.

Changing the `xAxis` and `yAxis` values to any integer between 0 and 4 will change which two features are compared.

In [None]:
import matplotlib.pyplot as plt

xAxis = 2
yAxis = 4

plt.scatter(df.values[:,xAxis], df.values[:,yAxis], c=assignments, s=5)
plt.scatter(centroids[:,xAxis], centroids[:,yAxis], c='black', s=50)
plt.xlabel(df.columns[xAxis])
plt.ylabel(df.columns[yAxis]);

## DBScan Based Clustering

In [None]:
import dbscan

dbScanAlg = dbscan.DBScan(40, 0.6, -3)
# dbscan works a bit better with normalized data
normalizedDF = (df-df.mean())/df.std()
(assignments, corePts, borderPts, noisePts) = dbScanAlg.runAlgorithm(normalizedDF);
for i in assignments:
    print(i)

In [None]:
xAxis = 2
yAxis = 4

plt.scatter(df.values[:,xAxis], df.values[:,yAxis], c=assignments, s=5)
plt.title("DBScan Clusters + Noise")
plt.xlabel(df.columns[xAxis])
plt.ylabel(df.columns[yAxis]);

In [None]:
for i in noisePts:
    plt.plot(df.values[i,xAxis], df.values[i,yAxis], marker="o", markerfacecolor='blue', markersize=3, markeredgecolor='blue')

plt.title("DBScan Noise Points")
plt.xlabel(df.columns[xAxis])
plt.ylabel(df.columns[yAxis]);

## 4.1 PCA graph in 2dim 

data is scaled b/c different units, two pca's are fit (one for 2Dim, one for no specified dim)

In [None]:
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler

scaler = StandardScaler(with_std=True,
                        with_mean=True)
df_scaled = scaler.fit_transform(df)

pcadf = PCA() #unspecified number of components
pcadf2 = PCA(n_components=2) #2 components
pcadf.fit(df_scaled)
pcadf2.fit(df_scaled)
pcadf2.mean_

In [None]:
scores = pcadf2.transform(df_scaled)
pcadf2.components_

Graph of 2D pca,

"Does it look like there are clusters
in these two dimensions? If so, how many would you say there are?"

In [None]:
i, j = 0, 1 # which components
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
ax.scatter(scores[:,0], scores[:,1], s=5, color='green', alpha=0.3)
ax.set_xlabel('PC%d' % (i+1))
ax.set_ylabel('PC%d' % (j+1))
for k in range(pcadf2.components_.shape[1]):
    ax.arrow(0, 0, pcadf2.components_[i,k], pcadf2.components_[j,k],)
    ax.text(pcadf2.components_[i,k],
    pcadf2.components_[j,k],
    df.columns[k],)

Same graph but scaling arrows and text for readability, 
graph flipped for perspective reasons but re-running will invert to normal

In [None]:
scale_arrow = s_ = 3
scores[:,1] *= -1
pcadf.components_[1] *= -1 # flip the y-axis
fig, ax = plt.subplots(1, 1, figsize=(8, 8))
ax.scatter(scores[:,0], scores[:,1], s=5)
ax.set_xlabel('PC%d' % (i+1))
ax.set_ylabel('PC%d' % (j+1))
for k in range(pcadf2.components_.shape[1]):
    ax.arrow(0, 0, s_*pcadf2.components_[i,k], s_*pcadf2.components_[
        j,k])
    ax.text(s_*pcadf2.components_[i,k],
            s_*pcadf2.components_[j,k],
            df.columns[k])

In [None]:
scores.std(0, ddof=1)

explained variance and ratio for our two component pca

In [None]:
pcadf2.explained_variance_

In [None]:
pcadf2.explained_variance_ratio_

## 4.2 PCA unspecified components graphs

graphs for 4.2

"Based on this plot, choose a number of principal components to reduce the dimensionality of the data. Report how many principal components will be used as well as the faction of total variance captured using this many components."

Based on graphs we'd want 4, yadda yadda flesh out later

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(15, 6))
ticks = np.arange(pcadf.n_components_)+1
ax = axes[0]
ax.plot(ticks,
    pcadf.explained_variance_ratio_,
    marker='o')
ax.set_xlabel('Principal Component');
ax.set_ylabel('Proportion of Variance Explained')
ax.set_ylim([0,1])
ax.set_xticks(ticks)
ax = axes[1]
ax.plot(ticks,
    pcadf.explained_variance_ratio_.cumsum(),
    marker='o')
ax.set_xlabel('Principal Component')
ax.set_ylabel('Cumulative Proportion of Variance Explained')
ax.set_ylim([0, 1])
ax.set_xticks(ticks)


In [None]:
pcadf.explained_variance_

In [None]:
pcadf.explained_variance_ratio_

## 4.3

run a k-means analysis (use our own + sickit) for 4.3 and a DBSCAN analysis for 
4.4