For the following exercises, work with the cereals data set. A link to the raw data is here:

https://raw.githubusercontent.com/ArashVafa/DESC624/master/cereals.CSV

* Standardize or normalize the predictors Sugars, Fiber, and Potass.
Construct the correlation matrix for Sugars, Fiber, and Potass. Which variables are highly correlated?
* Build a regression model to estimate Rating based on Sugars, Fiber, and Potass. 
* Run PCA using three components. What percent of the variability is explained by one component? By two components? By all three components?
* Say we want to explain at least 70% of the variability. How many components would you retain?
* Run PCA using two components. What percent of the variability do the two components explain?
* Use the two components as the predictor variables in a regression model to estimate Rating. What are the regression coefficients of the two components?

For the following exercises, work with the red_wine_PCA_training and red_wine_PCA_test data sets. A link to the raw data is here:

red_wine_PCA_training: 'https://raw.githubusercontent.com/ArashVafa/DESC624/master/red_wine_PCA_training'

red_wine_PCA_test: 'https://raw.githubusercontent.com/ArashVafa/DESC624/master/red_wine_PCA_test'

The predictors are alcohol, residual sugar, pH, density, and fixed acidity.

* Standardize or normalize the predictors.
* Construct the correlation matrix for the predictors. Between which predictors do you find the highest correlations?
* Build a regression model to estimate quality based on the predictors.
* Perform PCA. What percent of the variability is explained by one component? By two components? By three components? By four components? By all five components?
* Say we want to explain at least 90% of the variability. How many components does the proportion of variance explained criterion suggest we extract?
* Combine the recommendations from the two criteria to reach a consensus as to how many components we should extract.
* Produce the correlation matrix for the components. What do these values mean?
* Next, use only the components you extracted to estimate wine quality using a regression model. Do not include the original predictors.
* Compare the values of R^2s between the PCA regression and the original regression model.
* Explain why the original model slightly outperformed the PCA model.
* Explain how the PCA model may be considered superior, even though slightly outperformed?

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import sklearn
from sklearn import linear_model, dummy, metrics
from sklearn.linear_model import LogisticRegression
from sklearn.dummy import DummyClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import *

In [None]:
url = 'https://raw.githubusercontent.com/ArashVafa/DESC624/master/red_wine_PCA_training'
df = pd.read_csv(url)
df

In [None]:
datatype = df.dtypes
print(datatype)

In [None]:
testurl = 'https://raw.githubusercontent.com/ArashVafa/DESC624/master/red_wine_PCA_test'
dftest = pd.read_csv(testurl)
dftest

In [None]:
print(np.unique(df['type']))
print(np.unique(df['quality']))

In [None]:
#Predictor
X_train = df[['alcohol','density','fixed acidity','pH','residual sugar']]
#Target
y_train = df[['quality']]
#Test
X_test = dftest[['alcohol','density','fixed acidity','pH','residual sugar']]
y_test = dftest[['quality']]

In [None]:
#Standardize or normalize the predictors. The predictors are alcohol, residual sugar, pH, density, and fixed acidity.
from sklearn.preprocessing import StandardScaler
X_train = StandardScaler().fit_transform(X_train)
X_test = StandardScaler().fit_transform(X_test)

In [None]:
#Construct the correlation matrix for the predictors. Between which predictors do you find the highest correlations?
correlation_matrix = df.corr().round(2)
sns.heatmap(data=correlation_matrix, annot=True)

In [None]:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression(solver='liblinear',fit_intercept=True)
lr.fit(X_train, y_train)
y_pred = lr.predict(X_test)

In [None]:
print("Test accuracy (logistic regression) =", 100*lr.score(X_test,y_test), end="% \n\n")
print('R2 score:', metrics.r2_score(y_test, y_pred))

In [None]:
#Perform PCA. What percent of the variability is explained by one component? By two components? By three components? By four components? By all five components?
from sklearn.decomposition import PCA
pca1 = PCA(n_components=1)
principalComponents = pca1.fit_transform(X_train)
principalDf = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1'])
#one component
pca1.explained_variance_ratio_.sum()

In [None]:
from sklearn.decomposition import PCA
pca2 = PCA(n_components=2)
principalComponents2 = pca2.fit_transform(X_train)
principalDf2 = pd.DataFrame(data = principalComponents2
             , columns = ['principal component 1','principal component 2'])
#two component
pca2.explained_variance_ratio_.sum()

In [None]:
finalDf2 = pd.concat([principalDf2, df[['quality']]], axis = 1)
import matplotlib.pyplot as plt
fig = plt.figure(figsize = (8,8))
ax = fig.add_subplot(1,1,1) 
ax.set_xlabel('Principal Component 1', fontsize = 15)
ax.set_ylabel('Principal Component 2', fontsize = 15)
ax.set_title('2 component PCA', fontsize = 20)
targets = [3, 4, 5, 6, 7, 8]
colors = ['r', 'g', 'b','aqua','yellow','k']
for target, color in zip(targets,colors):
    indicesToKeep = finalDf2['quality'] == target
    ax.scatter(finalDf2.loc[indicesToKeep, 'principal component 1']
               , finalDf2.loc[indicesToKeep, 'principal component 2']
               , c = color
               , s = 50)
ax.legend(targets)
ax.grid()

In [None]:
from sklearn.decomposition import PCA
pca3 = PCA(n_components=3)
principalComponents = pca3.fit_transform(X_train)
principalDf = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1','principal component 2','principal component 3'])
#three component
pca3.explained_variance_ratio_.sum()

In [None]:
from sklearn.decomposition import PCA
pca4 = PCA(n_components=4)
principalComponents = pca4.fit_transform(X_train)
principalDf = pd.DataFrame(data = principalComponents
             , columns = ['principal component 1','principal component 2','principal component 3','principal component 4'])
#four component
pca4.explained_variance_ratio_.sum()

In [None]:
#Next, use only the components you extracted to estimate wine quality using a regression model. Do not include the original predictors.
from sklearn.decomposition import PCA
trainpca = PCA(n_components=3)
X_pca_train = trainpca.fit_transform(X_train)

testpca = PCA(n_components=3)
X_pca_test = testpca.fit_transform(X_test)

In [None]:
from sklearn.linear_model import LogisticRegression
lrpca = LogisticRegression(solver='liblinear',fit_intercept=True)
lrpca.fit(X_pca_train, y_train)
y_pca_pred = lrpca.predict(X_pca_test)

In [None]:
print("Test accuracy (pca logistic regression) =", 100*lrpca.score(X_pca_test,y_test), end="% \n\n")
print('R2 score:', metrics.r2_score(y_test, y_pca_pred))

In [None]:
#Compare the values of R^2s between the PCA regression and the original regression model.
print("Test accuracy (original logistic regression) =", 100*lr.score(X_test,y_test), end="% \n\n")
print('R2 score:', metrics.r2_score(y_test, y_pred))