## Codio Activity 6.4: Adjusting Parameters for Variance

**Expected Time: 60 Minutes**

**Total Points: 20 Points**

This activity focuses on using the $\Sigma$ matrix to limit the principal components based on how much variance should be kept.  In the last activity, a scree plot was used to see when the difference in variance explained slows.  Here, you will determine how many components are required to explain a proportion of variance.  The dataset is a larger example of a housing dataset related to individual houses and features in Ames Iowa.  For our purposes the non-null numeric data is selected.

## Index:

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)

In [1]:
import numpy as np
from scipy.linalg import svd
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.datasets import fetch_openml

In [2]:
#fetching the data
housing = fetch_openml(name="house_prices", as_frame=True, data_home='data')

  warn(


In [3]:
#examine the dataframe
housing.frame

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456,60,RL,62.0,7917,Pave,,Reg,Lvl,AllPub,...,0,,,,0,8,2007,WD,Normal,175000
1456,1457,20,RL,85.0,13175,Pave,,Reg,Lvl,AllPub,...,0,,MnPrv,,0,2,2010,WD,Normal,210000
1457,1458,70,RL,66.0,9042,Pave,,Reg,Lvl,AllPub,...,0,,GdPrv,Shed,2500,5,2010,WD,Normal,266500
1458,1459,20,RL,68.0,9717,Pave,,Reg,Lvl,AllPub,...,0,,,,0,4,2010,WD,Normal,142125


In [4]:
#select numeric data and drop missing values
df = housing.frame.select_dtypes(['float', 'int']).dropna(axis = 1)#.select_dtypes(['int', 'float'])

In [5]:
df.shape

(1460, 35)

[Back to top](#Index:) 

## Problem 1

### Scale the data

**5 Points**

After selecting our numeric data, scale the data so that it is ready for SVD.  Assign the scaled data to `df_scaled` below.  Your answer should be of type DataFrame.

In [6]:
### GRADED

df_scaled = ''

### BEGIN SOLUTION
df_scaled = (df - df.mean())/df.std()
### END SOLUTION

# Answer check
print(type(df_scaled))

<class 'pandas.core.frame.DataFrame'>


In [7]:
### BEGIN HIDDEN TESTS
df_scaled_ = (df - df.mean())/df.std()
#
#
#
pd.testing.assert_frame_equal(df_scaled, df_scaled_)
### END HIDDEN TESTS

[Back to top](#Index:) 

## Problem 2

### Extracting $\Sigma$

**5 Points**

Using the scaled data, extract the singular values from the data using the `scipy.linalg` function `svd`.  Assign your results to `sigma` below. 

In [8]:
### GRADED

U, sigma, VT = '', '', ''

### BEGIN SOLUTION
U, sigma, VT = svd(df_scaled)
### END SOLUTION

# Answer check
print(type(sigma))
print(sigma.shape)

<class 'numpy.ndarray'>
(35,)


In [9]:
### BEGIN HIDDEN TESTS
df_scaled_ = (df - df.mean())/df.std()
U_, sigma_, VT_ = svd(df_scaled_)
#
#
#
np.testing.assert_array_equal(sigma_, sigma)
assert sigma.shape == sigma_.shape
### END HIDDEN TESTS

[Back to top](#Index:) 

## Problem 3

### Percent Variance Explained

**5 Points**

To compute the percent variance explained, we will divide each singular value by the sum of the singular values.  Assign your percents as an array to `percent_variance_explained` below.  Note that due to rounding this percent won't sum to exactly 1.  

In [10]:
### GRADED

percent_variance_explained = ''

### BEGIN SOLUTION
U, sigma, VT = svd(df_scaled)
percent_variance_explained = sigma/sigma.sum()
### END SOLUTION
print(percent_variance_explained.shape)
print(percent_variance_explained.sum())

(35,)
0.9999999999999999


In [11]:
### BEGIN HIDDEN TESTS
df_scaled_ = (df - df.mean())/df.std()
U_, sigma_, VT_ = svd(df_scaled_)
percent_variance_explained_ = sigma_/sigma_.sum()
#
#
#
np.testing.assert_array_equal(percent_variance_explained, percent_variance_explained_)
### END HIDDEN TESTS

[Back to top](#Index:) 

## Problem 4

### Cumulative Variance Explained

**5 Points**

Using the solution to problem 3, how many principal components are necessary to retain 80% of the explained variance if we consider them in descending order?  Assign your response to `ans4` below as an integer. 

**HINT**: explore the `np.cumsum` function.

In [12]:
### GRADED

ans4 = ''

### BEGIN SOLUTION
U, sigma, VT = svd(df_scaled)
percent_variance_explained = sigma/sigma.sum()
ans4 = int((np.cumsum(percent_variance_explained) < .8).sum())

### END SOLUTION
print(type(ans4))
print(ans4)

<class 'int'>
21


In [13]:
### BEGIN HIDDEN TESTS
df_scaled_ = (df - df.mean())/df.std()
U_, sigma_, VT_ = svd(df_scaled_)
percent_variance_explained_ = sigma_/sigma_.sum()
ans4_ = int((np.cumsum(percent_variance_explained_) < .8).sum())
#
#
#
assert ans4 == ans4_
### END HIDDEN TESTS