## Codio Activity 6.4: Adjusting Parameters for Variance

**Expected Time: 60 Minutes**

**Total Points: 20 Points**

This activity focuses on using the $\Sigma$ matrix to limit the principal components based on how much variance should be kept.  In the last activity, a scree plot was used to see when the difference in variance explained slows.  Here, you will determine how many components are required to explain a proportion of variance.  The dataset is a larger example of a housing dataset related to individual houses and features in Ames Iowa.  For our purposes the non-null numeric data is selected.

## Index:

- [Problem 1](#Problem-1)
- [Problem 2](#Problem-2)
- [Problem 3](#Problem-3)
- [Problem 4](#Problem-4)

In [1]:
import numpy as np
from scipy.linalg import svd
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
from sklearn.datasets import fetch_openml

In [2]:
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

In [3]:
#fetching the data
housing = fetch_openml(name="house_prices", as_frame=True)

In [4]:
#examine the dataframe
housing.frame

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1.0,60.0,RL,65.0,8450.0,Pave,,Reg,Lvl,AllPub,...,0.0,,,,0.0,2.0,2008.0,WD,Normal,208500.0
1,2.0,20.0,RL,80.0,9600.0,Pave,,Reg,Lvl,AllPub,...,0.0,,,,0.0,5.0,2007.0,WD,Normal,181500.0
2,3.0,60.0,RL,68.0,11250.0,Pave,,IR1,Lvl,AllPub,...,0.0,,,,0.0,9.0,2008.0,WD,Normal,223500.0
3,4.0,70.0,RL,60.0,9550.0,Pave,,IR1,Lvl,AllPub,...,0.0,,,,0.0,2.0,2006.0,WD,Abnorml,140000.0
4,5.0,60.0,RL,84.0,14260.0,Pave,,IR1,Lvl,AllPub,...,0.0,,,,0.0,12.0,2008.0,WD,Normal,250000.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1455,1456.0,60.0,RL,62.0,7917.0,Pave,,Reg,Lvl,AllPub,...,0.0,,,,0.0,8.0,2007.0,WD,Normal,175000.0
1456,1457.0,20.0,RL,85.0,13175.0,Pave,,Reg,Lvl,AllPub,...,0.0,,MnPrv,,0.0,2.0,2010.0,WD,Normal,210000.0
1457,1458.0,70.0,RL,66.0,9042.0,Pave,,Reg,Lvl,AllPub,...,0.0,,GdPrv,Shed,2500.0,5.0,2010.0,WD,Normal,266500.0
1458,1459.0,20.0,RL,68.0,9717.0,Pave,,Reg,Lvl,AllPub,...,0.0,,,,0.0,4.0,2010.0,WD,Normal,142125.0


In [5]:
#select numeric data and drop missing values
df = housing.frame.select_dtypes(['float', 'int']).dropna(axis = 1)#.select_dtypes(['int', 'float'])

In [6]:
df.shape

(1460, 35)

[Back to top](#Index:) 

## Problem 1

### Scale the data

**5 Points**

After selecting our numeric data, scale the data so that it is ready for SVD.  Assign the scaled data to `df_scaled` below.  Your answer should be of type DataFrame.

In [7]:
### GRADED

df_scaled = ''

# YOUR CODE HERE
mu = df.mean()
s = df.std()
df_scaled = (df - mu) / s
#raise NotImplementedError()

# Answer check
print(type(df_scaled))

<class 'pandas.core.frame.DataFrame'>


[Back to top](#Index:) 

## Problem 2

### Extracting $\Sigma$

**5 Points**

Using the scaled data, extract the singular values from the data using the `scipy.linalg` function `svd`.  Assign your results to `sigma` below. 

In [8]:
### GRADED

U, sigma, VT = '', '', ''

# YOUR CODE HERE
U, sigma, VT = svd(df_scaled, full_matrices=False)
#raise NotImplementedError()

# Answer check
print(type(sigma))
print(sigma.shape)

<class 'numpy.ndarray'>
(35,)


[Back to top](#Index:) 

## Problem 3

### Percent Variance Explained

**5 Points**

To compute the percent variance explained, we will divide each singular value by the sum of the singular values.  Assign your percents as an array to `percent_variance_explained` below.  Note that due to rounding this percent won't sum to exactly 1.  

In [9]:
### GRADED

percent_variance_explained = ''

# YOUR CODE HERE
percent_variance_explained = sigma / sigma.sum()

#raise NotImplementedError()
print(percent_variance_explained.shape)
print(percent_variance_explained.sum())

(35,)
0.9999999999999999


[Back to top](#Index:) 

## Problem 4

### Cumulative Variance Explained

**5 Points**

Using the solution to problem 3, how many principal components are necessary to retain 80% of the explained variance if we consider them in descending order?  Assign your response to `ans4` below as an integer. 

**HINT**: explore the `np.cumsum` function.

In [17]:
(np.cumsum(percent_variance_explained) < .8).sum()

21

In [19]:
### GRADED

ans4 = ''

# YOUR CODE HERE
#raise NotImplementedError()
# cumulative sum until 80%
ans4 = (np.cumsum(percent_variance_explained) < .8).sum()

print(np.cumsum(percent_variance_explained[:ans4]))
print(type(ans4))
print(ans4)

[0.08757692 0.14576166 0.19370719 0.23970772 0.27948317 0.31513422
 0.35026271 0.38473414 0.41867326 0.45244718 0.48582951 0.51854891
 0.5511652  0.58337053 0.61509345 0.645868   0.67623342 0.70563948
 0.73491301 0.76358878 0.79116773]
<class 'numpy.int64'>
21
