#### Michael Perrine
#### DSC 550 Data Mining
#### Professor Werner
#### Week 7 Assignment

<h1><center>Dimensionality Reduction and Feature Selection</center></h1>

### Part 1 PCA and Variance Threshold in a Linear Regression

In [1]:
# Import libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import warnings
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, root_mean_squared_error

In [2]:
# This code will supress minor warnings
warnings.filterwarnings("ignore")

1. Import the housing data and ensure that the data is loaded properly

In [3]:
# This code imports the data and validates it is loaded properly
housing = pd.read_csv(r"housing.csv")
housing.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


2. Drop the "Id" column and any features that are missing more than 40% of their values.

In [4]:
# This code drops the Id column
housing = housing.drop(columns=["Id"])

In [5]:
# This code removes columns with greater than 40% of their data missing
new_housing = housing.drop(housing.columns[housing.isnull().mean() >0.40], axis = 1)
new_housing.head()

Unnamed: 0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,...,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,60,RL,65.0,8450,Pave,Reg,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,2,2008,WD,Normal,208500
1,20,RL,80.0,9600,Pave,Reg,Lvl,AllPub,FR2,Gtl,...,0,0,0,0,0,5,2007,WD,Normal,181500
2,60,RL,68.0,11250,Pave,IR1,Lvl,AllPub,Inside,Gtl,...,0,0,0,0,0,9,2008,WD,Normal,223500
3,70,RL,60.0,9550,Pave,IR1,Lvl,AllPub,Corner,Gtl,...,272,0,0,0,0,2,2006,WD,Abnorml,140000
4,60,RL,84.0,14260,Pave,IR1,Lvl,AllPub,FR2,Gtl,...,0,0,0,0,0,12,2008,WD,Normal,250000


3. For numerical columns, fill in any missing data with the median value.

In [6]:
# This code displays the dimension of the data frame
new_housing.shape

(1460, 74)

In [7]:
# This code displays columns that are missing values
new_housing.isnull().sum()

MSSubClass         0
MSZoning           0
LotFrontage      259
LotArea            0
Street             0
                ... 
MoSold             0
YrSold             0
SaleType           0
SaleCondition      0
SalePrice          0
Length: 74, dtype: int64

In [8]:
# This code creates a for loop that fills columns with the median value
for column in new_housing.select_dtypes(include='number').columns:
    median_value = new_housing[column].median()
    new_housing[column].fillna(median_value, inplace= True)

In [9]:
# This code displays missing values and confirms that the missing values have been replaced
new_housing.isnull().sum()

MSSubClass       0
MSZoning         0
LotFrontage      0
LotArea          0
Street           0
                ..
MoSold           0
YrSold           0
SaleType         0
SaleCondition    0
SalePrice        0
Length: 74, dtype: int64

4. For categorical columns, fill in any missing data with the most common value (mode).

In [10]:
# This code shows no missing data in the columns
new_housing.isna().sum()

MSSubClass       0
MSZoning         0
LotFrontage      0
LotArea          0
Street           0
                ..
MoSold           0
YrSold           0
SaleType         0
SaleCondition    0
SalePrice        0
Length: 74, dtype: int64

5. Convert the categorical columns to dummy variables.

In [11]:
# This code creates a new data frame that isolates the categorical columns
# and displays the first 5 columns
new_housing_1 = new_housing.select_dtypes(include=["object"])
new_housing_1.head()

Unnamed: 0,MSZoning,Street,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,...,Electrical,KitchenQual,Functional,GarageType,GarageFinish,GarageQual,GarageCond,PavedDrive,SaleType,SaleCondition
0,RL,Pave,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,...,SBrkr,Gd,Typ,Attchd,RFn,TA,TA,Y,WD,Normal
1,RL,Pave,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,...,SBrkr,TA,Typ,Attchd,RFn,TA,TA,Y,WD,Normal
2,RL,Pave,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,...,SBrkr,Gd,Typ,Attchd,RFn,TA,TA,Y,WD,Normal
3,RL,Pave,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,...,SBrkr,Gd,Typ,Detchd,Unf,TA,TA,Y,WD,Abnorml
4,RL,Pave,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,...,SBrkr,Gd,Typ,Attchd,RFn,TA,TA,Y,WD,Normal


In [12]:
# This code shows the unique labels in each column
for column in new_housing_1.columns:
    print(column, ':', len(new_housing_1[column].unique()), "labels" )

MSZoning : 5 labels
Street : 2 labels
LotShape : 4 labels
LandContour : 4 labels
Utilities : 2 labels
LotConfig : 5 labels
LandSlope : 3 labels
Neighborhood : 25 labels
Condition1 : 9 labels
Condition2 : 8 labels
BldgType : 5 labels
HouseStyle : 8 labels
RoofStyle : 6 labels
RoofMatl : 8 labels
Exterior1st : 15 labels
Exterior2nd : 16 labels
ExterQual : 4 labels
ExterCond : 5 labels
Foundation : 6 labels
BsmtQual : 5 labels
BsmtCond : 5 labels
BsmtExposure : 5 labels
BsmtFinType1 : 7 labels
BsmtFinType2 : 7 labels
Heating : 6 labels
HeatingQC : 5 labels
CentralAir : 2 labels
Electrical : 6 labels
KitchenQual : 4 labels
Functional : 7 labels
GarageType : 7 labels
GarageFinish : 4 labels
GarageQual : 6 labels
GarageCond : 6 labels
PavedDrive : 3 labels
SaleType : 9 labels
SaleCondition : 6 labels


In [13]:
# This code creates a data frame with dummy variables and displays the first 5 rows
df = pd.get_dummies(new_housing_1, drop_first= True)
df.head()

Unnamed: 0,MSZoning_FV,MSZoning_RH,MSZoning_RL,MSZoning_RM,Street_Pave,LotShape_IR2,LotShape_IR3,LotShape_Reg,LandContour_HLS,LandContour_Low,...,SaleType_ConLI,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
0,False,False,True,False,True,False,False,True,False,False,...,False,False,False,False,True,False,False,False,True,False
1,False,False,True,False,True,False,False,True,False,False,...,False,False,False,False,True,False,False,False,True,False
2,False,False,True,False,True,False,False,False,False,False,...,False,False,False,False,True,False,False,False,True,False
3,False,False,True,False,True,False,False,False,False,False,...,False,False,False,False,True,False,False,False,False,False
4,False,False,True,False,True,False,False,False,False,False,...,False,False,False,False,True,False,False,False,True,False


In [14]:
# This code drops the original categorical variables in the housing data frame
new_housing.drop(['MSZoning','Street','LotShape','LandContour','Utilities',
                'LotConfig', 'LandSlope', 'Neighborhood', 'Condition1',
                'Condition2', 'BldgType', 'HouseStyle', 'RoofStyle', 'RoofMatl',
                'Exterior1st', 'Exterior2nd', 'ExterQual', 'ExterCond',
                'Foundation', 'BsmtCond', 'BsmtExposure', 'BsmtFinType1',
                'BsmtFinType2', 'Heating', 'GarageCond', 'PavedDrive',
                'SaleType', 'SaleCondition'], axis = 1, inplace = True)

In [15]:
# This code located the remaining categorical columns
new_housing_2 = new_housing.select_dtypes(include=["object"])
new_housing_2.head()


Unnamed: 0,BsmtQual,HeatingQC,CentralAir,Electrical,KitchenQual,Functional,GarageType,GarageFinish,GarageQual
0,Gd,Ex,Y,SBrkr,Gd,Typ,Attchd,RFn,TA
1,Gd,Ex,Y,SBrkr,TA,Typ,Attchd,RFn,TA
2,Gd,Ex,Y,SBrkr,Gd,Typ,Attchd,RFn,TA
3,TA,Gd,Y,SBrkr,Gd,Typ,Detchd,Unf,TA
4,Gd,Ex,Y,SBrkr,Gd,Typ,Attchd,RFn,TA


In [16]:
# This code created a secondary data frame with dummy variables and displays the first 5 rows
df1 = pd.get_dummies(new_housing_2, drop_first= True)
df1.head()

Unnamed: 0,BsmtQual_Fa,BsmtQual_Gd,BsmtQual_TA,HeatingQC_Fa,HeatingQC_Gd,HeatingQC_Po,HeatingQC_TA,CentralAir_Y,Electrical_FuseF,Electrical_FuseP,...,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd,GarageFinish_RFn,GarageFinish_Unf,GarageQual_Fa,GarageQual_Gd,GarageQual_Po,GarageQual_TA
0,False,True,False,False,False,False,False,True,False,False,...,False,False,False,False,True,False,False,False,False,True
1,False,True,False,False,False,False,False,True,False,False,...,False,False,False,False,True,False,False,False,False,True
2,False,True,False,False,False,False,False,True,False,False,...,False,False,False,False,True,False,False,False,False,True
3,False,False,True,False,True,False,False,True,False,False,...,False,False,False,True,False,True,False,False,False,True
4,False,True,False,False,False,False,False,True,False,False,...,False,False,False,False,True,False,False,False,False,True


In [17]:
# This code drops the remaining categorical columns from the new_housing dataframe
new_housing.drop(['BsmtQual', 'HeatingQC', 'CentralAir', 'Electrical', 'KitchenQual',
                'Functional', 'GarageType', 'GarageFinish', 'GarageQual'], axis = 1, inplace = True)

In [18]:
# This code merges all three data frames into one 
new_housing = pd.concat([new_housing, df, df1], axis = 1).replace({True : 1, False : 0})
new_housing.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd,GarageFinish_RFn,GarageFinish_Unf,GarageQual_Fa,GarageQual_Gd,GarageQual_Po,GarageQual_TA
0,60,65.0,8450,7,5,2003,2003,196.0,706,0,...,0,0,0,0,1,0,0,0,0,1
1,20,80.0,9600,6,8,1976,1976,0.0,978,0,...,0,0,0,0,1,0,0,0,0,1
2,60,68.0,11250,7,5,2001,2002,162.0,486,0,...,0,0,0,0,1,0,0,0,0,1
3,70,60.0,9550,7,5,1915,1970,0.0,216,0,...,0,0,0,1,0,1,0,0,0,1
4,60,84.0,14260,8,5,2000,2000,350.0,655,0,...,0,0,0,0,1,0,0,0,0,1


In [19]:
# This code shows the dimension of the new_housing data frame
new_housing.shape

(1460, 262)

6. Split the data into a training and test set, where the SalePrice column is the target.

In [20]:
# This series of code splits the data between a training and testing set
X = new_housing.drop(columns=["SalePrice"], axis = 1) 
y = new_housing['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state= 19, test_size = 0.20)

In [21]:
# This code displays the first 5 rows in the X data frame
X.head()

Unnamed: 0,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,BsmtFinSF2,...,GarageType_Basment,GarageType_BuiltIn,GarageType_CarPort,GarageType_Detchd,GarageFinish_RFn,GarageFinish_Unf,GarageQual_Fa,GarageQual_Gd,GarageQual_Po,GarageQual_TA
0,60,65.0,8450,7,5,2003,2003,196.0,706,0,...,0,0,0,0,1,0,0,0,0,1
1,20,80.0,9600,6,8,1976,1976,0.0,978,0,...,0,0,0,0,1,0,0,0,0,1
2,60,68.0,11250,7,5,2001,2002,162.0,486,0,...,0,0,0,0,1,0,0,0,0,1
3,70,60.0,9550,7,5,1915,1970,0.0,216,0,...,0,0,0,1,0,1,0,0,0,1
4,60,84.0,14260,8,5,2000,2000,350.0,655,0,...,0,0,0,0,1,0,0,0,0,1


In [22]:
# This code displays the first 5 rows in the X data frame
y.head()

0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64

7. Run a linear regression and report the R2-value and RMSE on the test set.

In [23]:
# This code runs a linear regression
lr = LinearRegression()

In [24]:
# This code fits the training data into the linear regression
lr.fit(X_train, y_train)

In [25]:
# This code creates the predict value
y_pred = lr.predict(X_test)

In [26]:
# This code calculates the rmse for the regression
root_mean_squared_error(y_test, y_pred)

57991.347929554184

In [27]:
# This code calculates the R-squared for the regression
r2_score(y_test, y_pred)

0.13156622022341757

8. Fit and transform the training features with a PCA so that 90% of the variance is retained

In [28]:
# This code builds the standard scaler
scaleStandard = StandardScaler()

In [74]:
# This code transforms the X_train data
#X_train = scaleStandard.fit_transform(X_train).
X = scaleStandard.fit_transform(X)

In [75]:
pca = PCA(0.90)
#X_pca = pca.fit_transform(X_train)
X_pca = pca.fit_transform(X)

9. How many features are in the PCA-transformed matrix?

In [76]:
X_pca.shape

(1460, 129)

10. Transform but DO NOT fit the test features with the same PCA.

In [67]:
# This code transforms the X_test data

X_pca2 = pca.transform(X_test)


11. Repeat step 7 with your PCA transformed data.

In [77]:
X_train_pca, X_test_pca, y_train, y_test = train_test_split(X_pca, y, test_size= 0.20)

In [None]:
# This code runs a linear regression
lr_1 = LinearRegression()

In [79]:

lr_1.fit(X_train_pca, y_train)

In [85]:
y_pred = lr_1.predict(X_test_pca)


In [81]:
root_mean_squared_error(y_test, y_pred)

37668.4465352836

In [82]:
r2_score(y_test, y_pred)

0.7972510124065972

12. Take your original training features (from step 6) and apply a min-max scaler to them.

In [83]:
X = new_housing.drop(columns=["SalePrice"], axis = 1) 
y = new_housing['SalePrice']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state= 19, test_size = 0.20)

13. Find the min-max scaled features in your training set that have a variance above 0.1

14. Transform but DO NOT fit the test features with the same steps applied in steps 11 and 12.

15. Repeat step 7 with the high variance data.

16. Summarize your findings.

### Part 2 Categorical Feature Selection

1. Import the data as a data frame and ensure it is loaded correctly.

2. Convert the categorical features (all of them) to dummy variables.

3. Split the data into a training and test set.

4. Fit a decision tree classifier on the training set.

5. Report the accuracy and create a confusion matrix for the model prediction on the test set.

6. Create a visualization of the decision tree.

7. Use a χ2-statistic selector to pick the five best features for this data 

8. Which five features were selected in step 7? Hint: Use the get_support function.

9. Repeat steps 4 and 5 with the five best features selected in step 7.

10. Summarize your findings.