Welcome to the Logistic regression model for CS4650 Big Data, Analysis, and Cloud Computing.

This model is using a dataset from the Kaggle Competition:

`Housing Prices - Advanced Regression Techniques`

The dataset includes these features and their descriptions:

---

    1. MSSubClass: Identifies the type of dwelling involved in the sale. (categorical) 
    
    2. MSZoning: Identifies the general zoning classification of the sale. (categorical)
    
    3. LotFrontage: Linear feet of street connected to property (numeric)
    
    4. LotArea: Lot size in square feet (numeric)
    
    5. Street: Type of road access to property (categorical)
    
    6. Alley: Type of alley access to property (categorical)
    
    7. LotShape: General shape of property (categorical)
    
    8. LandContour: Flatness of the property (categorical)
    
    9. Utilities: Type of utilities available (categorical)
    
    10. LotConfig: Lot configuration (categorical)
    
    11. LandSlope: Slope of property (categorical)
    
    12. Neighborhood: Physical locations within Ames city limits (categorical)
    
    13. Condition1: Proximity to various conditions (categorical)
    
    14. Condition2: Proximity to various conditions (if more than one is present) (categorical)
    
    15. BldgType: Type of dwelling (categorical)
    
    16. HouseStyle: Style of dwelling (categorical)
    
    17. OverallQual: Rates the overall material and finish of the house (categorical)
    
    18. OverallCond: Rates the overall condition of the house (categorical)
    
    19. YearBuilt: Original construction date (numeric)
    
    20. YearRemodAdd: Remodel date (same as construction date if no remodeling or additions) (numeric)
    
    21. RoofStyle: Type of roof (categorical)
    
    22. RoofMatl: Roof material (categorical)
    
    23. Exterior1st: Exterior covering on house (categorical)
    
    24. Exterior2nd: Exterior covering on house (if more than one material) (categorical)
    
    25. MasVnrType: Masonry veneer type (categorical)
    
    26. MasVnrArea: Masonry veneer area in square feet (numeric)
    
    27. ExterQual: Evaluates the quality of the material on the exterior (categorical)
    
    28. ExterCond: Evaluates the present condition of the material on the exterior (categorical)
    
    29. Foundation: Type of foundation (categorical)
    
    30. BsmtQual: Evaluates the height of the basement (categorical)
    
    31. BsmtCond: Evaluates the general condition of the basement (categorical)
    
    32. BsmtExposure: Refers to walkout or garden level walls (categorical)
    
    33. BsmtFinType1: Rating of basement finished area (categorical)
    
    34. BsmtFinSF1: Type 1 finished square feet (numeric)

    35. BsmtFinType2: Rating of basement finished area (if multiple types) (categorical)
    
    36. BsmtFinSF2: Type 2 finished square feet (numeric)
    
    37. BsmtUnfSF: Unfinished square feet of basement area (numeric)
    
    38. TotalBsmtSF: Total square feet of basement area (numeric)

    39. Heating: Type of heating (categorical)
    
    40. HeatingQC: Heating quality and condition (categorical)
    
    41. CentralAir: Central air conditioning (categorical)
    
    42. Electrical: Electrical system (categorical)
    
    43. 1stFlrSF: First Floor square feet (numeric)
    
    44. 2ndFlrSF: Second floor square feet (numeric)
    
    45. LowQualFinSF: Low quality finished square feet (all floors) (numeric)
    
    46. GrLivArea: Above grade (ground) living area square feet (numeric)
    
    47. BsmtFullBath: Basement full bathrooms (numeric)
    
    48. BsmtHalfBath: Basement half bathrooms (numeric)
    
    49. FullBath: Full bathrooms above grade (numeric)
    
    50. HalfBath: Half baths above grade (numeric)
    
    51. Bedroom: Bedrooms above grade (does NOT include basement bedrooms) (numeric)
    
    52. Kitchen: Kitchens above grade (numeric)
    
    53. KitchenQual: Kitchen quality (categorical)
    
    54. TotRmsAbvGrd: Total rooms above grade (does not include bathrooms) (numeric)
    
    55. Functional: Home functionality (Assume typical unless deductions are warranted)
    
    56. Fireplaces: Number of fireplaces (numeric)
    
    57. FireplaceQu: Fireplace quality (categorical)
    
    58. GarageType: Garage location (categorical)
    
    59. GarageYrBlt: Year garage was built (numeric)
    
    60. GarageFinish: Interior finish of the garage (categorical)
    
    61. GarageCars: Size of garage in car capacity (numeric)
    
    62. GarageArea: Size of garage in square feet (numeric)
    
    63. GarageQual: Garage quality (categorical)
    
    64. GarageCond: Garage condition (categorical)
    
    65. PavedDrive: Paved driveway (categorical)
    
    66. WoodDeckSF: Wood deck area in square feet (numeric)
    
    67. OpenPorchSF: Open porch area in square feet (numeric)
    
    68. EnclosedPorch: Enclosed porch area in square feet (numeric)
    
    69. 3SsnPorch: Three season porch area in square feet (numeric)
    
    70. ScreenPorch: Screen porch area in square feet (numeric)
    
    71. PoolArea: Pool area in square feet (numeric)
    
    72. PoolQC: Pool quality (categorical)
    
    73. Fence: Fence quality (categorical)
    
    74. MiscFeature: Miscellaneous feature not covered in other categories (categorical)
    
    75. MiscVal: $Value of miscellaneous feature (numeric)
    
    76. MoSold: Month Sold (MM) (numeric)
    
    77. YrSold: Year Sold (YYYY) (numeric)
    
    78. SaleType: Type of sale (categorical)
    
    79. SaleCondition: Condition of sale (categorical)


---    







In [30]:
import pandas as pd 
import numpy as np
from sklearn.model_selection import KFold
from sklearn.model_selection import train_test_split 
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn import preprocessing
import matplotlib.pyplot as plt

In [32]:
df=pd.read_csv("./train.csv")

In [34]:
object_columns = df.loc[:, df.dtypes == object]
df_converted = pd.get_dummies(df, columns= object_columns.columns)
df_converted.sample()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
462,463,20,60.0,8281,5,5,1965,1965,0.0,553,...,False,False,False,True,False,False,False,False,True,False


The following cell will convert the True / False to 1 and 0 values

In [37]:
df_converted = df_converted*1
df_converted.sample()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
821,822,20,60.0,6000,4,4,1953,1953,0.0,0,...,0,0,0,1,0,0,0,0,1,0


In [39]:
scaler_Standard = StandardScaler()

columns_to_scaler_Standard = df_converted.drop(columns=['Id'])
scaled_data_Standard = scaler_Standard.fit_transform(columns_to_scaler)

scaled_df_Standard = pd.DataFrame(scaled_data_Standard, columns=columns_to_scaler_Standard.columns)
scaled_df_Standard = pd.concat([df_converted[['Id']], scaled_df_Standard], axis=1)
scaled_df_Standard.sample()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
1044,1045,-0.872563,0.409895,-0.091886,1.374795,-0.5172,0.322337,-0.187309,-0.572835,1.448343,...,-0.058621,-0.301962,-0.045376,0.390293,-0.272616,-0.052414,-0.091035,-0.117851,0.467651,-0.305995


In this testing, we are going to create a MinMax scaler to test the difference between both types of scalers (Standard vs. MinMax).

In [42]:
scaler_MinMax = preprocessing.MinMaxScaler()

columns_to_scaler_MinMax = df_converted.drop(columns=['Id'])
scaled_data_MinMax = scaler_MinMax.fit_transform(columns_to_scaler_MinMax)

scaled_df_MinMax = pd.DataFrame(scaled_data_MinMax, columns=columns_to_scaler_MinMax.columns)
scaled_df_MinMax = pd.concat([df_converted[['Id']], scaled_df_MinMax], axis=1)
scaled_df_MinMax.sample()

Unnamed: 0,Id,MSSubClass,LotFrontage,LotArea,OverallQual,OverallCond,YearBuilt,YearRemodAdd,MasVnrArea,BsmtFinSF1,...,SaleType_ConLw,SaleType_New,SaleType_Oth,SaleType_WD,SaleCondition_Abnorml,SaleCondition_AdjLand,SaleCondition_Alloca,SaleCondition_Family,SaleCondition_Normal,SaleCondition_Partial
248,249,0.235294,0.174658,0.04682,0.666667,0.5,0.949275,0.883333,0.063125,0.0,...,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0


Now create the K-fold tests and analyize the accuracies for each test.