# Housing Price Prediction w/ Kaggle Dataset

### Purpose: To refine my data science skills even more and have more concrete evidence of them.

### Tools: Jupyter Notebook, Python 3.11, TensorFlow, Pandas, PySpark

### Section 1: Data Exploration

#### We are already given a description of all of the data. We will load it here:

In [3]:
with open('HousingPredictionData/data_description.txt','r') as read_file:
    print(read_file.read())

MSSubClass: Identifies the type of dwelling involved in the sale.	

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM

#### Based on this document, we will have to do a lot of one-hot encoding for all of these categorical variables (i.e. catogires like excellent, good, ok, bad, very bad).

**We need to do one hot encoding for the following categories (according to the file):** MSSubClass, MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Neighborhood, Condition1, Condition2, BldgType, HouseStyle, OverallQual, OverallCond, RoofStyle, RoofMat1, Exterior1st, Exterior2nd, MasVnrType, ExterQual, ExterCond, Foundation, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, Heating, HeatingQC, Electrical, KitchenQual, Functional, FireplaceQu, GarageType, GarageFinish, GarageQual, GarageCond, PavedDrive, PoolQC, Fence, CentralAir, MiscFeature, MoSold, SaleType, SaleCondition, 
**Continuous variables that need to be normalized:** LotFrontage, LotArea, MasVnrArea,BsmtFinSF2,BsmtUnfSf,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,Bedroom,Kitchen, TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal, YearBuilt,YearRemodAdd,YrSold

### Section 2: Loading and Transforming Data

#### Obviously, we have a huge number of features here. We need to apply one hot encoding to our categorical features and apply normalization to our continuous features. In order to make sure that our normalization stays consistent, we will use a OneHot model and train it using the data. 

#### We will now load our data into a PySpark DataFrame for processing. 

In [5]:
from pyspark.sql import SparkSession

In [8]:
spark = SparkSession.builder.appName("Read CSV").getOrCreate()

In [10]:
myDataFrame = spark.read.csv("HousingPredictionData/train.csv",header=True,inferSchema=True)

In [15]:
myDataFrame.select("MSSubClass").show()

+----------+
|MSSubClass|
+----------+
|        60|
|        20|
|        60|
|        70|
|        60|
|        50|
|        20|
|        60|
|        50|
|       190|
|        20|
|        60|
|        20|
|        20|
|        20|
|        45|
|        20|
|        90|
|        20|
|        20|
+----------+
only showing top 20 rows



#### We will first define which features we will onehot encode, and which features we will minmax (using our analysis above).

In [None]:
oneHotList=['MSSubClass', 'MSZoning, Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 
            'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond',
            'RoofStyle', 'RoofMat1', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
            'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'Electrical', 'KitchenQual', 'Functional',
            'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'CentralAir',
            'MiscFeature', 'MoSold', 'SaleType', 'SaleCondition']
minMaxList=['LotFrontage', 'LotArea', 'MasVnrArea','BsmtFinSF2','BsmtUnfSf','TotalBsmtSF','1stFlrSF','2ndFlrSF','LowQualFinSF','GrLivArea',
            'BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','Bedroom','Kitchen', 'TotRmsAbvGrd','Fireplaces','GarageYrBlt',
            'GarageCars','GarageArea','WoodDeckSF','OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscVal', 'YearBuilt','YearRemodAdd','YrSold']  

#### In order to use a MinMaxScaler properly in pyspark we will first have to construct a vector column from the continuous features by using the vector assembler. 

In [None]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import MinMaxScaler

vectoer_assembler = VectorAssembler
minmax_scaler = MinMaxScaler(inputCol="