# Housing Price Prediction w/ Kaggle Dataset

#### Note: I have decided to abandon this notebook because I realize that the PySpark library does not have all of the features that I want in order to successfully build my model. Instead of making an awkward workaround, I will instead write the functions that I need myself.

### Purpose: To refine my data science skills even more and have more concrete evidence of them.

### Tools: Jupyter Notebook, Python 3.11, TensorFlow, Pandas, PySpark

### Section 1: Data Exploration

#### We are already given a description of all of the data. We will load it here:

In [3]:
with open('HousingPredictionData/data_description.txt','r') as read_file:
    print(read_file.read())

MSSubClass: Identifies the type of dwelling involved in the sale.	

        20	1-STORY 1946 & NEWER ALL STYLES
        30	1-STORY 1945 & OLDER
        40	1-STORY W/FINISHED ATTIC ALL AGES
        45	1-1/2 STORY - UNFINISHED ALL AGES
        50	1-1/2 STORY FINISHED ALL AGES
        60	2-STORY 1946 & NEWER
        70	2-STORY 1945 & OLDER
        75	2-1/2 STORY ALL AGES
        80	SPLIT OR MULTI-LEVEL
        85	SPLIT FOYER
        90	DUPLEX - ALL STYLES AND AGES
       120	1-STORY PUD (Planned Unit Development) - 1946 & NEWER
       150	1-1/2 STORY PUD - ALL AGES
       160	2-STORY PUD - 1946 & NEWER
       180	PUD - MULTILEVEL - INCL SPLIT LEV/FOYER
       190	2 FAMILY CONVERSION - ALL STYLES AND AGES

MSZoning: Identifies the general zoning classification of the sale.
		
       A	Agriculture
       C	Commercial
       FV	Floating Village Residential
       I	Industrial
       RH	Residential High Density
       RL	Residential Low Density
       RP	Residential Low Density Park 
       RM

#### Based on this document, we will have to do a lot of one-hot encoding for all of these categorical variables (i.e. catogires like excellent, good, ok, bad, very bad).

**We need to do one hot encoding for the following categories (according to the file):** MSSubClass, MSZoning, Street, Alley, LotShape, LandContour, Utilities, LotConfig, LandSlope, Neighborhood, Condition1, Condition2, BldgType, HouseStyle, OverallQual, OverallCond, RoofStyle, RoofMat1, Exterior1st, Exterior2nd, MasVnrType, ExterQual, ExterCond, Foundation, BsmtQual, BsmtCond, BsmtExposure, BsmtFinType1, BsmtFinType2, Heating, HeatingQC, Electrical, KitchenQual, Functional, FireplaceQu, GarageType, GarageFinish, GarageQual, GarageCond, PavedDrive, PoolQC, Fence, CentralAir, MiscFeature, MoSold, SaleType, SaleCondition, 
**Continuous variables that need to be normalized:** LotFrontage, LotArea, MasVnrArea,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr, TotRmsAbvGrd,Fireplaces,GarageYrBlt,GarageCars,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal, YearBuilt,YearRemodAdd,YrSold

#### Note that the data descriptor file is wrong about two of the column names! Bedroom and Kitchen do not exist- the columns are called BedroomAbvGr and KitchenAbvGr, respectively. Figured this out later when got an error using the pipeline.

### Section 2: Loading and Transforming Data

#### Obviously, we have a huge number of features here. We need to apply one hot encoding to our categorical features and apply normalization to our continuous features. In order to make sure that our normalization stays consistent, we will use a OneHot model and train it using the data. 

#### We will now load our data into a PySpark DataFrame for processing. 

In [45]:
from pyspark.sql import SparkSession

In [46]:
spark = SparkSession.builder.appName("Read CSV").getOrCreate()

In [47]:
myDataFrame = spark.read.csv("HousingPredictionData/train.csv",header=True,inferSchema=True)

#### We will first define which features we will onehot encode, and which features we will minmax (using our analysis above).

In [75]:
oneHotList=['MSSubClass', 'MSZoning', 'Street', 'Alley', 'LotShape', 'LandContour', 'Utilities', 'LotConfig', 
            'LandSlope', 'Neighborhood', 'Condition1', 'Condition2', 'BldgType', 'HouseStyle', 'OverallQual', 'OverallCond',
            'RoofStyle', 'RoofMatl', 'Exterior1st', 'Exterior2nd', 'MasVnrType', 'ExterQual', 'ExterCond', 'Foundation', 'BsmtQual',
            'BsmtCond', 'BsmtExposure', 'BsmtFinType1', 'BsmtFinType2', 'Heating', 'HeatingQC', 'Electrical', 'KitchenQual', 'Functional',
            'FireplaceQu', 'GarageType', 'GarageFinish', 'GarageQual', 'GarageCond', 'PavedDrive', 'PoolQC', 'Fence', 'CentralAir',
            'MiscFeature', 'MoSold', 'SaleType', 'SaleCondition']
minMaxList=['LotFrontage', 'LotArea', 'MasVnrArea','BsmtFinSF2','BsmtUnfSF','TotalBsmtSF','1stFlrSF','2ndFlrSF','LowQualFinSF','GrLivArea',
            'BsmtFullBath','BsmtHalfBath','FullBath','HalfBath','BedroomAbvGr','KitchenAbvGr', 'TotRmsAbvGrd','Fireplaces','GarageYrBlt',
            'GarageCars','GarageArea','WoodDeckSF','OpenPorchSF','EnclosedPorch','3SsnPorch','ScreenPorch','PoolArea','MiscVal', 'YearBuilt','YearRemodAdd','YrSold']  

#### We will now check to make sure that the data type for each column is the correct type.

In [87]:
myDataFrame.printSchema()

root
 |-- Id: integer (nullable = true)
 |-- MSSubClass: string (nullable = true)
 |-- MSZoning: string (nullable = true)
 |-- LotFrontage: integer (nullable = true)
 |-- LotArea: integer (nullable = true)
 |-- Street: string (nullable = true)
 |-- Alley: string (nullable = true)
 |-- LotShape: string (nullable = true)
 |-- LandContour: string (nullable = true)
 |-- Utilities: string (nullable = true)
 |-- LotConfig: string (nullable = true)
 |-- LandSlope: string (nullable = true)
 |-- Neighborhood: string (nullable = true)
 |-- Condition1: string (nullable = true)
 |-- Condition2: string (nullable = true)
 |-- BldgType: string (nullable = true)
 |-- HouseStyle: string (nullable = true)
 |-- OverallQual: string (nullable = true)
 |-- OverallCond: string (nullable = true)
 |-- YearBuilt: integer (nullable = true)
 |-- YearRemodAdd: integer (nullable = true)
 |-- RoofStyle: string (nullable = true)
 |-- RoofMatl: string (nullable = true)
 |-- Exterior1st: string (nullable = true)
 |-- E

#### The the LotFrontage, MasVnrArea, and GarageYrBlt columns are all strings but they should be integers! Let's make sure that each column has the right data type.

In [80]:
def fixTypes(dataframe):
    for oneHotColumn in oneHotList:
        dataframe=dataframe.withColumn(oneHotColumn,myDataFrame[oneHotColumn].cast("string"))
    for minMaxColumn in minMaxList:
        dataframe=dataframe.withColumn(minMaxColumn,myDataFrame[minMaxColumn].cast("int"))
    return dataframe

#### Now we have to make sure that we have no null values in our columns. We are ok with null values in our categorical variables - this will one-hot encode to its own column, and that will not affect the data processing. We are not ok with null values in our continuous variables - the nulls will cause errors. Thus, we will set the nulls to 0.

In [120]:
def fixNA(dataframe):
    dataframe=dataframe.fillna(0,minMaxList)
    return dataframe

In [121]:
myDataFrame=fixNA(myDataFrame)

In [122]:
myDataFrame=fixTypes(myDataFrame)
myDataFrame.printSchema()

root
 |-- Id: integer (nullable = true)
 |-- MSSubClass: string (nullable = true)
 |-- MSZoning: string (nullable = true)
 |-- LotFrontage: integer (nullable = true)
 |-- LotArea: integer (nullable = true)
 |-- Street: string (nullable = true)
 |-- Alley: string (nullable = true)
 |-- LotShape: string (nullable = true)
 |-- LandContour: string (nullable = true)
 |-- Utilities: string (nullable = true)
 |-- LotConfig: string (nullable = true)
 |-- LandSlope: string (nullable = true)
 |-- Neighborhood: string (nullable = true)
 |-- Condition1: string (nullable = true)
 |-- Condition2: string (nullable = true)
 |-- BldgType: string (nullable = true)
 |-- HouseStyle: string (nullable = true)
 |-- OverallQual: string (nullable = true)
 |-- OverallCond: string (nullable = true)
 |-- YearBuilt: integer (nullable = true)
 |-- YearRemodAdd: integer (nullable = true)
 |-- RoofStyle: string (nullable = true)
 |-- RoofMatl: string (nullable = true)
 |-- Exterior1st: string (nullable = true)
 |-- E

#### Great! Now we can do some data transformation.

#### First we will manipulate the continuous data.

#### In order to use a MinMaxScaler properly in pyspark we will first have to construct a vector column from the continuous features by using the vector assembler. 

In [123]:
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.feature import MinMaxScaler

In [124]:
vector_assembler = VectorAssembler(inputCols=minMaxList,outputCol='vecFeatures')
minmax_scaler = MinMaxScaler(inputCol='vecFeatures',outputCol='scaled')

#### Now we will use One Hot Encoding to create the rest of the features. Note that we first have to use the stringindexer which turns each string into its own category by number, as the PySpark One Hot Encoder only works with integer values. We will create a list of columns for this purpose as well.

In [135]:
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.feature import StringIndexer

In [155]:
outputlist1=[]
outputlist2=[]
for columnName in oneHotList:
    outputlist1.append(columnName+'index')
    outputlist2.append(columnName+'encoded')

In [156]:
indexer = StringIndexer(inputCols=oneHotList,outputCols=outputlist1)
encoder = OneHotEncoder(inputCols=outputlist1,outputCols=outputlist2)

#### We will now turn these transformations into a data pipeline so that we can easily apply them to data.

In [157]:
from pyspark.ml import Pipeline

In [158]:
pipeline=Pipeline(stages=[vector_assembler,minmax_scaler,indexer,encoder])

In [159]:
final_pipeline = pipeline.fit(myDataFrame)

In [160]:
transformedDataFrame=final_pipeline.transform(myDataFrame)

In [196]:
print(pyspark.ml.stat.Correlation.corr(transformedDataFrame,'MSSubClassencoded').collect()[0][0])

DenseMatrix([[ 1.        , -0.38651459, -0.2519415 , -0.19172148, -0.16963185,
              -0.16174039, -0.15767325, -0.15491248, -0.14636821, -0.11031613,
              -0.08975945, -0.08017202, -0.06933504, -0.06325027],
             [-0.38651459,  1.        , -0.16786988, -0.12774498, -0.11302655,
              -0.10776843, -0.10505847, -0.10321896, -0.09752587, -0.07350419,
              -0.05980717, -0.05341902, -0.04619828, -0.04214397],
             [-0.2519415 , -0.16786988,  1.        , -0.08326791, -0.07367401,
              -0.07024661, -0.06848018, -0.06728113, -0.06357021, -0.04791218,
              -0.03898406, -0.03482007, -0.03011339, -0.02747067],
             [-0.19172148, -0.12774498, -0.08326791,  1.        , -0.05606416,
              -0.053456  , -0.05211179, -0.05119934, -0.04837541, -0.03646003,
              -0.02966594, -0.02649725, -0.02291557, -0.02090453],
             [-0.16963185, -0.11302655, -0.07367401, -0.05606416,  1.        ,
              -0.0472

In [167]:
outputlist2.append('scaled')
finalColumns=outputlist2
pdReadyDataFrame=transformedDataFrame[finalColumns].toPandas()

In [170]:
pdReadyDataFrame.tail(5)

Unnamed: 0,MSSubClassencoded,MSZoningencoded,Streetencoded,Alleyencoded,LotShapeencoded,LandContourencoded,Utilitiesencoded,LotConfigencoded,LandSlopeencoded,Neighborhoodencoded,...,Fenceencoded,CentralAirencoded,MiscFeatureencoded,MoSoldencoded,SaleTypeencoded,SaleConditionencoded,scaled,scaled.1,scaled.2,scaled.3
1455,"(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0)",(1.0),"(1.0, 0.0)","(1.0, 0.0, 0.0)","(1.0, 0.0, 0.0)",(1.0),"(1.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",...,"(1.0, 0.0, 0.0, 0.0)",(1.0),"(1.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0)","[0.19808306709265175, 0.030928509663698613, 0....","[0.19808306709265175, 0.030928509663698613, 0....","[0.19808306709265175, 0.030928509663698613, 0....","[0.19808306709265175, 0.030928509663698613, 0...."
1456,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0)",(1.0),"(1.0, 0.0)","(1.0, 0.0, 0.0)","(1.0, 0.0, 0.0)",(1.0),"(1.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ...",...,"(0.0, 1.0, 0.0, 0.0)",(1.0),"(1.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0)","[0.2715654952076677, 0.055504919488653624, 0.0...","[0.2715654952076677, 0.055504919488653624, 0.0...","[0.2715654952076677, 0.055504919488653624, 0.0...","[0.2715654952076677, 0.055504919488653624, 0.0..."
1457,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0)",(1.0),"(1.0, 0.0)","(1.0, 0.0, 0.0)","(1.0, 0.0, 0.0)",(1.0),"(1.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",...,"(0.0, 0.0, 1.0, 0.0)",(1.0),"(0.0, 1.0, 0.0, 0.0)","(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0)","[0.2108626198083067, 0.036186870457360534, 0.0...","[0.2108626198083067, 0.036186870457360534, 0.0...","[0.2108626198083067, 0.036186870457360534, 0.0...","[0.2108626198083067, 0.036186870457360534, 0.0..."
1458,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0)",(1.0),"(1.0, 0.0)","(1.0, 0.0, 0.0)","(1.0, 0.0, 0.0)",(1.0),"(1.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",...,"(1.0, 0.0, 0.0, 0.0)",(1.0),"(1.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0)","(0.21725239616613418, 0.03934188693355769, 0.0...","(0.21725239616613418, 0.03934188693355769, 0.0...","(0.21725239616613418, 0.03934188693355769, 0.0...","(0.21725239616613418, 0.03934188693355769, 0.0..."
1459,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0)",(1.0),"(1.0, 0.0)","(1.0, 0.0, 0.0)","(1.0, 0.0, 0.0)",(1.0),"(1.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",...,"(1.0, 0.0, 0.0, 0.0)",(1.0),"(1.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0)","[0.23961661341853036, 0.0403701885998738, 0.0,...","[0.23961661341853036, 0.0403701885998738, 0.0,...","[0.23961661341853036, 0.0403701885998738, 0.0,...","[0.23961661341853036, 0.0403701885998738, 0.0,..."


#### For some reason our output has 4 of the same scaled columns. We will remove 3 of them. We will write a function for this so that we prevent all duplicate columns in the future.

In [176]:
def removeCopyColumns(dataframe):
    dataframe=dataframe.loc[:,~dataframe.columns.duplicated()].copy()
    return dataframe

In [178]:
pdReadyDataFrame=removeCopyColumns(pdReadyDataFrame)
pdReadyDataFrame.tail(5)

Unnamed: 0,MSSubClassencoded,MSZoningencoded,Streetencoded,Alleyencoded,LotShapeencoded,LandContourencoded,Utilitiesencoded,LotConfigencoded,LandSlopeencoded,Neighborhoodencoded,...,GarageCondencoded,PavedDriveencoded,PoolQCencoded,Fenceencoded,CentralAirencoded,MiscFeatureencoded,MoSoldencoded,SaleTypeencoded,SaleConditionencoded,scaled
1455,"(0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0)",(1.0),"(1.0, 0.0)","(1.0, 0.0, 0.0)","(1.0, 0.0, 0.0)",(1.0),"(1.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, ...",...,"(1.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(1.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0)",(1.0),"(1.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0)","[0.19808306709265175, 0.030928509663698613, 0...."
1456,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0)",(1.0),"(1.0, 0.0)","(1.0, 0.0, 0.0)","(1.0, 0.0, 0.0)",(1.0),"(1.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, ...",...,"(1.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(1.0, 0.0, 0.0)","(0.0, 1.0, 0.0, 0.0)",(1.0),"(1.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0)","[0.2715654952076677, 0.055504919488653624, 0.0..."
1457,"(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0)",(1.0),"(1.0, 0.0)","(1.0, 0.0, 0.0)","(1.0, 0.0, 0.0)",(1.0),"(1.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",...,"(1.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(1.0, 0.0, 0.0)","(0.0, 0.0, 1.0, 0.0)",(1.0),"(0.0, 1.0, 0.0, 0.0)","(0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0)","[0.2108626198083067, 0.036186870457360534, 0.0..."
1458,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0)",(1.0),"(1.0, 0.0)","(1.0, 0.0, 0.0)","(1.0, 0.0, 0.0)",(1.0),"(1.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",...,"(1.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(1.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0)",(1.0),"(1.0, 0.0, 0.0, 0.0)","(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0)","(0.21725239616613418, 0.03934188693355769, 0.0..."
1459,"(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0)",(1.0),"(1.0, 0.0)","(1.0, 0.0, 0.0)","(1.0, 0.0, 0.0)",(1.0),"(1.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...",...,"(1.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0)","(1.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0)",(1.0),"(1.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...","(1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0)","(1.0, 0.0, 0.0, 0.0, 0.0)","[0.23961661341853036, 0.0403701885998738, 0.0,..."


#### Now we need to take all of our scaled vectors and turn them into separate columns. We will also take all of the one-hot encoded vectors and turn them into separate columns. Note that after this we will be taking out the columns that are extremely highly correlated in order to not have more than one of the same/similar feature.

#### I have decided to abandon this notebook because I realize that the PySpark library does not have all of the features that I want in order to successfully build my model. Instead of making an awkward workaround, I will instead write the functions that I need myself.