# Data Load and Investigation

In this notebook, I load in the `Parcel` data and investigate the columns to get a feel for the data types, NA values and how the data is generally structured.

In [3]:
# test file collect:
import pandas as pd

# real property sales df create:
parcel = pd.read_csv("../data/EXTR_Parcel.csv", encoding = 'latin-1')


#### Preview data:

In [4]:
parcel.head()

Unnamed: 0,Major,Minor,PropName,PlatName,PlatLot,PlatBlock,Range,Township,Section,QuarterSection,...,SeismicHazard,LandslideHazard,SteepSlopeHazard,Stream,Wetland,SpeciesOfConcern,SensitiveAreaTract,WaterProblems,TranspConcurrency,OtherProblems
0,916110,346,,WARDALL PARK ADD,20-21-22,3.0,3,24,14,SW,...,N,N,N,N,N,N,N,N,N,N
1,132606,9228,,,,,6,26,13,SE,...,N,N,N,N,N,N,N,N,N,N
2,329870,12,,HIGHLAND PARK,2,1.0,4,24,31,SW,...,N,N,N,N,N,N,N,N,N,N
3,884530,50,,UPPERS H S LIBERTY HEIGHTS ADD,9,1.0,3,24,26,SW,...,N,N,N,N,N,N,N,N,N,N
4,261730,220,,FOUR LAKES,2,3.0,6,23,27,NE,...,N,N,N,N,N,N,N,N,N,N


In [6]:
# determine how many entries, nulls, dtypes etc.
parcel.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 614843 entries, 0 to 614842
Data columns (total 81 columns):
Major                     614843 non-null int64
Minor                     614843 non-null int64
PropName                  601757 non-null object
PlatName                  520495 non-null object
PlatLot                   614843 non-null object
PlatBlock                 614843 non-null object
Range                     614843 non-null int64
Township                  614843 non-null int64
Section                   614843 non-null int64
QuarterSection            614843 non-null object
PropType                  614843 non-null object
Area                      614813 non-null float64
SubArea                   614813 non-null float64
SpecArea                  17332 non-null float64
SpecSubArea               17332 non-null float64
DistrictName              614843 non-null object
LevyCode                  614843 non-null int64
CurrentZoning             614842 non-null object
HBUAsIfVaca

In [8]:
# view unique col entries:
columns = parcel.columns

for col in columns:
    print(f"Unique entries for {col}: {parcel[col].unique()}")

Unique entries for Major: [916110 132606 329870 ... 515520 358690  51910]
Unique entries for Minor: [ 346 9228   12 ... 6771 7011 3918]
Unique entries for PropName: [' ' 'sugar loaf mountain' 'JIFFY LUBE' ... 'LOW INCOME ELDERLY APT'
 'SERVICE LINEN' 'THE OLD VINE COURT BUILDING']
Unique entries for PlatName: ['WARDALL PARK ADD' nan 'HIGHLAND PARK' ...
 'FOX BOROUGH(2ND AMENDMENT) PH 01' 'MARINER  THE' 'INNISFREE']
Unique entries for PlatLot: ['20-21-22      ' '              ' '2             ' ... '22&23         '
 'APT 1-T       ' 'POR 161       ']
Unique entries for PlatBlock: ['3      ' '       ' '1      ' ... 'ALL 3  ' '171A   ' '396    ']
Unique entries for Range: [ 3  6  4  5  7  2  9 11  8 10 12 13  0 14]
Unique entries for Township: [24 26 23 25 22 21 20 19  0 99]
Unique entries for Section: [14 13 31 26 27 30 19  6 17  7 24 16  9  2 33  1  4 35 21 34 11  5 22 15
 36  8 18  3 28 10 23 25 32 29 20 12  0]
Unique entries for QuarterSection: ['SW' 'SE' 'NE' 'NW' '  ']
Unique entrie

#### Cleaning columns

- From this we can see that there are many boolean-value columns that we might want to change the type.  
- The majority of this data is categorical data so think about the best way to deal with this and clean the data appropriately - one-hot-encoding or label encoding. 
- Work out how to deal with NaN's