# Project 1 Part 4 - Creating a master parcel database

In this part of the project, we will use Python to read, process, and double all of the parcel data into a database.  Note that this is not our only alternative, and in Project 1 Part 4 b, we will look at another alternative, that is reading all the of original, raw files into their own database table, then using SQL to join/link/aggregate the tables.

## Chunking Files in Pandas – Part 1 (20 Points)

In this part of the project, you will use `Panda`’s to process the data from the MinneMUDAC 2016 competition Dive into Water Data.  The data can be found at the [MinneMUDAC site](http://minneanalytics.org/minnemudac/data/).  You should document your work in a Jupyter notebook, which will be used to submit your solution.  **For the rest of the parts of this project, we will limit ourselves to the years 2004-2014.**

1. Remind me why we want to skip 2003.

> 2003 has fewer columns than the other files, so exluding 2003 allows us to keep more data.

2. Import the common columns list and translation dictionaries from the `.py` file you created in the last part of the project.

In [3]:
from project_data_Miertschin import common_columns, ll_dist_dict, ll_code_dict, code_name_dict, ll_idnamedist_dict

In [39]:
import pandas as pd
from dfply import *
from glob import glob
import re
from toolz import first


3. Use glob and a list comprehension to get a list of file names for the years 2004-2014.

In [11]:
files = glob('../MinneMUDAC_raw_files/20**_metro_tax_parcels.txt')[2:-1]
files

['../MinneMUDAC_raw_files/2004_metro_tax_parcels.txt',
 '../MinneMUDAC_raw_files/2005_metro_tax_parcels.txt',
 '../MinneMUDAC_raw_files/2006_metro_tax_parcels.txt',
 '../MinneMUDAC_raw_files/2007_metro_tax_parcels.txt',
 '../MinneMUDAC_raw_files/2008_metro_tax_parcels.txt',
 '../MinneMUDAC_raw_files/2009_metro_tax_parcels.txt',
 '../MinneMUDAC_raw_files/2010_metro_tax_parcels.txt',
 '../MinneMUDAC_raw_files/2011_metro_tax_parcels.txt',
 '../MinneMUDAC_raw_files/2012_metro_tax_parcels.txt',
 '../MinneMUDAC_raw_files/2013_metro_tax_parcels.txt',
 '../MinneMUDAC_raw_files/2014_metro_tax_parcels.txt']

In [12]:
FILE_NAME_RE = re.compile(r'^\.\./MinneMUDAC_raw_files/(20\d\d_metro_tax_parcels)\.txt$')
file_name = lambda p: FILE_NAME_RE.match(p).group(1) 
file_names = lambda files: [file_name(p) for p in files]
years= file_names(files)
years

['2004_metro_tax_parcels',
 '2005_metro_tax_parcels',
 '2006_metro_tax_parcels',
 '2007_metro_tax_parcels',
 '2008_metro_tax_parcels',
 '2009_metro_tax_parcels',
 '2010_metro_tax_parcels',
 '2011_metro_tax_parcels',
 '2012_metro_tax_parcels',
 '2013_metro_tax_parcels',
 '2014_metro_tax_parcels']

4. Use the first chunk of the first file to prototype an expression that <br>
    a. Selects the common columns <br>
    b. Fixes any issues with the column names <br>
    c. Changes columns to the correct types (if necessary).  More information about the columns can be found [here](ftp://ftp.gisdata.mn.gov/pub/gdrs/data/pub/us_mn_state_metrogis/plan_regonal_prcls_open/metadata/metadata.html). It is **imperative** that you keep the lat and long columns as strings. <br>
    d. Use the translation dictionaries from the last part to add three new columns to the chunk: lake code, lake name, parcel distance to the lake.
    e. Filters to only properties that are within 1600 m (~1 mile) of the closest lake.

In [56]:
parcel_files = [pd.read_csv(file,chunksize=500,sep='|',dtype = {'centroid_lat': str,'centroid_long': str}) for file in files]

In [57]:
first_chunks = [first(df) for df in parcel_files]

In [69]:
first_chunk = first_chunks[0]
first_chunk.head()

Unnamed: 0,ACRES_DEED,ACRES_POLY,AGPRE_ENRD,AGPRE_EXPD,AG_PRESERV,BASEMENT,BLDG_NUM,BLOCK,CITY,CITY_USPS,COOLING,COUNTY_ID,DWELL_TYPE,EMV_BLDG,EMV_LAND,EMV_TOTAL,FIN_SQ_FT,GARAGE,GARAGESQFT,GREEN_ACRE,HEATING,HOMESTEAD,HOME_STYLE,ID,LANDMARK,LOT,MULTI_USES,NUM_UNITS,OPEN_SPACE,OWNER_MORE,OWNER_NAME,OWN_ADD_L1,OWN_ADD_L2,OWN_ADD_L3,PARC_CODE,PIN,PLAT_NAME,PREFIXTYPE,PREFIX_DIR,SALE_DATE,SALE_VALUE,SCHOOL_DST,SPEC_ASSES,STREETNAME,STREETTYPE,SUFFIX_DIR,Shape_Area,Shape_Leng,TAX_ADD_L1,TAX_ADD_L2,TAX_ADD_L3,TAX_CAPAC,TAX_EXEMPT,TAX_NAME,TOTAL_TAX,UNIT_INFO,USE1_DESC,USE2_DESC,USE3_DESC,USE4_DESC,WSHD_DIST,XUSE1_DESC,XUSE2_DESC,XUSE3_DESC,XUSE4_DESC,YEAR_BUILT,Year,ZIP,ZIP4,centroid_lat,centroid_long
0,0.0,8.03,,,N,,,,SAINT FRANCIS,,,3,,0.0,17750.0,23964.0,0.0,,,N,,N,,,,,,,N,,,24457 DOGWOOD ST NW,BETHEL,"MN, 55005",0.0,003-253424110001,,,,,0.0,15,0.0,,,,,,24457 DOGWOOD ST NW,BETHEL,"MN, 55005",351.0,N,,589.0,,,,,,UPPER RUM RIVER WMO,,,,,1980.0,2004,,,45.41332,-93.26739
1,0.0,0.93,,,N,,24457.0,,SAINT FRANCIS,BETHEL,,3,,106719.0,55400.0,171837.0,0.0,,,N,,Y,,,,,,,N,,,24457 DOGWOOD ST NW,BETHEL,"MN, 55005",0.0,003-253424110002,,,,,0.0,15,0.0,DOGWOOD,ST,NW,,,24457 DOGWOOD ST NW,BETHEL,"MN, 55005",1475.0,N,,1308.0,,,,,,UPPER RUM RIVER WMO,,,,,1974.0,2004,55005.0,,45.41354,-93.2701
2,0.0,8.75,,,N,,24442.0,,SAINT FRANCIS,BETHEL,,3,,95958.0,85876.0,195751.0,0.0,,,N,,Y,,,,,,,N,,,24442 DOGWOOD ST NW,ST FRANCIS,"MN, 55005",0.0,003-253424120001,,,,2001-04-26,215000.0,15,0.0,DOGWOOD,ST,NW,,,1757 TAPO CANYON RD SV-24 #300,SIMI VALLEY,"CA, 93063",1586.0,N,,1432.0,,,,,,UPPER RUM RIVER WMO,,,,,1969.0,2004,55005.0,,45.41318,-93.27344
3,0.0,11.17,,,N,,410.0,,SAINT FRANCIS,BETHEL,,3,,120762.0,77000.0,210338.0,0.0,,,N,,Y,,,,,,,N,,,PO BOX 14,BETHEL,"MN, 55005",0.0,003-253424210002,,,,1995-03-23,101000.0,15,0.0,245TH,AVE,NW,,,14528 SO OUTER FORTY RD,CHESTERFIELD,"MO, 63017",1744.0,N,,1609.0,,,,,,UPPER RUM RIVER WMO,,,,,1989.0,2004,55005.0,,45.41167,-93.27684
4,0.0,14.46,,,N,,480.0,,SAINT FRANCIS,BETHEL,,3,,114220.0,80505.0,204359.0,0.0,,,N,,Y,,,,,,,N,,,480 245TH AVE NW,EAST BETHEL,"MN, 55005",0.0,003-253424210003,,,,1995-04-04,101900.0,15,0.0,245TH,AVE,NW,,,"%VALUTREE REAL EST SER, LLC TAX SER DIV",RICHMOND,"VA, 23285",1636.0,N,,1488.0,,,,,,UPPER RUM RIVER WMO,,,,,1995.0,2004,55070.0,,45.41169,-93.27849


In [72]:
ll_code_dict.get((45.413,-93.26739))

In [74]:
ll_code_dict.keys()


In [67]:
new_chunk = (first_chunk 
             >> select(common_columns)
             >> mutate(lake_code = ll_code_dict.get((X.centroid_lat,X.centroid_long)),
                       lake_name = code_name_dict.get(X.lake_code),
                       distance_to_lake = ll_dist_dict.get((X.centroid_lat,X.centroid_long)))
            )
new_chunk.shape

TypeError: __hash__ method should return an integer

In [60]:
new_chunk.dtypes

GARAGE           float64
XUSE1_DESC       float64
GREEN_ACRE        object
Year               int64
COUNTY_ID          int64
OWNER_MORE       float64
OWN_ADD_L1        object
OWN_ADD_L2        object
SUFFIX_DIR        object
USE1_DESC        float64
AG_PRESERV        object
AGPRE_ENRD       float64
STREETTYPE        object
BLDG_NUM         float64
DWELL_TYPE       float64
SALE_DATE         object
ZIP4             float64
SPEC_ASSES       float64
OWN_ADD_L3        object
OWNER_NAME       float64
centroid_long     object
CITY              object
MULTI_USES       float64
TAX_ADD_L3        object
PLAT_NAME        float64
TAX_EXEMPT        object
COOLING          float64
HOME_STYLE       float64
LANDMARK         float64
NUM_UNITS        float64
                  ...   
SCHOOL_DST         int64
WSHD_DIST         object
PIN               object
YEAR_BUILT       float64
EMV_BLDG         float64
EMV_LAND         float64
XUSE2_DESC       float64
SALE_VALUE       float64
ZIP              float64


In [54]:
common_columns

['GARAGE',
 'XUSE1_DESC',
 'GREEN_ACRE',
 'Year',
 'COUNTY_ID',
 'OWNER_MORE',
 'OWN_ADD_L1',
 'OWN_ADD_L2',
 'SUFFIX_DIR',
 'USE1_DESC',
 'AG_PRESERV',
 'AGPRE_ENRD',
 'STREETTYPE',
 'BLDG_NUM',
 'DWELL_TYPE',
 'SALE_DATE',
 'ZIP4',
 'SPEC_ASSES',
 'OWN_ADD_L3',
 'OWNER_NAME',
 'centroid_long',
 'CITY',
 'MULTI_USES',
 'TAX_ADD_L3',
 'PLAT_NAME',
 'TAX_EXEMPT',
 'COOLING',
 'HOME_STYLE',
 'LANDMARK',
 'NUM_UNITS',
 'HOMESTEAD',
 'TAX_CAPAC',
 'PARC_CODE',
 'PREFIX_DIR',
 'UNIT_INFO',
 'TAX_NAME',
 'USE4_DESC',
 'centroid_lat',
 'PREFIXTYPE',
 'AGPRE_EXPD',
 'SCHOOL_DST',
 'WSHD_DIST',
 'PIN',
 'YEAR_BUILT',
 'EMV_BLDG',
 'EMV_LAND',
 'XUSE2_DESC',
 'SALE_VALUE',
 'ZIP',
 'TAX_ADD_L1',
 'TAX_ADD_L2',
 'ACRES_DEED',
 'XUSE3_DESC',
 'BASEMENT',
 'BLOCK',
 'Shape_Leng',
 'USE2_DESC',
 'HEATING',
 'CITY_USPS',
 'Shape_Area',
 'ACRES_POLY',
 'TOTAL_TAX',
 'GARAGESQFT',
 'LOT',
 'FIN_SQ_FT',
 'OPEN_SPACE',
 'STREETNAME',
 'EMV_TOTAL',
 'USE3_DESC',
 'XUSE4_DESC']

5. Now convert your expression from the last problem to a function and test that this function works on the first few chunks of each file.

6. We need to make a unique primary key for each row in the combined parcel file.<br>
    a. There is a column that appears to be a unique parcel id.  Double check that this is a true primary key for each individual file. (To do this you need to verify that the number of unique values is the same as the number of rows for each of the parcel files.  **Hint:** For each file, use of the accumulator pattern with two accumualtors (one number and one data frame). <br>
    b. Explain why this column will not work as a primary key if we want to combine all years in one database. <br>
    c. Suppose we make a new column that consist of `str(year) + '-' + PID`.  Explain why this should make a proper primary key for the combined data. <br>

7. Make a function to add the key suggested in the last problem (`str(year) + '-' + PID`) to a given chunk.

#### Note: If you are clever, you can do parts 8 in one double loop, which will save you from having to read the parcel files twice.

8. It is probably worth our time to test that our new key column is truely unique. (If not, we might be wasting out time loading the data into a database, only to have process fail hours in.) Test that the new column works by <br>
    a. Iterating over all the files.<br>
    b. Using an accumulator to count total number of rows across all parcel files. <br>
    c. Using an accumulator to accumulate a set of all unique values of our new key. <br>
    d. Verifying that we have as many total rows as unique keys.
    a. Selecting just this column. <br>
    b. Dumping this column into a temporary database <br>

9. If the last step succeeded, you can proceed to make a master parcel data database.  If not, you will need to figure out another primary key, probably an `id` column similar to the example in the lectures.