# Project 1 Part 4 - Creating a master parcel database

In this part of the project, we will use Python to read, process, and double all of the parcel data into a database.  Note that this is not our only alternative, and in Project 1 Part 4 b, we will look at another alternative, that is reading all the of original, raw files into their own database table, then using SQL to join/link/aggregate the tables.

## Chunking Files in Pandas – Part 1 (20 Points)

In this part of the project, you will use `Panda`’s to process the data from the MinneMUDAC 2016 competition Dive into Water Data.  The data can be found at the [MinneMUDAC site](http://minneanalytics.org/minnemudac/data/).  You should document your work in a Jupyter notebook, which will be used to submit your solution.  **For the rest of the parts of this project, we will limit ourselves to the years 2004-2014.**

1. Remind me why we want to skip 2003.<br>
=> Because the columns are not consistent 


2. Import the common columns list and translation dictionaries from the `.py` file you created in the last part of the project.

In [1]:
from project_data_khanal import common_columns,lat_long_distance_ID_dict

  if self.run_code(code, result):


In [2]:
len(common_columns)

70

3. Use glob and a list comprehension to get a list of file names for the years 2004-2014.

In [2]:
from project_data_khanal import filePaths
filePaths = ['./Data/MinneMUDAC/2015_metro_tax_parcels.txt',
 './Data/MinneMUDAC/2009_metro_tax_parcels.txt',
 './Data/MinneMUDAC/2007_metro_tax_parcels.txt',
 './Data/MinneMUDAC/2011_metro_tax_parcels.txt',
 './Data/MinneMUDAC/2005_metro_tax_parcels.txt',
 './Data/MinneMUDAC/2013_metro_tax_parcels.txt',
 './Data/MinneMUDAC/2014_metro_tax_parcels.txt',
 './Data/MinneMUDAC/2008_metro_tax_parcels.txt',
 './Data/MinneMUDAC/2010_metro_tax_parcels.txt',
 './Data/MinneMUDAC/2006_metro_tax_parcels.txt',
 './Data/MinneMUDAC/2012_metro_tax_parcels.txt',
 './Data/MinneMUDAC/2004_metro_tax_parcels.txt']




4. Use the first chunk of the first file to prototype an expression that <br>
    a. Selects the common columns <br>
    b. Fixes any issues with the column names <br>
    c. Changes columns to the correct types (if necessary).  More information about the columns can be found [here](ftp://ftp.gisdata.mn.gov/pub/gdrs/data/pub/us_mn_state_metrogis/plan_regonal_prcls_open/metadata/metadata.html). It is **imperative** that you keep the lat and long columns as strings. <br>
    d. Use the translation dictionaries from the last part to add three new columns to the chunk: lake code, lake name, parcel distance to the lake.<br>
    e. Filters to only properties that are within 1600 m (~1 mile) of the closest lake.

In [17]:
##### from project_data_khanal import read_chunk
import pandas as pd
df_chunks = pd.read_csv(filePaths[0],chunksize = 1000,sep = "|")


from functoolz import first
first_chunk = first(df_chunks)
first_chunk.columns

from project_data_khanal import lake_code_dict, lat_long_id__lake_dict,lat_long_distance_ID_dict
from dfply import *
from more_dfply import *
from itertools import zip_longest

first_chunk.columns

Index(['ACRES_DEED', 'ACRES_POLY', 'AGPRE_ENRD', 'AGPRE_EXPD', 'AG_PRESERV',
       'BASEMENT', 'BLDG_NUM', 'BLOCK', 'CITY', 'CITY_USPS', 'COOLING',
       'COUNTY_ID', 'DWELL_TYPE', 'EMV_BLDG', 'EMV_LAND', 'EMV_TOTAL',
       'FIN_SQ_FT', 'GARAGE', 'GARAGESQFT', 'GREEN_ACRE', 'HEATING',
       'HOMESTEAD', 'HOME_STYLE', 'LANDMARK', 'LOT', 'MULTI_USES', 'NUM_UNITS',
       'OPEN_SPACE', 'OWNER_MORE', 'OWNER_NAME', 'OWN_ADD_L1', 'OWN_ADD_L2',
       'OWN_ADD_L3', 'PARC_CODE', 'PIN', 'PLAT_NAME', 'PREFIXTYPE',
       'PREFIX_DIR', 'SALE_DATE', 'SALE_VALUE', 'SCHOOL_DST', 'SPEC_ASSES',
       'STREETNAME', 'STREETTYPE', 'SUFFIX_DIR', 'Shape_Area', 'Shape_Leng',
       'TAX_ADD_L1', 'TAX_ADD_L2', 'TAX_ADD_L3', 'TAX_CAPAC', 'TAX_EXEMPT',
       'TAX_NAME', 'TOTAL_TAX', 'UNIT_INFO', 'USE1_DESC', 'USE2_DESC',
       'USE3_DESC', 'USE4_DESC', 'WSHD_DIST', 'XUSE1_DESC', 'XUSE2_DESC',
       'XUSE3_DESC', 'XUSE4_DESC', 'YEAR_BUILT', 'Year', 'ZIP', 'ZIP4',
       'centroid_long', 'centroid_lat'],

In [3]:
first_chunk = (first_chunk >> select(list(common_columns)) 
             >>fix_names(make_lower=True)
             >>mutate(latitude = X.centroid_lat.astype("float").round(5).astype('object'))
             >>mutate(longitude = X.centroid_long.astype("float").round(5).astype('object'))
            >>drop(X.centroid_lat,X.centroid_long))
              
first_chunk = (first_chunk
            >>mutate(lat_long = tuple(zip(first_chunk.latitude,first_chunk.longitude)))
             >>mutate(lat_long_tup =X.lat_long.map(lambda x: (str(x[0]),str(x[1]))))
            >>drop(X.lat_long)
               )
        
first_chunk

Unnamed: 0,SALE_VALUE,BLDG_NUM,MULTI_USES,HEATING,SALE_DATE,XUSE1_DESC,USE3_DESC,TAX_ADD_L1,TAX_ADD_L3,LANDMARK,...,OWN_ADD_L1,AG_PRESERV,USE4_DESC,AGPRE_ENRD,TAX_NAME,USE2_DESC,UNIT_INFO,latitude,longitude,lat_long_tup
0,0.0,,,,,,,23725 NACRE ST NW,"MN, 55330",,...,23725 NACRE ST NW,N,,,JONES TRUSTEE RAYMOND,,,45.3977,-93.4622,"(45.39768, -93.46219)"
1,155000.0,23640.0,,Forced Air Furnace,2008-02-25,,,5288 FOREST RIDGE RD,"MN, 56450",,...,5288 FOREST RIDGE RD,N,,,BRAUSEN RICHARD,,,45.3985,-93.4513,"(45.39852, -93.45134)"
2,332900.0,23449.0,,Forced Air Furnace,2005-12-05,,,23449 VARIOLITE ST NW,"MN, 55330",,...,23449 VARIOLITE ST NW,N,,,HARVEY JAMES,,,45.3946,-93.4577,"(45.39462, -93.45767)"
3,474900.0,23309.0,,Forced Air Furnace,2006-03-31,,,23309 VARIOLITE ST,"MN, 55330",,...,23309 VARIOLITE ST,N,,,NACHREINER CONNIE,,,45.3928,-93.4577,"(45.39283, -93.45771)"
4,170000.0,23925.0,,Forced Air Furnace,2009-07-31,,,23925 GERMANIUM ST NW,"MN, 55070",,...,23925 GERMANIUM ST NW,N,,,COOPER RONALD,,,45.4032,-93.407,"(45.40319, -93.40698)"
5,0.0,23922.0,,Forced Air Furnace,,,,23922 GERMANIUM ST NW,"MN, 55070",,...,23922 GERMANIUM ST NW,N,,,LUDFORD KELLY L,,,45.4032,-93.4084,"(45.4032, -93.40839)"
6,165000.0,5763.0,,Forced Air Furnace,2009-10-02,,,5763 244TH CT NW,"MN, 55070",,...,5763 244TH CT NW,N,,,HERING CHAD,,,45.4133,-93.4129,"(45.41326, -93.41293)"
7,115000.0,5729.0,,Forced Air Furnace,2008-12-29,,,5729 244TH CT NW,"MN, 55070",,...,5729 244TH CT NW,N,,,SCHWAN JENNIFER,,,45.4133,-93.412,"(45.41328, -93.41204)"
8,85000.0,23946.0,,Forced Air Furnace,2010-11-18,,,23946 QUICKSILVER ST NW,"MN, 55070",,...,23946 QUICKSILVER ST NW,N,,,KOGLER JULIE,,,45.4075,-93.4256,"(45.40747, -93.42559)"
9,0.0,,,,,,,23580 NACRE ST NW,"MN, 55330",,...,23580 NACRE ST NW,N,,,WATROBA MICHAEL & SANDRA,,,45.3964,-93.4527,"(45.39637, -93.45266)"


In [5]:
from project_data_khanal import lat_long_id__lake_dict
lat_long_name = {(lat,long):name for (lat,long),(id_site,name,distance) in lat_long_id__lake_dict.items() }
lat_long_id = {(lat,long):id_site for (lat,long),(id_site,name,distance) in lat_long_id__lake_dict.items() }
lat_long_distance = {(lat,long):distance for (lat,long),(id_site,name,distance) in lat_long_id__lake_dict.items() }

first_chunk = (first_chunk
                      >> mutate(lake_distance = X.lat_long_tup.map(lat_long_distance))
                        >> mutate(lake_name = X.lat_long_tup.map(lat_long_name))
                          >> mutate(lake_ID = X.lat_long_tup.map(lat_long_id))
          
               
              
               
               
               )
first_chunk

Unnamed: 0,SALE_VALUE,BLDG_NUM,MULTI_USES,HEATING,SALE_DATE,XUSE1_DESC,USE3_DESC,TAX_ADD_L1,TAX_ADD_L3,LANDMARK,...,AGPRE_ENRD,TAX_NAME,USE2_DESC,UNIT_INFO,latitude,longitude,lat_long_tup,lake_distance,lake_name,lake_ID
0,0.0,,,,,,,23725 NACRE ST NW,"MN, 55330",,...,,JONES TRUSTEE RAYMOND,,,45.3977,-93.4622,"(45.39768, -93.46219)",6224.260278,Pickerel Lake,02013000-01
1,155000.0,23640.0,,Forced Air Furnace,2008-02-25,,,5288 FOREST RIDGE RD,"MN, 56450",,...,,BRAUSEN RICHARD,,,45.3985,-93.4513,"(45.39852, -93.45134)",6200.012223,Pickerel Lake,02013000-01
2,332900.0,23449.0,,Forced Air Furnace,2005-12-05,,,23449 VARIOLITE ST NW,"MN, 55330",,...,,HARVEY JAMES,,,45.3946,-93.4577,"(45.39462, -93.45767)",5825.372769,Pickerel Lake,02013000-01
3,474900.0,23309.0,,Forced Air Furnace,2006-03-31,,,23309 VARIOLITE ST,"MN, 55330",,...,,NACHREINER CONNIE,,,45.3928,-93.4577,"(45.39283, -93.45771)",5629.547543,Pickerel Lake,02013000-01
4,170000.0,23925.0,,Forced Air Furnace,2009-07-31,,,23925 GERMANIUM ST NW,"MN, 55070",,...,,COOPER RONALD,,,45.4032,-93.407,"(45.40319, -93.40698)",,,
5,0.0,23922.0,,Forced Air Furnace,,,,23922 GERMANIUM ST NW,"MN, 55070",,...,,LUDFORD KELLY L,,,45.4032,-93.4084,"(45.4032, -93.40839)",,,
6,165000.0,5763.0,,Forced Air Furnace,2009-10-02,,,5763 244TH CT NW,"MN, 55070",,...,,HERING CHAD,,,45.4133,-93.4129,"(45.41326, -93.41293)",,,
7,115000.0,5729.0,,Forced Air Furnace,2008-12-29,,,5729 244TH CT NW,"MN, 55070",,...,,SCHWAN JENNIFER,,,45.4133,-93.412,"(45.41328, -93.41204)",,,
8,85000.0,23946.0,,Forced Air Furnace,2010-11-18,,,23946 QUICKSILVER ST NW,"MN, 55070",,...,,KOGLER JULIE,,,45.4075,-93.4256,"(45.40747, -93.42559)",7336.674394,Pickerel Lake,02013000-01
9,0.0,,,,,,,23580 NACRE ST NW,"MN, 55330",,...,,WATROBA MICHAEL & SANDRA,,,45.3964,-93.4527,"(45.39637, -93.45266)",5969.943145,Pickerel Lake,02013000-01


In [6]:
#Seems like this first chunk doesn't have any lake less than 1600 distance 
first_chunk = (first_chunk>>
               filter_by(X.lake_distance<=1600.00)
)
first_chunk

Unnamed: 0,SALE_VALUE,BLDG_NUM,MULTI_USES,HEATING,SALE_DATE,XUSE1_DESC,USE3_DESC,TAX_ADD_L1,TAX_ADD_L3,LANDMARK,...,AGPRE_ENRD,TAX_NAME,USE2_DESC,UNIT_INFO,latitude,longitude,lat_long_tup,lake_distance,lake_name,lake_ID


list(enumerate(sorted(first_chunk.columns)))[22:29] #18,53,65,13,49,2,41 + 1 kapil



list(enumerate(first_chunk.columns)) #25,9,48,16,30,33,4 already added 1


def dtypes(col,chunk):
    col = chunk>>select(col)>>distinct()
    dtypes = [type(val[0]) for val in col.values]
    return dtypes


 #unique_types = [{col: set(dtypes(col,second_chunk)) for col in [19,54,66,14,50,3,42]} for chunk in df_chunks]
 #unique_types
unique_types = [{col: set(dtypes(col,second_chunk)) for col in [25,9,48,16,30,33,4]} for chunk in df_chunks]
unique_types

unique_across_chunks = dict()
for chunk_set in unique_types:
    for k,v in chunk_set.items():
        if k not in unique_across_chunks:
            unique_across_chunks[k] = [v]
        else:
            if v not in unique_across_chunks[k]:
                unique_across_chunks[k].append(v)
            else:
                pass

#unique_all ={k:set(v) for k,v in unique_across_chunks.items()}
unique_across_chunks
            
            

5. Now convert your expression from the last problem to a function and test that this function works on the first few chunks of each file.

In [None]:
@pipeable
def select_common_cols(chunk):
    common_chunk = (chunk >> select(list(common_columns)) 
             >>fix_names(make_lower=True)
             >>mutate(latitude = X.centroid_lat.astype("float").round(5).astype('object'))
             >>mutate(longitude = X.centroid_long.astype("float").round(5).astype('object'))
            >>drop(X.centroid_lat,X.centroid_long))
    return common_chunk
@pipeable
def map_lake(chunk):
    map_chunk = (chunk
            >>mutate(lat_long = tuple(zip(first_chunk.latitude,first_chunk.longitude)))
             >>mutate(lat_long_tup =X.lat_long.map(lambda x: (str(x[0]),str(x[1]))))
            >>drop(X.lat_long)
               )
    return map_chunk

@pipeable
def getClean_chunk(chunk):
    from project_data_khanal import lat_long_id__lake_dict
    lat_long_name = {(lat,long):name for (lat,long),(id_site,name,distance) in lat_long_id__lake_dict.items() }
    lat_long_id = {(lat,long):id_site for (lat,long),(id_site,name,distance) in lat_long_id__lake_dict.items() }
    lat_long_distance = {(lat,long):distance for (lat,long),(id_site,name,distance) in lat_long_id__lake_dict.items() }

    clean_chunk = (chunk
                      >> mutate(lake_distance = X.lat_long_tup.map(lat_long_distance))
                        >> mutate(lake_name = X.lat_long_tup.map(lat_long_name))
                          >> mutate(lake_ID = X.lat_long_tup.map(lat_long_id))
                  )
    
    return chunk

    
    

6. We need to make a unique primary key for each row in the combined parcel file.<br>
    a. There is a column that appears to be a unique parcel id.  Double check that this is a true primary key for each individual file. (To do this you need to verify that the number of unique values is the same as the number of rows for each of the parcel files.  **Hint:** For each file, use of the accumulator pattern with two accumualtors (one number and one data frame). <br>
    b. Explain why this column will not work as a primary key if we want to combine all years in one database. <br>
    c. Suppose we make a new column that consist of `str(year) + '-' + PID`.  Explain why this should make a proper primary key for the combined data. <br>

In [15]:
#Checking on chunks of first file only...if it fails here no need to check everywhere..df_chunks may exhaust need to run again
for chunk in df_chunks:
    nrows = chunk.shape[0]
    id_len = len(set(chunk['PIN']))
    if nrows != id_len:
        print("Not equal on one of the chunk")
        break

Not equal on one of the chunk


In [18]:
@pipeable #only works on def statements
def add_primary_key(chunk,start):
    chunk = (chunk >> mutate(primary_key = np.arange(start,chunk.shape[0]))
    )
    return chunk

7. Make a function to add the key suggested in the last problem (`str(year) + '-' + PID`) to a given chunk.

#### Note: If you are clever, you can do parts 8 in one double loop, which will save you from having to read the parcel files twice.

8. It is probably worth our time to test that our new key column is truely unique. (If not, we might be wasting out time loading the data into a database, only to have process fail hours in.) Test that the new column works by <br>
    a. Iterating over all the files.<br>
    b. Using an accumulator to count total number of rows across all parcel files. <br>
    c. Using an accumulator to accumulate a set of all unique values of our new key. <br>
    d. Verifying that we have as many total rows as unique keys.
    a. Selecting just this column. <br>
    b. Dumping this column into a temporary database <br>

9. If the last step succeeded, you can proceed to make a master parcel data database.  If not, you will need to figure out another primary key, probably an `id` column similar to the example in the lectures.