# Project 1 Part 4 - Creating a master parcel database

In this part of the project, we will use Python to read, process, and double all of the parcel data into a database.  Note that this is not our only alternative, and in Project 1 Part 4 b, we will look at another alternative, that is reading all the of original, raw files into their own database table, then using SQL to join/link/aggregate the tables.

## Chunking Files in Pandas – Part 1 (20 Points)

In this part of the project, you will use `Panda`’s to process the data from the MinneMUDAC 2016 competition Dive into Water Data.  The data can be found at the [MinneMUDAC site](http://minneanalytics.org/minnemudac/data/).  You should document your work in a Jupyter notebook, which will be used to submit your solution.  **For the rest of the parts of this project, we will limit ourselves to the years 2004-2014.**

1. Remind me why we want to skip 2003.

> 2003 has fewer columns than the other files, so exluding 2003 allows us to keep more data.

2. Import the common columns list and translation dictionaries from the `.py` file you created in the last part of the project.

In [1]:
from project_data_Miertschin import common_columns, ll_dist_dict, ll_code_dict, code_name_dict, ll_idnamedist_dict

In [2]:
import pandas as pd
from dfply import *
from glob import glob
import re
from toolz import first
from more_dfply import recode
from functoolz import pipeable


3. Use glob and a list comprehension to get a list of file names for the years 2004-2014.

In [3]:
files = glob('../MinneMUDAC_raw_files/20**_metro_tax_parcels.txt')[2:-1]
files

['../MinneMUDAC_raw_files/2004_metro_tax_parcels.txt',
 '../MinneMUDAC_raw_files/2005_metro_tax_parcels.txt',
 '../MinneMUDAC_raw_files/2006_metro_tax_parcels.txt',
 '../MinneMUDAC_raw_files/2007_metro_tax_parcels.txt',
 '../MinneMUDAC_raw_files/2008_metro_tax_parcels.txt',
 '../MinneMUDAC_raw_files/2009_metro_tax_parcels.txt',
 '../MinneMUDAC_raw_files/2010_metro_tax_parcels.txt',
 '../MinneMUDAC_raw_files/2011_metro_tax_parcels.txt',
 '../MinneMUDAC_raw_files/2012_metro_tax_parcels.txt',
 '../MinneMUDAC_raw_files/2013_metro_tax_parcels.txt',
 '../MinneMUDAC_raw_files/2014_metro_tax_parcels.txt']

In [4]:
FILE_NAME_RE = re.compile(r'^\.\./MinneMUDAC_raw_files/(20\d\d_metro_tax_parcels)\.txt$')
file_name = lambda p: FILE_NAME_RE.match(p).group(1) 
file_names = lambda files: [file_name(p) for p in files]
years= file_names(files)
years

['2004_metro_tax_parcels',
 '2005_metro_tax_parcels',
 '2006_metro_tax_parcels',
 '2007_metro_tax_parcels',
 '2008_metro_tax_parcels',
 '2009_metro_tax_parcels',
 '2010_metro_tax_parcels',
 '2011_metro_tax_parcels',
 '2012_metro_tax_parcels',
 '2013_metro_tax_parcels',
 '2014_metro_tax_parcels']

4. Use the first chunk of the first file to prototype an expression that <br>
    a. Selects the common columns <br>
    b. Fixes any issues with the column names <br>
    c. Changes columns to the correct types (if necessary).  More information about the columns can be found [here](ftp://ftp.gisdata.mn.gov/pub/gdrs/data/pub/us_mn_state_metrogis/plan_regonal_prcls_open/metadata/metadata.html). It is **imperative** that you keep the lat and long columns as strings. <br>
    d. Use the translation dictionaries from the last part to add three new columns to the chunk: lake code, lake name, parcel distance to the lake.
    
    e. Filters to only properties that are within 1600 m (~1 mile) of the closest lake.

In [5]:
c_size = 50000

In [6]:
parcel_files = [pd.read_csv(file,chunksize=c_size,sep='|',dtype = {'centroid_lat': str,'centroid_long': str, 'PIN': str, 'Year':str}) for file in files]

In [7]:
first_chunks = [first(df) for df in parcel_files]

  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)
  exec(code_obj, self.user_global_ns, self.user_ns)


In [10]:
first_chunk = first_chunks[0]
first_chunk2 = first_chunk.head()
first_chunk2

Unnamed: 0,ACRES_DEED,ACRES_POLY,AGPRE_ENRD,AGPRE_EXPD,AG_PRESERV,BASEMENT,BLDG_NUM,BLOCK,CITY,CITY_USPS,...,XUSE1_DESC,XUSE2_DESC,XUSE3_DESC,XUSE4_DESC,YEAR_BUILT,Year,ZIP,ZIP4,centroid_lat,centroid_long
0,0.0,8.03,,,N,,,,SAINT FRANCIS,,...,,,,,1980.0,2004,,,45.41332,-93.26739
1,0.0,0.93,,,N,,24457.0,,SAINT FRANCIS,BETHEL,...,,,,,1974.0,2004,55005.0,,45.41354,-93.2701
2,0.0,8.75,,,N,,24442.0,,SAINT FRANCIS,BETHEL,...,,,,,1969.0,2004,55005.0,,45.41318,-93.27344
3,0.0,11.17,,,N,,410.0,,SAINT FRANCIS,BETHEL,...,,,,,1989.0,2004,55005.0,,45.41167,-93.27684
4,0.0,14.46,,,N,,480.0,,SAINT FRANCIS,BETHEL,...,,,,,1995.0,2004,55070.0,,45.41169,-93.27849


In [11]:
new_chunk = (first_chunk2 
             >> select(common_columns)
             >> mutate(
                 lat_long = pd.Series(zip(first_chunk2.centroid_lat,first_chunk2.centroid_long)))
             >> mutate(
                 lake_code = recode(X.lat_long,ll_code_dict),
                 distance_to_lake = recode(X.lat_long,ll_dist_dict))
             >> mutate(
                 lake_name = recode(X.lake_code,code_name_dict))
             >> filter_by(X.distance_to_lake <= 1600)
            )

In [12]:
new_chunk

Unnamed: 0,GARAGE,XUSE1_DESC,GREEN_ACRE,Year,COUNTY_ID,OWNER_MORE,OWN_ADD_L1,OWN_ADD_L2,SUFFIX_DIR,USE1_DESC,...,FIN_SQ_FT,OPEN_SPACE,STREETNAME,EMV_TOTAL,USE3_DESC,XUSE4_DESC,lat_long,lake_code,distance_to_lake,lake_name
2,,,N,2004,3,,24442 DOGWOOD ST NW,ST FRANCIS,NW,,...,0.0,N,DOGWOOD,195751.0,,,"(45.41318, -93.27344)",27019101-01,311.355787,Sarah Lake
3,,,N,2004,3,,PO BOX 14,BETHEL,NW,,...,0.0,N,245TH,210338.0,,,"(45.41167, -93.27684)",27010700-01,962.625892,Parkers Lake
4,,,N,2004,3,,480 245TH AVE NW,EAST BETHEL,NW,,...,0.0,N,245TH,204359.0,,,"(45.41169, -93.27849)",82002000-01,549.680374,McKusick Lake


5. Now convert your expression from the last problem to a function and test that this function works on the first few chunks of each file.

In [13]:
def add_lake_columns(df):
    new_df = (df 
                 >> select(common_columns)
                 >> mutate(
                     lat_long = pd.Series(zip(df.centroid_lat,df.centroid_long)))
                 >> mutate(
                     lake_code = recode(X.lat_long,ll_code_dict),
                     distance_to_lake = recode(X.lat_long,ll_dist_dict))
                 >> mutate(
                     lake_name = recode(X.lake_code,code_name_dict))
                 >> filter_by(X.distance_to_lake <= 1600)
             )
    return new_df

In [14]:
new_first_chunks = [add_lake_columns(chunk) for chunk in first_chunks]

In [15]:
[chunk.head() for chunk in new_first_chunks]

[   GARAGE  XUSE1_DESC GREEN_ACRE  Year  COUNTY_ID  OWNER_MORE  \
 2     NaN         NaN          N  2004          3         NaN   
 3     NaN         NaN          N  2004          3         NaN   
 4     NaN         NaN          N  2004          3         NaN   
 5     NaN         NaN          N  2004          3         NaN   
 7     NaN         NaN          N  2004          3         NaN   
 
             OWN_ADD_L1   OWN_ADD_L2 SUFFIX_DIR  USE1_DESC  ... FIN_SQ_FT  \
 2  24442 DOGWOOD ST NW   ST FRANCIS         NW        NaN  ...       0.0   
 3            PO BOX 14       BETHEL         NW        NaN  ...       0.0   
 4     480 245TH AVE NW  EAST BETHEL         NW        NaN  ...       0.0   
 5     500 LAFAYETTE RD      ST PAUL        NaN        NaN  ...       0.0   
 7     550 245TH AVE NE       ISANTI         NW        NaN  ...       0.0   
 
    OPEN_SPACE STREETNAME  EMV_TOTAL  USE3_DESC XUSE4_DESC  \
 2           N    DOGWOOD   195751.0        NaN        NaN   
 3           N

6. We need to make a unique primary key for each row in the combined parcel file.<br>
    a. There is a column that appears to be a unique parcel id.  Double check that this is a true primary key for each individual file. (To do this you need to verify that the number of unique values is the same as the number of rows for each of the parcel files.  **Hint:** For each file, use of the accumulator pattern with two accumualtors (one number and one data frame). <br>
    b. Explain why this column will not work as a primary key if we want to combine all years in one database. <br>
    c. Suppose we make a new column that consist of `str(year) + '-' + PIN`.  Explain why this should make a proper primary key for the combined data. <br>

In [54]:
pin_set=set()
row_num=0

for file in first_chunks:
    pin_set = pin_set.union(set(file.PIN))
    row_num=row_num+len(file)    

print(row_num)
print(len(pin_set))

5500
2136


Even when iterating over just the first chunks, there are more rows than unique PINs. Therefore, PIN cannot be used as an ID.

The new column adding year to PIN should make the keys unique by adding more info to the key.

7. Make a function to add the key suggested in the last problem (`str(year) + '-' + PIN`) to a given chunk.

In [82]:
def add_key(df):
    df2 = (df
           >> mutate(key = X.Year.str.cat(X.PIN, sep='-'))
          )
    return df2

In [83]:
add_key(first_chunk2)

Unnamed: 0,ACRES_DEED,ACRES_POLY,AGPRE_ENRD,AGPRE_EXPD,AG_PRESERV,BASEMENT,BLDG_NUM,BLOCK,CITY,CITY_USPS,...,XUSE2_DESC,XUSE3_DESC,XUSE4_DESC,YEAR_BUILT,Year,ZIP,ZIP4,centroid_lat,centroid_long,key
0,0.0,8.03,,,N,,,,SAINT FRANCIS,,...,,,,1980.0,2004,,,45.41332,-93.26739,2004-003-253424110001
1,0.0,0.93,,,N,,24457.0,,SAINT FRANCIS,BETHEL,...,,,,1974.0,2004,55005.0,,45.41354,-93.2701,2004-003-253424110002
2,0.0,8.75,,,N,,24442.0,,SAINT FRANCIS,BETHEL,...,,,,1969.0,2004,55005.0,,45.41318,-93.27344,2004-003-253424120001
3,0.0,11.17,,,N,,410.0,,SAINT FRANCIS,BETHEL,...,,,,1989.0,2004,55005.0,,45.41167,-93.27684,2004-003-253424210002
4,0.0,14.46,,,N,,480.0,,SAINT FRANCIS,BETHEL,...,,,,1995.0,2004,55070.0,,45.41169,-93.27849,2004-003-253424210003


#### Note: If you are clever, you can do parts 8 in one double loop, which will save you from having to read the parcel files twice.

8. It is probably worth our time to test that our new key column is truely unique. (If not, we might be wasting out time loading the data into a database, only to have process fail hours in.) Test that the new column works by <br>
    a. Iterating over all the files.<br>
    b. Using an accumulator to count total number of rows across all parcel files. <br>
    c. Using an accumulator to accumulate a set of all unique values of our new key. <br>
    d. Verifying that we have as many total rows as unique keys.
    a. Selecting just this column. <br>
    b. Dumping this column into a temporary database <br>

In [8]:
key_set=set()
row_num=0

In [98]:
for file in first_chunks:
    file_with_key = add_key(file)
    key_set = key_set.union(set(file_with_key.key))
    row_num=row_num+len(file_with_key)  
    print(len(set(file_with_key.key)))

490
489
498
498
500
500
500
500
500
500
500


In [97]:
print(row_num)
print(len(key_set))

5500
5475


9. If the last step succeeded, you can proceed to make a master parcel data database.  If not, you will need to figure out another primary key, probably an `id` column similar to the example in the lectures.

In [9]:
add_lake_columns = pipeable(lambda df: (df 
                                         >> select(common_columns)
                                         >> mutate(
                                             lat_long = pd.Series(zip(df.centroid_lat,df.centroid_long)))
                                         >> mutate(
                                             lake_code = recode(X.lat_long,ll_code_dict),
                                             distance_to_lake = recode(X.lat_long,ll_dist_dict))
                                         >> mutate(
                                             lake_name = recode(X.lake_code,code_name_dict))
                                         >> filter_by(X.distance_to_lake <= 1600))
                           )

In [10]:
add_primary_key = pipeable(lambda start, df: (df
                                              >> mutate(id = np.arange(start, start + len(df))
                                              )))

In [11]:
process_chunk = pipeable(lambda start, df, chunksize=c_size: (df >> add_lake_columns >> add_primary_key(start)))

In [12]:
from more_sqlalchemy import get_sql_types
i = 0
complete_first_chunk = first_chunks[0] >> process_chunk(i)
sql_types = get_sql_types(complete_first_chunk)

In [16]:
from sqlalchemy import String, Float, Integer, DateTime
sql_types = {col:String if 'DESC' in col else t for col,t in sql_types.items()}

In [14]:
[col for col in enumerate(sorted(common_columns))]


[(0, 'ACRES_DEED'),
 (1, 'ACRES_POLY'),
 (2, 'AGPRE_ENRD'),
 (3, 'AGPRE_EXPD'),
 (4, 'AG_PRESERV'),
 (5, 'BASEMENT'),
 (6, 'BLDG_NUM'),
 (7, 'BLOCK'),
 (8, 'CITY'),
 (9, 'CITY_USPS'),
 (10, 'COOLING'),
 (11, 'COUNTY_ID'),
 (12, 'DWELL_TYPE'),
 (13, 'EMV_BLDG'),
 (14, 'EMV_LAND'),
 (15, 'EMV_TOTAL'),
 (16, 'FIN_SQ_FT'),
 (17, 'GARAGE'),
 (18, 'GARAGESQFT'),
 (19, 'GREEN_ACRE'),
 (20, 'HEATING'),
 (21, 'HOMESTEAD'),
 (22, 'HOME_STYLE'),
 (23, 'LANDMARK'),
 (24, 'LOT'),
 (25, 'MULTI_USES'),
 (26, 'NUM_UNITS'),
 (27, 'OPEN_SPACE'),
 (28, 'OWNER_MORE'),
 (29, 'OWNER_NAME'),
 (30, 'OWN_ADD_L1'),
 (31, 'OWN_ADD_L2'),
 (32, 'OWN_ADD_L3'),
 (33, 'PARC_CODE'),
 (34, 'PIN'),
 (35, 'PLAT_NAME'),
 (36, 'PREFIXTYPE'),
 (37, 'PREFIX_DIR'),
 (38, 'SALE_DATE'),
 (39, 'SALE_VALUE'),
 (40, 'SCHOOL_DST'),
 (41, 'SPEC_ASSES'),
 (42, 'STREETNAME'),
 (43, 'STREETTYPE'),
 (44, 'SUFFIX_DIR'),
 (45, 'Shape_Area'),
 (46, 'Shape_Leng'),
 (47, 'TAX_ADD_L1'),
 (48, 'TAX_ADD_L2'),
 (49, 'TAX_ADD_L3'),
 (50, 'TAX_CAP

In [28]:
[first_chunks[i].XUSE4_DESC.unique() for i in range(0,11)]
#Very sparse, should code as String

[array([nan]),
 array([nan]),
 array([nan]),
 array([nan]),
 array([nan]),
 array([nan]),
 array([nan]),
 array([nan]),
 array([nan]),
 array([nan]),
 array([nan])]

In [208]:
[first_chunks[i].YEAR_BUILT.unique() for i in range(0,11)]
#Pretty well filled-in, a few missing values coded as 0
#Should be Integer

[array([1980., 1974., 1969., 1989., 1995.,    0., 1985., 1997., 1970.,
        1994., 2001., 1958., 1998., 1986., 1991., 1973., 1965., 1982.,
        1971., 1988., 1900., 1984., 1895., 1890., 1999., 1972., 1931.,
        1962., 1990., 1880., 1910., 1936., 1916., 1993., 1987., 1979.,
        1912., 1977., 1966., 1978., 1920., 2000., 2003., 1996., 1918.,
        1992., 1968., 1930., 1948., 1923., 1956., 1950., 1981., 1922.,
        1955., 1976., 1967., 1921., 1935., 1940., 1903., 1913., 1957.,
        2002., 1945., 1959., 1902., 1901., 1899., 1952., 1905., 1947.,
        1906., 1964., 1919., 1914., 1915., 1983., 1917., 1975., 1943.,
        1942., 1938., 1888., 1926., 1893., 1927., 1961., 1909., 1963.,
        1960., 1864., 1870., 1925., 1933., 1953., 1891., 1949., 1924.,
        1928., 1939., 1874., 1946., 1951., 1954., 1941., 1932., 1937.,
        1934., 1944., 1911., 1860., 1889., 1878., 1887., 1886., 1875.,
        1897., 2004., 1892., 1873., 1881., 1879., 1896., 1929., 1882.,
      

In [209]:
[first_chunks[i].Year.unique() for i in range(0,11)]
#Filled-in, Should be Integer

[array(['2004'], dtype=object),
 array(['2005'], dtype=object),
 array(['2006'], dtype=object),
 array(['2007'], dtype=object),
 array(['2008'], dtype=object),
 array(['2009'], dtype=object),
 array(['2010'], dtype=object),
 array(['2011'], dtype=object),
 array(['2012'], dtype=object),
 array(['2013'], dtype=object),
 array(['2014'], dtype=object)]

In [210]:
[first_chunks[i].ZIP.unique() for i in range(0,11)]
#Mostly filled-in with a few nan, Should be Integer

[array([nan, 55005.0, 55070.0, 55040.0, 55330.0, 55011.0, 55303.0, 5070.0,
        55448.0, 55433.0, '55433', '55448', '55443', '55303', '55330', 'W',
        55304.0, 55079.0, 55092.0, 55025.0, 55038.0, 55014.0, 55110.0,
        55126.0], dtype=object),
 array([nan, 55005.0, 55070.0, 55040.0, 55330.0, 55011.0, 55303.0, 5070.0,
        55448.0, 55433.0, '55433', '55448', '55443', '55303', '55330', 'W',
        55304.0, 55079.0, 55092.0, 55025.0, 55038.0, 55014.0, 55110.0],
       dtype=object),
 array([55304.,    nan, 55011., 55448., 55303., 55449., 55434., 55014.,
        55126., 55433., 55432., 55499., 55070., 55330., 55421.]),
 array([55304.,    nan, 55011., 55448., 55303., 55005., 55092., 55449.,
        55434., 55014., 55126., 55432., 55070., 55330., 55421.]),
 array([55070., 55330., 55303., 55040., 55005.,    nan, 55432., 55434.,
        55011., 55304., 55092., 55079., 55025., 55014., 55126., 55110.,
        55038., 55449., 55421.]),
 array([55070., 55330., 55303., 55040., 55432.

In [211]:
[first_chunks[i].ZIP4.unique() for i in range(0,11)]
#Lots of nan, Should be Integer

[array([nan]),
 array([  nan, 9547., 9404., ..., 1413., 1421., 6400.]),
 array([4187., 4253., 6795., ..., 2188., 3191., 3947.]),
 array([4187., 4253., 6795., ..., 9691., 5421., 2874.]),
 array([7601., 9691., 9766., ..., 9516., 9495., 8538.]),
 array([7601., 9691.,   nan, ..., 9791., 9808., 9469.]),
 array([  nan, 8461., 2535., ..., 4015., 2923., 2540.]),
 array([  nan, 9013., 9029., ..., 8647., 1125., 4818.]),
 array([nan]),
 array([nan]),
 array([nan])]

In [212]:
[first_chunks[i].centroid_lat.unique() for i in range(0,11)]
#Filled-in, Should be String for matching purposes

[array(['45.41332', '45.41354', '45.41318', ..., '45.13798', '45.13735',
        '45.13752'], dtype=object),
 array(['45.41332', '45.41354', '45.41318', ..., '45.13219', '45.13222',
        '45.13177'], dtype=object),
 array(['45.22905', '45.22892', '45.22864', ..., '45.03794', '45.03772',
        '45.03774'], dtype=object),
 array(['45.22905', '45.22892', '45.22864', ..., '45.06136', '45.06141',
        '45.06129'], dtype=object),
 array(['45.40663', '45.40537', '45.40447', ..., '45.31657', '45.31644',
        '45.31621'], dtype=object),
 array(['45.40663', '45.40447', '45.39637', ..., '45.34378', '45.35623',
        '45.37276'], dtype=object),
 array(['45.39768', '45.39852', '45.39462', ..., '45.19125', '45.17541',
        '45.30513'], dtype=object),
 array(['inf', '44.97589', '44.976', ..., '44.75941', '44.75946',
        '44.75943'], dtype=object),
 array(['45.39768', '45.39852', '45.39462', ..., '45.16454', '45.19125',
        '45.17541'], dtype=object),
 array(['45.39768', '45.39

In [39]:
[first_chunks[i].SALE_VALUE.unique() for i in range(0,11)]
#Filled-in, Should be String for matching purposes

[array([     0., 215000., 101000., ..., 375087., 299773., 379500.]),
 array([     0., 215000.,  55500., ..., 185490., 397500., 297900.]),
 array([295547.,      0., 300065., ..., 203633., 131351., 134500.]),
 array([295547.,      0., 300065., ..., 202487., 162150.,  86700.]),
 array([     0., 251400., 250000., ...,   5500., 152451., 162875.]),
 array([299000.,      0., 155000., ...,  85960., 245929., 108900.]),
 array([     0., 155000., 332900., ..., 201530., 205550., 174100.]),
 array([      0., 1427750., 3019000., ...,  218020.,  283697.,  224148.]),
 array([     0., 155000., 332900., ..., 566500., 177111., 205550.]),
 array([0.00000e+00, 1.55000e+05, 3.32900e+05, ..., 9.23855e+06,
        1.76400e+05, 3.30000e+03]),
 array([     0., 155000., 332900., ..., 201367., 134329., 266965.])]

In [18]:
import sqlalchemy.sql.sqltypes

sql_types = {'GARAGE': sqlalchemy.sql.sqltypes.String,
 'XUSE1_DESC': sqlalchemy.sql.sqltypes.String,
 'GREEN_ACRE': sqlalchemy.sql.sqltypes.String,
 'Year': sqlalchemy.sql.sqltypes.String,
 'COUNTY_ID': sqlalchemy.sql.sqltypes.Integer,
 'OWNER_MORE': sqlalchemy.sql.sqltypes.String,
 'OWN_ADD_L1': sqlalchemy.sql.sqltypes.String,
 'OWN_ADD_L2': sqlalchemy.sql.sqltypes.String,
 'SUFFIX_DIR': sqlalchemy.sql.sqltypes.String,
 'USE1_DESC': sqlalchemy.sql.sqltypes.String,
 'AG_PRESERV': sqlalchemy.sql.sqltypes.String,
 'AGPRE_ENRD': sqlalchemy.sql.sqltypes.String,
 'STREETTYPE': sqlalchemy.sql.sqltypes.String,
 'BLDG_NUM': sqlalchemy.sql.sqltypes.Integer,
 'DWELL_TYPE': sqlalchemy.sql.sqltypes.String,
 'SALE_DATE': sqlalchemy.sql.sqltypes.String,
 'ZIP4': sqlalchemy.sql.sqltypes.Integer,
 'SPEC_ASSES': sqlalchemy.sql.sqltypes.Integer,
 'OWN_ADD_L3': sqlalchemy.sql.sqltypes.String,
 'OWNER_NAME': sqlalchemy.sql.sqltypes.String,
 'centroid_long': sqlalchemy.sql.sqltypes.String,
 'CITY': sqlalchemy.sql.sqltypes.String,
 'MULTI_USES': sqlalchemy.sql.sqltypes.String,
 'TAX_ADD_L3': sqlalchemy.sql.sqltypes.String,
 'PLAT_NAME': sqlalchemy.sql.sqltypes.String,
 'TAX_EXEMPT': sqlalchemy.sql.sqltypes.String,
 'COOLING': sqlalchemy.sql.sqltypes.String,
 'HOME_STYLE': sqlalchemy.sql.sqltypes.String,
 'LANDMARK': sqlalchemy.sql.sqltypes.String,
 'NUM_UNITS': sqlalchemy.sql.sqltypes.Integer,
 'HOMESTEAD': sqlalchemy.sql.sqltypes.String,
 'TAX_CAPAC': sqlalchemy.sql.sqltypes.Integer,
 'PARC_CODE': sqlalchemy.sql.sqltypes.Integer,
 'PREFIX_DIR': sqlalchemy.sql.sqltypes.String,
 'UNIT_INFO': sqlalchemy.sql.sqltypes.String,
 'TAX_NAME': sqlalchemy.sql.sqltypes.String,
 'USE4_DESC': sqlalchemy.sql.sqltypes.String,
 'centroid_lat': sqlalchemy.sql.sqltypes.String,
 'PREFIXTYPE': sqlalchemy.sql.sqltypes.String,
 'AGPRE_EXPD': sqlalchemy.sql.sqltypes.String,
 'SCHOOL_DST': sqlalchemy.sql.sqltypes.Integer,
 'WSHD_DIST': sqlalchemy.sql.sqltypes.String,
 'PIN': sqlalchemy.sql.sqltypes.String,
 'YEAR_BUILT': sqlalchemy.sql.sqltypes.Integer,
 'EMV_BLDG': sqlalchemy.sql.sqltypes.Integer,
 'EMV_LAND': sqlalchemy.sql.sqltypes.String,
 'XUSE2_DESC': sqlalchemy.sql.sqltypes.String,
 'SALE_VALUE': sqlalchemy.sql.sqltypes.Integer,
 'ZIP': sqlalchemy.sql.sqltypes.Integer,
 'TAX_ADD_L1': sqlalchemy.sql.sqltypes.String,
 'TAX_ADD_L2': sqlalchemy.sql.sqltypes.String,
 'ACRES_DEED': sqlalchemy.sql.sqltypes.Float,
 'XUSE3_DESC': sqlalchemy.sql.sqltypes.String,
 'BASEMENT': sqlalchemy.sql.sqltypes.String,
 'BLOCK': sqlalchemy.sql.sqltypes.String,
 'Shape_Leng': sqlalchemy.sql.sqltypes.String,
 'USE2_DESC': sqlalchemy.sql.sqltypes.String,
 'HEATING': sqlalchemy.sql.sqltypes.String,
 'CITY_USPS': sqlalchemy.sql.sqltypes.String,
 'Shape_Area': sqlalchemy.sql.sqltypes.String,
 'ACRES_POLY': sqlalchemy.sql.sqltypes.Float,
 'TOTAL_TAX': sqlalchemy.sql.sqltypes.Integer,
 'GARAGESQFT': sqlalchemy.sql.sqltypes.Integer,
 'LOT': sqlalchemy.sql.sqltypes.String,
 'FIN_SQ_FT': sqlalchemy.sql.sqltypes.Integer,
 'OPEN_SPACE': sqlalchemy.sql.sqltypes.String,
 'STREETNAME': sqlalchemy.sql.sqltypes.String,
 'EMV_TOTAL': sqlalchemy.sql.sqltypes.Integer,
 'USE3_DESC': sqlalchemy.sql.sqltypes.String,
 'XUSE4_DESC': sqlalchemy.sql.sqltypes.String,
 'lat_long': sqlalchemy.sql.sqltypes.String,
 'lake_code': sqlalchemy.sql.sqltypes.String,
 'distance_to_lake': sqlalchemy.sql.sqltypes.Float,
 'lake_name': sqlalchemy.sql.sqltypes.String,
 'id': sqlalchemy.sql.sqltypes.Integer}

sql_types_Strings = {'ACRES_DEED':String,
                       'ACRES_POLY':String,
                       'AGPRE_ENRD':String,
                       'AGPRE_EXPD':String,
                       'AG_PRESERV':String,
                       'BASEMENT':String, #Drop
                       'BLDG_NUM':String, #Drop
                       'BLOCK':String, #Drop
                       'CITY':String, #Drop
                       'CITY_USPS':String, #Drop
                       'COOLING':String,
                       'COUNTY_ID':String, #Drop
                       'DWELL_TYPE':String,
                       'EMV_BLDG':String,
                       'EMV_LAND':String,
                       'EMV_TOTAL':String,
                       'FIN_SQ_FT':String,
                       'GARAGE':String,
                       'GARAGESQFT':String,
                       'GREEN_ACRE':String,
                       'HEATING':String,
                       'HOMESTEAD':String,
                       'HOME_STYLE':String,
                       'ID':String,
                       'LANDMARK':String,
                       'LOT':String,
                       'MULTI_USES':String,
                       'NUM_UNITS':String,
                       'OPEN_SPACE':String,
                       'OWNER_MORE':String, #Drop
                       'OWNER_NAME':String, #Drop
                       'OWN_ADD_L1':String, #Drop
                       'OWN_ADD_L2':String, #Drop
                       'OWN_ADD_L3':String, #Drop
                       'PARC_CODE':String,
                       'PIN':String,
                       'PLAT_NAME':String, #Drop
                       'PREFIXTYPE':String, #Drop
                       'PREFIX_DIR':String, #Drop
                       'SALE_DATE':String, #Drop
                       'SALE_VALUE':String,
                       'SCHOOL_DST':String, #Drop
                       'SPEC_ASSES':String,
                       'STREETNAME':String, #Drop
                       'STREETTYPE':String, #Drop
                       'SUFFIX_DIR':String, #Drop
                       'Shape_Area':String, #Drop
                       'Shape_Leng':String, #Drop
                       'TAX_ADD_L1':String, #Drop
                       'TAX_ADD_L2':String, #Drop
                       'TAX_ADD_L3':String, #Drop
                       'TAX_CAPAC':String, #Drop
                       'TAX_EXEMPT':String, #Drop
                       'TAX_NAME':String, #Drop
                       'TOTAL_TAX':String, #Drop
                       'UNIT_INFO':String, #Drop
                       'USE1_DESC':String,
                       'USE2_DESC':String,
                       'USE3_DESC':String,
                       'USE4_DESC':String,
                       'WSHD_DIST':String,
                       'XUSE1_DESC':String,
                       'XUSE2_DESC':String,
                       'XUSE3_DESC':String,
                       'XUSE4_DESC':String,
                       'YEAR_BUILT':String,
                       'Year':String, 
                       'ZIP':String, #Drop
                       'ZIP4':String, #Drop
                       'centroid_lat':String,
                       'centroid_long':String}

#sql_types

In [19]:
!rm ./databases/project_1_part_4.db

In [20]:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///databases/project_1_part_4.db', echo=False)

In [21]:
schema = pd.io.sql.get_schema(complete_first_chunk, # dataframe
                              'project_4', # name in SQL db
                              keys='id', # primary key
                              con=engine, # connection
                              dtype=sql_types_Strings # SQL types
)
#print(schema)
engine.execute(schema)

<sqlalchemy.engine.result.ResultProxy at 0x7fafc02b5cc0>

In [22]:
df_iter = lambda file : pd.read_csv(file,
                                dtype = {'centroid_lat': str,'centroid_long': str, 'PIN': str, 'Year':str},
                                chunksize=c_size,
                                sep='|',
                                engine='python')

In [23]:
rows_so_far = 0

for f in files:
    print("Beginning file {0}".format(f))
    for i, chunk in enumerate(df_iter(f)):
        processed_chunk = chunk >> process_chunk(rows_so_far)
        print('\t writing chunk {0}'.format(i))
        processed_chunk.to_sql('project_4', 
                               con=engine, 
                               dtype=sql_types_Strings, 
                               index=False,
                               if_exists='append')
        rows_so_far = rows_so_far+len(chunk)


Beginning file ../MinneMUDAC_raw_files/2004_metro_tax_parcels.txt
	 writing chunk 0


InterfaceError: (sqlite3.InterfaceError) Error binding parameter 70 - probably unsupported type. [SQL: 'INSERT INTO project_4 ("GARAGE", "XUSE1_DESC", "GREEN_ACRE", "Year", "COUNTY_ID", "OWNER_MORE", "OWN_ADD_L1", "OWN_ADD_L2", "SUFFIX_DIR", "USE1_DESC", "AG_PRESERV", "AGPRE_ENRD", "STREETTYPE", "BLDG_NUM", "DWELL_TYPE", "SALE_DATE", "ZIP4", "SPEC_ASSES", "OWN_ADD_L3", "OWNER_NAME", centroid_long, "CITY", "MULTI_USES", "TAX_ADD_L3", "PLAT_NAME", "TAX_EXEMPT", "COOLING", "HOME_STYLE", "LANDMARK", "NUM_UNITS", "HOMESTEAD", "TAX_CAPAC", "PARC_CODE", "PREFIX_DIR", "UNIT_INFO", "TAX_NAME", "USE4_DESC", centroid_lat, "PREFIXTYPE", "AGPRE_EXPD", "SCHOOL_DST", "WSHD_DIST", "PIN", "YEAR_BUILT", "EMV_BLDG", "EMV_LAND", "XUSE2_DESC", "SALE_VALUE", "ZIP", "TAX_ADD_L1", "TAX_ADD_L2", "ACRES_DEED", "XUSE3_DESC", "BASEMENT", "BLOCK", "Shape_Leng", "USE2_DESC", "HEATING", "CITY_USPS", "Shape_Area", "ACRES_POLY", "TOTAL_TAX", "GARAGESQFT", "LOT", "FIN_SQ_FT", "OPEN_SPACE", "STREETNAME", "EMV_TOTAL", "USE3_DESC", "XUSE4_DESC", lat_long, lake_code, distance_to_lake, lake_name, id) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)'] [parameters: ((None, None, 'N', '2004', 3, None, '24457 DOGWOOD ST NW', 'BETHEL', None, None, 'N', None, None, None, None, None, None, 0.0, 'MN,  55005', None, '-93.26739', 'SAINT FRANCIS', None, 'MN,  55005', None, 'N', None, None, None, None, 'N', 351.0, 0.0, None, None, None, None, '45.41332', None, None, 15, 'UPPER RUM RIVER WMO', '003-253424110001', 1980.0, 0.0, 17750.0, None, 0.0, None, '24457 DOGWOOD ST NW', 'BETHEL', 0.0, None, None, None, None, None, None, None, None, 8.03, 589.0, None, None, 0.0, 'N', None, 23964.0, None, None, ('45.41332', '-93.26739'), '27001600-01', 371.59328527423776, 'Harriet Lake', 0), (None, None, 'N', '2004', 3, None, '24457 DOGWOOD ST NW', 'BETHEL', 'NW', None, 'N', None, 'ST', 24457.0, None, None, None, 0.0, 'MN,  55005', None, '-93.2701', 'SAINT FRANCIS', None, 'MN,  55005', None, 'N', None, None, None, None, 'Y', 1475.0, 0.0, None, None, None, None, '45.41354', None, None, 15, 'UPPER RUM RIVER WMO', '003-253424110002', 1974.0, 106719.0, 55400.0, None, 0.0, '55005', '24457 DOGWOOD ST NW', 'BETHEL', 0.0, None, None, None, None, None, None, 'BETHEL', None, 0.93, 1308.0, None, None, 0.0, 'N', 'DOGWOOD', 171837.0, None, None, ('45.41354', '-93.2701'), '02008100-01', 1407.5312639208016, 'Hart Lake', 1), (None, None, 'N', '2004', 3, None, 'PO BOX 14', 'BETHEL', 'NW', None, 'N', None, 'AVE', 410.0, None, '1995-03-23', None, 0.0, 'MN,  55005', None, '-93.27684', 'SAINT FRANCIS', None, 'MO,  63017', None, 'N', None, None, None, None, 'Y', 1744.0, 0.0, None, None, None, None, '45.41167', None, None, 15, 'UPPER RUM RIVER WMO', '003-253424210002', 1989.0, 120762.0, 77000.0, None, 101000.0, '55005', '14528 SO OUTER FORTY RD', 'CHESTERFIELD', 0.0, None, None, None, None, None, None, 'BETHEL', None, 11.17, 1609.0, None, None, 0.0, 'N', '245TH', 210338.0, None, None, ('45.41167', '-93.27684'), '27001900-01', 678.3843123463598, 'Nokomis Lake', 2), (None, None, 'N', '2004', 3, None, '480 245TH AVE NW', 'EAST BETHEL', 'NW', None, 'N', None, 'AVE', 480.0, None, '1995-04-04', None, 0.0, 'MN,  55005', None, '-93.27849', 'SAINT FRANCIS', None, 'VA,  23285', None, 'N', None, None, None, None, 'Y', 1636.0, 0.0, None, None, None, None, '45.41169', None, None, 15, 'UPPER RUM RIVER WMO', '003-253424210003', 1995.0, 114220.0, 80505.0, None, 101900.0, '55070', '%VALUTREE REAL EST SER, LLC  TAX SER DIV', 'RICHMOND', 0.0, None, None, None, None, None, None, 'BETHEL', None, 14.46, 1488.0, None, None, 0.0, 'N', '245TH', 204359.0, None, None, ('45.41169', '-93.27849'), '82014700-01', 278.20919444020785, 'Egg Lake', 3), (None, None, 'N', '2004', 3, None, '921 235TH AVE NE', 'EAST BETHEL', None, None, 'N', None, None, None, None, '2004-06-25', None, 0.0, 'MN,  55005', None, '-93.27973', 'SAINT FRANCIS', None, 'CA,  91724', None, 'N', None, None, None, None, 'N', 66.0, 0.0, None, None, None, None, '45.41172', None, None, 15, 'UPPER RUM RIVER WMO', '003-253424210004', 0.0, 0.0, 10005.0, None, 55500.0, None, '1123 S PARKVIEW', 'COVINA', 0.0, None, None, None, None, None, None, None, None, 4.69, 68.0, None, None, 0.0, 'N', None, 10005.0, None, None, ('45.41172', '-93.27973'), '19019800-01', 242.1349001467506, 'Scout Lake', 4), (None, None, 'N', '2004', 3, None, '24485 PARTRIDGE ST NW', 'ST FRANCIS', 'NW', None, 'N', None, 'ST', 24485.0, None, '1995-04-05', None, 0.0, 'MN,  55070', None, '-93.3182', 'SAINT FRANCIS', None, 'TX,  75247', None, 'N', None, None, None, None, 'Y', 2342.0, 0.0, None, None, None, None, '45.41316', None, None, 15, 'UPPER RUM RIVER WMO', '003-273424210010', 1985.0, 170164.0, 78000.0, None, 185000.0, '55070', '8435 N STEMMONS FREEWAY', 'DALLAS', 0.0, None, None, '1', None, None, None, 'SAINT FRANCIS', None, 4.49, 2304.0, None, '4', 0.0, 'N', 'PARTRIDGE', 269806.0, None, None, ('45.41316', '-93.3182'), '27017100-01', 697.7387467511415, 'Sylvan Lake', 5), (None, None, 'N', '2004', 3, None, '24464 PARTRIDGE ST NW', 'ST FRANCIS', 'NW', None, 'N', None, 'ST', 24464.0, None, '1995-05-13', None, 0.0, 'MN,  55070', None, '-93.32075', 'SAINT FRANCIS', None, 'UT,  84107', None, 'N', None, None, None, None, 'Y', 1792.0, 0.0, None, None, None, None, '45.41317', None, None, 15, 'UPPER RUM RIVER WMO', '003-273424210009', 1995.0, 122099.0, 79000.0, None, 125857.0, '55070', '6053 S FASHION SQ DR #200', 'MURRAY', 0.0, None, None, '1', None, None, None, 'SAINT FRANCIS', None, 4.62, 1662.0, None, '3', 0.0, 'N', 'PARTRIDGE', 221632.0, None, None, ('45.41317', '-93.32075'), '62001002-01', 1081.8514085399743, None, 6), (None, None, 'Y', '2004', 3, None, '24443 VERDIN ST NW', 'ST FRANCIS', 'NW', None, 'N', None, 'ST', 24443.0, None, None, None, 0.0, 'MN,  55070', None, '-93.32399', 'SAINT FRANCIS', None, 'MN,  55070', None, 'N', None, None, None, None, 'Y', 1709.0, 0.0, None, None, None, None, '45.41282', None, None, 15, 'UPPER RUM RIVER WMO', '003-273424220003', 1974.0, 115638.0, 108625.0, None, 0.0, '55070', '24443 VERDIN ST NW', 'ST FRANCIS', 0.0, None, None, None, None, None, None, 'SAINT FRANCIS', None, 14.95, 1431.0, None, None, 0.0, 'N', 'VERDIN', 265496.0, None, None, ('45.41282', '-93.32399'), '27016000-01', 728.8497877115113, 'Long Lake', 7)  ... displaying 10 of 25225 total bound parameter sets ...  (None, None, 'N', '2004', 3, None, '6325 CHEROKEE TRL', 'LINO LAKES', None, None, 'N', None, 'TRL', 6325.0, None, '2000-02-15', None, 0.0, 'MN,  55038', None, '-93.08149', 'LINO LAKES', None, 'MN,  55038', None, 'N', None, None, None, 0.0, 'Y', 2974.0, 0.0, None, None, None, None, '45.13623', None, None, 12, 'RICE CREEK WATERSHED DISTRICT', '003-333122110026', 1999.0, 209509.0, 77550.0, None, 299773.0, None, '6325 CHEROKEE TRL', 'LINO LAKES', 0.0, None, None, '2', None, None, None, 'HUGO', None, 1.08, 3581.0, None, '15', 0.0, 'N', 'CHEROKEE', 303731.0, None, None, ('45.13623', '-93.08149'), '27004700-01', 1245.3489063158154, 'Bush Lake', 25223), (None, None, 'N', '2004', 3, None, '1186 DURANGO POINT', 'LINO LAKES', None, None, 'N', None, 'PT', 1186.0, None, '2002-01-31', None, 0.0, 'MN,  55038', None, '-93.08196', 'LINO LAKES', None, 'IA,  50328', None, 'N', None, None, None, 0.0, 'Y', 3721.0, 0.0, None, None, None, None, '45.13731', None, None, 12, 'RICE CREEK WATERSHED DISTRICT', '003-333122110013', 1999.0, 280864.0, 77550.0, None, 379500.0, None, '1 HOME CAMPUS MAC X2502-011', 'DES MOINES', 0.0, None, None, '2', None, None, None, 'HUGO', None, 0.74, 4565.0, None, '2', 0.0, 'N', 'DURANGO', 375468.0, None, None, ('45.13731', '-93.08196'), '19044600-01', 638.4496824701953, 'Lac Lavon Lake', 25224))] (Background on this error at: http://sqlalche.me/e/rvf5)