## Using SQLalchemy to fix issues with the dataset

There are two problems with the initial data which can be solved at once. The first is that the data files are too big to easily load into my aging macbook pro. The second is that the file which contains the data's target files, the "train" csv's have much fewer entries than the datasets ('properties'). So for the purposes of fitting data to a model and training it, the majority of the properties data can't be used.

The solution is to to load the large datasets chunk by chunk into a sql database. Then use SQL commands to filter out the unused properties data in order to build a pandas database. We can then export to csv to create a file more easily used by personal computers without sacrificing data points like earlier attempts.

In [1]:
import pandas as pd
import numpy as np
#import random as rnd
#import nltk
#import datetime
import math
from sqlalchemy import create_engine
#nltk.download()

In [2]:
#vals_2016 = pd.read_csv('train_2016_v2.csv')
#vals_2017 = pd.read_csv('train_2017.csv')

csv_database = create_engine('sqlite:///csv_database.db')

# Next we load a the 2016 dataset into a SQL engine table chunk by chunk
chunksize = 30000
i_1 = 0
j_1 = 1
for df in pd.read_csv('properties_2016.csv', chunksize=chunksize, iterator=True):
      df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) 
      df.index += j_1
      i_1+=1
      df.to_sql('table_2016', csv_database, if_exists='append')
      j = df.index[-1] + 1
#now we do the same for 2017 into another table
i_2 = 0
j_2 = 1
for df in pd.read_csv('properties_2017.csv', chunksize=chunksize, iterator=True):
      df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) 
      df.index += j_2
      i_2+=1
      df.to_sql('table_2017', csv_database, if_exists='append')
      j = df.index[-1] + 1
        
        
i_3 = 0
j_3 = 1
for df in pd.read_csv('train_2016_v2.csv', chunksize=chunksize, iterator=True):
      df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) 
      df.index += j_3
      i_3+=1
      df.to_sql('vals_2016', csv_database, if_exists='append')
      j = df.index[-1] + 1
        
i_4 = 0
j_4 = 1
for df in pd.read_csv('train_2017.csv', chunksize=chunksize, iterator=True):
      df = df.rename(columns={c: c.replace(' ', '') for c in df.columns}) 
      df.index += j_4
      i_4+=1
      df.to_sql('vals_2017', csv_database, if_exists='append')
      j = df.index[-1] + 1
#df = pd.read_sql_query('SELECT COl1 table_2017', csv_database)
#df.head()

In [3]:
df2016 = pd.read_sql_query('Select * From table_2016 T inner join vals_2016 V on T.parcelid = V.parcelid', csv_database)
df2017 = pd.read_sql_query('Select * From table_2017 T inner join vals_2017 V on T.parcelid = V.parcelid', csv_database)

In [4]:
df2016.head()

Unnamed: 0,index,parcelid,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,...,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock,index.1,parcelid.1,logerror,transactiondate
0,363,17073783,,,,2.5,3.0,,,2.5,...,2015,76724.0,2015.06,,,61110020000000.0,5558,17073783,0.0953,2016-01-27
1,429,17088994,,,,1.0,2.0,,,1.0,...,2015,95870.0,2581.3,,,61110020000000.0,20708,17088994,0.0198,2016-03-30
2,471,17100444,,,,2.0,3.0,,,2.0,...,2015,14234.0,591.64,,,61110010000000.0,39718,17100444,0.006,2016-05-27
3,481,17102429,,,,1.5,2.0,,,1.5,...,2015,17305.0,682.78,,,61110010000000.0,42867,17102429,-0.0566,2016-06-07
4,508,17109604,,,,2.5,4.0,,,2.5,...,2015,277000.0,5886.92,,,61110010000000.0,63904,17109604,0.0573,2016-08-08


In [5]:
df2017.head()

Unnamed: 0,index,parcelid,airconditioningtypeid,architecturalstyletypeid,basementsqft,bathroomcnt,bedroomcnt,buildingclasstypeid,buildingqualitytypeid,calculatedbathnbr,...,assessmentyear,landtaxvaluedollarcnt,taxamount,taxdelinquencyflag,taxdelinquencyyear,censustractandblock,index.1,parcelid.1,logerror,transactiondate
0,350,17054981,,,,5.0,4.0,,,5.0,...,2016.0,370922.0,9673.46,,,61110010000000.0,46825,17054981,-0.013099,2017-06-15
1,356,17055743,,,,2.0,3.0,,,2.0,...,2016.0,305312.0,5538.8,,,61110010000000.0,60944,17055743,0.073985,2017-07-26
2,384,17068109,,,,1.5,3.0,,,1.5,...,2016.0,93193.0,2987.36,,,61110010000000.0,62346,17068109,0.071886,2017-07-28
3,407,17073952,,,,2.0,2.0,,,2.0,...,2016.0,168531.0,2706.24,,,61110020000000.0,42888,17073952,0.30568,2017-06-02
4,424,17078502,,,,1.0,2.0,,,1.0,...,2016.0,444178.0,6220.7,,,61110020000000.0,54699,17078502,-0.073787,2017-07-07


In [6]:
df2016 = df2016.loc[:,~df2016.columns.duplicated()]
df2017 = df2017.loc[:,~df2016.columns.duplicated()]

In [7]:
df2016 = df2016.drop('index', 1)
df2017 = df2017.drop('index', 1)

In [8]:
print(df2016.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 90275 entries, 0 to 90274
Data columns (total 60 columns):
parcelid                        90275 non-null int64
airconditioningtypeid           28781 non-null float64
architecturalstyletypeid        261 non-null float64
basementsqft                    43 non-null float64
bathroomcnt                     90275 non-null float64
bedroomcnt                      90275 non-null float64
buildingclasstypeid             16 non-null float64
buildingqualitytypeid           57364 non-null float64
calculatedbathnbr               89093 non-null float64
decktypeid                      658 non-null float64
finishedfloor1squarefeet        6856 non-null float64
calculatedfinishedsquarefeet    89614 non-null float64
finishedsquarefeet12            85596 non-null float64
finishedsquarefeet13            33 non-null float64
finishedsquarefeet15            3564 non-null float64
finishedsquarefeet50            6856 non-null float64
finishedsquarefeet6          

In [9]:
print(df2017.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 77613 entries, 0 to 77612
Data columns (total 59 columns):
parcelid                        77613 non-null int64
airconditioningtypeid           25007 non-null float64
architecturalstyletypeid        207 non-null float64
basementsqft                    50 non-null float64
bathroomcnt                     77579 non-null float64
bedroomcnt                      77579 non-null float64
buildingclasstypeid             15 non-null float64
buildingqualitytypeid           49809 non-null float64
calculatedbathnbr               76963 non-null float64
decktypeid                      614 non-null float64
finishedfloor1squarefeet        6037 non-null float64
calculatedfinishedsquarefeet    77378 non-null float64
finishedsquarefeet12            73923 non-null float64
finishedsquarefeet13            42 non-null float64
finishedsquarefeet15            3027 non-null float64
finishedsquarefeet50            6037 non-null float64
finishedsquarefeet6          

In [10]:
df2016.to_csv('2016.csv')
df2017.to_csv('2017.csv')