# Regression Exercises I
Let's review the steps we take at the beginning of each new module.

1. Create a new repository named ```regression-exercises``` in your ```GitHub```; all of your Regression work will be housed here.
2. Clone this repository within your local ```codeup-data-science``` directory.
3. Create a ```.gitignore``` and make sure your list of 'files to ignore' includes your ```env.py``` file.
4. Ceate a ```README.md``` file that outlines the contents and purpose of your repository.
5. ```Add```, ```commit```, and ```push``` these two files.
6. Now you can add your ```env.py``` file to this repository to access the ```Codeup database server```.
7. For these exercises, you will create ```wrangle.ipynb``` and ```wrangle.py``` files to hold necessary functions.
8. As always, ```add```, ```commit```, and ```push``` your work often.

# Regression Exercises II
Let's set up an example scenario as perspective for our regression exercises using the ```Zillow dataset```.

As a Codeup data science graduate, you want to show off your skills to the Zillow data science team in hopes of getting an interview for a position you saw pop up on [LinkedIn](www.LinkedIn.com). You thought it might look impressive to build an end-to-end project in which you use some of their Kaggle data to predict property values using some of their available features; who knows, you might even do some feature engineering to blow them away. 

##### Your goal is to predict the values of single unit properties using the obervations from 2017.

In these exercises, you will complete the first step toward the above goal: ```acquire``` and ```prepare``` the necessary ```Zillow data``` from the ```zillow database``` in the ```Codeup database server```.

In [1]:
import numpy as np
import pandas as pd

import QMCBT_wrangle as w
import QMCBT_explore_evaluate as ee
from env import user, password, host

### 1. ```Acquire``` 
* bedroomcnt 
* bathroomcnt 
* calculatedfinishedsquarefeet
* taxvaluedollarcnt 
* yearbuilt 
* taxamount  
* fips 


from the ```zillow database``` for all '```Single Family Residential```' properties.

In [2]:
connection = f'mysql+pymysql://{user}:{password}@{host}'
endpoint = '/zillow'
connection = connection+endpoint
query = 'SELECT propertylandusetypeid, propertylandusedesc, bedroomcnt, bathroomcnt, calculatedfinishedsquarefeet, taxvaluedollarcnt, yearbuilt, taxamount, fips FROM properties_2017 LEFT JOIN propertylandusetype USING (propertylandusetypeid) WHERE propertylandusetypeid = 261'

In [3]:
df = pd.read_sql(query, connection)

### 2. Using your acquired Zillow data, walk through the summarization and cleaning steps in your ```wrangle.ipynb``` file like we did above. You may handle the missing values however you feel is appropriate and meaningful; remember to document your process and decisions using markdown and code commenting where helpful.

In [4]:
df.T

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,2152853,2152854,2152855,2152856,2152857,2152858,2152859,2152860,2152861,2152862
propertylandusetypeid,261.0,261.0,261.0,261.0,261.0,261.0,261.0,261.0,261.0,261.0,...,261.0,261.0,261.0,261.0,261.0,261.0,261.0,261.0,261.0,261.0
propertylandusedesc,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,...,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential
bedroomcnt,0.0,0.0,0.0,0.0,4.0,0.0,3.0,3.0,0.0,0.0,...,4.0,0.0,3.0,4.0,0.0,4.0,4.0,0.0,3.0,4.0
bathroomcnt,0.0,0.0,0.0,0.0,2.0,0.0,4.0,2.0,0.0,0.0,...,2.0,0.0,2.5,4.0,0.0,3.0,4.5,0.0,2.5,4.0
calculatedfinishedsquarefeet,,,,,3633.0,,1620.0,2077.0,,,...,1987.0,,1809.0,4375.0,,2262.0,3127.0,,1974.0,2110.0
taxvaluedollarcnt,27516.0,10.0,10.0,2108.0,296425.0,124.0,847770.0,646760.0,6730242.0,15532.0,...,259913.0,1198476.0,405547.0,422400.0,1087111.0,960756.0,536061.0,208057.0,424353.0,554009.0
yearbuilt,,,,,2005.0,,2011.0,1926.0,,,...,1955.0,,2012.0,2015.0,,2015.0,2014.0,,2015.0,2014.0
taxamount,,,,174.21,6941.39,,10244.94,7924.68,80348.13,248.89,...,3175.66,,4181.1,13877.56,19313.08,13494.52,6244.16,5783.88,5302.7,6761.2
fips,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,...,6059.0,6037.0,6059.0,6037.0,6059.0,6059.0,6059.0,6059.0,6059.0,6037.0


In [5]:
df.shape

(2152863, 9)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2152863 entries, 0 to 2152862
Data columns (total 9 columns):
 #   Column                        Dtype  
---  ------                        -----  
 0   propertylandusetypeid         float64
 1   propertylandusedesc           object 
 2   bedroomcnt                    float64
 3   bathroomcnt                   float64
 4   calculatedfinishedsquarefeet  float64
 5   taxvaluedollarcnt             float64
 6   yearbuilt                     float64
 7   taxamount                     float64
 8   fips                          float64
dtypes: float64(8), object(1)
memory usage: 147.8+ MB


In [7]:
df.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
propertylandusetypeid,2152863.0,261.0,0.0,261.0,261.0,261.0,261.0,261.0
bedroomcnt,2152852.0,3.287196,0.954754,0.0,3.0,3.0,4.0,25.0
bathroomcnt,2152852.0,2.230688,0.99928,0.0,2.0,2.0,3.0,32.0
calculatedfinishedsquarefeet,2144379.0,1862.855178,1222.125124,1.0,1257.0,1623.0,2208.0,952576.0
taxvaluedollarcnt,2152370.0,461896.237963,699676.0496,1.0,188170.25,327671.0,534527.0,98428909.0
yearbuilt,2143526.0,1960.949681,22.162196,1801.0,1949.0,1958.0,1976.0,2016.0
taxamount,2148421.0,5634.865978,8178.910249,1.85,2534.98,4108.95,6414.32,1337755.86
fips,2152863.0,6048.377335,20.433292,6037.0,6037.0,6037.0,6059.0,6111.0


### Check isblank

In [8]:
# Return (row count)
row_count = df.shape[0]
row_count

2152863

In [9]:
# creates list of columns
column_list = df.columns
column_list

Index(['propertylandusetypeid', 'propertylandusedesc', 'bedroomcnt',
       'bathroomcnt', 'calculatedfinishedsquarefeet', 'taxvaluedollarcnt',
       'yearbuilt', 'taxamount', 'fips'],
      dtype='object')

In [10]:
# returns count of all rows from each column that has values
# essentially skipping count on any row that has a column with whitespace
row_value_count = df[column_list].value_counts().sum()
row_value_count

2140235

In [11]:
# subtract value count from row count to get count of rows with whitespace
whitespace_count = row_count - row_value_count
whitespace_count

12628

### Check isnull
* Create Function to automate this to one level of NaN check or figure out how to loop it

In [12]:
df.isnull().sum()

propertylandusetypeid              0
propertylandusedesc                0
bedroomcnt                        11
bathroomcnt                       11
calculatedfinishedsquarefeet    8484
taxvaluedollarcnt                493
yearbuilt                       9337
taxamount                       4442
fips                               0
dtype: int64

In [13]:
# assuming that even if all NaNs lived in separate rows
# if we divide the number of NaNs by the number of rows
# we will get the overall percentage of NaN rows for potential deletion
NaN_count = (11+11+8484+493+9337+4442)
NaN_count, NaN_count/row_count

(22778, 0.01058032954256727)

In [14]:
# let's take our largest NaN count column and filter by NaN
isnull_df = df[df.yearbuilt.isnull()==True]

In [15]:
# check our work
isnull_df.T

Unnamed: 0,0,1,2,3,5,8,9,10,12,13,...,2152540,2152644,2152672,2152767,2152771,2152823,2152844,2152854,2152857,2152860
propertylandusetypeid,261.0,261.0,261.0,261.0,261.0,261.0,261.0,261.0,261.0,261.0,...,261.0,261.0,261.0,261.0,261.0,261.0,261.0,261.0,261.0,261.0
propertylandusedesc,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,...,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential,Single Family Residential
bedroomcnt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
bathroomcnt,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
calculatedfinishedsquarefeet,,,,,,,,,,,...,,,,,,,,,,
taxvaluedollarcnt,27516.0,10.0,10.0,2108.0,124.0,6730242.0,15532.0,11009.0,2171.0,378.0,...,177192.0,5126781.0,98234.0,49892.0,540464.0,2568893.0,92679.0,1198476.0,1087111.0,208057.0
yearbuilt,,,,,,,,,,,...,,,,,,,,,,
taxamount,,,,174.21,,80348.13,248.89,,,,...,1917.4,80251.5,1597.27,566.92,5709.26,27309.3,1090.16,,19313.08,5783.88
fips,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,...,6059.0,6059.0,6037.0,6111.0,6111.0,6059.0,6111.0,6037.0,6059.0,6059.0


In [16]:
# now let's sum NaNs of all columns within just that filter
isnull_df.isnull().sum()

propertylandusetypeid              0
propertylandusedesc                0
bedroomcnt                         4
bathroomcnt                        4
calculatedfinishedsquarefeet    7877
taxvaluedollarcnt                375
yearbuilt                       9337
taxamount                       1734
fips                               0
dtype: int64

In [17]:
isnull_df.isnull().sum().max()

9337

In [18]:
# Returns (isnull row count)
# This will show us how many records would be deleted
isnull_row_count = isnull_df.shape[0]
isnull_row_count

9337

In [19]:
# now we can easily compare our first NaN_count against our filtered_NaN_count
# this show us that there are only 3,447 NaNs that do not coexist within the largest NaN column
filtered_NaN_count = (4+4+7877+375+9337+1734)
filtered_NaN_count, NaN_count - filtered_NaN_count

(19331, 3447)

In [20]:
# 9337 + (22778 - 19331)
# This equals the potential row count of NaNs to be deleted
NaN_delete_row_count = isnull_row_count + (NaN_count - filtered_NaN_count)
NaN_delete_row_count

12784

In [21]:
# assuming that even if all additional NaNs lived in separate rows
# we can now more accurately see the percentage of filtered NaNs
    # plus additional NaNs against the overall database
NaN_delete_row_count / row_count 

0.005938139119860391

<div class="alert alert-info">
    
As you can see deleting <b>less than*</b> 12,784 records or rows will result in a less than 1% data loss.  
    
<b>less than*</b> - because we assume all additional NaNs (3,447) are individual rows when in reality many probably share the same rows.
    
### Therefore, it is reasonably safe to remove all NaN records from the DataFrame.</div>

In [22]:
# Clean whitespace using R trick shared by Codeup Instructor Madeleine Capper

# teensy tiny regex mini lesson for very smols
# ^ : "starts with"
# \s : "any type of whitespace"
# * : " zero or more times"
# $ : "ends with"

# '^\s*$' : something that starts with any whitespace
          # character for zero or more times until the end

# let's change the whitespace into a null value,
# since thats effectively what it is
# Note: regex is not the only option/way to do this
# but its an awfully convenient one that I want y'all to see

# wrapping it in df = without specifying a specific column will 
    # remove ALL whitespace and update the df
df = df.replace(r'^\s*$', np.NaN, regex=True)

In [23]:
# Check the outcome
# it appears that all of our whitespace shared rows with NaN
    # since the isnull sum came back identical
df.isnull().sum()

propertylandusetypeid              0
propertylandusedesc                0
bedroomcnt                        11
bathroomcnt                       11
calculatedfinishedsquarefeet    8484
taxvaluedollarcnt                493
yearbuilt                       9337
taxamount                       4442
fips                               0
dtype: int64

In [24]:
# Now, let's remove all of the NaN's 
    # since we know they are insignificant in size and percentage
df = df.dropna()

In [25]:
# Check our work
df.isnull().sum()

propertylandusetypeid           0
propertylandusedesc             0
bedroomcnt                      0
bathroomcnt                     0
calculatedfinishedsquarefeet    0
taxvaluedollarcnt               0
yearbuilt                       0
taxamount                       0
fips                            0
dtype: int64

### Clean dtypes

In [26]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2140235 entries, 4 to 2152862
Data columns (total 9 columns):
 #   Column                        Dtype  
---  ------                        -----  
 0   propertylandusetypeid         float64
 1   propertylandusedesc           object 
 2   bedroomcnt                    float64
 3   bathroomcnt                   float64
 4   calculatedfinishedsquarefeet  float64
 5   taxvaluedollarcnt             float64
 6   yearbuilt                     float64
 7   taxamount                     float64
 8   fips                          float64
dtypes: float64(8), object(1)
memory usage: 163.3+ MB


In [27]:
# Drop Columns propertylandusetypeid and propertylandusedesc
# The index and discription were only needed to filter and verify
df = df.drop(columns=['propertylandusetypeid', 'propertylandusedesc']) 

In [28]:
# Check work
df.T

Unnamed: 0,4,6,7,11,14,15,18,19,20,21,...,2152850,2152851,2152852,2152853,2152855,2152856,2152858,2152859,2152861,2152862
bedroomcnt,4.0,3.0,3.0,0.0,0.0,0.0,3.0,3.0,3.0,4.0,...,3.0,3.0,3.0,4.0,3.0,4.0,4.0,4.0,3.0,4.0
bathroomcnt,2.0,4.0,2.0,0.0,0.0,0.0,1.0,2.0,2.0,4.0,...,2.5,3.5,2.0,2.0,2.5,4.0,3.0,4.5,2.5,4.0
calculatedfinishedsquarefeet,3633.0,1620.0,2077.0,1200.0,171.0,203.0,1244.0,1300.0,1222.0,4144.0,...,2033.0,1980.0,1917.0,1987.0,1809.0,4375.0,2262.0,3127.0,1974.0,2110.0
taxvaluedollarcnt,296425.0,847770.0,646760.0,5328.0,6920.0,14166.0,169471.0,233266.0,290492.0,1303522.0,...,641757.0,773358.0,408680.0,259913.0,405547.0,422400.0,960756.0,536061.0,424353.0,554009.0
yearbuilt,2005.0,2011.0,1926.0,1972.0,1973.0,1960.0,1950.0,1950.0,1951.0,2016.0,...,2015.0,2014.0,1946.0,1955.0,2012.0,2015.0,2015.0,2014.0,2015.0,2014.0
taxamount,6941.39,10244.94,7924.68,91.6,255.17,163.79,2532.88,3110.99,3870.25,14820.1,...,10009.46,8347.9,4341.32,3175.66,4181.1,13877.56,13494.52,6244.16,5302.7,6761.2
fips,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,...,6059.0,6059.0,6111.0,6059.0,6059.0,6037.0,6059.0,6059.0,6059.0,6037.0


In [29]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2140235 entries, 4 to 2152862
Data columns (total 7 columns):
 #   Column                        Dtype  
---  ------                        -----  
 0   bedroomcnt                    float64
 1   bathroomcnt                   float64
 2   calculatedfinishedsquarefeet  float64
 3   taxvaluedollarcnt             float64
 4   yearbuilt                     float64
 5   taxamount                     float64
 6   fips                          float64
dtypes: float64(7)
memory usage: 195.1 MB


In [30]:
# Function will show best dtype based on values (ignore objects)
# be sure to double check that the computer got this right before converting
df.convert_dtypes(infer_objects=False).dtypes

bedroomcnt                        Int64
bathroomcnt                     Float64
calculatedfinishedsquarefeet      Int64
taxvaluedollarcnt                 Int64
yearbuilt                         Int64
taxamount                       Float64
fips                              Int64
dtype: object

In [31]:
# no changes were made by convert dtypes
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2140235 entries, 4 to 2152862
Data columns (total 7 columns):
 #   Column                        Dtype  
---  ------                        -----  
 0   bedroomcnt                    float64
 1   bathroomcnt                   float64
 2   calculatedfinishedsquarefeet  float64
 3   taxvaluedollarcnt             float64
 4   yearbuilt                     float64
 5   taxamount                     float64
 6   fips                          float64
dtypes: float64(7)
memory usage: 195.1 MB


In [32]:
# make actual conversion
df = df.convert_dtypes(infer_objects=False)

In [33]:
# Check work
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2140235 entries, 4 to 2152862
Data columns (total 7 columns):
 #   Column                        Dtype  
---  ------                        -----  
 0   bedroomcnt                    Int64  
 1   bathroomcnt                   Float64
 2   calculatedfinishedsquarefeet  Int64  
 3   taxvaluedollarcnt             Int64  
 4   yearbuilt                     Int64  
 5   taxamount                     Float64
 6   fips                          Int64  
dtypes: Float64(2), Int64(5)
memory usage: 209.4 MB


In [34]:
# Check work
df.T

Unnamed: 0,4,6,7,11,14,15,18,19,20,21,...,2152850,2152851,2152852,2152853,2152855,2152856,2152858,2152859,2152861,2152862
bedroomcnt,4.0,3.0,3.0,0.0,0.0,0.0,3.0,3.0,3.0,4.0,...,3.0,3.0,3.0,4.0,3.0,4.0,4.0,4.0,3.0,4.0
bathroomcnt,2.0,4.0,2.0,0.0,0.0,0.0,1.0,2.0,2.0,4.0,...,2.5,3.5,2.0,2.0,2.5,4.0,3.0,4.5,2.5,4.0
calculatedfinishedsquarefeet,3633.0,1620.0,2077.0,1200.0,171.0,203.0,1244.0,1300.0,1222.0,4144.0,...,2033.0,1980.0,1917.0,1987.0,1809.0,4375.0,2262.0,3127.0,1974.0,2110.0
taxvaluedollarcnt,296425.0,847770.0,646760.0,5328.0,6920.0,14166.0,169471.0,233266.0,290492.0,1303522.0,...,641757.0,773358.0,408680.0,259913.0,405547.0,422400.0,960756.0,536061.0,424353.0,554009.0
yearbuilt,2005.0,2011.0,1926.0,1972.0,1973.0,1960.0,1950.0,1950.0,1951.0,2016.0,...,2015.0,2014.0,1946.0,1955.0,2012.0,2015.0,2015.0,2014.0,2015.0,2014.0
taxamount,6941.39,10244.94,7924.68,91.6,255.17,163.79,2532.88,3110.99,3870.25,14820.1,...,10009.46,8347.9,4341.32,3175.66,4181.1,13877.56,13494.52,6244.16,5302.7,6761.2
fips,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,...,6059.0,6059.0,6111.0,6059.0,6059.0,6037.0,6059.0,6059.0,6059.0,6037.0


### 3. Store all of the necessary functions to automate your process from acquiring the data to returning a cleaned dataframe with no missing values in your ```wrangle.py``` file. Name your final function ```wrangle_zillow```.

In [35]:
# reset df to zero to test wrangle function
# delete csv file to fully test all wrangle functions
df = 0

In [36]:
# use custom wrangle function to automate Acquire and Prepare
df = w.wrangle_zillow()

In [37]:
# Check our work
df.T

Unnamed: 0,4,6,7,11,14,15,18,19,20,21,...,2152850,2152851,2152852,2152853,2152855,2152856,2152858,2152859,2152861,2152862
bedroomcnt,4.0,3.0,3.0,0.0,0.0,0.0,3.0,3.0,3.0,4.0,...,3.0,3.0,3.0,4.0,3.0,4.0,4.0,4.0,3.0,4.0
bathroomcnt,2.0,4.0,2.0,0.0,0.0,0.0,1.0,2.0,2.0,4.0,...,2.5,3.5,2.0,2.0,2.5,4.0,3.0,4.5,2.5,4.0
calculatedfinishedsquarefeet,3633.0,1620.0,2077.0,1200.0,171.0,203.0,1244.0,1300.0,1222.0,4144.0,...,2033.0,1980.0,1917.0,1987.0,1809.0,4375.0,2262.0,3127.0,1974.0,2110.0
taxvaluedollarcnt,296425.0,847770.0,646760.0,5328.0,6920.0,14166.0,169471.0,233266.0,290492.0,1303522.0,...,641757.0,773358.0,408680.0,259913.0,405547.0,422400.0,960756.0,536061.0,424353.0,554009.0
yearbuilt,2005.0,2011.0,1926.0,1972.0,1973.0,1960.0,1950.0,1950.0,1951.0,2016.0,...,2015.0,2014.0,1946.0,1955.0,2012.0,2015.0,2015.0,2014.0,2015.0,2014.0
taxamount,6941.39,10244.94,7924.68,91.6,255.17,163.79,2532.88,3110.99,3870.25,14820.1,...,10009.46,8347.9,4341.32,3175.66,4181.1,13877.56,13494.52,6244.16,5302.7,6761.2
fips,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,...,6059.0,6059.0,6111.0,6059.0,6059.0,6037.0,6059.0,6059.0,6059.0,6037.0


In [38]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 2140235 entries, 4 to 2152862
Data columns (total 7 columns):
 #   Column                        Dtype  
---  ------                        -----  
 0   bedroomcnt                    Int64  
 1   bathroomcnt                   Float64
 2   calculatedfinishedsquarefeet  Int64  
 3   taxvaluedollarcnt             Int64  
 4   yearbuilt                     Int64  
 5   taxamount                     Float64
 6   fips                          Int64  
dtypes: Float64(2), Int64(5)
memory usage: 209.4 MB


### Personal Exploration

In [39]:
# Custom function that returns nunique for each value in each column
ee.nunique_column_all(df)

3     962944
4     633608
2     334221
5     150671
6      25117
1      22895
7       4792
0       4397
8       1103
9        290
10       118
11        34
13        15
12        12
14         7
15         5
18         3
16         2
25         1
Name: bedroomcnt, dtype: Int64

2.0     942463
3.0     422398
1.0     412582
2.5     142827
4.0      82039
1.5      31157
3.5      28464
5.0      28306
4.5      19474
6.0      10717
5.5       6201
7.0       4381
0.0       4274
8.0       1681
6.5       1330
9.0        707
7.5        382
10.0       322
11.0       145
8.5        108
12.0        73
9.5         50
13.0        39
14.0        25
15.0        17
0.5         16
10.5        14
16.0        12
18.0         8
20.0         6
17.0         4
1.75         3
12.5         3
11.5         3
19.5         1
14.5         1
32.0         1
19.0         1
Name: bathroomcnt, dtype: Int64

1200     5184
1080     4376
1120     4354
1400     3828
1440     3684
         ... 
12208       1
11580       1
13730 

In [40]:
# Built filtered df's to explore curiousity
bathroomcnt_df = df[df['bathroomcnt'] > 18]
yearbuilt_df = df[df['yearbuilt'] < 1900]

In [41]:
bathroomcnt_df.T

Unnamed: 0,26485,32114,701366,1114403,1174755,1618393,1657947,2051558,2135273
bedroomcnt,25.0,10.0,0.0,3.0,3.0,7.0,10.0,14.0,10.0
bathroomcnt,20.0,19.5,20.0,20.0,20.0,20.0,32.0,20.0,19.0
calculatedfinishedsquarefeet,11700.0,26345.0,1650.0,80.0,66.0,28725.0,39170.0,16198.0,31415.0
taxvaluedollarcnt,1608491.0,11689668.0,152598.0,95692.0,237056.0,83196095.0,31038350.0,9359259.0,6401936.0
yearbuilt,2010.0,1981.0,1947.0,2006.0,1960.0,1938.0,2009.0,1952.0,1991.0
taxamount,19238.87,369.08,1892.01,1250.16,2831.76,994030.96,372142.72,14900.91,73571.72
fips,6037.0,6059.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0


In [42]:
yearbuilt_df.T

Unnamed: 0,207,3549,3551,4085,4357,7803,10616,14253,14286,14341,...,2138339,2138340,2142306,2145584,2146107,2146220,2147922,2148015,2148840,2150772
bedroomcnt,3.0,2.0,3.0,3.0,2.0,3.0,0.0,3.0,2.0,3.0,...,2.0,4.0,3.0,2.0,2.0,3.0,2.0,2.0,3.0,2.0
bathroomcnt,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,2.0,1.0,1.0,1.0
calculatedfinishedsquarefeet,1319.0,840.0,1674.0,1190.0,1012.0,1223.0,650.0,1789.0,1172.0,1298.0,...,1028.0,940.0,1090.0,926.0,614.0,1290.0,1276.0,1270.0,1469.0,801.0
taxvaluedollarcnt,114727.0,393000.0,418000.0,283587.0,183632.0,369748.0,182000.0,60891.0,253923.0,133558.0,...,319803.0,57550.0,40272.0,207196.0,155977.0,431694.0,85584.0,284667.0,147329.0,30277.0
yearbuilt,1890.0,1885.0,1887.0,1898.0,1895.0,1896.0,1899.0,1896.0,1898.0,1893.0,...,1895.0,1890.0,1890.0,1887.0,1894.0,1890.0,1894.0,1895.0,1885.0,1898.0
taxamount,1270.8,4850.66,5197.58,3549.97,2395.48,4405.32,2111.66,935.64,3165.13,1791.76,...,3954.89,803.06,459.78,2565.23,2004.42,5342.14,1459.52,3486.72,1974.96,597.24
fips,6111.0,6037.0,6037.0,6037.0,6037.0,6059.0,6037.0,6037.0,6037.0,6037.0,...,6037.0,6037.0,6111.0,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0,6059.0


In [43]:
# look at oldest property just out of curiosity
yearbuilt_df.min()

bedroomcnt                         0.00
bathroomcnt                        0.00
calculatedfinishedsquarefeet     311.00
taxvaluedollarcnt               3731.00
yearbuilt                       1801.00
taxamount                         61.68
fips                            6037.00
dtype: float64

In [44]:
# Is this even possible? Are these errors?
# Do we need to drop rows < 300? 600? 
df.loc[df['calculatedfinishedsquarefeet'] == 1].T

Unnamed: 0,58438,1046787,1276353,1359288,1895866,2017745
bedroomcnt,2.0,0.0,1.0,2.0,5.0,3.0
bathroomcnt,1.0,0.0,3.0,1.0,5.0,1.0
calculatedfinishedsquarefeet,1.0,1.0,1.0,1.0,1.0,1.0
taxvaluedollarcnt,121376.0,28091.0,124906.0,147577.0,563977.0,31800.0
yearbuilt,1907.0,1963.0,1953.0,1991.0,1997.0,1900.0
taxamount,1996.35,439.55,2020.66,1855.4,6808.84,870.36
fips,6037.0,6037.0,6037.0,6037.0,6037.0,6037.0
