We'll use this notebook to prep the data in our us_realtor csv.

In [21]:
# import packages
import pandas as pd

In [22]:
# import data
us_df = pd.read_csv(
    "us_realtor_data.csv"
)

us_df

Unnamed: 0,status,price,bed,bath,acre_lot,full_address,street,city,state,zip_code,house_size,sold_date
0,for_sale,105000.0,3.0,2.0,0.12,"Sector Yahuecas Titulo # V84, Adjuntas, PR, 00601",Sector Yahuecas Titulo # V84,Adjuntas,Puerto Rico,601.0,920.0,
1,for_sale,80000.0,4.0,2.0,0.08,"Km 78 9 Carr # 135, Adjuntas, PR, 00601",Km 78 9 Carr # 135,Adjuntas,Puerto Rico,601.0,1527.0,
2,for_sale,67000.0,2.0,1.0,0.15,"556G 556-G 16 St, Juana Diaz, PR, 00795",556G 556-G 16 St,Juana Diaz,Puerto Rico,795.0,748.0,
3,for_sale,145000.0,4.0,2.0,0.10,"R5 Comunidad El Paraso Calle De Oro R-5 Ponce,...",R5 Comunidad El Paraso Calle De Oro R-5 Ponce,Ponce,Puerto Rico,731.0,1800.0,
4,for_sale,65000.0,6.0,2.0,0.05,"14 Navarro, Mayaguez, PR, 00680",14 Navarro,Mayaguez,Puerto Rico,680.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...
923154,for_sale,445000.0,1.0,2.0,0.99,"1008 King St, Chappaqua, NY, 10514",1008 King St,Chappaqua,New York,10514.0,1052.0,5/9/11
923155,for_sale,418000.0,4.0,2.0,0.40,"3 Elmwood Dr, Monroe, NY, 10950",3 Elmwood Dr,Monroe,New York,10950.0,1650.0,7/21/15
923156,for_sale,469000.0,4.0,2.0,0.18,"13 N Conger Ave, Congers, NY, 10920",13 N Conger Ave,Congers,New York,10920.0,2123.0,
923157,for_sale,825000.0,5.0,5.0,0.79,"7 Miller Rd, Valley Cottage, NY, 10989",7 Miller Rd,Valley Cottage,New York,10989.0,3775.0,6/2/10


The first thing I want to handle are those commas in the full_address. I had a lot of isses importing this data into MySQL Workbench, and while pandas handled the commas well, they still make me nervous for future exploration with other programs.

Technically we could just drop the full_address column, since the data is redundant, but I figure it doesn't hurt to keep it while our total column count is pretty small.

In [23]:
# remove ',' from full_address
us_df["full_address"] = us_df["full_address"].str.replace(",", "")

us_df

Unnamed: 0,status,price,bed,bath,acre_lot,full_address,street,city,state,zip_code,house_size,sold_date
0,for_sale,105000.0,3.0,2.0,0.12,Sector Yahuecas Titulo # V84 Adjuntas PR 00601,Sector Yahuecas Titulo # V84,Adjuntas,Puerto Rico,601.0,920.0,
1,for_sale,80000.0,4.0,2.0,0.08,Km 78 9 Carr # 135 Adjuntas PR 00601,Km 78 9 Carr # 135,Adjuntas,Puerto Rico,601.0,1527.0,
2,for_sale,67000.0,2.0,1.0,0.15,556G 556-G 16 St Juana Diaz PR 00795,556G 556-G 16 St,Juana Diaz,Puerto Rico,795.0,748.0,
3,for_sale,145000.0,4.0,2.0,0.10,R5 Comunidad El Paraso Calle De Oro R-5 Ponce ...,R5 Comunidad El Paraso Calle De Oro R-5 Ponce,Ponce,Puerto Rico,731.0,1800.0,
4,for_sale,65000.0,6.0,2.0,0.05,14 Navarro Mayaguez PR 00680,14 Navarro,Mayaguez,Puerto Rico,680.0,,
...,...,...,...,...,...,...,...,...,...,...,...,...
923154,for_sale,445000.0,1.0,2.0,0.99,1008 King St Chappaqua NY 10514,1008 King St,Chappaqua,New York,10514.0,1052.0,5/9/11
923155,for_sale,418000.0,4.0,2.0,0.40,3 Elmwood Dr Monroe NY 10950,3 Elmwood Dr,Monroe,New York,10950.0,1650.0,7/21/15
923156,for_sale,469000.0,4.0,2.0,0.18,13 N Conger Ave Congers NY 10920,13 N Conger Ave,Congers,New York,10920.0,2123.0,
923157,for_sale,825000.0,5.0,5.0,0.79,7 Miller Rd Valley Cottage NY 10989,7 Miller Rd,Valley Cottage,New York,10989.0,3775.0,6/2/10


Now that the address looks better, the rest of the problems I'm seeing are just formatting issues with some of the numerical colums. 

We can cast most of these columns as ints to remove the unecessary floating 0s. Note we'll use dtype 'Int64' here to handle the null values properly.

The zipcode also looks a bit odd. It looks like the zipcodes for Puerto Rico start with "00...", which has been cut off by pandas.

Since zipcodes are really a categorical rather than numerical feature, we can cast them as strings and fill zeros to a length of 5 digits.

Finally, we'll just drop the sold_date column, because it's mostly null, and we don't really need it anyway.

In [24]:
# remove floating 0s by casting as ints
us_df["price"] = us_df["price"].astype("Int64")
us_df["bed"] = us_df["bed"].astype("Int64")
us_df["bath"] = us_df["bath"].astype("Int64")
us_df["house_size"] = us_df["house_size"].astype("Int64")

# format zipcodes properly
us_df["zip_code"] = us_df["zip_code"].astype("Int64").astype("str").str.zfill(5)

# drop mostly null sold_date column
us_df_drop = us_df.drop(columns=["sold_date"])

us_df_drop

Unnamed: 0,status,price,bed,bath,acre_lot,full_address,street,city,state,zip_code,house_size
0,for_sale,105000,3,2,0.12,Sector Yahuecas Titulo # V84 Adjuntas PR 00601,Sector Yahuecas Titulo # V84,Adjuntas,Puerto Rico,00601,920
1,for_sale,80000,4,2,0.08,Km 78 9 Carr # 135 Adjuntas PR 00601,Km 78 9 Carr # 135,Adjuntas,Puerto Rico,00601,1527
2,for_sale,67000,2,1,0.15,556G 556-G 16 St Juana Diaz PR 00795,556G 556-G 16 St,Juana Diaz,Puerto Rico,00795,748
3,for_sale,145000,4,2,0.10,R5 Comunidad El Paraso Calle De Oro R-5 Ponce ...,R5 Comunidad El Paraso Calle De Oro R-5 Ponce,Ponce,Puerto Rico,00731,1800
4,for_sale,65000,6,2,0.05,14 Navarro Mayaguez PR 00680,14 Navarro,Mayaguez,Puerto Rico,00680,
...,...,...,...,...,...,...,...,...,...,...,...
923154,for_sale,445000,1,2,0.99,1008 King St Chappaqua NY 10514,1008 King St,Chappaqua,New York,10514,1052
923155,for_sale,418000,4,2,0.40,3 Elmwood Dr Monroe NY 10950,3 Elmwood Dr,Monroe,New York,10950,1650
923156,for_sale,469000,4,2,0.18,13 N Conger Ave Congers NY 10920,13 N Conger Ave,Congers,New York,10920,2123
923157,for_sale,825000,5,5,0.79,7 Miller Rd Valley Cottage NY 10989,7 Miller Rd,Valley Cottage,New York,10989,3775


I'm happy with the way this data looks at this point. Now we'll just export to a new csv for further exploration in the future.

In [25]:
# write prepped data to new csv
filepath = "us_realtor_prep.csv"

us_df_drop.to_csv(
    filepath,
    header="column_names",
)