# 1. Data Wrangling <a id="data_wrangling"></a>

<a id="contents"></a>
# Table of Contents  
1. [Data Wrangling](#data_wrangling)
    - [1.1 Introduction](#introduction)
    - [1.2 Imports](#imports)
    - [1.3 Load the Data](#load)
    - [1.4 Dataset Cleaning](#cleaning)

## 1.1 Introduction<a id="introduction"></a>

### Problem
Real estate investors need to identify profitable investment opportunities in dynamic markets. Understanding market trends and segmenting opportunities based on risk and return profiles is crucial for optimizing investment strategies. The goal of this project is to maximize investment returns by leveraging data-driven approaches to identify undervalued properties, forecast market trends, and optimize portfolio allocations.


### Clients
The findings of this study will be of interest to a broad range of stakeholders, specifically real estate investors, portfolio managers, and real estate agents and brokers who can benefit from understanding market trends and leverage the insights from the project to provide more accurate and data-driven recommendations.


### Data
The dataset for this project was downloaded from Kaggle and has been filtered and cleaned to include housing data from New York, extracted via the Zillow API. This comprehensive dataset provides detailed information about various properties, capturing a wide range of features relevant to real estate analysis. The primary goal of this project is to develop a predictive model that analyzes housing data to forecast property prices accurately. By leveraging this data, the model aims to provide valuable insights into the New York housing market, potentially aiding buyers, sellers, and investors in making informed decisions.

Link to Kaggle dataset: https://www.kaggle.com/datasets/ericpierce/new-york-housing-zillow-api


## 1.2 Imports

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
import os
import csv
from tqdm.notebook import tqdm
from datetime import datetime, timezone
from geopy.geocoders import Nominatim
from geopy.extra.rate_limiter import RateLimiter
from collections import Counter

## 1.3 Load the Data<a id="load"></a>

## To begin, we are focusing the data on the regions wihthin New York State (NY) in the US. After loading all the datasets in to the notebook, we will filter out any region/location that is not in NY.

df_ = df[df['address/streetAddress']=='282 East Rd']
df_['price']

In [2]:
df = pd.read_csv('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Capstone 3/Datasets/newyork_housing.csv')

  df = pd.read_csv('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Capstone 3/Datasets/newyork_housing.csv')


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75630 entries, 0 to 75629
Columns: 1507 entries, address/city to zpid
dtypes: bool(13), float64(440), int64(2), object(1052)
memory usage: 863.0+ MB


In [4]:
df.head()

Unnamed: 0,address/city,address/community,address/neighborhood,address/state,address/streetAddress,address/subdivision,address/zipcode,bathrooms,bedrooms,currency,...,schools/2/link,schools/2/name,schools/2/rating,schools/2/size,schools/2/studentsPerTeacher,schools/2/totalCount,schools/2/type,url,yearBuilt,zpid
0,New York,,,NY,60 Terrace View Ave,,10463.0,2.0,5.0,USD,...,,,,,,,,https://www.zillow.com/homedetails/60-Terrace-...,1920.0,31554050.0
1,Bronx,,,NY,625 W 246th St,,10471.0,8.0,8.0,USD,...,,,,,,,,https://www.zillow.com/homedetails/625-W-246th...,1940.0,29854120.0
2,Bronx,,,NY,716 W 231st St,,10463.0,3.0,4.0,USD,...,,,,,,,,https://www.zillow.com/homedetails/716-W-231st...,1920.0,29851860.0
3,Bronx,,,NY,750 W 232nd St,,10463.0,6.0,5.0,USD,...,,,,,,,,https://www.zillow.com/homedetails/750-W-232nd...,1950.0,29851860.0
4,Bronx,,,NY,632 W 230th St,,10463.0,6.0,5.0,USD,...,,,,,,,,https://www.zillow.com/homedetails/632-W-230th...,2020.0,2077107000.0


## 1.4 Dataset Cleaning

## We validated the unique values in column house_type as this is a categorical feature and we expect there to be a finite and relatively small set of values. Through this, we identified that there were several date values in the column and removed them to ensure accuracy.

In [5]:
df.loc[df['resoFactsStats/atAGlanceFacts/0/factValue'].str.contains('2021', na=False), 
       'resoFactsStats/atAGlanceFacts/0/factValue'] = df['resoFactsStats/atAGlanceFacts/1/factValue']

## The dataset has 1507 columns, many of which are unnecessary for our purposes. As there are so many columns, we have first identified some words that appear in many unneeded columns to reduce the column count and make it easier to look through. From there, we identified the target and feature columns that would be necessary to create our model and filtered the dataset accordingly. For readability and efficiency, we re-named most of the columns kept. Additioanlly, we removed null values.

In [6]:
# List of keywords to remove
remove_keywords = ['photo', 'url', 'History', 'link', 'zpid', 'level', 'Fact']

# Remove columns with any of the keywords in their names
df_1 = df.drop(columns=[col for col in df.columns if any(keyword in col for keyword in remove_keywords)])

# Display the result
print("\nDataFrame after removing columns with 'photo', 'url', 'History':")
print(df_1)


DataFrame after removing columns with 'photo', 'url', 'History':
               address/city  address/community address/neighborhood  \
0                  New York                NaN                  NaN   
1                     Bronx                NaN                  NaN   
2                     Bronx                NaN                  NaN   
3                     Bronx                NaN                  NaN   
4                     Bronx                NaN                  NaN   
...                     ...                ...                  ...   
75625              Flushing                NaN                  NaN   
75626  Forest Hills Gardens                NaN                  NaN   
75627  Forest Hills Gardens                NaN                  NaN   
75628              Flushing                NaN                  NaN   
75629              Flushing                NaN                  NaN   

      address/state address/streetAddress address/subdivision  \
0               

In [7]:
filtered_df = df.filter(like='school')

filtered_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75630 entries, 0 to 75629
Data columns (total 37 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   schools                       0 non-null      float64
 1   schools/0/assigned            0 non-null      float64
 2   schools/0/distance            75499 non-null  float64
 3   schools/0/grades              74988 non-null  object 
 4   schools/0/isAssigned          75499 non-null  object 
 5   schools/0/level               75499 non-null  object 
 6   schools/0/link                75499 non-null  object 
 7   schools/0/name                75499 non-null  object 
 8   schools/0/rating              74614 non-null  float64
 9   schools/0/size                75002 non-null  float64
 10  schools/0/studentsPerTeacher  74504 non-null  float64
 11  schools/0/totalCount          75499 non-null  float64
 12  schools/0/type                75499 non-null  object 
 13  s

In [8]:
# Keep only relevant columns
columns_to_keep = ['address/city', 'address/streetAddress', 'address/state', 'address/zipcode', 'resoFactsStats/atAGlanceFacts/0/factValue', 'price', 'bathrooms', 'bedrooms', 'schools/0/name', 'schools/0/rating', 'schools/2/name', 'schools/2/rating', 'yearBuilt', 'latitude', 'longitude', 'livingArea', 'resoFactsStats/atAGlanceFacts/2/factValue','resoFactsStats/atAGlanceFacts/3/factValue', 'resoFactsStats/atAGlanceFacts/4/factValue', 'resoFactsStats/basement', 'resoFactsStats/taxAssessedValue', 'resoFactsStats/taxAnnualAmount', 'resoFactsStats/stories', 'resoFactsStats/lotSize']

# Keep only the specified columns
df_1 = df.loc[:, columns_to_keep]

# Display the result
print("\nDataFrame after keeping only specified columns:")
print(df_1)


DataFrame after keeping only specified columns:
               address/city address/streetAddress address/state  \
0                  New York   60 Terrace View Ave            NY   
1                     Bronx        625 W 246th St            NY   
2                     Bronx        716 W 231st St            NY   
3                     Bronx        750 W 232nd St            NY   
4                     Bronx        632 W 230th St            NY   
...                     ...                   ...           ...   
75625              Flushing         6829 Manse St            NY   
75626  Forest Hills Gardens       82 Greenway Ter            NY   
75627  Forest Hills Gardens       86 Greenway Ter            NY   
75628              Flushing         8913 70th Ave            NY   
75629              Flushing         7049 Manse St            NY   

       address/zipcode resoFactsStats/atAGlanceFacts/0/factValue      price  \
0              10463.0                               Residential   

duplicate_rows = df_1[df_1.duplicated()]
sorted_duplicate_rows = duplicate_rows.sort_values(by='address/streetAddress')
sorted_duplicate_rows

In [9]:
# Dictionary of columns to rename
columns_to_rename = {
    'address/city': 'city',
    'address/streetAddress': 'street_address',
    'address/state': 'state',
    'address/zipcode': 'zipcode',
    'resoFactsStats/atAGlanceFacts/0/factValue': 'house_type',
    'schools/0/name': 'school_name',
    'schools/0/rating': 'school_rating',
    'schools/2/name': 'school_name_2',
    'schools/2/rating': 'school_rating_2',
    'livingArea': 'sqft',
    'resoFactsStats/atAGlanceFacts/2/factValue': 'heating',
    'resoFactsStats/atAGlanceFacts/3/factValue': 'cooling',
    'resoFactsStats/atAGlanceFacts/4/factValue': 'parking',
    'resoFactsStats/basement': 'basement',
    'resoFactsStats/taxAssessedValue': 'tax_assessed_value', 
    'resoFactsStats/taxAnnualAmount': 'tax_amount',
    'resoFactsStats/stories': 'stories',
    'resoFactsStats/lotSize':'lot_size'
}

# Rename the specified columns
df_2 = df_1.rename(columns=columns_to_rename)

# Display the result
print("\nDataFrame after renaming specified columns:")
print(df_2)


DataFrame after renaming specified columns:
                       city       street_address state  zipcode  \
0                  New York  60 Terrace View Ave    NY  10463.0   
1                     Bronx       625 W 246th St    NY  10471.0   
2                     Bronx       716 W 231st St    NY  10463.0   
3                     Bronx       750 W 232nd St    NY  10463.0   
4                     Bronx       632 W 230th St    NY  10463.0   
...                     ...                  ...   ...      ...   
75625              Flushing        6829 Manse St    NY  11375.0   
75626  Forest Hills Gardens      82 Greenway Ter    NY  11375.0   
75627  Forest Hills Gardens      86 Greenway Ter    NY  11375.0   
75628              Flushing        8913 70th Ave    NY  11375.0   
75629              Flushing        7049 Manse St    NY  11375.0   

          house_type      price  bathrooms  bedrooms  \
0        Residential   799999.0        2.0       5.0   
1      Single Family  3995000.0       

In [10]:
df_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75630 entries, 0 to 75629
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   city                75629 non-null  object 
 1   street_address      75629 non-null  object 
 2   state               75629 non-null  object 
 3   zipcode             75611 non-null  float64
 4   house_type          75344 non-null  object 
 5   price               75591 non-null  float64
 6   bathrooms           56577 non-null  float64
 7   bedrooms            56166 non-null  float64
 8   school_name         75499 non-null  object 
 9   school_rating       74614 non-null  float64
 10  school_name_2       55543 non-null  object 
 11  school_rating_2     55506 non-null  float64
 12  yearBuilt           69898 non-null  float64
 13  latitude            75604 non-null  float64
 14  longitude           75604 non-null  float64
 15  sqft                66419 non-null  float64
 16  heat

In [11]:
# Count the number of null values in each column
null_counts = df_2.isnull().sum()

# Display the result
print("Number of null values in each column:")
null_counts

Number of null values in each column:


city                      1
street_address            1
state                     1
zipcode                  19
house_type              286
price                    39
bathrooms             19053
bedrooms              19464
school_name             131
school_rating          1016
school_name_2         20087
school_rating_2       20124
yearBuilt              5732
latitude                 26
longitude                26
sqft                   9211
heating               44598
cooling               55230
parking                2200
basement              51989
tax_assessed_value    12486
tax_amount            10594
stories               27817
lot_size              11175
dtype: int64

In [12]:
distinct_house_types = df_2['house_type'].unique()
distinct_house_types

array(['Residential', 'Single Family', 'Condo', 'Multiple Occupancy',
       'Apartment', 'Vacant Land', 'Townhouse', 'Residential Income', nan,
       'Other', 'Available Now', 'Land', 'Mixed Use',
       'Mobile / Manufactured'], dtype=object)

In [13]:
#We should drop all hometype which is vacant land as these are not typical houses and mess up the distribution of house prices 
vacant_lands = df_2[df_2['house_type'] == 'Vacant Land'].index
df_2 = df_2.drop(vacant_lands)

## We then looked to validate that remaining categorical values were also accurate. We noted that there were additional states to NY included in the dataset which needed to be removed. Additionally, we transformed all values in the 'city' column to be lowercase in order to ensure there were no duplicates. From there we noted that there were 3 values that were not cities (blvd and n.y). Upon further investigation noted below, we identified the zip codes of the addresses within the corresponding rows and updated the 'city' accordingly.

In [14]:
for column in df_2.select_dtypes(include=['object']).columns:
    print(f"\nUnique values in '{column}' column:")
    print(df_2[column].unique())


Unique values in 'city' column:
['New York' 'Bronx' 'Manhattan' 'new york' 'New york' 'Street' 'Pelham'
 'Staten Island' 'Howard Beach' 'Broad Channel' 'Jamaica' 'Far Rockaway'
 'Hamilton Beach' 'Rosedale' 'Rockaway Beach' 'Queens' 'Rockaway Park'
 'Belle Harbor' 'Neponsit' 'Breezy Point' 'Rockaway park' 'Far rockaway'
 'Broad channel' 'Breezy Pt' 'BELLE HARBOR' 'Belle harbor' 'rosedale'
 'Rockaway point' 'ROCKAWAY PARK' 'Far Rockway' 'Lawrence'
 'Washington Heights' 'Avenue' 'Brooklyn' 'Maspeth' 'Little Neck'
 'Flushing' 'Douglaston' 'Little neck' 'District heights' 'NEW YORK'
 'Staten island' 'staten Island' 'staten island' 'BROOKLYN'
 'Queens Village' 'Cambria Heights' 'Springfield Gardens' 'Laurelton'
 'Cambria heights' 'South Ozone Park' 'Arverne' 'far rockaway'
 'belle harbor' 'New york City' 'NY' 'New York City' 'Yonkers' 'College'
 'bronx' 'Cen' 'elmont' 'boulevard' 'Rockaway Point' 'Averne' '350w42ndst'
 'Oval' 'BRONX' 'East 214th Street' 'West 156th' 'Concourse' 'Narrowsburg

In [15]:
non_ny_rows = df_2[df_2['state'] != 'NY']
non_ny_rows

Unnamed: 0,city,street_address,state,zipcode,house_type,price,bathrooms,bedrooms,school_name,school_rating,...,longitude,sqft,heating,cooling,parking,basement,tax_assessed_value,tax_amount,stories,lot_size
2403,District heights,Lorring Dr,MD,20747.0,Apartment,58521100.0,,,Longfields Elementary School,5.0,...,-74.01693,,,,0 spaces,,79200.0,1177.0,,0.54 Acres
13283,Aberdeen,Battle Ave,MD,21001.0,Single Family,5720.0,1.0,,Ps 94 David D Porter,10.0,...,-73.729507,672.0,,,Garage,,67800.0,1116.0,1.0,"2,500 sqft"
15834,,,,,Other,,,,,,...,,,,,0 spaces,,,,,
17976,Warwick,Cecilton-wa Rd,MD,21912.0,Single Family,370000.0,,,Ps 46 Alley Pond,8.0,...,-73.75779,,,,Garage,,23900.0,276.0,,49.09 Acres
22854,Scarborough,103 Running Hill Road,ME,4074.0,Single Family,299999.0,2.0,3.0,Ps 60 Alice Austen,4.0,...,-74.181206,1728.0,,,0 spaces,,,,,
24590,BENNETTSVILLE,W Main St,SC,29512.0,Single Family,63000.0,2.0,3.0,Ps 8 Robert Fulton,8.0,...,-73.99086,1764.0,Gas,Central,Garage - Detached,,51110.0,,,0.44 Acres
25175,Brooklyn,53 Boerum Place #5H,NV,11201.0,Condo,704500.0,1.0,1.0,P.S. 261 Philip Livingston,4.0,...,-73.989609,600.0,,,0 spaces,,,,,
25623,Staten Island,9113 N Park Plaza Ct,WI,10314.0,Single Family,45000.0,,,Ps 60 Alice Austen,4.0,...,-74.165787,,,,0 spaces,,30000.0,1009.0,,"3,245 sqft"
36295,Wakefield,130-52 120th,NC,11419.0,Multiple Occupancy,615000.0,2.0,4.0,Ps 161 Arthur Ashe School,9.0,...,-73.822197,2500.0,,,0 spaces,,,,,
36797,richmond hill,105-07 133,NC,11419.0,Single Family,480000.0,2.0,3.0,Ps 121,8.0,...,-73.812538,2500.0,,,0 spaces,,,,,


In [16]:
non_ny_rows = non_ny_rows[non_ny_rows['street_address'].str.contains('\d', na=False)]
non_ny_rows

Unnamed: 0,city,street_address,state,zipcode,house_type,price,bathrooms,bedrooms,school_name,school_rating,...,longitude,sqft,heating,cooling,parking,basement,tax_assessed_value,tax_amount,stories,lot_size
22854,Scarborough,103 Running Hill Road,ME,4074.0,Single Family,299999.0,2.0,3.0,Ps 60 Alice Austen,4.0,...,-74.181206,1728.0,,,0 spaces,,,,,
25175,Brooklyn,53 Boerum Place #5H,NV,11201.0,Condo,704500.0,1.0,1.0,P.S. 261 Philip Livingston,4.0,...,-73.989609,600.0,,,0 spaces,,,,,
25623,Staten Island,9113 N Park Plaza Ct,WI,10314.0,Single Family,45000.0,,,Ps 60 Alice Austen,4.0,...,-74.165787,,,,0 spaces,,30000.0,1009.0,,"3,245 sqft"
36295,Wakefield,130-52 120th,NC,11419.0,Multiple Occupancy,615000.0,2.0,4.0,Ps 161 Arthur Ashe School,9.0,...,-73.822197,2500.0,,,0 spaces,,,,,
36797,richmond hill,105-07 133,NC,11419.0,Single Family,480000.0,2.0,3.0,Ps 121,8.0,...,-73.812538,2500.0,,,0 spaces,,,,,
41864,Lewiston,2 Granite Street,ME,4240.0,Multiple Occupancy,155000.0,6.0,9.0,Ps 33 Chelsea Prep,7.0,...,-73.999069,5840.0,,,0 spaces,,,,,
42015,Voluntown,82 Pendleton Hill Rd,CT,6384.0,Single Family,595000.0,2.0,4.0,Ps 130 Hernando De Soto,8.0,...,-73.99836,1550.0,"Baseboard, Oil, Other",Wall,"Garage, Garage - Attached, Off-street, Covered",Basement (not specified),142100.0,4151.0,1.0,2.74 Acres
48625,Brooklyn,53 Boerum Place #5H,NV,11201.0,Condo,704500.0,1.0,1.0,P.S. 261 Philip Livingston,4.0,...,-73.989609,600.0,,,0 spaces,,,,,
53114,Staten Island,9113 N Park Plaza Ct,WI,10314.0,Single Family,45000.0,,,Ps 60 Alice Austen,4.0,...,-74.165787,,,,0 spaces,,30000.0,1009.0,,"3,245 sqft"
71192,New York,36 Watson Straits Rd,MA,10172.0,Residential,120000.0,1.0,1.0,Ps 59 Beekman Hill International,10.0,...,-73.974312,639.0,Wood,,Off Street,,115600.0,2779.0,,0.73 Acres


In [17]:
cities_to_update = ['Brooklyn', 'Staten Island', 'New York', 'Richmond Hill']

# Update the 'state' column to 'NY' where the 'city' column matches the cities in the list
non_ny_rows.loc[non_ny_rows['city'].str.lower().isin([city.lower() for city in cities_to_update]), 'state'] = 'NY'
non_ny_rows

Unnamed: 0,city,street_address,state,zipcode,house_type,price,bathrooms,bedrooms,school_name,school_rating,...,longitude,sqft,heating,cooling,parking,basement,tax_assessed_value,tax_amount,stories,lot_size
22854,Scarborough,103 Running Hill Road,ME,4074.0,Single Family,299999.0,2.0,3.0,Ps 60 Alice Austen,4.0,...,-74.181206,1728.0,,,0 spaces,,,,,
25175,Brooklyn,53 Boerum Place #5H,NY,11201.0,Condo,704500.0,1.0,1.0,P.S. 261 Philip Livingston,4.0,...,-73.989609,600.0,,,0 spaces,,,,,
25623,Staten Island,9113 N Park Plaza Ct,NY,10314.0,Single Family,45000.0,,,Ps 60 Alice Austen,4.0,...,-74.165787,,,,0 spaces,,30000.0,1009.0,,"3,245 sqft"
36295,Wakefield,130-52 120th,NC,11419.0,Multiple Occupancy,615000.0,2.0,4.0,Ps 161 Arthur Ashe School,9.0,...,-73.822197,2500.0,,,0 spaces,,,,,
36797,richmond hill,105-07 133,NY,11419.0,Single Family,480000.0,2.0,3.0,Ps 121,8.0,...,-73.812538,2500.0,,,0 spaces,,,,,
41864,Lewiston,2 Granite Street,ME,4240.0,Multiple Occupancy,155000.0,6.0,9.0,Ps 33 Chelsea Prep,7.0,...,-73.999069,5840.0,,,0 spaces,,,,,
42015,Voluntown,82 Pendleton Hill Rd,CT,6384.0,Single Family,595000.0,2.0,4.0,Ps 130 Hernando De Soto,8.0,...,-73.99836,1550.0,"Baseboard, Oil, Other",Wall,"Garage, Garage - Attached, Off-street, Covered",Basement (not specified),142100.0,4151.0,1.0,2.74 Acres
48625,Brooklyn,53 Boerum Place #5H,NY,11201.0,Condo,704500.0,1.0,1.0,P.S. 261 Philip Livingston,4.0,...,-73.989609,600.0,,,0 spaces,,,,,
53114,Staten Island,9113 N Park Plaza Ct,NY,10314.0,Single Family,45000.0,,,Ps 60 Alice Austen,4.0,...,-74.165787,,,,0 spaces,,30000.0,1009.0,,"3,245 sqft"
71192,New York,36 Watson Straits Rd,NY,10172.0,Residential,120000.0,1.0,1.0,Ps 59 Beekman Hill International,10.0,...,-73.974312,639.0,Wood,,Off Street,,115600.0,2779.0,,0.73 Acres


In [18]:
df_3 = df_2.copy()
non_ny_rows = pd.DataFrame(non_ny_rows)
non_ny_rows.index = df_3.index.intersection(non_ny_rows.index)

# Update df_3 with the new values from non_ny_rows
df_3.update(non_ny_rows)
df_3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 74787 entries, 0 to 75629
Data columns (total 24 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   city                74786 non-null  object 
 1   street_address      74786 non-null  object 
 2   state               74786 non-null  object 
 3   zipcode             74769 non-null  float64
 4   house_type          74501 non-null  object 
 5   price               74748 non-null  float64
 6   bathrooms           56573 non-null  float64
 7   bedrooms            56094 non-null  float64
 8   school_name         74668 non-null  object 
 9   school_rating       73791 non-null  float64
 10  school_name_2       55035 non-null  object 
 11  school_rating_2     54998 non-null  float64
 12  yearBuilt           69558 non-null  float64
 13  latitude            74761 non-null  float64
 14  longitude           74761 non-null  float64
 15  sqft                66066 non-null  float64
 16  heating  

In [19]:
# Keep only rows where the 'state' column is 'NY'
df_4 = df_3[df_3['state'] == 'NY']

In [20]:
# Convert all values in the 'city' column to lower case
df_4['city'] = df_4['city'].str.lower()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_4['city'] = df_4['city'].str.lower()


In [21]:
# Keep only rows where the 'state' column is 'NY'
#df_4= df_3[df_3['state'] == 'NY']

# Filter rows with 'blvd' and 'n.y' in the 'city' column
filtered_df = df_4[df_4['city'].isin(['blvd', 'n.y'])]

# Display the result
print("Rows with 'blvd' and 'n.y' in the 'city' column:")
filtered_df

Rows with 'blvd' and 'n.y' in the 'city' column:


Unnamed: 0,city,street_address,state,zipcode,house_type,price,bathrooms,bedrooms,school_name,school_rating,...,longitude,sqft,heating,cooling,parking,basement,tax_assessed_value,tax_amount,stories,lot_size
19733,blvd,87-40 Francis Lewis #B28,NY,11427.0,Condo,195000.0,1.0,2.0,The Bellaire School,6.0,...,-73.761993,800.0,,,0 spaces,,,,,
74792,blvd,63-93 Woodhaven #3H,NY,11374.0,Condo,285000.0,1.0,,Ps 174 William Sidney Mount,8.0,...,-73.866508,600.0,,,0 spaces,,,,,
75083,blvd,67-35 Yellowstone Blvd #6T,NY,11375.0,Condo,505000.0,2.0,2.0,Ps 175 The Lynn Gross Discovery School,7.0,...,-73.84861,1000.0,,,0 spaces,,,,,
75543,n.y,85-36 67th Ave Rego Park,NY,11374.0,Single Family,899000.0,2.0,3.0,Ps 174 William Sidney Mount,8.0,...,-73.860992,1440.0,,,0 spaces,,,,,"2,230 sqft"


## To validate the accuracy of the numerical features, we looked to validated the bedroom count was appropriate. To do this we looked to validate that single family homes did not contain more than 15 bedrooms as this amount of bedrooms would not make sense for a single family home but would be more appropriate for a multi-family home. We noted there were 5 properties that met this criteria. Upon further analysis shown below, we updated the house type of 2 of the properties to reflect them being multi-family properties (validated by a google search). We removed the remaining properties from our dataset as they could not be validated.

In [22]:
# Filter rows where the 'bedrooms' column has values greater than 15 and 'house_type' is 'Single Family'
df_bedrooms = df_4[(df_4['bedrooms'] >= 15) & (df_4['house_type'] == 'Single Family')]

# Display the result
print("Rows with more than 15 bedrooms and house_type 'Single Family':")
df_bedrooms

Rows with more than 15 bedrooms and house_type 'Single Family':


Unnamed: 0,city,street_address,state,zipcode,house_type,price,bathrooms,bedrooms,school_name,school_rating,...,longitude,sqft,heating,cooling,parking,basement,tax_assessed_value,tax_amount,stories,lot_size
750,far rockaway,508 Beach 135th St,NY,11694.0,Single Family,975000.0,3.0,41.0,Ps Ms 114 Belle Harbor,8.0,...,-73.855034,1500.0,Other,Central,"Garage, Garage - Attached",Full,910000.0,8552.0,,"3,998 sqft"
11735,brooklyn,6 Saint Nicholas Ave,NY,11237.0,Single Family,1600000.0,16.0,20.0,Ps 123 Suydam,2.0,...,-73.922455,12600.0,Forced air,Central,0 spaces,See Remarks,624000.0,4924.0,,"2,250 sqft"
11736,brooklyn,6 Saint Nicholas Ave,NY,11237.0,Single Family,1600000.0,16.0,20.0,Ps 123 Suydam,2.0,...,-73.922455,12600.0,Forced air,Central,0 spaces,See Remarks,624000.0,4924.0,,"2,250 sqft"
11738,brooklyn,6 Saint Nicholas Ave,NY,11237.0,Single Family,1600000.0,16.0,20.0,Ps 123 Suydam,2.0,...,-73.922455,12600.0,Forced air,Central,0 spaces,See Remarks,624000.0,4924.0,,"2,250 sqft"
11739,brooklyn,6 Saint Nicholas Ave,NY,11237.0,Single Family,1600000.0,16.0,20.0,Ps 123 Suydam,2.0,...,-73.922455,12600.0,Forced air,Central,0 spaces,See Remarks,624000.0,4924.0,,"2,250 sqft"
11740,brooklyn,6 Saint Nicholas Ave,NY,11237.0,Single Family,1600000.0,16.0,20.0,Ps 123 Suydam,2.0,...,-73.922455,12600.0,Forced air,Central,0 spaces,See Remarks,624000.0,4924.0,,"2,250 sqft"
11750,brooklyn,6 Saint Nicholas Ave,NY,11237.0,Single Family,1600000.0,16.0,20.0,Ps 123 Suydam,2.0,...,-73.922455,12600.0,Forced air,Central,0 spaces,See Remarks,624000.0,4924.0,,"2,250 sqft"
17164,jamaica,8019 190th St,NY,11423.0,Single Family,800000.0,3.0,15.0,Ps Is 178 Holliswood,8.0,...,-73.778816,1864.0,,,"Garage, Garage - Attached",,986000.0,10050.0,2.0,"4,000 sqft"
35145,brooklyn,2114 Fulton St,NY,11233.0,Single Family,750000.0,,18.0,Ps 41 Francis White,5.0,...,-73.912659,2600.0,,,0 spaces,,210000.0,8120.0,2.0,"1,992 sqft"
35968,brooklyn,658 Ashford St,NY,11207.0,Single Family,499000.0,,15.0,Ps 202 Ernest S Jenkyns,3.0,...,-73.883232,3766.0,,,"Garage, Garage - Attached",,321600.0,8422.0,2.0,"3,920 sqft"


In [23]:
# Identify the rows to be removed
rows_to_remove = df_3[(df_3['house_type'] == 'Single Family') & (df_3['bedrooms'] >= 15)]

# Drop these rows from df
df_4 = df_3.drop(rows_to_remove.index)

In [24]:
# Get unique zip codes
unique_new_york_zip_codes = df_4['zipcode'].unique()

# Print unique zip codes for 'new york'
print(unique_new_york_zip_codes)

[10463. 10471. 10034. 10040. 10803. 10466. 10469. 10475. 10314. 11414.
 11693. 11691. 11422. 10003. 11694. 11697. 11559. 10006. 10280. 10004.
 10032. 10033. 10453. 10452. 10027. 10456. 10031. 10030. 10457.    nan
 10464. 10035. 10454. 10029. 10037. 10451. 10474. 10455. 10024. 11222.
 11378. 11362. 20747. 11231. 10306. 10309. 10312. 10308. 11235. 11224.
 11429. 11411. 11413. 11420. 11692. 11357. 10023. 10069. 10282. 10704.
 10470. 10468. 10465. 10301. 10304. 11003. 10018. 10016. 10036. 10001.
 10017. 10123. 10467. 10458. 10039. 12764. 10460. 10459. 10472. 10473.
 10462. 10461.   148. 10021. 10075. 10028. 10007. 10022. 10044. 10065.
 11101. 11102. 11105. 11103. 11106. 11377. 11356. 11370. 11369. 11354.
 11368. 11104. 10010. 11109. 10009. 11755. 11249. 11211. 11385. 11206.
 11237. 10025. 11360. 11359. 11358. 11363. 11361. 11364. 11005. 11020.
 21001. 11355. 11367. 11365. 11366. 11375. 11432. 11415. 11418. 11435.
 11423. 11430. 11433. 11428. 11412. 21912. 11427. 11426. 11004. 11424.
 11001

In [25]:
# Dictionary mapping zip codes to boroughs
zip_to_borough = {
    10463: "Bronx",10471: "Bronx",10034: "Manhattan",10040: "Manhattan",10803: "Bronx",10466: "Bronx",10469: "Bronx",
    10475: "Bronx",10314: "Staten Island",11414: "Queens",11693: "Queens",11691: "Queens",11422: "Queens",10003: "Manhattan",11694: "Queens",
    11697: "Queens",11559: "Queens",10006: "Manhattan",10280: "Manhattan",10004: "Manhattan",10032: "Manhattan",
    10033: "Manhattan",10453: "Bronx",10452: "Bronx",10027: "Manhattan",10456: "Bronx",10031: "Manhattan",10030: "Manhattan",10457: "Bronx",
    10464: "Bronx",10035: "Manhattan",10454: "Bronx",10029: "Manhattan",10037: "Manhattan",10451: "Bronx",10474: "Bronx",
    10455: "Bronx",10024: "Manhattan",11222: "Brooklyn",11378: "Queens",11362: "Queens",11231: "Brooklyn",10306: "Staten Island",
    10309: "Staten Island",10312: "Staten Island",10308: "Staten Island",11235: "Brooklyn",11224: "Brooklyn",11429: "Queens",
    11411: "Queens",11413: "Queens",11420: "Queens",11692: "Queens",11357: "Queens",10023: "Manhattan",10069: "Manhattan",
    10282: "Manhattan",10704: "Bronx",10470: "Bronx",10468: "Bronx",10465: "Bronx",10301: "Staten Island",10304: "Staten Island",
    11003: "Queens",10018: "Manhattan",10016: "Manhattan",10036: "Manhattan",10001: "Manhattan",10017: "Manhattan",
    10123: "Manhattan",10467: "Bronx",10458: "Bronx",10039: "Manhattan",12764: "Bronx",10460: "Bronx",10459: "Bronx",
    10472: "Bronx",10473: "Bronx",10462: "Bronx",10461: "Bronx",148: "Bronx",10021: "Manhattan",10075: "Manhattan",10028: "Manhattan",
    10007: "Manhattan",10022: "Manhattan",10044: "Manhattan",10065: "Manhattan",11101: "Queens",11102: "Queens",
    11105: "Queens",11358: "Queens",11103: "Queens",11106: "Queens",11377: "Queens",11356: "Queens",11370: "Queens",
    11369: "Queens",11354: "Queens",11368: "Queens",11104: "Queens",10010: "Manhattan",11109: "Queens",10009: "Manhattan",
    11755: "Queens",11249: "Brooklyn",11211: "Brooklyn",11385: "Queens",11206: "Brooklyn",11237: "Brooklyn",10025: "Manhattan",
    11360: "Queens",11359: "Queens",11363: "Queens",11361: "Queens",11364: "Queens",11005: "Queens",11020: "Queens",
    11355: "Queens",11367: "Queens",11365: "Queens",11366: "Queens",11375: "Queens",11432: "Queens",11415: "Queens",11418: "Queens",
    11435: "Queens",11423: "Queens",11430: "Queens",11433: "Queens",11428: "Queens",11412: "Queens",11427: "Queens",11426: "Queens",
    11004: "Queens",11424: "Queens",11001: "Queens",11040: "Queens",10303: "Staten Island",10302: "Staten Island",
    10310: "Staten Island",11201: "Brooklyn",10005: "Manhattan",11242: "Brooklyn",11217: "Brooklyn",11215: "Brooklyn",11232: "Brooklyn",11214: "Brooklyn",
    10307: "Staten Island",10305: "Staten Island",11223: "Brooklyn",11229: "Brooklyn",11234: "Brooklyn",11419: "Queens",11417: "Queens",11434: "Queens",
    11436: "Queens",11205: "Brooklyn",11238: "Brooklyn",11221: "Brooklyn",11218: "Brooklyn",11216: "Brooklyn",11225: "Brooklyn",11213: "Brooklyn",11203: "Brooklyn",
    11233: "Brooklyn",11212: "Brooklyn",11236: "Brooklyn",11421: "Queens",11207: "Brooklyn",11208: "Brooklyn",11239: "Brooklyn",11416: "Queens",
    10433: "Bronx",11491: "Queens",11210: "Brooklyn",10019: "Manhattan",10011: "Manhattan",10038: "Manhattan",10013: "Manhattan",10002: "Manhattan",10550: "Bronx",11220: "Brooklyn",11209: "Brooklyn",11228: "Brooklyn",
    11219: "Brooklyn",11204: "Brooklyn",11230: "Brooklyn",11226: "Brooklyn",11379: "Queens",10014: "Manhattan",10128: "Manhattan",
    11372: "Queens",11374: "Queens",11373: "Queens",13277: "Bronx"
}

# Create the 'borough' column based on the 'zip_code' column
df_4['borough'] = df_4['zipcode'].map(zip_to_borough)

df_4

Unnamed: 0,city,street_address,state,zipcode,house_type,price,bathrooms,bedrooms,school_name,school_rating,...,sqft,heating,cooling,parking,basement,tax_assessed_value,tax_amount,stories,lot_size,borough
0,New York,60 Terrace View Ave,NY,10463.0,Residential,799999.0,2.0,5.0,Ps 37 Multiple Intelligence School,4.0,...,1889.0,"Natural Gas, Hot Water",,Driveway,Finished,711000.0,5096.0,,,Bronx
1,Bronx,625 W 246th St,NY,10471.0,Single Family,3995000.0,8.0,8.0,Ps 24 Spuyten Duyvil,10.0,...,7000.0,,Central,"Garage, Garage - Attached",,1937000.0,13941.0,1.0,0.29 Acres,Bronx
2,Bronx,716 W 231st St,NY,10463.0,Single Family,1495000.0,3.0,4.0,Ps 24 Spuyten Duyvil,10.0,...,4233.0,,,"Garage, Garage - Attached",,2341000.0,12253.0,2.0,0.42 Acres,Bronx
3,Bronx,750 W 232nd St,NY,10463.0,Single Family,3450000.0,6.0,5.0,Ps 24 Spuyten Duyvil,10.0,...,7000.0,,Central,"Garage, Garage - Attached",,3011000.0,19472.0,2.0,0.26 Acres,Bronx
4,Bronx,632 W 230th St,NY,10463.0,Single Family,1790000.0,6.0,5.0,Ps 24 Spuyten Duyvil,10.0,...,,,Central,0 spaces,,,,,,Bronx
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75625,Flushing,6829 Manse St,NY,11375.0,Single Family,825000.0,2.0,3.0,Ps 144 Col Jeromus Remsen,8.0,...,2417.0,Other,,"Garage, Garage - Attached",,907000.0,6447.0,2.0,"2,417 sqft",Queens
75626,Forest Hills Gardens,82 Greenway Ter,NY,11375.0,Townhouse,2704000.0,6.0,6.0,Ps 101 School In The Gardens,9.0,...,6085.0,,,"Garage, Garage - Attached",,2513000.0,18430.0,2.0,"3,255 sqft",Queens
75627,Forest Hills Gardens,86 Greenway Ter,NY,11375.0,Townhouse,2750000.0,5.0,6.0,Ps 101 School In The Gardens,9.0,...,4564.0,,,0 spaces,,2893000.0,24649.0,2.0,"6,603 sqft",Queens
75628,Flushing,8913 70th Ave,NY,11375.0,Single Family,935000.0,,,Ps 144 Col Jeromus Remsen,8.0,...,1216.0,,,"Garage, Garage - Attached",,783000.0,4979.0,2.0,"2,367 sqft",Queens


In [26]:
null_zipcode_rows = df_4[df_4['zipcode'].isnull()]
null_zipcode_rows

Unnamed: 0,city,street_address,state,zipcode,house_type,price,bathrooms,bedrooms,school_name,school_rating,...,sqft,heating,cooling,parking,basement,tax_assessed_value,tax_amount,stories,lot_size,borough
1207,Bronx,W Fordham Rd,NY,,Single Family,31550000.0,,,Ps 291,4.0,...,,,,0 spaces,,7651000.0,97169.0,,3.72 Acres,
2765,Staten island,Kent St,NY,,Single Family,32500.0,,,Ps 23 Richmondtown,9.0,...,,,,0 spaces,,88000.0,4173.0,,"4,000 sqft",
5876,Bronx,Bailey Ave,NY,,Single Family,2775000.0,,,Ps 310 Marble Hill,6.0,...,,,,0 spaces,,46000.0,46.0,,732 sqft,
5877,Bronx,Bailey Ave,NY,,Single Family,2775000.0,,,Ps 310 Marble Hill,6.0,...,,,,0 spaces,,345000.0,1011.0,,"5,517 sqft",
10502,Queens,50-02 Midtown Tun Plz,NY,,Single Family,1500000.0,,,P.S. 78,9.0,...,,,,0 spaces,,1895000.0,64094.0,,0.44 Acres,
15834,,,,,Other,,,,,,...,,,,0 spaces,,,,,,
24379,Staten island,N Gannon Ave,NY,,Single Family,500000.0,,,Ps 29 Bardwell,7.0,...,,,,0 spaces,,379000.0,2282.0,,"7,000 sqft",
24850,Brooklyn,Henry St,NY,,Single Family,3000.0,,,Ps 58 The Carroll,8.0,...,,,,0 spaces,,2000.0,13.0,,308 sqft,
28400,Staten island,Hargold Ave,NY,,Single Family,32500.0,,,Ps 6 Cpl Allan F Kivlehan School,7.0,...,,,,0 spaces,,83000.0,3936.0,,"4,000 sqft",
28401,Staten island,Englewood Ave,NY,,Single Family,130000.0,,,Ps 6 Cpl Allan F Kivlehan School,7.0,...,,,,0 spaces,,51000.0,2418.0,,"2,700 sqft",


In [27]:
non_digit = null_zipcode_rows[null_zipcode_rows['street_address'].str.contains('\d', na=False)]
non_digit

Unnamed: 0,city,street_address,state,zipcode,house_type,price,bathrooms,bedrooms,school_name,school_rating,...,sqft,heating,cooling,parking,basement,tax_assessed_value,tax_amount,stories,lot_size,borough
10502,Queens,50-02 Midtown Tun Plz,NY,,Single Family,1500000.0,,,P.S. 78,9.0,...,,,,0 spaces,,1895000.0,64094.0,,0.44 Acres,
39113,Brooklyn,24 Brighton 3 Ln,NY,,Residential,549000.0,1.0,3.0,Ps 225 The Eileen E Zaglin,7.0,...,560.0,"Natural Gas, Hot Water",,,,,2500.0,,"1,600 sqft",


In [28]:
if 10502 in non_digit.index:
    # Update the value in the 'zipcode' column of index 10502
    non_digit.at[10502, 'zipcode'] = '11101.0'
    print(f"Updated 'zipcode' column at index 10502 to '11101.0'")
else:
    print("Index 10502 not found in the DataFrame")
    
if 39113 in non_digit.index:
    # Update the value in the 'zipcode' column of index 39113
    non_digit.at[39113, 'zipcode'] = '11235.0'
    print(f"Updated 'zipcode' column at index 39113 to '11235.0'")
else:
    print("Index 39113 not found in the DataFrame")

Updated 'zipcode' column at index 10502 to '11101.0'
Updated 'zipcode' column at index 39113 to '11235.0'


In [29]:
# Update df_5 with the new values from non_ny_rows
df_4.update(non_digit)
df_4.info()

<class 'pandas.core.frame.DataFrame'>
Index: 74773 entries, 0 to 75629
Data columns (total 25 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   city                74772 non-null  object 
 1   street_address      74772 non-null  object 
 2   state               74772 non-null  object 
 3   zipcode             74757 non-null  object 
 4   house_type          74487 non-null  object 
 5   price               74734 non-null  float64
 6   bathrooms           56561 non-null  float64
 7   bedrooms            56080 non-null  float64
 8   school_name         74654 non-null  object 
 9   school_rating       73777 non-null  float64
 10  school_name_2       55025 non-null  object 
 11  school_rating_2     54988 non-null  float64
 12  yearBuilt           69544 non-null  float64
 13  latitude            74747 non-null  float64
 14  longitude           74747 non-null  float64
 15  sqft                66052 non-null  float64
 16  heating  

### Handling Null Values

In [30]:
result = df_4.groupby('borough', as_index=False).agg(
    count=('price', 'size'),
    mean_price=('price', 'mean')
).round()
result

Unnamed: 0,borough,count,mean_price
0,Bronx,13471,658376.0
1,Brooklyn,14817,1378240.0
2,Manhattan,4477,3794234.0
3,Queens,26102,801940.0
4,Staten Island,15871,634491.0


In [31]:
df_4.info()

<class 'pandas.core.frame.DataFrame'>
Index: 74773 entries, 0 to 75629
Data columns (total 25 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   city                74772 non-null  object 
 1   street_address      74772 non-null  object 
 2   state               74772 non-null  object 
 3   zipcode             74757 non-null  object 
 4   house_type          74487 non-null  object 
 5   price               74734 non-null  float64
 6   bathrooms           56561 non-null  float64
 7   bedrooms            56080 non-null  float64
 8   school_name         74654 non-null  object 
 9   school_rating       73777 non-null  float64
 10  school_name_2       55025 non-null  object 
 11  school_rating_2     54988 non-null  float64
 12  yearBuilt           69544 non-null  float64
 13  latitude            74747 non-null  float64
 14  longitude           74747 non-null  float64
 15  sqft                66052 non-null  float64
 16  heating  

In [32]:
missing = pd.concat([df_4.isnull().sum(), 100 * df_4.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by=['count', '%'])

Unnamed: 0,count,%
city,1,0.001337
street_address,1,0.001337
state,1,0.001337
zipcode,16,0.021398
latitude,26,0.034772
longitude,26,0.034772
borough,35,0.046808
price,39,0.052158
school_name,119,0.159148
house_type,286,0.382491


In [33]:
null_house_type_rows = df_4[df_4['house_type'].isnull()]
null_house_type_rows

Unnamed: 0,city,street_address,state,zipcode,house_type,price,bathrooms,bedrooms,school_name,school_rating,...,sqft,heating,cooling,parking,basement,tax_assessed_value,tax_amount,stories,lot_size,borough
205,Bronx,3426 Hunter Ave,NY,10475.0,,165000.0,,,Ps 111 Seton Falls,4.0,...,,,,,,,1281.0,,"5,836 sqft",Bronx
222,Broad Channel,308 E 8th Rd,NY,11693.0,,179000.0,,,Ps 47 Chris Galas,7.0,...,,,,,,378000.0,1238.0,,"1,654 Acres",Queens
232,Hamilton Beach,99 165 Ave,NY,11414.0,,305000.0,,,Ps 146 Howard Beach,8.0,...,,,,,,,688.0,,0.09 Acres,Queens
242,Howard Beach,164th Ave,NY,11414.0,,375000.0,,,Ps 146 Howard Beach,8.0,...,,,,,,,,,0.16 Acres,Queens
255,Far Rockaway,The Strand,NY,11691.0,,95900.0,,,Ps 104 The Bays Water,6.0,...,,,,,,,,,0.19 Acres,Queens
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
72140,Forest Hills,68-37A 112th St,NY,11375.0,,1690000.0,,,Ps 220 Edward Mandel,6.0,...,,,,,,,,,0.11 Acres,Queens
72142,Flushing,5814 Granger St,NY,11368.0,,1950000.0,,,Ps 14 Fairview,3.0,...,,,,,,841000.0,6826.0,,"4,000 Acres",Queens
72359,Woodside,3823 54th St,NY,11377.0,,1395000.0,,,Ps 152 Gwendolyn N Alleyne School,9.0,...,,,,,,,14995.0,,0.25 Acres,Queens
73185,Woodside,7208 45th Ave,NY,11377.0,,1100000.0,,,Ps 12 James B Colgate,8.0,...,,,,,,,750.0,,0.50 Acres,Queens


In [34]:
df_5 = df_4.dropna(subset=['zipcode', 'price', 'house_type'])

In [35]:
#fill in NaN for story building column with 1 they must have at least one floor 
df_5['stories'].fillna(1,inplace=True)

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_5['stories'].fillna(1,inplace=True)


In [36]:
df_5 = df_5.drop(df_5[df_5['price'] < 100000].index, axis=0)

In [37]:
missing = pd.concat([df_5.isnull().sum(), 100 * df_5.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by=['count', '%'])

Unnamed: 0,count,%
city,0,0.0
street_address,0,0.0
state,0,0.0
zipcode,0,0.0
house_type,0,0.0
price,0,0.0
stories,0,0.0
latitude,1,0.001402
longitude,1,0.001402
borough,12,0.016822


In [38]:
# Check if the columns exist in the dataframe
columns_to_check = ['heating', 'cooling', 'basement', 'parking', 'house_type']

for column in columns_to_check:
    if column in df_5.columns:
        unique_values = df_5[column].unique()
        print(f"Unique values in '{column}':\n{unique_values}\n")
    else:
        print(f"Column '{column}' does not exist in the dataframe.\n")

Unique values in 'heating':
['Natural Gas, Hot Water' nan 'Natural Gas, Forced Air' 'Forced air'
 'Electric, Forced Air' 'Forced air, Gas' 'Natural Gas, Baseboard'
 'Other, Gas' 'Natural Gas, Hot Water, Steam' 'Forced air, Oil'
 'Electric, Gas' 'Electric, See Remarks' 'Other, Hot Water'
 'Natural Gas, See Remarks' 'Oil' 'Gas' 'Other' 'Wall, Gas' 'Other, Oil'
 'Other, Gas, Oil' 'Radiant, Gas' 'Forced air, Electric, Gas'
 'Forced air, Gas, Oil' 'Natural Gas, Oil, Forced Air' 'Other, Steam'
 'Baseboard, Other' 'Gas, Other' 'Natural Gas, Baseboard, Hot Water'
 'Baseboard, Gas' 'Oil, Hot Water, See Remarks' 'Electric, Baseboard'
 'Natural Gas, Steam' 'Propane, Forced Air' 'Natural Gas' 'Baseboard, Oil'
 'Oil, Hot Water' 'Oil, Forced Air' 'Baseboard, Oil, Propane / Butane'
 'Oil, Baseboard' 'Natural Gas, Baseboard, Radiant'
 'Natural Gas, Steam, Other' 'Forced Air' 'None, Other'
 'Natural Gas, Other' 'Natural Gas, Baseboard, Steam'
 'Natural Gas, Hot Water, Baseboard' 'Baseboard, Other, Gas'

### Parking

In [39]:
# Count the number of rows that contain the word 'none' in the 'parking' column
parking_none_count = df_5['parking'].str.contains('none', case=False, na=False).sum()

# Display the count
print(parking_none_count)

683


In [40]:
# Update all values in the 'parking' column with False if it contains 'none'
df_5['parking'] = df_5['parking'].apply(lambda x: False if pd.notna(x) and 'none' in str(x).lower() else x)

# Update all non-False values to True
df_5['parking'] = df_5['parking'].apply(lambda x: True if x != False else x)

# Check the updated values in the 'parking' column
print(df_5['parking'].unique())

[ True False]


### Heating

In [41]:
# Heating
df_5['heating'] = df_5['heating'].fillna(False)

In [42]:
# Display unique values in the 'heating' column
unique_heating_values = df_5['heating'].unique()
unique_heating_values

array(['Natural Gas, Hot Water', False, 'Natural Gas, Forced Air',
       'Forced air', 'Electric, Forced Air', 'Forced air, Gas',
       'Natural Gas, Baseboard', 'Other, Gas',
       'Natural Gas, Hot Water, Steam', 'Forced air, Oil',
       'Electric, Gas', 'Electric, See Remarks', 'Other, Hot Water',
       'Natural Gas, See Remarks', 'Oil', 'Gas', 'Other', 'Wall, Gas',
       'Other, Oil', 'Other, Gas, Oil', 'Radiant, Gas',
       'Forced air, Electric, Gas', 'Forced air, Gas, Oil',
       'Natural Gas, Oil, Forced Air', 'Other, Steam', 'Baseboard, Other',
       'Gas, Other', 'Natural Gas, Baseboard, Hot Water',
       'Baseboard, Gas', 'Oil, Hot Water, See Remarks',
       'Electric, Baseboard', 'Natural Gas, Steam', 'Propane, Forced Air',
       'Natural Gas', 'Baseboard, Oil', 'Oil, Hot Water',
       'Oil, Forced Air', 'Baseboard, Oil, Propane / Butane',
       'Oil, Baseboard', 'Natural Gas, Baseboard, Radiant',
       'Natural Gas, Steam, Other', 'Forced Air', 'None, Other'

In [43]:
# Update all values in the 'heating' column with False if it contains 'none'
df_5['heating'] = df_5['heating'].apply(lambda x: False if pd.notna(x) and 'none' in str(x).lower() else x)

# Update all non-False values to True
df_5['heating'] = df_5['heating'].apply(lambda x: True if x != False else x)

# Check the updated values in the 'parking' column
print(df_5['heating'].unique())

[ True False]


### Cooling

In [44]:
# Cooling
df_5['cooling'] = df_5['cooling'].fillna(False)

In [45]:
# Display unique values in the 'heating' column
unique_heating_values = df_5['cooling'].unique()
unique_heating_values

array([False, 'Central', 'Central Air', 'Window Unit(s)', 'Wall Unit(s)',
       'Other', 'Wall', 'Ductless',
       'ENERGY STAR Qualified Equipment, Wall Unit(s)',
       'Ductless, Window Unit(s)', 'ENERGY STAR Qualified Equipment',
       'Wall Unit(s), Window Unit(s)', 'Ductless, Zoned',
       'Refrigerator, Wall', 'A/C Unit, Central Air', 'Central, Solar',
       'Central, Other', 'Zoned', 'None, Window Unit(s)',
       'Ductless, Wall Unit(s)', 'Other, Wall', 'Units',
       'Units, ENERGY STAR Qualified Equipment',
       'Central Air, ENERGY STAR Qualified Equipment', 'Solar, Wall',
       'Central, Wall', 'Refrigerator', 'Central Air, Wall Unit(s)',
       'Zoned, Window Unit(s)', 'Central Air, Ductless', 'Evaporative',
       'Refrigerator, Central', 'Ductless, Wall Unit(s), Window Unit(s)',
       'None, Wall Unit(s)', 'Zoned, Wall Unit(s)', 'None, Other',
       'Central Air, Wall Unit(s), Window Unit(s)', 'Central Air, Zoned',
       'Central Air - Split', 'Zoned, Wall U

In [46]:
# Update all values in the 'cooling' column with False if it contains 'none'
df_5['cooling'] = df_5['cooling'].apply(lambda x: False if pd.notna(x) and 'none' in str(x).lower() else x)

# Update all non-False values to True
df_5['cooling'] = df_5['cooling'].apply(lambda x: True if x != False else x)

# Check the updated values in the 'parking' column
print(df_5['cooling'].unique())

[False  True]


### Basement

In [47]:
# Basement
df_5['basement'] = df_5['basement'].fillna(False)
#For all values that are none or unfinished we put false as well and make all the ones with basements True
for i in df_5.index:
    if type(df_5['basement'][i]) == str:
        if df_5['basement'][i] in ('None'):
            df_5['basement'][i] = False
        else:
            df_5['basement'][i] = True

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_5['basement'][i] = True
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_5['basement'][i] = True
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_5['basement'][i] = True
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_5['basement'][i] = True
A value is trying to be set on a copy of a slice from a Data

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



In [48]:
print(df_5['basement'].unique())

[True False]


### School Names and Ratings

In [49]:
# Check rows where school_name is null and school_name_2 is not null
null_school_name_not_null_school_name_2 = df_5[df_5['school_name'].isnull() & df_5['school_name_2'].notnull()]

# Display the rows with null school_name and not null school_name_2 values
null_school_name_not_null_school_name_2


Unnamed: 0,city,street_address,state,zipcode,house_type,price,bathrooms,bedrooms,school_name,school_rating,...,sqft,heating,cooling,parking,basement,tax_assessed_value,tax_amount,stories,lot_size,borough


In [50]:
df_5.drop(columns=['school_name_2', 'school_rating_2', 'school_name'], inplace=True)

### Latitude and Longitude

In [51]:
null_lat = df_5[df_5['latitude'].isnull()]
null_lat

Unnamed: 0,city,street_address,state,zipcode,house_type,price,bathrooms,bedrooms,school_rating,yearBuilt,...,sqft,heating,cooling,parking,basement,tax_assessed_value,tax_amount,stories,lot_size,borough
71274,Long Island City,(Undisclosed Address),NY,11101.0,Condo,1120000.0,2.0,1.0,9.0,1920.0,...,946.0,False,False,True,False,,,1.0,,Queens


### Boroughs

In [52]:
null_borough_rows = df_5[df_5['borough'].isnull()]
null_borough_rows

Unnamed: 0,city,street_address,state,zipcode,house_type,price,bathrooms,bedrooms,school_rating,yearBuilt,...,sqft,heating,cooling,parking,basement,tax_assessed_value,tax_amount,stories,lot_size,borough
2403,District heights,Lorring Dr,MD,20747.0,Apartment,58521100.0,,,5.0,,...,,False,False,True,False,79200.0,1177.0,1.0,0.54 Acres,
10502,Queens,50-02 Midtown Tun Plz,NY,11101.0,Single Family,1500000.0,,,9.0,,...,,False,False,True,False,1895000.0,64094.0,1.0,0.44 Acres,
17976,Warwick,Cecilton-wa Rd,MD,21912.0,Single Family,370000.0,,,8.0,,...,,False,False,True,False,23900.0,276.0,1.0,49.09 Acres,
22854,Scarborough,103 Running Hill Road,ME,4074.0,Single Family,299999.0,2.0,3.0,4.0,,...,1728.0,False,False,True,False,,,1.0,,
39113,Brooklyn,24 Brighton 3 Ln,NY,11235.0,Residential,549000.0,1.0,3.0,7.0,1899.0,...,560.0,True,False,True,False,,2500.0,1.0,"1,600 sqft",
41864,Lewiston,2 Granite Street,ME,4240.0,Multiple Occupancy,155000.0,6.0,9.0,7.0,,...,5840.0,False,False,True,False,,,1.0,,
42015,Voluntown,82 Pendleton Hill Rd,CT,6384.0,Single Family,595000.0,2.0,4.0,8.0,1962.0,...,1550.0,True,True,True,True,142100.0,4151.0,1.0,2.74 Acres,
42096,Harrison,Harrison Ave,OH,45030.0,Single Family,1561205.0,,,10.0,,...,,False,False,True,False,99100.0,2229.0,1.0,17.28 Acres,
61780,Loxahatchee,E Grand National Dr,FL,33470.0,Single Family,125000.0,,,5.0,,...,,False,False,True,False,79492.0,1938.0,1.0,1.67 Acres,
68924,Landover,Landover Rd,MD,20785.0,Apartment,65850000.0,,,10.0,,...,,False,False,True,False,48200.0,716.0,1.0,"6,087 sqft",


In [53]:
result = df_5.groupby('borough', as_index=False).agg(
    count=('price', 'size'),
    mean_price=('price', 'mean')
).round()
result

Unnamed: 0,borough,count,mean_price
0,Bronx,12626,697702.0
1,Brooklyn,14156,1438336.0
2,Manhattan,4266,3980364.0
3,Queens,24995,832136.0
4,Staten Island,15280,649903.0


In [54]:
missing = pd.concat([df_5.isnull().sum(), 100 * df_5.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by=['count', '%'])

Unnamed: 0,count,%
city,0,0.0
street_address,0,0.0
state,0,0.0
zipcode,0,0.0
house_type,0,0.0
price,0,0.0
heating,0,0.0
cooling,0,0.0
parking,0,0.0
basement,0,0.0


In [55]:
df_6 = df_5.copy(deep=True)

### Tax Assessed Value

In [56]:
pd.set_option('display.float_format', lambda x: '%.5f' % x)

In [57]:
df_2[['price', 'tax_assessed_value']].describe()

Unnamed: 0,price,tax_assessed_value
count,74748.0,62423.0
mean,1035985.20413,1195003.24068
std,2423901.25722,8523943.62052
min,1.0,1.0
25%,480000.0,492000.0
50%,678000.0,631000.0
75%,963000.0,901000.0
max,92725017.0,1680557850.0


In [58]:
# Fill null values in 'tax_assessed_value' with the corresponding 'price' value
df_6['tax_assessed_value'] = df_6.apply(
    lambda row: row['price'] if pd.isnull(row['tax_assessed_value']) else row['tax_assessed_value'], 
    axis=1
)

# Check if there are any remaining null values in the tax_assessed_value column
print("\nNull value counts in tax_assessed_value column after imputation:")
print(df_6['tax_assessed_value'].isnull().sum())


Null value counts in tax_assessed_value column after imputation:
0


### Sqft

In [59]:
# Function to convert acres to square feet and remove 'Acres'
def convert_acres_to_sqft(value):
    if isinstance(value, str):
        if 'Acres' in value:
            # Extract the numeric part, remove commas, and convert to float
            acres = float(value.replace('Acres', '').replace(',', '').strip())
            # Convert acres to square feet (1 acre = 43560 sqft)
            sqft = acres * 43560
            return sqft
        elif 'sqft' in value:
            # Remove 'sqft' and commas, and convert to float
            sqft = float(value.replace('sqft', '').replace(',', '').strip())
            return sqft
        else:
            # Remove commas and convert to float
            value = float(value.replace(',', '').strip())
            return value
    return value

# Apply the function to the lot_size column
df_6['lot_size'] = df_6['lot_size'].apply(convert_acres_to_sqft)

In [60]:
# Convert the column to numeric, forcing errors to NaN
df_6['lot_size'] = pd.to_numeric(df_6['lot_size'], errors='coerce')

In [61]:
df_6.rename(columns={'lot_size': 'lot_size_sqft'}, inplace=True)
df_5.rename(columns={'lot_size': 'lot_size_sqft'}, inplace=True)

In [62]:
missing = pd.concat([df_6.isnull().sum(), 100 * df_6.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by=['count', '%'])

Unnamed: 0,count,%
city,0,0.0
street_address,0,0.0
state,0,0.0
zipcode,0,0.0
house_type,0,0.0
price,0,0.0
heating,0,0.0
cooling,0,0.0
parking,0,0.0
basement,0,0.0


### Impute Values for null values

In [63]:
import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# Define the columns to be imputed, ensuring they exist in the DataFrame
numerical_cols = ['bedrooms', 'bathrooms', 'yearBuilt', 'sqft', 'lot_size_sqft', 'school_rating', 'tax_amount']

# Filter the columns that actually exist in the DataFrame
numerical_cols = [col for col in numerical_cols if col in df_6.columns]

In [64]:
def impute_by_group(df, group_cols, numerical_cols):
    df_imputed = df.copy()
    imputer = IterativeImputer(random_state=0, max_iter=20)
    
    # Group by the specified columns
    grouped = df_imputed.groupby(group_cols)
    
    # Apply imputation to each group
    for name, group in grouped:
        group_numerical = group[numerical_cols].copy()
        
        # Convert non-numeric values to NaN in numerical columns
        for col in numerical_cols:
            group_numerical[col] = pd.to_numeric(group_numerical[col], errors='coerce')
        
        # Filter the columns that actually exist in the group
        valid_cols = [col for col in numerical_cols if col in group_numerical.columns]
        
        # Skip groups that don't have enough data for imputation
        if group_numerical[valid_cols].shape[0] > 1 and group_numerical[valid_cols].shape[1] == len(valid_cols):
            # Perform the imputation on numerical data
            try:
                group_numerical_imputed = pd.DataFrame(imputer.fit_transform(group_numerical[valid_cols]), columns=valid_cols)
                
                # Update the imputed values in the original group
                df_imputed.loc[group.index, valid_cols] = group_numerical_imputed.values
            except ValueError as e:
                print(f"Error imputing group {name}: {e}")
    
    return df_imputed

In [65]:
# Apply the imputation function
df_6_imputed = impute_by_group(df_6, ['zipcode', 'house_type'], numerical_cols)



Error imputing group (148.0, 'Townhouse'): Shape of passed values is (2, 4), indices imply (2, 7)




Error imputing group (10002.0, 'Residential'): Shape of passed values is (3, 6), indices imply (3, 7)
Error imputing group (10003.0, 'Apartment'): Shape of passed values is (4, 4), indices imply (4, 7)
Error imputing group (10004.0, 'Townhouse'): Shape of passed values is (2, 3), indices imply (2, 7)
Error imputing group (10005.0, 'Single Family'): Shape of passed values is (3, 6), indices imply (3, 7)
Error imputing group (10006.0, 'Condo'): Shape of passed values is (8, 6), indices imply (8, 7)




Error imputing group (10010.0, 'Apartment'): Shape of passed values is (2, 6), indices imply (2, 7)




Error imputing group (10017.0, 'Multiple Occupancy'): Shape of passed values is (4, 6), indices imply (4, 7)
Error imputing group (10017.0, 'Single Family'): Shape of passed values is (3, 5), indices imply (3, 7)




Error imputing group (10018.0, 'Single Family'): Shape of passed values is (11, 5), indices imply (11, 7)
Error imputing group (10019.0, 'Single Family'): Shape of passed values is (8, 6), indices imply (8, 7)




Error imputing group (10029.0, 'Condo'): Shape of passed values is (16, 5), indices imply (16, 7)
Error imputing group (10029.0, 'Residential'): Shape of passed values is (2, 6), indices imply (2, 7)
Error imputing group (10029.0, 'Single Family'): Shape of passed values is (21, 5), indices imply (21, 7)




Error imputing group (10032.0, 'Residential'): Shape of passed values is (3, 6), indices imply (3, 7)
Error imputing group (10033.0, 'Apartment'): Shape of passed values is (2, 5), indices imply (2, 7)




Error imputing group (10034.0, 'Multiple Occupancy'): Shape of passed values is (3, 6), indices imply (3, 7)
Error imputing group (10035.0, 'Apartment'): Shape of passed values is (5, 5), indices imply (5, 7)




Error imputing group (10035.0, 'Residential Income'): Shape of passed values is (4, 5), indices imply (4, 7)




Error imputing group (10036.0, 'Mixed Use'): Shape of passed values is (2, 5), indices imply (2, 7)
Error imputing group (10037.0, 'Apartment'): Shape of passed values is (2, 5), indices imply (2, 7)
Error imputing group (10037.0, 'Condo'): Shape of passed values is (18, 6), indices imply (18, 7)
Error imputing group (10037.0, 'Single Family'): Shape of passed values is (6, 6), indices imply (6, 7)




Error imputing group (10039.0, 'Apartment'): Shape of passed values is (2, 4), indices imply (2, 7)
Error imputing group (10039.0, 'Residential'): Shape of passed values is (2, 4), indices imply (2, 7)
Error imputing group (10040.0, 'Apartment'): Shape of passed values is (2, 5), indices imply (2, 7)
Error imputing group (10040.0, 'Multiple Occupancy'): Shape of passed values is (2, 5), indices imply (2, 7)




Error imputing group (10280.0, 'Condo'): Shape of passed values is (20, 6), indices imply (20, 7)
Error imputing group (10282.0, 'Condo'): Shape of passed values is (4, 5), indices imply (4, 7)
Error imputing group (10301.0, 'Land'): Shape of passed values is (6, 3), indices imply (6, 7)
Error imputing group (10301.0, 'Mixed Use'): Shape of passed values is (3, 6), indices imply (3, 7)




Error imputing group (10302.0, 'Land'): Shape of passed values is (2, 3), indices imply (2, 7)
Error imputing group (10302.0, 'Mixed Use'): Shape of passed values is (2, 6), indices imply (2, 7)
Error imputing group (10302.0, 'Other'): Shape of passed values is (2, 5), indices imply (2, 7)
Error imputing group (10303.0, 'Land'): Shape of passed values is (2, 3), indices imply (2, 7)




Error imputing group (10304.0, 'Land'): Shape of passed values is (2, 3), indices imply (2, 7)
Error imputing group (10305.0, 'Land'): Shape of passed values is (5, 3), indices imply (5, 7)




Error imputing group (10306.0, 'Land'): Shape of passed values is (4, 3), indices imply (4, 7)




Error imputing group (10309.0, 'Land'): Shape of passed values is (3, 3), indices imply (3, 7)
Error imputing group (10310.0, 'Land'): Shape of passed values is (2, 3), indices imply (2, 7)
Error imputing group (10310.0, 'Mixed Use'): Shape of passed values is (2, 6), indices imply (2, 7)
Error imputing group (10310.0, 'Mobile / Manufactured'): Shape of passed values is (2, 6), indices imply (2, 7)




Error imputing group (10312.0, 'Land'): Shape of passed values is (3, 3), indices imply (3, 7)




Error imputing group (10314.0, 'Land'): Shape of passed values is (2, 3), indices imply (2, 7)




Error imputing group (10451.0, 'Condo'): Shape of passed values is (30, 6), indices imply (30, 7)
Error imputing group (10451.0, 'Land'): Shape of passed values is (4, 3), indices imply (4, 7)
Error imputing group (10452.0, 'Land'): Shape of passed values is (2, 3), indices imply (2, 7)




Error imputing group (10452.0, 'Townhouse'): Shape of passed values is (2, 5), indices imply (2, 7)
Error imputing group (10454.0, 'Condo'): Shape of passed values is (4, 5), indices imply (4, 7)




Error imputing group (10455.0, 'Condo'): Shape of passed values is (2, 6), indices imply (2, 7)
Error imputing group (10456.0, 'Condo'): Shape of passed values is (25, 6), indices imply (25, 7)




Error imputing group (10457.0, 'Land'): Shape of passed values is (6, 3), indices imply (6, 7)
Error imputing group (10458.0, 'Condo'): Shape of passed values is (4, 6), indices imply (4, 7)
Error imputing group (10458.0, 'Land'): Shape of passed values is (4, 3), indices imply (4, 7)




Error imputing group (10459.0, 'Land'): Shape of passed values is (4, 3), indices imply (4, 7)




Error imputing group (10463.0, 'Land'): Shape of passed values is (4, 3), indices imply (4, 7)




Error imputing group (10465.0, 'Land'): Shape of passed values is (2, 3), indices imply (2, 7)




Error imputing group (10466.0, 'Land'): Shape of passed values is (4, 3), indices imply (4, 7)
Error imputing group (10467.0, 'Land'): Shape of passed values is (2, 3), indices imply (2, 7)




Error imputing group (10467.0, 'Other'): Shape of passed values is (2, 6), indices imply (2, 7)




Error imputing group (10469.0, 'Land'): Shape of passed values is (3, 3), indices imply (3, 7)




Error imputing group (11001.0, 'Residential Income'): Shape of passed values is (4, 6), indices imply (4, 7)
Error imputing group (11003.0, 'Single Family'): Shape of passed values is (4, 6), indices imply (4, 7)




Error imputing group (11005.0, 'Residential'): Shape of passed values is (4, 5), indices imply (4, 7)




Error imputing group (11109.0, 'Condo'): Shape of passed values is (3, 6), indices imply (3, 7)




Error imputing group (11201.0, 'Mixed Use'): Shape of passed values is (2, 6), indices imply (2, 7)




Error imputing group (11204.0, 'Mixed Use'): Shape of passed values is (9, 6), indices imply (9, 7)
Error imputing group (11205.0, 'Residential'): Shape of passed values is (2, 6), indices imply (2, 7)




Error imputing group (11207.0, 'Mixed Use'): Shape of passed values is (3, 6), indices imply (3, 7)




Error imputing group (11209.0, 'Mixed Use'): Shape of passed values is (14, 6), indices imply (14, 7)
Error imputing group (11210.0, 'Condo'): Shape of passed values is (6, 6), indices imply (6, 7)
Error imputing group (11210.0, 'Land'): Shape of passed values is (2, 3), indices imply (2, 7)
Error imputing group (11210.0, 'Mixed Use'): Shape of passed values is (3, 6), indices imply (3, 7)




Error imputing group (11211.0, 'Mixed Use'): Shape of passed values is (7, 6), indices imply (7, 7)




Error imputing group (11212.0, 'Condo'): Shape of passed values is (2, 6), indices imply (2, 7)




Error imputing group (11213.0, 'Condo'): Shape of passed values is (4, 6), indices imply (4, 7)
Error imputing group (11213.0, 'Residential Income'): Shape of passed values is (2, 6), indices imply (2, 7)
Error imputing group (11214.0, 'Mixed Use'): Shape of passed values is (17, 6), indices imply (17, 7)




Error imputing group (11217.0, 'Residential'): Shape of passed values is (2, 5), indices imply (2, 7)
Error imputing group (11218.0, 'Mixed Use'): Shape of passed values is (3, 6), indices imply (3, 7)




Error imputing group (11219.0, 'Mixed Use'): Shape of passed values is (13, 6), indices imply (13, 7)




Error imputing group (11220.0, 'Mixed Use'): Shape of passed values is (10, 6), indices imply (10, 7)




Error imputing group (11221.0, 'Condo'): Shape of passed values is (7, 5), indices imply (7, 7)
Error imputing group (11221.0, 'Mixed Use'): Shape of passed values is (2, 6), indices imply (2, 7)
Error imputing group (11222.0, 'Mobile / Manufactured'): Shape of passed values is (6, 6), indices imply (6, 7)




Error imputing group (11223.0, 'Mixed Use'): Shape of passed values is (7, 6), indices imply (7, 7)




Error imputing group (11224.0, 'Mixed Use'): Shape of passed values is (4, 6), indices imply (4, 7)
Error imputing group (11224.0, 'Residential Income'): Shape of passed values is (2, 6), indices imply (2, 7)




Error imputing group (11225.0, 'Mixed Use'): Shape of passed values is (2, 6), indices imply (2, 7)
Error imputing group (11226.0, 'Mixed Use'): Shape of passed values is (4, 6), indices imply (4, 7)




Error imputing group (11228.0, 'Condo'): Shape of passed values is (3, 6), indices imply (3, 7)
Error imputing group (11228.0, 'Mixed Use'): Shape of passed values is (2, 6), indices imply (2, 7)
Error imputing group (11229.0, 'Mixed Use'): Shape of passed values is (5, 6), indices imply (5, 7)




Error imputing group (11231.0, 'Mixed Use'): Shape of passed values is (2, 6), indices imply (2, 7)
Error imputing group (11232.0, 'Mixed Use'): Shape of passed values is (4, 6), indices imply (4, 7)




Error imputing group (11233.0, 'Condo'): Shape of passed values is (3, 6), indices imply (3, 7)
Error imputing group (11234.0, 'Mixed Use'): Shape of passed values is (4, 6), indices imply (4, 7)




Error imputing group (11236.0, 'Mixed Use'): Shape of passed values is (3, 6), indices imply (3, 7)
Error imputing group (11237.0, 'Apartment'): Shape of passed values is (2, 6), indices imply (2, 7)




Error imputing group (11239.0, 'Condo'): Shape of passed values is (3, 6), indices imply (3, 7)
Error imputing group (11239.0, 'Single Family'): Shape of passed values is (41, 5), indices imply (41, 7)




Error imputing group (11354.0, 'Condo'): Shape of passed values is (6, 5), indices imply (6, 7)
Error imputing group (11355.0, 'Land'): Shape of passed values is (2, 3), indices imply (2, 7)




Error imputing group (11366.0, 'Condo'): Shape of passed values is (5, 5), indices imply (5, 7)




Error imputing group (11370.0, 'Condo'): Shape of passed values is (17, 6), indices imply (17, 7)




Error imputing group (11378.0, 'Condo'): Shape of passed values is (7, 6), indices imply (7, 7)




Error imputing group (11385.0, 'Mixed Use'): Shape of passed values is (2, 6), indices imply (2, 7)
Error imputing group (11411.0, 'Residential Income'): Shape of passed values is (2, 6), indices imply (2, 7)




Error imputing group (11414.0, 'Land'): Shape of passed values is (5, 3), indices imply (5, 7)




Error imputing group (11416.0, 'Residential'): Shape of passed values is (4, 6), indices imply (4, 7)
Error imputing group (11416.0, 'Residential Income'): Shape of passed values is (5, 6), indices imply (5, 7)




Error imputing group (11423.0, 'Condo'): Shape of passed values is (15, 5), indices imply (15, 7)




Error imputing group (11691.0, 'Land'): Shape of passed values is (2, 3), indices imply (2, 7)




Error imputing group (11692.0, 'Condo'): Shape of passed values is (15, 6), indices imply (15, 7)




Error imputing group (11694.0, 'Other'): Shape of passed values is (2, 6), indices imply (2, 7)
Error imputing group (11697.0, 'Single Family'): Shape of passed values is (4, 5), indices imply (4, 7)
Error imputing group (11755.0, 'Condo'): Shape of passed values is (2, 4), indices imply (2, 7)
Error imputing group (12764.0, 'Single Family'): Shape of passed values is (2, 2), indices imply (2, 7)




In [66]:
# Check null values after imputation
print("\nNull value counts after imputation:")
print(df_6_imputed.isnull().sum())


Null value counts after imputation:
city                    0
street_address          0
state                   0
zipcode                 0
house_type              0
price                   0
bathrooms             353
bedrooms              459
school_rating          14
yearBuilt             240
latitude                1
longitude               1
sqft                  238
heating                 0
cooling                 0
parking                 0
basement                0
tax_assessed_value      0
tax_amount            272
stories                 0
lot_size_sqft         319
borough                12
dtype: int64


In [67]:
missing = pd.concat([df_6_imputed.isnull().sum(), 100 * df_6_imputed.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by=['count', '%'])

Unnamed: 0,count,%
city,0,0.0
street_address,0,0.0
state,0,0.0
zipcode,0,0.0
house_type,0,0.0
price,0,0.0
heating,0,0.0
cooling,0,0.0
parking,0,0.0
basement,0,0.0


### Lot_size_sqft

In [68]:
# Filter the DataFrame to get rows with lot_size_sqft under 700
rows_with_small_lot_size = df_6_imputed[df_6_imputed['lot_size_sqft'] < 700]

# Display the filtered rows
rows_with_small_lot_size

Unnamed: 0,city,street_address,state,zipcode,house_type,price,bathrooms,bedrooms,school_rating,yearBuilt,...,sqft,heating,cooling,parking,basement,tax_assessed_value,tax_amount,stories,lot_size_sqft,borough
148,Bronx,5291 Independence Ave,NY,10471.00000,Single Family,4200000.00000,5.50000,7.00000,7.00000,2003.00000,...,7692.00000,False,False,True,False,2804000.00000,24648.00000,2.00000,422.00000,Bronx
273,Broad Channel,39 W 14th Rd,NY,11693.00000,Residential,560000.00000,2.00000,3.00000,7.00000,1920.00000,...,1885.54494,True,False,True,True,448000.00000,4663.00000,1.00000,0.11000,Queens
290,Jamaica,9978 164th Rd,NY,11414.00000,Residential,565000.00000,2.00000,3.00000,8.00000,2003.00000,...,2017.17097,True,True,True,True,605000.00000,6215.05240,1.00000,0.07000,Queens
307,Far Rockaway,3229 Mott Ave,NY,11691.00000,Residential,445000.00000,2.00000,3.00000,6.00000,1955.00000,...,2199.97367,True,False,True,True,363000.00000,3840.00000,1.00000,0.13000,Queens
384,Jamaica,14939 Huxley St,NY,11422.00000,Residential Income,840000.00000,5.00000,6.00000,2.00000,1960.00000,...,2586.54328,True,False,True,True,752000.00000,6935.00000,1.00000,0.13000,Queens
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75533,Glendale,7705 82nd St,NY,11385.00000,Residential,685000.00000,2.00000,3.00000,9.00000,1930.00000,...,1430.95218,True,False,True,True,616000.00000,6100.00000,1.00000,0.04000,Queens
75547,Glendale,8425 Doran Ave,NY,11385.00000,Residential,705000.00000,2.00000,2.00000,9.00000,1940.00000,...,1000.00000,True,False,True,True,725000.00000,5996.00000,1.00000,0.05000,Queens
75551,Flushing,6680 79th Pl,NY,11379.00000,Residential Income,1238000.00000,4.00000,7.00000,6.00000,1970.00000,...,1842.97574,True,False,True,True,1475000.00000,9319.00000,1.00000,0.07000,Queens
75584,Forest Hills,67-54 Ingram St,NY,11375.00000,Residential,928000.00000,3.00000,3.00000,8.00000,1935.00000,...,1406.42485,True,False,True,True,928000.00000,5850.00000,1.00000,0.05000,Queens


In [69]:
# Make the values of lot_size_sqft that are under 700 to be null
df_6_imputed.loc[df_6_imputed['lot_size_sqft'] < 700, 'lot_size_sqft'] = np.nan

# Calculate the mean lot_size_sqft for each zipcode and house_type
zipcode_house_type_mean = df_6_imputed.groupby(['zipcode', 'house_type'])['lot_size_sqft'].transform('mean')

# Function to fill missing lot_size_sqft with the mean of the same zipcode and house_type
def fill_lot_size(row):
    if pd.isnull(row['lot_size_sqft']):
        return zipcode_house_type_mean[row.name]
    else:
        return row['lot_size_sqft']

# Apply the function to the DataFrame
df_6_imputed['lot_size_sqft'] = df_6_imputed.apply(fill_lot_size, axis=1)

# Display the rows with updated lot_size_sqft values
print("\nNull value counts in lot_size_sqft column after imputation:")
print(df_6_imputed['lot_size_sqft'].isnull().sum())


Null value counts in lot_size_sqft column after imputation:
326


### Sqft

In [70]:
# Filter the DataFrame to get rows with sqft under 500
rows_with_small_sqft = df_6_imputed[df_6_imputed['sqft'] < 500]

# Display the filtered rows
rows_with_small_sqft

Unnamed: 0,city,street_address,state,zipcode,house_type,price,bathrooms,bedrooms,school_rating,yearBuilt,...,sqft,heating,cooling,parking,basement,tax_assessed_value,tax_amount,stories,lot_size_sqft,borough
159,New York,90 Park Ter E APT 4-G,NY,10034.00000,Condo,305000.00000,1.00000,1.00000,5.00000,1948.00000,...,-17835.39886,False,True,True,False,305000.00000,0.00000,1.00000,12248.86584,Manhattan
174,New York,77 Park Ter APT D19,NY,10034.00000,Condo,244500.00000,1.00000,1.55532,5.00000,1939.00000,...,475.00000,False,False,True,False,244500.00000,96422.33828,1.00000,12372.75987,Manhattan
254,New York,23 E 10th St APT 4E,NY,10003.00000,Condo,590000.00000,1.00000,1.00000,7.00000,1923.00000,...,450.00000,False,False,True,False,590000.00000,0.00000,6.00000,8074.27638,Manhattan
409,Rockaway park,171 Beach 119th St,NY,11694.00000,Single Family,695000.00000,1.70546,3.70966,7.00000,1942.28646,...,-66.32708,False,False,True,False,87000.00000,878.00000,1.00000,2000.00000,Queens
436,Far Rockaway,2514 Brookhaven Ave,NY,11691.00000,Single Family,295000.00000,1.00000,3.00000,2.00000,1910.00000,...,468.00000,True,True,True,False,179000.00000,1365.00000,1.00000,1324.00000,Queens
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
75107,Forest Hills,6741 Burns St APT 610,NY,11375.00000,Condo,378000.00000,1.00000,2.00000,8.00000,1959.00000,...,-12719.36330,False,False,True,False,378000.00000,0.00000,1.00000,20714.49317,Queens
75143,Queens,99-45 67th Rd UNIT 523,NY,11375.00000,Condo,225000.00000,1.00000,1.26453,8.00000,1938.00000,...,-12719.35652,True,True,True,False,225000.00000,0.00000,6.00000,20714.49362,Queens
75220,Forest Hills,144-40 71st Ave #3B,NY,11375.00000,Condo,335000.00000,1.13791,1.38982,8.00000,1949.00000,...,-12719.36009,False,False,True,False,335000.00000,0.00000,1.00000,20714.49338,Queens
75304,Rego Park,6486 Wetherole St APT 6D,NY,11374.00000,Condo,758596.00000,1.22670,1.69686,7.00000,1957.74487,...,-39198.14774,False,False,True,False,161755.00000,359.00000,7.00000,34703.82210,Queens


In [71]:
# Make the values of lot_size_sqft that are under 500 to be null
df_6_imputed.loc[df_6_imputed['sqft'] < 500, 'sqft'] = np.nan

# Calculate the mean lot_size_sqft for each zipcode and house_type
zipcode_house_type_mean = df_6_imputed.groupby(['zipcode', 'house_type'])['sqft'].transform('mean')

# Function to fill missing lot_size_sqft with the mean of the same zipcode and house_type
def fill_lot_size(row):
    if pd.isnull(row['sqft']):
        return zipcode_house_type_mean[row.name]
    else:
        return row['sqft']

# Apply the function to the DataFrame
df_6_imputed['sqft'] = df_6_imputed.apply(fill_lot_size, axis=1)

# Display the rows with updated lot_size_sqft values
print("\nNull value counts in sqft column after imputation:")
print(df_6_imputed['sqft'].isnull().sum())


Null value counts in sqft column after imputation:
164


In [72]:
df_7 = df_6_imputed.dropna()

In [73]:
missing = pd.concat([df_7.isnull().sum(), 100 * df_7.isnull().mean()], axis=1)
missing.columns=['count', '%']
missing.sort_values(by=['count', '%'])

Unnamed: 0,count,%
city,0,0.0
street_address,0,0.0
state,0,0.0
zipcode,0,0.0
house_type,0,0.0
price,0,0.0
bathrooms,0,0.0
bedrooms,0,0.0
school_rating,0,0.0
yearBuilt,0,0.0


In [74]:
df_7.info()

<class 'pandas.core.frame.DataFrame'>
Index: 70578 entries, 0 to 75629
Data columns (total 22 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   city                70578 non-null  object 
 1   street_address      70578 non-null  object 
 2   state               70578 non-null  object 
 3   zipcode             70578 non-null  object 
 4   house_type          70578 non-null  object 
 5   price               70578 non-null  float64
 6   bathrooms           70578 non-null  float64
 7   bedrooms            70578 non-null  float64
 8   school_rating       70578 non-null  float64
 9   yearBuilt           70578 non-null  float64
 10  latitude            70578 non-null  float64
 11  longitude           70578 non-null  float64
 12  sqft                70578 non-null  float64
 13  heating             70578 non-null  bool   
 14  cooling             70578 non-null  bool   
 15  parking             70578 non-null  bool   
 16  basement 

In [75]:
df_8 = df_7.drop_duplicates()
df_8.info()

<class 'pandas.core.frame.DataFrame'>
Index: 58860 entries, 0 to 75629
Data columns (total 22 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   city                58860 non-null  object 
 1   street_address      58860 non-null  object 
 2   state               58860 non-null  object 
 3   zipcode             58860 non-null  object 
 4   house_type          58860 non-null  object 
 5   price               58860 non-null  float64
 6   bathrooms           58860 non-null  float64
 7   bedrooms            58860 non-null  float64
 8   school_rating       58860 non-null  float64
 9   yearBuilt           58860 non-null  float64
 10  latitude            58860 non-null  float64
 11  longitude           58860 non-null  float64
 12  sqft                58860 non-null  float64
 13  heating             58860 non-null  bool   
 14  cooling             58860 non-null  bool   
 15  parking             58860 non-null  bool   
 16  basement 

In [76]:
# Round the bedrooms values to the nearest 0.5
df_8['bedrooms'] = df_8['bedrooms'].apply(lambda x: round(x * 2) / 2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_8['bedrooms'] = df_8['bedrooms'].apply(lambda x: round(x * 2) / 2)


In [77]:
# Round the bathrooms values to the nearest 0.5
df_8['bathrooms'] = df_8['bathrooms'].apply(lambda x: round(x * 2) / 2)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_8['bathrooms'] = df_8['bathrooms'].apply(lambda x: round(x * 2) / 2)


In [78]:
df_8.head()

Unnamed: 0,city,street_address,state,zipcode,house_type,price,bathrooms,bedrooms,school_rating,yearBuilt,...,sqft,heating,cooling,parking,basement,tax_assessed_value,tax_amount,stories,lot_size_sqft,borough
0,New York,60 Terrace View Ave,NY,10463.0,Residential,799999.0,2.0,5.0,4.0,1920.0,...,1889.0,True,False,True,True,711000.0,5096.0,1.0,2845.69599,Bronx
1,Bronx,625 W 246th St,NY,10471.0,Single Family,3995000.0,8.0,8.0,10.0,1940.0,...,7000.0,False,True,True,False,1937000.0,13941.0,1.0,12632.4,Bronx
2,Bronx,716 W 231st St,NY,10463.0,Single Family,1495000.0,3.0,4.0,10.0,1920.0,...,4233.0,False,False,True,False,2341000.0,12253.0,2.0,18295.2,Bronx
3,Bronx,750 W 232nd St,NY,10463.0,Single Family,3450000.0,6.0,5.0,10.0,1950.0,...,7000.0,False,True,True,False,3011000.0,19472.0,2.0,11325.6,Bronx
4,Bronx,632 W 230th St,NY,10463.0,Single Family,1790000.0,6.0,5.0,10.0,2020.0,...,4079.81465,False,True,True,False,1790000.0,19016.38407,1.0,7392.42898,Bronx


## Save df

In [79]:
# save updated dataframe
df_8.to_csv('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Capstone 3/df_dw.csv', index=False)

In [80]:
df_8.info()

<class 'pandas.core.frame.DataFrame'>
Index: 58860 entries, 0 to 75629
Data columns (total 22 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   city                58860 non-null  object 
 1   street_address      58860 non-null  object 
 2   state               58860 non-null  object 
 3   zipcode             58860 non-null  object 
 4   house_type          58860 non-null  object 
 5   price               58860 non-null  float64
 6   bathrooms           58860 non-null  float64
 7   bedrooms            58860 non-null  float64
 8   school_rating       58860 non-null  float64
 9   yearBuilt           58860 non-null  float64
 10  latitude            58860 non-null  float64
 11  longitude           58860 non-null  float64
 12  sqft                58860 non-null  float64
 13  heating             58860 non-null  bool   
 14  cooling             58860 non-null  bool   
 15  parking             58860 non-null  bool   
 16  basement 