# 1. Data Wrangling <a id="data_wrangling"></a>

<a id="contents"></a>
# Table of Contents  
1. [Data Wrangling](#data_wrangling)
    - [1.1 Introduction](#introduction)
    - [1.2 Imports](#imports)
    - [1.3 Load and Concatenate Individual Stock Datasets](#load)
    - [1.4 Dataset Cleaning](#cleaning)

## 1.1 Introduction<a id="introduction"></a>

### Problem
Real estate investors need to identify profitable investment opportunities in dynamic markets. Understanding market trends and segmenting opportunities based on risk and return profiles is crucial for optimizing investment strategies. The goal of this project is to maximize investment returns by leveraging data-driven approaches to identify undervalued properties, forecast market trends, and optimize portfolio allocations.


### Clients
The findings of this study will be of interest to a broad range of stakeholders, specifically real estate investors, portfolio managers, and real estate agents and brokers who can benefit from understanding market trends and leverage the insights from the project to provide more accurate and data-driven recommendations.


### Data
The dataset for this project was downloaded from Kaggle and has been filtered and cleaned to include housing data from New York, extracted via the Zillow API. This comprehensive dataset provides detailed information about various properties, capturing a wide range of features relevant to real estate analysis. The primary goal of this project is to develop a predictive model that analyzes housing data to forecast property prices accurately. By leveraging this data, the model aims to provide valuable insights into the New York housing market, potentially aiding buyers, sellers, and investors in making informed decisions.

Link to Kaggle dataset: https://www.kaggle.com/datasets/ericpierce/new-york-housing-zillow-api


## 1.2 Imports

In [1]:
import numpy as np 
import pandas as pd 
import matplotlib.pyplot as plt 
import seaborn as sns 
import os
import csv
from tqdm.notebook import tqdm
from datetime import datetime, timezone

## 1.3 Load the Data<a id="load"></a>

To begin, we are focusing the data on the regions wihthin New York State (NY) in the US. After loading all the datasets in to the notebook, we will filter out any region/location that is not in NY.

In [2]:
df = pd.read_csv('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Capstone 3/Datasets/newyork_housing.csv')

  df = pd.read_csv('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Capstone 3/Datasets/newyork_housing.csv')


In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75630 entries, 0 to 75629
Columns: 1507 entries, address/city to zpid
dtypes: bool(13), float64(440), int64(2), object(1052)
memory usage: 863.0+ MB


In [4]:
df.head()

Unnamed: 0,address/city,address/community,address/neighborhood,address/state,address/streetAddress,address/subdivision,address/zipcode,bathrooms,bedrooms,currency,...,schools/2/link,schools/2/name,schools/2/rating,schools/2/size,schools/2/studentsPerTeacher,schools/2/totalCount,schools/2/type,url,yearBuilt,zpid
0,New York,,,NY,60 Terrace View Ave,,10463.0,2.0,5.0,USD,...,,,,,,,,https://www.zillow.com/homedetails/60-Terrace-...,1920.0,31554050.0
1,Bronx,,,NY,625 W 246th St,,10471.0,8.0,8.0,USD,...,,,,,,,,https://www.zillow.com/homedetails/625-W-246th...,1940.0,29854120.0
2,Bronx,,,NY,716 W 231st St,,10463.0,3.0,4.0,USD,...,,,,,,,,https://www.zillow.com/homedetails/716-W-231st...,1920.0,29851860.0
3,Bronx,,,NY,750 W 232nd St,,10463.0,6.0,5.0,USD,...,,,,,,,,https://www.zillow.com/homedetails/750-W-232nd...,1950.0,29851860.0
4,Bronx,,,NY,632 W 230th St,,10463.0,6.0,5.0,USD,...,,,,,,,,https://www.zillow.com/homedetails/632-W-230th...,2020.0,2077107000.0


## 1.4 Dataset Cleaning

**The dataset has 1507 columns, many of which are unnecessary for our purposes. As there are so many columns, we have first identified some words that appear in many unneeded columns to reduce the column count and make it easier to look through. From there, we identified the target and feature columns that would be necessary to create our model and filtered the dataset accordingly. For readability and efficiency, we re-named most of the columns kept. Additioanlly, we removed null values. ** 

In [5]:
# List of keywords to remove
remove_keywords = ['photo', 'url', 'History', 'link', 'zpid', 'level', 'Fact']

# Remove columns with any of the keywords in their names
df_1 = df.drop(columns=[col for col in df.columns if any(keyword in col for keyword in remove_keywords)])

# Display the result
print("\nDataFrame after removing columns with 'photo', 'url', 'History':")
print(df_1)


DataFrame after removing columns with 'photo', 'url', 'History':
               address/city  address/community address/neighborhood  \
0                  New York                NaN                  NaN   
1                     Bronx                NaN                  NaN   
2                     Bronx                NaN                  NaN   
3                     Bronx                NaN                  NaN   
4                     Bronx                NaN                  NaN   
...                     ...                ...                  ...   
75625              Flushing                NaN                  NaN   
75626  Forest Hills Gardens                NaN                  NaN   
75627  Forest Hills Gardens                NaN                  NaN   
75628              Flushing                NaN                  NaN   
75629              Flushing                NaN                  NaN   

      address/state address/streetAddress address/subdivision  \
0               

In [6]:
# Keep only relevant columns
columns_to_keep = ['address/city', 'address/streetAddress', 'address/state', 'address/zipcode', 'resoFactsStats/atAGlanceFacts/0/factValue', 'price', 'bathrooms', 'bedrooms', 'schools/2/name', 'schools/2/rating', 'yearBuilt', 'latitude', 'longitude', 'livingArea']

# Keep only the specified columns
df_1 = df.loc[:, columns_to_keep]

# Display the result
print("\nDataFrame after keeping only specified columns:")
print(df_1)


DataFrame after keeping only specified columns:
               address/city address/streetAddress address/state  \
0                  New York   60 Terrace View Ave            NY   
1                     Bronx        625 W 246th St            NY   
2                     Bronx        716 W 231st St            NY   
3                     Bronx        750 W 232nd St            NY   
4                     Bronx        632 W 230th St            NY   
...                     ...                   ...           ...   
75625              Flushing         6829 Manse St            NY   
75626  Forest Hills Gardens       82 Greenway Ter            NY   
75627  Forest Hills Gardens       86 Greenway Ter            NY   
75628              Flushing         8913 70th Ave            NY   
75629              Flushing         7049 Manse St            NY   

       address/zipcode resoFactsStats/atAGlanceFacts/0/factValue      price  \
0              10463.0                               Residential   

In [7]:
# Dictionary of columns to rename
columns_to_rename = {
    'address/city': 'city',
    'address/streetAddress': 'street_address',
    'address/state': 'state',
    'address/zipcode': 'zipcode',
    'resoFactsStats/atAGlanceFacts/0/factValue': 'house_type',
    'schools/2/name': 'school_name',
    'schools/2/rating': 'school_rating',
    'livingArea': 'sqft',
}

# Rename the specified columns
df_2 = df_1.rename(columns=columns_to_rename)

# Display the result
print("\nDataFrame after renaming specified columns:")
print(df_2)


DataFrame after renaming specified columns:
                       city       street_address state  zipcode  \
0                  New York  60 Terrace View Ave    NY  10463.0   
1                     Bronx       625 W 246th St    NY  10471.0   
2                     Bronx       716 W 231st St    NY  10463.0   
3                     Bronx       750 W 232nd St    NY  10463.0   
4                     Bronx       632 W 230th St    NY  10463.0   
...                     ...                  ...   ...      ...   
75625              Flushing        6829 Manse St    NY  11375.0   
75626  Forest Hills Gardens      82 Greenway Ter    NY  11375.0   
75627  Forest Hills Gardens      86 Greenway Ter    NY  11375.0   
75628              Flushing        8913 70th Ave    NY  11375.0   
75629              Flushing        7049 Manse St    NY  11375.0   

          house_type      price  bathrooms  bedrooms  \
0        Residential   799999.0        2.0       5.0   
1      Single Family  3995000.0       

In [8]:
df_2.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 75630 entries, 0 to 75629
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   city            75629 non-null  object 
 1   street_address  75629 non-null  object 
 2   state           75629 non-null  object 
 3   zipcode         75611 non-null  float64
 4   house_type      75344 non-null  object 
 5   price           75591 non-null  float64
 6   bathrooms       56577 non-null  float64
 7   bedrooms        56166 non-null  float64
 8   school_name     55543 non-null  object 
 9   school_rating   55506 non-null  float64
 10  yearBuilt       69898 non-null  float64
 11  latitude        75604 non-null  float64
 12  longitude       75604 non-null  float64
 13  sqft            66419 non-null  float64
dtypes: float64(9), object(5)
memory usage: 8.1+ MB


In [9]:
# Count the number of null values in each column
null_counts = df_2.isnull().sum()

# Display the result
print("Number of null values in each column:")
null_counts

Number of null values in each column:


city                  1
street_address        1
state                 1
zipcode              19
house_type          286
price                39
bathrooms         19053
bedrooms          19464
school_name       20087
school_rating     20124
yearBuilt          5732
latitude             26
longitude            26
sqft               9211
dtype: int64

**After valiating the null columns, we noted that that there were many more null school ratings values compared to null zipcodes. To remeidate this, we calculated the mean of the school ratings by zipcode. By calculating the average school rating for each zipcode, the code simplifies the data and provides a more general view of school quality within each zipcode. As school quality is often a significant factor in real estate pricing, using aggregated data can help capture this relationship more effectively.**

In [10]:
# Group by zipcode and calculate the average school rating
avg_school_rating = df_2.groupby('zipcode')['school_rating'].mean().reset_index()

# Merge the average school ratings back to the original DataFrame
df_2 = df_2.drop(columns=['school_name'])  # Remove the school_name column
df_2 = df_2.drop(columns=['school_rating'])  # Remove the school_rating column to avoid duplication
df_2 = df_2.merge(avg_school_rating, on='zipcode', how='left')

# Display the result
print("DataFrame with average school ratings and without the school_name column:")
df_2

DataFrame with average school ratings and without the school_name column:


Unnamed: 0,city,street_address,state,zipcode,house_type,price,bathrooms,bedrooms,yearBuilt,latitude,longitude,sqft,school_rating
0,New York,60 Terrace View Ave,NY,10463.0,Residential,799999.0,2.0,5.0,1920.0,40.877743,-73.910866,1889.0,
1,Bronx,625 W 246th St,NY,10471.0,Single Family,3995000.0,8.0,8.0,1940.0,40.892689,-73.910667,7000.0,
2,Bronx,716 W 231st St,NY,10463.0,Single Family,1495000.0,3.0,4.0,1920.0,40.883419,-73.918106,4233.0,
3,Bronx,750 W 232nd St,NY,10463.0,Single Family,3450000.0,6.0,5.0,1950.0,40.885033,-73.917793,7000.0,
4,Bronx,632 W 230th St,NY,10463.0,Single Family,1790000.0,6.0,5.0,2020.0,40.881702,-73.914185,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
75625,Flushing,6829 Manse St,NY,11375.0,Single Family,825000.0,2.0,3.0,1920.0,40.714203,-73.855263,2417.0,5.372685
75626,Forest Hills Gardens,82 Greenway Ter,NY,11375.0,Townhouse,2704000.0,6.0,6.0,1925.0,40.717163,-73.843124,6085.0,5.372685
75627,Forest Hills Gardens,86 Greenway Ter,NY,11375.0,Townhouse,2750000.0,5.0,6.0,1925.0,40.717052,-73.843025,4564.0,5.372685
75628,Flushing,8913 70th Ave,NY,11375.0,Single Family,935000.0,,,1930.0,40.709549,-73.854385,1216.0,5.372685


In [11]:
# Count the number of null values in each column
null_counts = df_2.isnull().sum()

# Display the result
print("Number of null values in each column:")
null_counts

Number of null values in each column:


city                  1
street_address        1
state                 1
zipcode              19
house_type          286
price                39
bathrooms         19053
bedrooms          19464
yearBuilt          5732
latitude             26
longitude            26
sqft               9211
school_rating      2213
dtype: int64

In [12]:
# Remove all rows with any null values
df_3 = df_2.dropna()

# Reset the index
df_3.reset_index(drop=True, inplace=True)

# Display the result
print("DataFrame after removing all rows with null values:")
df_3

DataFrame after removing all rows with null values:


Unnamed: 0,city,street_address,state,zipcode,house_type,price,bathrooms,bedrooms,yearBuilt,latitude,longitude,sqft,school_rating
0,New York,24 Cooper St #5CD,NY,10034.0,Condo,230000.0,2.0,3.0,1925.0,40.867687,-73.924606,994.0,1.551724
1,New York,1825 Riverside Dr APT 2D,NY,10034.0,Condo,599000.0,1.0,2.0,1926.0,40.866562,-73.930374,1150.0,1.551724
2,New York,420 W 206th St #6B,NY,10034.0,Residential,325000.0,1.0,1.0,1946.0,40.863277,-73.918770,800.0,1.551724
3,New York,57 Park Ter W #WIC,NY,10034.0,Condo,369000.0,1.0,1.0,1937.0,40.871239,-73.917900,750.0,1.551724
4,Manhattan,75 Park Ter E #D70,NY,10034.0,Condo,629000.0,1.0,2.0,1939.0,40.871101,-73.916397,950.0,1.551724
...,...,...,...,...,...,...,...,...,...,...,...,...,...
46175,Forest Hills,93-19 71st Ave,NY,11375.0,Single Family,1255000.0,2.0,4.0,1930.0,40.712009,-73.850281,2200.0,5.372685
46176,Flushing,6829 Manse St,NY,11375.0,Single Family,825000.0,2.0,3.0,1920.0,40.714203,-73.855263,2417.0,5.372685
46177,Forest Hills Gardens,82 Greenway Ter,NY,11375.0,Townhouse,2704000.0,6.0,6.0,1925.0,40.717163,-73.843124,6085.0,5.372685
46178,Forest Hills Gardens,86 Greenway Ter,NY,11375.0,Townhouse,2750000.0,5.0,6.0,1925.0,40.717052,-73.843025,4564.0,5.372685


In [13]:
df_3.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 46180 entries, 0 to 46179
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   city            46180 non-null  object 
 1   street_address  46180 non-null  object 
 2   state           46180 non-null  object 
 3   zipcode         46180 non-null  float64
 4   house_type      46180 non-null  object 
 5   price           46180 non-null  float64
 6   bathrooms       46180 non-null  float64
 7   bedrooms        46180 non-null  float64
 8   yearBuilt       46180 non-null  float64
 9   latitude        46180 non-null  float64
 10  longitude       46180 non-null  float64
 11  sqft            46180 non-null  float64
 12  school_rating   46180 non-null  float64
dtypes: float64(9), object(4)
memory usage: 4.6+ MB


In [14]:
non_string_count = df_3['house_type'].apply(lambda x: not isinstance(x, str)).sum()

# Display the result
print("Number of non-string values in the 'house_type' column:", non_string_count)

Number of non-string values in the 'house_type' column: 0


**We validated the unique values in column house_type as this is a categorical feature and we expect there to be a finite and relatively small set of values. Through this, we identified that there were several date values in the column and removed them to ensure accuracy.**

In [15]:
distinct_house_types = df_3['house_type'].unique()
distinct_house_types

array(['Condo', 'Residential', 'Single Family', 'Multiple Occupancy',
       'Residential Income', 'Apartment', 'Other', 'Available Now',
       'Townhouse', 'Mobile / Manufactured', 'Mon Feb 1 2021',
       'Mon Feb 15 2021', 'Wed Feb 10 2021', 'Vacant Land',
       'Tue Feb 2 2021', 'Sun Jan 24 2021'], dtype=object)

In [16]:
count_2021 = df_3['house_type'].astype(str).str.contains('2021').sum()

# Display the result
print("Number of values containing '2021' in the 'house_type' column:", count_2021)

Number of values containing '2021' in the 'house_type' column: 11


In [17]:
# Remove rows containing "2021" in the 'house_type' column
df_3 = df_3[~df_3['house_type'].astype(str).str.contains('2021', na=False)]

# Remove rows with 'Available Now' in the 'house_type' column
df_3 = df_3[df_3['house_type'] != 'Available Now']

In [18]:
df_3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 46122 entries, 0 to 46179
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   city            46122 non-null  object 
 1   street_address  46122 non-null  object 
 2   state           46122 non-null  object 
 3   zipcode         46122 non-null  float64
 4   house_type      46122 non-null  object 
 5   price           46122 non-null  float64
 6   bathrooms       46122 non-null  float64
 7   bedrooms        46122 non-null  float64
 8   yearBuilt       46122 non-null  float64
 9   latitude        46122 non-null  float64
 10  longitude       46122 non-null  float64
 11  sqft            46122 non-null  float64
 12  school_rating   46122 non-null  float64
dtypes: float64(9), object(4)
memory usage: 4.9+ MB


**We then looked to validate that remaining categorical values were also accurate. We noted that there were additional states to NY included in the dataset which needed to be removed. Additionally, we transformed all values in the 'city' column to be lowercase in order to ensure there were no duplicates. From there we noted that there were 3 values that were not cities (blvd and n.y). Upon further investigation noted below, we identified the zip codes of the addresses within the corresponding rows and updated the 'city' accordingly.**

In [19]:
for column in df_3.select_dtypes(include=['object']).columns:
    print(f"\nUnique values in '{column}' column:")
    print(df_3[column].unique())


Unique values in 'city' column:
['New York' 'Manhattan' 'New york' 'Bronx' 'Howard Beach' 'Broad Channel'
 'Far Rockaway' 'Jamaica' 'Rosedale' 'Rockaway Park' 'Neponsit'
 'Belle Harbor' 'Queens' 'Far rockaway' 'BELLE HARBOR' 'Belle harbor'
 'Far Rockway' 'Lawrence' 'Brooklyn' 'Maspeth' 'Little Neck' 'Flushing'
 'NEW YORK' 'Staten Island' 'Staten island' 'staten Island'
 'staten island' 'BROOKLYN' 'Cambria Heights' 'Queens Village'
 'Springfield Gardens' 'belle harbor' 'Cen' 'Rockaway Beach' 'BRONX'
 'Long Island City' 'Astoria' 'College Pt' 'East Elmhurst' 'Woodside'
 'College Point' 'east elmhurst' 'Corona' 'Sunnyside' 'Ridgewood'
 'Bayside' 'Whitestone' 'Beechhurst' 'Douglaston' 'DOUGLASTON'
 'Douglas Manor' 'little neck' 'Great Neck' 'Little neck' 'Fresh Meadows'
 'FLUSHING' 'Oakland Gardens' 'Fresh meadows' 'Forest Hills'
 'Kew Gardens Hills' 'Kew Garden Hl' 'Kew Garden Hills' 'Kew Garden Hill'
 'Kew Gardens' 'Forest Hills Gardens' 'Richmond Hill' 'Kew gardens'
 'kew gardens' 'Bri

In [20]:
# Keep only rows where the 'state' column is 'NY'
df_3 = df_3[df_3['state'] == 'NY']

In [21]:
# Convert all values in the 'city' column to lower case
df_3['city'] = df_3['city'].str.lower()

In [22]:
for column in df_3.select_dtypes(include=['object']).columns:
    print(f"\nUnique values in '{column}' column:")
    print(df_3[column].unique())


Unique values in 'city' column:
['new york' 'manhattan' 'bronx' 'howard beach' 'broad channel'
 'far rockaway' 'jamaica' 'rosedale' 'rockaway park' 'neponsit'
 'belle harbor' 'queens' 'far rockway' 'lawrence' 'brooklyn' 'maspeth'
 'little neck' 'flushing' 'staten island' 'cambria heights'
 'queens village' 'springfield gardens' 'cen' 'rockaway beach'
 'long island city' 'astoria' 'college pt' 'east elmhurst' 'woodside'
 'college point' 'corona' 'sunnyside' 'ridgewood' 'bayside' 'whitestone'
 'beechhurst' 'douglaston' 'douglas manor' 'great neck' 'fresh meadows'
 'oakland gardens' 'forest hills' 'kew gardens hills' 'kew garden hl'
 'kew garden hills' 'kew garden hill' 'kew gardens' 'forest hills gardens'
 'richmond hill' 'briarwood' 'jamaica estates' 'hollis' 'holliswood'
 'saint albans' 'bayside hills' 'hollis hills' 'bellerose manor'
 'glen oaks' 'bellerose' 'floral park' 'new hyde park' 'north hills'
 'st. albans' 'brooklyn heights' 'pinedale' 'south richmond hill'
 'south ozone par

In [23]:
# Filter rows with '11375.0' or '11374.0' in the 'zipcode' column
filtered_df = df_3[df_3['zipcode'].isin([11375.0, 11374.0])]

# Display the result
print("Rows with '11375.0' or '11374.0' in the 'zipcode' column:")
print(filtered_df)

Rows with '11375.0' or '11374.0' in the 'zipcode' column:
                       city          street_address state  zipcode  \
8635           forest hills        113-05 Jewel Ave    NY  11375.0   
8643           forest hills      112-29 75th Ave #A    NY  11375.0   
8648           forest hills  71-40 112th St APT 601    NY  11375.0   
8653           forest hills    7235 112th St APT 3B    NY  11375.0   
8655               flushing           11048 72nd Rd    NY  11375.0   
...                     ...                     ...   ...      ...   
46175          forest hills          93-19 71st Ave    NY  11375.0   
46176              flushing           6829 Manse St    NY  11375.0   
46177  forest hills gardens         82 Greenway Ter    NY  11375.0   
46178  forest hills gardens         86 Greenway Ter    NY  11375.0   
46179              flushing           7049 Manse St    NY  11375.0   

               house_type      price  bathrooms  bedrooms  yearBuilt  \
8635          Residential  39

In [24]:
# Define the range for latitude and longitude
latitude_range = [40.717743 - 0.000001, 40.727482 + 0.000001]  # ±0.01 degrees
longitude_range = [-73.860992 - 0.000001, -73.848610 + 0.000001]  # ±0.01 degrees

# Filter rows within the specified range
filtered_df = df_3[
    (df_3['latitude'] >= latitude_range[0]) & (df_3['latitude'] <= latitude_range[1]) &
    (df_3['longitude'] >= longitude_range[0]) & (df_3['longitude'] <= longitude_range[1])
]

# Display the result
print("Filtered DataFrame with latitude and longitude values within the specified range:")
print(filtered_df)

Filtered DataFrame with latitude and longitude values within the specified range:
               city            street_address state  zipcode     house_type  \
45526     rego park          6430 Alderton St    NY  11374.0  Single Family   
45528     rego park         86-03 66th Ave #A    NY  11374.0  Single Family   
45530     rego park    6547 Dieterle Crescent    NY  11374.0  Single Family   
45531  forest hills            85-69 66th Ave    NY  11375.0    Residential   
45534      flushing          6406 Alderton St    NY  11374.0  Single Family   
...             ...                       ...   ...      ...            ...   
46107      flushing             9521 68th Ave    NY  11375.0  Single Family   
46119           n.y  85-36 67th Ave Rego Park    NY  11374.0  Single Family   
46148      flushing            6782 Groton St    NY  11375.0  Single Family   
46166      flushing             9607 68th Ave    NY  11375.0  Single Family   
46169  forest hills            6772 Groton St    

In [25]:
# Define the central points for latitude and longitude
central_latitudes = [40.727482, 40.717743]
central_longitudes = [-73.848610, -73.860992]

# Define the range for latitude and longitude
latitude_range = .001  # ±1 degree
longitude_range = .001  # ±1 degree

# Filter rows within the specified range
filtered_df = df_3[
    ((df_3['latitude'] >= (central_latitudes[0] - latitude_range)) & (df_3['latitude'] <= (central_latitudes[0] + latitude_range)) &
     (df_3['longitude'] >= (central_longitudes[0] - longitude_range)) & (df_3['longitude'] <= (central_longitudes[0] + longitude_range))) |
    ((df_3['latitude'] >= (central_latitudes[1] - latitude_range)) & (df_3['latitude'] <= (central_latitudes[1] + latitude_range)) &
     (df_3['longitude'] >= (central_longitudes[1] - longitude_range)) & (df_3['longitude'] <= (central_longitudes[1] + longitude_range)))
]

# Display the result
print("Filtered DataFrame with latitude and longitude values within 1 degree of the specified points:")
filtered_df

Filtered DataFrame with latitude and longitude values within 1 degree of the specified points:


Unnamed: 0,city,street_address,state,zipcode,house_type,price,bathrooms,bedrooms,yearBuilt,latitude,longitude,sqft,school_rating
45592,flushing,8546 66th Rd,NY,11374.0,Single Family,868000.0,3.0,3.0,1931.0,40.718204,-73.861191,1416.0,5.418539
45598,forest hills,67-70 Yellowstone Blvd #5P,NY,11375.0,Condo,498000.0,2.0,2.0,1941.0,40.726524,-73.849159,1150.0,5.372685
45599,forest hills,102-55 67 Road #1-X,NY,11375.0,Condo,294000.0,1.0,1.0,1955.0,40.727901,-73.848801,800.0,5.372685
45626,forest hills,10525 67th Rd #2H,NY,11375.0,Condo,280000.0,1.0,1.0,1955.0,40.728199,-73.848297,750.0,5.372685
45660,rego park,8546 67th Ave,NY,11374.0,Single Family,875000.0,3.0,3.0,1929.0,40.717587,-73.86068,1512.0,5.418539
45663,flushing,8542 67th Ave,NY,11374.0,Townhouse,810000.0,2.0,3.0,1929.0,40.717533,-73.860832,1500.0,5.418539
45672,flushing,8524 67th Ave,NY,11374.0,Single Family,850000.0,3.0,4.0,1929.0,40.717339,-73.861412,1928.0,5.418539
45743,flushing,8536 67th Ave,NY,11374.0,Single Family,899000.0,2.0,3.0,1929.0,40.717464,-73.861053,1440.0,5.418539
45805,flushing,8543 66th Rd,NY,11374.0,Single Family,735000.0,2.0,3.0,1932.0,40.718548,-73.861572,1464.0,5.418539
45884,blvd,67-35 Yellowstone Blvd #6T,NY,11375.0,Condo,505000.0,2.0,2.0,1947.0,40.727482,-73.84861,1000.0,5.372685


In [26]:
# Keep only rows where the 'state' column is 'NY'
df_4= df_3[df_3['state'] == 'NY']

# Convert all values in the 'city' column to lower case
df_4['city'] = df_3['city'].str.lower()

# Filter rows with 'blvd' and 'n.y' in the 'city' column
filtered_df = df_4[df_4['city'].isin(['blvd', 'n.y'])]

# Display the result
print("Rows with 'blvd' and 'n.y' in the 'city' column:")
filtered_df

Rows with 'blvd' and 'n.y' in the 'city' column:


Unnamed: 0,city,street_address,state,zipcode,house_type,price,bathrooms,bedrooms,yearBuilt,latitude,longitude,sqft,school_rating
45884,blvd,67-35 Yellowstone Blvd #6T,NY,11375.0,Condo,505000.0,2.0,2.0,1947.0,40.727482,-73.84861,1000.0,5.372685
46119,n.y,85-36 67th Ave Rego Park,NY,11374.0,Single Family,899000.0,2.0,3.0,1929.0,40.717743,-73.860992,1440.0,5.418539


In [27]:
df_3['city'] = df_3['city'].replace('n.y', 'rego park')

df_3['city'] = df_3['city'].replace('blvd', 'flushing')

In [28]:
# Filter rows with 'blvd' and 'n.y' in the 'city' column
filtered_df = df_4[df_4['city'].isin(['blvd', 'n.y'])]

# Display the result
print("Rows with 'blvd' and 'n.y' in the 'city' column:")
filtered_df

Rows with 'blvd' and 'n.y' in the 'city' column:


Unnamed: 0,city,street_address,state,zipcode,house_type,price,bathrooms,bedrooms,yearBuilt,latitude,longitude,sqft,school_rating
45884,blvd,67-35 Yellowstone Blvd #6T,NY,11375.0,Condo,505000.0,2.0,2.0,1947.0,40.727482,-73.84861,1000.0,5.372685
46119,n.y,85-36 67th Ave Rego Park,NY,11374.0,Single Family,899000.0,2.0,3.0,1929.0,40.717743,-73.860992,1440.0,5.418539


In [29]:
# Remove duplicate rows
df_3 = df_3.drop_duplicates()

**To validate the accuracy of the numerical features, we looked to validated the bedroom count was appropriate. To do this we looked to validate that single family homes did not contain more than 15 bedrooms as this amount of bedrooms would not make sense for a single family home but would be more appropriate for a multi-family home. We noted there were 5 properties that met this criteria. Upon further analysis shown below, we updated the house type of 2 of the properties to reflect them being multi-family properties (validated by a google search). We removed the remaining properties from our dataset as they could not be validated.**

In [30]:
# Filter rows where the 'bedrooms' column has values greater than 15 and 'house_type' is 'Single Family'
filtered_df = df_3[(df_3['bedrooms'] >= 15) & (df_3['house_type'] == 'Single Family')]

# Display the result
print("Rows with more than 15 bedrooms and house_type 'Single Family':")
filtered_df

Rows with more than 15 bedrooms and house_type 'Single Family':


Unnamed: 0,city,street_address,state,zipcode,house_type,price,bathrooms,bedrooms,yearBuilt,latitude,longitude,sqft,school_rating
337,far rockaway,508 Beach 135th St,NY,11694.0,Single Family,975000.0,3.0,41.0,1940.0,40.578411,-73.855034,1500.0,2.0
6653,brooklyn,6 Saint Nicholas Ave,NY,11237.0,Single Family,1600000.0,16.0,20.0,1901.0,40.707729,-73.922455,12600.0,1.676471
9581,jamaica,8019 190th St,NY,11423.0,Single Family,800000.0,3.0,15.0,1940.0,40.729176,-73.778816,1864.0,3.997297
31423,brooklyn,1459 47th St,NY,11219.0,Single Family,2999999.0,12.0,16.0,1986.0,40.634071,-73.987061,11900.0,3.183544
37350,brooklyn,1605 E 34th St,NY,11234.0,Single Family,375000.0,40.0,40.0,1925.0,40.616104,-73.940559,15800.0,3.989222


In [31]:
# Count the number of rows
count_filtered_rows = filtered_df.shape[0]
count_filtered_rows

5

In [32]:
# Update the row at index 5750 to have a house_type of "multiple occupancy"
df_3.loc[5750, 'house_type'] = 'Multiple Occupancy'

# Display the updated row to verify the change
print("Updated row at index 5750:")
print(df_3.loc[5750])

Updated row at index 5750:
city                long island city
street_address          3426 10th St
state                             NY
zipcode                      11106.0
house_type        Multiple Occupancy
price                      1215695.0
bathrooms                        5.0
bedrooms                         6.0
yearBuilt                     2001.0
latitude                   40.764244
longitude                 -73.939339
sqft                          2840.0
school_rating                    3.0
Name: 5750, dtype: object


In [33]:
# Identify the rows to be removed
rows_to_remove = df_3[(df_3['house_type'] == 'Single Family') & (df_3['bedrooms'] >= 15)]

# Drop these rows from df
df_3 = df_3.drop(rows_to_remove.index)

In [34]:
df_3 = df_3[~((df_3['bathrooms'] == 1) & (df_3['bedrooms'] > 50)) & (df_3['bedrooms'] != 1502) & (df_3['bathrooms'] != 1346)]

In [35]:
df_3.info()

<class 'pandas.core.frame.DataFrame'>
Index: 38152 entries, 0 to 46179
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   city            38152 non-null  object 
 1   street_address  38152 non-null  object 
 2   state           38152 non-null  object 
 3   zipcode         38152 non-null  float64
 4   house_type      38152 non-null  object 
 5   price           38152 non-null  float64
 6   bathrooms       38152 non-null  float64
 7   bedrooms        38152 non-null  float64
 8   yearBuilt       38152 non-null  float64
 9   latitude        38152 non-null  float64
 10  longitude       38152 non-null  float64
 11  sqft            38152 non-null  float64
 12  school_rating   38152 non-null  float64
dtypes: float64(9), object(4)
memory usage: 4.1+ MB


**After confirming there were no null values after our data cleaning, we saved the dataframe to be sued for EDA.**

In [36]:
# save updated dataframe
df_3.to_csv('/Users/heatheradler/Documents/GitHub/Springboard/Springboard_Projects/Capstone 3/df_dw.csv', index=False)