In [5]:
import pandas as pd

file_path = "NY-House-Dataset.csv"
try:
    df = pd.read_csv(file_path)
    print("Dataset loaded successfully.")
except FileNotFoundError:
    print(f"File '{file_path}' not found.")

print("\nFirst few rows of the dataset:")
print(df.head())

print("\nSummary statistics of the dataset:")
print(df.describe())

print("\nInformation about the dataset:")
print(df.info())


Dataset loaded successfully.

First few rows of the dataset:
                 TYPE      PRICE  BEDS       BATH  PROPERTYSQFT  \
0      Condo for sale     315000     2   2.000000        1400.0   
1      Condo for sale  195000000     7  10.000000       17545.0   
2      House for sale     260000     4   2.000000        2015.0   
3      Condo for sale      69000     3   1.000000         445.0   
4  Townhouse for sale   55000000     7   2.373861       14175.0   

                     STATE                                       MAIN_ADDRESS  \
0       New York, NY 10022             2 E 55th St Unit 803New York, NY 10022   
1       New York, NY 10019  Central Park Tower Penthouse-217 W 57th New Yo...   
2  Staten Island, NY 10312            620 Sinclair AveStaten Island, NY 10312   
3      Manhattan, NY 10022         2 E 55th St Unit 908W33Manhattan, NY 10022   
4       New York, NY 10065                      5 E 64th StNew York, NY 10065   

  ADMINISTRATIVE_AREA_LEVEL_2  LOCALITY      SUBL

**New York Housing Market dataset:**
- There is a wide range of housing types listed, including condos, houses, and townhouses.
- Prices vary significantly, with some properties listed for millions of dollars and others for much less.
- The number of bedrooms and bathrooms also varies, indicating differences in property sizes and types.
- Geographical coordinates (latitude and longitude) are provided for each property, allowing for spatial analysis.


> **NYC Rat Sightings dataset:**
> - Rat sightings occur across various types of locations, including residential buildings, commercial buildings, and mixed-use buildings.
> - There are missing values in the "Incident Address" column, indicating that not all sightings are associated with specific addresses.
> - Rat sightings are reported across different zip codes and neighborhoods in New York City.
> - The dataset includes geographical coordinates (latitude and longitude) for each sighting, facilitating spatial analysis and mapping.

> **Observations:**
> - There may be a relationship between the type of location (e.g., residential, commercial) and the frequency of rat sightings.
> - It's possible that certain neighborhoods or zip codes have a higher incidence of rat sightings compared to others.
> - There might be spatial correlations between areas with high house prices and areas with a high frequency of rat sightings, but further analysis is needed to confirm this.


**What feature/s would you like to be able to predict?** The geographical coordinates (latitude and longitude) of both house listings and rat sightings are crucial for spatial analysis. Mapping the distribution of house prices and rat sightings can help identify spatial patterns and correlations between the two features.

Geographical coordinates provide precise location information for each data point, allowing us to visualize the spatial distribution of house prices and rat sightings on a map. By plotting these coordinates, we can observe clusters or hotspots of high and low values for each feature across different neighborhoods or areas of New York City.



In [2]:
import pandas as pd

housing_data = pd.read_csv("NY-House-Dataset.csv")

missing_values_housing = housing_data.isnull().sum()
print("Missing values in New York Housing Market dataset:\n", missing_values_housing)

housing_data['ZIPCODE'] = housing_data['MAIN_ADDRESS'].str[-5:].astype(int)

housing_data_cleaned = housing_data[['PRICE', 'ZIPCODE', 'BEDS', 'BATH', 'PROPERTYSQFT']]

housing_data_cleaned.to_csv("cleaned_housing_data.csv", index=False)

rat_sightings_data = pd.read_csv("Rat_Sightings.csv")

print("Missing values in Rat Sightings dataset:\n", rat_sightings_data.isnull().sum())

rat_sightings_data_cleaned = rat_sightings_data.dropna(subset=['Incident Zip']).copy()  # Create a copy

rat_sightings_data_cleaned['Sightings Count'] = rat_sightings_data_cleaned.groupby('Incident Zip')['Incident Zip'].transform('count')

rat_sightings_data_cleaned = rat_sightings_data_cleaned[['Incident Zip', 'Sightings Count']].drop_duplicates()

rat_sightings_data_cleaned.to_csv("cleaned_rat_sightings_data.csv", index=False)



Missing values in New York Housing Market dataset:
 TYPE                           0
PRICE                          0
BEDS                           0
BATH                           0
PROPERTYSQFT                   0
STATE                          0
MAIN_ADDRESS                   0
ADMINISTRATIVE_AREA_LEVEL_2    0
LOCALITY                       0
SUBLOCALITY                    0
STREET_NAME                    0
LATITUDE                       0
LONGITUDE                      0
dtype: int64
Missing values in Rat Sightings dataset:
 Location Type          6
Incident Zip         336
Incident Address    9074
Street Name         9075
City                 342
Latitude             706
Longitude            706
dtype: int64


In [3]:
import pandas as pd
from sklearn.model_selection import train_test_split

housing_data = pd.read_csv("cleaned_housing_data.csv")
rat_sightings_data = pd.read_csv("cleaned_rat_sightings_data.csv")

merged_data = pd.merge(housing_data, rat_sightings_data, how='inner', left_on='ZIPCODE', right_on='Incident Zip')

X = merged_data[['Sightings Count', 'BEDS', 'BATH', 'PROPERTYSQFT']].values  # Features
y = merged_data[['PRICE']].values  # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
