<a href="https://colab.research.google.com/github/Nelkit/36103-AT2-data-analysis-project/blob/main/36103_AT2B_data_analysis_project(NYC_Property_Sales).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Assessment task 2: Data analysis project




## 📝 TODO List

✅ Project Overview  
- [ ] **1.1 Project Description**  
- [ ] **1.2 Business Objective**  
- [ ] **1.3 Research Questions**  

📥 Data Loading and Understanding  
- [x] Load the dataset  
- [ ] Check for missing values and duplicates  
- [ ] Understand data types and structure  

📊 Exploratory Data Analysis (EDA)  
- [ ] **3.1 Explore features**  
- [ ] **3.2 Explore target variable**  
- [ ] Visualize distributions, correlations, and patterns  

🎯 Feature Selection  
- [ ] **4.1 Feature Selection Approach** (correlation, importance scores, etc.)  
- [ ] **4.2 Final Selected Features**  

🛠 Data Preprocessing  
- [ ] **5.1 Data Cleaning** (handle missing values, outliers, duplicates)  
- [ ] **5.2 Feature Engineering** (create new features, transformations)  
- [ ] **5.3 Data Transformation** (scaling, encoding, normalization)  

🤖 Data Modeling  
- [ ] **6.1 Generate Predictions with Baseline Model**  
- [ ] **6.2 Assess the Baseline Model**  

📈 Model Evaluation  
- [ ] **7.1 Generate Predictions with Model Selected**  
- [ ] **7.2 Assess the Selected Model** (metrics, performance comparison)  

🔍 Insights and Conclusions  
- [ ] Summarize key findings  
- [ ] Discuss model performance and business impact  
- [ ] Identify limitations and potential improvements


## 0. Setup Environment

### 0.a Install Mandatory Packages

> Do not modify this code before running it

In [1]:
# Do not modify this code

import os
import sys
from pathlib import Path

COURSE = "36103"
ASSIGNMENT = "AT2"
DATA = "data"

asgmt_path = f"{COURSE}/assignment/{ASSIGNMENT}"
root_path = "./"

if os.getenv("COLAB_RELEASE_TAG"):

    from google.colab import drive
    from pathlib import Path

    print("\n###### Connect to personal Google Drive ######")
    gdrive_path = "/content/gdrive"
    drive.mount(gdrive_path)
    root_path = f"{gdrive_path}/MyDrive/"

print("\n###### Setting up folders ######")
folder_path = Path(f"{root_path}/{asgmt_path}/") / DATA
folder_path.mkdir(parents=True, exist_ok=True)
print(f"\nYou can now save your data files in: {folder_path}")

if os.getenv("COLAB_RELEASE_TAG"):
    %cd {folder_path}



###### Connect to personal Google Drive ######
Mounted at /content/gdrive

###### Setting up folders ######

You can now save your data files in: /content/gdrive/MyDrive/36103/assignment/AT2/data
/content/gdrive/MyDrive/36103/assignment/AT2/data


### 0.b Disable Warnings Messages

> Do not modify this code before running it

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

### 0.c Install Additional Packages

> If you are using additional packages, you need to install them here using the command: `! pip install <package_name>`

In [2]:
!pip install scipy



### 0.d Import Packages

In [3]:
import pandas as pd
import numpy as np
import altair as alt
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as stats
import matplotlib.gridspec as gridspec

### 0.f Reusable Functions

## 1. Project Overview

### 1.1 Project Description

`Type your project decription here`

### 1.2 Business Objective

`Type your business objective here`

### 1.3 Research Questions

`Type your research Questions here`

## 2. Data Loading and Understanding

In [5]:
original_df = pd.read_csv(folder_path / "nyc-rolling-sales.csv")

## 3. Exploratory Data Analysis (EDA)

### 3.1 Explore features

In [6]:
original_df.head()

Unnamed: 0.1,Unnamed: 0,BOROUGH,NEIGHBORHOOD,BUILDING CLASS CATEGORY,TAX CLASS AT PRESENT,BLOCK,LOT,EASE-MENT,BUILDING CLASS AT PRESENT,ADDRESS,...,RESIDENTIAL UNITS,COMMERCIAL UNITS,TOTAL UNITS,LAND SQUARE FEET,GROSS SQUARE FEET,YEAR BUILT,TAX CLASS AT TIME OF SALE,BUILDING CLASS AT TIME OF SALE,SALE PRICE,SALE DATE
0,4,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,392,6,,C2,153 AVENUE B,...,5,0,5,1633,6440,1900,2,C2,6625000,2017-07-19 00:00:00
1,5,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2,399,26,,C7,234 EAST 4TH STREET,...,28,3,31,4616,18690,1900,2,C7,-,2016-12-14 00:00:00
2,6,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2,399,39,,C7,197 EAST 3RD STREET,...,16,1,17,2212,7803,1900,2,C7,-,2016-12-09 00:00:00
3,7,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2B,402,21,,C4,154 EAST 7TH STREET,...,10,0,10,2272,6794,1913,2,C4,3936272,2016-09-23 00:00:00
4,8,1,ALPHABET CITY,07 RENTALS - WALKUP APARTMENTS,2A,404,55,,C2,301 EAST 10TH STREET,...,6,0,6,2369,4615,1900,2,C2,8000000,2016-11-17 00:00:00


In [7]:
original_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 84548 entries, 0 to 84547
Data columns (total 22 columns):
 #   Column                          Non-Null Count  Dtype 
---  ------                          --------------  ----- 
 0   Unnamed: 0                      84548 non-null  int64 
 1   BOROUGH                         84548 non-null  int64 
 2   NEIGHBORHOOD                    84548 non-null  object
 3   BUILDING CLASS CATEGORY         84548 non-null  object
 4   TAX CLASS AT PRESENT            84548 non-null  object
 5   BLOCK                           84548 non-null  int64 
 6   LOT                             84548 non-null  int64 
 7   EASE-MENT                       84548 non-null  object
 8   BUILDING CLASS AT PRESENT       84548 non-null  object
 9   ADDRESS                         84548 non-null  object
 10  APARTMENT NUMBER                84548 non-null  object
 11  ZIP CODE                        84548 non-null  int64 
 12  RESIDENTIAL UNITS               84548 non-null

### 3.2 Explore target variable

## 4. Feature Selection

### 4.1 Feature Selection Approach

### 4.2 Final Selected Features

## 5. Data Preprocessing

### 5.1 Data Cleaning

### 5.2. Feature Engineering

### 5.3 Data Transformation

## 6. Data Modeling

### 6.1 Generate Predictions with Baseline Model

### 6.2 Assess the Baseline Model

## 7. Model Evaluation

### 7.1 Generate Predictions with Model Selected

### 7.2 Assess the Selected Model

## 8. Insights and Conclusions