<a href="https://colab.research.google.com/github/Architg021/SUNRISE/blob/main/Cleaning_Data.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd
from scipy.stats import zscore
from sklearn.preprocessing import StandardScaler


 Load the dataset

In [None]:
file_path = "/content/AB_NYC_2019.csv"
df = pd.read_csv(file_path)


1. Data Integrity Check

In [None]:
print("Initial Dataset Shape:", df.shape)
print("Columns:\n", df.columns)
print("\nSummary of Missing Values:\n", df.isnull().sum())


Initial Dataset Shape: (48895, 16)
Columns:
 Index(['id', 'name', 'host_id', 'host_name', 'neighbourhood_group',
       'neighbourhood', 'latitude', 'longitude', 'room_type', 'price',
       'minimum_nights', 'number_of_reviews', 'last_review',
       'reviews_per_month', 'calculated_host_listings_count',
       'availability_365'],
      dtype='object')

Summary of Missing Values:
 id                                    0
name                                 16
host_id                               0
host_name                            21
neighbourhood_group                   0
neighbourhood                         0
latitude                              0
longitude                             0
room_type                             0
price                                 0
minimum_nights                        0
number_of_reviews                     0
last_review                       10052
reviews_per_month                 10052
calculated_host_listings_count        0
availability_3

In [None]:
# Replace placeholders like '-999' or '?' with NaN
df.replace([-999, '?'], pd.NA, inplace=True)


2. Handle Missing Values


In [None]:
# Drop columns with too many missing values (>50% missing)
threshold = len(df) * 0.5
df = df.dropna(axis=1, thresh=threshold)

# Fill remaining missing values
for column in df.columns:
    if df[column].dtype == 'object':  # Categorical columns
        df[column].fillna(df[column].mode()[0], inplace=True)
    else:  # Numeric columns
        df[column].fillna(df[column].median(), inplace=True)


The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(df[column].median(), inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df[column].fillna(df[column].mode()[0], inplace=True)


3. Rename and Normalize Columns


In [None]:
df.rename(columns=lambda x: x.strip().replace(" ", "_").lower(), inplace=True)
for column in df.select_dtypes(include=['object']):
    df[column] = df[column].str.strip().str.lower()

 4. Remove Duplicates


In [None]:
df.drop_duplicates(inplace=True)


 5. Outlier Detection

In [None]:
numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns

# Z-Score Method
for column in numeric_columns:
    df['zscore'] = zscore(df[column])
    df = df[df['zscore'].abs() < 3]  # Keep rows with Z-score < 3

# Drop the zscore column (used temporarily for outlier detection)
df.drop(columns=['zscore'], inplace=True)


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['zscore'] = zscore(df[column])


 6. Standardize Numeric Columns


In [None]:
scaler = StandardScaler()
df[numeric_columns] = scaler.fit_transform(df[numeric_columns])


In [None]:
# Final Dataset Info
print("Cleaned Dataset Shape:", df.shape)
print(df.info())


Cleaned Dataset Shape: (44074, 16)
<class 'pandas.core.frame.DataFrame'>
Index: 44074 entries, 0 to 48894
Data columns (total 16 columns):
 #   Column                          Non-Null Count  Dtype  
---  ------                          --------------  -----  
 0   id                              44074 non-null  float64
 1   name                            44074 non-null  object 
 2   host_id                         44074 non-null  float64
 3   host_name                       44074 non-null  object 
 4   neighbourhood_group             44074 non-null  object 
 5   neighbourhood                   44074 non-null  object 
 6   latitude                        44074 non-null  float64
 7   longitude                       44074 non-null  float64
 8   room_type                       44074 non-null  object 
 9   price                           44074 non-null  float64
 10  minimum_nights                  44074 non-null  float64
 11  number_of_reviews               44074 non-null  float64
 12  la

In [None]:
# Save Cleaned Dataset
output_path = "/content/AB_NYC_2019.csv"
df.to_csv(output_path, index=False)
print("Cleaned Dataset Saved to:", output_path)


Cleaned Dataset Saved to: /content/AB_NYC_2019.csv
