# Data Processing

Goal: Understand and clean our data so we can derive better insights

## 1. Import Libraries

In [121]:
import pandas as pd

## 2. Load the Dataset

In [122]:
df = pd.read_csv("data/NY-House-Dataset-Small.csv")

In [114]:
df.info()
df.columns

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4815 entries, 0 to 4814
Data columns (total 11 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   BROKERTITLE                  1789 non-null   object 
 1   TYPE                         4815 non-null   object 
 2   PRICE                        4815 non-null   int64  
 3   BEDS                         4815 non-null   int64  
 4   BATH                         4815 non-null   float64
 5   PROPERTYSQFT                 4815 non-null   float64
 6   STATE                        4815 non-null   object 
 7   MAIN_ADDRESS                 4815 non-null   object 
 8   ADMINISTRATIVE_AREA_LEVEL_2  2135 non-null   object 
 9   LOCALITY                     4791 non-null   object 
 10  SUBLOCALITY                  4815 non-null   object 
dtypes: float64(2), int64(2), object(7)
memory usage: 413.9+ KB


Index(['BROKERTITLE', 'TYPE', 'PRICE', 'BEDS', 'BATH', 'PROPERTYSQFT', 'STATE',
       'MAIN_ADDRESS', 'ADMINISTRATIVE_AREA_LEVEL_2', 'LOCALITY',
       'SUBLOCALITY'],
      dtype='object')

## 3. Handle Duplicates

In [115]:
# Find the # of duplicated rows
df.duplicated().sum()

# Find duplicates by column
df.duplicated(["MAIN_ADDRESS"]).sum()

# Filter to get duplicates
df.loc[df.duplicated()]

# Display all duplictaes (even first occurences)
df.loc[df.duplicated(keep=False)].sort_values("PRICE")



Unnamed: 0,BROKERTITLE,TYPE,PRICE,BEDS,BATH,PROPERTYSQFT,STATE,MAIN_ADDRESS,ADMINISTRATIVE_AREA_LEVEL_2,LOCALITY,SUBLOCALITY
3639,,Co-op for sale,119000,3,1.000000,2184.207862,"Jamaica, NY 11432","89-00 170 St Unit 11NJamaica, NY 11432",New York,Queens County,Queens
3629,,Co-op for sale,119000,3,1.000000,2184.207862,"Jamaica, NY 11432","89-00 170 St Unit 11NJamaica, NY 11432",New York,Queens County,Queens
1520,,Co-op for sale,174000,1,1.000000,800.000000,"Brooklyn, NY 11229","3105 Avenue V Apt 1HBrooklyn, NY 11229",,Kings County,Brooklyn
1522,Brokered by TRACEY REAL ESTATE,Co-op for sale,174000,1,1.000000,800.000000,"Brooklyn, NY 11229","3105 Avenue V Apt 1HBrooklyn, NY 11229",New York,Kings County,Brooklyn
2128,,Co-op for sale,174000,1,1.000000,800.000000,"Brooklyn, NY 11229","3105 Avenue V Apt 1HBrooklyn, NY 11229",,Kings County,Brooklyn
...,...,...,...,...,...,...,...,...,...,...,...
2678,Brokered by Keller Williams Realty NYC Grp,Multi-family home for sale,3200000,3,2.373861,3735.000000,"New York, NY 10035","2117 5th AveNew York, NY 10035",,New York,New York County
3469,,Condo for sale,7600000,4,4.000000,3216.000000,"New York, NY 10007","100 Barclay St Apt 20CNew York, NY 10007",,New York County,New York
3473,,Condo for sale,7600000,4,4.000000,3216.000000,"New York, NY 10007","100 Barclay St Apt 20CNew York, NY 10007",,New York County,New York
2355,,Multi-family home for sale,16995000,5,4.000000,4230.000000,"New York, NY 10014","31 Grove StNew York, NY 10014",,New York,New York County


In [116]:
df = df.drop_duplicates()

In [117]:
df.shape

(4761, 11)

## 4. Handle Missing Data

#### Generally Dropping Data

In [118]:
# Does a cell have a null value
df.isna()

# Does a cell have a non-null value
df.notna()

# Get all columns with null values
df.isna().any()

# Get all rows with null values
df.isna().any(axis=1)

# Filter rows with null values
df.loc[df.isna().any(axis=1)]

# Drop null values (row axis)
df.dropna()

# Drop all columns with null values
df.dropna(axis=1)

Unnamed: 0,TYPE,PRICE,BEDS,BATH,PROPERTYSQFT,STATE,MAIN_ADDRESS,SUBLOCALITY
0,Condo for sale,315000,2,2.000000,1400.000000,"New York, NY 10022","2 E 55th St Unit 803New York, NY 10022",Manhattan
1,Condo for sale,195000000,7,10.000000,17545.000000,"New York, NY 10019",Central Park Tower Penthouse-217 W 57th New Yo...,New York County
2,House for sale,260000,4,2.000000,2015.000000,"Staten Island, NY 10312","620 Sinclair AveStaten Island, NY 10312",Richmond County
3,Condo for sale,69000,3,1.000000,445.000000,"Manhattan, NY 10022","2 E 55th St Unit 908W33Manhattan, NY 10022",New York County
4,Townhouse for sale,55000000,7,2.373861,14175.000000,"New York, NY 10065","5 E 64th StNew York, NY 10065",New York County
...,...,...,...,...,...,...,...,...
4810,Co-op for sale,599000,1,1.000000,2184.207862,"Manhattan, NY 10075","222 E 80th St Apt 3AManhattan, NY 10075",New York
4811,Co-op for sale,245000,1,1.000000,2184.207862,"Rego Park, NY 11374","97-40 62 Dr Unit LgRego Park, NY 11374",Queens County
4812,Co-op for sale,1275000,1,1.000000,2184.207862,"New York, NY 10011","427 W 21st St Unit GardenNew York, NY 10011",New York County
4813,Condo for sale,598125,2,1.000000,655.000000,"Elmhurst, NY 11373","91-23 Corona Ave Unit 4GElmhurst, NY 11373",Queens


## 5. Missing Data By Column
Steps:
1. Use Descriptive Statistics to examine data
2. Identify missing values
3. Understand why the data is missing
4. Decide to impute or drop values
5. Document your approach


### Broker Title

In [119]:
df["BROKERTITLE"].head(20)

# Figure out how many missing values
df["BROKERTITLE"].isna().sum()

# Determine percentage of missing values
print(str(round(df["BROKERTITLE"].isna().sum() / df.shape[0] * 100, 2)) + "% of data is missing")

# Drop column
df = df.drop(columns="BROKERTITLE")

62.72% of data is missing


#### Conclusion

# Homework

## Question 1

In [120]:
df["ADMINISTRATIVE_AREA_LEVEL_2"].head(18)
df.value_counts()

# Find percentage of values missing
print(str(round(df["ADMINISTRATIVE_AREA_LEVEL_2"].isna().sum() / df.shape[0] * 100, 2)) + "% of data is missing") # Output: "55.66% of data is missing"

# Drop Column
df = df.drop(columns="ADMINISTRATIVE_AREA_LEVEL_2")

55.66% of data is missing


### Conclusion:

I would drop the last column. There is over 50% of the values missing (so it wouldn't make sense to drop the rows), and I do not know what an Administrative Area (Level 2) is, so it would not make sense to try and fill them in.

## Question 2

In [129]:
df["LOCALITY"].isna().sum() # 24 N/A values

df.loc[df["LOCALITY"].notna()]["LOCALITY"].unique() # "-" and "Na" should probably be considered as N/A

df["LOCALITY"] = df["LOCALITY"].map(lambda x: None if x in ["-", "Na"] else x)

df["LOCALITY"].isna().sum() # 76 N/A values

df["LOCALITY"].value_counts()


LOCALITY
New York           2468
New York County     966
Queens County       555
Kings County        462
Bronx County        179
Richmond County      58
United States        34
Brooklyn              6
Queens                6
The Bronx             4
Flatbush              1
Name: count, dtype: int64

### Conclusion

I would either replace all of the N/A values with "New York", which is a group which would most likely contain every location which the house could be in, since it is the entire new york state, or just drop the columns since they are less than 10% of the data set.

## Question 3

In [161]:
df["PRICE"].value_counts() # 79 rows have a value of 0

print(str(df["PRICE"].loc[lambda x: x == 0].count() / df.shape[0] * 100) + "% of data is missing") # Around 1.6% of data is missing

1.640706126687435% of data is missing


### Conclusion:

I would just drop the houses with no price, or a price of 0, because the quantity of houses with no price is just 1.6% of the overall dataset, and it would be very difficult to find the prices of the house.

If I had to fill in the price of the houses with a value, instead of just dropping the rows, I would fill them in with the median. This would have less impact on the average then just leaving them at 0, or setting them to the mode, affecting the data set less.