# Finding the Best Neighborhood in Pittsburgh: 
## Factoring in Property Values
Data borrowed from https://data.wprdc.org/dataset/real-estate-sales

### Getting Started

In [1]:
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt

ModuleNotFoundError: No module named 'geopandas'

In [None]:
# Importing the data
property_data = pd.read_csv("PghPropertySaleData.csv", low_memory = False)

In [None]:
# Previewing the data
property_data.head(5)

In [None]:
# In the preview, I noticed that the property in the fourth row sold for a price of $0.
# Looking through the .dbf file, I noticed that there are several other extremely low sold prices, such as $0, $1, and $10. 
# Therefore, I am only considering prices above $1,000 to mitigate the infuence of global outliers. 
property_data = property_data[property_data.PRICE > 1000]
# In importing the CSV file, the zip codes were turned into floats. This will cast them back into the int data type
property_data['PROPERTYZIP'] = property_data['PROPERTYZIP'].astype(int)

In [None]:
# Finding the mean of the property sold prices for all properties sharing the same zip code
price_property_data = property_data[['PROPERTYZIP','PRICE']].groupby(['PROPERTYZIP']).mean()
price_property_data

In [None]:
# Rounding the mean property sales prices to the nearest dollar
price_property_data['PRICE'] = price_property_data['PRICE'].astype(int)
# Sorting the data by price
price_property_data.sort_values(by=['PRICE'],inplace=True)
price_property_data.head(10)

### Establishing a points system and price brackets
Zip codes with properties within a certain threshold will be assigned a fixed number of points.          

Team members' data sets also utilize a point system.         

The neighborood with the highest combined number of points will be considered the best.

In [None]:
# First, price brackets need to be established. To do this, I will divide the distribution into five tiers, based on percentiles.

# Creating a new column called "Percentile Rank", which shows the percentage of prices that any one price is greater than.
price_property_data['Percentile Rank'] = price_property_data.PRICE.rank(pct = True)

In [None]:
# Now, I am creating conditions for the program to check in order to set a point value based on the Percentile Rank values. 

conditions = [
    (price_property_data['Percentile Rank'] <= .2),
    (price_property_data['Percentile Rank'] > .2) & (price_property_data['Percentile Rank'] <= .4),
    (price_property_data['Percentile Rank'] > .4) & (price_property_data['Percentile Rank'] <= .6),
    (price_property_data['Percentile Rank'] > .6) & (price_property_data['Percentile Rank'] <= .8),
    (price_property_data['Percentile Rank'] > .8)]
# The points work with the above conditions. If the first condition is met (percentile rank below .2), one point is assigned.
# If the second condition is met (percentile rank below or equal to .4 and greater than .2), then two poitns are assigned.
# This method gives more points to zip codes with higher percentile prices.
points = ['1', '2', '3', '4', '5']
# Making a new column called "Points" and adding point values based on the above conditions.
price_property_data['Points'] = np.select(conditions, points)
# Sorting the data first by points, and then by price.
price_property_data.sort_values(by=['Points', 'PRICE'],inplace=True, ascending=False)
price_property_data.head(5)


### As seen above, the best neighborhood judged purely from property values is 15275, which is Pittsburgh. There is no specific neighborhood attached to that zip code, nor the runner-up, so I will conclude that 15222 - Downtown - is the best neighborhood.

In [None]:
# Grabbing the data to put in the final notebook
price_property_data.to_csv('property_data.csv')

## Visualizing the data

In [None]:
price_property_data.reset_index(inplace=True)
price_property_data = price_property_data.astype(int)
graph = price_property_data.plot.bar(x ='PROPERTYZIP',y='PRICE', title="Distribution of Average Property Sale Prices in Pittsburgh")
plt.axis('off')


### As we can see, there is one major outlier in the form of zipcode 15275, with an average property value of 6 million dollars.