# Determine best locations to open new communities

## Data available through https://www.census.gov and may impact community success
* Home counts, home age, # bedrooms, years living in home, vehicles at home
* Vacant vs occupied, owned vs rent, value range, mortgage status, monthly cost
* Age, sex, race, language spoken
* Population, income, internet use, education

## Steps used for both supervised and unsupervised models
* Input all zip codes along with data we feel is important to community success
* Separate the zip codes we currently have communities from the ones we don't, generating two dataframes

## Idea #1 - Supervised with target

* Add a new column to the current_communities dataframe that scores the success of each community
    * <span style="color:red;">Could a success score be based on community sales velocity or profit margin?  Do we have this data?</span>
* Train model using the current_communities dataframe as the inputs and the the success score as the target
* Run the other_locations dataframe into the model to get those locations predicted success scores

## Idea #2 - Unsupervised without target

* Cluster all zip codes and see which ones we are not building in are closest to those where we are building
* Can display as both a two color scatter plot, and a spreadsheet with vector distance between every location we do not build in to the nearest community we do build in
    * Smallest vectors are locations where we should be building

In [21]:
import pandas as pd
df = pd.DataFrame()
input_df = pd.read_csv('../inputs/census_DP04_data.csv', low_memory=False) # Selected Housing Characteristics
df['ZipCode'] = input_df['GEO_ID'].str[-5:]
df['TotalHousingUnits'] = input_df['DP04_0001E']
df['OccupiedHousingUnits'] = input_df['DP04_0002E']
df['VacantHousingUnits'] = input_df['DP04_0003E']
df['OwnerOccupied'] = input_df['DP04_0046E']
df['RenterOccupied'] = input_df['DP04_0047E']
df['OwnerOccupiedHouseholdSize'] = input_df['DP04_0048E']
df['RenterOccupiedHouseholdSize'] = input_df['DP04_0049E']
df['VehiclesAvailable'] = input_df['DP04_0057E']
df['NoVehiclesAvailable'] = input_df['DP04_0058E']
df['WithMortgage'] = input_df['DP04_0091E']
df['WithoutMortgage'] = input_df['DP04_0092E']
input_df = pd.read_csv('../inputs/census_S1501_data.csv', low_memory=False) # Educational Attainment
del input_df # Free memory
df = df.drop(0) # Drop the header description row
df.head()

Unnamed: 0,ZipCode,TotalHousingUnits,OccupiedHousingUnits,VacantHousingUnits,OwnerOccupied,RenterOccupied,OwnerOccupiedHouseholdSize,RenterOccupiedHouseholdSize,VehiclesAvailable,NoVehiclesAvailable,WithMortgage,WithoutMortgage
1,601,7306,5397,1909,3553,1844,3.09,3.29,5397,775,540,3013
2,602,17311,12858,4453,9782,3076,3.05,2.6,12858,1600,1876,7906
3,603,24771,19295,5476,11254,8041,2.56,2.4,19295,3397,3487,7767
4,606,2786,1968,818,1440,528,2.93,2.86,1968,265,193,1247
5,610,12494,8934,3560,6452,2482,3.05,2.57,8934,973,1902,4550


### Additional Census data in file DP04 that we may want to add later

* attached/detached, with count of how many units are attached
* year structure built
* number of rooms
* number of bedrooms
* housing tenure
* count of vehicles available
* house heating fuel
* house value
* monthly mortgage cost
* monthly rent
* cost as a percentage of income
* rent as a percentage of income