# DataHacks 2020 Beginner Track:
## San Diego Housing Information 1970

We will be using two datasets that contain information about the same ‘census tract’, or small blocks of areas in a city divided for census purposes. 
- The SD1970_population table will contain features like age, gender, race and marital status of San Diego residents in the 1970s, and the SD1970_housing table will contain features like average listing price/monthly rent, number of rooms, and kitchen/plumbing situations of their houses in the same year. The identifying feature for each row will be the tract (“Census Tract Name”) and block number (“Block Group”) combined, of the represented areas.

What are we dealing with?

## "Census Tract"

Census tracts are small, relatively permanent statistical subdivisions of a County
- Uniquely numbered in each county with a numeric code
- Census tracts average about 4,000 inhabitants
- Minimum Population – 1,200, Maximum Population – 8,000

GIS Link(San Diego Census Tracts - 2010):
http://www.arcgis.com/home/webmap/viewer.html?url=https://services.arcgis.com/uEH09Hfm70zI2ZxR/ArcGIS/rest/services/CENSUS_TRACTS_2010/FeatureServer/0&source=sd

![Screen%20Shot%202020-02-08%20at%209.04.19%20AM.png](attachment:Screen%20Shot%202020-02-08%20at%209.04.19%20AM.png)

## Block groups: smaller subdivision of tracts

## Place name: city/neighborhood name that the blocks belong to
   - can have multiples per tract
   
***All of them belong in the same San Diego county; do not delete rows with other Place names!***

In [6]:
# Basic Python libraries; feel free to add more!

import pandas as pd
import numpy as np
import matplotlib

In [7]:
# make sure the filepath is correct so the datasets can be imported

population = pd.read_csv('SD1970_population.csv', thousands=',')
housing = pd.read_csv('SD1970_housing.csv', thousands=',')

FileNotFoundError: [Errno 2] File b'SD1970_population.csv' does not exist: b'SD1970_population.csv'

## Requires a LOT of cleaning!


In [None]:
population.head()

In [None]:
housing.head()

### Ellipses (...s) - look at column 6 above

- Means that the data was not released due to privacy issues
  - When there is not a lot of people living in one block

- Different from having null values/0s

- Get rid of them, replace them, change to binary variables w/ existing row data, etc.

In [None]:
# one example
housing = housing.replace('...', np.NaN)
housing.head()

## What do these columns mean?

- Columns act as "features" for machine learning

We have a huge number of features to work from, feel free to add your own features.

### Some Terms in SD1970_housing.csv

**Housing unit**: 1 basic unit of house/apartment

- Unit structure: 1 basic unit of building containing houses

- Room units: 1 basic unit of rooms

**Aggregate (total) value**: total value of all houses in the block combined

**Average value**: average value of individual houses in the block

Housing units with **all plumbing facilities**: houses that have all three of the following facilities;
 - Hot and cold running water, A flush toilet, A bathtub or shower

Housing units with **direct access**: accessible from the outside or through a common hall
 - Exists as an independent unit; can be ignored because it is used with another condition (below).

Housing units with **complete kitchen facilities**: houses that have all three of the following facilities;
 - A sink with a faucet, A stove or range, A refrigerator

**Seasonal and migratory**: house is not offered year-round

Housing units for which value is **tabulated**: data is collected and arranged in this table

**Roomers, boarders, or lodgers**: people who live with the landlord
- **Renters** do not live with the landlord

**Family head**: head of family; only 1 member per household

## COLUMN NAMES TOO LONG???

- Rename column names to make data manipulation easy!

In [4]:
population = population.rename(columns={"Census Tract Name": "Tract name"})
population.head()

NameError: name 'population' is not defined

## TOO MANY COLUMNS???

Some columns you might want to get rid of if it becomes too overwhelming:

- "Vacant~" columns: get rid of them or add them to "Occupied" columns 
- Columns containing data about family heads: not enough supporting data
- Columns with no relevant data in other columns
  - "seasonal and vacant migratory", "Roomers, boarders, or lodgers:"

*Only to help you; feel free to use these if you want

# Tasks:

Detailed steps can be found in the rubric.

**THE GUIDELINE QUESTIONS ARE THERE TO HELP YOU EXPERIMENT WITH THE DATASET**

***BE SURE TO DO ALL OF THEM***




## 1. Data Cleaning (3 pts): 

### Drop columns you don't need/ want to use
- We have a LOT of columns; you can't be efficient by using all of those columns

### Deal with null values, duplicate/ambiguous rows, incorrect. datatypes

Most of these will be numerical data so convert them to appropriate data types

In [5]:
population.dtypes

NameError: name 'population' is not defined

### Column manipulation
- Lump similar columns together, add new columns based on existing ones, etc.


***Guideline Q1***: Add columns “Total Male Persons” and “Total Female Persons” to the population dataframe, and include each column values for the 1st row (Census Tract 1, Block 1) in the report.

- Data should be optimally cleaned, with similar columns combined together/dropped


### Merge data on each block from both housing and population tables

***EXAMPLE ONLY***

![Screen%20Shot%202020-02-08%20at%2010.29.12%20AM.png](attachment:Screen%20Shot%202020-02-08%20at%2010.29.12%20AM.png)

### Intro to Python & Pandas workshop @ 1:30pm

## 2. Data Visualization (3 pts): 

### Examine the overall distribution of data using different plots
Focus on various categories and find out the distribution of different people living in San Diego.

For example, what's the age distribution for males in San Diego like, for females; what's the distribution for individuals with different races living in San Diego, etc.?

![Screen%20Shot%202020-02-08%20at%2010.30.57%20AM.png](attachment:Screen%20Shot%202020-02-08%20at%2010.30.57%20AM.png)

![Screen%20Shot%202020-02-08%20at%2010.32.04%20AM.png](attachment:Screen%20Shot%202020-02-08%20at%2010.32.04%20AM.png)

### How big are each block/residential area? 
- ***Guideline Q2***: What block is the second common, following San Diego?



### How is the pricing of houses distributed in San Diego? 
- ***Guideline Q3***: What's the average price of all houses in San Diego?
- ***Guideline Q4***: What block has the highest average price (based on the owner-occupied average value)?



### Data Visualization workshop @ 3:45pm

## 3. Machine Learning (3 pts):

### Use machine learning to see what factors are correlated to the living conditions of houses.

### Define the condition of a house and what columns are relevant to it.
- We will use this as an arbitrary feature to differentiate between rows of housing blocks
- You can choose to select one specific column to define it, or use multiple to up your accuracy!
- Does the gender of occupants affect how expensive the house they live in is? What about age? Or marriage status?

### Main goal: connecting the two tables (population, housing) together

### Find a meaningful relationship between the residents of a house and its conditions using different machine learning techniques.
- Perform linear regression on 2 or more features in the population dataset.
- Divide census tracts into 2 groups, make a new column containing the group data, and perform logistic regression on the column

### Use PCA to select useful features: appropriate with our current dataset because it has many features



### Machine Learning workshop @ 8:00pm

## 4. Analysis - Final Report (1 pt)

### Requirements:
- Minimum 2 pages, double spaced, ~12 pt with clear & legible font
- Include all aspects of the 3 steps given above with details on the procedure.
- Must be a clear and concise communication of end results and final analysis
- Minimal grammatical/vocabulary errors and consistent formatting

### Report Writing workshop on Sun. 9:30am

# Any Questions?