# Final Project Submission

* Student name: **Dennis Trimarchi**
* Student pace: **full time**
* Scheduled project review date/time: **Not yet scheduled**
* Instructor name: **Rafael Cassaro**
* Blog post URL: **Not yet created**


# Motivation
I was motivated to do this project because I have a keen interest in the home values of King's County Washington!!... Oh wait, that's a complete lie. I am doing this project because I am required to review data related to King's County housing, scrub the data, do some analysis, and develop a regression model. With that said, there is a great deal of data available and I am interested in going through a full analysis without being "walked through" the process. I believe that doing this project will give me an idea of my strengths and weaknesses as this stage of my learning as well as give me practice applying data science concepts in Python.

# Outline

I have broken this notebook out into sections. Most of my Python work is completed in other notebooks. Where I have done this, I will provide the filename as a reference. In all, I have three notebooks. The contents of these notebooks are summarized below:

1. [student_1_Data_cleaning.ipynb](student_1_Data_cleaning.ipynb): Read in King's County housing data. Scrub the data. Handle null values, placeholders and outliers.
2. [student_2_EDA_Questions.ipynb](student_2_EDA_Questions.ipynb): Start to look at contents of each feature. Formulate three questions, investigate them, and provide an answer/conclusion to those questions
3. [student_3_Model.ipynb](student_3_Model.ipynb): Create a model. Iteratively improve the model. Provide and final model and perform model validation.

This readme contains sections that summarize the work from the notebooks above as well as provide Recommendations / Conclusions and links to additional project deliverables.

**Table Of Contents**
1. Import Data and Scrub
2. EDA & Questions
3. Modeling
4. Recommendations / Conclusion
5. Further Work
6. Slideshow - Link
7. Video - Link
8. Blog - Link

# Import Data and Scrub

See details in [student_1_Data_cleaning.ipynb](student_1_Data_cleaning.ipynb) if desired.

In total I removed **634** out of **21597** rows of data during Data Cleaning.

I read in the data from the provided csv file. My process was to first look at the datatype and convert it. Second, look for Null values. Third, look for outliers by running feature histograms and scatterplots against the target. However, in trying to convert some feature datatypes, null values became apparent on their own. Below is a summary of the data cleaning actions for each feature:

### **Field:** Action Taken

* **id:** Dropped column. Could've used as index, but auto-index in DataFrame was sufficient. Verified no duplicates before dropping.

* **date:** Dropped column. I converted it to a datetime object and thought it might be good to see if there was a correlation between sell price and time of year. However, the dataset only spans one year, decided to drop.

* **price:** THIS IS THE TARGET - dependent variable. I ran a histogram and decided to remove records that were >=2000000 (208 rows). The histogram was very skewed and removing these outliers gave a more normal distribution - although still skewed right. 

* **bedrooms:** There was one very clear outlier at 33 bedrooms. Decided to remove records with bedrooms >= 8 (20 rows).

* **bathrooms:** Remove records that were >5 which looked like outliers (21 rows removed). I also decided to round the data to the nearest 0.5 bathroom. I didn't think that 0.25 or 0.75 of a bathroom was meaningful.

* **sqft_living:** Removed records that were  >6000 (only 12 rows after other data slicing).

* **sqft_lot:** Remove outliers where square footage is greater than 150,000 (332 rows). I feel that I could've removed more according to the histogram, but I did not want to slice away too much at this time.

* **floors:** Left alone.

* **waterfront:** 2354 rows have a Null value. Decided to fill Null values with 0 (i.e. not waterfront). Unfortunately the feature only has 101 records equal to 1 (i.e. view to waterfront).

* **view:** Dropped column. Over 90% of the data has zero as the value. It is also unclear what the meaning of this feature is... for example viewed = 4.

* **condition:** Left alone.

* **grade:** Left alone.

* **sqft_above:** Left alone.

* **sqft_basement:** Feature engineered. Converted to float from object type. However, needed to fill 454 records that had "?" for a value. Decided to use 0 in place of "?" with the assumption that these properties did not have a basement, or no evidence or a basement. The median value for the feature is 0, so I feel that this was a good assumption. Basement sizes other than zero followed a normal distribution but there was a huge peak at zero. Decided to convert this to a categorical variable **has_basement** with 0 meaning "no basement" and 1 meaning "has a basement".

* **yr_built:** Left alone.

* **yr_renovated:** Updated with yr_built or yr_renovated whichever is newer. The idea behind this is that the year a home was built should count as a last renovation. This filled in most of the data as only 713 rows were non-zero value to start with.

* **zipcode:** Feature engineered. Stored as a number but should be considered a categorical feature. Since there are 70 unique zip codes in the dataset, I decided to look at zip code prefixes. There are two in the dataset. 980 representing areas around Seattle, and 981 representing the city of Seattle. I created a categorical variable **zip_981** that represented 981 Seattle City zip codes. I left zipcode alone in case it would come in handy later.

* **lat:** Left alone.

* **long:** Left alone.

* **sqft_living15:** Left alone.

* **sqft_lot15:** Remove outliers where square footage is greater than 150,000. I feel that I could've removed more according to the histogram, but I did not want to slice away too much data at this time.

# Questions

I came up with the following questions as I was initially reviewing the data. 

1. Are waterfront properties significantly more expensive?
2. Are newer homes more expensive?
3. What impacts price more, square footage of living space or number of bedrooms?

## Question 1 - waterfront view properties

Unfortunately I didn't realize that there would only be about 100 records where waterfront view was selected. While I don't believe that this is enough data to make any meaningful determinations, I nevertheless did some data analysis to see what came out. 

I ran a regression with waterfront vs price to yield a p-value near zero and R-squared of 0.021. Unfortunately the residuals were problematic, not normally distributed. I ended up taking the log of the price (target variable) and re-ran the regression. This yield a p-value near zero again with an R-squared of 0.012. 

Having a p-value of almost zero in both models tells me that waterfront views are statistically significant. Luckily the residuals in the second model were more normally distributed which is good. This tells me that, based on my limited set of data, with a coefficient of 13, view of the waterfront results in an increase of 0.7883 for the log of the price (about $2.2K in value). I figured that it would be much higher, but it's a small representation of waterfront properties.


## Question 2 - newer homes

The scatterplot of yr_built vs price didn't look to have any kind of relationship. The correlation coefficient of 0.053 would indicate a very weak positive correlation between the two. Again, I ended up having to take the log of the price in order to get normally distributed residuals from the model. 

The model coefficient of 0.0013 would indicate that log price increases by 0.0013 for each newer year. This is only around $1K per year. I suppose it's possible but not very interesting or useful. The p-value was less than 0.05 which indicates statistical significance.

## Question 3 - Square footage vs. Num Bedrooms



# Modeling

 

# Outline
* [Learning Stuff Section](#Learning-Stuff)

# Notebook
* Needs to be clean and organized
    * markdowns
    * introduction
        * outline
            * table of contents if possible
    * clear comments
    
    
* In order you should have
    * introduction
    * obtain section
    * scrubbing
        * handling null values
        * handling placeholders
        * handling outliers
        * etc
    * eda section
        * questions clearly stated
            * question
            * investigation + eda
            * conclusions + insights + recommendation
        * eda summary
    * modeling
        * model
        * iterate by using those results to make a new model
        * repeat 
        * final model - **do not have multicollinearity**
            * **DON'T**
                * have multicollinearity
                * pvalues > 0.05
                * P(F) > 0.05
            * your price equation written out 
                * $price = 23\text{sqft_living} + \dots$
                * writing a latex equation
                
                $$p = \beta_1 f_1 + \beta_2 f_2 + \dots$$
                
                $$p = 1232 sqftliving + 3232 grade + \dots$$
                
            * final model sm.OLS.summary()
            * model validation
                * train test split
                * cross validation in sklearn using
                * show the scores and interpret them
    * recommendations
    * conclusion
    * further work



# Slideshow
* 6-8/10 slides
* not too text heavy
* good visual aids
* model summary for features
    * insights
* recommendations
* further investigation
* thank you slide



# README.md
Either
    * convert notebook
or
    * intro
    * outline
    * notebooks used
    * conclusion
        * final model equation



# Video
* present your project
* upload to youtube, google drive or other cloud service



# Blog
* 800 words
* add some visuals
* have fun with it

