# Project Design Writeup and Approval Template

Follow this as a guide to completing the project design write-up. The questions for each section are merely there to suggest what the baseline should cover; be sure to use detail as it will make the project much easier to approach as the class moves on.

### Project Problem and Hypothesis
- What is the project about? What problem are you solving?

    - The aim of this project is to predict Sale Price of houses in Ames, Iowa. Individual housing sale price data is available between the period of 2006 and 2010. This information will be used to identify the factors that contribute significantly to the Sale Price. An algorithm will then be established to predict the house price in Ames with minimal error. 
    - Background info: Ames has a population of 62,000 and the 9th best place to live in USA. There is robust/stable economy and the average temperature is 9.6 degree Celsius.
    - Null Hypothesis: Sale Price is not likely to have any relationship with the indepenent variables available from Ames Housing dataset.
    - Alternate hypothesis: Sale Price is likely to have relationship with the independent variables available from Ames Housing dataset.
    - There is a change to the dataset being used for this project from my initial communication. I was planning to use datasets from Australian Bureau of Statistics. However further analysis has proven that those datasets contained aggregated data, not raw data, therefore not fit for this purpose. Ames Housing dataset contain data fit for this purpose.

- Where does this seem to reside as a machine learning problem? Are you predicting some continuous number, or predicting a binary value?
    - The expected outcome or dependent variable, SalePrice, is a continuous variable. However predictors are a mixture of continuous and categorical variables. A regression model (eg: linear regression) is likely to be used to predict the sale price. 

- What kind of impact do you think it could have?
    - Sale price can be influenced by various factors such as the size of the house, condition of the house, the neighborhood, refurbishments etc. Exploratory analysis will need to be conducted to understand the correlationship between these factors and Sale Price.

- What do you think will have the most impact in predicting the value you are interested in solving for?
    - Exploratory analysis will give an idea of the correlation between independent variables and dependent variable (Sale Price) and also identify correlation between independent variables themselves. Independent variables with correlation will need to be analysed and removed appropriately to avoid any impact to the final outcome. Independent variables with a strong correlation to dependent variable (value closest to 1 or -1) will have the highest impact for Sale price (dependent variable).

### Data sets
- Description of data set available, at the field level (see table)
    - Housing data of individual house sales in Ames have been collected over the period from 2006 to 2010. 
Kaggle dataset for Ames Housing contain 1460 observations and 81 variables. Variables include 20 continuous variables and 61 categorical variables. Dependent variable Sale Price is a continuous variable.

    - The original Ames Housing dataset contain 2930 observations and 82 variables. Variables include 20 continuous variables and 62 categorical variables. More details are available from an additional attachment, called DataDocumentation.txt. This document will be provided with the submission. This document further explain the range of values and categories for each of the 82 variables.
    
    - Parcel Identification number (PID) is the variable that has been removed from Kaggle dataset. This variable has no impact to the analysis of Sale Price in Ames.


Variable | Description | Type of variable
---|---|---
Id|Identification number|Categorical
MSSubClass| The building class|Categorical
MSZoning| The general zoning classification|Categorical
LotFrontage| Linear feet of street connected to property|Continuous
LotArea| Lot size in square feet|Continuous
Street| Type of road access|Categorical
Alley| Type of alley access|Categorical
LotShape| General shape of property|Categorical
LandContour| Flatness of the property|Categorical
Utilities| Type of utilities available|Categorical
LotConfig| Lot configuration|Categorical
LandSlope| Slope of property|Categorical
Neighborhood| Physical locations within Ames city limits|Categorical
Condition1| Proximity to main road or railroad|Categorical
Condition2| Proximity to main road or railroad (if a second is present)|Categorical
BldgType| Type of dwelling|Categorical
HouseStyle| Style of dwelling|Categorical
OverallQual| Overall material and finish quality|Categorical
OverallCond| Overall condition rating|Categorical
YearBuilt| Original construction date|Categorical
YearRemodAdd| Remodel date|Categorical
RoofStyle| Type of roof|Categorical
RoofMatl| Roof material|Categorical
Exterior1st| Exterior covering on house|Categorical
Exterior2nd| Exterior covering on house (if more than one material)|Categorical
MasVnrType| Masonry veneer type|Categorical
MasVnrArea| Masonry veneer area in square feet|Continuous
ExterQual| Exterior material quality|Categorical
ExterCond| Present condition of the material on the exterior|Categorical
Foundation| Type of foundation|Categorical
BsmtQual| Height of the basement|Categorical
BsmtCond| General condition of the basement|Categorical
BsmtExposure| Walkout or garden level basement walls|Categorical
BsmtFinType1| Quality of basement finished area|Categorical
BsmtFinSF1| Type 1 finished square feet|Continuous
BsmtFinType2| Quality of second finished area (if present)|Categorical
BsmtFinSF2| Type 2 finished square feet|Continuous
BsmtUnfSF| Unfinished square feet of basement area|Continuous
TotalBsmtSF| Total square feet of basement area|Continuous
Heating| Type of heating|Categorical
HeatingQC| Heating quality and condition|Categorical
CentralAir| Central air conditioning|Categorical
Electrical| Electrical system|Categorical
1stFlrSF| First Floor square feet|Continuous
2ndFlrSF| Second floor square feet|Continuous
LowQualFinSF| Low quality finished square feet (all floors)|Continuous
GrLivArea| Above grade (ground) living area square feet|Continuous
BsmtFullBath| Basement full bathrooms|Categorical
BsmtHalfBath| Basement half bathrooms|Categorical
FullBath| Full bathrooms above grade|Categorical
HalfBath| Half baths above grade|Categorical
Bedroom| Number of bedrooms above basement level|Categorical
Kitchen| Number of kitchens|Categorical
KitchenQual| Kitchen quality|Categorical
TotRmsAbvGrd| Total rooms above grade (does not include bathrooms)|Categorical
Functional| Home functionality rating|Categorical
Fireplaces| Number of fireplaces|Categorical
FireplaceQu| Fireplace quality|Categorical
GarageType| Garage location|Categorical
GarageYrBlt| Year garage was built|Categorical
GarageFinish| Interior finish of the garage|Categorical
GarageCars| Size of garage in car capacity|Categorical
GarageArea| Size of garage in square feet|Continuous
GarageQual| Garage quality|Categorical
GarageCond| Garage condition|Categorical
PavedDrive| Paved driveway|Categorical
WoodDeckSF| Wood deck area in square feet|Continuous
OpenPorchSF| Open porch area in square feet|Continuous
EnclosedPorch| Enclosed porch area in square feet|Continuous
3SsnPorch| Three season porch area in square feet|Continuous
ScreenPorch| Screen porch area in square feet|Continuous
PoolArea| Pool area in square feet|Continuous
PoolQC| Pool quality|Categorical
Fence| Fence quality|Categorical
MiscFeature| Miscellaneous feature not covered in other categories|Categorical
MiscVal| $Value of miscellaneous feature|Continuous
MoSold| Month Sold|Categorical
YrSold| Year Sold|Categorical
SaleType| Type of sale|Categorical
SaleCondition| Condition of sale|Categorical
SalePrice|Sale Price of house - dependent variable|Continuous

- If from an API, include a sample return (this is usually included in API documentation!) (if doing this in markdown, use the javacript code tag)

### Domain knowledge
- What experience do you already have around this area?
    - I am interested in the real estate space for personal investments and potentially a future business option. I have been analysing different factors that influence real estate housing price from various websites available for the Australian market.
    - Background info: Ames has a population of 62,000 and the 9th best place to live in USA. There is robust/stable economy and the average temperature is 9.6 degree Celsius. The background information about Ames will also help with this project, eg: need for heating, positive trending with sale price etc.

- Does it relate or help inform the project in any way?
    - Based on my experience to date, it will help me identify and shortlist some of the predictors related to Sale Price. Plotting exercises such as box plots, histograms and scatter plots will further help with the decision making process.

- What other research efforts exist?
    - I had searched google extensively for Australian housing dataset. I had found some datasets from ABS website. However the data was aggregated and not raw. Therefore kept looking further and found Ames Housing dataset that had single line of data for each individual house sale and looked more appropriate for this project.
Benchmark at this stage will be to obtain at least 0.70 Adjusted R-squared.

    - Use a quick Google search to see what approaches others have made, or talk with your colleagues if it is work related about previous attempts at similar problems.
    - This could even just be something like "the marketing team put together a forecast in excel that does not do well."
    - Include a benchmark, how other models have performed, even if you are unsure what the metric means.

### Project Concerns
- What questions do you have about your project? What are you not sure you quite yet understand? (The more honest you are about this, the easier your instructors can help).
     - I am not quite sure if there is a perfect Adj R squared that I need to meet, how many regression models that I should be comparing and what RMSE I should be achieving. If there is a min or max number of predictors I should be using, considering the large number of categorical variables in the dataset. 

- What are the assumptions and caveats to the problem?
    - What data do you not have access to but wish you had?
    - I wish I had access to Australian housing data as this would have been more useful for me personally and people in the class.
    - What is already implied about the observations in your data set? For example, if your primary data set is twitter data, it may not be representative of the whole sample (say, predicting who would win an election)
    - The dataset contain data between 2006 and 2010. Changes beyond that date have not been captured, hence the prediction may have a discrepancy from the current sale price of housing data in Ames.
- What are the risks to the project?
    - What is the cost of your model being wrong? (What is the benefit of your model being right?)
    - There are no risks for this project as the outcome will only be used privately. If this was published publicly, and people were using this prediction to buy properties online, there is a low risk of people getting misguided and potentially overspending or underspending for the property.
    - Is any of the data incorrect? Could it be incorrect?
    - Not that I am aware of. There are some outliers that could potentially be removed, but have not been removed from this analysis.

### Outcomes
- What do you expect the output to look like?
    - I expect the output to be close to the actual Sale Price in the test data set when using the train test split model.

- What does your target audience expect the output to look like?
    - Target audience will also expect the predicted Sale Price to be as close to reality.

- What gain do you expect from your most important feature on its own?
    - Getting adjusted R squared value greater than 0.5 the very least with the most important feature contributing significantly to the variance of Sale Price.

- How complicated does your model have to be?
    - It is better to keep the model simple if possible. Identifying the most prominent features will be the key to this.

- How successfully does your project have to be in order to be considered a "success"?
    - Adjusted R squared greater than 0.70. P-values of predictors less than 0.05. RMSE as low as possible.

- What will you do if the project is a bust (this happens! but it should not here)?
    - Explain the reasons behind why the project did not produce the expected outcome. And all the steps undertaken to come to this conclusion.