## Project 3 – Final Project Submission

* Student name: Greg Osborne
* Student pace: self paced / part time
* Scheduled project review date/time: 10/X/22
* Instructor name: Morgan Jones
* Blog post URL: https://medium.com/@gregosborne

## Project Overview

For this project, you will engage in the full data science process from start to finish, solving a **classification** problem using a **dataset of your choice**.

### Business Problem and Data

Similar to the Phase 2 project, it is up to you to define a stakeholder and business problem. Unlike the Phase 2 project, you are also responsible for choosing a dataset.

For complete details, see [Phase 3 Project - Choosing a Dataset](https://github.com/learn-co-curriculum/dsc-phase-3-choosing-a-dataset).

### Key Points

#### Classification

Recall the distinction between *classification* and *regression* models:

 * Classification is used when the target variable is a *category*
 * Regression is used when the target variable is a *numeric value*

(Categorical data may be represented in the data as numbers, e.g. 0 and 1, but they are not truly numeric values. If you're unsure, ask yourself "is a target value of 1 _one more than_ a target value of 0"; if it is one more, that is a regression target, if not, that is a classification target.)

You already practiced performing a regression analysis in Phase 2, and you will have additional opportunities to work on regression problems in later phases, but **for this project, you must be modeling a classification problem**.

#### Findings and Recommendations

In the previous two projects, the framing was primarily *descriptive* and *inferential*, meaning that you were trying to understand the distributions of variables and the relationship between them. For this project you can still use these techniques, but make sure you are also using a ***predictive*** approach.

A predictive *finding* might include:

* How well your model is able to predict the target
* What features are most important to your model

A predictive *recommendation* might include:

* The contexts/situations where the predictions made by your model would and would not be useful for your stakeholder and business problem
* Suggestions for how the business might modify certain input variables to achieve certain target results

#### Iterative Approach to Modeling

The expectations from the Phase 2 project still stand:

> You should demonstrate an iterative approach to modeling. This means that you must build multiple models. Begin with a basic model, evaluate it, and then provide justification for and proceed to a new model. After you finish refining your models, you should provide 1-3 paragraphs in the notebook discussing your final model.

With the additional techniques you have learned in Phase 3, be sure to explore:

1. Model features and preprocessing approaches
2. Different kinds of models (logistic regression, k-nearest neighbors, decision trees, etc.)
3. Different model hyperparameters

At minimum you must build three models:

* A simple, interpretable baseline model (logistic regression or single decision tree)
* A more-complex model (e.g. random forest)
* A version of either the simple model or more-complex model with tuned hyperparameters

#### Classification Metrics

**You must choose appropriate classification metrics and use them to evaluate your models.** Choosing the right classification metrics is a key data science skill, and should be informed by data exploration and the business problem itself. You must then use this metric to evaluate your model performance using both training and testing data.

## Deliverables

There are three deliverables for this project:

* A **non-technical presentation**
* A **Jupyter Notebook**
* A **GitHub repository**

The deliverables requirements are almost the same as in the Phase 1 and Phase 2 projects. ***The only difference between the Phase 2 and Phase 3 project checklist is that the "Regression Results" element has been replaced with an "Evaluation" element.***

# Stakeholder: Maple Homes
# Problem: Maximize Salesmen Productivity
Real estate company Maple Homes employs several salesmen who are paid a commission for each of home they sell. As is customary for salesman, the commission is a percentage of the sale price. Salesmen, like all of us, only have so much time. How do they know which houses will yield the highest prices? What should they look for? What are three independent variables to look for in a home that will yield a high sale price, and thus a higher commission?

Maple Homes contracted FurPig industries to analyze county home sales data and give three recommendations to answer these questions.

# Functions
Pandas libraries.

In [1]:
#DataFrames and computation
import pandas as pd
import numpy as np

#Statsmodels for OLS modeling
import statsmodels.api as sm
from statsmodels.formula.api import ols
import statsmodels.stats.api as sms


#For plotting
import seaborn as sns
import matplotlib.pyplot as plt
plt.style.use('seaborn')
import scipy.stats as stats

#To draw linear regression lines
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

#For combinations
from itertools import product

#Setting DataFrame Display settings
pd.set_option("display.max_columns", None)