# Apartment Price Prediction Case Study

You are a data scientist at Willoz, a real-estate online marketplace based in Albuquerque, NM. 

Your team has been contracted by a property aggregator to improve user engagement by offering a personalized recommendation widget. This feature will allow users to enter their apartment preferences (size, pet-policy) and receive a prediction of a typical apartment price. You've been provided with a historical dataset of 10,000 rental listings scraped from various sources containing predictors such as bedrooms, price, location, and descriptions.

We will demo

Your task is to:
* Explore and clean the dataset,
* Engineer new features using the insights you've discovered from your EDA,
* Deploy it locally using a light-weight streamlit app that takes in apartment features and returns a price.

Let's use the patterns we've learned about in class to complete this case study together.

## EDA

Let's get started with exploratory data analysis. 

* Load the dataset and identify the structure and content of each column. 
* Identify analytical & predictive questions to determine valuable analyses. 
* Generate summary statistics and visualizations on features such as price, bedrooms, bathrooms, square_feet, and cityname.  
* Investigate missing or inconsistent values and decide how to address them.  
* Identify correlations or relationships that might impact housing price or desirability.  

In [None]:
import pandas as pd
import numpy as np

import seaborn as sns
import matplotlib.pyplot as plt

### Identifying the Structure of the Dataset

In [None]:
rentals = pd.read_csv("apartments_for_rent.csv")

In [None]:
# identify the shape
...

In [None]:
# identify the columns
...

In [None]:
# take a look at the first few rows
...

## Forming Good Exploratory Questions

Let's consider good exploratory questions given the structure of our dataset and the types of variables our dataset contains. Consider these questions and how we can apply them to our dataset to ask specific questions about our columns.

* Which column are we attempting to predict?
* What numerical variables do we have? Which visualization techniques can we apply to these variables?
* What categorical variables do we have? Which visualization techniques can we apply to these variables?
* How can we explore the relationships between numeric vs numeric? What about categorical vs numeric? 

### Generate Visualizations

Let's perform both univariate and bivariate analysis.

### Investigate Missing, Incosistent, or Outlier Values. Identify the "Shape" of your Data (Univariate Analysis)

Let's identify the distributions of our numeric columns, as well as our categorical columns. Additionally, let's observe if we have any null values.

In [None]:
...

### Identify correlations or relationships (Bivariate Analysis)

Now, let's observe if we have any clear relationships between our predictors & our target (price). Remember, our predictors could be categorical as well! Furthermore, even if our predictors are expressed as numerical, they can still *manifest* themselves as discrete if we don't expect decimal values, and we are limited to a range of numeric values.

In [None]:
...

### Observations

Looking at our distributions and null values, it appears that we do have outlier values as well as missing data. Furthermore, it appears that there are specific columns which **should not** be included in our prediction step as it could potentially bias our data to specific samples. This includes columns like `id`, `category`, `title`, `body`, `address`, `latitude`, `longitude`, `source`, `time`, and others.

Remember, we want to be able to predict the price of a rental property given the independent features of an apartment that we have at the ready. Features such as `id`, `title` and `body` are already expressed in our sample through other columns such as `bathrooms` and `bedrooms`. 

For our predictive columns, must make an executive decision as to what we want to do with missing values. We could either:
* Drop columns with missing values,
* Drop rows with missing values (and remove roughly half of our dataset),
* Impute missing values,
* Or perform all 3 steps in some specific order. The order in which you apply these steps result in different datasets!

Take a close look at your dataset however! A "None" might not always mean `NA` in this dataset. For example, what might "None" mean in the context of the `pets_allowed` or `amenities` columns?

We should also decide what to do with outlier values. We could either:
* Transform columns to fit a normal distribution
* Filter your dataset to not include atypical values in your analysis.

More information on possible data transformations are listed here: https://developers.google.com/machine-learning/crash-course/numerical-data/normalization.

Furthermore it looks like we have data that could be transformed to ensure we have as much as data as available for greater predictive capabilities. This includes:
* Encoding specific categorical predictors in our analysis,
* Converting values with alternative units of measurement into 1 standard

Which data transformation technique have we learned about before which can assist us in this?

Remember, there is no one concrete methodology. You must be able to choose transformation techiques and justify them!