## Module 2 Final Project Submission
* Name: Vivienne DiFrancesco
* Pace: Full Time
* Instructor: James Irving

# Introduction

# Obtaining the data

In [None]:
# Importing libraries that I will use
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
import matplotlib.ticker as mtick
%matplotlib inline

# Setting default seaborn setting for my visuals
sns.set(style="whitegrid")

# Supressing warnings
import warnings
warnings.filterwarnings('ignore')

# Importing the statsmodels packages I will use
import statsmodels.formula.api as smf
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

# Importing scikit learn packages I will use
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, make_scorer
from sklearn.model_selection import cross_val_score
from sklearn.metrics import r2_score

In [None]:
# Setting pandas to display max columns and rows
pd.set_option('display.max_columns', 0)
pd.set_option('display.max_rows', None)

# Turning off scientific notation in pandas
pd.set_option('display.float_format', lambda x: '%.3f' % x)

In [None]:
# Loading in the data
df = pd.read_csv('kc_house_data.csv')
df.head()

When I attempted to set the index I used the verify_integrity=True parameter and got an error that there were duplicate keys. That is how I knew that there were houses that had been sold multiple times in the dataset. I saved those duplicate items as their own dataframe to be able to return to later for EDA.

In [None]:
# Making a new dataframe to look at later of houses sold multiple times
houses_resold = df[df.duplicated(keep=False, subset=['id'])]
houses_resold.head()

In [None]:
# Set the index to the id
df.set_index('id', inplace=True)

In [None]:
# Checking out the length and columns
df.shape

In [None]:
# Checking the data types and where there might be nulls
df.info()

# Scrub

## Addressing the price column

I started with the price column since that is the target. I wanted to get to know the data a little using describe(). I looked at value_counts() to make sure there weren't issues with rogue values like 0000 or something that would not register as nulls.

In [None]:
# Making price an integer instead of a float
df.price = df.price.astype('int64')

In [None]:
# Checking the stats for the column to see if everything looks normal
df.price.describe()

In [None]:
# Double checking that there aren't rogue values hiding in the data
df.price.value_counts()[:20]

## Dealing with NA values

I then turned to the other columns to deal with NA values. I filled the NA values, cast them to the correct data type, and then used value_counts() to check for rogue entries that may have been missed.

In [None]:
# Looking at all NA values in all columns
df.isna().sum()

I tried mapping the entries that were missing waterfront and it seems as if some of the values are in fact on the water. I decided to fill the null values based on the ratio of 0 and 1 that are already in the dataset.

In [None]:
# Creating a sub-dataframe of the missing entries to use for visualizing
waterfront_check = df.copy()
waterfront_check = waterfront_check[waterfront_check['waterfront'].isna()]

In [None]:
# Saving the file

# waterfront_check.to_csv(r'C:\Users\drudi\DataScience\Module02\FinalProject\waterfront_check.csv')

This map was created using the waterfront_check dataframe loaded into Tableau Public. This screenshot is a zoomed in view to better see individual entries as an example. The full image can be viewed and downloaded from https://public.tableau.com/profile/vivienne4370 

<img src="waterfrontcheck.png">

In [None]:
# Checking the percentages of the different values
df.waterfront.value_counts(normalize=True)

In [None]:
# Checking value counts before filling the missing values
df.waterfront.value_counts()

In [None]:
# Setting the probability ratios based on the value counts
prob = [.992, .008]

# Filling the missing values with either 0 or 1 using the probability
df["waterfront"] = df["waterfront"].apply(lambda x: np.random.choice([0, 1], p=prob) if (np.isnan(x)) else x)

In [None]:
# Making sure the value counts changed appropriately
df["waterfront"].value_counts()

In [None]:
# Changing the datatype
df.waterfront = df.waterfront.astype('int64')

I dropped the view column since it is not clear what this data represents. It does not represent the views from the house but likely has something to do with listing views. Without knowing what it could mean, I dropped it to avoid any confusion from the column.

In [None]:
# Filling NA values with 0
df.drop(columns='view', inplace=True)

I decided to fill the yr_renovated columns with zeros because I thought it was a fair assumption that if the entry is null, then it probably hasn't been renovated. 

In [None]:
# Filling NA values with 0
df.yr_renovated.fillna(0, inplace=True)

In [None]:
# Checking for rogue values
df.yr_renovated.value_counts()[:10]

In [None]:
# Changing the datatype
df.yr_renovated = df.yr_renovated.astype('int64')

In [None]:
# Verifying that all NAs were dealt with
df.isna().sum()

## Checking for strange values in other columns

I looked through the rest of my columns for rogue entries.