# Chapter 2: Overview of the Data Mining Process



Import Libraries

In [1]:
%matplotlib inline
from pathlib import Path
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.linear_model import LinearRegression

import matplotlib.pylab as plt

## Data Exploration
Load the West Roxbury data set

Determine the shape of the data frame. It has 5802 rows and 14 columns

Show the top rows of the dataframe

## Cleanup
Preprocessing and cleaning up data is an important aspect of data analysis. 

Show the column names.

Note that some column titles end with spaces and some consist of two space separated words. For further analysis it's more convenient to have column names which are single words. 


We therefore strip trailing spaces and replace the remaining spaces with an underscore _. Instead of using the `rename` method, we create a modified copy of `columns` and assign to the `columns` field of the dataframe.

## Accessing subsets of the data
Pandas uses two methods to access rows in a data frame; `loc` and `iloc`. The `loc` method is more general and allows accessing rows using labels. The `iloc` method on the other hand only allows using integer numbers. To specify a range of rows use the slice notation, e.g. `0:9`.

To show the first four rows of the data frame, you can use the following commands.

Show the first ten rows of the first column

Show the fifth row of the first 10 columns. The `iloc` methods allows specifying the rows and columns within one set of brackets. `dataframe.iloc[rows, columns]`

If you prefer to preserve the data frame format, use a slice for the rows as well.

Use the `pd.concat` method if you want to combine non-consecutive columns into a new data frame. The `axis` argument specifies the dimension along which the concatenation happens, 0=rows, 1=columns.

We can subset the column using a slice

Pandas provides a number of ways to access statistics of the columns.

A data frame also has the method `describe` that prints a number of common statistics 

## Sampling
Use the `sample` method to retrieve a random sample of observations. Here we sample 5 observations without replacement.

The sample method allows to specify weights for the individual rows. We use this here to oversample houses with over 10 rooms.

## Variable Types

The REMODEL column is a factor, so we need to change it's type.

Other columns also have types.

It's also possible to the all columns data types 

## Dummy / One Hot Encoding Variables
Pandas provides a method to convert factors into dummy variables.

## Handling Missing Data
To illustrate missing data procedures, we first convert a few entries for bedrooms to NA's. Then we impute these missing values using the median of the remaining values.

Replace the missing values using the median of the remaining values.


## Normalizing / Scaling Data

The standardization of the dataset may give a <code>DataConversionWarning</code>. This informs you that the integer columns in the dataframe are automatically converted to real numbers (<code>float64</code>). This is expected and you can therefore ignore this warning. If you want to suppress the warning, you can explicitly convert the integer columns to real numbers</p>
<pre>
# Option 1: Identify all integer columns, remove personal loan, 
# and change their type
intColumns = [c for c in housing_df.columns if housing_df[c].dtype == 'int']
housing_df[intColumns] = housing_df[intColumns].astype('float64')
</pre>
Alternatively, you can suppress the warning as follows:
<pre>
# Option 2: use the warnings package to suppress the display of the warning
import warnings
with warnings.catch_warnings():
    warnings.simplefilter('ignore')
    norm_df = pd.DataFrame(scaler.fit_transform(housing_df), 
                       index=housing_df.index, columns=housing_df.columns)    
</pre>

## Splitting Datasets
Split the dataset into training (60%) and validation (40%) sets. Randomly sample 60% of the dataset into a new data frame `trainData`. The remaining 40% serve as validation.

Partition the dataset into training (50%), validation (30%), and test sets (20%). 

## Linear Regression
Let's create a linear regression model to predict TOTAL_VALUE

Exclude TAX from analysis

Predict the validation data

## Error Metrics
