# 1. Introduction

To understand how linear regression works, we've stuck to using features from the training dataset that contained no missing values and were already in a convenient numeric representation. In this mission, we'll explore how to transform some of the remaining features so we can use them in our model

Broadly, the `process of processing and creating new features is known as feature engineering`. **Feature engineering is a bit of an art and having knowledge in the specific domain (in this case real estate) can help you create better features**

`In this mission, we'll focus on some domain-independent strategies that work for all problems.`

In the first half of this mission, we'll focus only on columns that contain no missing values but still aren't in the proper format to use in a linear regression model. In the latter half of this mission, we'll explore some ways to deal with missing values.

**Amongst the columns that don't contain missing values, some of the common issues include:**

* the column is `not numerical` (e.g. a zoning code represented using text)
* the column is `numerical but not ordinal` (e.g. zip code values)
* the column is `numerical but isn't representative of the type of relationship with the target column (e.g. year values)`

In [1]:
import pandas as pd

house=pd.read_csv('AmesHousing.txt',delimiter='\t')
train=house[0:1460]
test=house[1460:]

train_null_counts=train.isnull().sum()
print(train_null_counts)

Order                0
PID                  0
MS SubClass          0
MS Zoning            0
Lot Frontage       249
Lot Area             0
Street               0
Alley             1351
Lot Shape            0
Land Contour         0
Utilities            0
Lot Config           0
Land Slope           0
Neighborhood         0
Condition 1          0
Condition 2          0
Bldg Type            0
House Style          0
Overall Qual         0
Overall Cond         0
Year Built           0
Year Remod/Add       0
Roof Style           0
Roof Matl            0
Exterior 1st         0
Exterior 2nd         0
Mas Vnr Type        11
Mas Vnr Area        11
Exter Qual           0
Exter Cond           0
                  ... 
Bedroom AbvGr        0
Kitchen AbvGr        0
Kitchen Qual         0
TotRms AbvGrd        0
Functional           0
Fireplaces           0
Fireplace Qu       717
Garage Type         74
Garage Yr Blt       75
Garage Finish       75
Garage Cars          0
Garage Area          0
Garage Qual

In [2]:
# dataframe with no missing values
df_no_mv=train[train_null_counts[train_null_counts==0].index]

In [3]:
df_no_mv.isnull().sum()

Order              0
PID                0
MS SubClass        0
MS Zoning          0
Lot Area           0
Street             0
Lot Shape          0
Land Contour       0
Utilities          0
Lot Config         0
Land Slope         0
Neighborhood       0
Condition 1        0
Condition 2        0
Bldg Type          0
House Style        0
Overall Qual       0
Overall Cond       0
Year Built         0
Year Remod/Add     0
Roof Style         0
Roof Matl          0
Exterior 1st       0
Exterior 2nd       0
Exter Qual         0
Exter Cond         0
Foundation         0
Heating            0
Heating QC         0
Central Air        0
Electrical         0
1st Flr SF         0
2nd Flr SF         0
Low Qual Fin SF    0
Gr Liv Area        0
Full Bath          0
Half Bath          0
Bedroom AbvGr      0
Kitchen AbvGr      0
Kitchen Qual       0
TotRms AbvGrd      0
Functional         0
Fireplaces         0
Garage Cars        0
Garage Area        0
Paved Drive        0
Wood Deck SF       0
Open Porch SF

# 2. Categorical Features

You'll notice that some of the columns in the data frame df_no_mv contain string values. `If these columns contain only a limited set of unique values, they're known as categorical features`. As the name suggests,` a categorical feature groups a specific training example into a specific category`

In [4]:
df_no_mv.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Area,Street,Lot Shape,Land Contour,Utilities,Lot Config,...,Enclosed Porch,3Ssn Porch,Screen Porch,Pool Area,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,31770,Pave,IR1,Lvl,AllPub,Corner,...,0,0,0,0,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,11622,Pave,Reg,Lvl,AllPub,Inside,...,0,0,120,0,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,14267,Pave,IR1,Lvl,AllPub,Corner,...,0,0,0,0,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,11160,Pave,Reg,Lvl,AllPub,Corner,...,0,0,0,0,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,13830,Pave,IR1,Lvl,AllPub,Inside,...,0,0,0,0,0,3,2010,WD,Normal,189900


In [5]:
df_no_mv['MS Zoning'].value_counts()

RL         1123
RM          232
FV           83
RH           11
C (all)      10
I (all)       1
Name: MS Zoning, dtype: int64

In [6]:
df_no_mv['Utilities'].value_counts()

AllPub    1457
NoSewr       2
NoSeWa       1
Name: Utilities, dtype: int64

In [7]:
df_no_mv['Sale Condition'].value_counts()

Normal     1267
Abnorml      98
Partial      63
Family       18
Alloca       14
Name: Sale Condition, dtype: int64

In [8]:
df_no_mv['Sale Type'].value_counts()

WD       1309
New        59
COD        54
ConLD      18
ConLI       6
ConLw       5
Oth         4
Con         4
CWD         1
Name: Sale Type, dtype: int64

In [9]:
df_no_mv['Lot Shape'].value_counts()

Reg    940
IR1    479
IR2     34
IR3      7
Name: Lot Shape, dtype: int64

`To use these features in our model, we need to transform them into numerical representations.` Thankfully, pandas makes this easy because the library has a`special categorical data type.` We can convert `any column that contains no missing values `(or an error will be thrown) to the categorical data type using the `pandas.Series.astype()` method:

`train['Utilities'] = train['Utilities'].astype('category')`

When a column is converted to the categorical data type, pandas assigns a code to each unique value in the column. Unless we access these values directly, most of the pandas manipulation operations that work for string columns will work for categorical ones as well.

We need to use the `.cat` accessor followed by the `.codes `property to actually access the underlying numerical representation of a column:

`train['Utilities'].cat.codes`

## TODO
* Convert all of the text columns in train to the categorical data type.
* Select the Utilities column, return the categorical codes, and display the unique value counts for those codes: train['Utilities'].cat.codes.value_counts()

In [10]:
traiin=train.copy()

len(traiin.select_dtypes(include=['category']).columns)

0

In [11]:
text_cols=traiin.select_dtypes(include=['object']).columns

for col in text_cols:
    print(col+":",len(traiin[col].unique()))
    traiin[col]=traiin[col].astype('category')

MS Zoning: 6
Street: 2
Alley: 3
Lot Shape: 4
Land Contour: 4
Utilities: 3
Lot Config: 5
Land Slope: 3
Neighborhood: 26
Condition 1: 9
Condition 2: 6
Bldg Type: 5
House Style: 8
Roof Style: 6
Roof Matl: 5
Exterior 1st: 14
Exterior 2nd: 16
Mas Vnr Type: 5
Exter Qual: 4
Exter Cond: 5
Foundation: 6
Bsmt Qual: 5
Bsmt Cond: 6
Bsmt Exposure: 5
BsmtFin Type 1: 7
BsmtFin Type 2: 7
Heating: 6
Heating QC: 4
Central Air: 2
Electrical: 4
Kitchen Qual: 5
Functional: 7
Fireplace Qu: 6
Garage Type: 7
Garage Finish: 4
Garage Qual: 6
Garage Cond: 6
Paved Drive: 3
Pool QC: 2
Fence: 5
Misc Feature: 4
Sale Type: 9
Sale Condition: 5


In [12]:
train.head()

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Pool Area,Pool QC,Fence,Misc Feature,Misc Val,Mo Sold,Yr Sold,Sale Type,Sale Condition,SalePrice
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,,,,0,5,2010,WD,Normal,215000
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,,MnPrv,,0,6,2010,WD,Normal,105000
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,0,,,Gar2,12500,6,2010,WD,Normal,172000
3,4,526353030,20,RL,93.0,11160,Pave,,Reg,Lvl,...,0,,,,0,4,2010,WD,Normal,244000
4,5,527105010,60,RL,74.0,13830,Pave,,IR1,Lvl,...,0,,MnPrv,,0,3,2010,WD,Normal,189900


In [13]:
len(traiin.select_dtypes(include=['category']).columns)

43

In [14]:
traiin['Utilities'].cat.codes[:6]

0    0
1    0
2    0
3    0
4    0
5    0
dtype: int8

# 3. Dummy Coding

**`When we convert a column to the categorical data type, pandas assigns a number from 0 to n-1 (where n is the number of unique values in a column) for each value.` The drawback with this approach is that one of the assumptions of linear regression is violated here. `Linear regression operates under the assumption that the features are linearly correlated with the target column.` `For a categorical feature, however, there's no actual numerical meaning to the categorical codes that pandas assigned for that column`. An increase in the Utilities column from 1 to 2 has no correlation value with the target column, and the categorical codes are instead used for uniqueness and exclusivity (the category associated with 0 is different than the one associated with 1).**

 The common solution is to use a technique called [dummy coding](https://en.wikipedia.org/wiki/Dummy_variable_%28statistics%29).` Instead of having a single column with n integer codes, we have n binary columns.`

## TODO:
* Convert all of the columns in text_cols from the train data frame into dummy columns.
* Delete the original columns from text_cols from the train data frame

In [15]:
dummy_cols = pd.DataFrame()

for col in text_cols:
    col_dummies = pd.get_dummies(traiin[col])
    train = pd.concat([traiin, col_dummies], axis=1)
    del train[col]

In [16]:
train.head(3)

Unnamed: 0,Order,PID,MS SubClass,MS Zoning,Lot Frontage,Lot Area,Street,Alley,Lot Shape,Land Contour,...,Misc Val,Mo Sold,Yr Sold,Sale Type,SalePrice,Abnorml,Alloca,Family,Normal,Partial
0,1,526301100,20,RL,141.0,31770,Pave,,IR1,Lvl,...,0,5,2010,WD,215000,0,0,0,1,0
1,2,526350040,20,RH,80.0,11622,Pave,,Reg,Lvl,...,0,6,2010,WD,105000,0,0,0,1,0
2,3,526351010,20,RL,81.0,14267,Pave,,IR1,Lvl,...,12500,6,2010,WD,172000,0,0,0,1,0


# 4. Transforming Improper Numerical Features

Let's now look at `numerical features that aren't categorical, but whose numerical representation needs to be improved`. We'll focus on the Year Remod/Add and Year Built columns:

In [17]:
print(traiin[['Year Remod/Add', 'Year Built']][:5])

   Year Remod/Add  Year Built
0            1960        1960
1            1961        1961
2            1958        1958
3            1968        1968
4            1998        1997


The two main issues with these features are:

* Year values aren't representative of how old a house is
* The Year Remod/Add column doesn't actually provide useful information for a linear regression model

## TODO:
* Create a new column years_until_remod in the train data frame that represents the difference between Year Remod/Add (the later value) and Year Built (the earlier value).

In [18]:
traiin['years_until_remod']=traiin['Year Remod/Add']-traiin['Year Built']

# 5. Missing Values

When values are missing in a column, there are two main approaches we can take:

**Remove rows containing missing values for specific columns**
     * Pro: Rows containing missing values are removed, leaving only clean data for modeling
     * Con: Entire observations from the training set are removed, which can reduce overall prediction accuracy
**Impute (or replace) missing values using a descriptive statistic from the column**
  * Pro: Missing values are replaced with potentially similar estimates, preserving the rest of the observation in the model.
  * Con: Depending on the approach, we may be adding noisy data for the model to learn

`Given that we only have 1460 training examples (with ~80 potentially useful features), we don't want to remove any of these rows from the dataset. Let's instead focus on imputation techniques.`

## TODO
* Select only the columns from train that contain more than 0 missing values but less than 584 missing values. Assign the resulting data frame to df_missing_values.
* Display the number of missing values for each column in df_missing_values.
* Display the data type for each column in df_missing_values.

In [19]:
train_null_counts=train.isnull().sum()

df_missing_values=train[train_null_counts[(train_null_counts>0) & (train_null_counts<584)].index]

In [20]:
df_missing_values.isnull().sum()

Lot Frontage      249
Mas Vnr Type       11
Mas Vnr Area       11
Bsmt Qual          40
Bsmt Cond          40
Bsmt Exposure      41
BsmtFin Type 1     40
BsmtFin SF 1        1
BsmtFin Type 2     41
BsmtFin SF 2        1
Bsmt Unf SF         1
Total Bsmt SF       1
Bsmt Full Bath      1
Bsmt Half Bath      1
Garage Type        74
Garage Yr Blt      75
Garage Finish      75
Garage Qual        75
Garage Cond        75
dtype: int64

In [21]:
df_missing_values.dtypes

Lot Frontage       float64
Mas Vnr Type      category
Mas Vnr Area       float64
Bsmt Qual         category
Bsmt Cond         category
Bsmt Exposure     category
BsmtFin Type 1    category
BsmtFin SF 1       float64
BsmtFin Type 2    category
BsmtFin SF 2       float64
Bsmt Unf SF        float64
Total Bsmt SF      float64
Bsmt Full Bath     float64
Bsmt Half Bath     float64
Garage Type       category
Garage Yr Blt      float64
Garage Finish     category
Garage Qual       category
Garage Cond       category
dtype: object

# 6. Imputing Missing Values

It looks like about half of the columns in df_missing_values are string columns (object data type), while about half are float64 columns. `For numerical columns with missing values, a common strategy is to compute the mean, median, or mode of each column and replace all missing values in that column with that value.`

Because imputation is a common task, pandas contains a `pandas.DataFrame.fillna()` method that we can use for this. If we pass in a value, all of the missing values (NaN) in the data frame are replaced by that value:

## TODO:
* Because imputation is a common task, pandas contains a pandas.DataFrame.fillna() method that we can use for this. If we pass * in a value, all of the missing values (NaN) in the data frame are replaced by that value:

In [22]:
float_cols=df_missing_values.select_dtypes(include=['float'])
float_cols=float_cols.fillna(float_cols.mean())
float_cols.head()

Unnamed: 0,Lot Frontage,Mas Vnr Area,BsmtFin SF 1,BsmtFin SF 2,Bsmt Unf SF,Total Bsmt SF,Bsmt Full Bath,Bsmt Half Bath,Garage Yr Blt
0,141.0,112.0,639.0,0.0,441.0,1080.0,1.0,0.0,1960.0
1,80.0,0.0,468.0,144.0,270.0,882.0,0.0,0.0,1961.0
2,81.0,108.0,923.0,0.0,406.0,1329.0,0.0,0.0,1958.0
3,93.0,0.0,1065.0,0.0,1045.0,2110.0,1.0,0.0,1968.0
4,74.0,0.0,791.0,0.0,137.0,928.0,0.0,0.0,1997.0
