# Introduction

This is a perfect competition for data science students who have completed a course in machine learning and are looking to expand their skill set before trying a featured competition. 

**Competition Description**

Ask a home buyer to describe their dream house, and they probably won't begin with the height of the basement ceiling or the proximity to an east-west railroad. But this playground competition's dataset proves that much more influences price negotiations than the number of bedrooms or a white-picket fence. With 79 explanatory variables describing (almost) every aspect of residential homes in Ames, Iowa, this competition challenges you to predict the final price of each home.

The dataset is in csv format and has a predictor variable called SalePrice. The SalePrice is determined by a number of X features contained in the dataset. In this version of prediciting the home prices we will preprocess but categorical and numerical variables in the training set data using pipeline ML techniques and create a hyper optimization function to help us decide the best fit optimization parameters for our Random Forest Regressor Model. 

---

### Importing Relevant Packages 

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score

/kaggle/input/sample_submission.csv
/kaggle/input/sample_submission.csv.gz
/kaggle/input/train.csv.gz
/kaggle/input/data_description.txt
/kaggle/input/test.csv.gz
/kaggle/input/train.csv
/kaggle/input/test.csv


## 1. Data handling and visualisation
### (a) Use the ‘pandas’ library or the ‘numpy’ library to read the data set.

---

The read csv() function in pandas lets you get data from a.csv file and put it into a DataFrame. pandas works right out of the box with many different file formats or data sources, such as csv, excel, sql, json, parquet, etc., and each of these has a prefix that starts with read_.

Make sure to check the data every time after you read it in. By default, when a DataFrame is shown, only the first and last 5 rows are shown: However, we can check only the first 5 rows using pandas head(). method.

In [2]:
# Read the data
train_full_data = pd.read_csv("../input/train.csv", index_col='Id')

#Read Test data
X_test_full = pd.read_csv("../input/test.csv", index_col='Id')

# Separate target from predictors
train_full_data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = train_full_data["SalePrice"]
X = train_full_data.drop(["SalePrice"], axis=1)

X.head()

Unnamed: 0_level_0,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,...,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,...,0,0,,,,0,2,2008,WD,Normal
2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,...,0,0,,,,0,5,2007,WD,Normal
3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,...,0,0,,,,0,9,2008,WD,Normal
4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,...,0,0,,,,0,2,2006,WD,Abnorml
5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,...,0,0,,,,0,12,2008,WD,Normal


Print first 5 rows of Dependant variable y

In [3]:
y.head()

Id
1    208500
2    181500
3    223500
4    140000
5    250000
Name: SalePrice, dtype: int64

### (b) Subsequently, split the data set into training and validation sets

___

Here we Split the Dataset into training and Validation sets using sklearn train test split class. 

X: This array contains all the Feature X colunns of the data set except the Predictor column known as (SalePrice).

y : This array contains only the predictor column.

We will also check the cardinality of X features and assign those features with cardinals less than 26 to X_Train and X_Valid. For easier processing we will assign all numerical columns in the dataset to a new variable.

In [4]:
# Divide data into training and validation subsets
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].nunique() < 26 and 
                        X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_train.head()


Unnamed: 0_level_0,MSZoning,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,...,GarageArea,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,MiscVal,MoSold,YrSold
Id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
619,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NridgHt,Norm,...,774,0,108,0,0,260,0,0,7,2007
871,RL,Pave,,Reg,Lvl,AllPub,Inside,Gtl,NAmes,PosN,...,308,0,0,0,0,0,0,0,8,2009
93,RL,Pave,Grvl,IR1,HLS,AllPub,Inside,Gtl,Crawfor,Norm,...,432,0,0,44,0,0,0,0,8,2009
818,RL,Pave,,IR1,Lvl,AllPub,CulDSac,Gtl,Mitchel,Norm,...,857,150,59,0,0,0,0,0,7,2008
303,RL,Pave,,IR1,Lvl,AllPub,Corner,Gtl,CollgCr,Norm,...,843,468,81,0,0,0,0,0,1,2006


## 2. Preprocessing using Pipelines and Simple Imputer
### (a) Use the ‘sklearn.impute’ library to handle missing values 

---

For instance, the pipeline below will use SimpleImputer() to replace missing values in the data, before using RandomForestRegressor() to train a random forest model to make predictions. We set the number of trees in the random forest model with the n_estimators parameter, and setting random_state ensures reproducibility. 

Begin by writing a function get_score() that reports the average (over three cross-validation folds) MAE of a machine learning pipeline that uses:

the data in X and y to create folds,
SimpleImputer() (with all parameters left as default) to replace missing values, and
RandomForestRegressor() (with random_state=0) to fit a random forest model.
The n_estimators parameter supplied to get_score() is used when setting the number of trees in the random forest model.

In [5]:
def get_score(n_estimators):
    
        # Preprocessing for numerical data
    numerical_transformer = SimpleImputer(strategy='constant')

    # Preprocessing for categorical data
    categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ])

    # Bundle preprocessing for numerical and categorical data
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numerical_transformer, numerical_cols),
            ('cat', categorical_transformer, categorical_cols)
        ])
    
    my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', RandomForestRegressor(n_estimators, random_state=1))
                             ])
    scores = -1 * cross_val_score(my_pipeline, X, y,
                                  cv=4,
                                  scoring='neg_mean_absolute_error')
    return scores.mean()

### 2(b) Test different parameter values

Now, we will use the function that was defined in Step 1 to evaluate the model performance corresponding to eight different values for the number of trees in the random forest: 50, 100, 150, ..., 300, 350, 400.

Store your results in a Python dictionary results, where results[i] is the average MAE returned by get_score(i).

This is done to enable visualize and select the best parameters for building our RandForestRegressor Model
___

In [6]:
#  results = {}
#  for i in range(1,17):
#      results[50*i] = get_score(50*i)

Use the next cell to visualize your results from Step 2.


In [7]:
# %matplotlib inline
# plt.title('Plot of different parameter values')
# plt.plot(list(results.keys()), list(results.values()))
# plt.xlabel("N Estimators")
# plt.ylabel("Mean Absolute Error")
# plt.figure(figsize=(7,7))
# plt.show()

We can see that n_estimators from 700 above performs better with the model so we set our n_estimators to our best fit mean absolute error for our model and predict our validation dataset

In [8]:
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])



model = RandomForestRegressor(n_estimators=800, random_state=1)


# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)


MAE: 17342.208578767124


### 3(a) Predicting Test Results for the competition 

Here we predict test results for the competition and submit our predicitions.

___


In [9]:
pred_test = my_pipeline.predict(X_test_full)
# pred_test = (np.rint(pred_test)).astype(int)
pred_test

array([128097.8725 , 154711.48625, 184409.53   , ..., 151654.38125,
       109218.7325 , 227193.63125])

In [10]:
# Run the code to save predictions in the format used for competition scoring

output = pd.DataFrame({'Id': X_test_full.index,
                       'SalePrice': pred_test})
output.to_csv('submission.csv', index=False)
print("Your submission was successfully saved!")



Your submission was successfully saved!
