## Feature Engineering and Linear Regression**

#### **Description of the Dataset**

* The Ames Housing dataset describes the sale of individual residential properties in Ames, Iowa, from 2006 to 2010. It contains a wide range of features, making it an excellent dataset to practice feature engineering techniques. In this assignment, you will explore a series of feature engineering tasks aimed at improving linear regression predictions.

* Please make sure you have read the  provided`data_description.txt` file that provides additional information about the dataset and its features before you start implementing your homework.

### **Question 1: Import libraries, Load Train and Test datasets into separate DataFrames**

In [None]:
# Write your code here
import pandas as pd


# Load train and test datasets
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')


train.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,...,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,...,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,...,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,...,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,...,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,...,0,,,,0,12,2008,WD,Normal,250000


In [None]:
# Optional diplay all columns to be able to see all the columns (features)

pd.set_option('display.max_columns', None)

### **Question 2: Identify columns with missing data and then determine an appropriate strategy for each.**

In [None]:
# Write your code here

missing_data = pd.concat([train, test], axis=0, ignore_index=True)
missing_data = missing_data.isnull().sum()
missing_data.sort_values(ascending=False)
missing_data

Id                  0
MSSubClass          0
MSZoning            4
LotFrontage       486
LotArea             0
                 ... 
MoSold              0
YrSold              0
SaleType            1
SaleCondition       0
SalePrice        1459
Length: 81, dtype: int64

### **Question 3: One-hot Encoding for Categorical Variables**

* Apply one-hot encoding to transform columns with categorical values.

In [None]:
# Write your code here
train_encoded = pd.get_dummies(train, drop_first=True)

### **Question 4: Scaling and Normalization**

* Using `StandarScaler` from `scikit-learn`, scale the features in some of the numeric columns.
* You will need to import required libraries and modules here as well.

In [None]:
# Write your code here
import numpy as np
from sklearn.preprocessing import StandardScaler

# Identify numeric columns and put them in a python list
numeric_columns = train_encoded.select_dtypes(include=[np.number]).columns.tolist()

# From the `numeric_columns`, exclude the target column 'SalePrice' and 'Id' column
numeric_columns = [col for col in numeric_columns if col not in ['SalePrice', 'Id']]

# Scaling the numeric columns, initialize StabdardScaler from `scikit-learn`
scaler = StandardScaler()

# fit and transform the numeric columns using the scaler you defined above
train_encoded[numeric_columns] = scaler.fit_transform(train_encoded[numeric_columns])

### **Question 5: Feature Extraction from Year Variables**

* Extract the information about the age of the house at the time of the sale by calculating the difference between `YrSold` and `YearBuilt` features and create a new Column (feature) named `HouseAge` from the result.
* Extract the number of years since remodelling when the house was sold by calculating the difference between `YrSold` and `YearRemodAdd` features and create a new Column (feature) named `YearsSinceRemod` from the result.

In [None]:
# Write your code here

# Age of the house at the time of sale
train_encoded['HouseAge'] = train_encoded['YrSold'] - train_encoded['YearBuilt']

# Number of years since remodelling when the house was sold
train_encoded['YearsSinceRemod'] = train_encoded['YrSold'] - train_encoded['YearRemodAdd']

# Display the new features you have created above
train_encoded[['HouseAge', 'YearsSinceRemod']].head()

  train_encoded['HouseAge'] = train_encoded['YrSold'] - train_encoded['YearBuilt']
  train_encoded['YearsSinceRemod'] = train_encoded['YrSold'] - train_encoded['YearRemodAdd']


Unnamed: 0,HouseAge,YearsSinceRemod
0,-0.912216,-0.739891
1,-0.771172,-0.184862
2,-0.845975,-0.691437
3,0.495977,-0.647357
4,-0.812854,-0.59453


### **Question 6: Dimensionality Reduction**

* In this question you will apply the PCA algorithm to reduce the dimensions of the dataset. PCA will not work if there are missing values in your data.For this, you will first fill all the missing values in the numerical columns with the median of the column.
* Then, you will fill the missing categorical columns with mode imputation.
* Tip: To fill all the missing values in all columns, first identify all numeric columns with missing data and store them in a variable called `numeric_cols_with_missing`. Do the same for categoical columns with missing data and store them into a variable called `categorical_cols_with_missing`. Then, you can iterate over (using a for loop) these two sets of columns and fill them with the appropriate values described above. Do not forget to use `inplace=True` attribute in the `fillna()` method that you will use to fill missing values so that you manipulate the original dataset without creating a copy of it.

In [None]:
# Write your code here
# Numeric columns: median imputation
numeric_cols_with_missing = train_encoded.select_dtypes(include=[np.number]).columns[train_encoded.select_dtypes(include=[np.number]).isnull().any()].tolist()
for col in numeric_cols_with_missing:
    # fill the missing columns
    train_encoded[col].fillna(train_encoded[col].median(), inplace=True)

# Categorical columns: mode imputation
categorical_cols_with_missing = train_encoded.select_dtypes(exclude=[np.number]).columns[train_encoded.select_dtypes(exclude=[np.number]).isnull().any()].tolist()
for col in categorical_cols_with_missing:
    # fill the missing columns
    train_encoded[col].fillna(train_encoded[col].mode()[0], inplace=True)

# Verifying that there are no more missing values
assert train_encoded.isnull().sum().sum() == 0

* **Implement the PCA algorithm to the dataset with no missing values**

In [None]:
# Write your code here
from sklearn.decomposition import PCA
X = train_encoded.drop(columns=['SalePrice', 'Id'])

pca = PCA()

# fit the PCA
X_pca = pca.fit(X)

# Number of components retained after PCA
num_components = pca.n_components_
num_components


247

### **Question 7: Engineering Ordinal Features**

* For ordinal columns like `ExterQual` and `ExterCond`,  map the values to numbers (using the `map()`(https://docs.python.org/3/library/functions.html#map) method.

In [None]:
# Write your code here
# Mapping ordinal values to numbers
ordinal_mappings = {
    'Ex': 5,
    'Gd': 4,
    'TA': 3,
    'Fa': 2,
    'Po': 1
}

train['ExterQual'] = train['ExterQual'].map(ordinal_mappings)
train['ExterCond'] = train['ExterCond'].map(ordinal_mappings)


train[['ExterQual', 'ExterCond']].head()

Unnamed: 0,ExterQual,ExterCond
0,4,3
1,3,3
2,4,3
3,3,3
4,4,3


### **Question 8: Feature Interaction**

* An **interaction featiure** refers to a new feature that is created by combining or relating two or more existing features. It is based on the idea that two or more variables together may have a synergistic effect on the target variable that is not captured when they are used independently.

* Create an interaction feature, such as the total area of the house by using the available features `ToralBsmtSF`, `1stFlrSF`, and `2ndFlrSF`
* Tip: You will need to add these `Series` to get the `TotalArea`

In [None]:
# Write your code here

# Total area of the house
train['TotalArea'] = train['TotalBsmtSF'] + train['1stFlrSF'] + train['2ndFlrSF']

# Displaying the new feature
train[['TotalArea']].head()

Unnamed: 0,TotalArea
0,2566
1,2524
2,2706
3,2473
4,3343


### **Question 9: Binning**

* Group the `LotArea` feature into 5 bins

In [None]:
# Write your code here
labels = ['Bin1', 'Bin2', 'Bin3', 'Bin4', 'Bin5']
train['LotAreaBin'] =pd.cut(train['LotArea'], bins=bins, labels=labels)


# Displaying the binned column
train[['LotArea', 'LotAreaBin']].head()

Unnamed: 0,LotArea,LotAreaBin
0,8450,Bin2
1,9600,Bin2
2,11250,Bin3
3,9550,Bin2
4,14260,Bin3


In [None]:
# Check the unique values in the 'LotAreaBin' column and the corresponding bin boundaries
bins = [0, 5000, 10000, 15000, 20000, float('inf')]
unique_values = train['LotAreaBin'].value_counts().sort_index()

bins, unique_values

([0, 5000, 10000, 15000, 20000, inf],
 Bin1    147
 Bin2    695
 Bin3    496
 Bin4     69
 Bin5     53
 Name: LotAreaBin, dtype: int64)

* From the results, it's evident that the vast majority of houses (1447 out of 1460) have lot areas that fall into the first bin (Bin 0), which corresponds to the interval (1086.055,44089.0 (1086.055,44089.0]. This is why you see many values of 0 in the `LotAreaBin` column.

* The binning behavior here is due to a few properties with very large lot areas that are influencing the range and hence the bin boundaries. The first bin captures most of the data, while the other bins capture only a few outliers. In such cases, it might be more appropriate to use quantile-based binning (using pd.qcut()) to ensure a more even distribution of data points across bins.

In [None]:
# Observe the results of the quantile-based binning and compare it to the result from only `pd.cut()` above

train['LotAreaQuantileBin'] = pd.qcut(train['LotArea'], q=5, labels=False)

# Check the unique values in the 'LotAreaQuantileBin' column and the corresponding bin boundaries
quantile_bins = pd.qcut(train['LotArea'], q=5, retbins=True)
unique_values_quantile = train['LotAreaQuantileBin'].value_counts().sort_index()

quantile_bins, unique_values_quantile

((0          (7078.4, 8793.4]
  1         (8793.4, 10198.2]
  2        (10198.2, 12205.8]
  3         (8793.4, 10198.2]
  4       (12205.8, 215245.0]
                 ...         
  1455       (7078.4, 8793.4]
  1456    (12205.8, 215245.0]
  1457      (8793.4, 10198.2]
  1458      (8793.4, 10198.2]
  1459      (8793.4, 10198.2]
  Name: LotArea, Length: 1460, dtype: category
  Categories (5, interval[float64, right]): [(1299.999, 7078.4] < (7078.4, 8793.4] < (8793.4, 10198.2] <
                                             (10198.2, 12205.8] < (12205.8, 215245.0]],
  array([  1300. ,   7078.4,   8793.4,  10198.2,  12205.8, 215245. ])),
 0    292
 1    292
 2    292
 3    292
 4    292
 Name: LotAreaQuantileBin, dtype: int64)

### **Question 10: Linear Regression Model**

* Now, let's train a Linear Regression model using the feature-engineered dataset and evaluate its performance on the test set. You will need to apply the same feature engineering steps you applied to the training dataset to the test dataset as well. This means that in the test set:

- Handle missing data for both numeric and categorical columns
-  Apply one-hot encoding to categorical columns
-  Aligning train and test datasets by columns and handle any missing values introduced by the alignment. This alignment step is critical and it guarantees that the training and testing datasets have the same columns, especially after one-hot encoding. It's possible that after one-hot encoding, some columns present in the training data might not be in the testing data (and vice versa) due to different categorical values. The align method makes sure both datasets have the same columns in the same order. This is crucial because a machine learning model expects the input features in the same order and structure as it was trained on.
- Don't forget to fill with `inplace=True` parameter to modify the original dataframe.

In [None]:
# Write your code here

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Handling missing values in the test set similar to the train set
# Numeric columns: median imputation
for col in numeric_cols_with_missing:
    test[col].fillna(train_encoded[col].median(), inplace=True)


# Categorical columns: mode imputation
for col in categorical_cols_with_missing:
    test[col].fillna(train_encoded[col].mode()[0], inplace=True)

# One-hot encoding the test dataset
test_encoded = pd.get_dummies(test, drop_first=True)

# Aligning train and test datasets by columns
train_encoded, test_encoded = train_encoded.align(test_encoded, join='left', axis=1)

# Filling any new NaN values introduced by alignment
test_encoded.fillna(0, inplace=True)


# Splitting the data into training and testing sets
X_train = train_encoded.drop(columns=['SalePrice'])
y_train = train_encoded['SalePrice']
X_test = test_encoded
y_test = test_encoded

# Training the Linear Regression model
lr = LinearRegression()
lr.fit(X_train, y_train)

#Fit the model


# Predicting on the test set
y_pred = lr.predict(X_test)

# Evaluating the model using RMSE
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
rmse

ValueError: ignored