# **Import Important Libraries**

In [3]:
!pip install dmba

Collecting dmba
  Downloading dmba-0.2.4-py3-none-any.whl.metadata (1.9 kB)
Downloading dmba-0.2.4-py3-none-any.whl (11.8 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m11.8/11.8 MB[0m [31m4.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: dmba
Successfully installed dmba-0.2.4


In [4]:
import pandas as pd
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
#from dmba import regressionSummary
from dmba import forward_selection, backward_elimination, stepwise_selection, AIC_score
import statsmodels.api as sm
from sklearn.model_selection import train_test_split

Colab environment detected.


# **Data loading and preprocessing**

In [6]:
data = pd.read_csv('/content/housing_price_dataset.csv')

2. Inspect the dataset


In [7]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 50000 entries, 0 to 49999
Data columns (total 6 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   SquareFeet    50000 non-null  int64  
 1   Bedrooms      50000 non-null  int64  
 2   Bathrooms     50000 non-null  int64  
 3   Neighborhood  50000 non-null  object 
 4   YearBuilt     50000 non-null  int64  
 5   Price         50000 non-null  float64
dtypes: float64(1), int64(4), object(1)
memory usage: 2.3+ MB


- From the results, we can see that there are no missing values


3. Removing duplicates

In [8]:
data.duplicated().sum()

0

- no duplicates

4. fixing structural error

In [9]:
data.columns = data.columns.str.lower().str.strip()

- I am going to remove white spaces before and after the column name using the strip() function from the built in string(str) class..
- I am going to convert the column names to lowercases using the lower() function from the built in string(str) class.

1. Dealing with categorical variables

In [10]:
data = pd.get_dummies(data, columns=['neighborhood'], drop_first=True) # Apply one-hot Encoding and avoids multicollinearity

cat_boo = data.select_dtypes('bool').columns.tolist() # Stores the boolean types from the dataframe to a variable called cat_boo
data[cat_boo] = data[cat_boo].astype(int) # Converts the boolean types(True/False) to integer types(1/0)

- I first created dummy variables using the get_dummies() from pandas on a categorical variable called neighborhood.
- I extracted the boolean data type variables and stored them on a cat_boo variable.
- Then I converted the boolean type variables in the dataframe to integer type variables. I did this because the linear regression works with numerical variables not boolean or string

2. Standardize numerical features

In [11]:
num_var = ['squarefeet','yearbuilt','bedrooms','bathrooms']
scaler = MinMaxScaler()
data[num_var] = scaler.fit_transform(data[num_var])

- First I extracted numerical features from the dataframe.
- I then used the MinMaxScaler() function to normalize the features
- Then I fitted the transformed data to the dataframe.


3. Define features (X) and target (y)

In [12]:
target = data['price']
features = data.drop(columns=['price'])

- I stored the price variable from the dataframe to a variable called target.
- I stored remaining variables in the dataframe to a variable called features.

4. Add a constant for the intercept (required for statsmodels)

In [13]:
features = sm.add_constant(features)

5. Split the data

In [14]:
features_train,features_test, target_train, target_test = train_test_split(features, target, test_size=0.2, random_state=42)


- I used the train_test_split() function to split the entries/records of my dataset into a 80:20 ratio.
- 80% of my dataframe records will be stored under the features_train and target_train variables.
- 20% will be stored under features_test and target_test variables
- Basically 80% of the data will be used for training and the remaining 20% will be used for testing

# Model fitting and Prediction

1. fitting the model

In [15]:
model = LinearRegression()
model.fit(features_train,target_train)

- I created a regression model using the LineaarRegression() function from the sklearn library.
- Then I fitted the model using the features_train and target_train variables

2. Make predictions

In [16]:
fitted_v = model.predict(features_test)

3. Evaluate the model

In [17]:
mae1 = mean_absolute_error(target_test, fitted_v)
mse1 = mean_squared_error(target_test, fitted_v)
rmse1 = mse1 ** 0.5
r21 = r2_score(target_test, fitted_v)

- I created the error metrics and coefficient of determination(r2) to evaluate my model

In [18]:
print(f"Mean Absolute Error: {mae1: .3f}")
print(f"Mean Squared Error: {mse1: .3f}")
print(f"Root Mean Squared Error: {rmse1: .3f}")
print(f"R² Score: {r21:.4f}")

Mean Absolute Error:  39430.165
Mean Squared Error:  2436249371.307
Root Mean Squared Error:  49358.377
R² Score: 0.5756


# Feature selection and Model fitting 2

1.  Select features using a Stepwise selection process

In [19]:
def train_model(variables):
    if len(variables) == 0:
        return None
    model = LinearRegression()
    model.fit(features_train[variables], target_train)
    return model

def score_model(model, variables):
    if model is None:
        return AIC_score(target_train, [target_train.mean()] * len(target_train), model, df=1)
    return AIC_score(target_test, model.predict(features_test[variables]), model)


- The train_model() function will create and fit the model like in the previous model.
- the score_model() function will create an AIC_score using target_test records, predicted values, and a model created using the train_model() function.

In [20]:
print('Stepwise selection:')
best_model, best_variable = stepwise_selection(features_train.columns, train_model, score_model, verbose=True)

Stepwise selection:
Variables: const, squarefeet, bedrooms, bathrooms, yearbuilt, neighborhood_Suburb, neighborhood_Urban
Start: score=1012845.58, constant
Step: score=244644.59, add squarefeet
Step: score=244543.60, add bedrooms
Step: score=244532.55, add bathrooms
Step: score=244529.18, add neighborhood_Urban
Step: score=244529.18, unchanged None


In [21]:
# Stepwise selection
print(f'Intercept: {best_model.intercept_: .3f}')
print('Coefficints:')
for name, coef in zip(features_train, best_model.coef_):
    print(f'{name}: {coef}')

Intercept:  113466.397
Coefficints:
const: 198590.12765248335
squarefeet: 15692.827716552423
bedrooms: 5935.626505146774
bathrooms: 1715.1744806171191


 4. Extract the selected features of the stepwise selection algorithm

In [22]:
data2= features_test[best_variable]

5. Predict the fitted values using the selected features

In [23]:
y_pred = best_model.predict(data2)

6. Calculate errors and a determination score

In [24]:
mae = mean_absolute_error(target_test, y_pred)
mse = mean_squared_error(target_test, y_pred)
rmse = mse ** 0.5
r2 = r2_score(target_test, y_pred)

7. Display the results

In [25]:
print(f"Mean Absolute Error: {mae: .3f}")
print(f"Mean Squared Error: {mse: .3f}")
print(f"Root Mean Squared Error: {rmse: .3f}")
print(f"R² Score: {r2:.4f}\n")


Mean Absolute Error:  39432.371
Mean Squared Error:  2436531259.723
Root Mean Squared Error:  49361.232
R² Score: 0.5755



# **Interpretation**
Based on these two models the best model is the first one since the Mean Absolute Error is slightly less and the coefficient of determination is slightly greater.
