<a href="https://colab.research.google.com/github/Ash100/Biopython/blob/main/Regression_on_Boston_House_Prices_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Regression of Boston House Prices
I am **Dr. Ashfaq Ahmad**, and this notebook is created for teaching and research purposes. Refering to the people working in the field of Biology, I have tried my level best to keep it as simple as possible. For Detailed instruction and understandings, please watch a video tutorial on **https://www.youtube.com/@Bioinformaticsinsights**

This notebook is based on the book **Deep Learning with Python** by Jason Brownlee.

In this project tutorial you will discover how to develop and evaluate neural network models using Keras for a regression problem.
Regression analysis is a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome' or 'response' variable, or a 'label' in machine learning parlance) and one or more independent variables (often called 'predictors', 'covariates', 'explanatory variables' or 'features'). The most common form of regression analysis is linear regression, in which one finds the line (or a more complex linear combination) that most closely fits the data according to a specific mathematical criterion.

##About the Dataset
The problem that we will look at in this tutorial is the Boston house price dataset. The dataset describes properties of houses in Boston suburbs and is concerned with modeling the price of houses in those suburbs in thousands of dollars. As such, this is a regression predictive modeling
problem. The dataset can be downloaded from Kaggle repository.

There are 13 input variables that describe the properties of a given Boston suburb.
The full list of attributes in this dataset are as follows:
1. CRIM: per capita crime rate by town.
2. ZN: proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS: proportion of non-retail business acres per town.
4. CHAS: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
5. NOX: nitric oxides concentration (parts per 10 million).
6. RM: average number of rooms per dwelling.
7. AGE: proportion of owner-occupied units built prior to 1940.
8. DIS: weighted distances to five Boston employment centers.
9. RAD: index of accessibility to radial highways.
10. TAX: full-value property-tax rate per 10,000.
11. PTRATIO: pupil-teacher ratio by town.
12. B: 1000(Bk  0.63)2 where Bk is the proportion of blacks by town.
13. LSTAT: % lower status of the population.
14. MEDV: Median value of owner-occupied homes in 1000s

##Develop a Baseline Neural Network Model

In [None]:
#Install Keras and Scikit
!pip install --upgrade keras
!pip install --upgrade scikit_learn

In [None]:
#Install as per your need
!pip install scikeras[tensorflow-cpu]

In [None]:
#Install as per your need
!pip install scikeras[tensorflow]      # gpu compute platform

In [3]:
import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

##Load your Dataset
We can now load our dataset from a file in the local directory. The dataset is in fact not in
CSV format, the attributes are instead separated by
whitespace. We can load this easily using the Pandas library. We can then split the input (X)
and output (Y ) attributes so that they are easier to model with Keras and scikit-learn.

In [4]:
# load dataset
dataframe = pandas.read_csv("/content/sample_data/housing.csv", delim_whitespace=True, header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:13]
Y = dataset[:,13]

In [5]:
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
from keras.losses import mean_squared_error
from keras.activations import relu

# Define base model
def baseline_model():
    # Create model
    model = Sequential()
    model.add(Dense(13, input_dim=13, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))

    # Compile model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

The Keras wrapper object for use in scikit-learn as a regression estimator is called KerasRegressor.
We will create an instance and pass it both the name of the function to create the neural network model as well as some parameters to pass along to the *fit()* function of the model later, suchas the number of epochs and batch size.

In [None]:
# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)
# evaluate model with standardized dataset
estimator = KerasRegressor(build_fn=baseline_model, epochs=100, batch_size=5, verbose=0)


In [None]:
#Model Evaluation
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import LinearRegression

# Define your estimator (e.g., Linear Regression)
estimator = LinearRegression()

# Define the number of folds for cross-validation
kfold = KFold(n_splits=10, shuffle=True, random_state=seed)

# Perform cross-validation
results = cross_val_score(estimator, X, Y, cv=kfold)

# Print the mean and standard deviation of the results
print("Baseline: %.2f (%.2f) MSE" % (results.mean(), results.std()))


##Standardizing the dataset
An important concern with the Boston house price dataset is that the input attributes all vary in their scales because they measure different quantities. It is almost always good practice to prepare your data before modeling it using a neural network model. Continuing on from the
above baseline model, we can re-evaluate the same model using a standardized version of the input dataset.

In [None]:
import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
from keras.losses import mean_squared_error
from keras.initializers import normal
from keras.activations import relu
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Load dataset
dataframe = pandas.read_csv("/content/sample_data/housing.csv", delim_whitespace=True, header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:13]
Y = dataset[:,13]

# Define base model
def baseline_model():
    # Create model
    model = Sequential()
    model.add(Dense(13, input_dim=13, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    # Compile model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

# Fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# Evaluate model with standardized dataset
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasRegressor(build_fn=baseline_model, epochs=50,
                                          batch_size=5, verbose=0)))
pipeline = Pipeline(estimators)
kfold = KFold(n_splits=10, random_state=seed, shuffle=True)
results = cross_val_score(pipeline, X, Y, cv=kfold)
print("Standardized: %.2f (%.2f) MSE" % (results.mean(), results.std()))

##Tuning of Network Topology by adding extra hidden layers

In [None]:
import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
from keras.losses import mean_squared_error
from keras.initializers import normal
from keras.activations import relu
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Load dataset
dataframe = pandas.read_csv("/content/sample_data/housing.csv", delim_whitespace=True, header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:13]
Y = dataset[:,13]

# Define the model
def larger_model():
    # Create model
    model = Sequential()
    model.add(Dense(13, input_dim=13, kernel_initializer='normal', activation='relu'))
    model.add(Dense(6, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))

    # Compile model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

# Fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# Evaluate model with standardized dataset
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasRegressor(build_fn=larger_model, epochs=50, batch_size=5,
                                          verbose=0)))
pipeline = Pipeline(estimators)
kfold = KFold(n_splits=10, shuffle=True)  # Enable shuffling
results = cross_val_score(pipeline, X, Y, cv=kfold)
print("Larger: %.2f (%.2f) MSE" % (results.mean(), results.std()))

##Let's make the network more wider
Here we will introduce a hidden layer with 20 neurons.

In [None]:
import numpy
import pandas
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# load dataset
dataframe = pandas.read_csv("/content/sample_data/housing.csv", delim_whitespace=True, header=None)
dataset = dataframe.values
# split into input (X) and output (Y) variables
X = dataset[:,0:13]
Y = dataset[:,13]

# define wider model
def wider_model():
    # create model
    model = Sequential()
    model.add(Dense(20, input_dim=13, kernel_initializer='normal', activation='relu'))
    model.add(Dense(1, kernel_initializer='normal'))
    # Compile model
    model.compile(loss='mean_squared_error', optimizer='adam')
    return model

# fix random seed for reproducibility
seed = 7
numpy.random.seed(seed)

# evaluate model with standardized dataset
estimators = []
estimators.append(('standardize', StandardScaler()))
estimators.append(('mlp', KerasRegressor(build_fn=wider_model, epochs=100, batch_size=5,
                                         verbose=0)))
pipeline = Pipeline(estimators)
kfold = KFold(n_splits=10, shuffle=True)
results = cross_val_score(pipeline, X, Y, cv=kfold)
print("Wider: %.2f (%.2f) MSE" % (results.mean(), results.std()))