# Regression data using scikit-learn

Regression refers to the process of predicting a dependent variable by analyzing the relationship between other independent variables. There are several common algorithms that help us in excavating these relationships to better predict the value.

In this notebook, we'll use `scikit-learn` to predict values. `Scikit-learn` provides implementations of many regression algorithms. In here, we have done a comparative study of 3 different regression algorithms. 

To help visualize what we are doing, we'll use 2D and 3D charts to show how the classes looks (with 3 selected dimensions) with matplotlib and seaborn python libraries.


<a id="top"></a>
## Table of Contents

1. [Load libraries](#load_libraries)
2. [Helper methods for metrics](#helper_methods)
3. [Data exploration](#explore_data)
4. [Prepare data for building regression model](#prepare_data)
5. [Build Simple Linear Regression model](#model_slr)
6. [Build Multiple Linear Regression model](#model_mlr)
7. [Build Polynomial Linear Regression model](#model_plr) 

### Quick set of instructions to work through the notebook

If you are new to Notebooks, here's a quick overview of how to work in this environment.

1. The notebook has 2 types of cells - markdown (text) such as this and code such as the one below. 
2. Each cell with code can be executed independently or together (see options under the Cell menu). When working in this notebook, we will be running one cell at a time because we need to make code changes to some of the cells.
3. To run the cell, position cursor in the code cell and click the Run (arrow) icon. The cell is running when you see the * next to it. Some cells have printable output.
4. Work through this notebook by reading the instructions and executing code cell by cell. Some cells will require modifications before you run them. 

<a id="load_libraries"></a>
## 1. Load libraries
[Top](#top)

 It is convention to import all of your python libraries at the top of the file. While it is possible to import the libraries at any point in a python notebook, doing so all in one place makes it easy to figure out where the symbols are coming from.

In [None]:
# Fill in missing values
from sklearn.impute import SimpleImputer

# Handle categorical columns
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

# Chain a sequence of transformations
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Separate the data into Training and Testing sets
from sklearn.model_selection import train_test_split

# Compute performance metrics for models
from sklearn.metrics import accuracy_score,mean_squared_error, r2_score

# Data manipulation
import pandas as pd

# Plotting
import matplotlib.pyplot as plt
import seaborn as sns


<a id="helper_methods"></a>
## 2. Helper methods for metrics
[Top](#top)

One of the benefits of using Python for data science is that you can simplify your work by defining repetitive tasks as functions (or methods as they are called in Python).

In the following section, we define three methods that will help us with the repetitive tasks throughout.

In [None]:

def two_d_compare(X_test,y_test,y_pred,model_name):
    '''
    Plot the predicted values and actual values on two side-by-side plots.

    :param X_test: A series containing the X values.
    :param y_test: A series containing the actual Y values corresponding to X_test entries.
    :param y_pred: A series containing the predicted Y values corresponding to X_test entries.
    :param model_name: name of the model. Used for placing in the plot's title.
    '''

    # Defining a plot with two subplots
    plt.subplots(ncols=2, figsize=(10,4))

    # Naming the plots
    plt.suptitle('Actual vs Predicted data : ' +model_name + '. Variance score: %.2f' % r2_score(y_test, y_pred))

    # Populating the first subplot
    plt.subplot(121)
    plt.scatter(X_test, y_test, alpha=0.8, color='#8CCB9B')
    plt.title('Actual')

    # Populating the second subplot
    plt.subplot(122)
    plt.scatter(X_test, y_pred,alpha=0.8, color='#E5E88B')
    plt.title('Predicted')

    # directive to display the created plot
    plt.show()
    

def model_metrics(y_test,y_pred):
    '''
    Calculate MSE and R2 errors, print them, and return them as a list.append

    :param y_test: A series containing the actual Y values
    :param y_pred: A series containing the predicted Y values
    '''

    # Calculate and print Mean Squared Error (MSE)
    mse = mean_squared_error(y_test,y_pred)
    print("Mean squared error: %.2f" % mse)
    
    # Calculate and print R^2 
    r2 = r2_score(y_test, y_pred)
    print('R2 score: %.2f' % r2 )
    
    return [mse, r2]

def two_vs_three(x_test,y_test,y_pred,z=None, isLinear = False) : 
    '''
    Create a 3D plot of LOT AREA vs YEAR BUILT vs SELLING PRICE.

    Technically this function creates 2-D and 3-D scatterplots of the inputs.Since in this 
    notebook we've only used it to generate the three plot mentioned above, we are hardcoding
    the axis names to avoid having to pass them in as parameters every time.

    :param x_test: A series containing the x values
    :param y_test: A series containing the actual Y values
    :param y_pred: A series containing the predicted Y values
    '''
    
    area = 60
    
    # Define the size of the graph and it's title
    fig = plt.figure(figsize=(12,6))
    fig.suptitle('2D and 3D view of sales price data')

    # First subplot
    ax = fig.add_subplot(1, 2,1)
    ax.scatter(x_test, y_test, alpha=0.5,color='blue', s= area)
    # ax.plot(x_test, y_pred, alpha=0.9,color='red', linewidth=2)
    ax.plot(x_test, y_pred, alpha=0.5,color='red', marker='s', linewidth=0)
    ax.set_xlabel('YEAR BUILT')
    ax.set_ylabel('SELLING PRICE')
    
    plt.title('YEARBUILT vs SALEPRICE')
    
    if not isLinear : 
    # Second subplot
        ax = fig.add_subplot(1,2,2, projection='3d')

        ax.scatter(z, x_test, y_test, color='blue', marker='o')
        # ax.plot(z, x_test, y_pred, alpha=0.9,color='red', linewidth=2)
        ax.plot(z, x_test, y_pred, alpha=0.5,color='red', marker='s', linewidth=0)
        ax.set_ylabel('YEAR BUILT')
        ax.set_zlabel('SELLING PRICE')
        ax.set_xlabel('LOT AREA')

    plt.title('LOT AREA vs YEAR BUILT vs SELLING PRICE')

    plt.show()
    

<a id="explore_data"></a>
## 3. Data exploration
[Top](#top)

Data can be easily loaded within IBM Watson Studio. Instructions to load data within IBM Watson Studio can be found [here](https://developer.ibm.com/tutorials/watson-studio-using-jupyter-notebook/). The data set can be located by its name and inserted into the notebook as a pandas DataFrame as shown below.

![insert_spark_dataframe.png](https://raw.githubusercontent.com/IBM/icp4d-customer-churn-classifier/master/doc/source/images/insert_spark_dataframe.png)

The generated code comes up with a generic name and it is good practice to rename the dataframe to match the use case context.


To simplify this notebook, we will use a feature of Pandas that allows us to directly load a csv file from the internet. You can use the instruction above if you want to load your own dataset.

In [None]:
# Load the Data
df_pd =  pd.read_csv("https://raw.githubusercontent.com/IBM/ml-learning-path-assets/master/data/predict_home_value.csv")



### About the Data
The data that we are loading contains housing related information. With several independent variables related to this domain, we are going to predict the sales price of a house. 

In [None]:
# Show the first 5 rows of the data.
# Good for quick inspection of the data and column names.
df_pd.head()

Let's try creating a scatter plot for the price of the house vs. the year the house was built.

In [None]:
year_column = df_pd['YEARBUILT']
price_column = df_pd['SALEPRICE']

sns.scatterplot(x = year_column, y =price_column)

## Exercise 1

Let's practice our plotting. Create a scatter plot of the lot area versus the sales price. Do you see any trends? Did you notice the outliers?

**Hint:** Print the column names first if you don't know the column name. 
**Note** that you can create a scatter plot only for the numerical columns (`int64` in this case)

In [None]:
# Your Answer:

# Uncomment the line below if you need the column names
# df_pd.columns



In [None]:
# Solution 

area_column = df_pd['LOTAREA']
price_column = df_pd['SALEPRICE']

sns.scatterplot(x = area_column, y =price_column)

 Next, let's take a look at creating histograms using `seaborn`. Note that the `histplot` functionality of Seaborn has many options, in this case we are enabling the KDE by adding `kde = True` to add a kernel density estimate to smooth the distribution.


 See [histplot documentation](https://seaborn.pydata.org/generated/seaborn.histplot.html) to learn about the other options.

In [None]:
sns.histplot(df_pd['SALEPRICE'], kde=True)

Now that we have learnt how to explore the data visually, let's see how we can find out about the data types in the columns. This gives us the idea of which are numerical and which are categorical so we can apply the correct visualization tool to them.

`int64` denotes a 64-bit integer, that is, a numerical value. `object` on the other hand denotes a non-numerical value which in this case we know is a string (text) that indicates categorical values.

In [None]:
print("The dataset contains columns of the following data types : \n" +str(df_pd.dtypes))

### Missing Values
Notice below that FIREPLACEQU, GARAGETYPE, GARAGEFINISH, GARAGECOND,FENCE and POOLQC have missing values. 

**Important:** It is important to take care of missing data before feeding the data into your ML model. Most Regression Algorithms cannot handle missing values, so it is on you to decide what to do with them before passing the data to the next step.

You could, for instance, remove those rows, fill them in with the average of the column, interpolate based on other rows, or any other statistical method. Nonetheless, you should handle missing values (`NaN` or Not a Number) first.

In [None]:
# Place True for each cell if the value is missing and False if a value is present
missing_values = df_pd.isna()

# Since True is a 1 and False is a 0, by summing each column, we effectively count the number of Trues 
# which is equal to the count of missing values.
missing_values_count = missing_values.sum()

print("The dataset contains following number of missing values for each of the columns : \n" + str(missing_values_count) )


Alternatively, you can use the following code to simply indicate if there are *any* missing values in each column. By the time you start your machine learning experiment, you want to have Falses for every column that is used in the model.


In [None]:
# Show if there are any missing values in each column
df_pd.isnull().any()

<a id="prepare_data"></a>
## 4. Prepare data for building regression model
[Top](#top)

Data preparation is a very important step in machine learning model building. This is because the model can perform well only when the data it is trained on is good and well prepared. Hence, this step consumes bulk of data scientist's time spent building models.

During this process, we identify categorical columns in the dataset. Categories needed to be indexed, which means the string labels are converted to label indices. These label indices are encoded using One-hot encoding to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values. This encoding allows algorithms which expect continuous features to use categorical features.


We begin by identifying columns that will not add any value toward predicting the outputs. While some of these columns are easily identified, a subject matter expert is usually engaged to identify most of them. Removing such columns helps in reducing dimensionality of the model.

In [None]:
#remove columns that are not required
df_pd = df_pd.drop(['ID'], axis=1)

df_pd.head()


The preprocessing techniques that are applied must be customized for each of the columns. Sklearn provides a library called the ColumnTransformer, which allows a sequence of these techniques to be applied to selective columns using a pipeline.


A common problem while dealing with data sets is that values will be missing. scikit-learn provides a method to fill these empty values with something that would be applicable in its context. We used the SimpleImputer class that is provided by Sklearn and filled the missing values with the most frequent value in the column.


Also, because machine learning algorithms perform better with numbers than with strings, we want to identify columns that have categories and convert them into numbers. We use the OneHotEncoder class provided by Sklearn. The idea of one hot encoder is to create binary variables that each represent a category. By doing this, we remove any ordinal relationship that might occur by just assigning numbers to categories. Basically, we go from a single column that contains multiple class numbers to multiple columns that contain only binary class numbers.

In [None]:
# Defining the categorical columns 
categoricalColumns = df_pd.select_dtypes(include=[object]).columns

print("Categorical columns : " )
print(categoricalColumns)

impute_categorical = SimpleImputer(strategy="most_frequent")
onehot_categorical =  OneHotEncoder(handle_unknown='ignore')

categorical_transformer = Pipeline(steps=[('impute',impute_categorical),('onehot',onehot_categorical)])

The numerical columns from the data set are identified, and StandardScaler is applied to each of the columns. This way, each value is subtracted with the mean of its column and divided by its standard deviation.

In [None]:
# Defining the numerical columns 
numericalColumns = [col for col in df_pd.select_dtypes(include=[float,int]).columns if col not in ['SALEPRICE']]
print("Numerical columns : " )
print(numericalColumns)

scaler_numerical = StandardScaler()

numerical_transformer = Pipeline(steps=[('scale',scaler_numerical)])


As discussed previously, each of the techniques are grouped by the columns they needed to be applied on and are queued using the ColumnTransformer. Ideally, this is run in the pipeline just before the model is trained. However, to understand what the data will look like, we have transformed the data into a temporary variable.

In [None]:
preprocessorForCategoricalColumns = ColumnTransformer(transformers=[('cat', categorical_transformer, categoricalColumns)],
                                                      remainder="passthrough")
preprocessorForAllColumns = ColumnTransformer(transformers=[('cat', categorical_transformer, categoricalColumns),('num',numerical_transformer,numericalColumns)],
                                              remainder="passthrough")


#. The transformation happens in the pipeline. Temporarily done here to show what intermediate value looks like
df_pd_temp = preprocessorForCategoricalColumns.fit_transform(df_pd)
print("Data after transforming :")
print(df_pd_temp)

df_pd_temp_2 = preprocessorForAllColumns.fit_transform(df_pd)
print("Data after transforming :")
print(df_pd_temp_2)

These are some of the popular preprocessing steps that are applied on the data sets. You can get more information in Data preprocessing in detail.

For more examples, take a look at the [Data preprocessing in detail](https://developer.ibm.com/articles/data-preprocessing-in-detail/) article.

In [None]:
# prepare data frame for splitting data into train and test datasets

features = []
features = df_pd.drop(['SALEPRICE'], axis=1)

label = pd.DataFrame(df_pd, columns = ['SALEPRICE']) 
#label_encoder = LabelEncoder()
label = df_pd['SALEPRICE']

#label = label_encoder.fit_transform(label)
print(" value of label : " + str(label))




<a id="model_slr"></a>
## 5. Build Simple Linear Regression model
[Top](#top)

This is the most basic form of linear regression in which the variable to be predicted is dependent on only one other variable. This is calculated by using the formula that is generally used in calculating the slope of a line.

y = w0 + w1*x1

In the above equation, y refers to the target variable and x1 refers to the independent variable. w1 refers to the coefficient that expresses the relationship between y and x1. It is also known as the slope. w0 is the constant coefficient a.k.a the intercept. It refers to the constant offset that y will always be with respect to the independent variables.

Since simple linear regression assumes that output depends on only one variable, we are assuming that it depends on the YEARBUILT. Of course, this will not be the most useful model as it is ignoring all but one column. But it is a good starting point and helps us get familiar with the syntax.

**Important Note:**
Data is split up into training and test sets. This is a common practice where we split the data into two sets before training our model: train and test. Training Data is what the ML algorithm looks at to learn the patterns while the Test portion is never shown to the model during the training. Once the training is complete, we show the previously unseen data set to our model and compare its predictions with the actual values that we have.

To see why this is important, imagine if we used all the data for training and created a model that effectively memorized all the input/output pairs. Now, if we compare our models predictions to the actual labels that we have they will match 100% (since the model memorized them.) However, if you show any new data to the model it will perform poorly because it hasn't really uncovered any real pattern!

This is why we keep a portion of the data out during the training phase so that we can evaluate how well our model generalizes after training.

In [None]:
X = features['YEARBUILT'].values.reshape(-1,1)
X_train_slr, X_test_slr, y_train_slr, y_test_slr = train_test_split(X,label , random_state=0)

print("Dimensions of datasets that will be used for training : Input features"+str(X_train_slr.shape)+ 
      " Output label" + str(y_train_slr.shape))
print("Dimensions of datasets that will be used for testing : Input features"+str(X_test_slr.shape)+ 
      " Output label" + str(y_test_slr.shape))

In [None]:
from sklearn.linear_model import LinearRegression

# Our model's name
model_name = 'Simple Linear Regression'

# Assign the LinearRegression class (imported above)
# to a variable so we can use it more simply
slRegressor = LinearRegression()

# Train the model by calling .fit() method on it
slRegressor.fit(X_train_slr,y_train_slr)

# Perform prediction on the Test portion of the data
y_pred_slr= slRegressor.predict(X_test_slr)

print("Predictions vs Real labels")
print(pd.DataFrame({
                    'predictions' : y_pred_slr, 
                    'actual values' : y_test_slr.values
                    }))

Since this is a linear regression, we can easily print the intercept and coefficient of the prediction.

In [None]:
print('Intercept: \n',slRegressor.intercept_)
print('Coefficients: \n', slRegressor.coef_)

In [None]:
two_vs_three(X_test_slr[:,0],   # Isolating the first column
             y_test_slr,        # Actual values of the sale price
             y_pred_slr,        # Predicted values of the sale price
             None, True)

# This will create a single graph only, Year Built vs price. This is because we are studying a linear regression so 
# we are looking at only one input variable. Therefore, a 3d plot doesn't make sense.

In [None]:
two_d_compare(X_test_slr,y_test_slr,y_pred_slr,model_name)

Let's look at how well our model is performing by looking at the R^2 and MSE values. Note that we don't expect this model to be doing too well because it's using only a single variable from the whole input, and the following metrics also confirm that.

In [None]:
# Remember slr stands for Simple Linear Regression
slrMetrics = model_metrics(y_test_slr,y_pred_slr)

<a id="model_mlr"></a>
## 6. Build Multiple Linear Regression model
[Top](#top)

Multiple linear regression is an extension to the simple linear regression. In this setup, the target value is dependent on more than one variable. The number of variables depends on the use case at hand. Usually a subject matter expert is involved in identifying the fields that will contribute towards better predicting the output feature.

y = w0 + w1*x1 + w2*x2 + .... + wn*xn

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features,label , random_state=0)

print("Dimensions of datasets that will be used for training : Input features"+str(X_train.shape)+ 
      " Output label" + str(y_train.shape))
print("Dimensions of datasets that will be used for testing : Input features"+str(X_test.shape)+ 
      " Output label" + str(y_test.shape))

In [None]:
from sklearn.linear_model import LinearRegression

model_name = 'Multiple Linear Regression'

mlRegressor = LinearRegression()

mlr_model = Pipeline(steps=[('preprocessorAll',preprocessorForAllColumns),('regressor', mlRegressor)])

mlr_model.fit(X_train,y_train)

y_pred_mlr= mlr_model.predict(X_test)

print(mlRegressor)

Notice how we have many coefficients and intercepts. This is because we are not solving a multiple linear regression problem instead of a simple linear regression problem.

In [None]:
print('Intercept: \n',mlRegressor.intercept_)
print('Coefficients: \n', mlRegressor.coef_)

In [None]:
two_vs_three(X_test['YEARBUILT'],y_test,y_pred_mlr,X_test['LOTAREA'], False)  

In [None]:
two_d_compare(X_test['YEARBUILT'],y_test,y_pred_mlr,model_name)

In [None]:
mlrMetrics = model_metrics(y_test,y_pred_mlr)

<a id="model_plr"></a>
## 7. Build Polynomial Linear Regression model
[Top](#top)

The prediction line generated by simple/linear regression is usually a straight line and captures a first order relationship between each colum and the output (label). In cases when a simple or multiple linear regression does not fit the data point accurately, we use the polynomial linear regression. The following formula is used in the back-end to generate polynomial linear regression.

y = w0 + w1*x1 + w2*x21 + .... + wn*xnn

We are assuming that output depends on the YEARBUILT and LOTAREA. Data is split up into training and test sets. 

Use this section as a practice and try to fill in the blocks yourself. Where you need help, uncomment the `%load ...` line by removing the leading `# ` and run the cell. That will load the answer for you.

In [None]:
X = features.iloc[:, [0,4]].values

# Exercise 
# Split the data to train and test


In [None]:
# Answer
X = features.iloc[:, [0,4]].values
X_train, X_test, y_train, y_test = train_test_split(X,label, random_state=0)

print("Dimensions of datasets that will be used for training : Input features"+str(X_train.shape)+ 
      " Output label" + str(y_train.shape))
print("Dimensions of datasets that will be used for testing : Input features"+str(X_test.shape)+ 
      " Output label" + str(y_test.shape))

In [None]:
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

model_name = 'Polynomial Linear Regression'

polynomial_features= PolynomialFeatures(degree=3)
plRegressor = LinearRegression()

plr_model = Pipeline(steps=[('polyFeature',polynomial_features ),('regressor', plRegressor)])

# Exercise
# train the plr model
# make predictions for X_test using your model





In [None]:
# Answer
plr_model.fit(X_train,y_train)
y_pred_plr= plr_model.predict(X_test)

In [None]:
# Exercise
# print the intercepts and coefficients

In [None]:
# Answer
print('Intercept: \n',plRegressor.intercept_)
print('Coefficients: \n', plRegressor.coef_)

Once again let's take a look at how out predictions compare with the actual data.

In [None]:
two_vs_three(X_test[:,1],y_test,y_pred_plr,X_test[:,0], False)  

In [None]:
two_d_compare(X_test[:,1],y_test,y_pred_plr,model_name)

In [None]:
# Exercise
# Finally, use the model_metrics() method to compute MSE and R^2

In [None]:
# Answer
plrMetrics = model_metrics(y_test,y_pred_plr)