# Regression

In the previous chapter, you used image and political datasets to predict binary and multiclass outcomes. But what if your problem requires a continuous outcome? Regression is best suited to solving such problems. You will learn about fundamental concepts in regression and apply them to predict the life expectancy in a given country using Gapminder data.

# (1) Introduction to regression

## Boston housing data

In [None]:
boston = pd.read_csv('boston.csv')
print(boston.head())

## Creating feature and target arrays

In [None]:
X = boston.drop('MEDV', axis=1).values
y = boston['MEDV'].values

## Predicting house value from a single feature

In [None]:
X_rooms = X[:, 5]
type(X_rooms), type(y)

In [None]:
y = y.shape(-1, 1)
X_rooms = X_rooms.reshape(-1, 1)

## Plotting house value vs. number of rooms

In [None]:
plt.scatter(X_rooms, y)
plt.ylabel('Value of house /1000 ($)')
plt.xlabel('Number of rooms')
plt.show()

## Fitting a regression model

In [None]:
import numpy as np
from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(X_rooms, y)
prediction_space = np.linspace(min(X_rooms),
                                max(X_rooms).reshape(-1, 1))

<img src="image/Screenshot 2021-02-01 172327.png">

In [None]:
# Exercise I: Which of the following is a regression problem?

Andy introduced regression to you using the Boston housing dataset. But regression models can be used in a variety of contexts to solve a variety of different problems.

Given below are four example applications of machine learning. Your job is to pick the one that is best framed as a **regression** problem.

- An e-commerce company using labeled customer data to predict whether or not a customer will purchase a particular item.

- A healthcare company using data about cancer tumors (such as their geometric measurements) to predict whether a new tumor is benign or malignant.

- A restaurant using review data to ascribe positive or negative sentiment to a given review.

- A bike share company using time and weather data to predict the number of bikes being rented at any given hour. (T)

# Exercise II: Importing data for supervised learning

In this chapter, you will work with [Gapminder](https://www.gapminder.org/data/) data that we have consolidated into one CSV file available in the workspace as `'gapminder.csv'`. Specifically, your goal will be to use this data to predict the life expectancy in a given country based on features such as the country's GDP, fertility rate, and population. As in Chapter 1, the dataset has been preprocessed.

Since the target variable here is quantitative, this is a regression problem. To begin, you will fit a linear regression with just one feature: `'fertility'`, which is the average number of children a woman in a given country gives birth to. In later exercises, you will use all the features to build regression models.

Before that, however, you need to import the data and get it into the form needed by scikit-learn. This involves creating feature and target variable arrays. Furthermore, since you are going to use only one feature to begin with, you need to do some reshaping using NumPy's `.reshape()` method. Don't worry too much about this reshaping right now, but it is something you will have to do occasionally when working with scikit-learn so it is useful to practice.

### Instructions

- Import `numpy` and `pandas` as their standard aliases.
- Read the file `'gapminder.csv'` into a DataFrame df using the `read_csv()` function.
- Create array `X` for the `'fertility'` feature and array `y` for the `'life'` target variable.
- Reshape the arrays by using the `.reshape()` method and passing in `-1` and `1`.


In [None]:
# Import numpy and pandas
import numpy as np
import pandas as pd

# Read the CSV file into a DataFrame: df
df = pd.read_csv('gapminder.csv')

# Create arrays for features and target variable
y = df['life'].values
X = df['fertility'].values

# Print the dimensions of X and y before reshaping
print("Dimensions of y before reshaping: {}".format(y.shape))
print("Dimensions of X before reshaping: {}".format(X.shape))

# Reshape X and y
y = y.reshape(-1, 1)
X = X.reshape(-1, 1)

# Print the dimensions of X and y after reshaping
print("Dimensions of y after reshaping: {}".format(y.shape))
print("Dimensions of X after reshaping: {}".format(X.shape))


# Exercise III: Exploring the Gapminder data

As always, it is important to explore your data before building models. On the right, we have constructed a heatmap showing the correlation between the different features of the Gapminder dataset, which has been pre-loaded into a DataFrame as `df` and is available for exploration in the IPython Shell. Cells that are in green show positive correlation, while cells that are in red show negative correlation. Take a moment to explore this: Which features are positively correlated with `life`, and which ones are negatively correlated? Does this match your intuition?

Then, in the IPython Shell, explore the DataFrame using pandas methods such as `.info()`, `.describe()`, `.head()`.

In case you are curious, the heatmap was generated using Seaborn's heatmap function and the following line of code, where `df.corr()` computes the pairwise correlation between columns:

```
sns.heatmap(df.corr(), square=True, cmap='RdYlGn')
```

Once you have a feel for the data, consider the statements below and select the one that is **not** true. After this, Hugo will explain the mechanics of linear regression in the next video and you will be on your way building regression models!

<img src="image/2021-02-01-173742.svg">

### Instructions

- The DataFrame has 139 samples (or rows) and 9 columns.
- life and fertility are negatively correlated.
- The mean of life is 69.602878.
- fertility is of type int64. (T)
- GDP and life are positively correlated.