# Exercise: Focus on the Data

Briefly reminder of why we need to split data into train and test sets.

## Preparing data

Explain what's in our dataset and how we can understand it better by looking at it.


In [1]:
import pandas as pd

# Load data from our dataset file into a pandas dataframe
dataset = pd.read_csv('Data/chocolate data.txt', index_col=False, sep="\t",header=0)

# Let's take a look at the data
dataset.head()


Unnamed: 0,weight,cocoa_percent,sugar_percent,milk_percent,customer_happiness
0,185,65,11,24,47
1,247,44,34,22,55
2,133,33,21,47,35
3,145,30,38,32,34
4,110,22,70,7,40


We can use the `dataset.head()` to show just the headers and first rows our our dataset (we don't want to print everything if we have thousands of rows).

But how much data is there?



In [2]:
dataset.shape

(100, 5)

The `shape` property of a dataset let's us know how many rows and columns it has.

## Splitting Data into Train and Test Sets
We are going to split our dataset into training and test sets using a 70/30 split.
That means that 70% of our data will be used only to train out model, while the other 30% will be used to evaluate how well it performs.

We also randomly shuffle our data before the split (we will explain why in a later chapter).

In [3]:
# Dataset 70/30 split here:
import sklearn.model_selection as model_selection
from sklearn.linear_model import LinearRegression

# First we need to separate features and labels
# X holds everything but the column with the labels
X = dataset[["weight", "cocoa_percent", "sugar_percent", "milk_percent"]]

# y should have labels only
y = dataset["customer_happiness"]

# This code randomly shuffles our data, then splits it into train and test datasets an a 70/30 ratio
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, train_size=0.70,test_size=0.30, random_state=101)

# Let's check the shapes of the resulting sets
print(f"The shape for X_train is {X_train.shape}")
print(f"The shape for y_train is {y_train.shape}")
print(f"The shape for X_test is {X_test.shape}")
print(f"The shape for y_test is {y_test.shape}")


The shape for X_train is (70, 4)
The shape for y_train is (70,)
The shape for X_test is (30, 4)
The shape for y_test is (30,)


As you can see we have 70 rows in the train sets, and 30 rows in the test sets.



## Building a New Model
To build our model we are goint to use the Linear Regression algorithm.


In [4]:
# Training our model using linear regression, training features and labels
model = LinearRegression().fit(X_train, y_train)


It really is that simple!



## Testing the Model
There are two things we need to test on our new model:

 1. Does it generate predictions at all? (this is basically checking with there are syntax errors in our code).
 2. Are the predictions **reasonably** close to the values we expect?

In [5]:
from sklearn import metrics

# To evaluate the model, we can make predictions using our test data
y_pred = model.predict(X_test)

# Builds a dataframe with Predicted and Expected values for comparison
results_dataset = pd.DataFrame({'Predicted': y_pred, 'Expected': y_test})

# Print the first ten rows so we can get a glimpse of how it performed
print(results_dataset[:10])


    Predicted  Expected
16  37.105292        48
1   56.540508        55
43  54.664053        47
67  15.186745        10
89  44.588821        52
21  38.066163        43
97  22.350649        19
51  24.281470        19
6   42.255609        41
41  52.478854        49


The output above shows that:

1. The code works.
2. The predictions are within the range of the values we expected.

But can we get better, more accurate results?

A better way to evaluate the model is to calculate its Mean Squared Error (MSE).

We can make adjustments to the model later and measure the MRE again. If we get a lower MRE it leans that our model is improving and making better predictions.

In [6]:
# Now calculate the model's Mean Squared Error
mse = metrics.mean_squared_error(y_test, y_pred)

print(f"The mean squared error for this model is {mse}.")

The mean squared error for this model is 41.37942823989544.


## Summary
You've reached a few goals on this exercise:

- Quicky isualize data and shapes in your datasets.
- Shuffle and split datasets into train and test sets.
- Build a Linear Regression model.
- Evaluate the model, by visually comparing results with the expected values and by calculating its Mean Squared Error (MSE).