## 3-Prepare Data

Import the MFR Data:

```python
import numpy as np
import pandas as pd
url = 'http://apmonitor.com/pds/uploads/Main/polymer_reactor.txt'
data = pd.read_csv(url)
data.columns = ['Time','C3=','H2R','Pressure','Level','C2=','Cat','Temp','MFR']
data['lnMFR'] = np.log(data['MFR'].values)
del data['Time']
data = data.dropna() # drop any row with NaN
data.head(10)
```

Run this code to import the data as a DataFrame.

There are several graphical techniques to help detect outliers. A box or histogram plot shows outlying points.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(12,8))
for i,c in enumerate(data.columns):
    if i<=7:
        plt.subplot(2,4,i+1)
        plt.title(c)
        plt.boxplot(data[c])
plt.show()

Remove outliers by removing select rows such as with:

```python
data = data[data['H2R']<0.7]
data = data[data['H2R']>0.01]
```

to keep only values of `H2R` (Hydrogen to Monomer ratio) that are between 0.01 and 0.7.

Show the boxplot again to verify that the data set does not have outliers.

In [None]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(12,8))
for i,c in enumerate(data.columns):
    if i<=7:
        plt.subplot(2,4,i+1)
        plt.title(c)
        plt.boxplot(data[c])
plt.show()

Remove MFR and keep only ln(MFR) as `lnMFR`. You can delete a column `x` with `del data['x']`. 

### Scale Data

Scale data with the Standard Scalar from scikit-learn.

```python
from sklearn.preprocessing import StandardScaler
s = StandardScaler()
ds = s.fit_transform(data)
```

The value `ds` is returned as a `numpy` array so we need to convert it back to a `pandas` `DataFrame`.

```python
ds = pd.DataFrame(ds,columns=data.columns)
```

Re-use the column names from `data`.

![idea](https://apmonitor.com/che263/uploads/Begin_Python/idea.png)

### Divide Data

Data is divided into train and test sets to separate a fraction of the rows for evaluating classification or regression models. A typical split is 80% for training and 20% for testing, although the range depends on how much data is available and the objective of the study.

The `train_test_split` is a function in `sklearn` for the specific purpose of splitting data into train and test sets.

```python
from sklearn.model_selection import train_test_split
train,test = train_test_split(ds, test_size=0.2, shuffle=True)
```

There are options such as `shuffle=True` to randomize the selection in each set. 

### Save Data

Save values that will be needed in the subsequent notebook.

In [None]:
import pickle
info = [data,train,test,ds,s]

with open('mfr_data.pkl', 'wb') as handle:
    pickle.dump(info, handle, protocol=pickle.HIGHEST_PROTOCOL)