# Preprocessing in Python

Let's look at doing some preprocessing using pandas and scikit-learn in Python.

First of all, we need to import the packages we want to use: numpy, pandas and scikit-learn.

# Required Libraries

```conda install scikit-learn```

or

```pip install scikit-learn```

In [None]:
# For data analysis
import numpy as np
import pandas as pd
import sklearn as sk

# For graphing/visualisations
import matplotlib.pyplot as plt
#import seaborn as sns

## Opening the data file
Now we need to open the abalone file. We do this using pandas. You can search for these in the pandas help yourself, but the functions of interest are:
- `pandas.read_csv` and `pandas.to_csv` to read and write CSV files.
- `pandas.read_excel` and `pandas.to_excel` to read and write MS Excel files

We're going to open the abalone file. Note: you will need to edit the code to ensure that it points to where you have downloaded the `abalone-small.xls` file.

In [None]:
abalone_data = pd.read_excel("./abalone-small.xls")
abalone_data

# Abalone Data Description

Attribute Information:

Given is the attribute name, attribute type, the measurement unit and a brief description. The number of rings is the value to predict: either as a continuous value or as a classification problem.

Name / Data Type / Measurement Unit / Description
-----------------------------

| Name            | Data Type    | Measurement Unit    | Description        |
|:----------------|:------------|:-------|:-----------------------------|
| Sex            | nominal    | --    | M, F, and I (infant)        |
| Length         | continuous | mm    | Longest shell measurement   |
| Diameter       | continuous | mm    | perpendicular to length     |
| Height         | continuous | mm    | with meat in shell          |
| Whole weight   | continuous | grams | whole abalone               |
| Shucked weight | continuous | grams | weight of meat              |
| Viscera weight | continuous | grams | gut weight (after bleeding) |
| Shell weight   | continuous | grams | after being dried           |
| Rings          | integer    | --    | +1.5 gives the age in years |

## Taking a quick look at the data
You can of course just display the variable or use `.head()`, `.tail()` or `.sample()` to see the top or bottom of the dataset. We can also quickly check the `head` **and** `tail` with the `.concat()` function to join two tables together along the row axis.

### Concatenate
>In formal language theory and computer programming, string concatenation is the operation of joining character strings end-to-end. For example, the concatenation of "snow" and "ball" is "snowball".

In [None]:
pd.concat([abalone_data.head(), abalone_data.tail()])

In [None]:
# Check the shape the abalone data
abalone_data.shape

You can see the column names with `.columns` and the row indices with `.index`.

In [None]:
# To access the columns of the data
abalone_data.columns

## Preprocessing the data

First, let's look at scaling (normalising) the data. We do this with the following:

- `sklearn.preprocessing.StandardScaler` for Z-score normalisation
- `sklearn.preprocessing.MinMaxScaler` for min-max normalisation

For each, you `.fit` to work out the scaler setting (e.g., the mean and variance) then `.transform` when you want to use it. That means you can do scale different DataFrames in the same way. If you want to do both together, then you can just use `.fit_transform`.

In [None]:
# Drop the column "Sex"
abalone_subdata = abalone_data.drop(['Sex'],axis=1)
abalone_subdata.head()

Standardization involves rescaling the features such that they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one.

In [None]:
# Performing a Standard scaler transform of the Abalone dataset

from sklearn.preprocessing import StandardScaler
from matplotlib import pyplot
std_scaler = StandardScaler()
data = std_scaler.fit_transform(abalone_subdata)

# convert the array back to a dataframe
dataset = pd.DataFrame(data,columns=['Index','Length', 'Diameter', 'Height', 'Gross mass',
       'Meat mass', 'Gut mass', 'Shell mass', 'Age'])

# summarize
dataset.describe()

# Check mean & Standard dev

In [None]:
# histograms of the variables

dataset.hist(figsize=(12,12))
#pyplot.show()

In [None]:
# Performing a minmax scaler transform of the Abalone dataset
from sklearn.preprocessing import MinMaxScaler
from matplotlib import pyplot


min_max = MinMaxScaler()
data1 = min_max.fit_transform(abalone_subdata)

# convert the array back to a dataframe
dataset1 = pd.DataFrame(data1,columns=['Index','Length', 'Diameter', 'Height', 'Gross mass',
       'Meat mass', 'Gut mass', 'Shell mass', 'Age'])

# summarize
print(dataset1.describe())

In [None]:
# histograms of the variables
dataset1.hist(figsize=(12,12))
pyplot.show()

In [None]:
# Performing a minmax scaler transform of the Abalone dataset with range (-3,3)
from sklearn.preprocessing import MinMaxScaler
from matplotlib import pyplot

min_max = MinMaxScaler(feature_range=(-3,3))
data3 = min_max.fit_transform(abalone_subdata["Height"].values.reshape(-1, 1))

# convert the array back to a dataframe
dataset3 = pd.DataFrame(data3,columns=['Height'])

# summarize
print(dataset3.describe())

# histograms of the variables
dataset3.hist()
pyplot.show()

## Other preprocessing classess of interest

You might also be interested in 
`sklearn.preprocessing.OneHotEncoder` and `sklearn.preprocessing.LabelBinarizer` (for the target column). These do a one hot encoding of the data and labels respectively. Note: these are quite different in older versions of scikit-learn.

There is also `sklearn.preprocessing.KBinsDiscretizer` which does binning. Again, this only exists in newer versions of scikit-learn, so you may need to write yourself if it isn't there.

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder

onehot = OneHotEncoder(dtype=int,sparse=False)
amalone_sex=onehot.fit_transform(abalone_data[['Sex']])
amalone_sex= pd.DataFrame(amalone_sex,columns=['Female', 'Male', 'Infant'])
amalone_sex


Just like categorical data can be encoded, numerical features can be ‘decoded’ into categorical features. 
The two most common ways to do this are discretization and binarization.

Discretization:also known as quantization or binning, divides a continuous feature into a pre-specified number of categories (bins)

In [None]:
from sklearn.preprocessing import KBinsDiscretizer
disc = KBinsDiscretizer(n_bins=4, encode='ordinal')
abalone_subdata1 = disc.fit_transform(abalone_subdata)
abalone_subdata2= pd.DataFrame(abalone_subdata1,columns=['Index','Length', 'Diameter', 'Height', 'Gross mass',
       'Meat mass', 'Gut mass', 'Shell mass', 'Age'])
abalone_subdata2

# Exercise

1) Load the iris data used in previous weeks to a pandas dataframe

2) Create a "data dictionary" describing the dataset's attributes (as above).

3) perform standard normalisation on a copy of the dataset

4) Perform min-max normalisation on a copy of the dataset

5) On the either the min-maxed or normalised dataset, perform one-hot encoding on the "species" attribute.