# Data Preprocessing

### Objectives
- Removing and imputing missing values from a dataset 
- Getting cateogrical data into shape for machine learning algorithms
- Selecting relevant features for model constructions

## Dealing with missing data 

### Identifying missing values
This can be caused by errors in data collection, certain measurements aren't applicable, or particular fields can be optional. To store data in-memory, you're definitely going to be using a Pandas DataFrame, which also uses optimized NumPy arrays.

Here's an easy way to identify missing values.
```Python 
df.isnull.sum()
```

---
### Method 1: Eliminating samples or features
Remove the corresponding features (columns) or samples (rows) from the dataset entirely. The risk to this is removing too many samples, or losing 
valuable feature columns. As a result, one of the other techniques is imputation.

```Python 
# For any row that has one or more missing column values, remove the entire row
df.dropna(axis=0)

# Only drop rows where all of their column values are missing 
df.dropna(how='all')

# Drop rows that have less than 4 actual values
df.dropna(thresh=4)

# Drop rows that that have specific columns missing; if column 'C' is missing for a row, drop the row.
df.dropna(subset=['C'])

# For any column, if one of its row values is missing, remove the entire column
df.dropna(axis=1)
```

---
### Method 2: Imputing missing values
This is the idea of estimating what a missing value would be, based on the data we already have. The idea of interpolation. 

- **Mean imputation**: When we replace the value of the missing column with the mean value of the entire feature column. The arithmetic mean, but you can try other measures of center like the media nor mode. The latter being useful for imputing categorical features.


Sci-kit has data **transformer** classes for imputing. They have methods like fit for learning the parameters from training data and transform to use those parameters to transform your data. However, there's also the estimator API. These have a `predict` and `transform` method:
1. Training data and training labels are inputted into the `fit()` function, allowing the model to learn.
2. The model then takes test data, and we use the `predict(X_test)`, which will output what the model thinks is the predicted class label. 


## Handling categorical data

---
### Premise: Nominal and ordinal features
- **ordinal features:** Categorical values that can be sorted and ordered. E.g. t-shirt size is ordinal since we can define an order XL > L > M.
- **nominal features:** Categorical values that don't imply any order. E.g. color of the t-shirt.
As we've realized already, categorical values need to be converted into numerical representations for machine learning algorithms to work, as it all relies on math. Ordinal features work essentially the same too, but you'll need to have more considerations.

---
### Mapping ordinal features
There's no convenient function that can derive the correct order the labels for the `size` feature. So we'll create our own. Let $XL=L+1=M+2$. Something like this would work:
```Python
size_mapping = {
  'XL': 3,
  'L': 2,
  'M': 1
}
df['size'] = df['size'].map(size_mapping)
```
For example, now any rows with a size "XL" would now have the integer 3 in the size column. And so on according to our mapping.

---
### Encoding class labels
Categorical class labels should be encoded/represented as integer values too. We'll just do something similar to mapping ordinal features, but since class labels aren't ordinal it doesn't matter which integer we assign to one. So we can simply enumerate from 0.
```Python 
class_mapping = {label: idx for idx, label in enumerate(np.unique([df['class_label']]))}
df['class_label'] = df['class_label'].map(class_mapping)
```

The more convenient way to do this is using `LabelEncoder` class from Sci-kit
```Python
class_label_encoder = LabelEncoder()
y = class_label_encoder.fit_transform(df['class_label'].values)
```

---
### One Hot Encoding for nominal features
One of the big issues that we need to avoid when doing feature encoding is how your model may treat higher values as more important. Imagine of `red=1` but `yellow=10`, and it may think yellow has more weight/influence, even though those are just unique identifiers that we just picked.

The solution to this would be the idea of one hot encoding. Of course this is going to create a lot more columns, so there's that to think about.

# Partitioning data into training and testing sets


### Premise 
We're going to be looking at the Wine dataset, preprocess the data, do feature selection techniques to reduce the dimensionality of the dataset.

### Wine dataset
- 13 different features
- 178 samples
- Each sample has one of three class labels {1,2,3}, which refer to the 3 different types of grapes grown in the same region in Italy. 


p188s

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split 

df = pd.read_csv('https://archive.ics.uci.edu/ml/'
'machine-learning-databases/wine/wine.data',
header=None)

df.columns = ['Class label', 'Alcohol',
 'Malic acid', 'Ash',
 'Alcalinity of ash', 'Magnesium',
 'Total phenols', 'Flavanoids',
 'Nonflavanoid phenols',
 'Proanthocyanins',
 'Color intensity', 'Hue',
 'OD280/OD315 of diluted wines',
 'Proline']

X = df.iloc[:, 1:].values
y = df.iloc[:, 0].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify=y)

