# Data Preprocessing Tools

## Import libs

In [386]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Import dataset

In [387]:
dataset = pd.read_csv("./filez/Data.csv")
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [388]:
# values from all rows & all columns except the last one
X = dataset.iloc[:, :-1].values
# values from all rows from the last column
y = dataset.iloc[:, -1].values

In [389]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [390]:
print(y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


## Taking care of missing data

The choice between excluding rows with missing values (`NaN`) and imputing them with the mean (or another statistic) depends on the context of your dataset and the importance of the missing values. Here are some considerations to help you decide:

#### Excluding Rows with NaN:

- **Data Loss**: If you simply exclude rows with `NaN`, you might lose valuable information, especially if the missing values are not systematically related to the data you're analyzing. If the proportion of such rows is small, this approach may be acceptable.
- **Sample Size**: If your dataset is large and the missing data is a small fraction, removing rows with missing values might not significantly impact the dataset's integrity.
- **Bias**: If the missingness is not random (i.e., there's a pattern to which data points are missing), removing those rows could introduce bias into your model.

#### Assigning the Mean:

- **Preserving Data**: Imputing missing values allows you to keep rows with missing data, which can be particularly important if your dataset is small or if removing the data would result in bias.
- **Distortion**: Imputing with the mean can distort the distribution of your data, especially if the missingness is substantial. It can reduce variance and affect the covariance between features.
- **Impact on Models**: Models may yield less accurate predictions if the imputation doesn't represent the true underlying distribution of the data. However, it may still be a better option than losing data entirely.

#### How Imputation Affects Models:

1. **Machine Learning Assumptions**: Many machine learning models assume that there are no missing values in the dataset. Imputation allows you to meet this requirement and use a broader range of models.
2. **Robustness**: Some models are sensitive to outliers. Imputing with the mean is a central tendency measure, which can be more robust against outliers compared to other methods like imputing with median or mode.
3. **Feature Relationships**: Mean imputation does not account for the relationships between features. More sophisticated imputation methods, like multivariate imputation, can sometimes provide better estimates by considering these relationships.

#### Better Practices:

- **Exploratory Analysis**: Perform an exploratory analysis to understand the pattern of missingness. If the values are missing completely at random (MCAR), simple imputation methods are more justified.
- **Advanced Techniques**: Consider using advanced imputation techniques like k-nearest neighbors (KNN) imputation, multivariate imputation by chained equations (MICE), or models that inherently handle missing values like XGBoost.
- **Model Comparison**: After imputation, compare models trained on imputed data with those trained on non-imputed data (if you have enough data after dropping `NaN`s) to understand the impact.
- **Validation**: Always validate your model with out-of-sample data to check if the imputation strategy has led to overfitting.

In [391]:
from sklearn.impute import SimpleImputer

# replace all NaN values by the mean of the feature
imputer = SimpleImputer(missing_values=np.nan, strategy="mean")

In [392]:
# fit imputer with age & salary columns
# best practise: do for all numeric columns
imputer.fit(X[:, 1:3])

In [393]:
# Update new values in X
X[:, 1:3] = imputer.transform(X[:, 1:3])
X

array([['France', 44.0, 72000.0],
       ['Spain', 27.0, 48000.0],
       ['Germany', 30.0, 54000.0],
       ['Spain', 38.0, 61000.0],
       ['Germany', 40.0, 63777.77777777778],
       ['France', 35.0, 58000.0],
       ['Spain', 38.77777777777778, 52000.0],
       ['France', 48.0, 79000.0],
       ['Germany', 50.0, 83000.0],
       ['France', 37.0, 67000.0]], dtype=object)

## Encoding categorical data

One hot encoding is a process used to convert categorical data variables into a form that could be provided to machine learning algorithms to do a better job in prediction. E.g.: dataset with three categories: Red, Green, and Blue -> These are categorical data because they represent distinct categories, not numerical values.

In many machine learning scenarios, you need numerical input, but you can't just assign Red = 1, Green = 2, and Blue = 3, because this would imply an order or magnitude difference between the colors that doesn't exist (i.e., Blue is not "greater than" Green, and Red is not "less than" Green).

One hot encoding converts each categorical value into a binary vector with all zeros and one one. The length of this vector is equal to the number of categories in the feature. For our example:

- Red might become [1, 0, 0]
- Green might become [0, 1, 0]
- Blue might become [0, 0, 1]

This way, each color is equally distant from all others (in terms of vector distance), and you can provide these vectors as inputs to your machine learning algorithm without implying any non-existent order or scale.



### Encoding the Independent Variable

In [394]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

**ColumnTransformer**:

- `ColumnTransformer` Each transformer is a tuple with a name, the transformer itself, and the column indices to apply it to.
- `"encoder"` is just a name for this specific step.
- `OneHotEncoder()` is the transformer that will be applied to the specified column(s).
- `[0]` indicates that the transformer should be applied to the first column of the data (Python uses zero-based indexing). If your categorical data that needs to be one-hot encoded is in the first column of your dataset, this will apply the `OneHotEncoder` to that column.
- `remainder="passthrough"` specifies that all columns not specified in the `transformers` list should be allowed to "pass through" the transformer unchanged.
- **Outcome** -> a NumPy array with the first column one-hot encoded, and the rest of the data unchanged from the original `X`.

In [395]:
ct = ColumnTransformer(
    transformers=[("encoder", OneHotEncoder(), [0])], remainder="passthrough"
)
X = np.array(ct.fit_transform(X))
X

array([[1.0, 0.0, 0.0, 44.0, 72000.0],
       [0.0, 0.0, 1.0, 27.0, 48000.0],
       [0.0, 1.0, 0.0, 30.0, 54000.0],
       [0.0, 0.0, 1.0, 38.0, 61000.0],
       [0.0, 1.0, 0.0, 40.0, 63777.77777777778],
       [1.0, 0.0, 0.0, 35.0, 58000.0],
       [0.0, 0.0, 1.0, 38.77777777777778, 52000.0],
       [1.0, 0.0, 0.0, 48.0, 79000.0],
       [0.0, 1.0, 0.0, 50.0, 83000.0],
       [1.0, 0.0, 0.0, 37.0, 67000.0]], dtype=object)

### Encoding the Dependent Variable

In [396]:
# encode target label y with values between 0 and n_classes-1.
# @dev: the mapping from categorical labels to integers is determined based on 
#       the alphabetical order of the labels by default
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
y = le.fit_transform(y)
y


array([0, 1, 0, 0, 1, 1, 0, 1, 0, 1])

## Splitting dataset into Train/Test set

    ☝🏻 Scaling BEFORE or AFTER splitting data? AFTER splitting!!!!
 => Scaling uses the mean & std. dev of all data, so if applied before splitting, we are mixing train and test data. Test data must be totally independent of train data.

In [397]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

In [398]:
print(X_train)

[[0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 40.0 63777.77777777778]
 [1.0 0.0 0.0 44.0 72000.0]
 [0.0 0.0 1.0 38.0 61000.0]
 [0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 0.0 48.0 79000.0]
 [0.0 1.0 0.0 50.0 83000.0]
 [1.0 0.0 0.0 35.0 58000.0]]


In [399]:
print(X_test)

[[0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 37.0 67000.0]]


In [400]:
print(y_train)

[0 1 0 0 1 1 0 1]


In [401]:
print(y_test)

[0 1]


## Feature Scaling
    ☝🏻 Normalisation OR Standardisation?
-> **Normalisation**: when we've normal distribution in most of our features

-> **Standardisation**: works well in most of the cases, so **preferred**

In [402]:
from sklearn.preprocessing import StandardScaler

sc = StandardScaler()

    ☝🏻 Do we have to apply scaling in the dummy features?
=> **NO**! since dummies are already between 0 and 1, this would mean:
- No improvement at all (already in the range of standard values)
- Lose the relationship between dummies and original values
- Scaling only to *non-dummy* numerical features

In [403]:
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])


In [404]:
# we must apply the same scaler that was used with the X_train data
X_test[:, 3:] = sc.transform(X_test[:, 3:])

In [405]:
print(X_train)

[[0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]


In [406]:
print(X_test)

[[0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]


# Data Processing Template

### Importing libraries

In [407]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

### Importing the dataset

In [408]:
dataset = pd.read_csv("./filez/Data.csv")
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

### Splitting the dataset into Train/Test sets

In [409]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)