<a href="https://colab.research.google.com/github/Aftabgazali/Building_Good_Training_Datasets_-_Data_Preprocessing.ipynb/blob/main/Building_Good_Training_Datasets_%E2%80%93_Data_Preprocessing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Importing Libraries

In [None]:
import pandas as pd
from io import StringIO
import numpy as np

In [None]:
!pip install pandas profiling

# Handling Missing Data

In [None]:
# Create a csv file and fill the dummy data with two missing values
csv_data = \
'''A,B,C,D
 1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''

df = pd.read_csv(StringIO(csv_data))

In [None]:
df.head()

## Get the count of null values in each column

In [None]:
df.isna().sum()

# Imputing Missing Values

**Missing values will be replaced by a mean of that feature**

***Note: Other strategy includes median or most_frequent, where
the latter replaces the missing values with the most frequent values***

In [None]:
from ast import Starred
from sklearn.impute import SimpleImputer

imr = SimpleImputer(missing_values=np.nan, strategy='mean')
imr = imr.fit(df.values)

new_df = pd.DataFrame(imr.transform(df.values))
new_df.values

**Same steps but more efficient and easy**

In [None]:
df.fillna(df.mean())

# Handling Categorical Features

*When we are talking about categorical data, we have to further distinguish between ordinal and nom-
inal features. Ordinal features can be understood as categorical values that can be sorted or ordered.
For example, t-shirt size would be an ordinal feature, because we can define an **order: XL > L > M.** In
contrast, nominal features don’t imply any order; to continue with the previous example, we could
think of t-shirt color as a nominal feature since it typically doesn’t make sense to say that, for example,
red is larger than blue*

In [None]:
tshirts = [
           ['green', 'M', 10.1,'class_1'],
           ['red','L',13.5,'class_2'],
           ['blue','XL',15.3,'class_3']
           ]
df = pd.DataFrame(tshirts)
df.columns = ['color', 'size','price','class']
df.head()

**Mapping the size feature**

In [None]:
size_mapping = {'M': 1, 'L': 2,'XL':3}
df['size'] = [size_mapping[item] for item in df['size']]
df.head()

**Encoding Class Labels**

In [None]:
class_mapping = {feature_value: label for label,feature_value in enumerate(np.unique(df['class']))}
class_mapping

In [None]:
df['class'] = [class_mapping[item] for item in df['class']]
df.head()

**ALternate approach to the above method One Hot Encoding**

In [None]:
from sklearn.preprocessing import LabelEncoder
class_labels = LabelEncoder()
y = class_labels.fit_transform(df['class'].values)
y

In [None]:
X = df.iloc[[0,1,2],:-1].values
X

In [None]:
color_labels = LabelEncoder()
# Perform encoding on color labels
X[:,0] = color_labels.fit_transform(X[:,0])
X

In [None]:
## Performing OneHotEncoding
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
X = df.iloc[[0,1,2],:-1].values
column_transform = ColumnTransformer([
    ('onehot', OneHotEncoder(),[0]),
    ('nothing', 'passthrough',[1,2])
])
column_transform.fit_transform(X).astype(float)

**Alternative way is to use the get_dummies method will
only convert string columns and leave all other columns unchanged**

In [None]:
pd.get_dummies(df[['price','color','size']])
pd.get_dummies(df[['price', 'color', 'size']],drop_first=True)

# Trying out ydata profiling for EDA

In [None]:
!pip install ydata-profiling

In [None]:
from ydata_profiling import ProfileReport

In [None]:
import seaborn as sns

iris = sns.load_dataset('iris')
iris.head()

In [None]:
profile = ProfileReport(iris, title="Profiling Report")
profile

# Feature Scaling

*Decision
trees and random forests are two of the very few machine learning algorithms where we **don’t need to
worry about feature scaling**. Those algorithms are scale-invariant. However, the majority of machine
learning and optimization algorithms behave much better if features are on the same scale,*

**Two common approaches to bringing different features onto the same scale: normalization and standardization**

*normalization refers to the rescaling of the features
to a range of [0, 1], which is a special case of min-max scaling*

**Implementing Standardization using `MinMaxScalar`**

In [None]:
iris = sns.load_dataset('iris')
iris.head()

In [None]:
X = iris.iloc[:,[0,1,2,3]].values
X

In [None]:
y = iris.iloc[:,-1].values

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [None]:
from sklearn.preprocessing import MinMaxScaler
mns = MinMaxScaler()
X_train_norm = mns.fit_transform(X_train)
X_test_norm = mns.transform(X_test)

*Although normalization via min-max scaling is a commonly used technique that is useful when we
need values in a bounded interval, standardization can be more practical for many machine learning
algorithms, especially for optimization algorithms such as gradient descent. The reason is that many
linear models, such as the logistic regression and SVM from Chapter 3, initialize the weights to 0 or
small random values close to 0. Using standardization, we center the feature columns at mean 0 with
standard deviation 1 so that the feature columns have the same parameters as a standard normal
distribution (zero mean and unit variance), which makes it easier to learn the weights.*

**Implementing Normalization using `StandardScalar`**

In [None]:
from sklearn.preprocessing import StandardScaler
stdc = StandardScaler()
X_train_std = stdc.fit_transform(X_train)
X_test_std = stdc.transform(X_test)

*Manually Applying the both methods*

In [None]:
arr = np.array([0,1,2,3,4,5])
std = np.array((arr - arr.mean())/ arr.std())
nrm = np.array((arr - arr.min())/ arr.max() - arr.min())

table = pd.DataFrame({'Standardization': std, 'Normalization': nrm})
print(f"Standardization {((arr - arr.mean())/ arr.std())}")
print(f"Normalization {((arr - arr.min())/ arr.max() - arr.min())}")

In [None]:
table.head()

*Important to highlight that we fit the StandardScaler class only once—on the training
data—and use those parameters to transform the test dataset or any new data point.
Other, more advanced methods for feature scaling are available from scikit-learn, such **RobustScaler**.*


*RobustScaler is especially helpful and recommended if we are working with small datasets that
contain many outliers. Similarly, if the machine learning algorithm applied to this dataset is prone
to overfitting, RobustScaler can be a good choice. Operating on each feature column independently,
RobustScaler removes the median value and scales the dataset according to the 1st and 3rd quartile of
the dataset (that is, the 25th and 75th quantile, respectively) such that more extreme values and outliers
become less pronounced*

# L1 & L2 Regularization

In [None]:
from sklearn.datasets import load_wine

wine = load_wine()

df = pd.DataFrame(data = wine.data, columns = wine.feature_names)
df['target'] = wine.target
df.head()

In [None]:
X = df.iloc[:,:-1].values
y = df.iloc[:,-1].values

X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.3)

In [None]:
from sklearn.preprocessing import StandardScaler
stdc = StandardScaler()
X_train_std = stdc.fit_transform(X_train)
X_test_std = stdc.transform(X_test)

**Applying L1 Regularization**

*For regularized models in scikit-learn that support L1 regularization, we can simply set the penalty
parameter to 'l1' to obtain a sparse solution:*

**Note:** that we also need to select a different optimization algorithm (for example, solver='liblinear'),
since 'lbfgs' currently does not support L1-regularized loss optimization*

In [None]:
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(C=1.0,penalty='l1', solver='liblinear', multi_class='ovr')
model.fit(X_train_std, y_train)
print(f"Training Accuracy {model.score(X_train_std,y_train)}")

In [None]:
model.coef_

In [None]:
import matplotlib.pyplot as plt

fig = plt.figure()
ax = plt.subplot(111)
colors = ['blue', 'green', 'red', 'cyan','magenta', 'yellow', 'black','pink', 'lightgreen', 'lightblue','gray', 'indigo', 'orange']

weights, params = [], []
for c in np.arange(-4., 6.):
  model = LogisticRegression(penalty='l1', C=10.**c, solver='liblinear', multi_class='ovr')
  model.fit(X_train_std, y_train)
  weights.append(model.coef_[2])
  params.append(10.**c)

weights = np.array(weights)
# column is the index
for column, color in zip(range(weights.shape[1]), colors):
  plt.plot(params, weights[:, column], label=df.columns[column],
           color = color)
plt.axhline(0, color='black', linestyle='--', linewidth=3)
plt.xlim([10**(-5), 10**5])
plt.ylabel('Weight coefficient')
plt.xlabel('C (inverse regularization strength)')
plt.xscale('log')
plt.legend(loc='upper left')
ax.legend(loc='upper center',bbox_to_anchor=(1.38, 1.03),ncol=1, fancybox=True)
plt.show()

**The resulting plot provides us with further insights into the behavior of L1 regularization. As we can
see, all feature weights will be zero if we penalize the model with a strong regularization parameter
(C < 0.01); C is the inverse of the regularization parameter, 𝜆:**

# Feature Selection Techniques

***Sequential Feature Algorithms***

*Alternative way to reduce the complexity of the model and avoid overfitting is dimensionality
reduction via feature selection, which is especially useful for unregularized models. There are two
main categories of dimensionality reduction techniques: feature selection and feature extraction. Via
feature selection, we select a subset of the original features, whereas in feature extraction, we derive
information from the feature set to construct a new feature subspace.*

* Sequential feature selection algorithms are a family of greedy search algorithms that are used to
reduce an initial d-dimensional feature space to a k-dimensional feature subspace where k<d. The
motivation behind feature selection algorithms is to automatically select a subset of features that are
most relevant to the problem, to improve computational efficiency, or to reduce the generalization
error of the model by removing irrelevant features or noise, which can be useful for algorithms that
don’t support regularization

* A classic sequential feature selection algorithm is sequential backward selection (SBS), which aims to
reduce the dimensionality of the initial feature subspace with a minimum decay in the performance
of the classifier to improve upon computational efficiency. The idea behind the SBS algorithm is quite simple: SBS sequentially removes features from the full
feature subset until the new feature subspace contains the desired number of feature

In [None]:
!pip install mlxtend

In [None]:
wine_data = load_wine()
df = pd.DataFrame(data = wine_data.data, columns = wine_data.feature_names)
df.head()

In [None]:
df['Target'] = wine_data.target
df.head()

In [None]:
print(f"Class Labels {np.unique(df['Target'])}")

*Scaling the data as KNN is sensitive to Scaling*

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.2)

sc = StandardScaler()
X_train_std = sc.fit_transform(X_train)
X_test_std = sc.transform(X_test)

*Build a dummy KNN Model*

In [None]:
from sklearn.neighbors import KNeighborsClassifier

model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train_std, y_train)
print(f"Training Accuracy {model.score(X_train_std, y_train)}")
print(f"Testing Accuracy {model.score(X_test_std, y_test)}")

**Selecting Best 5 Features**

* `forward=True` & `floating=False` indicates it is SFS
* `forward=False` & `floating=False` indicates it is SBS

In [None]:
from mlxtend.feature_selection import SequentialFeatureSelector as SFS

sfs_1 = SFS(model, k_features=5,forward=True,floating=False,verbose=2,scoring='accuracy',cv=5)
sfs_1 = sfs_1.fit(X_train_std, y_train)

**Get the best feature Index using `k_feature_idx_`**

In [None]:
sfs_1.k_feature_idx_

**Get the best features**

In [None]:
df.columns[:][list(sfs_1.k_feature_idx_)]

**Extracting data based on the best features using `transform`**

In [None]:
X_train_selected = sfs_1.transform(X_train_std)
X_test_selected = sfs_1.transform(X_test_std)

model.fit(X_train_selected, y_train)

In [None]:
print(f"Training Accuracy {model.score(X_train_selected, y_train)}")
print(f"Testing Accuracy {model.score(X_test_selected, y_test)}")

**We can see from the graph that with 5 features we get an accuracy of nearly 99%**

In [None]:
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
metric_dict = sfs_1.get_metric_dict(confidence_interval=0.95)
fig1 = plot_sfs(metric_dict, kind='std_dev')

**Selecting Best Features without providing any size/num of features we need**

*Set `k_features='best'`*

In [None]:
sfs_2 = SFS(model, k_features='best',forward=True,floating=False,verbose=2,scoring='accuracy',cv=5)
sfs_2 = sfs_2.fit(X_train_std, y_train)

In [None]:
print(f"Training Accuracy {model.score(X_train_selected, y_train)}")
print(f"Testing Accuracy {model.score(X_test_selected, y_test)}")

**So, using 9 features is probably the best idea to get maximum accuracy**

In [None]:
metric_dict = sfs_2.get_metric_dict(confidence_interval=0.95)
fig1 = plot_sfs(metric_dict, kind='std_dev')

# Assessing feature importance with random forests

*We learnt how to use L1 regularization to zero out irrelevant features via logistic
regression and how to use the SBS algorithm for feature selection and apply it to a KNN algorithm.
Another useful approach for selecting relevant features from a dataset is using a random forest, an en-
semble technique*

*we can measure the feature
importance as the averaged impurity decrease computed from all decision trees in the forest, without
making any assumptions about whether our data is linearly separable or not. Conveniently, the random
forest implementation in scikit-learn already collects the feature importance values for us so that we
can access them via the `feature_importances_` attribute after fitting a `RandomForestClassifier`*

In [None]:
np.argsort(importances[::-1])

In [None]:
from sklearn.ensemble import RandomForestClassifier
feature_labels = df.columns[:-1]
forest = RandomForestClassifier(n_estimators=500)
forest.fit(X_train, y_train)
importances = forest.feature_importances_
indices = np.argsort(importances)[::-1]
# for f in range(X_train.shape[1]):
#   print("%2d) %-*s %f" % (f + 1, 30,feature_labels[indices[f]],importances[indices[f]]))
plt.title('Feature importance')
plt.bar(range(X_train.shape[1]),
importances[indices],
align='center')
plt.xticks(range(X_train.shape[1]),
feature_labels[indices], rotation=90)
plt.xlim([-1, X_train.shape[1]])
plt.tight_layout()
plt.show()

**We can conclude that the proline and flavonoid levels, the color intensity, the OD280/OD315 diffraction,
and the alcohol concentration of wine are the most discriminative features in the dataset based on
the average impurity decrease in the 500 decision trees**

*Getting the data we could set the threshold to 0.1 to
reduce the dataset to the five most important feature*

In [None]:
from sklearn.feature_selection import SelectFromModel
sfm = SelectFromModel(forest, threshold=0.1, prefit=True)
X_selected = sfm.transform(X_train)
print('Number of features that meet this threshold','criterion:', X_selected.shape[1])
for f in range(X_selected.shape[1]):
  print("%2d) %-*s %f" % (f + 1,30,feature_labels[indices[f]],importances[indices[f]]))