#### About

> Data-pre-processing

Data pre-processing is an essential stage in machine learning that entails converting raw data into a format that machine learning algorithms can quickly comprehend and analyse. Some of the often employed data pre-processing techniques in machine learning include the following:

1. Data cleaning: Making ensure the data is accurate and consistent is a crucial part of the pre-processing process. Using the feature's mean or median to impute missing values is a typical method for dealing with missing data. If there are only a few rows or columns with missing data, another method is to eliminate them. The quality of the data can also be enhanced via outlier discovery and elimination.

2. Data transformation entails scaling the information to provide each feature a comparable range and distribution. This is significant because the size of the data has an impact on many machine learning methods. Scaling methods like standardisation and normalisation are frequently used. While normalisation is scaling the data to a fixed range like [0,1], standardisation entails changing the data to have a zero mean and unit variance.


3. Feature selection: This entails finding the most critical traits for predicting the target variable. This is significant since utilising too many features can result in overfitting and reduce the model's performance. Correlation analysis, recursive feature elimination, and feature importance ranking are common feature selection strategies.

4. Feature engineering: This is the process of producing new features from current features that are more informative and can improve the model's performance. If the dataset has a date field, for example, other features such as day of the week or month can be added. Other feature engineering approaches include one-shot encoding, binning, and text processing.

5. Data Partitioning: This means that the dataset is partitioned into training, validation and testing sets. The training set is used to train the model, the validation set is used to tune the hyperparameters of the model, and the test set is used to evaluate the performance of the model. The most common method of data allocation is random sampling.

6. Data Augmentation: Generate synthetic data from existing data to increase dataset size and improve model reliability. Common methods of data augmentation include transforming or rotating images, adding noise to data, and falsifying input variables.



In [1]:
from sklearn.datasets import load_iris
import pandas as pd

In [2]:
iris = load_iris()

In [3]:
df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
df['target'] = iris.target

In [4]:
print(df.head())

   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                5.1               3.5                1.4               0.2   
1                4.9               3.0                1.4               0.2   
2                4.7               3.2                1.3               0.2   
3                4.6               3.1                1.5               0.2   
4                5.0               3.6                1.4               0.2   

   target  
0       0  
1       0  
2       0  
3       0  
4       0  


In [5]:
print(df.describe())


       sepal length (cm)  sepal width (cm)  petal length (cm)  \
count         150.000000        150.000000         150.000000   
mean            5.843333          3.057333           3.758000   
std             0.828066          0.435866           1.765298   
min             4.300000          2.000000           1.000000   
25%             5.100000          2.800000           1.600000   
50%             5.800000          3.000000           4.350000   
75%             6.400000          3.300000           5.100000   
max             7.900000          4.400000           6.900000   

       petal width (cm)      target  
count        150.000000  150.000000  
mean           1.199333    1.000000  
std            0.762238    0.819232  
min            0.100000    0.000000  
25%            0.300000    0.000000  
50%            1.300000    1.000000  
75%            1.800000    2.000000  
max            2.500000    2.000000  


In [6]:
print(df['target'].value_counts())



0    50
1    50
2    50
Name: target, dtype: int64
