In [2]:
pip install numpy pandas matplotlib seaborn scikit-learn jupyterlab

Collecting jupyterlab
  Downloading jupyterlab-4.4.9-py3-none-any.whl.metadata (16 kB)
Collecting async-lru>=1.0.0 (from jupyterlab)
  Downloading async_lru-2.0.5-py3-none-any.whl.metadata (4.5 kB)
Collecting ipykernel!=6.30.0,>=6.5.0 (from jupyterlab)
  Downloading ipykernel-6.30.1-py3-none-any.whl.metadata (6.2 kB)
Collecting jinja2>=3.0.3 (from jupyterlab)
  Downloading jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)
Collecting jupyter-lsp>=2.0.0 (from jupyterlab)
  Downloading jupyter_lsp-2.3.0-py3-none-any.whl.metadata (1.8 kB)
Collecting jupyter-server<3,>=2.4.0 (from jupyterlab)
  Downloading jupyter_server-2.17.0-py3-none-any.whl.metadata (8.5 kB)
Collecting jupyterlab-server<3,>=2.27.1 (from jupyterlab)
  Downloading jupyterlab_server-2.27.3-py3-none-any.whl.metadata (5.9 kB)
Collecting notebook-shim>=0.2 (from jupyterlab)
  Downloading notebook_shim-0.2.4-py3-none-any.whl.metadata (4.0 kB)
Collecting setuptools>=41.1.0 (from jupyterlab)
  Using cached setuptools-80.9.0-py3-no

## Data PreProcessing Methods

Data preprocessing is a crucial step in the machine learning pipeline. It involves transforming raw data into a clean and usable format, which helps improve the performance and accuracy of machine learning models. Common preprocessing methods include handling missing values, encoding categorical variables, feature scaling, and splitting datasets. These steps ensure that the data is consistent, free of errors, and suitable for analysis by machine learning algorithms.


In [24]:
import numpy as np
import pandas as pd

data = pd.read_csv('data.csv')
data.isnull().sum()

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

### Splitting the Matrix of Features (X) and Dependent Variable Vector (y)

In supervised machine learning, it is important to separate the dataset into two parts: the matrix of features (X) and the dependent variable vector (y). The features (X) are the input variables used to make predictions, while the dependent variable (y) is the output or target we want to predict. This separation allows us to train models effectively and evaluate their performance on unseen data.

In [40]:
x = data.iloc[:, :-1].values
y = data.iloc[:, -1].values
print(x)
print(y)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


### Taking Care of Missing Data

Real-world datasets often contain missing or incomplete values, which can negatively impact the performance of machine learning models. Handling missing data is essential to ensure the integrity of the dataset. Common techniques include replacing missing values with the mean, median, or mode of the column, or removing rows/columns with missing values. Properly addressing missing data helps prevent biases and errors during model training.

In [41]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer.fit(x[:, 1::])
x[:, 1::] = imputer.transform(x[:, 1::])
print(x)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 63777.77777777778]
 ['France' 35.0 58000.0]
 ['Spain' 38.77777777777778 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


### Encoding Categorical Data

Many machine learning algorithms require numerical input, but real-world datasets often contain categorical (text) data. Encoding categorical data involves converting these text labels into numerical values so that algorithms can process them. Common methods include label encoding (assigning each category a unique number) and one-hot encoding (creating binary columns for each category). Proper encoding ensures that the model can interpret and learn from categorical features.

In [43]:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, LabelEncoder
col_transformer = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
x = np.array(col_transformer.fit_transform(x))
print(x)
label_encoder_y = LabelEncoder()
y = label_encoder_y.fit_transform(y)
print(y)

[[0.0 1.0 0.0 0.0 44.0 72000.0]
 [1.0 0.0 0.0 1.0 27.0 48000.0]
 [1.0 0.0 1.0 0.0 30.0 54000.0]
 [1.0 0.0 0.0 1.0 38.0 61000.0]
 [1.0 0.0 1.0 0.0 40.0 63777.77777777778]
 [0.0 1.0 0.0 0.0 35.0 58000.0]
 [1.0 0.0 0.0 1.0 38.77777777777778 52000.0]
 [0.0 1.0 0.0 0.0 48.0 79000.0]
 [1.0 0.0 1.0 0.0 50.0 83000.0]
 [0.0 1.0 0.0 0.0 37.0 67000.0]]
[0 1 0 0 1 1 0 1 0 1]


### Splitting Training Set and Test Set

To evaluate the performance of a machine learning model, it is important to split the dataset into a training set and a test set. The training set is used to train the model, while the test set is used to assess how well the model generalizes to new, unseen data. This split helps prevent overfitting and provides a realistic estimate of the model's accuracy on real-world data.

In [46]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=1)


### Applying Feature Scaling for Training and Testing Data

Feature scaling is the process of normalizing or standardizing the range of independent variables (features) in the dataset. Many machine learning algorithms perform better when features are on a similar scale, especially those that use distance calculations (e.g., KNN, SVM, gradient descent-based models). Feature scaling helps improve convergence speed, model accuracy, and ensures that no single feature dominates the learning process due to its scale.

In [47]:
from sklearn.preprocessing import StandardScaler
scaler_x = StandardScaler()
x_train[:, 4:] = scaler_x.fit_transform(x_train[:, 4:])
x_test[:, 4:] = scaler_x.transform(x_test[:, 4:])
print(x_train)
print(x_test)

[[1.0 0.0 0.0 1.0 -0.19159184384578545 -1.0781259408412425]
 [1.0 0.0 1.0 0.0 -0.014117293757057777 -0.07013167641635372]
 [0.0 1.0 0.0 0.0 0.566708506533324 0.633562432710455]
 [1.0 0.0 0.0 1.0 -0.30453019390224867 -0.30786617274297867]
 [1.0 0.0 0.0 1.0 -1.9018011447007988 -1.420463615551582]
 [0.0 1.0 0.0 0.0 1.1475343068237058 1.232653363453549]
 [1.0 0.0 1.0 0.0 1.4379472069688968 1.5749910381638885]
 [0.0 1.0 0.0 0.0 -0.7401495441200351 -0.5646194287757332]]
[[1.0 0.0 1.0 0.0 -1.4661817944830124 -0.9069571034860727]
 [0.0 1.0 0.0 0.0 -0.44973664397484414 0.2056403393225306]]
