# <div class="alert alert-block alert-success" dir="rtl" style="text-align: center;"><strong><span style="font-size: 20pt">Data PreProcessing <br /></span></strong></div>

# Working with this Notebook

## Step 1: Understand the Notebook

- **Review existing code:** Understand the context of the `?` placeholders by examining the surrounding code and comments.

## Step 2: Identify the Function Type
- **Look at the comment beside `?`:** Use the comment to infer what function or code is needed.
  - Example: `# Load data` suggests using a data loading function like `pd.read_csv()`.

## Step 3: Replace `?` with the Appropriate Function
- **Loading Data:**
  ```python
  ? # Load data
  data = pd.read_csv('file.csv')  # Replace '?' with 'pd.read_csv' or an appropriate data loading function
  ```

## Step 4: Test the Code
- **Run each cell:** After replacing the `?` placeholders with the appropriate functions or code, execute each cell in the notebook to ensure everything works correctly.
- **Debug if necessary:** If any errors occur, review the code and comments to confirm that the correct functions were chosen, and make any necessary adjustments.


# <font color='deepskyblue'>Why Data Preprocessing in Machine Learning?</font>
When it comes to creating a Machine Learning model, data preprocessing is the first step marking the initiation of the process. Typically, real-world data is incomplete, inconsistent, inaccurate (contains errors or outliers), and often lacks specific attribute values/trends. This is where data preprocessing enters the scenario – it helps to clean, format, and organize the raw data, thereby making it ready-to-go for Machine Learning models. Let’s explore various steps of data preprocessing in machine learning.

# <font color='deepskyblue'>Step 0 : Download the Dataset</font>

Just run the below cell and it will automatically download the dataset

In [None]:
import requests

dropbox_url = 'https://www.dropbox.com/scl/fi/wlwe84143qn7y21fo0gfl/data2.csv?rlkey=y80m2qjenl5xnldweg6jt83m0&st=iivuy9fn&dl=1'

destination = 'data.csv'

response = requests.get(dropbox_url, stream=True)

with open(destination, 'wb') as file:
    for chunk in response.iter_content(1024):
        file.write(chunk)

print(f"Downloaded the file to {destination}")


Downloaded the file to data.csv


# <font color='deepskyblue'>Step 1 : Import all the necessary Libraries</font>


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import sklearn

# <font color='deepskyblue'>Step 2 : Load the dataset</font>


In [None]:
dataset = pd.read_csv('data.csv') # load the dataset

In [None]:
df = pd.DataFrame(dataset)

In [None]:
df

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [None]:
X = df.iloc[:, :-1].values
Y = df.iloc[:, -1].values

In [None]:
print(X)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]


In [None]:
print(Y)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


# <font color='deepskyblue'>Step 3 : Identify and Handle all the missing values</font>


In [None]:
df.isnull() # Find which all coloumns has missing values

Unnamed: 0,Country,Age,Salary,Purchased
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,True,False
5,False,False,False,False
6,False,True,False,False
7,False,False,False,False
8,False,False,False,False
9,False,False,False,False


# <font color='deepskyblue'>Solution 1 : Drop all the missing rows</font>

In [None]:
df1 = df.copy()  # Create a copy of the dataframe

In [None]:
# summarize the shape of the raw data
print("Before:",df1.shape)

df1.dropna() # Drop the missing valus

# summarize the shape of the data with missing rows removed
print("After:",df1.shape)

Before: (10, 4)
After: (10, 4)


# <font color='deepskyblue'>Solution 2 : Fill the missing values with mean/median/mode</font>

In [None]:
df2 = df1.copy() # Create  a copy of dataframe

In [None]:
import warnings
warnings.filterwarnings('ignore')

In [None]:
df2['Age'] = df2['Age'].median() # Fill the missing values with mean/median/mode ? for age
df2['Salary'] = df2['Salary'].mean() # Fill the missing values with mean/median/mode ? for Salary
print(df2.isnull().sum())

df2

Country      0
Age          0
Salary       0
Purchased    0
dtype: int64


Unnamed: 0,Country,Age,Salary,Purchased
0,France,38.388889,63777.777778,No
1,Spain,38.388889,63777.777778,Yes
2,Germany,38.388889,63777.777778,No
3,Spain,38.388889,63777.777778,No
4,Germany,38.388889,63777.777778,Yes
5,France,38.388889,63777.777778,Yes
6,Spain,38.388889,63777.777778,No
7,France,38.388889,63777.777778,Yes
8,Germany,38.388889,63777.777778,No
9,France,38.388889,63777.777778,Yes


# <font color='deepskyblue'>Step 4:  Encoding the categorical data</font>

In [None]:
categorical_columns = df2.select_dtypes(include=['object', 'category']).columns
print("Categorical columns:", categorical_columns)
 # Find all the coloumns whose entires are categorical

Categorical columns: Index(['Country', 'Purchased'], dtype='object')


# <font color='deepskyblue'>Soluton 1 : One-Hot Encoder </font>

In [None]:
df2

Unnamed: 0,Country,Age,Salary,Purchased
0,France,38.388889,63777.777778,No
1,Spain,38.388889,63777.777778,Yes
2,Germany,38.388889,63777.777778,No
3,Spain,38.388889,63777.777778,No
4,Germany,38.388889,63777.777778,Yes
5,France,38.388889,63777.777778,Yes
6,Spain,38.388889,63777.777778,No
7,France,38.388889,63777.777778,Yes
8,Germany,38.388889,63777.777778,No
9,France,38.388889,63777.777778,Yes


In [None]:
from sklearn.preprocessing import OneHotEncoder#Import one-hot Encoder
import pandas as pd
categorical_columns = df.select_dtypes(include=['object', 'category']).columns
print(f"Categorical columns: {categorical_columns}")
df_one_hot = pd.get_dummies(df, columns=categorical_columns, drop_first=True)
print(df_one_hot.head())
 # Use one-hot encoding to encode coategorical values

Categorical columns: Index(['Country', 'Purchased'], dtype='object')
    Age   Salary  Country_Germany  Country_Spain  Purchased_Yes
0  44.0  72000.0            False          False          False
1  27.0  48000.0            False           True           True
2  30.0  54000.0             True          False          False
3  38.0  61000.0            False           True          False
4  40.0      NaN             True          False           True


# <font color='deepskyblue'>Solution 2 : LabelEncoder</font>

In [None]:
import pandas as pd
from sklearn.preprocessing import LabelEncoder
  # Import LabelEncoder
data = {
    'Country': ['France','Spain','Germany']
    }
df = pd.DataFrame(data)
label_encoder = LabelEncoder()
df['Country_encoded'] = label_encoder.fit_transform(df['Country'])
print(df)

#Use label encoding to encode categorical variables

   Country  Country_encoded
0   France                0
1    Spain                2
2  Germany                1


# <font color='deepskyblue'>Step 5 : Splitting the dataset</font>
Splitting the dataset is the next step in data preprocessing in machine learning. Every dataset for Machine Learning model must be split into two separate sets – training set and test set.

In [None]:
print(X)
print(Y)

[['France' 44.0 72000.0]
 ['Spain' 27.0 48000.0]
 ['Germany' 30.0 54000.0]
 ['Spain' 38.0 61000.0]
 ['Germany' 40.0 nan]
 ['France' 35.0 58000.0]
 ['Spain' nan 52000.0]
 ['France' 48.0 79000.0]
 ['Germany' 50.0 83000.0]
 ['France' 37.0 67000.0]]
['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes' 'No' 'Yes']


In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X[:, 1:] = imputer.fit_transform(X[:, 1:])

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
X[:,0] = le.fit_transform(X[:,0])
X = X.astype(float)

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.2, random_state = 1)

In [None]:
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(strategy='mean')
X_train[:, 1:] = imputer.fit_transform(X_train[:, 1:])

In [None]:
print(X_train)

[[2.00000000e+00 3.87777778e+01 5.20000000e+04]
 [1.00000000e+00 4.00000000e+01 6.37777778e+04]
 [0.00000000e+00 4.40000000e+01 7.20000000e+04]
 [2.00000000e+00 3.80000000e+01 6.10000000e+04]
 [2.00000000e+00 2.70000000e+01 4.80000000e+04]
 [0.00000000e+00 4.80000000e+01 7.90000000e+04]
 [1.00000000e+00 5.00000000e+01 8.30000000e+04]
 [0.00000000e+00 3.50000000e+01 5.80000000e+04]]


In [None]:
print(X_test)

[[1.0e+00 3.0e+01 5.4e+04]
 [0.0e+00 3.7e+01 6.7e+04]]


In [None]:
print(Y_train)

['No' 'Yes' 'No' 'No' 'Yes' 'Yes' 'No' 'Yes']


In [None]:
print(Y_test)

['No' 'Yes']


# <font color='deepskyblue'>Step 6 : Feature scaling</font>



Feature scaling marks the end of the data preprocessing in Machine Learning. It is a method to standardize the independent variables of a dataset within a specific range. In other words, feature scaling limits the range of variables so that you can compare them on common grounds.

Another reason why feature scaling is applied is that few algorithms like gradient descent converge much faster with feature scaling than without it.

# <font color='deepskyblue'>Standard Scaler</font>
StandardScaler follows Standard Normal Distribution (SND). Therefore, it makes mean = 0 and scales the data to unit variance.

# <img src="https://i.stack.imgur.com/PZgJ2.png" width="50%"/>


# <img src="https://journaldev.nyc3.digitaloceanspaces.com/2020/10/Standardization.png" width="50%"/>


In [None]:
? # Import StandardScalar
sta = StandardScaler()
X_train[:, 1:] = sta.?(X_train[:, 1:]) #Standardize the data
X_test[:, 1:] = sta.?(X_test[:, 1:])

In [None]:
print(X_train)

In [None]:
print(X_test)

# <font color='deepskyblue'>Step 7 : KNN</font>

In [None]:
? # Import KNN and fit the data into the KNN model and generate predictions

# <font color='deepskyblue'>Step 8 : Decision Tree Classifier</font>


In [None]:
? # Import DecisionTreeClassifier and fit the data into the DecisionTreeClassifier and generate predictions

# <font color='deepskyblue'>Bonus Task</font>
## Implementing KNN from scratch

In [None]:
# @title Run this cell to generate data
data = np.random.randint(0,100,size=(10,3))
for i in range(10):
    if ((data[i][0] + data[i][1] )% 4 == 0):
        data[i][2] = 1
    else:
        data[i][2] = 0
data

array([[73, 19,  1],
       [19, 68,  0],
       [33, 38,  0],
       [74, 69,  0],
       [33, 95,  1],
       [46,  9,  0],
       [91, 35,  0],
       [93, 17,  0],
       [30, 31,  0],
       [72, 96,  1]])

### Your generated data is in a variable called `data`

In [None]:
# Convert the data array into a pandas dataframe with coloumn names `x1`, `x2` , `y`.
df = ?
df

In [None]:
# generate a scatter plot between x1 and x2

?

In [None]:
# Find Euclidean distance between the first two rows of matrix data.

In [None]:
# Create a vector V with three random values.

In [None]:
# Find the Euclidean distance between each row of M with V. Store the distance in a vector and print.

In [None]:
# Sort the Dist matrix based on the last column. you can use argsort() to sort the matrix.