In [None]:
#1.importingg important libraries

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA


In [None]:
# @title
# from numpy._core.defchararray import upper
#2.creating a raw dataset

data={
    'Age':[25,30,np.nan,45,50],
    'Salary':[30000,40000,50000,60000,70000],
    'Gender':['male','female','female','male','female']
}

#storing data to tables
df=pd.DataFrame(data)
print("Data stored")
print(df,"\n")


#3.data preprocessing

#3.1.updating the missing values
df['Age'].fillna(df['Age'].mean(),inplace=True)
print("Missing values updated")
print(df,"\n")


#3.2. encoding data
  #->method1-
df=pd.get_dummies(df,columns=['Gender'], dtype=int)  #it make dummy columns
print("One hot encoded data")
print(df,"\n")

   #->method2-one hot encoding
df=pd.get_dummies(df,columns=['Gender'], dtype=int)  #it make dummy columns
print("One hot encoded data")
print(df,"\n")


#3.3 Scaling technique
  #->method1-Normalisation technique
scaler=MinMaxScaler()
df[['Age','Salary']]=scaler.fit_transform(df[['Age','Salary']])
print("Scaled data normalised")
print(df,"\n")

  #->method2-Standarisatiion technique
std_scaler=StandardScaler()
df[['Age','Salary']]=std_scaler.fit_transform(df[['Age','Salary']])
print("Scaled data standardised")
print(df,"\n")

#3.4 Feature transformation
df['salary_log']=np.log(df['Salary']+1)
print("Feature transformation")
print(df,"\n")

#3.4 Handeling outliers-winsorization
upper_limit=df['Salary'].quantile(0.95) #it will neglect the top 5% data so that outliers lieing in that range will be removed
df['Salary']=np.where(df['Salary']>upper_limit,upper_limit,df['Salary']) #np.where(condn,x,y) An array-like object of booleans. Where condition is True, x is chosen; where False, y is chosen.
print("Outliers removed")
print(df,"\n")

#3.5 Bining
df['Age_group']=pd.cut(df['Age'],bins=3,labels=['Young', 'Adult', 'Old'])
print("Bined data")
print(df,"\n")

#3.6 Feature construction
df['Age_to_salary']=df['Age']/(df['Salary']+1)
print("Feature constructed Age to day")
print(df,"\n")

#3.7 Dimensionality reduction
pca=PCA(n_components=2)
reduced_data=pca.fit_transform(df[['Age','Salary']])
print("Reduced data")
print(reduced_data,"\n")

NameError: name 'np' is not defined

### Explanation of `np.where()`

`numpy.where()` is a function in the NumPy library that returns elements chosen from `x` or `y` depending on `condition`. It's incredibly useful for conditional logic within NumPy arrays, similar to an if-else statement but vectorized.

**Syntax:**
`numpy.where(condition, x, y)`

*   **`condition`**: An array-like object of booleans. Where `condition` is `True`, `x` is chosen; where `False`, `y` is chosen.
*   **`x`**: Values from which to choose when `condition` is `True`.
*   **`y`**: Values from which to choose when `condition` is `False`.

**How it was used in your notebook for handling outliers:**

In your notebook, `np.where()` was used in the line:
`df['Salary']=np.where(df['Salary']>upper_limit,upper_limit,df['Salary'])`

Let's break this down:

*   **`condition`**: `df['Salary'] > upper_limit`
    *   This condition checks, for each value in the 'Salary' column, if that value is greater than the calculated `upper_limit` (which was the 95th percentile of the 'Salary' column).

*   **`x`**: `upper_limit`
    *   If the condition is `True` (i.e., a salary is greater than the `upper_limit`), then that salary value is replaced with the `upper_limit`. This effectively 'caps' or 'winsorizes' the outliers at the upper end.

*   **`y`**: `df['Salary']`
    *   If the condition is `False` (i.e., a salary is not greater than the `upper_limit`), then the original salary value is retained.

**In summary:** This line of code replaces all salary values that are above the 95th percentile with the 95th percentile value itself, thus mitigating the impact of extreme high outliers.

### Standardization vs. Normalization

Both standardization and normalization are common data preprocessing techniques used to scale numerical features, but they achieve this in different ways and are suitable for different scenarios.

**1. Normalization (Min-Max Scaling)**

*   **Goal:** To scale features to a fixed range, usually between 0 and 1.
*   **Formula:** `X_scaled = (X - X_min) / (X_max - X_min)`
    *   `X`: The original feature value.
    *   `X_min`: The minimum value of the feature.
    *   `X_max`: The maximum value of the feature.
*   **Characteristics:**
    *   Transforms data to a specific, bounded range.
    *   Sensitive to outliers, as they will compress the range of the majority of the data.
*   **When to use:**
    *   When you know that the distribution of your data does not follow a Gaussian distribution.
    *   Algorithms that require input features to be within a specific range (e.g., neural networks with sigmoid activation functions, k-nearest neighbors).

**2. Standardization (Z-score Normalization)**

*   **Goal:** To scale features such that they have a mean of 0 and a standard deviation of 1.
*   **Formula:** `X_scaled = (X - μ) / σ`
    *   `X`: The original feature value.
    *   `μ`: The mean of the feature.
    *   `σ`: The standard deviation of the feature.
*   **Characteristics:**
    *   Does not bound values to a specific range, but it makes them unit-less.
    *   Less affected by outliers than min-max scaling because it uses the mean and standard deviation, which are robust to some extent.
*   **When to use:**
    *   When the data follows a Gaussian (bell curve) distribution.
    *   Algorithms that assume a Gaussian distribution or are sensitive to the scale of features (e.g., Linear Regression, Logistic Regression, Support Vector Machines, K-Means Clustering, Principal Component Analysis).

In your notebook, you applied both:

*   **Normalization** using `MinMaxScaler()`: `df[['Age','Salary']]=scaler.fit_transform(df[['Age','Salary']])`
*   **Standardization** using `StandardScaler()`: `df[['Age','Salary']]=std_scaler.fit_transform(df[['Age','Salary']])`

Note that applying both sequentially means the last one applied (Standardization in this case) is the one that determines the final scaling of the `Age` and `Salary` columns.

### Explanation of One-Hot Encoding

One-hot encoding is a process by which categorical variables are converted into a form that could be provided to ML algorithms to do a better job in prediction. If a categorical feature has `n` unique values, one-hot encoding will transform this feature into `n` new features (dummy variables), each representing one unique value. Each of these new features will have a value of `1` for the row where the original feature had that category, and `0` otherwise.

In our notebook, for the `Gender` column, which contained 'male' and 'female':

1. **`LabelEncoder()`** was initially used, which would assign `0` and `1` to 'male' and 'female' respectively. However, this creates an ordinal relationship that might not exist.

2. **`pd.get_dummies(df, columns=['Gender'])`** then correctly applied one-hot encoding. It created two new columns: `Gender_female` and `Gender_male`. For each row, one of these columns will have a `1` and the other a `0`, indicating the gender.

This is evident in the `df` DataFrame's state, where `Gender_0` and `Gender_1` likely correspond to the encoded 'female' and 'male' categories, respectively, after `get_dummies` was applied on the `LabelEncoder` output.