# Data Preprocessing with StandardScaler and PCA
**Author:** Magudeshwaran and Senthilkumaran

**Goal:** To demonstrate common data preprocessing techniques, including feature scaling with `StandardScaler` and dimensionality reduction with `PCA`. 

### Step 1: Import Libraries and Load Data

We'll use the Iris dataset from `sklearn.datasets` as our sample data.

In [2]:
import pandas as pd
import numpy as np
from sklearn.datasets import load_iris

# Load the Iris dataset
iris = load_iris()
x = iris.data
y = iris.target

### Step 2: Feature Scaling with `StandardScaler`

`StandardScaler` standardizes features by removing the mean and scaling to unit variance. This is an important step for many machine learning algorithms.

In [4]:
from sklearn.preprocessing import StandardScaler
sc=StandardScaler()
x_scaled=sc.fit_transform(x)

### Step 3: Label Encoding

`LabelEncoder` is used to convert categorical labels (like 'setosa', 'versicolor', 'virginica') into numerical form (0, 1, 2). The Iris dataset target is already numerical, but this shows how it would be done.

In [5]:
from sklearn.preprocessing import LabelEncoder
lb=LabelEncoder()
y_encoded=lb.fit_transform(y)

### Step 4: Dimensionality Reduction with PCA

Principal Component Analysis (PCA) is used to reduce the number of features in a dataset while preserving as much information as possible. Here, we reduce the 4 features of the Iris dataset down to 2 principal components for easy visualization.

In [7]:
from sklearn.decomposition import PCA

pcaRes = PCA(n_components=2)
principalComponents = pcaRes.fit_transform(x_scaled)
pcaFinal = pd.DataFrame(data = principalComponents, columns = ['Att1', 'Att2'])
pcaFinal = np.array(pcaFinal)

### Step 5: Visualize the Results

Now we can plot the 2 principal components and color the points by their original class to see how well the classes are separated in the reduced feature space.

In [10]:
import matplotlib.pyplot as plt
plt.scatter(pcaFinal[:,0],pcaFinal[:,1],c=y,cmap='autumn')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Iris Dataset')
plt.show()