# Experiment 1: Data Preprocessing Techniques Using Pandas and Scikit-Learn

## Aim
To explore and apply various data preprocessing techniques using Pandas and Scikit-Learn for preparing datasets for machine learning tasks.

## Objectives
- Understand the importance of data preprocessing in machine learning.
- Implement common preprocessing steps such as handling missing data, encoding categorical variables, scaling features, and splitting data.

## Tools Used
- **Pandas**: For data manipulation.
- **Scikit-Learn**: For preprocessing utilities.
- **NumPy**: For numerical operations.

## Implementation

### Step 1: Import Libraries
```python
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
```

### Step 2: Load the Dataset
```python
# Sample dataset creation
data = {
    'Age': [25, 27, np.nan, 29, 24],
    'Salary': [50000, 54000, np.nan, 62000, 58000],
    'Gender': ['Male', 'Female', 'Female', 'Male', np.nan],
    'Purchased': ['No', 'Yes', 'No', 'Yes', 'No']
}
df = pd.DataFrame(data)

# Display the dataset
print("Original Dataset:")
print(df)
```

### Step 3: Handle Missing Data
```python
# Handling missing numerical data with mean imputation
imputer = SimpleImputer(strategy='mean')
df['Age'] = imputer.fit_transform(df[['Age']])
df['Salary'] = imputer.fit_transform(df[['Salary']])

# Handling missing categorical data with mode imputation
imputer = SimpleImputer(strategy='most_frequent')
df['Gender'] = imputer.fit_transform(df[['Gender']])

print("\nDataset After Handling Missing Data:")
print(df)
```

### Step 4: Encode Categorical Variables
```python
# Label encoding for the target variable 'Purchased'
label_encoder = LabelEncoder()
df['Purchased'] = label_encoder.fit_transform(df['Purchased'])

# One-hot encoding for the 'Gender' column
one_hot = pd.get_dummies(df['Gender'], prefix='Gender', drop_first=True)
df = pd.concat([df.drop('Gender', axis=1), one_hot], axis=1)

print("\nDataset After Encoding Categorical Variables:")
print(df)
```

### Step 5: Scale Features
```python
# Standard scaling for numerical features
scaler = StandardScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])

print("\nDataset After Feature Scaling:")
print(df)
```

### Step 6: Split the Data into Training and Testing Sets
```python
# Split the dataset into features and target variable
X = df.drop('Purchased', axis=1)
y = df['Purchased']

# Split into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("\nTraining Features:")
print(X_train)
print("\nTesting Features:")
print(X_test)
print("\nTraining Target:")
print(y_train)
print("\nTesting Target:")
print(y_test)
```

### Step 7: Summary and Observations
```python
print("\nSummary:")
print("1. Missing data was handled using mean and mode imputation.")
print("2. Categorical variables were encoded using label encoding and one-hot encoding.")
print("3. Numerical features were scaled using standard scaling.")
print("4. The dataset was split into training and testing sets for model evaluation.")
