## üìå Student Information

**Student Name:** _____________________________  
**Roll Number:** ____________________________  
**Date of Practical:** ______________________  
**IDE Used:** ‚òê Jupyter ‚òê Anaconda ‚òê VS Code ‚òê Colab ‚òê PyCharm  
**Practical Status:** ‚òê In Progress ‚òê Completed

## Step 1: Import Required Libraries

This cell imports all necessary libraries for data preprocessing.

In [None]:
# Data manipulation
import numpy as np
import pandas as pd

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Preprocessing
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# Utilities
import warnings
warnings.filterwarnings('ignore')

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

print("‚úÖ All libraries imported successfully!")
print(f"NumPy Version: {np.__version__}")
print(f"Pandas Version: {pd.__version__}")
print(f"Scikit-learn Version: {pd.__version__}")

## Step 2: Create Sample Dataset

Since we don't have an external file, we'll create a realistic sample dataset with common data quality issues.

In [None]:
# Set random seed for reproducibility
np.random.seed(42)

# Create sample dataset with realistic data quality issues
n_samples = 300

data = {
    'price': np.random.normal(300000, 100000, n_samples),
    'square_feet': np.random.normal(2000, 500, n_samples),
    'bedrooms': np.random.choice([1, 2, 3, 4, 5, np.nan], n_samples),
    'bathrooms': np.random.choice([1, 1.5, 2, 2.5, 3, np.nan], n_samples),
    'age': np.random.choice(range(0, 100), n_samples),
    'city': np.random.choice(['New York', 'new york', 'Los Angeles', 'los angeles', 'Chicago', np.nan], n_samples),
    'garage': np.random.choice([0, 1, 2, 3, np.nan], n_samples)
}

df_raw = pd.DataFrame(data)

# Add some outliers intentionally
df_raw.loc[10, 'price'] = 5000000  # Unrealistic price
df_raw.loc[20, 'square_feet'] = 10000  # Unrealistic size

print("‚úÖ Sample dataset created!")
print(f"\nDataset Shape: {df_raw.shape}")
print(f"\nFirst 10 rows:")
print(df_raw.head(10))

## Step 3: Exploratory Data Analysis (EDA)

Analyze data quality issues before preprocessing.

In [None]:
print("="*70)
print("DATA QUALITY ANALYSIS - BEFORE PREPROCESSING")
print("="*70)

# 1. Data types and basic info
print("\n1Ô∏è‚É£  DATA TYPES AND BASIC INFO:")
print(df_raw.dtypes)

# 2. Missing values analysis
print("\n2Ô∏è‚É£  MISSING VALUES ANALYSIS:")
missing_count = df_raw.isnull().sum()
missing_percent = (df_raw.isnull().sum() / len(df_raw) * 100).round(2)
missing_df = pd.DataFrame({
    'Column': missing_count.index,
    'Missing_Count': missing_count.values,
    'Missing_Percentage': missing_percent.values
})
print(missing_df)
print(f"Total Missing Values: {df_raw.isnull().sum().sum()}")

# 3. Duplicate rows
print(f"\n3Ô∏è‚É£  DUPLICATE ROWS: {df_raw.duplicated().sum()} duplicates found")

# 4. Statistical summary
print("\n4Ô∏è‚É£  STATISTICAL SUMMARY:")
print(df_raw.describe())

# 5. Data consistency issues
print("\n5Ô∏è‚É£  CONSISTENCY ISSUES:")
print(f"Unique cities: {df_raw['city'].unique()}")
print("Note: Inconsistent case in city names (New York vs new york)")

## Step 4: Data Cleaning Phase 1 - Handle Missing Values

Different strategies for handling missing values based on data characteristics.

In [None]:
df_cleaned = df_raw.copy()

print("PHASE 1: HANDLING MISSING VALUES")
print("="*70)

# Strategy: Drop rows where critical columns have missing values
print("\nDropping rows with missing values in critical columns (price, square_feet)...")
df_cleaned = df_cleaned.dropna(subset=['price', 'square_feet'])
print(f"‚úÖ Rows after dropping: {len(df_cleaned)} (removed {len(df_raw) - len(df_cleaned)} rows)")

# Strategy: Impute missing numerical values with mean
print("\nImputing numerical missing values with mean...")
numerical_cols = ['bedrooms', 'bathrooms', 'garage']
imputer = SimpleImputer(strategy='mean')
df_cleaned[numerical_cols] = imputer.fit_transform(df_cleaned[numerical_cols])
print(f"‚úÖ Numerical imputation complete")

# Strategy: Fill categorical missing values with mode (most common value)
print("\nFilling categorical missing values with mode...")
df_cleaned['city'].fillna(df_cleaned['city'].mode()[0], inplace=True)
print(f"‚úÖ Categorical imputation complete")

# Verify no missing values remain
print(f"\n‚úÖ RESULT: Total missing values after cleaning: {df_cleaned.isnull().sum().sum()}")
print(f"   Dataset shape: {df_cleaned.shape}")

## Step 5: Data Cleaning Phase 2 - Remove Duplicates

Identify and remove duplicate rows from the dataset.

In [None]:
print("PHASE 2: REMOVING DUPLICATES")
print("="*70)

duplicates_count = df_cleaned.duplicated().sum()
print(f"\nDuplicate rows found: {duplicates_count}")

if duplicates_count > 0:
    print(f"\nSample duplicate rows:")
    print(df_cleaned[df_cleaned.duplicated(keep=False)].head())
    
    # Remove duplicates
    df_cleaned = df_cleaned.drop_duplicates()
    print(f"\n‚úÖ Duplicates removed")
else:
    print("‚úÖ No duplicates found")

print(f"\n‚úÖ RESULT: Dataset shape after removing duplicates: {df_cleaned.shape}")

## Step 6: Data Cleaning Phase 3 - Detect and Treat Outliers

Using IQR (Interquartile Range) method to identify and handle outliers.

In [None]:
print("PHASE 3: DETECTING AND TREATING OUTLIERS")
print("="*70)

def detect_outliers_iqr(data, column, multiplier=1.5):
    """Detect outliers using IQR method."""
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - multiplier * IQR
    upper_bound = Q3 + multiplier * IQR
    
    outliers = data[(data[column] < lower_bound) | (data[column] > upper_bound)]
    return outliers, lower_bound, upper_bound

# Detect outliers in numerical columns
numerical_cols_all = df_cleaned.select_dtypes(include=[np.number]).columns
outlier_summary = {}

print(f"\nAnalyzing numerical columns for outliers:")
for col in numerical_cols_all:
    outliers, lower, upper = detect_outliers_iqr(df_cleaned, col)
    outlier_summary[col] = len(outliers)
    print(f"  {col:15} - Outliers: {len(outliers):3} | Bounds: [{lower:10.2f}, {upper:10.2f}]")

total_outliers = sum(outlier_summary.values())
print(f"\nTotal outlier records (before removal): {total_outliers}")

# Remove outliers for price and square_feet
df_cleaned_no_outliers = df_cleaned.copy()
for col in ['price', 'square_feet']:
    _, lower, upper = detect_outliers_iqr(df_cleaned_no_outliers, col)
    df_cleaned_no_outliers = df_cleaned_no_outliers[
        (df_cleaned_no_outliers[col] >= lower) & (df_cleaned_no_outliers[col] <= upper)
    ]

outliers_removed = len(df_cleaned) - len(df_cleaned_no_outliers)
print(f"\n‚úÖ RESULT: {outliers_removed} rows with outliers removed")
print(f"   Dataset shape after outlier removal: {df_cleaned_no_outliers.shape}")

df_cleaned = df_cleaned_no_outliers  # Update for next phases

## Step 7: Data Cleaning Phase 4 - Ensure Data Consistency

Standardize and validate data formats and values.

In [None]:
print("PHASE 4: ENSURING DATA CONSISTENCY")
print("="*70)

# Standardize categorical text (lowercase, strip whitespace)
print("\nStandardizing categorical values...")
print(f"Before: {df_cleaned['city'].unique()}")
df_cleaned['city'] = df_cleaned['city'].str.lower().str.strip()
print(f"After:  {df_cleaned['city'].unique()}")

# Round numerical values to reasonable precision
print("\nRounding numerical values to reasonable precision...")
df_cleaned['bedrooms'] = df_cleaned['bedrooms'].round(0).astype(int)
df_cleaned['bathrooms'] = df_cleaned['bathrooms'].round(1)
df_cleaned['garage'] = df_cleaned['garage'].round(0).astype(int)

# Validate data ranges
print("\nValidating data ranges...")
assert df_cleaned['price'].min() > 0, "Price should be positive"
assert df_cleaned['square_feet'].min() > 0, "Square feet should be positive"
assert df_cleaned['bedrooms'].min() >= 0, "Bedrooms should be non-negative"
print("‚úÖ All data validations passed")

print(f"\n‚úÖ RESULT: Data consistency check completed")
print(f"\nCleaned dataset (first 10 rows):")
print(df_cleaned.head(10))

## Step 8: Feature Scaling (Normalization)

Scale numerical features to comparable ranges.

In [None]:
print("PHASE 5: FEATURE SCALING")
print("="*70)

# Identify numerical features
numerical_features = df_cleaned.select_dtypes(include=[np.number]).columns.tolist()
print(f"\nNumerical features to scale: {numerical_features}")

# Apply StandardScaler (mean=0, std=1)
print("\n1Ô∏è‚É£  StandardScaler (mean=0, std=1):")
scaler_standard = StandardScaler()
df_scaled_standard = df_cleaned.copy()
df_scaled_standard[numerical_features] = scaler_standard.fit_transform(df_cleaned[numerical_features])

print("\nBefore StandardScaler:")
print(df_cleaned[numerical_features].describe().round(2))

print("\nAfter StandardScaler:")
print(df_scaled_standard[numerical_features].describe().round(2))

# Apply MinMaxScaler (range 0-1)
print("\n2Ô∏è‚É£  MinMaxScaler (range 0-1):")
scaler_minmax = MinMaxScaler()
df_scaled_minmax = df_cleaned.copy()
df_scaled_minmax[numerical_features] = scaler_minmax.fit_transform(df_cleaned[numerical_features])

print("\nAfter MinMaxScaler:")
print(df_scaled_minmax[numerical_features].describe().round(2))

print("\n‚úÖ RESULT: Feature scaling completed (using StandardScaler for further processing)")

## Step 9: Categorical Feature Encoding

Encode categorical variables for machine learning models.

In [None]:
print("PHASE 6: CATEGORICAL FEATURE ENCODING")
print("="*70)

# Identify categorical features
categorical_features = df_cleaned.select_dtypes(include=['object']).columns.tolist()
print(f"\nCategorical features to encode: {categorical_features}")
print(f"Unique values: {df_cleaned['city'].unique()}")

# Method 1: Label Encoding (ordinal encoding)
print("\n1Ô∏è‚É£  LABEL ENCODING (for ordinal categories):")
df_label_encoded = df_cleaned.copy()
label_encoders = {}

for col in categorical_features:
    le = LabelEncoder()
    df_label_encoded[col] = le.fit_transform(df_label_encoded[col].astype(str))
    label_encoders[col] = le
    print(f"\n  {col}:")
    for i, label in enumerate(le.classes_):
        print(f"    {label:20} ‚Üí {i}")

# Method 2: One-Hot Encoding (nominal encoding)
print("\n2Ô∏è‚É£  ONE-HOT ENCODING (for nominal categories):")
df_onehot = pd.get_dummies(df_cleaned, columns=categorical_features, drop_first=False)
print(f"Shape before One-Hot: {df_cleaned.shape}")
print(f"Shape after One-Hot:  {df_onehot.shape}")
print(f"\nNew columns created:")
new_cols = [col for col in df_onehot.columns if col not in df_cleaned.columns]
for col in new_cols:
    print(f"  {col}")

print("\n‚úÖ RESULT: Categorical encoding completed")

## Step 10: Create and Apply Preprocessing Pipeline

Combine all preprocessing steps into a reusable pipeline.

In [None]:
print("PHASE 7: PREPROCESSING PIPELINE")
print("="*70)

# Create preprocessing pipeline
print("\nCreating preprocessing pipeline...")

numerical_features = df_cleaned.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = df_cleaned.select_dtypes(include=['object']).columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numerical_features),
        ('cat', OneHotEncoder(drop='first', sparse=False), categorical_features)
    ]
)

# Create full pipeline
preprocessing_pipeline = Pipeline([
    ('preprocessor', preprocessor)
])

print("‚úÖ Pipeline created successfully")

# Apply pipeline
print("\nApplying pipeline to cleaned data...")
X_processed = preprocessing_pipeline.fit_transform(df_cleaned)

# Get feature names
feature_names = []
feature_names.extend(numerical_features)
cat_features = preprocessor.named_transformers_['cat'].get_feature_names_out(categorical_features)
feature_names.extend(cat_features)

# Create processed DataFrame
df_final = pd.DataFrame(X_processed, columns=feature_names)

print(f"‚úÖ Pipeline applied successfully")
print(f"\nProcessed dataset shape: {df_final.shape}")
print(f"\nProcessed data (first 5 rows):")
print(df_final.head())

## Step 11: Validation & Visualization

Verify preprocessing quality through visualizations and metrics.

In [None]:
print("PHASE 8: VALIDATION & QUALITY METRICS")
print("="*70)

# Create quality report
print("\nüìä DATA QUALITY REPORT\n")
quality_report = pd.DataFrame({
    'Metric': [
        'Total Rows',
        'Missing Values',
        'Duplicate Rows',
        'Total Columns',
        'Numerical Columns',
        'Categorical Columns'
    ],
    'Before Preprocessing': [
        df_raw.shape[0],
        df_raw.isnull().sum().sum(),
        df_raw.duplicated().sum(),
        df_raw.shape[1],
        len(df_raw.select_dtypes(include=[np.number]).columns),
        len(df_raw.select_dtypes(include=['object']).columns)
    ],
    'After Preprocessing': [
        df_final.shape[0],
        df_final.isnull().sum().sum(),
        df_final.duplicated().sum(),
        df_final.shape[1],
        df_final.shape[1],
        0
    ]
})

print(quality_report.to_string(index=False))

# Improvement metrics
print("\n‚úÖ IMPROVEMENTS:")
print(f"  ‚Ä¢ Data quality increased: {((1 - df_final.isnull().sum().sum()/len(df_final)) * 100):.1f}%")
print(f"  ‚Ä¢ Rows preserved: {(df_final.shape[0]/df_raw.shape[0] * 100):.1f}%")
print(f"  ‚Ä¢ Outliers handled: {df_raw.shape[0] - df_final.shape[0]} rows removed/cleaned")

## Step 12: Visualizations

Create visualizations comparing before and after preprocessing.

In [None]:
# Create comprehensive visualizations
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
fig.suptitle('Data Preprocessing Comparison: Before vs After', fontsize=16, fontweight='bold')

# 1. Distribution comparison - Price
axes[0, 0].hist(df_raw['price'], bins=30, color='red', alpha=0.7, edgecolor='black', label='Before')
axes[0, 0].axvline(df_raw['price'].mean(), color='darkred', linestyle='--', linewidth=2, label='Mean')
axes[0, 0].set_title('Price Distribution - Before Preprocessing', fontweight='bold')
axes[0, 0].set_xlabel('Price ($)')
axes[0, 0].set_ylabel('Frequency')
axes[0, 0].legend()
axes[0, 0].grid(alpha=0.3)

# 2. Distribution after scaling
axes[0, 1].hist(df_scaled_standard[numerical_features[0]], bins=30, color='green', alpha=0.7, edgecolor='black')
axes[0, 1].axvline(0, color='darkgreen', linestyle='--', linewidth=2, label='Mean (0)')
axes[0, 1].set_title('Price Distribution - After StandardScaler', fontweight='bold')
axes[0, 1].set_xlabel('Scaled Price (StandardScaler)')
axes[0, 1].set_ylabel('Frequency')
axes[0, 1].legend()
axes[0, 1].grid(alpha=0.3)

# 3. Missing values comparison
missing_before = [df_raw.isnull().sum().sum()]
missing_after = [df_cleaned.isnull().sum().sum()]
axes[1, 0].bar(['Before', 'After'], [missing_before[0], missing_after[0]], color=['red', 'green'], alpha=0.7, edgecolor='black')
axes[1, 0].set_title('Missing Values: Before vs After', fontweight='bold')
axes[1, 0].set_ylabel('Count')
axes[1, 0].grid(alpha=0.3, axis='y')

# 4. Data shape comparison
shapes = [df_raw.shape[0], df_cleaned.shape[0]]
axes[1, 1].bar(['Original', 'Cleaned'], shapes, color=['orange', 'blue'], alpha=0.7, edgecolor='black')
axes[1, 1].set_title('Dataset Size: Before vs After', fontweight='bold')
axes[1, 1].set_ylabel('Number of Rows')
for i, v in enumerate(shapes):
    axes[1, 1].text(i, v + 5, str(v), ha='center', fontweight='bold')
axes[1, 1].grid(alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("‚úÖ Visualizations created successfully")

## Step 13: Summary & Results

Summary of preprocessing steps and key outcomes.

In [None]:
print("\n" + "="*70)
print("PRACTICAL 2 - DATA PREPROCESSING: FINAL SUMMARY")
print("="*70)

print("\nüìã PREPROCESSING STEPS COMPLETED:")
print("  ‚úÖ Step 1: Loaded raw data (300 rows √ó 7 columns)")
print("  ‚úÖ Step 2: Identified data quality issues")
print("  ‚úÖ Step 3: Handled missing values using imputation and deletion")
print("  ‚úÖ Step 4: Removed duplicate rows")
print("  ‚úÖ Step 5: Detected and treated outliers using IQR method")
print("  ‚úÖ Step 6: Ensured data consistency (standardization, validation)")
print("  ‚úÖ Step 7: Scaled numerical features (StandardScaler)")
print("  ‚úÖ Step 8: Encoded categorical features (One-Hot Encoding)")
print("  ‚úÖ Step 9: Created reusable preprocessing pipeline")
print("  ‚úÖ Step 10: Validated preprocessing quality")
print("  ‚úÖ Step 11: Generated visualizations and metrics")

print("\nüìä KEY RESULTS:")
print(f"  ‚Ä¢ Original dataset: {df_raw.shape[0]} rows √ó {df_raw.shape[1]} columns")
print(f"  ‚Ä¢ Cleaned dataset: {df_cleaned.shape[0]} rows √ó {df_cleaned.shape[1]} columns")
print(f"  ‚Ä¢ Final dataset: {df_final.shape[0]} rows √ó {df_final.shape[1]} columns")
print(f"  ‚Ä¢ Data quality: {(1 - df_final.isnull().sum().sum()/len(df_final)) * 100:.1f}% (Missing: {df_final.isnull().sum().sum()})")
print(f"  ‚Ä¢ Rows retained: {(df_final.shape[0]/df_raw.shape[0] * 100):.1f}%")

print("\nüéØ LEARNING OUTCOMES ACHIEVED:")
print("  ‚úÖ LO1: Identified different data quality issues")
print("  ‚úÖ LO2: Handled missing values using various strategies")
print("  ‚úÖ LO3: Detected and treated outliers")
print("  ‚úÖ LO4: Normalized and scaled numerical features")
print("  ‚úÖ LO5: Encoded categorical variables")
print("  ‚úÖ LO6: Created preprocessing pipelines")
print("  ‚úÖ LO7: Validated preprocessing quality")
print("  ‚úÖ LO8: Documented preprocessing steps")

print("\n" + "="*70)
print("‚úÖ PRACTICAL 2 COMPLETED SUCCESSFULLY!")
print("="*70)

## Step 14: Submission Checklist

Ensure all required elements are completed before submission.

### üìù PRACTICAL 2 SUBMISSION CHECKLIST

Before submitting, ensure you have completed:

#### Code Execution
- [ ] All code cells executed successfully without errors
- [ ] All visualizations displayed correctly
- [ ] Output shows expected results and metrics

#### Learning Outcomes
- [ ] LO1: Data quality issues identified ‚úÖ
- [ ] LO2: Missing values handled ‚úÖ
- [ ] LO3: Outliers detected and treated ‚úÖ
- [ ] LO4: Features scaled ‚úÖ
- [ ] LO5: Categorical features encoded ‚úÖ
- [ ] LO6: Pipeline created ‚úÖ
- [ ] LO7: Quality validated ‚úÖ
- [ ] LO8: Steps documented ‚úÖ

#### Documentation
- [ ] Student details filled in (Name, Roll No, Date)
- [ ] All code cells have appropriate comments
- [ ] Results explained in text cells
- [ ] Reflections written below

#### Files to Submit
- [ ] This notebook (`Practical_2_Complete_Notebook.ipynb`)
- [ ] Submission template (`SUBMISSION_TEMPLATE_Practical_2.md`)
- [ ] Screenshot of final summary

---

### üí≠ REFLECTIONS & LEARNINGS

**1. What I learned about data quality:**

Real-world data is messier than we expect. Missing values, duplicates, and outliers are common and must be handled carefully. Different strategies work for different scenarios.

**2. Key challenges faced:**

- Deciding which rows to delete vs impute
- Balancing data loss with data quality
- Understanding when to use different scaling methods

**3. Important concepts understood:**

- Missing data mechanisms (MCAR, MAR, MNAR)
- Outlier detection methods (IQR, Z-score)
- Feature scaling importance for ML algorithms
- Categorical encoding strategies

**4. How this applies to real-world ML:**

Data preprocessing is often 70% of ML work. Clean data = Better models. The pipelines we created can be reused for new datasets.

**5. Skills developed:**

- Pandas data manipulation
- Scikit-learn preprocessing
- Pipeline creation and reusability
- Data quality assessment
- Data visualization

---

**Ready for submission?** ‚úÖ Yes / ‚ùå No

**Submission date:** _____________________