### AI/ML – Improving Model Performance with Clean Data

**Task 1**: Data Preprocessing for Models

**Objective**: Enhance data quality for better AI/ML outcomes.

**Steps**:
1. Choose a dataset for training an AI/ML model.
2. Identify common data issues like null values, redundant features, or noisydata.
3. Apply preprocessing methods such as imputation, normalization, or feature engineering.

In [1]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Step 1: Load dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# Introduce data quality issues
X.loc[0:5, 'sepal length (cm)'] = np.nan  # Inject missing values
X['redundant_feature'] = 1  # Add a constant column (zero variance)
X['noisy_feature'] = np.random.normal(0, 10, size=len(X))  # Add noise

print("Original Data with Issues:")
print(X.head())

# Step 2: Handle missing values
imputer = SimpleImputer(strategy='mean')
X_imputed = pd.DataFrame(imputer.fit_transform(X), columns=X.columns)

# Step 3: Remove low variance (redundant) features
selector = VarianceThreshold(threshold=0.01)
X_selected = pd.DataFrame(selector.fit_transform(X_imputed),
                          columns=X_imputed.columns[selector.get_support()])

# Step 4: Normalize features
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X_selected),
                        columns=X_selected.columns)

# Step 5: Train/test split
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# Step 6: Train model
model = RandomForestClassifier(random_state=42)
model.fit(X_train, y_train)

# Step 7: Evaluate
y_pred = model.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Original Data with Issues:
   sepal length (cm)  sepal width (cm)  petal length (cm)  petal width (cm)  \
0                NaN               3.5                1.4               0.2   
1                NaN               3.0                1.4               0.2   
2                NaN               3.2                1.3               0.2   
3                NaN               3.1                1.5               0.2   
4                NaN               3.6                1.4               0.2   

   redundant_feature  noisy_feature  
0                  1       9.986995  
1                  1       4.136855  
2                  1       7.782418  
3                  1      -2.107377  
4                  1      -1.119679  

Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                

**Task 2**: Evaluate Model Performance

**Objective**: Assess the impact of data quality improvements on model performance.

**Steps**:
1. Train a simple ML model with and without preprocessing.
2. Analyze and compare model performance metrics to evaluate the impact of data quality strategies.

In [2]:
import numpy as np
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold

# Load the dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = iris.target

# Introduce issues (simulate low data quality)
X_dirty = X.copy()
X_dirty.loc[0:5, 'sepal length (cm)'] = np.nan           # Add missing values
X_dirty['redundant_feature'] = 1                        # Add constant feature
X_dirty['noisy_feature'] = np.random.normal(0, 10, X.shape[0])  # Add noisy feature

# ----------- Model 1: WITHOUT Preprocessing -----------
# Fill missing values with 0 to allow training, skip real preprocessing
X_dirty_filled = X_dirty.fillna(0)

# Train/test split
X_train1, X_test1, y_train1, y_test1 = train_test_split(X_dirty_filled, y, test_size=0.3, random_state=42)

# Train model
model1 = RandomForestClassifier(random_state=42)
model1.fit(X_train1, y_train1)

# Predict and report
y_pred1 = model1.predict(X_test1)
print("🚫 Model WITHOUT Preprocessing:\n")
print(classification_report(y_test1, y_pred1))


# ----------- Model 2: WITH Preprocessing -----------
# Imputation
imputer = SimpleImputer(strategy='mean')
X_imputed = pd.DataFrame(imputer.fit_transform(X_dirty), columns=X_dirty.columns)

# Remove low variance (redundant) features
selector = VarianceThreshold(threshold=0.01)
X_selected = pd.DataFrame(selector.fit_transform(X_imputed),
                          columns=X_imputed.columns[selector.get_support()])

# Normalize features
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X_selected), columns=X_selected.columns)

# Train/test split
X_train2, X_test2, y_train2, y_test2 = train_test_split(X_scaled, y, test_size=0.3, random_state=42)

# Train model
model2 = RandomForestClassifier(random_state=42)
model2.fit(X_train2, y_train2)


# Predict and report
y_pred2 = model2.predict(X_test2)
print("\n✅ Model WITH Preprocessing:\n")
print(classification_report(y_test2, y_pred2))

🚫 Model WITHOUT Preprocessing:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45


✅ Model WITH Preprocessing:

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        19
           1       1.00      1.00      1.00        13
           2       1.00      1.00      1.00        13

    accuracy                           1.00        45
   macro avg       1.00      1.00      1.00        45
weighted avg       1.00      1.00      1.00        45

