### AI/ML – Improving Model Performance with Clean Data

**Task 1**: Data Preprocessing for Models

**Objective**: Enhance data quality for better AI/ML outcomes.

**Steps**:
1. Choose a dataset for training an AI/ML model.
2. Identify common data issues like null values, redundant features, or noisydata.
3. Apply preprocessing methods such as imputation, normalization, or feature engineering.

In [3]:
# Write your code from here

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# 1. Sample dataset (simulated)
data = {
    'age': [25, 30, np.nan, 35, 40, 29, np.nan],
    'income': [50000, 60000, 55000, np.nan, 65000, 62000, 59000],
    'gender': ['M', 'F', 'F', 'M', 'F', 'M', 'M'],
    'has_children': [1, 0, 1, 0, 1, 0, 0],
    'redundant_feature': [100, 100, 100, 100, 100, 100, 100],  # same value - redundant
    'purchase_amount': [250, 300, 150, 400, 350, 200, 180]  # target or feature
}

df = pd.DataFrame(data)
print("Original Data:\n", df)

# 2. Identify common issues
print("\nMissing values per column:")
print(df.isnull().sum())

print("\nCheck for redundant features:")
for col in df.columns:
    if df[col].nunique() == 1:
        print(f" - Column '{col}' is redundant (single unique value)")

# 3. Preprocessing steps

# A. Impute missing numerical values using median
num_cols = ['age', 'income']
imputer = SimpleImputer(strategy='median')
df[num_cols] = imputer.fit_transform(df[num_cols])

# B. Remove redundant feature
df.drop(columns=['redundant_feature'], inplace=True)

# C. Encode categorical feature 'gender' using one-hot encoding
df = pd.get_dummies(df, columns=['gender'], drop_first=True)  # drop_first=True to avoid dummy trap

# D. Normalize numerical features (age, income, purchase_amount)
scaler = StandardScaler()
df[['age', 'income', 'purchase_amount']] = scaler.fit_transform(df[['age', 'income', 'purchase_amount']])

# E. Feature engineering: create interaction feature (age * income)
df['age_income_interaction'] = df['age'] * df['income']

print("\nPreprocessed Data:")
print(df)


Original Data:
     age   income gender  has_children  redundant_feature  purchase_amount
0  25.0  50000.0      M             1                100              250
1  30.0  60000.0      F             0                100              300
2   NaN  55000.0      F             1                100              150
3  35.0      NaN      M             0                100              400
4  40.0  65000.0      F             1                100              350
5  29.0  62000.0      M             0                100              200
6   NaN  59000.0      M             0                100              180

Missing values per column:
age                  2
income               1
gender               0
has_children         0
redundant_feature    0
purchase_amount      0
dtype: int64

Check for redundant features:
 - Column 'redundant_feature' is redundant (single unique value)

Preprocessed Data:
        age    income  has_children  purchase_amount  gender_M  \
0 -1.408406 -1.916535          

**Task 2**: Evaluate Model Performance

**Objective**: Assess the impact of data quality improvements on model performance.

**Steps**:
1. Train a simple ML model with and without preprocessing.
2. Analyze and compare model performance metrics to evaluate the impact of data quality strategies.

In [4]:
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score

# 1. Original dataset with some missing values and a binary target
data = {
    'age': [25, 30, np.nan, 35, 40, 29, np.nan, 50, 60, 55],
    'income': [50000, 60000, 55000, np.nan, 65000, 62000, 59000, 72000, 80000, np.nan],
    'gender': ['M', 'F', 'F', 'M', 'F', 'M', 'M', 'F', 'M', 'F'],
    'has_children': [1, 0, 1, 0, 1, 0, 0, 1, 0, 1],
    'purchase_amount': [250, 300, 150, 400, 350, 200, 180, 500, 450, 480],
    'target': [1, 0, 1, 0, 1, 0, 0, 1, 1, 0]
}

df = pd.DataFrame(data)

# Split features and target
X = df.drop(columns=['target'])
y = df['target']

# -------------------------------
# Model 1: Without preprocessing
# -------------------------------
# Convert categorical 'gender' to numeric with simple map (no one-hot)
X_raw = X.copy()
X_raw['gender'] = X_raw['gender'].map({'M':0, 'F':1})

# Fill missing numeric values with zero (simple naive approach)
X_raw['age'] = X_raw['age'].fillna(0)
X_raw['income'] = X_raw['income'].fillna(0)

# Train/test split
X_train_raw, X_test_raw, y_train, y_test = train_test_split(X_raw, y, test_size=0.3, random_state=42)

model_raw = LogisticRegression()
model_raw.fit(X_train_raw, y_train)
y_pred_raw = model_raw.predict(X_test_raw)

# -------------------------------
# Model 2: With preprocessing
# -------------------------------
X_pre = X.copy()

# Impute missing numerical values with median
num_cols = ['age', 'income']
imputer = SimpleImputer(strategy='median')
X_pre[num_cols] = imputer.fit_transform(X_pre[num_cols])

# One-hot encode 'gender'
X_pre = pd.get_dummies(X_pre, columns=['gender'], drop_first=True)

# Normalize numerical features (age, income, purchase_amount)
scaler = StandardScaler()
X_pre[['age', 'income', 'purchase_amount']] = scaler.fit_transform(X_pre[['age', 'income', 'purchase_amount']])

# Train/test split
X_train_pre, X_test_pre, y_train, y_test = train_test_split(X_pre, y, test_size=0.3, random_state=42)

model_pre = LogisticRegression()
model_pre.fit(X_train_pre, y_train)
y_pred_pre = model_pre.predict(X_test_pre)

# -------------------------------
# Evaluate and compare performance
# -------------------------------
def print_metrics(y_true, y_pred, label):
    print(f"\n{label} Performance:")
    print(f"Accuracy:  {accuracy_score(y_true, y_pred):.3f}")
    print(f"Precision: {precision_score(y_true, y_pred):.3f}")
    print(f"Recall:    {recall_score(y_true, y_pred):.3f}")

print_metrics(y_test, y_pred_raw, "Without Preprocessing")
print_metrics(y_test, y_pred_pre, "With Preprocessing")



Without Preprocessing Performance:
Accuracy:  0.333
Precision: 0.333
Recall:    1.000

With Preprocessing Performance:
Accuracy:  0.000
Precision: 0.000
Recall:    0.000


STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
