### AI/ML – Improving Model Performance with Clean Data

**Task 1**: Data Preprocessing for Models

**Objective**: Enhance data quality for better AI/ML outcomes.

**Steps**:
1. Choose a dataset for training an AI/ML model.
2. Identify common data issues like null values, redundant features, or noisydata.
3. Apply preprocessing methods such as imputation, normalization, or feature engineering.

In [1]:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import VarianceThreshold

data = {
    "age": [25, 30, None, 45, 50],
    "income": [50000, 60000, 55000, None, 65000],
    "gender": ["M", "F", "F", "M", None],
    "purchase_count": [5, 7, 6, 10, 8],
    "redundant_feature": [1, 1, 1, 1, 1]
}

df = pd.DataFrame(data)

num_cols = ["age", "income", "purchase_count"]
cat_cols = ["gender"]

num_imputer = SimpleImputer(strategy="mean")
cat_imputer = SimpleImputer(strategy="most_frequent")

df[num_cols] = num_imputer.fit_transform(df[num_cols])
df[cat_cols] = cat_imputer.fit_transform(df[cat_cols])

scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])

selector = VarianceThreshold(threshold=0.0)
df_reduced = selector.fit_transform(df.drop(columns=cat_cols))

print(pd.DataFrame(df_reduced))


          0    1         2
0 -1.355815 -1.5 -1.278724
1 -0.813489  0.5 -0.116248
2  0.000000 -0.5 -0.697486
3  0.813489  0.0  1.627467
4  1.355815  1.5  0.464991


**Task 2**: Evaluate Model Performance

**Objective**: Assess the impact of data quality improvements on model performance.

**Steps**:
1. Train a simple ML model with and without preprocessing.
2. Analyze and compare model performance metrics to evaluate the impact of data quality strategies.

In [2]:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

data = {
    "age": [25, 30, None, 45, 50, 23, 37, None, 42, 29],
    "income": [50000, 60000, 55000, None, 65000, 48000, 62000, 59000, None, 58000],
    "gender": ["M", "F", "F", "M", None, "F", "M", "F", "F", "M"],
    "purchase_count": [5, 7, 6, 10, 8, 4, 9, 6, 7, 5],
    "target": [0, 1, 0, 1, 1, 0, 1, 0, 1, 0]
}

df = pd.DataFrame(data)

X = df.drop(columns=["target"])
y = df["target"]

num_cols = ["age", "income", "purchase_count"]
cat_cols = ["gender"]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

# Without preprocessing (drop missing)
X_train_raw = X_train.dropna()
y_train_raw = y_train.loc[X_train_raw.index]
X_test_raw = X_test.dropna()
y_test_raw = y_test.loc[X_test_raw.index]

model_raw = LogisticRegression()
model_raw.fit(pd.get_dummies(X_train_raw), y_train_raw)
y_pred_raw = model_raw.predict(pd.get_dummies(X_test_raw))
acc_raw = accuracy_score(y_test_raw, y_pred_raw)

# With preprocessing
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

num_imputer = SimpleImputer(strategy="mean")
cat_imputer = SimpleImputer(strategy="most_frequent")
scaler = StandardScaler()
ohe = OneHotEncoder(handle_unknown='ignore')

num_pipeline = Pipeline([
    ("imputer", num_imputer),
    ("scaler", scaler)
])

cat_pipeline = Pipeline([
    ("imputer", cat_imputer),
    ("onehot", ohe)
])

preprocessor = ColumnTransformer([
    ("num", num_pipeline, num_cols),
    ("cat", cat_pipeline, cat_cols)
])

model_pipe = Pipeline([
    ("preprocessor", preprocessor),
    ("classifier", LogisticRegression())
])

model_pipe.fit(X_train, y_train)
y_pred_pipe = model_pipe.predict(X_test)
acc_pipe = accuracy_score(y_test, y_pred_pipe)

acc_raw, acc_pipe


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


ValueError: The feature names should match those that were passed during fit.
Feature names unseen at fit time:
- gender_F
Feature names seen at fit time, yet now missing:
- gender_M
