### Week 10 Model Submission

**Feature Engineering**

To create user-specific and interaction-level features:
1. User-based Metrics: Aggregated user statistics such as mean, standard deviation, and kurtosis of ratings. Interaction counts for likes, dislikes, neutrals, and watched actions were calculated.
2. Ratio Features: Ratios of each interaction type relative to the total interactions were derived.
3. Weighted Scores: Weighted scores were computed using likes and dislikes to reflect user sentiment.
4. Item Popularity: A popularity score for each item was calculated by combining the average rating and the log-scaled count of interactions.
5. Deviation Metrics: For each user, the average deviation from item popularity was calculated.
6. Outlier Removal: Outliers were filtered using the interquartile range for robust feature quality.

**Data Engineering**

The first and second batch dataset was used for training and testing.

**Feature Preprocessing**

Feature preprocessing involves a two-stage scaling approach:
1. Scaling: Standard scaling was applied to normalize the feature distributions.
2. Feature Selection: Features with high correlation or limited utility were excluded to reduce redundancy.

**Model Training**
1. Base Model: Logistic Regression wrapped in a One-vs-Rest classifier for multi-class classification.
Hyperparameter Optimization:
2. RandomizedSearchCV was used with 10-fold cross-validation to optimize hyperparameters such as C, solver type, and tolerance.
3. Penalty types (L1 and L2) were explored to handle feature sparsity effectively.
4. Evaluation Metric: ROC-AUC for multi-class classification (roc_auc_ovr) was used as the scoring criterion.

In [None]:
import random

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.stats import kurtosis
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.multiclass import OneVsRestClassifier
from sklearn.preprocessing import StandardScaler


In [2]:
RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)

In [3]:
def remove_outliers(df, columns):
    for col in columns:
        Q1 = df[col].quantile(0.01)
        Q3 = df[col].quantile(0.99)
        IQR = Q3 - Q1

        # Define the outlier range
        lower_bound = Q1 - 1.5 * IQR
        upper_bound = Q3 + 1.5 * IQR

        # Remove rows with outliers
        df = df[(df[col] >= lower_bound) & (df[col] <= upper_bound)]

    return df


def engineer_features(df_X, feature_columns=None, df_y=None):
    # Basic user features
    df_user_features = df_X.groupby("user").agg(
        mean_rating=("rating", "mean"),
        median_rating=("rating", "median"),
        std_rating=("rating", "std"),
        count_dislike=("rating", lambda x: (x == -10).sum()),
        count_neutral=("rating", lambda x: (x == 0).sum()),
        count_like=("rating", lambda x: (x == 10).sum()),
        count_watched=("rating", lambda x: (x == 1).sum()),
        total_interactions=("rating", "count"),
    )

    # Ratio features
    df_user_features["like_ratio"] = (
        df_user_features["count_like"] / df_user_features["total_interactions"]
    )
    df_user_features["dislike_ratio"] = (
        df_user_features["count_dislike"] / df_user_features["total_interactions"]
    )
    df_user_features["neutral_ratio"] = (
        df_user_features["count_neutral"] / df_user_features["total_interactions"]
    )
    df_user_features["watched_ratio"] = (
        df_user_features["count_watched"] / df_user_features["total_interactions"]
    )

    # Weighted scores
    df_user_features["weighted_score"] = (
        df_user_features["count_like"] * 1.5 - df_user_features["count_dislike"] * 1.5
    )

    # Distribution features
    df_user_features["rating_kurtosis"] = df_X.groupby("user")["rating"].apply(
        lambda x: kurtosis(x)
    )

    # Item popularity metrics
    item_popularity = df_X.groupby("item")["rating"].agg(["mean", "count"])
    item_popularity["popularity_score"] = item_popularity["mean"] * np.log1p(
        item_popularity["count"]
    )

    # Merge item popularity with user interactions
    df_X_with_popularity = pd.merge(
        df_X, item_popularity["popularity_score"], left_on="item", right_index=True
    )

    # New features
    df_user_features["avg_deviation_from_popularity"] = df_X_with_popularity.groupby(
        "user"
    ).apply(lambda x: np.abs(x["rating"] - x["popularity_score"]).mean())

    # Drop columns with high correlation
    df_user_features.drop(columns=["total_interactions", "median_rating"], inplace=True)

    # If labels are provided, merge with df_y
    if df_y is not None:
        df_user_features = remove_outliers(df_user_features, df_user_features.columns)
        df_merged = pd.merge(df_user_features.reset_index(), df_y, on="user")
        feature_columns = df_user_features.columns.tolist()

        # Return the merged dataframe with selected features, label, and selected features
        return df_merged[["user"] + feature_columns + ["label"]], feature_columns

    # If no labels (unseen data), just select the selected features
    else:
        if feature_columns is None:
            raise ValueError("feature_columns must be provided for unseen data")

        df_merged = df_user_features.reset_index()
        return df_merged[["user"] + feature_columns]

In [4]:
data_first = np.load("first_second_batch_multi_labels.npz")
X_first = data_first["X"]
y_first = data_first["yy"]

# Convert to DataFrame
df_X_first = pd.DataFrame(X_first, columns=["user", "item", "rating"])
df_y_first = pd.DataFrame(y_first, columns=["user", "label"])

# Engineer features for the first dataset
df_merged_first, top_features = engineer_features(df_X_first, df_y=df_y_first)

scaler = StandardScaler()

# Features and Labels
X_features_first = df_merged_first.drop(columns=["user", "label"])
y_labels_first = df_merged_first["label"]

X_train_scaled = scaler.fit_transform(X_features_first)

# Define base logistic regression model
base_logreg = LogisticRegression(random_state=RANDOM_SEED)

# Wrap it with OneVsRestClassifier
ovr_logreg = OneVsRestClassifier(base_logreg)

# Define parameter grid for logistic regression
param_grid_logreg = {
    "estimator__C": [0.1, 1, 10, 35, 100],  # Wider range of C values
    "estimator__penalty": ["l2", "l1"],  # Include L1 penalty
    "estimator__solver": ["newton-cg", "lbfgs", "saga"],  # Include saga solver
    "estimator__max_iter": [1000, 2000, 3000],  # Increased max_iter
    "estimator__tol": [1e-4, 1e-5, 1e-6],  # Adjusted tolerance levels
    "estimator__warm_start": [True, False],
}

# Initialize RandomizedSearchCV for logistic regression
random_search_logreg = RandomizedSearchCV(
    estimator=ovr_logreg,
    param_distributions=param_grid_logreg,
    # Increase to 100 will marginally improve the results
    n_iter=10,
    scoring="roc_auc_ovr",
    cv=10,
    random_state=RANDOM_SEED,
    n_jobs=-1,
)

# Fit RandomizedSearchCV to the training data (first dataset)
random_search_logreg.fit(X_train_scaled, y_labels_first)

# Print the best parameters found by RandomizedSearchCV
print(f"Best Parameters (Logistic Regression): {random_search_logreg.best_params_}")

# Use the best logistic regression model from RandomizedSearchCV
best_logreg_model = random_search_logreg.best_estimator_
model = best_logreg_model

  ).apply(lambda x: np.abs(x["rating"] - x["popularity_score"]).mean())


Best Parameters (Logistic Regression): {'estimator__warm_start': False, 'estimator__tol': 0.0001, 'estimator__solver': 'newton-cg', 'estimator__penalty': 'l2', 'estimator__max_iter': 2000, 'estimator__C': 100}


30 fits failed out of a total of 100.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
30 fits failed with the following error:
Traceback (most recent call last):
  File "/Users/milton/Documents/GitHub/cs421-project/venv/lib/python3.10/site-packages/sklearn/model_selection/_validation.py", line 888, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Users/milton/Documents/GitHub/cs421-project/venv/lib/python3.10/site-packages/sklearn/base.py", line 1473, in wrapper
    return fit_method(estimator, *args, **kwargs)
  File "/Users/milton/Documents/GitHub/cs421-project/venv/lib/python3.10/site-packages/sklearn/multiclass.py", line 370, in fit
    self.estimators_ = Parallel(n_jobs=self.n_jobs, verbose=self.verbose)(

In [5]:
data_third = np.load("third_batch_multi.npz")
X_third = data_third["X"]

df_X_third = pd.DataFrame(X_third, columns=["user", "item", "rating"])

# Engineer features for the third dataset
df_merged_third = engineer_features(df_X_third, top_features)

# Scale the features
X_third_scaled = scaler.transform(df_merged_third.drop(columns=["user"]))

# Predict probabilities for the third dataset
y_pred_proba_third = model.predict_proba(X_third_scaled)

# Create a DataFrame to hold user IDs and their corresponding class probabilities
df_predictions_third = pd.DataFrame(
    {
        "user": df_merged_third["user"],
        "z0": y_pred_proba_third[:, 0],
        "z1": y_pred_proba_third[:, 1],
        "z2": y_pred_proba_third[:, 2],
        "predicted_class": np.argmax(y_pred_proba_third, axis=1),
    }
)

df_predictions_third

  ).apply(lambda x: np.abs(x["rating"] - x["popularity_score"]).mean())


Unnamed: 0,user,z0,z1,z2,predicted_class
0,2200,0.061739,0.929657,0.008604,1
1,2201,0.000362,0.946300,0.053339,1
2,2202,0.000468,0.978741,0.020791,1
3,2203,0.889322,0.110299,0.000379,0
4,2204,0.448965,0.535102,0.015933,1
...,...,...,...,...,...
1035,3235,0.006570,0.917614,0.075816,1
1036,3236,0.044705,0.954596,0.000698,1
1037,3237,0.000007,0.958334,0.041658,1
1038,3238,0.006783,0.976588,0.016629,1


In [6]:
df_final = df_predictions_third.drop(["user", "predicted_class"], axis="columns")
df_final

Unnamed: 0,z0,z1,z2
0,0.061739,0.929657,0.008604
1,0.000362,0.946300,0.053339
2,0.000468,0.978741,0.020791
3,0.889322,0.110299,0.000379
4,0.448965,0.535102,0.015933
...,...,...,...
1035,0.006570,0.917614,0.075816
1036,0.044705,0.954596,0.000698
1037,0.000007,0.958334,0.041658
1038,0.006783,0.976588,0.016629


In [9]:
np.savez(
    f"./cs421-g1-team3-week10.npz",
    scores=df_final.to_numpy(),
)