# <a name="0">Machine Learning Lab

Build a classfier to predict the __label__ field (substitute or not substitute) of the product substitute dataset.

### Final Project Problem: Product Substitute Prediction

__Problem Definition__:
Given a pair of products, (A, B), we say that B is a "substitute" for A if a customer would buy B in place of A -- say, if A were out of stock.

The goal of this project is to predict a substitute relationship between pairs of products. Complete the tasks in this notebook and submit your notebook via Colab  

1. <a href="#1">Read the datasets</a> (Given)
2. <a href="#2">Data Processing</a> (Implement)
    * <a href="#21">Exploratory Data Analysis</a>
    * <a href="#22">Select features to build the model</a> (Suggested)
    * <a href="#23">Train - Validation - Test Datasets</a>
    * <a href="#24">Data Processing with Pipeline</a>
3. <a href="#3">Train (and Tune) a Classifier on the Training Dataset</a> (Implement)
4. <a href="#3">Make Predictions on the Test Dataset</a> (Implement)


__Datasets and Files:__


* __training.csv__: Training data with product pair features and corresponding labels:
> - `ID:` ID of the record
> - `label:` Tells whether the key and candidate products are substitutes (1) or not (0).
> - `key_asin ...:` Key product ASIN features
> - `cand_asin ...:` Candidate product ASIN features


* __public_test_features.csv__: Test data with product pairs features __without__ labels:
> - `ID:` ID of the record
> - `key_asin ...:` Key product ASIN features
> - `cand_asin ...:` Candidate product ASIN features


* __metadata-dataset.xlsx__: Provides detailed information about all key_ and cand_ columns in the training and test sets. Try to select some useful features to include in the model, as not all of them are suitable. `|Region Id|MarketPlace Id|ASIN|Binding Code|binding_description|brand_code|case_pack_quantity|, ...`


## 1. <a name="1">Read the datasets</a> (Given)
(<a href="#0">Go to top</a>)
</br>
<a href="https://propensity-labs-screening.s3.amazonaws.com/machine_learning/ml_data.zip">Download Dataset</a>

Then, we read the __training__ and __test__ datasets into dataframes

In [41]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import SelectKBest, f_classif

# Load the dataset
data_path = '/content/drive/MyDrive/Karini_Ai_test/ml_data/ml_data/training.csv'  # Replace with the path to your dataset
df = pd.read_csv(data_path)






  df = pd.read_csv(data_path)


## 2. <a name="2">Data Processing</a> (Implement)
(<a href="#0">Go to top</a>)

### 2.1 <a name="21">Exploratory Data Analysis</a>

We look at number of rows, columns, and some simple statistics of the datasets.

In [42]:
# Implement EDA here

# Separate features (X) and target variable (y)
X = df.drop(columns=['label'])  # Features
y = df['label']  # Target variable


### 2.2 <a name="22">Select features to build the model</a>

For a quick start, we recommend using only a few of the numerical features for both key_ and cand_ ASINs: __item_package_quantity__, __item_height__, __item_width__, __item_length__, __item_weight__, __pkg_height__, __pkg_width__, __pkg_length__, __pkg_weight__. Feel free to explore other fields from the metadata-dataset.xlsx file.


In [52]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.feature_selection import SelectKBest, f_classif
from sklearn.impute import SimpleImputer

# Load the dataset
data_path = r'/content/drive/MyDrive/Karini_Ai_test/ml_data/ml_data/training.csv'  # Replace with the path to your dataset
df = pd.read_csv(data_path)

# Separate features (X) and target variable (y)
X = df.drop(columns=['label'])  # Features
y = df['label']  # Target variable

# Handle missing values for numerical features with SimpleImputer
numeric_features = X.select_dtypes(include=np.number).columns
imputer_numeric = SimpleImputer(strategy='mean')

X_numeric_imputed_values = imputer_numeric.fit_transform(X[numeric_features])


X_numeric_imputed = pd.DataFrame(imputer_numeric.fit_transform(X[numeric_features]), columns=numeric_features)

# Combine imputed numerical features with remaining features
X_remaining = X.drop(columns=numeric_features)
X_final = pd.concat([X_numeric_imputed, X_remaining], axis=1)

# Calculate correlation matrix
corr_matrix = X_final.corr()

# Plot correlation matrix heatmap
plt.figure(figsize=(12, 10))
sns.heatmap(corr_matrix, annot=False, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()

# SelectKBest feature selection
selector = SelectKBest(score_func=f_classif, k=10)  # Choose the number of top features to select (e.g., k=10)
X_selected = selector.fit_transform(X_final, y)
selected_features = X_final.columns[selector.get_support()]

# Plot the scores of features
plt.figure(figsize=(10, 6))
scores = -np.log10(selector.pvalues_)
plt.bar(range(len(X_final.columns)), scores)
plt.xticks(range(len(X_final.columns)), X_final.columns, rotation='vertical')
plt.xlabel('Features')
plt.ylabel('-log(p-value)')
plt.title('Feature Importance Scores')
plt.show()

# Print selected features
print("Selected Features:", selected_features)


  df = pd.read_csv(data_path)


ValueError: Shape of passed values is (36803, 77), indices imply (36803, 101)

### 2.3 <a name="23">Train - Validation Datasets</a>
(<a href="#2">Go to Data Processing</a>)

We already have training and test datasets, however the test dataset is missing the labels - the goal of the project is to predict these labels.

To produce a validation set to evaluate model performance before submitting  split the training dataset into train and validation. Validation data you get here will be used later in section 3 to tune your classifier.

In [32]:
# Implement here
from sklearn.preprocessing import OneHotEncoder

# Assuming X contains your feature matrix with both numeric and categorical features
# Assuming 'categorical_features' contains the list of categorical feature names
# Convert float values to strings within categorical features
X[categorical_features] = X[categorical_features].astype(str)

# Separate categorical and numeric features
categorical_features = X.select_dtypes(include=['object']).columns
X[categorical_features] = X[categorical_features].astype(str)
numeric_features = X.select_dtypes(exclude=['object']).columns



# Now X_processed contains the preprocessed feature matrix with all features in numeric format


### 2.4 <a name="24">Data processing with Pipeline</a>

Build a [pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html)  to impute the missing values and scale the numerical features, and finally train the classifier on the imputed and scaled dataset.


In [34]:
# Implement here
#One-hot encode categorical features
encoder = OneHotEncoder(handle_unknown='ignore')
X_categorical_encoded = encoder.fit_transform(X[categorical_features])

from sklearn.decomposition import TruncatedSVD

# Check if there are categorical features before one-hot encoding
if len(categorical_features) > 0:
    # One-hot encode categorical features
    encoder = OneHotEncoder(handle_unknown='ignore')
    X_categorical_encoded = encoder.fit_transform(X[categorical_features])

    # Check if there are features after one-hot encoding
    if X_categorical_encoded.shape[1] > 0:
        # Reduce dimensionality of the sparse matrix
        svd = TruncatedSVD(n_components=100)  # Adjust the number of components as needed
        X_reduced = svd.fit_transform(X_categorical_encoded)
    else:
        print("No features remaining after one-hot encoding.")
else:
    print("No categorical features found.")


No categorical features found.


## 3. <a name="3">Train (and Tune) a Classifier</a> (Implement)
(<a href="#0">Go to top</a>)

Train and tune the classifier

In [None]:
# Implement here
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

import pandas as pd

# Load training and testing datasets
train_file = r'C:\Users\tejas\Downloads\ml_data\training.csv'
test_file = r'C:\Users\tejas\Downloads\ml_data\public_test_features.csv'

# # Step 1: Load the training and testing data
# train_data = pd.read_csv("training.csv")
# test_data = pd.read_csv("public_test_features.csv")

# Load training and testing datasets with explicit data types or low_memory=False
# train_file = pd.read_csv(r'C:\Users\tejas\Downloads\ml_data\training.csv', low_memory=False)
# test_file = pd.read_csv(r'C:\Users\tejas\Downloads\ml_data\public_test_features.csv', low_memory=False)

# Continue with your preprocessing and modeling code as before


# Define chunk size for incremental loading
chunk_size = 10000  # Adjust the chunk size as needed based on your available memory

# Initialize empty lists to store chunk-wise processed data
train_chunks = []
test_chunks = []

# Iterate over chunks of training data
for chunk in pd.read_csv(train_file, chunksize=chunk_size, low_memory=False):
    # Preprocess each chunk as needed (e.g., fill missing values, encode categorical variables)
    chunk.fillna(chunk.mean(), inplace=True)  # Example: fill missing values with mean
    # Append preprocessed chunk to the list
    train_chunks.append(chunk)

# Concatenate all preprocessed training data chunks into a single dataframe
train_data = pd.concat(train_chunks, ignore_index=True)

# Iterate over chunks of testing data
for chunk in pd.read_csv(test_file, chunksize=chunk_size, low_memory=False):
    # Preprocess each chunk as needed (similar to training data)
    chunk.fillna(chunk.mean(), inplace=True)  # Example: fill missing values with mean
    # Append preprocessed chunk to the list
    test_chunks.append(chunk)

# Concatenate all preprocessed testing data chunks into a single dataframe
test_data = pd.concat(test_chunks, ignore_index=True)





# Step 2: Separate features and target variable in the training data
X_train = train_data.drop(columns=['label'])
y_train = train_data['label']

# Drop non-numeric columns from the feature matrix
X_train_numeric = X_train.select_dtypes(include=['number'])



# Step 3: Preprocess the training data
# Handle missing values
imputer = SimpleImputer(strategy='mean')
X_train_imputed = imputer.fit_transform(X_train_numeric)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_imputed)

# Step 4: Train a predictive model
model = RandomForestClassifier()
model.fit(X_train_scaled, y_train)

# Step 5: Preprocess the testing data
# Check for missing columns in testing data compared to training data
missing_columns = set(X_train.columns) - set(X_test.columns)

# # If there are missing columns, add them to testing data and fill with zeros
# if missing_columns:
#     for col in missing_columns:
#         X_test[col] = 0  # Fill missing columns with zeros

# # Check for additional columns in testing data compared to training data
# additional_columns = set(X_test.columns) - set(X_train.columns)

# If there are additional columns, remove them from testing data
if additional_columns:
    X_test = X_test.drop(columns=additional_columns)

# Now proceed with imputing missing values and scaling as before
X_test_imputed = imputer.transform(X_test)
X_test_scaled = scaler.transform(X_test_imputed)


# Handle missing values (use the same imputer as in training)
# X_test = test_data.drop(columns=['label'])
X_test_imputed = imputer.transform(X_test)

# Scale features (use the same scaler as in training)
X_test_scaled = scaler.transform(X_test_imputed)


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## 4. <a name="4">Make Predictions on the Test Dataset</a> (Implement)
(<a href="#0">Go to top</a>)

Use the trained classifier to predict the labels on the test set. Test accuracy would be displayed upon a valid submission to the leaderboard.

In [None]:
# Implement here

# Get test data to test the classifier
# ! test data should come from public_test_features.csv !
# ...

# Use the trained model to make predictions on the test dataset
# test_predictions = ...


# Step 6: Predict labels for the testing data
y_pred = model.predict(X_test_scaled)

# Step 7: Output the predicted labels
output = pd.DataFrame({'ID': test_data['ID'], 'Predicted_Label': y_pred})
output.to_csv('predicted_labels.csv', index=False)
