# Pipeline Project

You will be using the provided data to create a machine learning model pipeline.

You must handle the data appropriately in your pipeline to predict whether an
item is recommended by a customer based on their review.
Note the data includes numerical, categorical, and text data.

You should ensure you properly train and evaluate your model.

## The Data

The dataset has been anonymized and cleaned of missing values.

There are 8 features for to use to predict whether a customer recommends or does
not recommend a product.
The `Recommended IND` column gives whether a customer recommends the product
where `1` is recommended and a `0` is not recommended.
This is your model's target/

The features can be summarized as the following:

- **Clothing ID**: Integer Categorical variable that refers to the specific piece being reviewed.
- **Age**: Positive Integer variable of the reviewers age.
- **Title**: String variable for the title of the review.
- **Review Text**: String variable for the review body.
- **Positive Feedback Count**: Positive Integer documenting the number of other customers who found this review positive.
- **Division Name**: Categorical name of the product high level division.
- **Department Name**: Categorical name of the product department name.
- **Class Name**: Categorical name of the product class name.

The target:
- **Recommended IND**: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.

## Load Data

In [None]:
import pandas as pd

# Load data
df = pd.read_csv(
    'data/reviews.csv',
)

df.info()
df.head()

## Preparing features (`X`) & target (`y`)

In [None]:
data = df

# separate features from labels
X = data.drop('Recommended IND', axis=1)
y = data['Recommended IND'].copy()

print('Labels:', y.unique())
print('Features:')
display(X.head())

In [3]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.1,
    shuffle=True,
    random_state=27,
)

# Your Work

## Data Exploration

In [None]:
# Basic data exploration
import matplotlib.pyplot as plt
from IPython.display import display

display(df.head())
print('\nDataset summary:')
display(df.describe(include='all'))

print('\nTarget distribution (Recommended IND):')
display(df['Recommended IND'].value_counts(normalize=True))

# Create a simple text-length feature for exploration
df['review_len'] = df['Review Text'].fillna('').str.len()
print('\nReview length (chars) summary:')
display(df['review_len'].describe())

plt.figure(figsize=(8,4))
plt.hist(df['review_len'].dropna(), bins=50)
plt.xlabel('Review length (chars)')
plt.ylabel('Count')
plt.title('Distribution of review lengths')
plt.show()

## Building Pipeline

In [None]:
# Build preprocessing + modeling pipeline
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, FunctionTransformer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# Column groups
text_cols = ['Title', 'Review Text']
numeric_cols = ['Age', 'Positive Feedback Count']
categorical_cols = ['Division Name', 'Department Name', 'Class Name', 'Clothing ID']

def combine_texts(X):
    # X may be a DataFrame slice
    return (X['Title'].fillna('') + ' ' + X['Review Text'].fillna('')).values

# Text pipeline: combine Title + Review Text then TF-IDF
text_pipeline = Pipeline([
    ('combine', FunctionTransformer(combine_texts, validate=False)),
    ('tfidf', TfidfVectorizer(stop_words='english', max_features=10000))
])

numeric_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))
])

preprocessor = ColumnTransformer([
    ('text', text_pipeline, text_cols),
    ('num', numeric_pipeline, numeric_cols),
    ('cat', categorical_pipeline, categorical_cols),
], remainder='drop')

model_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('clf', LogisticRegression(max_iter=1000))
])

print('Pipeline components:')
print(model_pipeline)

## Training Pipeline

In [None]:
# Train the pipeline on the training data and evaluate on test set
from sklearn.metrics import f1_score, classification_report

model_pipeline.fit(X_train, y_train)

pred_train = model_pipeline.predict(X_train)
pred_test = model_pipeline.predict(X_test)

print('Train F1:', f1_score(y_train, pred_train))
print('Test  F1:', f1_score(y_test, pred_test))

print('\nClassification report (test):')
print(classification_report(y_test, pred_test))

## Fine-Tuning Pipeline

In [None]:
# Hyperparameter tuning with GridSearchCV
from sklearn.model_selection import GridSearchCV
import joblib

param_grid = {
    'clf__C': [0.01, 0.1, 1.0],
    'preprocessor__text__tfidf__max_features': [5000, 10000]
}

grid = GridSearchCV(model_pipeline, param_grid, cv=3, scoring='f1', n_jobs=-1)
grid.fit(X_train, y_train)

print('Best parameters:', grid.best_params_)

best_model = grid.best_estimator_
pred = best_model.predict(X_test)
print('\nFinal classification report (test):')
print(classification_report(y_test, pred))

# Save the tuned model
joblib.dump(best_model, 'tuned_pipeline.pkl')
print('Saved tuned pipeline to tuned_pipeline.pkl')