# Pipeline Project

You will be using the provided data to create a machine learning model pipeline.

You must handle the data appropriately in your pipeline to predict whether an
item is recommended by a customer based on their review.
Note the data includes numerical, categorical, and text data.

You should ensure you properly train and evaluate your model.

## The Data

The dataset has been anonymized and cleaned of missing values.

There are 8 features for to use to predict whether a customer recommends or does
not recommend a product.
The `Recommended IND` column gives whether a customer recommends the product
where `1` is recommended and a `0` is not recommended.
This is your model's target/

The features can be summarized as the following:

- **Clothing ID**: Integer Categorical variable that refers to the specific piece being reviewed.
- **Age**: Positive Integer variable of the reviewers age.
- **Title**: String variable for the title of the review.
- **Review Text**: String variable for the review body.
- **Positive Feedback Count**: Positive Integer documenting the number of other customers who found this review positive.
- **Division Name**: Categorical name of the product high level division.
- **Department Name**: Categorical name of the product department name.
- **Class Name**: Categorical name of the product class name.

The target:
- **Recommended IND**: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.

# Preparation
import spacy 

In [1]:
 ! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.7.1
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1-py3-none-any.whl (12.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.8/12.8 MB[0m [31m58.3 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m

[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.0.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [2]:
import spacy

nlp = spacy.load('en_core_web_sm')

## Load Data

In [3]:
import pandas as pd

# Load data
df = pd.read_csv(
    'data/reviews.csv',
)

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18442 entries, 0 to 18441
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              18442 non-null  int64 
 1   Age                      18442 non-null  int64 
 2   Title                    18442 non-null  object
 3   Review Text              18442 non-null  object
 4   Positive Feedback Count  18442 non-null  int64 
 5   Division Name            18442 non-null  object
 6   Department Name          18442 non-null  object
 7   Class Name               18442 non-null  object
 8   Recommended IND          18442 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 1.3+ MB


Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name,Recommended IND
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,General,Dresses,Dresses,0
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",0,General Petite,Bottoms,Pants,1
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,6,General,Tops,Blouses,1
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",4,General,Dresses,Dresses,0
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,General Petite,Tops,Knits,1


## Preparing features (`X`) & target (`y`)

In [4]:
data = df

# separate features from labels
X = data.drop('Recommended IND', axis=1)
y = data['Recommended IND'].copy()

print('Labels:', y.unique())
print('Features:')
display(X.head())

Labels: [0 1]
Features:


Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,General,Dresses,Dresses
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",0,General Petite,Bottoms,Pants
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,6,General,Tops,Blouses
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",4,General,Dresses,Dresses
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,General Petite,Tops,Knits


In [5]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.1,
    shuffle=True,
    random_state=27,
)

## Data Exploration

### Summary statistics

In [6]:
# Displaying the number of records and features
print(df.shape)
print("-" * 50)
# Displaying information about the dataset
print(df.info())
print("-" * 50)
# Displaying statistics about the dataset for numerical datatypes
df.describe()

(18442, 9)
--------------------------------------------------
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18442 entries, 0 to 18441
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              18442 non-null  int64 
 1   Age                      18442 non-null  int64 
 2   Title                    18442 non-null  object
 3   Review Text              18442 non-null  object
 4   Positive Feedback Count  18442 non-null  int64 
 5   Division Name            18442 non-null  object
 6   Department Name          18442 non-null  object
 7   Class Name               18442 non-null  object
 8   Recommended IND          18442 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 1.3+ MB
None
--------------------------------------------------


Unnamed: 0,Clothing ID,Age,Positive Feedback Count,Recommended IND
count,18442.0,18442.0,18442.0,18442.0
mean,954.896757,43.383635,2.697484,0.816235
std,141.571783,12.246264,5.94222,0.387303
min,2.0,18.0,0.0,0.0
25%,863.0,34.0,0.0,1.0
50%,952.0,41.0,1.0,1.0
75%,1078.0,52.0,3.0,1.0
max,1205.0,99.0,122.0,1.0


In [7]:
# Desplaying statistics about the object datatypes
df.describe(include="object")

Unnamed: 0,Title,Review Text,Division Name,Department Name,Class Name
count,18442,18442,18442,18442,18442
unique,13142,18439,2,6,14
top,Love it!,The sweater and skirt are so pretty! they're r...,General,Tops,Dresses
freq,129,2,11664,8713,5371


In [8]:
# Diplaying the number of missing values
df.isna().sum()

Clothing ID                0
Age                        0
Title                      0
Review Text                0
Positive Feedback Count    0
Division Name              0
Department Name            0
Class Name                 0
Recommended IND            0
dtype: int64

In [9]:
# Categorycal columns values
print(df['Department Name'].unique().tolist())
print(df['Division Name'].unique().tolist())
print(df['Class Name'].unique().tolist())

['Dresses', 'Bottoms', 'Tops', 'Jackets', 'Trend', 'Intimate']
['General', 'General Petite']
['Dresses', 'Pants', 'Blouses', 'Knits', 'Outerwear', 'Sweaters', 'Skirts', 'Fine gauge', 'Jackets', 'Trend', 'Lounge', 'Jeans', 'Shorts', 'Casual bottoms']


## Building Pipeline
Here we will create 3 pipline for preprocessing the features: <br>
- Numericall pipline
- Categoricall pipline
- Textual pipline

In [15]:
# Droping the id since it has no meaning 
#X_test.drop(['Clothing ID'], inplace=True, axis=1)
#X_train.drop(['Clothing ID'], inplace=True, axis=1)
X_train['Full Review'] = X_train['Title'] + " " + X_train['Review Text']
X_test['Full Review'] = X_test['Title'] + " " + X_test['Review Text']

In [16]:
X_train.drop(columns=['Title', 'Review Text'], inplace=True)
X_test.drop(columns=['Title', 'Review Text'], inplace=True)

In [21]:
# split data into numerical, categorical, and text features
numerical_features = X_train.select_dtypes(include=["number"]).columns
print("Numerical features:", numerical_features)

categorical_features = X_train[["Department Name", "Division Name", "Class Name"]].columns
print("Categorical features:", categorical_features)

text_features = X_train[["Full Review"]].columns
print("Text features:", text_features)

Numerical features: Index(['Age', 'Positive Feedback Count'], dtype='object')
Categorical features: Index(['Department Name', 'Division Name', 'Class Name'], dtype='object')
Text features: Index(['Full Review'], dtype='object')


### Numerical Features Pipline

In [22]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

# Define a numerical pipeline
num_pipeline = Pipeline(
    steps=[
        # Step 1: Impute missing values with the mean of each column
        (
            "imputer",
            SimpleImputer(strategy="mean"),
        ),  # This will fill any missing (NaN) values with the mean of the respective column
        # Step 2: Scale the data (standardize) to have a mean of 0 and a standard deviation of 1
        (
            "scaler",
            StandardScaler(),
        ),  # StandardScaler will scale the numerical features to standardize them
    ]
)

# Display the pipeline
num_pipeline

### Categorical Features Pipline

In [23]:
from sklearn.preprocessing import OneHotEncoder

cat_pipeline = Pipeline(steps=[('imputer', SimpleImputer(strategy='most_frequent')),
                              ('encoder', OneHotEncoder(handle_unknown='ignore', sparse_output=False))])
cat_pipeline

### Textual Features Pipline
For the text pipline we will utilize spaCy library 

We will create `Length()` transformer using `BestEstimator` and `TransformerMixin` <br>
This transformer will be used to calculate the length of the text

In [24]:
from sklearn.base import BaseEstimator, TransformerMixin


# Custom transformer to calculate the length of the text in terms of word count
class Length(BaseEstimator, TransformerMixin):

    # Constructor method to initialize the transformer (no parameters in this case)
    def __init__(self):
        super().__init__()

    def fit(self, X, y=None):
        # We don't need to learn anything from the data, just return self
        return self

    # It takes the input data (X) and returns the transformed data.
    def transform(self, X):
        # Calculate word count for each text
        word_counts = [[len(text.split())] for text in X]
        
        # Convert the result to a NumPy array
        return np.array(word_counts)

Create the pipeline for counting the number of words

In [27]:
from sklearn.preprocessing import FunctionTransformer
import numpy as np

# This pipeline reshapes the input data into a 1D array, which is useful for models
# that require data in a single column (flattened format).
initial_text_preprocess = Pipeline(
    [
        (
            "dimension_reshaper",
            FunctionTransformer(
                np.reshape,  # The function being applied, which reshapes the data
                kw_args={"newshape": -1},  # Reshape the data to a 1D array (flatten it)
            ),
        ),
    ]
)

# pipline for calculating the word count for each text entry.
character_counts_pipeline = Pipeline(
    [
        (
            "initial_text_preprocess",  # Apply the initial text preprocessing (reshape)
            initial_text_preprocess,  # The preprocessing pipeline defined above
        ),
        (
            "word_count", 
            Length(),  # Length is a custom transformer that calculates the number of words
        ),
    ]
)

# This is the combined pipeline that first reshapes the data and then calculates the word count
character_counts_pipeline

We will create `SpacyLemmatizer()` Transformer to map the words to the same base which will be helpful when using `TF-IDF`

In [29]:
from sklearn.base import BaseEstimator, TransformerMixin


# Custom transformer for lemmatizing text using spaCy
class SpacyLemmatizer(BaseEstimator, TransformerMixin):

    # Constructor method to initialize the transformer with a spaCy model
    def __init__(self, nlp):
        self.nlp = nlp  # The spaCy language model (e.g., 'en_core_web_sm')

    def fit(self, X, y=None):
        return self  # Nothing to fit, so we simply return 'self'

    # The transform method applies the lemmatization to each document in the input (X)
    def transform(self, X):
        # Process each text in the input using the spaCy pipeline (`nlp.pipe` processes texts in batches)
        # `nlp.pipe(X)` returns a generator of processed spaCy documents
        # For each processed document, extract the lemmatized form of each token (excluding stop words)
        lemmatized = [
            " ".join(
                token.lemma_ for token in doc if not token.is_stop
            )  # Join lemmas of tokens without stop words
            for doc in self.nlp.pipe(X)  # Process each document in the batch
        ]
        return lemmatized  # Return the list of lemmatized texts

Create the TF-IDF pipline

In [35]:
from sklearn.feature_extraction.text import TfidfVectorizer


tfidf_pipeline = Pipeline(
    [
        (
            "dimension_reshaper",
            FunctionTransformer(
                np.reshape,
                kw_args={"newshape": -1},
            ),
        ),
        (
            "lemmatizer",
            SpacyLemmatizer(nlp=nlp),
        ),
        (
            "tfidf_vectorizer",
            TfidfVectorizer(
                stop_words="english",
                max_features=100,
                dtype=np.float32,
                min_df=5,
            ),
        ),
    ]
)
tfidf_pipeline

### Combine Feature Engineering Pipelines

In [36]:
from sklearn.compose import ColumnTransformer

feature_engineering = ColumnTransformer([
        ('num', num_pipeline, numerical_features),
        ('cat', cat_pipeline, categorical_features),
        ('character_counts', character_counts_pipeline, text_features),
        ('tfidf_text', tfidf_pipeline, text_features),
])

feature_engineering

## Training Pipeline
Now we will combine our feature engineering pipline with a Random Forest Classifier

In [37]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

model_pipeline = make_pipeline(
    feature_engineering,
    RandomForestClassifier(random_state=27),
)

model_pipeline.fit(X_train, y_train)

### Model Evaluation

In [38]:
from sklearn.metrics import accuracy_score

y_pred_forest_pipeline = model_pipeline.predict(X_test)
accuracy_forest_pipeline = accuracy_score(y_test, y_pred_forest_pipeline)

print('Accuracy:', accuracy_forest_pipeline)

Accuracy: 0.8558265582655826


## Fine-Tuning Pipeline

In [45]:
# Step 1: Define parameter grid
from sklearn.model_selection import RandomizedSearchCV


param_grid = {
    'randomforestclassifier__n_estimators': [50, 100],
    'randomforestclassifier__max_depth': [10, 20],
    'randomforestclassifier__min_samples_split': [2, 5],
}


random_search = RandomizedSearchCV(
    model_pipeline, param_grid, cv=2, scoring='accuracy', n_jobs=-1, verbose=3, n_iter=3  # Set `n_iter` to limit search space
)

random_search.fit(X_train, y_train)

best_params = random_search.best_params_
print("Best Parameters:", best_params)
print("Best Cross-Validation Accuracy:", random_search.best_score_)
best_model = random_search.best_estimator_
best_model

Fitting 2 folds for each of 3 candidates, totalling 6 fits
[CV 1/2] END randomforestclassifier__max_depth=10, randomforestclassifier__min_samples_split=5, randomforestclassifier__n_estimators=100;, score=0.826 total time= 1.6min
[CV 2/2] END randomforestclassifier__max_depth=10, randomforestclassifier__min_samples_split=5, randomforestclassifier__n_estimators=100;, score=0.826 total time= 1.6min
[CV 1/2] END randomforestclassifier__max_depth=10, randomforestclassifier__min_samples_split=2, randomforestclassifier__n_estimators=50;, score=0.825 total time= 1.6min
[CV 2/2] END randomforestclassifier__max_depth=10, randomforestclassifier__min_samples_split=2, randomforestclassifier__n_estimators=50;, score=0.827 total time= 1.6min
[CV 1/2] END randomforestclassifier__max_depth=10, randomforestclassifier__min_samples_split=2, randomforestclassifier__n_estimators=100;, score=0.827 total time= 1.6min
[CV 2/2] END randomforestclassifier__max_depth=10, randomforestclassifier__min_samples_split=

In [48]:
best_model.fit(X_train, y_train)
y_pred_forest_best = best_model.predict(X_test)
accuracy_forest_best = accuracy_score(y_test, y_pred_forest_best)

print('Accuracy:', accuracy_forest_best)

Accuracy: 0.8314363143631436
