# Pipeline Project

You will be using the provided data to create a machine learning model pipeline.

You must handle the data appropriately in your pipeline to predict whether an
item is recommended by a customer based on their review.
Note the data includes numerical, categorical, and text data.

You should ensure you properly train and evaluate your model.

## The Data

The dataset has been anonymized and cleaned of missing values.

There are 8 features for to use to predict whether a customer recommends or does
not recommend a product.
The `Recommended IND` column gives whether a customer recommends the product
where `1` is recommended and a `0` is not recommended.
This is your model's target/

The features can be summarized as the following:

- **Clothing ID**: Integer Categorical variable that refers to the specific piece being reviewed.
- **Age**: Positive Integer variable of the reviewers age.
- **Title**: String variable for the title of the review.
- **Review Text**: String variable for the review body.
- **Positive Feedback Count**: Positive Integer documenting the number of other customers who found this review positive.
- **Division Name**: Categorical name of the product high level division.
- **Department Name**: Categorical name of the product department name.
- **Class Name**: Categorical name of the product class name.

The target:
- **Recommended IND**: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not recommended.

## Load Data

In [233]:
import pandas as pd

# Load data
df = pd.read_csv(
    'data/reviews.csv',
)

df.info()
df.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18442 entries, 0 to 18441
Data columns (total 9 columns):
 #   Column                   Non-Null Count  Dtype 
---  ------                   --------------  ----- 
 0   Clothing ID              18442 non-null  int64 
 1   Age                      18442 non-null  int64 
 2   Title                    18442 non-null  object
 3   Review Text              18442 non-null  object
 4   Positive Feedback Count  18442 non-null  int64 
 5   Division Name            18442 non-null  object
 6   Department Name          18442 non-null  object
 7   Class Name               18442 non-null  object
 8   Recommended IND          18442 non-null  int64 
dtypes: int64(4), object(5)
memory usage: 1.3+ MB


Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name,Recommended IND
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,General,Dresses,Dresses,0
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",0,General Petite,Bottoms,Pants,1
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,6,General,Tops,Blouses,1
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",4,General,Dresses,Dresses,0
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,General Petite,Tops,Knits,1


## Preparing features (`X`) & target (`y`)

In [234]:
data = df

# separate features from labels
X = data.drop('Recommended IND', axis=1)
y = data['Recommended IND'].copy()

print('Labels:', y.unique())
print('Features:')
display(X.head())

Labels: [0 1]
Features:


Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name
0,1077,60,Some major design flaws,I had such high hopes for this dress and reall...,0,General,Dresses,Dresses
1,1049,50,My favorite buy!,"I love, love, love this jumpsuit. it's fun, fl...",0,General Petite,Bottoms,Pants
2,847,47,Flattering shirt,This shirt is very flattering to all due to th...,6,General,Tops,Blouses
3,1080,49,Not for the very petite,"I love tracy reese dresses, but this one is no...",4,General,Dresses,Dresses
4,858,39,Cagrcoal shimmer fun,I aded this in my basket at hte last mintue to...,1,General Petite,Tops,Knits


In [235]:
# Split data into train and test sets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.1,
    shuffle=True,
    random_state=27,
)

# Your Work

In [236]:
# Load the en_core_web_sm model from SpaCy in the code block below

In [237]:
! python -m spacy download en_core_web_sm

Collecting en-core-web-sm==3.8.0
  Downloading https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.8.0/en_core_web_sm-3.8.0-py3-none-any.whl (12.8 MB)
     ---------------------------------------- 0.0/12.8 MB ? eta -:--:--
     - -------------------------------------- 0.5/12.8 MB 7.3 MB/s eta 0:00:02
     ------ --------------------------------- 2.1/12.8 MB 8.2 MB/s eta 0:00:02
     ---------- ----------------------------- 3.4/12.8 MB 7.9 MB/s eta 0:00:02
     ----------------- ---------------------- 5.5/12.8 MB 7.7 MB/s eta 0:00:01
     ---------------------- ----------------- 7.3/12.8 MB 8.1 MB/s eta 0:00:01
     ------------------------------- -------- 10.0/12.8 MB 9.0 MB/s eta 0:00:01
     ------------------------------------- -- 12.1/12.8 MB 9.1 MB/s eta 0:00:01
     ---------------------------------------- 12.8/12.8 MB 9.0 MB/s eta 0:00:00
[38;5;2m✔ Download and installation successful[0m
You can now load the package via spacy.load('en_core_web_sm')


In [238]:
# Load the text data model from SpaCy
import spacy
nlp = spacy.load('en_core_web_sm')

## Data Exploration

In [239]:
# Turn the Clothing ID column into a text variable as of its categorical nature
X['Clothing ID'] = X['Clothing ID'].astype(str)

In [240]:
# Generate summary statistics for the numerical variables
X.describe(include='all')

Unnamed: 0,Clothing ID,Age,Title,Review Text,Positive Feedback Count,Division Name,Department Name,Class Name
count,18442.0,18442.0,18442,18442,18442.0,18442,18442,18442
unique,531.0,,13142,18439,,2,6,14
top,1078.0,,Love it!,I bought this shirt at the store and after goi...,,General,Tops,Dresses
freq,871.0,,129,2,,11664,8713,5371
mean,,43.383635,,,2.697484,,,
std,,12.246264,,,5.94222,,,
min,,18.0,,,0.0,,,
25%,,34.0,,,0.0,,,
50%,,41.0,,,1.0,,,
75%,,52.0,,,3.0,,,


In [241]:
# Observations: the number of distinct clothing ID's is very large, 
# therefore making it not ideal as a categorical variable in the predictive model 
# (since the OneHotEncoder will generate 531 distinct boolean variables). 
# The 'count' row again confirms that there are no missing values for any variables,
# which means that an imputer step is not necessary in the preprocessing part of the pipeline. 
# Furthermore, the numerical columns 'Age' and 'Positive Feedback Count' seem to contain 
# a sizeable spread of values (with max age 99 and max positive feedback count 122).
# Therefore, take the potential for outliers into account when, for instance, 
# choosing the right scaler in the numerical pipeline.

In [242]:
y.describe()

count    18442.000000
mean         0.816235
std          0.387303
min          0.000000
25%          1.000000
50%          1.000000
75%          1.000000
max          1.000000
Name: Recommended IND, dtype: float64

In [243]:
# Observations: With a mean of 0.81, most people recommend the product. 
# Therefore, the categories of this variable are slightly imbalanced, 
# although not to a degree that would immediately ring the alarm bells.
# Be mindful of this when choosing the right method for training and prediction.

## Building Pipeline

In [244]:
# Select the columns that need to flow into the numerical, categorical and text preprocessing part of the pipeline 
num_features = (X.select_dtypes(exclude=['object']).columns)
cat_features = (X.select_dtypes(include=['object']).columns.drop(['Review Text', 'Clothing ID', 'Title']))
text_features = 'Review Text' # (X[['Review Text']]).columns

In [245]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import RobustScaler

# Build the numerical preprocessing pipeline
num_pipeline = Pipeline([
    ('scaler', RobustScaler())
])

In [246]:
from sklearn.preprocessing import OneHotEncoder

# Build the categorical preprocessing pipeline. There are no ordinal variables, so one hot encoding is sufficient
cat_pipeline = Pipeline([
    ('cat_encoder', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
])

In [247]:
# Build a class for feature extraction (the number of words in a text,
#  operationalized as the number of chunks in a review text separated by whitespace)
from sklearn.base import BaseEstimator, TransformerMixin

class CountWords(BaseEstimator, TransformerMixin):
    def __init__(self, character):
        self.character = character

    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        feature = [[len(text.split(self.character))] for text in X]
        return pd.DataFrame(data=feature)

# Make this feature extraction into a pipeline
word_count_pipeline = Pipeline([
    ('count_words', CountWords(character=' '))
])

In [248]:
# Build a pipeline for a tfidf vectorizer. This is used for feature extraction
# that account for the frequency and significance of words in a review.
# Before the tfidf vectorizer is executed, a lemmatizer is used for
# additional cleaning of the input text data.

from sklearn.feature_extraction.text import TfidfVectorizer

# Build a class for the lemmatizer
class SpacyLemmatizer(BaseEstimator, TransformerMixin):
    def __init__(self, nlp):
        self.nlp = nlp
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        lemmatized = X.apply(lambda x: " ".join([y.lemma_ for y in self.nlp(x)]))
        return lemmatized

# Define the tfidf pipeline
tfidf_pipeline = Pipeline([
    ('lemmatizer', SpacyLemmatizer(nlp=nlp)),
    ('tfidf', TfidfVectorizer(stop_words='english'))
])

In [249]:
# Now, merge all pipelines above using a ColumnTransformer
from sklearn.compose import ColumnTransformer

feature_engineering = ColumnTransformer([
    ('num', num_pipeline, num_features),
    ('cat', cat_pipeline, cat_features),
    ('word_counts', word_count_pipeline, text_features),
    ('tfidf_text', tfidf_pipeline, text_features)
])

## Training Pipeline

In [250]:
# Import a RandomForestClassifier, since the random forest averages over 
# multiple trees, which makes it less bias-prone and a little more
# robust to class imbalance than regular decision trees.
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline

# Define and fit the total pipeline on the training data, from feature engineering to fitting the model
model_pipeline = make_pipeline(feature_engineering, RandomForestClassifier(random_state=42))
model = model_pipeline.fit(X_train, y_train)
y_pred = model.predict(X_test)



## Fine-Tuning Pipeline

In [251]:
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score

# Calculate evaluation metrics of the model fitted above
accuracy_initial = accuracy_score(y_test, y_pred)
recall_initial = recall_score(y_test, y_pred)
precision_initial = precision_score(y_test, y_pred)
f1_initial = f1_score(y_test, y_pred)

print('Accuracy:    ', accuracy_initial)
print('Recall:    ', recall_initial)
print('Precision:    ', precision_initial)
print('F1-score:    ', f1_initial)

Accuracy:     0.8476964769647697
Recall:     0.9888010540184453
Precision:     0.8504249291784702
F1-score:     0.9144075540664027


In [252]:
# The model shows an accuracy of (rounded to two decimals) 0.85. 
# But, since the dataset is slightly imbalanced (80 percent of reviews 
# recommend the product and 20 percent do not), accuracy is not the best
# evaluation metric. The relatively lower precision than recall indicates
# that the model is better in avoiding false negatives than false positives.
# This means that the model will not often indicate that someone does not
# recommend the product when they in fact do. In reverse, the model will more
# often fail to correctly predict that someone does not recommend the product.
# In this way, the model still tends to the majority class, which is not entirely
# surprising given the class imbalance.
# The F1-score, which combines precision and recall, is approximately 0.91,
# which is relatively decent. 

In [253]:
# Show hyperparameters that can be optimized
model_pipeline.get_params().keys()

dict_keys(['memory', 'steps', 'transform_input', 'verbose', 'columntransformer', 'randomforestclassifier', 'columntransformer__force_int_remainder_cols', 'columntransformer__n_jobs', 'columntransformer__remainder', 'columntransformer__sparse_threshold', 'columntransformer__transformer_weights', 'columntransformer__transformers', 'columntransformer__verbose', 'columntransformer__verbose_feature_names_out', 'columntransformer__num', 'columntransformer__cat', 'columntransformer__word_counts', 'columntransformer__tfidf_text', 'columntransformer__num__memory', 'columntransformer__num__steps', 'columntransformer__num__transform_input', 'columntransformer__num__verbose', 'columntransformer__num__scaler', 'columntransformer__num__scaler__copy', 'columntransformer__num__scaler__quantile_range', 'columntransformer__num__scaler__unit_variance', 'columntransformer__num__scaler__with_centering', 'columntransformer__num__scaler__with_scaling', 'columntransformer__cat__memory', 'columntransformer__ca

In [None]:
from sklearn.model_selection import RandomizedSearchCV

# Perform a Randomized Search with Cross-Validation to find optimal hyperparameters
# Here, the Randomized Search is adopted for the maximum number of features and the
# number of estimators in the Random Forest Classifier step.

# This code block took about 19 minutes to run. Hence, from the perspective of
# performance, the Randomized Search has been applied to only two hyperparameters.

my_distributions = dict(
    randomforestclassifier__max_features = [10, 50, 100, 150],
    randomforestclassifier__n_estimators = [10, 50, 100, 150]
)

param_search = RandomizedSearchCV(
    estimator=model_pipeline,
    param_distributions=my_distributions,
    n_iter=6,
    cv=5,
    n_jobs=-1,
    refit=True,
    verbose=3,
    random_state=42
)

param_search.fit(X_train, y_train)

best_model = param_search.best_estimator_

Fitting 5 folds for each of 6 candidates, totalling 30 fits


In [255]:
# Evaluate the best model by fitting it to the entire training data set
final_model = best_model.fit(X_train, y_train)
y_pred_final = final_model.predict(X_test)



In [256]:
# Print the evaluation metrics of the best model. 
# As becomes apparent from the output, the evaluation
# metrics are roughly similar to the ones generated
# before the hyperparameter tuning. Therefore, see
# above for the interpretation of these metrics in light
# of the model concerned.

accuracy_final = accuracy_score(y_test, y_pred_final)
recall_final = recall_score(y_test, y_pred_final)
precision_final = precision_score(y_test, y_pred_final)
f1_final = f1_score(y_test, y_pred_final)

print('Accuracy:    ', accuracy_final)
print('Recall:    ', recall_final)
print('Precision:    ', precision_final)
print('F1-score:    ', f1_final)

Accuracy:     0.8476964769647697
Recall:     0.9835309617918313
Precision:     0.8536306460834763
F1-score:     0.913988368533823
