# 🎽 The North Face ecommerce 🎽
North Face's marketing department would like to take advantage of machine learning solutions to boost online sales on the website 
They have identified two major solutions that could have a huge effect on the conversion rates :
* Deploying a recommender system that will allow to suggest additionnal products to users, that are similar to the items they are already interested in. * The recommendations could be materialized by a "you might also be interested by these products..." section that would appear on each product page of the website.
* Improving the structure of the products catalog thanks to topic extraction. The idea is to use unsupervised methods to challenge the existing categories : is it possible to find new categories of product that would be more suitable for the navigation on the website ?
## 🎯 Goals 
The project can be cut into three steps :
1. Identify groups of products that have similar descriptions.
2. Use the groups of similar products to build a simple recommender system algorithm.
3. Use topic modeling algorithms to automatically assess the latent topics present in the item descriptions.

## 📚 Imports and installations

In [None]:
pip install spacy -q

In [None]:
pip install wordcloud -q

In [None]:
pip install bs4 -q

In [None]:
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
import spacy
from sklearn.decomposition import TruncatedSVD
import seaborn as sns
import en_core_web_sm
nlp = en_core_web_sm.load()
from spacy.lang.en.stop_words import STOP_WORDS
import wordcloud
import matplotlib.pyplot as plt

## ⚙️ Preprocessing of textual data

In [None]:
df = pd.read_csv('sample-data.csv')
df.head()

Unnamed: 0,"<script type=""text/javascript"">"
0,(function() {
1,var href = window.location.toString();
2,var queryArgs = window.location.search; /...
3,// var hashArgs = window.location.ha...
4,// this avoids the FF bug where location....


In [None]:
# Creating an 'article_name column'
df['article_name'] = df['description'].str.split('-').str[0]
df.head()

In [None]:
df['description'][0]

### 🧹 Cleaning the corpus
We make some preprocessings to clean the corpus (in particular, handle stop words and lemmatize the documents)

In [None]:
from bs4 import BeautifulSoup
df['text_clean'] = [BeautifulSoup(elem).get_text(separator=' ') for elem in df['description']]
df['text_clean'] = df['text_clean'].apply(lambda x: ''.join(ch for ch in x if ch.isalpha() or ch==" "))
df['text_clean'] = df['text_clean'].fillna('').apply(lambda x: x.lower())
df.head()

In [None]:
df['text_clean'][0]

In [None]:
df['processed_documents'] = df['text_clean'].apply(lambda x: " ".join([token.lemma_ for token in nlp(x) if (token.lemma_ not in STOP_WORDS) and (token.text not in STOP_WORDS)]))

In [None]:
df['processed_documents'][0]

### 📜 TF-IDF transformation
We encode the texts with TF-IDF transformation

In [None]:
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(df['processed_documents'])

In [None]:
words_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names_out()).T
words_df['sum'] = words_df.sum(axis=1)
words_df.head()

## Part 1 : Groups of products with similar descriptions

In this part, we train a clustering model that will create groups of products for which the descriptions are "close" to each other.

### ✨ DBSCAN clustering

We use DBSCAN to make some clustering on the TF-IDF matrix. 

In [None]:
from sklearn.cluster import DBSCAN
db = DBSCAN(eps=0.7, metric='cosine', min_samples=4)
db.fit(X)
clusters_number = len(np.unique(db.labels_[db.labels_ != -1]))
print(f'{clusters_number} clusters were created by DBSCAN')

### ☁️ Displaying wordclouds
We display a wordcloud for each cluster to analyze the results and see if the groups contain different words.

In [None]:
wd = wordcloud.WordCloud(background_color="white", colormap="hsv")
for cluster in range(clusters_number):
    mask = list(np.where(db.labels_ == cluster)[0])
    words = list(words_df.iloc[:,mask].T.sum().sort_values(ascending=False).index[0:20])
    plt.imshow(wd.generate(" ".join(words)))
    plt.axis('off')
    plt.show()

## 👕 Part 2 - Recommender system

Then, we can use the cluster ids from part 1 to build a recommender system. The aim is to be able to suggest to a user some products that are similar to the ones he is interested in. To do this, we will consider that products belonging to the same cluster are similar.

In [None]:
def item_name(item_id):
    return df['article_name'][df['id'] == item_id].values[0]

def item_cluster(item_id):
    return db.labels_[df['id'][item_id]]

def find_similar_items():
    item_id = int(input("Enter an id between 1 and 500"))
    print(f"Recommending 5 products similar to {item_name(item_id)}:")
    similar_items_list = list(np.random.choice(df['id'][db.labels_ == item_cluster(item_id)].values, 5))
    for item in similar_items_list:
        print("* " + item_name(item))

In [None]:
find_similar_items()

## 💬 Part 3 : Topic modeling

We use an LSA model (TruncatedSVD) to automatically extract latent topics in the products' descriptions from the TF-IDF matrix.
Contrary to clustering, LSA allows to map each document to a mixing of several topics. For this reason, it's a bit more difficult to interpret the topics as categories : one document can actually be related to several topics at a time. To make things easier, consider extracting the main topic of each document.
As in part 1, you can display wordclouds to analyze the results.

In [None]:
svd_model = TruncatedSVD(n_components=15, n_iter=100)
svd_model.fit_transform(X)

In [None]:
topic_encoded_df = pd.DataFrame(svd_model.components_, columns=vectorizer.get_feature_names_out()).T
topic_encoded_df.columns = [f"topic_{i}" for i in range(15)]
topic_encoded_df.head()

### ☁️ Displaying wordclouds

In [None]:
wd = wordcloud.WordCloud(background_color="white", colormap="hsv")
for col in topic_encoded_df.columns:
    words = list(topic_encoded_df[col].sort_values(ascending=False).index[0:20])
    plt.imshow(wd.generate(" ".join(words)))
    plt.axis('off')
    plt.show()