# Food Products Project

## About the project

Unsupervised learning

## About the data

The data source is the `food-products.csv` file

Column | Definiton
--- | ---------
Channel | Customer channel (1 = internal; 2 = external)
Region | Customer regions (1  = North; 2 = South; 3 = West)
Fresh | Spendings on fresh products
Milk | Spendings on milk products
Grocery | Spendings on grocery products
Frozen | Spendings on frozen products
Detergents_Paper | Spendings on detergents paper products
Delicassen | Spendings on delicatessen products

## Solution

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

### Data exploratory analysis

In [None]:
df = pd.read_csv('food-products.csv')
df.head()

Let's create a scatterplot showing the relation between MILK and GROCERY spending, colored by Channel column.

We can see that external channel customers are spending more on milk and grocery.

In [None]:
sns.scatterplot(data=df,x='Milk', y='Grocery', hue='Channel')

In [None]:
sns.histplot(df, x='Milk', hue='Channel', multiple="stack")

Let's create annotated clustermap of the correlations between spending on different products.

In [None]:
sns.clustermap(df.drop(['Region', 'Channel'], axis=1).corr(), annot=True);

### Data scaling

Since the values of the features are in different orders of magnitude, let's scale the data.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
scaler = StandardScaler()
scaled_X = scaler.fit_transform(df)
scaled_X

### DBSCAN

In [None]:
from sklearn.cluster import DBSCAN

In [None]:
outlier_percent = []

for eps in np.linspace(0.001, 3, 50):
    
    # create model
    dbscan = DBSCAN(eps=eps, min_samples=2 * scaled_X.shape[1])
    dbscan.fit(scaled_X)
     
    # log percentage of points that are outliers
    percent_outliers = 100 * np.sum(dbscan.labels_ == -1) / len(dbscan.labels_)
    
    outlier_percent.append(percent_outliers)

Let's line plot the percentage of outlier points versus the epsilon value choice.

In [None]:
sns.lineplot(x=np.linspace(0.001, 3, 50), y=outlier_percent)
plt.ylabel("Percentage of points classified as outliers")
plt.xlabel("Epsilon value")

Based on the above line plot we will choose epsilon value = 2

### DBSCAN with Chosen Epsilon

In [None]:
dbscan = DBSCAN(eps=2, min_samples=scaled_X.shape[1])
dbscan.fit(scaled_X)

In [None]:
sns.scatterplot(data=df, x='Grocery', y='Milk', hue=dbscan.labels_, palette='Set1')

In [None]:
sns.scatterplot(data=df, x='Detergents_Paper', y='Milk', hue=dbscan.labels_, palette='Set1')

Let's create a new column on the original dataframe called "Labels" consisting of the DBSCAN labels.

In [None]:
df['Labels'] = dbscan.labels_
df.head()

Let's compare the statistical mean of the clusters and outliers for the spending amounts on the categories.

In [None]:
categories = df.drop(['Channel','Region'], axis=1)
categories_means = categories.groupby('Labels').mean()
categories_means

Let's also normalize the dataframe so the spending means go from 0-1

In [None]:
from sklearn.preprocessing import MinMaxScaler

In [None]:
scaler = MinMaxScaler()
data = scaler.fit_transform(categories_means)
scaled_means = pd.DataFrame(data, categories_means.index, categories_means.columns)
scaled_means

In [None]:
sns.heatmap(scaled_means)

In [None]:
sns.heatmap(scaled_means.loc[[0, 1]], annot=True)

We can see that Detergents Paper was the most significant difference.