<p align = "center" draggable=”false” ><img src="https://user-images.githubusercontent.com/37101144/161836199-fdb0219d-0361-4988-bf26-48b0fad160a3.png" 
     width="200px"
     height="auto"/>
</p>

# 🛍️ Launch New Products

Today you are a machine learning engineer at the Department of New Products at Target Cosmetics! 

We will start with a small dataset on interactions between users and current products from the past and try to discover substructure, if there's any, by applying some **unsupervised learning** methods. 

Then we will leverage the small amount of labeled data (current products) in combination with a larger amount of unlabeled data (new products to launch) to make estimations as to which products will sell more. 

## 📚 Learning Objectives

By the end of this session, you will be able to:

- apply dimensionality reduction techniques to reduce features to a lower dimensional space
- perform customer segmentation, determine optional number of clusters, and understand assumptions for used algorithm
- understand what semi-supervised learning is and leverage it to improve performance of supervised learning

## Task 1. Dimensionality Reduction

1. Load in the data. 
    
    Import `pandas` as `pd` and use `pd.read_csv()` to read in `past.csv.gz` in the `dat` folder, saving it as `past`. 
    
    Data in `past.csv.gz` was propcessed; e.g., features indicating time of day, day of week, month, and year of the purchase have been converted to one-hot representations of these categories. 

#### Finding out the current working directory

In [None]:
import os
print(os.getcwd())
# Out: /Users/shane/Documents/blog
# Display all of the files found in your current working directory
print(os.listdir(os.getcwd()))

In [None]:
# YOUR CODE HERE
import pandas as pd
past = pd.read_csv('/home/krishanu_sinha/MLE-9/code/MLE-9/MLE-9/assignments/week-8-unsupervised-ML/dat/past.csv.gz')
past.head()

In [None]:
past.shape

<details>
<summary> Expected output </summary>

```
Index(['product_id', 'user_id', 'NumOfEventsInJourney', 'NumSessions',
       'interactionTime', 'maxPrice', 'minPrice', 'NumCart', 'NumView',
       'NumRemove', 'InsessionCart', 'InsessionView', 'InsessionRemove',
       'Weekend', 'Fr', 'Mon', 'Sat', 'Sun', 'Thu', 'Tue', 'Wed', '2019',
       '2020', 'Jan', 'Feb', 'Oct', 'Nov', 'Dec', 'Afternoon', 'Dawn',
       'EarlyMorning', 'Evening', 'Morning', 'Night', 'Purchased?', 'Noon',
       'Category'],
      dtype='object')
```
</details>

2. What percentage of the interactions (rows) resulted in a purchase?

    Do people mostly buy what they look at or do they do a lot of "window shopping" (shopping around without buying)?
    
    From the perspective of classification, is the data balanced?

In [None]:
import numpy as np
np.mean(past['Purchased?'])*100

3. Drop `product_id` and `user_id` and save the rest columns to a new `pd.DataFrame`:`X`; then pop the column `'Purchased?'` and save it to `y`.

In [None]:

X = past.drop(columns=['product_id','user_id'])
y = X.pop('Purchased?')

In [None]:
assert X.shape == (5000, 34)
assert y.shape == (5000,)

4. Apply [PCA (check documentation if unfamiliar)](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html) to reduce the number of features down to **5**, save it to a numpy array named `X_reduced`. 

    Do you need to preprocess the data before performing PCA? Quick review [here: Importance of feature scaling](https://scikit-learn.org/stable/auto_examples/preprocessing/plot_scaling_importance.html).
    
    If time permits, read [Does mean centering or feature scaling affect a Principal Component Analysis?](https://sebastianraschka.com/faq/docs/pca-scaling.html) or [discussion 1](https://stats.stackexchange.com/questions/53/pca-on-correlation-or-covariance).

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
from sklearn.decomposition import PCA
X_scaled = StandardScaler().fit_transform(X)
pca = PCA(n_components=5, random_state=0, whiten=True)
X_reduced = pca.fit_transform(X_scaled)
print(pca.explained_variance_ratio_)

In [None]:
assert X_reduced.shape == (5000, 5)

5. Print out the percentage of variance explained by each of the selected components.

In [None]:
#
# Cumulative sum of eigenvalues; This will be used to create step plot
# for visualizing the variance explained by each principal component.
#
cum_sum_eigenvalues = np.cumsum(pca.explained_variance_ratio_)
print(cum_sum_eigenvalues)

6. Review code in functions `visualize_2pcs` and `visualize_3pcs` below and visualize first few principal components in 2D and 3D plots, respectively:

In [None]:
import matplotlib.pyplot as plt

def visualize_2pcs(pcs, y):
    fig, ax = plt.subplots()
    plot = plt.scatter(pcs[:,0], pcs[:,1], c=y) 
    ax.legend(
        handles=plot.legend_elements()[0], 
        labels=['No', 'Yes'])

In [None]:
def visualize_3pcs(pcs, y):
    fig, ax = plt.subplots()
    ax = fig.add_subplot(projection='3d')
    plot = ax.scatter(pcs[:,0], pcs[:,1], pcs[:,2], c=y)
    ax.legend(
        handles=plot.legend_elements()[0], 
        labels=['No', 'Yes'])

In [None]:
visualize_2pcs(X_reduced,y)

In [None]:
visualize_3pcs(X_reduced,y)

7. One way to assess the quality of the dimensionality reduction, when the groundtruth is available of course, is comparing the prediction performance using given features vs reduced (engineered) features.

    Complete the wrapper function below that 

    - takes features, target, and a boolean parameter indicating whether to include standardization in the pipeline or not
    - split the data into train (80%) and test (20%) datasets, set the random state for spliting at 0
    - build a pipeline that 

        1) preprocessing data using standardization if the `standardize` is `True`; otherwise skip this step  

        2) apply logistic regression ( are the labels balanced? )
        
    - fit the pipeline using training data
    - print the classification report (use `sklearn.metrics.classification_report`) on test data

In [None]:
# Importing basic packages
#
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.metrics import classification_report, confusion_matrix
# Importing Sklearn module and classes
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn import metrics
#from sklearn import datasets
from sklearn.model_selection import train_test_split

def train(X, y, standardize = True) -> None:
    X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)
    sc = StandardScaler()
    X_train_std = sc.fit_transform(X_train)
    X_test_std = sc.fit_transform(X_test)
    # Create an instance of LogisticRegression classifier
    lr = LogisticRegression(C=100.0, random_state=1, solver='lbfgs', class_weight="balanced")
    #
    # Fit the model
    #
    lr.fit(X_train_std, Y_train)
    # Create the predictions
    #
    Y_predict = lr.predict(X_test_std)
    return print(classification_report(Y_test, Y_predict))

Now apply the pipeline on the all the features `X` and review the performance

In [None]:
train(X, y, standardize = True)

Similarly, apply the pipeline on the reduced / engineered features `X_reduced`. Should you include standardization in the pipeline?

In [None]:
train(X_reduced,y, standardize = True)

8. Are the results as expected? Discuss the pros and cons using reduced set of features in this application with your teammate. 
    *YOUR ANSWER HERE*

## Task 2. Customer Segmentation

In this task, we apply k-means clustering on the reduced data, experimenting different vaules of `n_cluster`, summarize all this information in a single plot, the *Elbow* plot. In addition, leverage silhouette visualization to help decide the "optimal" number of clusters in our data and answer: 

1. Are there any patterns among customer purchasing behaviors?
2. If so, what categories do they belong to? How do you characterize the clusters?
3. If not, what followup steps and / or recommendations will you make as an MLE?

1. Look up the [documentation](https://scikit-learn.org/stable/modules/clustering.html) and import the model class for k-means from `sklearn.cluster`

In [None]:
from sklearn.cluster import KMeans

2. Complete `visualize_elbow`; inspect the code and complete

    - fit k-means on the given data `X` and `k`, setting `random_state` to be 10 for reproducibility
    - append the sum of squared distances of samples to their closest cluster center for each $k$ to list `inertias`

In [None]:
def visualize_elbow(X, ks):
    fig, ax = plt.subplots()
    inertias = []
    for k in ks:
        kmeans = KMeans(n_clusters = k,     
                    init = 'k-means++',                 # Initialization method for kmeans
                    max_iter = 300,                     # Maximum number of iterations 
                    n_init = 10,                        # Choose how often algorithm will run with different centroid 
                    random_state = 10)                  # Choose random state for reproducibility
        kmeans.fit(X)
        inertias.append(kmeans.inertia_)
    plt.plot(ks, inertias)
    plt.xticks(ks)
    plt.xlabel('Number of clusters')
    plt.ylabel('Inertia')
    plt.title('Elbow plot')

3. Visualize the elbow plot for the number of clusters ranging between 2 and 9. Discuss with your teammate, what is the 'optimal' number of clusters?

In [None]:
visualize_elbow(X, range(2, 9))

4. What are the disadvantage to use the Elbow method? 

#### Choosing  k manually from the elbow method can be erroneous.

5. Let's try a different approach: [silhouette score](https://towardsdatascience.com/clustering-metrics-better-than-the-elbow-method-6926e1f723a6).

    A helper function `visualize_silhouette` is provided for you (inspect the code in `utils.py`) and figure out how to use it to visualize k-means for k ranges from 2 to 8 on the reduced data. 

In [None]:
from utils import visualize_silhouette

In [None]:
visualize_silhouette(8,X_reduced)

6. Instantiate a k-means model using the number of cluster that you deem optimal, assign it to `km`, and fit on the reduced data. 

In [None]:
km = KMeans(n_clusters = 4,     
                    init = 'k-means++',                 # Initialization method for kmeans
                    max_iter = 300,                     # Maximum number of iterations 
                    n_init = 10,                        # Choose how often algorithm will run with different centroid 
                    random_state = 10)                  # Choose random state for reproducibility

km.fit(X_reduced)

7. What is the size of each cluster? 

In [None]:
def CountFrequency(my_list):  
    freq = {} 
    for item in my_list: 
        if (item in freq): 
            freq[item] += 1
        else: 
            freq[item] = 1

    for key, value in freq.items(): 
        print ("% d : % d"%(key, value)) 
        
CountFrequency(km.labels_)

8. Create a new column called `cluster_pca` in `past`, with values as predicted cluster index predicted by `km`. 

In [None]:
past.loc[:,"cluster_pca"] = km.labels_

In [None]:
past.head()

9. Open ended: manipulate `past` and see if you can characterize each cluster (e.g., calculate statistics of / visualize features for each cluster), how will you intepret the results? 

    **Note**. This is probably the most important part as far as the business stakeholders are concerned: "*What can I do with your results?*" The math, modeling part is relatively easy, compared to actionable recommendations you make for business. Thus, before jumping on a different algorithm for the given task, do you best to 1) understand the data in depth 2) keep buisiness use cases in mind throughout all steps. 

In [None]:
past_0 = past[past['cluster_pca'] == 0]
past_1 = past[past['cluster_pca'] == 1]
past_2 = past[past['cluster_pca'] == 2]
past_3 = past[past['cluster_pca'] == 3]

In [None]:
!pip install autoviz

In [None]:
past_0.to_csv('past_0.csv')
from autoviz.AutoViz_Class import AutoViz_Class

#EDA using Autoviz
autoviz = AutoViz_Class().AutoViz('past_0.csv')

In [None]:
past_1.to_csv('past_1.csv')
#EDA using Autoviz
autoviz = AutoViz_Class().AutoViz('past_1.csv')

In [None]:
past_2.to_csv('past_2.csv')
#EDA using Autoviz
autoviz = AutoViz_Class().AutoViz('past_2.csv')

In [None]:
past_3.to_csv('past_3.csv')
#EDA using Autoviz
autoviz = AutoViz_Class().AutoViz('past_3.csv')

#### KEY TAKEAWAYS FROM EDA ON EACH CLUSTER:
1. Median interaction Time is highest for past-3 cluster => 1 but that does not lead to a purchase as witnessed from the data.
2. maxPrice is lowest for past-3 => 17.5
3. Items Purchased for top 15 categories is lowest for Cluster-3

NOTE: (minPrice is pretty close to normally distributed in all clusters and hence I have considered it for predictions)

4. For cluster past-2,average minPrice by Purchased? often leads to a purchase but not in case of other clusters because past-2.
5. cluster-2 mostly contains costly items as compared to the other items.
6. similarly, for cluster past-2,average minPrice by Purchased? often does not leads to a purchase but not in case of other clusters because past-2 
7. cluster mostly contains costly items as compared to the other items.

8. Average Interaction time is high it leads to a Purchase in case of past_0, but not for past_1, past_2 and past_3 because there are too many Outliers in Past_0 cluster. Hence Median would have been a better metric in place of mean.
9. Most Purchases for cluster past-3, happens only during the weekend (Saturday and Sunday).
10. For past-3 cluster, in 2019, cluster there was more purchase as compared to 2020.
11. For past-2 cluster, in 2019 and 2020, average purchase was almost same.
12. For past-1 cluster, in 2020 cluster there was more purchase as compared to 2019.
13. For past-0 cluster, in 2019, cluster there was more purchase as compared to 2020.
14. ALL CLUSTER DISTRIBUTIONS ARE NOT GAUSSIAN.

10. What are the assumptions for k-means? Judging by the cluster sizes, is k-means a good approach? 

    Scanning the list of [clustering algorithms](https://scikit-learn.org/stable/modules/clustering.html) implemented in scikit-learn, try at least one other algorithm, examine its assumptions, and intepret results.

#### ASSUPTIONS OF KMEANS:
1. k-means assume the variance of the distribution of each attribute (variable) is spherical;

2. all variables have the same variance;

3. the prior probability for all k clusters are the same, i.e. each cluster has roughly equal number of observations; If any one of these 3 assumptions is violated, then k-means will fail.

#### WHY KMEANS IS NOT A GOOD APPROACH?
1. The distribution for all the three clusters weren't spherical or normally distributed.

2. All variables don't have same variance.

Hence, K-Means is not a good approach.

#### Since we have a high dimensional dataset hence my approach would be to reduce the dimensions using TSNE and then apply DBSCAN since DBSCAN cannot handle high dimensional dataset. 

#### What is TSNE?

t-Distributed Stochastic Neighbor Embedding (t-SNE) is a dimensionality reduction technique used to represent high-dimensional dataset in a low-dimensional space of two or three dimensions so that we can visualize it. In contrast to other dimensionality reduction algorithms like PCA which simply maximizes the variance, t-SNE creates a reduced feature space where similar samples are modeled by nearby points and dissimilar samples are modeled by distant points with high probability.

#### Applying TSNE to reduce the dimensions and then visualize the dataset:

In [None]:
from sklearn.manifold import TSNE
from numpy import reshape
import seaborn as sns
import pandas as pd

In [None]:
tsne = TSNE(n_components=2, verbose=1, random_state=123)
z = tsne.fit_transform(X) 

In [None]:
tsne = TSNE(n_components=2, verbose=1, random_state=123)
z = tsne.fit_transform(X)
df = pd.DataFrame()
df["y"] = y
df["comp-1"] = z[:,0]
df["comp-2"] = z[:,1]

sns.scatterplot(x="comp-1", y="comp-2", hue=df.y.tolist(),
                data=df).set(title="New Products at Target Cosmetics")

#### ALTERNATE CLUSTERING TECHNIQUE: DBSCAN

#### ASSUMPTIONS OF DBSCAN:
1. Based on what I read on the internet, DBSCAN is a density-based clustering algorithm that works on the assumption that clusters are dense regions in space separated by regions of lower density. 
2. DBSCAN is a cool clustering algorithm that doesn't make assumptions about how data are distributed. 

#### Advantages of DBSCAN:
1. Is great at separating clusters of high density versus clusters of low density within a given dataset.
2. Is great with handling outliers within the dataset.

#### Disadvantages of DBSCAN:

1. While DBSCAN is great at separating high density clusters from low density clusters, DBSCAN struggles with clusters of similar density.
2. Struggles with high dimensionality data. I know, this entire article I have stated how DBSCAN is great at contorting the data into different dimensions and shapes. However, DBSCAN can only go so far, if given data with too many dimensions, DBSCAN suffers


#### Since DBSCAN cannot handle too many dimensions hence we reduced our dataset to 2 dimensions and we will cluster our dataset.

In [None]:
!pip install kneed

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

from sklearn.cluster import DBSCAN
from sklearn.datasets import make_blobs
from sklearn.neighbors import NearestNeighbors

  
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import normalize
from sklearn.decomposition import PCA

from kneed import KneeLocator

from sklearn.neighbors import NearestNeighbors # importing the library


In [None]:
# scale and standardizing data
X = StandardScaler().fit_transform(df)

In [None]:
X = df[['comp-1','comp-2']]
X.head()

In [None]:
db = DBSCAN(eps=0.5, min_samples=10).fit(X)
labels = db.labels_
fig = plt.figure(figsize=(10, 10))
sns.scatterplot(X.iloc[:,0], X.iloc[:,1], hue=["cluster-{}".format(x) for x in labels])

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import DBSCAN
fig = plt.figure(figsize=(20, 10))
fig.subplots_adjust(hspace=.5, wspace=.2)
i = 1
for x in range(10, 0, -1):
    eps = 1/(11-x)
    db = DBSCAN(eps=eps, min_samples=10).fit(X)
    core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
    core_samples_mask[db.core_sample_indices_] = True
    labels = db.labels_
    
    print(eps)
    ax = fig.add_subplot(2, 5, i)
    ax.text(1, 4, "eps = {}".format(round(eps, 1)), fontsize=25, ha="center")
    sns.scatterplot(X.iloc[:,0], X.iloc[:,1], hue=["cluster-{}".format(x) for x in labels])
    
    i += 1

#### We can see that we hit a sweet spot between eps=1.0. Our eps value must be far higher than this. eps values smaller than that have no information in them.

#### A Systematic Method for Tuning the eps Value
Since the eps figure is proportional to the expected number of neighbours discovered, we can use the nearest neighbours to reach a fair estimation for eps. Let us compute the nearest neighbours.

In [None]:
from sklearn.neighbors import NearestNeighbors
nearest_neighbors = NearestNeighbors(n_neighbors=11)
neighbors = nearest_neighbors.fit(X)
distances, indices = neighbors.kneighbors(X)
distances = np.sort(distances[:,10], axis=0)
fig = plt.figure(figsize=(5, 5))
plt.plot(distances)
plt.xlabel("Points")
plt.ylabel("Distance")
plt.savefig("Distance_curve.png", dpi=300)

#### Locating our exact Knee point:

In [None]:
from kneed import KneeLocator
i = np.arange(len(distances))
knee = KneeLocator(i, distances, S=1, curve='convex', direction='increasing', interp_method='polynomial')
fig = plt.figure(figsize=(5, 5))
knee.plot_knee()
plt.xlabel("Points")
plt.ylabel("Distance")

print(distances[knee.knee])

We can see that the detected knee point by this method is at distance 3.98. Now we can use this value as our eps to see how our new clustering would look like.

In [None]:
db = DBSCAN(eps=distances[knee.knee], min_samples=10).fit(X)
labels = db.labels_

fig = plt.figure(figsize=(5, 5))
sns.scatterplot(X.iloc[:,0], X.iloc[:,1], hue=["cluster-{}".format(x) for x in labels])
plt.savefig("dbscan_with_knee.png", dpi=300)

11. Jot down recommendations or followup steps, detailing the reasons.

Using TSNE to reduce the dimensions and using DBSCAN, we can clearly see the the cluster-0 is the most dense cluster. The other clusters are not as dense as cluster-0. We can therefore omit the other clusters and focus mostly on cluster-0 to find out about their purchase habbits and based on that we can formulate our business policy.

## Task 3. To launch or not to launch?

In this task, we will work on a hypothetical application: cosmetics purchase prediction for new products with limited features. The intention here is to maximize **recall** so that no popular cosmetic is understocked. Overstocking is less of a concern since it will not cause disengagement in customers.

The purchase status for each "new" product is known, but we only use the labels for benchmarking purpose. Instead, we use label spreading method to leverage a small amount of labeled data in combination with a larger amount of unlabeled data. 

1. Read in the data in `new.csv.gz` and save it as a `pd.DataFrame` named `new`. This is the test dataset.

    Look at the shape of `new` and inspect the frist few rows.

In [None]:
import os
print(os.getcwd())

In [None]:
# YOUR CODE HERE
import pandas as pd
new = pd.read_csv('/home/krishanu_sinha/MLE-9/code/MLE-9/MLE-9/assignments/week-8-unsupervised-ML/dat/new.csv.gz')
new.head()

In [None]:
assert new.shape == (30091, 5)

In [None]:
new.head()

In [None]:
new.shape

2. How does the number of data points in the training set (`past`) compare to the number of datapoints in the test set (`new`)? 

    And how does the feature set in the training set compare to the feature set in the test set?

In [None]:
X_train, X_test, Y_train, Y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)
# computing number of rows
rows = len(X_train.axes[0])
rows

In [None]:
Y_train.value_counts()

In [None]:
1203/3500

In [None]:
rows = len(new.axes[0])
rows

In [None]:
new['Purchased?'].value_counts()

In [None]:
10359/30091

#### In both the training set of past and new, 34% of cases have resulted in a puchase.

    *The number of datapoints in the training set is relatively small while the test set is quite large. The training set has more features than in the test set.*

3. Are there any product ids in both the training and test datasets? Hint: use `np.intersect1d` or set operations.

In [None]:
past_prodid_arr = past[["product_id"]].to_numpy()
new_prodid_arr = new[["product_id"]].to_numpy()

intersecting_prod_ids = np.intersect1d(past_prodid_arr, new_prodid_arr)
    
print (intersecting_prod_ids)

#### There are no common product ids in past (training) and new (test) datasets.

In [None]:
# YOUR CODE HERE

4. What percentage of data points resulted in a purchase in the test set?

    In reality, we won't be able to calculate information that is not available to you. Here, we simply demonstrated that the distributions in target between `past` and `new` are similar. 

In [None]:
new['Purchased?'].value_counts()

In [None]:
10359/30091

#### In both the training set of past and new, 34% of cases have resulted in a puchase.

5. Create `ndarray`s: `X_train`, `y_train`, `X_test`, and `y_test` according to the following guidelines.

    - The `Purchased?` column is the target.
    - `X_train` and `X_test` should contain the same features
    - `product_id` should not be a feature.

    Double check that the shapes of the four arrays are what you expect.

In [None]:
X = new.drop(columns=['product_id'])
y = X.pop('Purchased?')

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

X_train = X_train.to_numpy()
X_test = X_test.to_numpy()
y_train = y_train.to_numpy()
y_test = y_test.to_numpy()

In [None]:
assert X_train.shape[0] == y_train.shape[0] # 5000
assert X_train.shape[1] == X_test.shape[1]  # 3

assert type(X_train) == np.ndarray # make sure you import numpy as np at this point
assert type(X_train).__module__ == type(y_train).__module__ == np.__name__  # alternative way

6. Let's fit a simple logistic regression on the training set (`X_train`, `y_train`) and report performance on the test set (`X_test`, `y_test`).

In [None]:
train(X, y, standardize = True)

In [None]:
# YOUR CODE HERE

7. Re-assemble data for semi-supervised learning. 
    - Use the features from the test set along with the features from the training set. 
    - Only use the labels from the training set but none from the test set.  
    
    Since we're using a large number of sampled features, but only a small number of these samples have labels, this is **semi-supervised learning**.

Create a matrix `X` that has the rows from `X_train` concatenated with the rows from `X_test`. Check the shape of the matrix.

In [None]:
# YOUR CODE HERE
past = past[['product_id', 'maxPrice', 'minPrice', 'Purchased?', 'Category']]
X = pd.concat([past, new], sort=False)
X.reset_index()
X.head()

In [None]:
X = X.drop(columns=['product_id','Purchased?'])

In [None]:
X.shape

In [None]:
assert X.shape == (35091, 3)

Create the target array `y` by concatenating `y_train` with a vector of -1's, effectively creating a dummy label for the `X_test` rows in `X`. Check the shape of the array. It should have as many values as `X` has rows.

In [None]:
y_train.shape

In [None]:
y_train

In [None]:
vector = [-1] * 14028
vector = np.array(vector)

In [None]:
y = np.concatenate((y_train, vector))
y.shape

In [None]:
assert X.shape[0] == y.shape[0]

In [None]:
from scipy.stats import itemfreq
itemfreq(y)

8. Semi-supervised learning. 

    Scikit-learn provides two label propagation models: [`LabelPropagation`](https://scikit-learn.org/stable/modules/generated/sklearn.semi_supervised.LabelPropagation.html) and [`LabelSpreading`](https://scikit-learn.org/stable/modules/generated/sklearn.semi_supervised.LabelSpreading.html). Both work by constructing a similarity graph over all items in the input dataset. `LabelSpreading` is similar to the basic Label Propagation algorithm, but it uses an affinity matrix based on the normalized graph Laplacian and soft clamping across the labels; thus more robust to noise. We will be using scikit-learn's `LabelSpreading` model with `kNN`.
    
    Train a `LabelSpreading` model. Set `kernel` to `knn` and `alpha` to 0.01.

In [None]:
from sklearn.semi_supervised import LabelSpreading
label_prop_model = LabelSpreading(kernel='knn',alpha=0.01)

label_prop_model.fit(X,y)

In [None]:
# YOUR CODE HERE

9. Extract the predictions for the test data. 

    You can get the predictions from the `transduction_` attribute. Note that there is a value for every row in `X`, so select just the values that correspond to `X_test`.

In [None]:
semi_sup_preds = label_prop_model.predict(X_test)

In [None]:
assert semi_sup_preds.shape[0] == X_test.shape[0]

10. Print the classification report

In [None]:
from sklearn.metrics import classification_report
print(classification_report(y_test, semi_sup_preds))

In [None]:
print(classification_report(y_test, semi_sup_preds)) # make sure you properly import classification_report

Let's bring the performance from the supervised learning model down to see the comparison; discuss the areas of improvement and reasons for improvement.

In [None]:
print(classification_report(y_test, y_pred))

    *YOUR ANSWER HERE*

11. Read [Small Data Can Play a Big Role in AI](https://hbr.org/2020/02/small-data-can-play-a-big-role-in-ai) and discuss with your teammate about AI tools for training AI with small data and their use cases. 

## Acknowledgement & References

- data was adapted from Kaggle: [eCommerce Events History in Cosmetics Shop](https://www.kaggle.com/mkechinov/ecommerce-events-history-in-cosmetics-shop)
- function `visualize_silhouette` was adapted from [plot_kmeans_silhouette_analysis by scikit-learn](https://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html)
- [Categorizing Online Shopping Behavior from Cosmetics to Electronics: An Analytical Framework](https://arxiv.org/pdf/2010.02503.pdf)
- [OPAM: Online Purchasing-behavior Analysis using Machine learning](https://arxiv.org/pdf/2102.01625.pdf)