<a href="https://colab.research.google.com/github/RomaViraj/APT/blob/master/Applied_Tech_Project_136_Solution_copy_v0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### Instructions

#### Goal of the Project

This project is designed for you to practice and solve the activities that are based on the concepts covered in the lesson: 

**AT- Lesson 136 Collaborative Filtering II - Cosine Similarity**

---

#### Problem Statement

This is a transactional dataset that contains all the transactions occurring in the year 2011 for a UK-based and registered non-store online retail. 

As a data scientist, you need to analyse the data and calculate how much similarity is there among the items that are sold. Based on similarity scores, the retailer can stock items that are in high demand.

---

### Dataset Description

The DataFrame consists of the following columns:

|Field|Description|
|---:|:---|
| InvoiceNo | Invoice number |
| StockCode | Product (item) code |
| Description | Product (item) name |
| Quantity | The quantities of each product (item) per transaction |
| InvoiceDate | Invoice Date and time |
| UnitPrice | Unit price |
| CustomerID | Unique ID of the customer|
| Country | Country name |

**Dataset source:** https://archive.ics.uci.edu/ml/datasets/online+retail#

**Citation:** Daqing Chen, Sai Liang Sain, and Kun Guo, Data mining for the online retail industry: A case study of RFM model-based customer segmentation using data mining, Journal of Database Marketing and Customer Strategy Management

Dua, D., & Graff, C.. (2017). UCI Machine Learning Repository.

**Dataset Link -** https://drive.google.com/uc?id=1BajrtGH5BF5WSmjgegymjDnA8NfhUH3y

---

### List of Activities
 
**Activity 1:** Import Modules and Read Data

**Activity 2:** Prepare Retail DataFrame using Retail Dummy Data

**Activity 3:** Calculate Cosine Similarity Score

---

#### Activity 1: Import Modules and Read Data

1. Import the necessary Python modules.

2. Read the data from the dummy variable to create a Pandas DataFrame and go through the necessary data-cleaning process (if required).

In [None]:
# Import the required modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")

# Read the dataset and print first five records.
url = "https://drive.google.com/uc?id=1BajrtGH5BF5WSmjgegymjDnA8NfhUH3y"
df = pd.read_csv(url)

# Create a DataFrame with only required features.
df.head()

Unnamed: 0,InvoiceNo,StockCode,Description,Quantity,InvoiceDate,UnitPrice,CustomerID,Country
0,539993,22386,JUMBO BAG PINK POLKADOT,10,01-04-2011 10:00,1.95,13313.0,United Kingdom
1,539993,21498,RED RETROSPOT WRAP,25,01-04-2011 10:00,0.42,13313.0,United Kingdom
2,539993,22379,RECYCLING BAG RETROSPOT,5,01-04-2011 10:00,2.1,13313.0,United Kingdom
3,539993,20718,RED RETROSPOT SHOPPER BAG,10,01-04-2011 10:00,1.25,13313.0,United Kingdom
4,539993,85099B,JUMBO BAG RED RETROSPOT,10,01-04-2011 10:00,1.95,13313.0,United Kingdom


Get the information about the dataset.

In [None]:
# Print the information of the DataFrame using the 'info()' function.
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 258913 entries, 0 to 258912
Data columns (total 8 columns):
 #   Column       Non-Null Count   Dtype  
---  ------       --------------   -----  
 0   InvoiceNo    258913 non-null  object 
 1   StockCode    258913 non-null  object 
 2   Description  258913 non-null  object 
 3   Quantity     258913 non-null  int64  
 4   InvoiceDate  258913 non-null  object 
 5   UnitPrice    258913 non-null  float64
 6   CustomerID   200920 non-null  float64
 7   Country      258913 non-null  object 
dtypes: float64(2), int64(1), object(5)
memory usage: 15.8+ MB


**Q:** Are there any missing values in the DataFrame?

**A:** No.

<br>

Next, create a new DataFrame `ecommerce_df` that contains only the following features:
- `InvoiceNo`
- `Description`
- `Quantity`

In [None]:
# Pull out 'InvoiceNo', 'Description', 'Quantity' columns
ecommerce_df = df[['InvoiceNo', 'Description', 'Quantity']]
ecommerce_df.head()

Unnamed: 0,InvoiceNo,Description,Quantity
0,539993,JUMBO BAG PINK POLKADOT,10
1,539993,RED RETROSPOT WRAP,25
2,539993,RECYCLING BAG RETROSPOT,5
3,539993,RED RETROSPOT SHOPPER BAG,10
4,539993,JUMBO BAG RED RETROSPOT,10


Get the information about the new dataset.

In [None]:
# Print the information of the new DataFrame using the 'info()' function.
ecommerce_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 258913 entries, 0 to 258912
Data columns (total 3 columns):
 #   Column       Non-Null Count   Dtype 
---  ------       --------------   ----- 
 0   InvoiceNo    258913 non-null  object
 1   Description  258913 non-null  object
 2   Quantity     258913 non-null  int64 
dtypes: int64(1), object(2)
memory usage: 5.9+ MB


**Q:** Are there any missing values in the new DataFrame?

**A:** No.

**Q:** Are there any non-numeric columns? 

**A:** Yes.



---



#### Activity 2: Prepare Retail DataFrame Using Retail Dummy Data

1. You are given with a DataFrame for a dummy dataset of retail items with only the `Description` column.

2. Make a list of items using the `tolist()` function.

3. Perform `isin()` operation to get the items based on retail history data.

> **Note -** `isin()` function will be used to get the data from the original DataFrame based on the dummy DataFrame. So that, we can use user-based collaborative filtering.

Create a DataFrame for the dummy dataset.

In [None]:
# Prepare a Dummy DataFrame
retail_history = [
  {"Description": "SPACEBOY BIRTHDAY CARD"},
  {"Description": "PACK OF 12 VINTAGE DOILY TISSUES"},
  {"Description": "VINTAGE LEAF MAGNETIC NOTEPAD"},
  {"Description": "PINK FLORAL FELTCRAFT SHOULDER BAG"},
  {"Description": "BLUE CAT BISCUIT BARREL PINK HEART"}
]

retail_history_df = pd.DataFrame(retail_history)
retail_history_df

Unnamed: 0,Description
0,SPACEBOY BIRTHDAY CARD
1,PACK OF 12 VINTAGE DOILY TISSUES
2,VINTAGE LEAF MAGNETIC NOTEPAD
3,PINK FLORAL FELTCRAFT SHOULDER BAG
4,BLUE CAT BISCUIT BARREL PINK HEART


Create a DataFrame consisting of `Description` based on retail history data using `isin()` function.

In [None]:
# Obtain a DataFrame consisting of Description based on retail history data

# Convert the 'Description' to list of the retail history data
retail_list = retail_history_df['Description'].tolist()

# Perform 'isin()' operation to get the items based on retail history data
retail_df = ecommerce_df[ecommerce_df['Description'].isin(retail_list)]
retail_df

Unnamed: 0,InvoiceNo,Description,Quantity
4307,548728,SPACEBOY BIRTHDAY CARD,12
4343,548728,PINK FLORAL FELTCRAFT SHOULDER BAG,4
4382,548732,SPACEBOY BIRTHDAY CARD,12
5495,548893,SPACEBOY BIRTHDAY CARD,1
5823,548894,BLUE CAT BISCUIT BARREL PINK HEART,1
...,...,...,...
257478,581492,PACK OF 12 VINTAGE DOILY TISSUES,3
257596,581492,PINK FLORAL FELTCRAFT SHOULDER BAG,1
257731,581492,SPACEBOY BIRTHDAY CARD,4
258126,581492,VINTAGE LEAF MAGNETIC NOTEPAD,1


**Q:** Mention the dimension after performing `isin()` operation?

**A:** 679 rows and 3 columns.



---



#### Activity 3: Calculate Cosine Similarity Score

1. Create a pivot table using `pivot_table()` with paramters as `index='Description'`, `columns=['InvoiceNo']`, `values='Quantity'` and fill all `NaN` values with `0`.

> **Note -** Pivot table is used to create a data table that will be respective quantity orders based on `Description` and `InvoiceNo`. Using the pivot table, you can find the Cosine Similarity scores between the `Description` of the items and recommend those items to be produced in large quantities. 



In [None]:
# Create a pivot table 
df_items = retail_df.pivot_table(index='Description', columns=['InvoiceNo'], values='Quantity').fillna("0")
pivot_df = pd.DataFrame(df_items)
pivot_df

InvoiceNo,548728,548732,548893,548894,548988,549012,549115,549279,549295,549525,549577,549742,549743,549847,549851,551516,551693,551734,551855,551869,551884,551985,551994,551995,552039,552108,552180,552232,552234,552282,552285,552315,552330,552337,552353,552371,552466,552468,552469,552508,...,580129,580143,580367,580511,580517,580523,580610,580632,580637,580673,580727,580729,580730,580752,580904,580906,580939,580956,580974,580982,580983,580986,581171,581175,581176,581217,581219,581230,581399,581401,581402,581412,581437,581439,581492,581498,C555196,C570867,C571707,C574027
Description,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1,Unnamed: 22_level_1,Unnamed: 23_level_1,Unnamed: 24_level_1,Unnamed: 25_level_1,Unnamed: 26_level_1,Unnamed: 27_level_1,Unnamed: 28_level_1,Unnamed: 29_level_1,Unnamed: 30_level_1,Unnamed: 31_level_1,Unnamed: 32_level_1,Unnamed: 33_level_1,Unnamed: 34_level_1,Unnamed: 35_level_1,Unnamed: 36_level_1,Unnamed: 37_level_1,Unnamed: 38_level_1,Unnamed: 39_level_1,Unnamed: 40_level_1,Unnamed: 41_level_1,Unnamed: 42_level_1,Unnamed: 43_level_1,Unnamed: 44_level_1,Unnamed: 45_level_1,Unnamed: 46_level_1,Unnamed: 47_level_1,Unnamed: 48_level_1,Unnamed: 49_level_1,Unnamed: 50_level_1,Unnamed: 51_level_1,Unnamed: 52_level_1,Unnamed: 53_level_1,Unnamed: 54_level_1,Unnamed: 55_level_1,Unnamed: 56_level_1,Unnamed: 57_level_1,Unnamed: 58_level_1,Unnamed: 59_level_1,Unnamed: 60_level_1,Unnamed: 61_level_1,Unnamed: 62_level_1,Unnamed: 63_level_1,Unnamed: 64_level_1,Unnamed: 65_level_1,Unnamed: 66_level_1,Unnamed: 67_level_1,Unnamed: 68_level_1,Unnamed: 69_level_1,Unnamed: 70_level_1,Unnamed: 71_level_1,Unnamed: 72_level_1,Unnamed: 73_level_1,Unnamed: 74_level_1,Unnamed: 75_level_1,Unnamed: 76_level_1,Unnamed: 77_level_1,Unnamed: 78_level_1,Unnamed: 79_level_1,Unnamed: 80_level_1,Unnamed: 81_level_1
BLUE CAT BISCUIT BARREL PINK HEART,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,12,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
PACK OF 12 VINTAGE DOILY TISSUES,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,48,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,24,0,12,24,2,3,0,0,0,0,-24
PINK FLORAL FELTCRAFT SHOULDER BAG,4,0,0,0,8,0,0,1,0,0,0,12,0,0,0,4,4,4,0,0,0,0,0,0,0,0,0,0,0,3,1,0,4,0,0,0,0,0,0,0,...,0,0,0,0,8,3,0,0,0,0,0,0,0,0,0,0,4,2,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,-1,0,0,0
SPACEBOY BIRTHDAY CARD,12,12,1,0,0,12,24,0,1,1,12,0,12,12,12,12,12,0,12,36,0,12,12,4,24,12,12,5,1,0,0,12,0,12,12,12,12,12,12,1,...,12,0,1,24,0,0,0,0,0,0,1,0,0,12,0,0,0,0,12,12,2,12,12,144,72,1,3,12,12,0,0,0,0,4,4,0,0,-12,0,0
VINTAGE LEAF MAGNETIC NOTEPAD,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,12,2,0,0,0,3,2,3,0,0,2,6,0,2,12,0,2,0,0,1,0,0,0,0,6,0,0,0,0,0,6,0,1,1,1,0,0,-2,0


2. Steps to perform Cosine Similarity 

 1. Import `cosine_similarity` module from `sklearn.metrics.pairwise`.

  2. Perform Cosine Similarity on pivoted DataFrame.

 3. Create a DataFrame from the Cosine Similarity scores using pandas and set `index` and `columns` as pivot DataFrame index.





In [None]:
# Import 'pairwise_distances' from sklearn.metrics
from sklearn.metrics.pairwise import cosine_similarity

# Perform Cosine Similarity
cosine_sim = cosine_similarity(pivot_df)

# Make a DataFrame for the cosine similarities
pd.DataFrame(cosine_sim, index = pivot_df.index, columns = pivot_df.index)

Description,BLUE CAT BISCUIT BARREL PINK HEART,PACK OF 12 VINTAGE DOILY TISSUES,PINK FLORAL FELTCRAFT SHOULDER BAG,SPACEBOY BIRTHDAY CARD,VINTAGE LEAF MAGNETIC NOTEPAD
Description,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
BLUE CAT BISCUIT BARREL PINK HEART,1.0,0.0,0.0,0.00092,0.00056
PACK OF 12 VINTAGE DOILY TISSUES,0.0,1.0,0.001637,0.01292,0.014204
PINK FLORAL FELTCRAFT SHOULDER BAG,0.0,0.001637,1.0,0.014706,0.001382
SPACEBOY BIRTHDAY CARD,0.00092,0.01292,0.014706,1.0,0.031899
VINTAGE LEAF MAGNETIC NOTEPAD,0.00056,0.014204,0.001382,0.031899,1.0


**Q:** Which two items show high Cosine Similarity score?

**A:** `VINTAGE LEAF MAGNETIC NOTEPAD` and `SAPCEBOY BIRTHDAY CARD`.

<br>

---

