# **Norm of Vector**

![](https://mathsimulationtechnology.files.wordpress.com/2012/02/la_r2vector_length.jpg?w=298&h=285)

### You will work with a simple data set that contains the sales in a particular year of various sanitizers. you will perform the following tasks:

- Load and study the data
- Extract vectors from the data
- Calculate norms of vectors
- Print a sorted list of store IDs in terms of their sales of sanitizers
- Print a sorted list of sanitizers in terms of their popularity
- Create a function to calculate the norm of the difference of two vectors
- Create a function that returns the ID of the store with the most similar sales as a given store
- Create a function that returns the name of a sanitizer with the most similar sales as a given sanitizer


## Task 1 - Load and study the data

In [2]:
import numpy as np
import pandas as pd

In [3]:
df = pd.read_csv('/content/Store_Sanitizer_Sales.csv', index_col = 0)

In [4]:
df.head()

Unnamed: 0_level_0,Dettol,Savlon,Himalaya,Lifebuoy,Kaya,Godrej,Dabur
Store ID,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
AMD463,8.4,6.5,9.8,5.8,3.2,7.1,3.5
BGL198,7.3,8.1,10.5,9.1,4.1,3.8,4.0
CXF008,9.2,10.2,8.6,10.0,4.2,5.6,2.8
DRH187,11.5,7.1,8.9,8.4,5.0,6.1,4.6
EWO651,7.4,6.9,10.0,5.8,3.7,8.2,4.5


#### Feature Description:
- The data set contains sales values of sanitizers in various stores.
- The sales values are in millions of units sold in the previous year.
- Each row contains the sales values of all the sanitizers for a particular store.
- Each column contains the sales values of a particular sanitizer in all the stores.

#### study its features such as:
- The number of stores
- The number of sanitizers
- The ranges of sales

In [5]:
df.shape

(15, 7)

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 15 entries, AMD463 to OEP108
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Dettol    15 non-null     float64
 1   Savlon    15 non-null     float64
 2   Himalaya  15 non-null     float64
 3   Lifebuoy  15 non-null     float64
 4   Kaya      15 non-null     float64
 5   Godrej    15 non-null     float64
 6   Dabur     15 non-null     float64
dtypes: float64(7)
memory usage: 960.0+ bytes


In [7]:
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
Dettol,15.0,9.293333,2.095392,5.4,7.7,9.2,10.95,12.5
Savlon,15.0,7.953333,1.473027,5.8,6.95,7.8,8.3,11.0
Himalaya,15.0,10.14,2.422749,7.0,8.75,9.8,10.6,16.3
Lifebuoy,15.0,7.58,1.721793,4.8,5.95,7.9,8.9,10.1
Kaya,15.0,4.806667,0.964711,3.2,4.05,4.9,5.55,6.5
Godrej,15.0,4.673333,2.272653,1.6,2.9,3.8,6.6,8.2
Dabur,15.0,4.833333,1.405432,2.7,4.0,4.6,5.55,7.5


#### Observations

- There are 15 rows and 7 columns in the data
- Each row corresponds to the sales values of different sanitizers of a particular store
- Each column corresponds to the sales values of a particular sanitizer in different stores
- The sales values range from 1.6 to 16.3 million units

## Task 2 - Extract vectors from the data

- You have seen that there are two types of vectors that we can extract from the data:
- Row vectors - these contain the sales values of different sanitizers of a particular store
- Column vectors - these contain the sales values of a particular sanitizer in different stores

In [8]:
# Access the row vector for the store with ID "DRH187" using ".loc[]"
df.loc['DRH187']

Dettol      11.5
Savlon       7.1
Himalaya     8.9
Lifebuoy     8.4
Kaya         5.0
Godrej       6.1
Dabur        4.6
Name: DRH187, dtype: float64

In [9]:
# Access the row vector for the store with ID "LXM794" using ".loc[]"
df.loc['LXM794']

Dettol      5.4
Savlon      6.8
Himalaya    8.9
Lifebuoy    8.2
Kaya        4.0
Godrej      8.2
Dabur       2.7
Name: LXM794, dtype: float64

In [10]:
# Access the column vector for the sanitizer "Dettol"
# Note: You may access column vectors either directly or using ".loc[]"
df['Dettol']

Store ID
AMD463     8.4
BGL198     7.3
CXF008     9.2
DRH187    11.5
EWO651     7.4
FLJ843    12.1
GKM230     8.0
HRT117    10.4
IDT818     8.8
JRB076     6.9
KTG143    11.5
LXM794     5.4
MRN637    12.5
NGP835     9.7
OEP108    10.3
Name: Dettol, dtype: float64

In [11]:
# Access the column vector for the sanitizer "Kaya"
df['Kaya']

Store ID
AMD463    3.2
BGL198    4.1
CXF008    4.2
DRH187    5.0
EWO651    3.7
FLJ843    5.6
GKM230    3.7
HRT117    4.8
IDT818    6.1
JRB076    4.9
KTG143    5.5
LXM794    4.0
MRN637    6.5
NGP835    5.2
OEP108    5.6
Name: Kaya, dtype: float64

#### Observations

- There are two types of vectors that we can work with:
- The sales values of different sanitizers of a particular store (row vectors)
- Note: These are 7-long since there are 7 sanitizers in the data set
- The sales values of a particular sanitizer in different stores (column vectors)
- Note: These are 15-long since there are 15 stores in the data set

## Task 3 - Calculate norms of vectors

- To perform the tasks in this exercise, we will use metrics that are derived from norms of vectors
- So, let's get comfortable with calculating norms using Numpy

In [13]:
# Calculate the norm of the row vector for the store with ID "GKM230" using "np.linalg.norm()"
# Note: Leave all parameters as default except the input vector
np.linalg.norm(df.loc['GKM230'])


18.379336223052235

In [14]:
# Calculate the L2 norm of the row vector for the store with ID "GKM230" using "np.linalg.norm()"
# Note: This time, specify "ord = 2" as one of the parameters
np.linalg.norm(df.loc['GKM230'], ord = 2 )

18.379336223052235

In [15]:
df.loc['GKM230']

Dettol       8.0
Savlon       8.4
Himalaya    10.7
Lifebuoy     6.5
Kaya         3.7
Godrej       4.1
Dabur        4.0
Name: GKM230, dtype: float64

##### Note: The default norm calculated by "np.linalg.norm()" is the L2 norm, that is the "ord" parameter is set to 2 by default

![](https://miro.medium.com/max/1056/1*r6RXOl5FoSbY6urC9HjUcw.png)

In [16]:
# Calculate the L1 norm of the row vector for the store with ID "GKM230" using "np.linalg.norm()"
# Note: This time, specify "ord = 1" as one of the parameters
np.linalg.norm(df.loc['GKM230'], ord=1)

45.4

In [17]:
# Calculate the L3 norm of the row vector for the store with ID "GKM230" using "np.linalg.norm()"
# Note: This time, specify "ord = 3" as one of the parameters
np.linalg.norm(df.loc['GKM230'], ord=3)

14.07434262320583

In [18]:
# Calculate the max norm of the row vector for the store with ID "GKM230" using "np.linalg.norm()"
# Note: This time, specify "ord = np.inf" as one of the parameters
np.linalg.norm(df.loc['GKM230'], ord=np.inf)

10.7

##### Note: For the remainder of this exercise, we will use the default (L2) norm for the sake of simplicity and consistency

In [19]:
# Calculate the norm of the row vector for the store with ID "KTG143" using "np.linalg.norm()"
np.linalg.norm(df.loc['KTG143'])

26.075467397536713

In [20]:
# Calculate the norm of the row vector for the store with ID "OEP108" using "np.linalg.norm()"
np.linalg.norm(df.loc['OEP108'])

18.086182571233767

In [21]:
# Calculate the norm of the column vector for the sanitizer "Himalaya" using "np.linalg.norm()"
np.linalg.norm(df['Himalaya'])

40.304714364451215

In [22]:
# Calculate the norm of the column vector for the sanitizer "Dabur" using "np.linalg.norm()"
np.linalg.norm(df['Dabur'])

19.444022217637997

#### Observations

- Norms can be calculated using different orders ranging from 1, 2, 3, ... up to infinity
- The default norm is the L2 norm, which is also the Euclidean distance, and hence equal to the magnitude of the vector
- The magnitude of a vector (L2 norm) can be used as a measure of the strength of the values in that vector
- The smaller the norm of a vector, the weaker the vector, and vice versa
- The larger the norm of a vector, the stronger the vector, and vice versa

## Task 4 - Print a sorted list of store IDs in terms of their sales of sanitizers

- The norm of a vector indicates the strength of the values in that vector
- The smaller the norm of a vector, the weaker the vector, and vice versa
- The larger the norm of a vector, the stronger the vector, and vice versa
- We can use the norms of row vectors as a measure of the overall sales value of different stores

In [23]:
# Calculate the norms of the row vectors for all 15 stores and sort them in descending order
# Note: Use these norms as a measure of overall sales values of the different stores
# Note: Store the resulting values in a Pandas Series with index as store IDs and name the series "stores"
# Note: The "index" parameter of the series can be set as "df.index"
# Note: You may need to specify the "dtype" parameter of the series as "float64" to avoid some warnings
# Note: Use the ".sort_values()" function with the "ascending" parameter set to "False"

stores = pd.Series(index = df.index, dtype = 'float64')

for store in df.index:
    stores.loc[store] = np.linalg.norm(df.loc[store])
stores = stores.sort_values(ascending = False)

In [24]:
# Print the "stores" series
stores

Store ID
KTG143    26.075467
FLJ843    21.870299
JRB076    21.861381
NGP835    20.833867
CXF008    20.481211
MRN637    20.475839
DRH187    20.391175
BGL198    18.952836
IDT818    18.670565
HRT117    18.415754
GKM230    18.379336
EWO651    18.362734
OEP108    18.086183
AMD463    17.759223
LXM794    17.674275
dtype: float64

#### Observations

- The store with ID "KTG143" has the highest sales
- The user with ID "LXM794" has the lowest sales

## Task 5 - Print a sorted list of sanitizers in terms of their popularity

- The norm of a vector indicates the strength of the values in that vector
- The smaller the norm of a vector, the weaker the vector, and vice versa
- The larger the norm of a vector, the stronger the vector, and vice versa
- We can use the norms of column vectors as a measure of the overall sales of different sanitizers

In [25]:
# Calculate the norms of the column vectors for all 7 sanitizers and sort them in descending order
# Note: Use these norms as a measure of overall sales values of the different sanitizers
# Note: Store the resulting values in a Pandas Series with index as sanitizer names and name the series "sanitizers"
# Note: The "index" parameter of the series can be set as "df.columns"
# Note: You may need to specify the "dtype" parameter of the series as "float64" to avoid some warnings
# Note: Use the ".sort_values()" function with the "ascending" parameter set to "False"

sanitizers = pd.Series(index = df.columns, dtype = 'float64')

for sanitizer in sanitizers.index:
    sanitizers.loc[sanitizer] = np.linalg.norm(df[sanitizer])

sanitizers = sanitizers.sort_values(ascending = False)

In [26]:
# Print the "sanitizers" series
sanitizers

Himalaya    40.304714
Dettol      36.836938
Savlon      31.292331
Lifebuoy    30.055781
Godrej      19.997750
Dabur       19.444022
Kaya        18.962858
dtype: float64

#### Observations

- The most popular sanitizer is "Himalaya"
- The least popular sanitizer is "Kaya"

## Task 6 - Create a function to calculate the norm of the difference of two vectors


- The norm of the difference between two vectors can be used as a similarity measure between those two vectors
- The smaller the norm of the difference vector, the more similar the two original vectors, and vice versa
- The larger the norm of the difference vector, the more dissimilar the two original vectors, and vice versa

In [27]:
# Create a function called "sim()" which takes in two vectors and returns similarity score between them
# Note: Use the "np.linalg.norm()" method on the difference of the two vectors to calculate their similarity
# Note: The order in which the vectors are subtracted does not matter

def sim(vec1, vec2):
    vec = vec1 - vec2
    norm =  np.linalg.norm(vec)
    return norm

In [28]:
# Calculate the similarity between the sales of the stores with IDs "AMD463" and "HRT117" using the function "sim()"
sim(df.loc['AMD463'], df.loc['HRT117'])

5.175905717843014

In [29]:
# Calculate the similarity between the sales of the stores with IDs "NGP835" and "GKM230" using the function "sim()"
sim(df.loc['NGP835'], df.loc['GKM230'])

6.040695324215581

In [30]:
# Calculate the similarity between the sales of the sanitizers "Lifebuoy" and "Himalaya" using the function "sim()"
sim(df['Lifebuoy'], df['Himalaya'])

14.395138068111747

In [31]:
# Calculate the similarity between the sales of the sanitizers "Dettol" and "Godrej" using the function "sim()"
sim(df['Dettol'], df['Godrej'])

22.269036800005516

In [43]:
# Calculate the similarity between the sales of the store with ID "JRB076" and itself using the function "sim()"
sim(df.loc['JRB076'], df.loc['JRB076'])

0.0

#### Observations

- The norm of the difference between two vectors can be used as a similarity measure between those two vectors
- The norm of the difference between a vector and itself is 0
- The smaller the norm of the difference between two vectors, the higher their similarity score, and vice versa
- The larger the norm of the difference between two vectors, the lower their similarity score, and vice versa

## Task 7 - Create a function that returns the ID of the store with the most similar sales as a given store

- The norm of the difference between two vectors can be used as a similarity measure between those two vectors
- The smaller the norm of the difference between two vectors, the higher their similarity score, and vice versa
- The larger the norm of the difference between two vectors, the lower their similarity score, and vice versa
- We will create a function that returns the ID of the store with the most similar sales as a given store

![](https://datascience-enthusiast.com/figures/cosine_sim.png)


In [44]:
# Define a function "storesim()" that takes in the ID of a store and returns the ID of the store with the most similar sales
# Note: Use the "sim()" function to measure similarities between stores (using their row vectors)
# Note: You may create a temporary Pandas Series within the "storesim()" function to store the similarity values
# Note: The "index" parameter of the series can be set as "df.index"
# Note: You may need to specify the "dtype" parameter of the series as "float64" to avoid some warnings
# Note: You may sort the entries in this series and return the second index entry
# Note: Use the ".sort_values()" function with the default value for the "ascending" parameter, which is "True"
# Note: The first entry after sorting will be trivial

def storesim(store1):
    temp = pd.Series(index = df.index, dtype='float64')
    for store2 in temp.index:
      temp.loc[store2] = sim(df.loc[store1], df.loc[store2])
    return (temp.sort_values().index[1])

In [49]:
# Use the function "storesim()" to find the ID of the store whose sales are most similar to the store with ID "AMD463"
storesim('AMD463')

'EWO651'

In [46]:
# Use the function "storesim()" to find the ID of the store whose sales are most similar to the store with ID "NGP835"
storesim('NGP835')

'HRT117'

In [47]:
# Use the function "storesim()" to find the ID of the store whose sales are most similar to the store with ID "CXF008"
storesim('CXF008')

'BGL198'

In [48]:
# Use the function "storesim()" to find the ID of the store whose sales are most similar to the store with ID "BGL198"
storesim('BGL198')

'GKM230'

#### Observations

- The norm of the difference between row vectors is a measure of similarity between the sales in different stores
- The similarity score is not symmetric
- Rather it pertains to a specific case (such as a specific store)
- This is because it is possible for X to be most similar to Y and Y to be most similar to Z
- This happens when X and Z are not as similar to each other as X and Y or Y and Z
- Transitivity, on the other hand, is a useful assumption for further analysis

## Task 8 - Create a function that returns the name of a sanitizer with the most similar sales as a given sanitizer

- The norm of the difference between two vectors can be used as a similarity measure between those two vectors
- The smaller the norm of the difference between two vectors, the higher their similarity score, and vice versa
- The larger the norm of the difference between two vectors, the lower their similarity score, and vice versa
- We will create a function that returns the name of the sanitizer whose sales are most similar to a given sanitizer

In [61]:
# Define a function "sanisim()" that takes in the name of a sanitizer and returns sanitizer name with the most similar sales
# Note: Use the "sim()" function to measure similarities between sanitizers (using their column vectors)
# Note: You may create a temporary Pandas Series within the "sanisim()" function to store the similarity values
# Note: The "index" parameter of the series can be set as "df.columns"
# Note: You may need to specify the "dtype" parameter of the series as "float64" to avoid some warnings
# Note: You may sort the entries in this series and return the second index entry
# Note: Use the ".sort_values()" function with the default value for the "ascending" parameter, which is "True"
# Note: The first entry after sorting will be trivial

def sanisim(sani1):
    temp = pd.Series(index = df.columns, dtype = 'float64')
    for sani2 in temp.index:
      temp.loc[sani2] = sim(df[sani1], df[sani2])
    return temp.sort_values().index[1]

In [62]:
# Use the function "sanisim()" to find the name of the sanitizer whose sales are most similar to the sanitizer "Godrej"
sanisim('Godrej')

'Kaya'

In [63]:
# Use the function "sanisim()" to find the name of the sanitizer whose sales are most similar to the sanitizer "Dabur"

sanisim('Dabur')

'Kaya'

In [64]:
# Use the function "sanisim()" to find the name of the sanitizer whose sales are most similar to the sanitizer "Himalaya"

sanisim('Himalaya')

'Dettol'

In [65]:
# Use the function "sanisim()" to find the name of the sanitizer whose sales are most similar to the sanitizer "Dettol"

sanisim('Dettol')

'Lifebuoy'

#### Observations

- The norm of the difference between column vectors is a measure of similarity between the sales of different sanitizers
- The similarity score is not symmetric
- Rather it pertains to a specific case (such as a specific sanitizer)
- This is because it is possible for X to be most similar to Y and Y to be most similar to Z
- This happens when X and Z are not as similar to each other as X and Y or Y and Z
- Transitivity, on the other hand, is a useful assumption for further analysis

#### Conclusions

- From the sales data, we can extract row vectors (sales of a store) and column vectors (sales of a sanitizer)
- We can use norms of row vectors for particular stores to measure their overall sales
- We can use norms of column vectors for particular sanitizers to measure their popularity
- We can use the norm of the difference between row vectors of two stores to measure the similarity in their sales
- We can use the norm of the difference between column vectors of two sanitizers to measure the similarity
- Using these measures, we can recommend decisions to certain stores about which sanitizers to sell more of and less of
- This is a very basic look into the working methodology of sales similarity analysis