# Collaborative Filtering with Agricultural Commoditites

This week, we focus on the concept of embeddings and look at two very machine learning methods that use embeddings: collaborative filtering and autoencoders.  Embeddings are how machine learning models represent data in a structured way, largely through reducing high dimensional data into continuous vectors.

![image](https://gisgeography.com/wp-content/uploads/2014/12/feed-world-agriculture-map.png)

When you are shopping online or watching content on streaming services, have you ever thought about how other items or shows are recommended to you? These suggestions are largely due to Recommender systems, which are a way of suggesting similar content to a user's specific preferences.  A common method in Recommender systems is Collaborative Filtering, which recommends content based on similarity measures between users and the content.  In Collaborative Filtering a simple approach is to rely on matrix factorization to identify the relationships between items and user entities. Latent features, the association between users and item matrices, are determine to find similarity and make predictions. Common examples of Collaborative Filtering include the movie or song/artist recommendations from the MovieLens or lastfm datasets, examples of which you can find online.


For this workbook, we will be focusing on a dataset which features 50+ agricultural commodities grown in over 3200+ administrative units around the world. This data is implicit so we only know if a commodity is grown or not in each location. Imagine you are a policymaker in the agriculture sector and you want to improve nutritional and food security outcomes by improving crop diversity, that is, increasing the number of commodities grown in all the administrative units around the world.  To make these recommendations you would need to understand context-specific information on what crops grow best where and what part of world can grow different crops, but this might be information that you can't easily access.  Instead, for an initial understanding we can use our data to build a recommender system to answer the following questions:
1. What commodities are most similar to one another?
2. What admin units are most similar to one another?
3. What are the top 5 new commodities that we could recommend for an admin unit to grow?


By the end of this exercise we  will have developed 3 outputs: A similarity matrix of commodities, a similarity matrix of admin units and lastly, a recommendation of the top 5 new commodities for each admin unit in our dataset.

## Learning Objectives
1. Understand how to create a similarity matrix from a pairwise dataset
2. Understand how to use matrix factorization to create a recommender system
3. Interpret the results of a recommender system


## Crop-Crop Similarities
Here, we'll aim to understand what crops are similar to each other based on whether they are grown in the same regions as each other. Let's start by setting up our environment and loading in our data. Ensure that you have the [corresponding dataset](https://kaggle.com/datasets/18185cb6db270f05eee91f017174ca33fefd087dcdf99ae15f87b675587e2af0) pulled into this notebook. 

In [None]:
import pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.decomposition import PCA
import seaborn as sns 
import matplotlib.pyplot as plt 
import os
print('imported')

In [None]:
base_path = '/kaggle/input/commodotiesgrown-byadmin1unit'
filename = 'commoditiesgrown_byadmin1unit.csv'
data = pd.read_csv(os.path.join(base_path, filename))  
data.head()

We can see that we have the admin unit id (mostly provinces or states within countries eg. AFG.1_1 is the Badakhshan province in Afghanistan), the name of the commodity (this includes from crops and animals) as well as whether they are grown in that unit (binary, 1 for yes or 0 for no).

Lets pivot the data and create a matrix. We will also zero any NAs and index based on admin units.

In [None]:
DfMatrix = pd.pivot_table(data, values='grown', index='admin_id', columns='commodity')
DfMatrix=DfMatrix.fillna(0)
DfResetted = DfMatrix.reset_index().rename_axis(None, axis=1)
DfResetted.head()

For our first task which is to develop a commodity-commodity similarity matrix we will drop admin units and normalize our data for comparison

In [None]:
dfCommodities = DfResetted.drop('admin_id', axis=1) 
dfCommoditiesNorm = dfCommodities/np.sqrt(np.square(dfCommodities).sum(axis=0))   
dfCommoditiesNorm.head()

Now we will compute the cosine similarity for all our commodities.  To get the cosine similarity we will take the dot product (matrix multiplication) of normalized dataset (3278x54) by its transpose (54x3278) which will result in a 55x55 matrix of the cosine similarities of all commodity combinations.

In [None]:
ItemItemSim = dfCommoditiesNorm.transpose().dot(dfCommoditiesNorm) 
ItemItemSim.head(10)

Using the cosine similarity function from sklearn we get the same result.

In [None]:
df_cosine=pd.DataFrame(cosine_similarity(dfCommoditiesNorm.transpose(),dense_output=True))
df_cosine.head(10)

Let's get the top 10 most similar commodities for each, do the recommendations make sense? From what you know, are these crops often grown together?

In [None]:
ItemTop10 = pd.DataFrame(index=ItemItemSim.columns,columns=range(1,11))
for i in range(0,len(ItemItemSim.columns)): ItemTop10.iloc[i,:10] = ItemItemSim.iloc[0:,i].sort_values(ascending=False)[:10].index  
ItemTop10.head(10)

They seem to make sense! We're seeing tropical commodities most similar to other tropical commodities, different meats most alike etc. 
Write the top 10 commodity-commodity similarities to a file. You should find it in "Output" under `/kaggle/working/`.

In [None]:
# export to excel
ItemTop10.to_excel("commodity_commodity_top10.xlsx")
print("file written")

We can also do a gut check on our commodities with a Principal Component Analysis (PCA) which is helps with dimensionality reduction.  Let's look at how our commodity variable is distributed.

In [None]:
df_tranposed=dfCommoditiesNorm.transpose()
df_tranposed.index.name = 'label'
df_tranposed.reset_index(inplace=True)

pca_comm = PCA(n_components=2)
principalComponents_comm = pca_comm.fit_transform(df_tranposed.iloc[:,1:])
principal_comm_Df = pd.DataFrame(data = principalComponents_comm
             , columns = ['pc1', 'pc2'])
principal_comm_Df['label'] = df_tranposed['label']
principal_comm_Df.head()

In [None]:
# plot pc1 and pc2
plt.figure(figsize=(16,8))
# Normalize values for color scaling
pc1_norm = (principal_comm_Df["pc1"] - principal_comm_Df["pc1"].min()) / (principal_comm_Df["pc1"].max() - principal_comm_Df["pc1"].min())
pc2_norm = (principal_comm_Df["pc2"] - principal_comm_Df["pc2"].min()) / (principal_comm_Df["pc2"].max() - principal_comm_Df["pc2"].min())

# Create blended colors
colors = np.column_stack([pc1_norm, pc2_norm, np.ones_like(pc1_norm)])

sns.scatterplot(
    x="pc1", y="pc2",
    data=principal_comm_Df,
    s=200,  # Increase dot size
    color=colors  # Apply blended colors
)

# Add text labels
for i in range(principal_comm_Df.shape[0]):
    plt.text(x=principal_comm_Df.pc1[i] + 0.005, y=principal_comm_Df.pc2[i] + 0.005,
             s=principal_comm_Df.label[i], fontdict=dict(size=10))

plt.xlabel("Principal Component 1")  # x label
plt.ylabel("Principal Component 2")  # y label
plt.show()

We can see from this PCA that similar commodities are clustering together (meats, tropical crops such as coconuts/coffee/pineapples etc.).  This is also reflected in our commodity cosine similarity index we have generated. What other interesting clusters can you find in this PCA?

## Admin-Admin Similarities

Similar to above, lets generate the cosine similarities of which admin units are most alike to each other.

In [None]:
DfMatrix2 = pd.pivot_table(data, values='grown', columns='admin_id', index='commodity')
DfMatrix2 = DfMatrix2.fillna(0)
DfResetted2 = DfMatrix2.reset_index().rename_axis(None, axis=1)
dfAdmins = DfResetted2.drop('commodity', axis=1) 
dfAdminsNorm = dfAdmins/np.sqrt(np.square(dfAdmins).sum(axis=0))  
AdminAdminSim = dfAdminsNorm.transpose().dot(dfAdminsNorm) 
AdminTop10 = pd.DataFrame(index=AdminAdminSim.columns,columns=range(1,11))
for i in range(0,len(AdminAdminSim.columns)): AdminTop10.iloc[i,:10] = AdminAdminSim.iloc[0:,i].sort_values(ascending=False)[:10].index  
AdminTop10.head(10)

An initial look through data shows that admin units within a country seem to be the most similar to each other.  Digging into the table, are there other interesting occurences?  Do similar types of countries (eg. geographically near, similar climates) appear? 
Write the top 10 admin-admin similarities to file

In [None]:
AdminTop10.to_excel("admin_admin_top10.xlsx")

We can also visualize the admin unit embeddings

In [None]:
df_tranposed2=dfAdminsNorm.transpose()
df_tranposed2.index.name = 'label'
df_tranposed2.reset_index(inplace=True)

pca_admin = PCA(n_components=2)
principalComponents_admin = pca_admin.fit_transform(df_tranposed2.iloc[:,1:])
principal_admin_Df = pd.DataFrame(data = principalComponents_admin
             , columns = ['pc1', 'pc2'])
principal_admin_Df['label'] = df_tranposed2['label']
principal_admin_Df.head(10)

In [None]:
# plot the principal components
plt.figure(figsize=(16,12))

# Normalize values for color scaling
pc1_norm = (principal_admin_Df["pc1"] - principal_admin_Df["pc1"].min()) / (principal_admin_Df["pc1"].max() - principal_admin_Df["pc1"].min())
pc2_norm = (principal_admin_Df["pc2"] - principal_admin_Df["pc2"].min()) / (principal_admin_Df["pc2"].max() - principal_admin_Df["pc2"].min())

# Create blended colors
colors = np.column_stack([pc1_norm, pc2_norm, np.ones_like(pc1_norm)])

sns.scatterplot(
    x="pc1", y="pc2",
    data=principal_admin_Df,
    s=100,  # Increase dot size
    color=colors  # Apply blended colors
)

# Add text labels
for i in range(principal_admin_Df.shape[0]):
    plt.text(x=principal_admin_Df.pc1[i] + 0.005, y=principal_admin_Df.pc2[i] + 0.005,
             s=principal_admin_Df.label[i], fontdict=dict(size=6))

plt.xlabel("Principal Component 1")  # x label
plt.ylabel("Principal Component 2")  # y label
plt.show()

## Admin-Commodity Recommendations
Now that we have a sense of what commodities are similar to each other, and what regions are similar to each other, our last task is our recommender system that suggests which commodities regions could consider growing.  We will need the commodity similarity matrix, then will look at which commodities the admin unit grows and get the top 10 neighbours for each commodity.  Then we will calculate the admin's commodity growing history for each neighbour and calculate a similarity score for them.  In the end we will recommend 5 commodities with the highest score.

In [None]:
df=DfResetted 
df=df.fillna(0) 
df.head()
AdminItemSimilarity = pd.DataFrame(index=df.index,columns=df.columns)
AdminItemSimilarity.iloc[:,:1] = df.iloc[:,:1] 
AdminItemSimilarity.head(10)

Now calculate similarity scores through the functions we will create below.

In [None]:
# this might a few minutes to run...
def getScore(history, similarities):
    return sum(history*similarities)/sum(similarities)

import warnings
warnings.filterwarnings('ignore')

for i in range(0,len(AdminItemSimilarity.index)): 
    for j in range(1,len(AdminItemSimilarity.columns)): 
        user = AdminItemSimilarity.index[i] 
        product = AdminItemSimilarity.columns[j]
        
        if df.loc[i][j] == 1: AdminItemSimilarity.loc[i][j] = 0 
        else: 
            ItemTop = ItemTop10.loc[product][1:10] 
            ItemTopSimilarity = ItemItemSim.loc[product].sort_values(ascending=False)[1:10] 
            AdminGrown = dfCommodities.loc[user,ItemTop] 
            
            AdminItemSimilarity.loc[i][j] = getScore(AdminGrown,ItemTopSimilarity)

print('done!')

Let's take a peek at our newly generated recommendations

In [None]:
AdminItemSimilarity.head(10)

Let's convert this into the top 5 recommendations of new commodities for each admin unit and take a look

In [None]:
AdminItemRecommend = pd.DataFrame(index=AdminItemSimilarity.index, columns=['Admin_id','1','2','3','4','5']) 
AdminItemRecommend.iloc[0:,0] = AdminItemSimilarity.iloc[:,0]
for i in range(0,len(AdminItemSimilarity.index)):
    AdminItemRecommend.iloc[i,1:] = AdminItemSimilarity.iloc[i,1:].sort_values(ascending=False).iloc[0:5,].index.transpose()
    
AdminItemRecommend.head(10)

This looks interesting, we can see that for these Afghan provinces the recommendations mostly make sense, including for pork, which is culturally not consumed and would explain why its not a commodity grown in the country. Looking through the recommendations, what else stands out? 

In [None]:
AdminItemRecommend.to_excel("admin_comm_recommend.xlsx")
print('exported')

## Acknowledgement
Thank you very much to [Ginni Braich](https://www.linkedin.com/in/ginni-braich-41672337/?originalSubdomain=ca) of the [Better Planet Lab](https://betterplanetlab.com/about) for conceptualizing and putting together much of this notebook.

## Assignment

1. Imagine that you are in the Ministry of Agriculture of Sierra Leone (SLE) and have been tasked with diversifying your country's agricultural industry. What crops are most likely to be successful in your climate based on the results above? What other data or information would you ask for to increase your confidence in the recommendations? Are these results actionable?

2. The above code uses cosine similarity to measure the "distance" between crops or administrative units, but there are quite a few other distance measures. Skim through [this Medium article](https://medium.com/@jodancker/a-brief-introduction-to-distance-measures-ac89cbd2298) on other mathematical distances, pick one that seems appropriate, and rerun the analysis using this distance measure instead. Does it change your results for question #1? Why or why not?

3. The similarity matrix of crops is a form of "embeddings". Here, we used them to recommend crops to given regions of the world based on what other similar regions are growing. What is another task you could use these "embeddings" for?

## Bonus Assignment
1. The above example uses memory-based collaborative filtering. In the video you learned about matrix factorization optimized with stochastic gradient descent. Re-implement the example above with this alternative model. Do the results differ? Why might you choose this over the memory-based version?

## Deliverables
Write out your answers to the questions above. Make this notebook public (Save Version > Share > Public > Copy link) and share the link on Discord in the #notebook-submissions channel. 