# MATH 3280 - Project 1

You are provided data on food consumption in a variety of countries.
* [food_consumption.csv](https://raw.githubusercontent.com/drolsonmi/math3280/master/Projects/food_consumption.csv)

The goal of this project is to use Locality Sensitive Hashing (LSH) to find similarities between the different food categories.
1. Import the data and use `pd.pivot_table()` to create a table whose indices are the country, columns are the food category, and values are the consumption of each category
   * Normalize each column
   * Turn values into 1's if normalized consumption $\ge 0.50$ and 0's if normalized consumption $<0.50$
2. Create a function that will return the result of a minhash function
3. Create a signature matrix by running the minhash function 250 times
4. Take every 3 rows of the signature matrix as groups for your signature matrix
5. Take the columns with similar signatures and calculate the Jaccard similarity of those columns

Questions:
1. How many food categories are we working with? How many countries?
2. What food category pairs have the highest Jaccard similarity?
3. What countries do you see that use these foods?
4. Is there any connection between these countries?
5. For this dataset, we could have done a similarity test without LSH. When would LSH become a lot more useful?

In [34]:
import pandas as pd
import numpy as np
from random import shuffle

# Load and preprocess data (you can reuse your code for this)
url = "https://raw.githubusercontent.com/drolsonmi/math3280/master/Projects/food_consumption.csv"
df = pd.read_csv(url)
pivot_table = pd.pivot_table(df, values='consumption', index='country', columns='food_category')
normalized_table = (pivot_table - pivot_table.min()) / (pivot_table.max() - pivot_table.min())

# Hot encoding: Convert consumption data to binary representation
binary_table = normalized_table.applymap(lambda x: 1 if x >= 0.5 else 0)

# Define the number of permutations
n_permutations = 250

# Initialize MinHash signatures
minhash_signatures = {food_category: ['_' for _ in range(n_permutations)] for food_category in binary_table.columns}

# Define a list of countries
countries = binary_table.index.tolist()

for k in range(n_permutations):
    rand_countries = countries.copy()
    shuffle(rand_countries)
    for food_category in binary_table.columns:
        for i in range(len(countries)):
            idx = countries.index(rand_countries[i])
            if (minhash_signatures[food_category][k] == '_') and (binary_table.loc[countries[idx], food_category] == 1):
                minhash_signatures[food_category][k] = rand_countries[i]
        if '_' not in minhash_signatures[food_category]:
            break

# Display MinHash signatures as a DataFrame with countries as rows and food categories as columns
signature_matrix = pd.DataFrame(minhash_signatures)

# Group every 3 rows into groups
grouped_signatures = [signature_matrix.iloc[i:i+3] for i in range(0, signature_matrix.shape[0], 3)]

# Calculate Jaccard similarity between columns with similar signatures
similar_columns = {}

for group in grouped_signatures:
    for col1 in group.columns:
        for col2 in group.columns:
            if col1 != col2:
                pair = (col1, col2)
                if pair not in similar_columns:
                    # Calculate Jaccard similarity
                    intersection = sum(group[col1] == group[col2])  # Count common elements
                    union = len(group)  # Total number of elements
                    jaccard_similarity = intersection / union
                    similar_columns[pair] = jaccard_similarity

# Display the Jaccard similarities
for pair, similarity in similar_columns.items():
    print(f"Jaccard similarity between columns '{pair[0]}' and '{pair[1]}': {similarity}")

# Find pairs with the highest Jaccard similarity
highest_similarity_pairs = max(similar_columns, key=lambda pair: similar_columns[pair])

# Display the food category pairs with the highest Jaccard similarity
print(f"The food category pairs with the highest Jaccard similarity are '{highest_similarity_pairs[0]}' and '{highest_similarity_pairs[1]}' with a similarity of {similar_columns[highest_similarity_pairs]}")




Jaccard similarity between columns 'beef' and 'dairy': 0.0
Jaccard similarity between columns 'beef' and 'eggs': 0.0
Jaccard similarity between columns 'beef' and 'fish': 0.0
Jaccard similarity between columns 'beef' and 'lamb_goat': 0.0
Jaccard similarity between columns 'beef' and 'nuts': 0.0
Jaccard similarity between columns 'beef' and 'pork': 0.0
Jaccard similarity between columns 'beef' and 'poultry': 0.0
Jaccard similarity between columns 'beef' and 'rice': 0.0
Jaccard similarity between columns 'beef' and 'soybeans': 0.0
Jaccard similarity between columns 'beef' and 'wheat': 0.0
Jaccard similarity between columns 'dairy' and 'beef': 0.0
Jaccard similarity between columns 'dairy' and 'eggs': 0.0
Jaccard similarity between columns 'dairy' and 'fish': 0.0
Jaccard similarity between columns 'dairy' and 'lamb_goat': 0.0
Jaccard similarity between columns 'dairy' and 'nuts': 0.0
Jaccard similarity between columns 'dairy' and 'pork': 0.0
Jaccard similarity between columns 'dairy' and 

In [None]:
# Questions:

# How many food categories are we working with? How many countries?
    ### we are working with 11 food Categories, 130 countries. 129 ignore it

# What food category pairs have the highest Jaccard similarity?
    ### The food category pairs with the highest Jaccard similarity are 'eggs' and 'poultry' with a similarity of 0.6666666666666666

# What countries do you see that use these foods?
    ### Countries using the food categories 'beef' and 'eggs' with the highest Jaccard similarity: ['Argentina', 'Bermuda', 'Canada', 'Denmark', 'Israel', 'Luxembourg', 'USA', 'Uruguay']

# Is there any connection between these countries?
    ### they are located in sout amarica and north amarica and that shows the connection between it, they kinda have similar tridtion of food.

# For this dataset, we could have done a similarity test without LSH. When would LSH become a lot more useful?
    ### LSH are better with bigger data sets and it needs very close similarities.

In [35]:
# We this part to shwo our work progress
# import pandas as pd
# import numpy as np
# from random import shuffle

# # Load and preprocess data (you can reuse your code for this)
# url = "https://raw.githubusercontent.com/drolsonmi/math3280/master/Projects/food_consumption.csv"
# df = pd.read_csv(url)
# pivot_table = pd.pivot_table(df, values='consumption', index='country', columns='food_category')
# normalized_table = (pivot_table - pivot_table.min()) / (pivot_table.max() - pivot_table.min())

# # Hot encoding: Convert consumption data to binary representation
# binary_table = normalized_table.applymap(lambda x: 1 if x >= 0.5 else 0)

# # Define the number of permutations
# n_permutations = 250

# # Initialize MinHash signatures
# minhash_signatures = {food_category: ['_' for _ in range(n_permutations)] for food_category in binary_table.columns}

# # Define a list of countries
# countries = binary_table.index.tolist()
# for k in range(n_permutations):
#     rand_countries = countries.copy()
#     shuffle(rand_countries)
#     for food_category in binary_table.columns:
#         for i in range(len(countries)):
#             idx = countries.index(rand_countries[i])
#             if (minhash_signatures[food_category][k] == '_') and (binary_table.loc[countries[idx], food_category] == 1):
#                 minhash_signatures[food_category][k] = rand_countries[i]
#         if '_' not in minhash_signatures[food_category]:
#             break

# # Display MinHash signatures as a DataFrame with countries as rows and food categories as columns
# signature_matrix = pd.DataFrame(minhash_signatures)

# # Group every 3 rows into groups
# grouped_signatures = [signature_matrix.iloc[i:i+3] for i in range(0, signature_matrix.shape[0], 3)]

# # Display the grouped signatures
# for i, group in enumerate(grouped_signatures):
#     print(f"Group {i + 1}:")
#     print(group)

In [36]:
# #  Step 1: Create a pivot table
# pivot_table = pd.pivot_table(food, index=['country'],
#                              columns=['food_category'],
#                              values=['consumption'])
# print(pivot_table)

In [10]:
# print(pivot_table.min()), print(pivot_table.max())

In [27]:
# from pandas.core.reshape.pivot import pivot
# # normalize the data
# normalized_table = (pivot_table - pivot_table.min())/(pivot_table.max() - pivot_table.min())
# print(normalized_table)