# Categorising Tags into Segments

This notebook details how the tags are segmented. In future, when more tags appear, we can add them into the segment and generate a new `segment.json` file.

In [None]:
import pandas as pd
import numpy as np
import json
from pyspark.sql import SparkSession, DataFrame

# Set working directory to root directory
path = os.path.normpath(os.getcwd())
if path.split(os.sep)[-1] != 'generic-buy-now-pay-later-project-group-19':
    os.chdir("..")

spark = (
        SparkSession.builder.appName("MAST30034 Project 2")
        .config("spark.sql.repl.eagerEval.enabled", True) 
        .config("spark.sql.parquet.cacheMetadata", "true")
        .config("spark.sql.session.timeZone", "Etc/UTC")
        .config("spark.driver.memory", "4g")
        .getOrCreate()
    )

# Tags from Merchant Profile Data

In [None]:
merchants_with_tags = spark.read.parquet("./data/curated/merchants_with_tags")

In [None]:
# Show all tags from merchants profile data
tags = merchants_with_tags.columns[4:]
tags

In [None]:
# Find total number of tags
print(f"Total number of tags: {len(merchants_with_tags.columns[4:])}")

# Tag Segmentation
## <a name="definitions"></a>Defining Segments

There are industries that experience **high growth during an economic boom** but **suffer tremendously during recession**. We call these **recession-vulnerable industries** and an example of this would be retail, hospitality and leisure industries. On the other hand, tech industry and repair industry are generally **recession-proof** as they have grown to be essentials in our daily life. As part of our diversification strategy, we aim to segment merchants into the following categories:

Credits: [CNBC: Industries Hit Hardest by Recession!](https://www.cnbc.com/2012/06/01/Industries-Hit-Hardest-by-the-Recession.html), [Forbes: What Industries Do Well In An L-Shaped Recession?](https://www.forbes.com/sites/qai/2022/08/12/what-industries-do-well-in-an-l-shaped-recession/?sh=2a12dd68404e)

<br>
<br>

- ### Recession-vulnerable industries:
    - **Luxury Goods** - High value per order and non-service oriented industry such as jewelry and arts. This category should see **low returning customer** and vip customer rate as well as relatively **high variance in customer spendings**.
    - **Leisure Goods** - Hobbies, toys, books as well as outdoor activities essentials. Generally **high variance in customer spendings**.
    - **Home Furnishings** - Home furnishing industry. This industry should see **high value per order** but **moderate to low order quantities**.
    - **Gifts & Souvenir** - Gifts, flowers, souvenir, etc industry. Generally **low value per order** but **high order quantities**.
    - **Clothing & Accessories** - Fashion and accessories industry.
    - **Office Equipments** - Office supplies, stationery etc.
<br>
<br>

- ### Recession-proof industries:
    - **Repair Services** - General repair services industry. This industry should see high **value per order** and generally **high order quantities** as well.
    - **Technology & Telecommunication** - electronic devices, telecommunication, systems and software-related industry. Generally **high daily revenue** and **high order quantities** due to how big the market is.
    - **Motor & Bicycles** - Motor, bicycle supplies and parts. 
    - **Health & Wellness** - Health related services.
   
<br>

In [None]:
# Industry to tags dictionary (RV = Recession-Vulnerable, RP = Recession-Proof)
rv_ind = {'luxury goods':['art_dealer_gallery','antique_shop_sale','jewelry','silverware_shop'],
          'leisure goods and services':['artist_supply_craft_shop','book','digital_good_book','hobby',\
                                         'music_shop_musical_instrument', 'newspaper', \
                                         'novelty', 'periodical', 'piano','sheet_music','toy_game_shop',\
                                         'forest_supply','movie', 'music','tent_owning_shop'],
          'home furnishings':['furniture','home_furnishing_equipment_shop','nursery_stock',\
                                'including_nursery','lawn_garden_supply_outlet'],
          'gifts souvenirs':['card','flower','forest_supply','gift','souvenir_shop'],
          'clothing and accessories':['watch','shoe_shop'],
          'office equipments':['office_supply_printing_writing_paper','stationery']}

rp_ind = {'repair services':['repair', 'restoration_service','jewelry_repair_shop'],
          'tech and telecom':['computer', 'computer_peripheral_equipment','computer_programming', \
                              'data_processing','integrated_system_design_service','pay_television_radio_service',\
                              'satellite','telecom','software','cable'],
          'motor and bicycles':['bicycle_shop_sale_service','motor_vehicle_supply_new_part'],
          'health and wellness':['optical_good', 'optician','eyeglass','health_beauty_spa']}

ind = {**rv_ind, **rp_ind}

In [None]:
# Other tags will fall under miscellaneour
cat = []
for item in rv_ind.keys():
    cat += rv_ind[item]
    
for item in rp_ind.keys():
    cat += rp_ind[item]
    
cat = set(cat)

ind['miscellaneous'] = list(set(tags) - cat)

# Save to Segment.json

In [None]:
with open('./ranking/segments.json', 'w') as f:
    json.dump(ind, f)