<a href="https://colab.research.google.com/github/FelipeAce96/Cleaner-Restaurant-Names/blob/main/DEMO_CLEANER_RESTAURANT_NAMES.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cleaner-Restaurant-Names üçî
Fine tuned model (T5-Google) for clean restaurant names (remove address and stow words automatically) [Spanish]

## RESULTS:

    REGULAR NAME: KFC 232 PIN SAN ANTONI.
    CLEANED NAME: KFC

    REGULAR NAME: PIZZAS Y PASTAS BY PPC 39 GIRARDOT.
    CLEANED NAME: PIZZAS Y PASTAS BY PPC

    REGULAR NAME: BURGER KING EUCLIDES MIRAGAIA 22518.
    CLEANED NAME: BURGER KING

    REGULAR NAME: DUNKIN DONUTS COMAS.
    CLEANED NAME: DUNKIN DONUTS

## INSTALL DEPENDENCIES AND DOWNLOAD THE MODEL

In [1]:
!pip install accelerate transformers sentencepiece

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting accelerate
  Downloading accelerate-0.19.0-py3-none-any.whl (219 kB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m219.1/219.1 kB[0m [31m5.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting transformers
  Downloading transformers-4.29.2-py3-none-any.whl (7.1 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m7.1/7.1 MB[0m [31m74.0 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting sentencepiece
  Downloading sentencepiece-0.1.99-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ‚îÅ[0m [32m1.3/1.3 MB[0m [31m77.4 MB/s[0m eta [36m0:00:00[0m


In [2]:
def clean_name(name, stopwords=[]):
    import unicodedata
    import re
    name = unicodedata.normalize('NFD', name).encode('ascii', 'ignore').decode("utf-8")
    name=name.upper()
    name=name.replace("'S",'S')
    name=name.replace('-',' ')
    name=name.replace("'",' ')
    name=re.sub('[^A-Za-z0-9√±\s]+', '', name) #remove special characters
    name=re.sub('\s{2}', ' ', name) #replace 2 white spaces to 1
    name=re.sub('\s{3}', ' ', name) #replace 2 white spaces to 1
    words=name.split()
    words=[w for w in words if w not in stopwords]
    name=" ".join(words)
    name=name.strip() #remove white spaces

    return str(name)

stopwords=['la','de','el','del','las','los']


def create_prompt(cleaned_name):
  return f"""
REGULAR NAME: {cleaned_name}.
CLEANED NAME:
"""

In [3]:
# LOAD OUR MODEL

Tx = 30
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

tokenizer = AutoTokenizer.from_pretrained("felipeace96/cleaner-restaurant-names")
model = AutoModelForSeq2SeqLM.from_pretrained("felipeace96/cleaner-restaurant-names")

Downloading (‚Ä¶)okenizer_config.json:   0%|          | 0.00/2.48k [00:00<?, ?B/s]

Downloading spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (‚Ä¶)/main/tokenizer.json:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

Downloading (‚Ä¶)cial_tokens_map.json:   0%|          | 0.00/2.20k [00:00<?, ?B/s]

Downloading (‚Ä¶)lve/main/config.json:   0%|          | 0.00/1.56k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (‚Ä¶)neration_config.json:   0%|          | 0.00/142 [00:00<?, ?B/s]

In [4]:
# TO GPU
import torch
device = "cuda:0" if torch.cuda.is_available() else "cpu"
device

'cuda:0'

In [5]:
model = model.to(device)

## USE THE MODEL

In [8]:
import gc
RESTAURANT_NAME = "CENTRO COMERCIAL GRAN ESTACION, DOMINOS PIZZA" #@param{type: 'string'}
PROMPT = create_prompt(clean_name(RESTAURANT_NAME))
#Generate using the saved mdoel

sentences = [PROMPT]

inputs = tokenizer(sentences,
          truncation=True,
          return_attention_mask=True,
          add_special_tokens = True ,
          max_length = Tx,
          padding= 'max_length',
          return_tensors="pt",
          ).to(device)

# test_input_ids, test_attention_masks = test_encoding.input_ids, test_encoding.attention_mask
# inputs = tokenizer(inputs, max_length=max_input_length, truncation=True, return_tensors="pt")
output = model.generate(**inputs,
                              num_beams=8,
                              do_sample=False,
                              min_length=2,
                              max_length=Tx,
                              early_stopping=True)
#clean gpu memory
gc.collect()
torch.cuda.empty_cache()

#Output
decoded_output = tokenizer.batch_decode(output, skip_special_tokens=True)
for sentence, output in zip(sentences, decoded_output):
  print(sentence.strip() + ' ' + output)

REGULAR NAME: CENTRO COMERCIAL GRAN ESTACION DOMINOS PIZZA.
CLEANED NAME: DOMINOS PIZZA
