<a href="https://colab.research.google.com/github/KadyrbekNurgali/tasks/blob/main/restaurant.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import pandas as pd

In [None]:
df = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/data/menu_data.xlsx', engine='openpyxl')
len(df)

In [None]:
df.columns

Index(['restaurant', 'tags', 'url', 'location', 'rating', 'nr_reviews',
       'category', 'product', 'description', 'price', 'flag_has_picture',
       'country', 'postal_code', 'postal_code_suffix', 'nr_items'],
      dtype='object')

**1. For each data completion sub-task bellow, code it or describe textually a process for achieving the outlined goals (we have labelled some data in a small sample, as example):**
a. Fill the “country” column
b. Fill the “postal_code” and “postal_code_suffix” columns
c. Fill the “nr_items” column - this is a comparative metric of how many items are comprised in the listed product

It would be possible to use a regular expression completely here, but it would become a hardcode because there are many exceptional situations in the data set. And in order to somehow normalize and extract as much reliable data as possible, I used the library **geopy**. Advantages: The data is more accurate, and missing metadata can be supplemented. Disadvantages: since the request has been sent, it will take quite a long time, and the code itself uses brute force (this can also be optimized) 
*but if we consider that we can reuse the corrected data, then time is not crucial here.*

In [None]:
!pip install geopy



In [None]:
# I used separate dictionaries for the address because there are a lot of repetitions in the lines, 
# and I didn't want to repeat the processing. So it is why i used unique values
addresses = df['location'].unique()

In [None]:
len(addresses)

2239

In [None]:
!pip install pycountry

Collecting pycountry
  Downloading pycountry-22.3.5.tar.gz (10.1 MB)
[K     |████████████████████████████████| 10.1 MB 4.1 MB/s 
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Building wheels for collected packages: pycountry
  Building wheel for pycountry (PEP 517) ... [?25l[?25hdone
  Created wheel for pycountry: filename=pycountry-22.3.5-py2.py3-none-any.whl size=10681845 sha256=9b59d0440c98aa313ffcce4f189dcc20dc72202b2dbe0cf762fe76229ec498a3
  Stored in directory: /root/.cache/pip/wheels/0e/06/e8/7ee176e95ea9a8a8c3b3afcb1869f20adbd42413d4611c6eb4
Successfully built pycountry
Installing collected packages: pycountry
Successfully installed pycountry-22.3.5


In [None]:
def get_full_addres_fromWEB(address):
  user_ag = 'nurgali_'+str(random.randint(0,10))
  geolocator = Nominatim(user_agent = user_ag)
  location = geolocator.geocode(address)
  # print(location.address)
  return location.address

In [None]:
from geopy.geocoders import Nominatim
import random
def get_addres_fromWEB(address):
  address_parts = address.split(', ')
  # I noticed that when we search for the address completely, it may not find it, and in case of failure, I cut off the full address in parts
  for i in range(len(address_parts),1,-1):
    try:
      part_of_address = ', '.join(address_parts[:i])
      parts = get_full_addres_fromWEB(part_of_address).split(', ')
      # print(address)
      return parts[-1]
    except Exception as e:
      pass
      # print(e)
  return None
print(get_addres_fromWEB('Rua Silva Carvalho 124, 124 , 1250'))

None


In [None]:
def get_country(address):
  import pycountry
  for country in pycountry.countries:
      if country.name in address:
          return country.name
  return get_addres_fromWEB(address)
get_country('R. Do Dr. Barbosa De Castro 59, 4050-629 Porto, Portugal, Porto 4050-629')

'Portugal'

In [None]:
import re
address_dictionary = {}
code_zip = r'(\d{3,}\-\d{2,})'
# code_zip_with_city = r'([A-ZÉ][A-ZÉa-zúé]{1,}\ \d{4,})'
# code_zip_general = r'(,\ \d{4,})'

#here, in this case (address_dictionary), it was worth using the name of the restaurant as a key. But in general, you need to use the restaurant ID from the database. 

for address in addresses:
  country = get_country(str(address))
  postal_code = None
  postal_code_suffix = None
  zip_code = re.search(code_zip, str(address))
  if zip_code:
      postal_code = zip_code.group(1)
      print('address', address)
      postal_code_suffix = (lambda x: x.split('-')[1])(postal_code)
      address_dictionary[address] = {'country':country,
                                     'postal_code' : postal_code,
                                     'postal_code_suffix' : postal_code_suffix}
  else:
    address_fromWEB = get_full_addres_fromWEB
    zip_code = re.search(code_zip, str(address_fromWEB))
    if zip_code:
      postal_code = zip_code.group(1)
      postal_code_suffix = (lambda x: x.split('-')[1])(postal_code)
      address_dictionary[address] = {'country':country,
                                     'postal_code' : postal_code,
                                     'postal_code_suffix' : postal_code_suffix}
    else:
      address_dictionary[address] = {'country':country,
                                     'postal_code' : postal_code,
                                     'postal_code_suffix' : postal_code_suffix}
      #here you can continue the search using a regular expression

In [None]:
address_dictionary

{'Praça Dom Pedro Iv 81-83, 1100-202 Lisboa, Portugal, Lisboa 1100': {'country': 'Portugal',
  'postal_code': '1100-202',
  'postal_code_suffix': '202'},
 'Av. República, 1000-082 Lisboa, Portugal, 406, Lisboa 1000-082': {'country': 'Portugal',
  'postal_code': '1000-082',
  'postal_code_suffix': '082'},
 'Av. Gen. Roçadas 95, 1170-340 Lisboa, Portugal, Lisboa 1170': {'country': 'Portugal',
  'postal_code': '1170-340',
  'postal_code_suffix': '340'},
 'R. Condes De Monsanto 4, 1100-240 Lisboa, Portugal, Lisboa 1100-240': {'country': 'Portugal',
  'postal_code': '1100-240',
  'postal_code_suffix': '240'},
 'Rua Marquês De Fronteira 106d, Lisboa, Lisboa 1070': {'country': 'Portugal',
  'postal_code': None,
  'postal_code_suffix': None},
 'R. Alm. Barroso 1, Lisboa, Lisboa 1000': {'country': 'Portugal',
  'postal_code': None,
  'postal_code_suffix': None},
 'Rua Marquês De Fronteira 117f, 1070': {'country': 'Portugal',
  'postal_code': None,
  'postal_code_suffix': None},
 'Centro Comerci

In [None]:
# import json
# with open('/content/drive/MyDrive/Colab Notebooks/data/address_dictionary.json', 'w') as fp:
#     json.dump(address_dictionary, fp,ensure_ascii=False)

In [None]:
df[['country', 'postal_code', 'postal_code_suffix']] 

Unnamed: 0,country,postal_code,postal_code_suffix
0,Portugal,1100.0,202.0
1,Portugal,1100.0,202.0
2,Portugal,1100.0,202.0
3,Portugal,1100.0,202.0
4,Portugal,1100.0,202.0
...,...,...,...
139232,,,
139233,,,
139234,,,
139235,,,


In [None]:
def location_handler (row):
  country = address_dictionary[row['location']]['country']
  postal_code = address_dictionary[row['location']]['postal_code']
  postal_code_suffix = address_dictionary[row['location']]['postal_code_suffix']
  return pd.Series([country,postal_code, postal_code_suffix] )

In [None]:
df[['country', 'postal_code', 'postal_code_suffix']] = df.apply (lambda row: location_handler(row), axis=1)

[1;30;43mВыходные данные были обрезаны до нескольких последних строк (5000).[0m
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None, None]
['Portugal', None,

In [None]:
df = df.where(pd.notnull(df), None)

In [None]:
def get_tags(row):
  if not row['tags']:
    return []
  clean_tags = re.sub('[^A-Za-z•]+','', row['tags']).split('•')
  clean_tags = list(filter(None, clean_tags))
  return clean_tags
# get_tags('Burgers • American • Mediterranean • Bar Food • Sandwich • Kids Friendly • Desserts • Ice Cream + Frozen Yogurt • Vegetarian Friendly • €')

In [None]:
df['tags_array'] = [[] for _ in range(len(df))]

In [None]:
df['tags_array'] = df.apply (lambda row: get_tags(row), axis=1)

In [None]:
df['tags_array'] 

0         [Burgers, American, Mediterranean, BarFood, Sa...
1         [Burgers, American, Mediterranean, BarFood, Sa...
2         [Burgers, American, Mediterranean, BarFood, Sa...
3         [Burgers, American, Mediterranean, BarFood, Sa...
4         [Burgers, American, Mediterranean, BarFood, Sa...
                                ...                        
139232    [BreakfastandBrunch, Desserts, Pastry, Portugu...
139233    [BreakfastandBrunch, Desserts, Pastry, Portugu...
139234    [BreakfastandBrunch, Desserts, Pastry, Portugu...
139235    [BreakfastandBrunch, Desserts, Pastry, Portugu...
139236    [BreakfastandBrunch, Desserts, Pastry, Portugu...
Name: tags_array, Length: 139237, dtype: object

In [None]:
df.columns

Index(['restaurant', 'tags', 'url', 'location', 'rating', 'nr_reviews',
       'category', 'product', 'description', 'price', 'flag_has_picture',
       'country', 'postal_code', 'postal_code_suffix', 'nr_items',
       'tags_array'],
      dtype='object')

In [None]:
import nltk
from nltk import word_tokenize
from nltk.corpus import stopwords
# nltk.download('stopwords')
# nltk.download("punkt")
def text_handler (row):
  text = str(row['restaurant'])+str(row['category'])+str(row['product'])+str(row['description'])
  nltk_tokenList = word_tokenize(text)

  #Filter stopword
  filtered_sentence = []  
  nltk_stop_words = set(stopwords.words('portuguese'))
  for w in nltk_tokenList:  
      if w not in nltk_stop_words:  
          filtered_sentence.append(w)
  punctuations="?:!.,;)("
  for word in filtered_sentence:
      if word in punctuations:
          filtered_sentence.remove(word)
  return filtered_sentence + row['tags_array']

In [None]:
df['words_array'] = df.apply (lambda row: text_handler(row), axis=1)

In [None]:
df['words_array'][0]

["McDonald's®",
 'Rossio',
 'Sanduíches',
 'McMenusMiami',
 'DoubleNone',
 'Burgers',
 'American',
 'Mediterranean',
 'BarFood',
 'Sandwich',
 'KidsFriendly',
 'Desserts',
 'IceCreamFrozenYogurt',
 'VegetarianFriendly']

In [None]:
# df.to_csv('/content/drive/MyDrive/Colab Notebooks/data/fullData.cvs', sep='\t', encoding='utf-8')

In [None]:
# df.to_excel("/content/drive/MyDrive/Colab Notebooks/data/fullData.xlsx")

In [None]:
df_labels = pd.read_excel('/content/drive/MyDrive/Colab Notebooks/data/Labels_portuguese.xlsx', engine='openpyxl')
len(df_labels)

39

In [None]:
df_labels['label'] = df_labels['label'].str.strip()

In [None]:
df_labels['label'].unique

<bound method Series.unique of 0        Starters
1            Meat
2         Burgers
3           Sides
4        Desserts
5       Beverages
6         Popular
7         Grilled
8          Combos
9           Sides
10       Starters
11        Popular
12       Desserts
13      Beverages
14        Alcohol
15         Snacks
16        Burgers
17    Bbq & Grill
18           Meat
19           Fish
20         Combos
21        Garnish
22      Beverages
23       Desserts
24       Starters
25          Sides
26          Mains
27        Burgers
28     Sandwiches
29     Sandwiches
30     Sandwiches
31        Hotdogs
32          Wraps
33      Beverages
34        Alcohol
35        Alcohol
36          Sushi
37         Sauces
38       Desserts
Name: label, dtype: object>

In [None]:
from sklearn.preprocessing import LabelBinarizer

lb = LabelBinarizer()

df_labels = df_labels.join(pd.DataFrame(lb.fit_transform(df_labels["label"]),
                          columns=lb.classes_, 
                          index=df_labels.index))

In [None]:
df_labels

Unnamed: 0,input,label,Alcohol,Bbq & Grill,Beverages,Burgers,Combos,Desserts,Fish,Garnish,...,Mains,Meat,Popular,Sandwiches,Sauces,Sides,Snacks,Starters,Sushi,Wraps
0,ENTRADA,Starters,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
1,MEAT PREMIUM BLACK ANGUS,Meat,0,0,0,0,0,0,0,0,...,0,1,0,0,0,0,0,0,0,0
2,BURGERS CARNE PREMIUM BLACK ANGUS,Burgers,0,0,0,1,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,ACOMPANHAMENTOS,Sides,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,SOBREMESAS CASEIRAS,Desserts,0,0,0,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
5,BEBIDAS,Beverages,0,0,1,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,RECOMENDADO,Popular,0,0,0,0,0,0,0,0,...,0,0,1,0,0,0,0,0,0,0
7,GRELHADOS NO CARVÃO,Grilled,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
8,MENUS,Combos,0,0,0,0,1,0,0,0,...,0,0,0,0,0,0,0,0,0,0
9,ACOMPANHAMENTOS,Sides,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


We can use different tools from the sklearn (LabelEncoder, OneHotEncoder, LabelBinarizer) and the specific solution depends on the goal and the approaches that we will apply further.
A difference is that we can use OneHotEncoder for multi column data, while not for LabelBinarizer and LabelEncoder.

### **Bonus task**
Ideally, if we have a menu with images of different restaurants and a history of orders for these menus, then at a hypothetical level we can say the following: if a dish is found in several menus, and the order statistics differ from each other, then where there are many orders, photos are more likely to be of high quality. And where the statistics are lower, the quality leaves much to be desired. With this approach, it is necessary to normalize the data, that is, take into account the workload of the restaurant, and take statistics on the remaining orders of the current restaurant. 

*   https://teeyeeyang.medium.com/food-image-segmentation-with-fast-ai-20d5cc70aa10 at this link, you can see the approach for segmentation and it will help to extract big data about the photo (angle, area), we can also use lighting, color scheme.

