# Data Quality Exploration
---

In this notebooks, we will explore the data we have in foodies and recognize which features is useful but are not currently available to us.

In [43]:
import pandas as pd
from utils import FoodiesData

In [2]:
data = FoodiesData("data")
data.summary()

admin's shape: (1, 6)
cover's shape: (4, 7)
district's shape: (929, 9)
food_nationality's shape: (50, 10)
food_type's shape: (43, 10)
member's shape: (23, 21)
member_address's shape: (2, 17)
member_follow's shape: (2, 2)
member_like's shape: (0, 2)
photo_group's shape: (3, 9)
province's shape: (78, 8)
review's shape: (52, 7)
review_shop's shape: (77, 13)
review_shop_like's shape: (24, 2)
review_shop_photo's shape: (22, 10)
review_shop_photo_comment's shape: (26, 7)
review_shop_photo_comment_like's shape: (3, 2)
review_shop_photo_like's shape: (2, 2)
shop's shape: (65041, 34)
shop_follow's shape: (1, 2)
shop_food_nationality's shape: (135, 2)
shop_food_type's shape: (195, 2)
shop_for's shape: (7, 9)
shop_location's shape: (68, 10)
shop_pay's shape: (3, 9)
shop_price's shape: (4, 9)
shop_service's shape: (14, 9)
shop_shop_for's shape: (168, 2)
shop_shop_location's shape: (101, 2)
shop_shop_pay's shape: (85, 2)
shop_shop_service's shape: (136, 2)
shop_shop_type's shape: (117, 2)
shop_type

  """Entry point for launching an IPython kernel.


## From the summary, we realized that

- There are not so many feedback from user yet. These includes table such as 
    - Reviews
    - shop_follow
- Only shop table contains all the data of shops. Other table doesn't have rows that is more than 10K yet.
    - shop_sshop_for
    - shop_shop_location
    - shop_shop_pay
    - shop_shop_service
    - shop_shop_type

## Let's look at shop which has the most rows and looks the most valid.

In [4]:
shop = data.dfs["shop"]
print(shop.columns)

Index(['id', 'logo', 'cover', 'name', 'branch', 'description', 'tel',
       'website', 'line', 'facebook', 'instagram', 'address_no', 'no_room',
       'no_floor', 'address_soi', 'address_road', 'sub_district_id',
       'district_id', 'province_id', 'postal_code', 'lat', 'lng',
       'shop_price_from', 'shop_price_to', 'has_carpark', 'landmark',
       'worktime', 'last_order', 'holiday', 'status', 'created_on',
       'created_by', 'updated_on', 'updated_by'],
      dtype='object')


### Investigate NULL

from the code below, we see that these are the columns which more than 90% are nulls. 
Some field that can be valueable to categorize shops: description, branch (see how many branches they have), no_floor (categorized for luxury restaurant), postal_code for location grouping, holiday

In [48]:
null_count = shop.isnull().mean()
null_count[null_count>0.99]

logo           0.999370
cover          0.999370
branch         0.999370
description    0.999370
no_room        0.999370
no_floor       0.999370
landmark       0.999754
holiday        0.999477
dtype: float64

In [20]:
# Let's look at description and branch that aren't empty
print(shop.description.value_counts())
print(shop.branch.value_counts())

# These also appears to be mock data from each restaurant.

Shop Description    36
wafs                 1
ward                 1
shooopim             1
wae                  1
wqed                 1
Name: description, dtype: int64
Shop Branch    36
qwrd            1
wafs            1
shooopim        1
wae             1
wqed            1
Name: branch, dtype: int64


In [34]:
unique_name

(55293,)

## Names appear to be usable

In [51]:
print(shop.name[:10])
shop.name.unique().shape

0                   Akimitsu Tendon Rayong (อาคิมิซุ)
1                                     โอชิเน พิษณุโลก
2                                      Isao (อิซาโอะ)
3               Sushi Mori (ซูชิ โมริ) Sathorn Square
4           Bankara Ramen (บังคาระ ราเมง) สุขุมวิท 39
5             UMENOHANA (อุเมะโนะฮานะ) นิฮอนมูระมอลล์
6                     Katsushin (かつ真) (คัทสึชิน) สีลม
7                      Sushi Masa (ซูชิ มาสะ) ราชเทวี
8                 Fillets (ฟิลเล) The Portico หลังสวน
9    Rising Yakiniku Buffet (ไรซิ่ง ยาคินิคุ บุฟเฟต์)
Name: name, dtype: object


(55293,)

In [60]:
# Soi and Road is usable about quarter and half of the data. But isn't no useful as we have lat lng already. 
print("Soi valid count: %s"%shop.address_soi[shop.address_soi.notnull()].shape)
print("Road valid count: %s"%shop.address_soi[shop.address_road.notnull()].shape)

Soi valid count: 11679
Road valid count: 24616


## Validate lat lng

In [83]:
# Seems to be similar to unique_shop_name
valid_latlng = shop[['lat','lng', 'name']][shop.lat> 0]
valid_latlng.shape

(56861, 3)

In [84]:
valid_latlng["position"] = valid_latlng.lat.astype(str) + "-" + valid_latlng.lng.astype(str)

In [90]:
# Latitude, Longtitude also looks nice. There are some duplication but not that much compare to data. Assuming that it might be restaurant from same Mall, Landmark, etc.
valid_latlng["position"].value_counts().head(10)

13.733662099999998-100.4705742          36
12.939362-100.884257                    34
12.0-101.0                              14
16.84089591-100.2323637                 10
18.7825795-99.00077469999997             8
18.77254-98.999634                       8
13.89831844669952-100.54535481473248     7
13.879943689831-100.40464851013999       7
13.58435362-100.6071177                  6
18.7995452958815-98.96882968220905       6
Name: position, dtype: int64

## Landmark column doesn't have feature yet but could be valuable

In [96]:
shop.landmark.value_counts()

ใกล้ปั๊มน้ำมัน    12
333                1
shooopim           1
wqed               1
32r                1
Name: landmark, dtype: int64