# Data Quality Exploration (Other table)
---

In this notebooks, we will explore the data we have in foodies and recognize which features is useful but are not currently available to us.

In [2]:
import pandas as pd
from utils import FoodiesData

In [3]:
data = FoodiesData("data")
data.summary()

admin's shape: (1, 6)
cover's shape: (4, 7)
district's shape: (929, 9)
food_nationality's shape: (50, 10)
food_type's shape: (43, 10)
member's shape: (23, 21)
member_address's shape: (2, 17)
member_follow's shape: (2, 2)
member_like's shape: (0, 2)
photo_group's shape: (3, 9)
province's shape: (78, 8)
review's shape: (52, 7)
review_shop's shape: (77, 13)
review_shop_like's shape: (24, 2)
review_shop_photo's shape: (22, 10)
review_shop_photo_comment's shape: (26, 7)
review_shop_photo_comment_like's shape: (3, 2)
review_shop_photo_like's shape: (2, 2)
shop's shape: (65041, 34)
shop_follow's shape: (1, 2)
shop_food_nationality's shape: (135, 2)
shop_food_type's shape: (195, 2)
shop_for's shape: (7, 9)
shop_location's shape: (68, 10)
shop_pay's shape: (3, 9)
shop_price's shape: (4, 9)
shop_service's shape: (14, 9)
shop_shop_for's shape: (168, 2)
shop_shop_location's shape: (101, 2)
shop_shop_pay's shape: (85, 2)
shop_shop_service's shape: (136, 2)
shop_shop_type's shape: (117, 2)
shop_type

  """Entry point for launching an IPython kernel.


## From the summary, we realized that

- There are not so many feedback from user yet. These includes table such as 
    - Reviews
    - shop_follow
- Only shop table contains all the data of shops. Other table doesn't have rows that is more than 10K yet.
    - shop_sshop_for
    - shop_shop_location
    - shop_shop_pay
    - shop_shop_service
    - shop_shop_type

## Concern with quality of review

--

It is normal for web development to test the review on the data. One things we have to note is which review is not real.

Findings
1. Most of the review has HTML tags in the data. I assume that Foodies platform has some kind of markdown functionality where user can edit there text (add bold, italic, font, etc)
While storing the description here is good, this is not so good for later use. I could write a parser to extract the text .. but I'm sure if 10 people want to access this data we all don't want to all write parser just to get the data. One way I can think of is to have another column that only store the text of the review itself. It is also useful if we wnat the text only for some part of the website later on in the development.
2. I assume most of them is a test review. Are there anyway that we can recognize them as test later? We have few rows so it is fine now but imagine if reviews grows to 60K, where 20K it test review. We should have way to separate them early on so that our life will be easier.
    (ทดสอบรีวิว, ทดสอบร้าน)

In [13]:
# Review with 52 rows.
data.dfs["review_shop"]['description'].head(10)

0                       Review Shop Description Update
1    <h2 class="title" style="font-family: Kanit, s...
2    <p><strong style="margin: 0px; padding: 0px; f...
3    <p><strong style="margin: 0px; padding: 0px; f...
4    <p><strong>Lorem Ipsum<span style="font-family...
5                                                  NaN
6                                                  NaN
7                                    <p>ทดสอบรีวิว</p>
8    <p>ทดสอบร้าน&nbsp;<span style="background-colo...
9    <p><img src="null" alt="ทดสอบ" /></p>\n<div st...
Name: description, dtype: object

In [16]:
data.dfs["review_shop"].columns # Service, taste, value, appearance, envrionment score look really promising. Need more observation of this!
# Presumable 5 reviews per shop for 10% of shop at first should be enough to work with
# 60K * 0.1 * 5 = 30K reviews!

Index(['review_shop_id', 'review_id', 'shop_id', 'member_id', 'parent_id',
       'service_score', 'taste_score', 'value_score', 'appearance_score',
       'environment_score', 'description', 'created_on', 'updated_on'],
      dtype='object')

In [18]:
data.dfs["review"].head()

Unnamed: 0,id,member_id,subject,description,summary,created_on,updated_on
0,1,1,Review Subject Update,Review Description Update,Review Summary Update,2019-02-25 03:24:54,2019-02-25 03:24:59
1,2,4,Review 1,,,2019-05-16 18:03:43,2019-05-16 18:03:43
2,3,4,Review 2,,,2019-05-16 18:06:26,2019-05-16 18:06:26
3,4,4,Review 3,,,2019-05-17 19:27:08,2019-05-17 19:27:08
4,5,5,,,,2019-05-31 19:06:46,2019-05-31 19:06:46


In [22]:
# Shop type is also promising, but only 117 rows.
data.dfs["shop_type"]

Unnamed: 0,id,name_th,name_en,order_on,status,created_on,created_by,updated_on,updated_by
0,1,กึ่งผับ,Pub & Restaurant,1,Y,2019-02-11 02:11:25,-,0000-00-00 00:00:00,-
1,2,คาเฟ่,Cafe,2,Y,2019-02-11 02:11:25,-,0000-00-00 00:00:00,-
2,3,คาราโอเกะ,Karaoke,3,Y,2019-02-11 02:11:25,-,0000-00-00 00:00:00,-
3,4,จานด่วน,Fast food,4,Y,2019-02-11 02:11:26,-,0000-00-00 00:00:00,-
4,5,เดลิเวอรี่,Delivery,5,Y,2019-02-11 02:11:26,-,0000-00-00 00:00:00,-
5,6,บรรยากาศดี,Good atmosphere,6,Y,2019-02-11 02:11:26,-,0000-00-00 00:00:00,-
6,7,บาร์,Bar,7,Y,2019-02-11 02:11:26,-,0000-00-00 00:00:00,-
7,8,บุฟเฟ่ต์,Buffet,8,Y,2019-02-11 02:11:26,-,0000-00-00 00:00:00,-
8,9,บุฟเฟ่ต์โรงแรม,Hotel buffet,9,Y,2019-02-11 02:11:26,-,0000-00-00 00:00:00,-
9,10,ปิดหลัง 22.00 น.,Closed after 10 pm,10,Y,2019-02-11 02:11:26,-,0000-00-00 00:00:00,-


In [None]:
# จริงๆขาดทุก table เลย T T แต่จะเน้นอันที่อยากได้ ซัก 4 table ละกัน