Ta sẽ cho dữ liệu của rotten tomato vào data warehouse, với schema được thể hiện như sau:

Bảng Movie Info: chứa các thông tin chung của movie

- movie_id: mã phim, được mã hoá dưới dạng số
- movie_title: tiêu đề
- movie_info: mô tả
- content_rating: giới hạn độ tuổi xem
- genres: thể loại
- directors: đạo diễn bộ phim
- actors: diễn viên tham gia bộ phim
- release_date: ngày ra mắt
- runtime: thời lượng bộ phim
- production_company: nhà sản xuất

Bảng Tomato Rating Info: chứa các thông tin về rating của movie trên trang Rotten Tomato
- movie_id: khoá ngoại
- tomatometer_rating: rating trung bình của các nhà phê bình phim
- tomatometer_count: số lượng rating của các nhà phê bình phim
- audience_rating: rating trung bình của khán giả
- audience_count: số lượng rating của khán giả

Với các nguồn dữ liệu khác khi cho vào data warehouse cũng sẽ tách ra làm 2 bảng riêng: bảng về thông tin chung về movie và bảng về thông tin rating của movie đối với nguồn đó

Ngoài ra, còn một số bảng tham khảo các giá trị của các trường trong data warehouse thể hiện ở dưới (Phục vụ cho xử lý các nguồn dữ liệu khác):

- Bảng Cast and Director: chứa danh sách các đạo diễn và diễn viên tham gia phim
- Bảng Production Company: chứa danh sách các nhà sản xuất
- Bảng Genres: chứa danh sách các thể loại
- Bảng Content Rating: chứa danh sách các content rating

In [2]:
!pip3 install pandas

Defaulting to user installation because normal site-packages is not writeable
Collecting pandas
  Downloading pandas-1.4.3-cp39-cp39-win_amd64.whl (10.6 MB)
     -------------------------------------- 10.6/10.6 MB 187.6 kB/s eta 0:00:00
Collecting pytz>=2020.1
  Downloading pytz-2022.1-py2.py3-none-any.whl (503 kB)
     ------------------------------------ 503.5/503.5 kB 404.7 kB/s eta 0:00:00
Installing collected packages: pytz, pandas
Successfully installed pandas-1.4.3 pytz-2022.1


In [1]:
import pandas as pd
import re
import difflib

In [2]:
# Movie Info

df_metacritic_movie = pd.read_csv("H:/DataIntegation/metacritic/movieDatasetClean.csv")
df_metacritic_movie.head()

Unnamed: 0,title,age_rating,rating,rank,genre,director,year,producer,actor,runtime,description,img,url
0,Citizen Kane,15,8.4,1,Drama,Orson Welles,1941,RKO Radio Pictures,Joseph Cotten,119,"Following the death of a publishing tycoon, ne...",https://static.metacritic.com/images/products/...,https://www.metacritic.com/movie/citizen-kane
1,The Godfather,18,9.2,2,Drama,Francis Ford Coppola,1972,Paramount Pictures,Al Pacino,175,Francis Ford Coppola's epic features Marlon Br...,https://static.metacritic.com/images/products/...,https://www.metacritic.com/movie/the-godfather
2,Rear Window,0,8.8,3,Mystery,Alfred Hitchcock,1954,Paramount Pictures,Frank Cady,112,A wheelchair-bound photographer spies on his n...,https://static.metacritic.com/images/products/...,https://www.metacritic.com/movie/rear-window
3,Casablanca,15,8.9,4,Drama,Michael Curtiz,1943,Warner Bros.,Humphrey Bogart,102,"A Casablanca, Morocco casino owner in 1941 she...",https://static.metacritic.com/images/products/...,https://www.metacritic.com/movie/casablanca
4,Boyhood,18,7.5,5,Drama,Richard Linklater,2014,IFC Films,Bonnie Cross,165,"Filmed over 12 years with the same cast, Richa...",https://static.metacritic.com/images/products/...,https://www.metacritic.com/movie/boyhood


In [3]:
df_metacritic_movie.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14213 entries, 0 to 14212
Data columns (total 13 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   title        14213 non-null  object
 1   age_rating   14213 non-null  int64 
 2   rating       14213 non-null  object
 3   rank         14213 non-null  int64 
 4   genre        14213 non-null  object
 5   director     14213 non-null  object
 6   year         14213 non-null  int64 
 7   producer     13941 non-null  object
 8   actor        14213 non-null  object
 9   runtime      14213 non-null  int64 
 10  description  14210 non-null  object
 11  img          14213 non-null  object
 12  url          14213 non-null  object
dtypes: int64(4), object(9)
memory usage: 1.4+ MB


Kết quả Schema Matching:

- Bảng Movie

| Source Schema | Data warehouse Schema |
|---------------|-----------------------|
| title         | movie_title           |
| genre         | genres                |
| director      | directors             |
| year          | release_date          |
| producer      | production_company    |
| actor         | actors                |
| description   | movie_info            |

- Bảng Rating: như nhận xét, kết quả matching sai hoàn toàn

Ở đây ta thấy age_rating trong dữ liệu để dưới dạng con số => chuyển về định dạng chung giống bảng tham khảo conten_rating

## Trường age_rating

In [4]:
df_metacritic_movie['age_rating'].unique()

array([15, 18,  0, -5, 13, 17, 14,  7], dtype=int64)

content_rating description:

- G: 0 : GENERAL AUDIENCES: ALL AGES ADMITTED
- PG: PARENTAL GUIDANCE SUGGESTED: SOME MATERIAL MAY NOT BE SUITABLE FOR CHILDREN
- PG-13: 7 : PARENTS STRONGLY CAUTIONED: SOME MATERIAL MAY BE INAPPROPRIATE FOR CHILDREN UNDER 13
- R: 13, 14, 15 : RESTRICTED: UNDER 17 REQUIRES ACCOMPANYING PARENT OR ADULT GUARDIAN
- NC-17: 17, 18 : NO ONE 17 AND UNDER ADMITTED
- NR: -5 : NOT RATED: THE CONTENT OF THIS FILM HAS NOT BEEN EVALUATED (TRAILER)

In [12]:
replace_age_rating_dict = { 0 : 'G', 7 : 'PG-13', 13 : 'R', 14: 'R', 15: 'R', 17: 'NC-17', 18: 'NC-17', -5: 'NR'}
def preprocess_age_rating(age_rating):
    age_rating = replace_age_rating_dict[age_rating]
    return age_rating
df_metacritic_movie['age_rating'] = [preprocess_age_rating(i) for i in df_metacritic_movie['age_rating']]

In [13]:
df_metacritic_movie['age_rating'].unique()

array(['R', 'NC-17', 'G', 'NR', 'PG-13'], dtype=object)

In [14]:
df_metacritic_movie.head()

Unnamed: 0,title,age_rating,rating,rank,genre,director,year,producer,actor,runtime,description,img,url
0,Citizen Kane,R,8.4,1,Drama,Orson Welles,1941,RKO Radio Pictures,Joseph Cotten,119,"Following the death of a publishing tycoon, ne...",https://static.metacritic.com/images/products/...,https://www.metacritic.com/movie/citizen-kane
1,The Godfather,NC-17,9.2,2,Drama,Francis Ford Coppola,1972,Paramount Pictures,Al Pacino,175,Francis Ford Coppola's epic features Marlon Br...,https://static.metacritic.com/images/products/...,https://www.metacritic.com/movie/the-godfather
2,Rear Window,G,8.8,3,Mystery,Alfred Hitchcock,1954,Paramount Pictures,Frank Cady,112,A wheelchair-bound photographer spies on his n...,https://static.metacritic.com/images/products/...,https://www.metacritic.com/movie/rear-window
3,Casablanca,R,8.9,4,Drama,Michael Curtiz,1943,Warner Bros.,Humphrey Bogart,102,"A Casablanca, Morocco casino owner in 1941 she...",https://static.metacritic.com/images/products/...,https://www.metacritic.com/movie/casablanca
4,Boyhood,NC-17,7.5,5,Drama,Richard Linklater,2014,IFC Films,Bonnie Cross,165,"Filmed over 12 years with the same cast, Richa...",https://static.metacritic.com/images/products/...,https://www.metacritic.com/movie/boyhood


In [18]:
pd.DataFrame(list(df_warehouse_movie['content_rating'].unique()), columns=["content_rating"]).to_csv(
    "H:\DataIntegation\warehouse/content_rating.csv",
    index=False 
)

## Trường genres

In [16]:
df_metacritic_movie["genre"].unique()

array(['Drama', 'Mystery', 'Comedy', 'Fantasy', 'Adventure', 'Action',
       'Biography', 'Documentary', 'Sci-Fi', 'History', 'Thriller',
       'Western', 'Music', 'Crime', 'War', 'Horror', 'Romance',
       'Animation', 'Family', 'Sport', 'Musical', 'untagged'],
      dtype=object)

Từ 2 danh sách thể loại, ta có thể mapping như dưới:
|Data warehouse Genres|Metacritic Genres |
| :- | :- |
|‘Classic’||
|‘Documentary’|‘Documentary’|
|‘Western’|‘Western’|
|‘Horror’|‘Horror’|
|‘Science Fiction & Fantasy’|‘Fantasy’, 'Sci-Fi'|
|‘Gay & Lesbian’||
|‘Drama’|‘Drama’|
|‘Comedy’|‘Comedy’|
|‘Cult Movies’||
|‘Romance’|‘Romance’|
|‘Television’||
|‘Sport & Fitness’||
|‘Art House & International’||
|‘Special Interest’||
|‘Animation’|‘Animation’|
|‘Musical & Performing Arts’|‘Musical’|
|‘Faith & Spirituality’||
|‘Mystery & Suspense’|‘Mystery’|
|‘Anime & Manga’||
|‘Action & Adventure’|‘Action’, ‘Adventure’|
|‘Kids & Family’|‘Children’|

Những giá trị như Biography, history, untagged, ta sẽ thêm vào Data Warehouse

In [18]:
replace_genres_dict = {'Mystery' : 'Mystery & Suspense', 'Thriller': 'Thrill',
                        'Sci-Fi': 'Science Fiction & Fantasy', 'Fantasy': 'Science Fiction & Fantasy',
                       'Adventure': 'Action & Adventure','Action': 'Action & Adventure',
                       'Musical': 'Musical & Performing Arts', 'Music': 'Musical & Performing Arts',
                        'Family': 'Kids & Family'}
def preprocess_genres(genre):

    if genre in replace_genres_dict:
        genre = replace_genres_dict[genre]
    return genre
df_metacritic_movie["genre"] = [preprocess_genres(i) for i in df_metacritic_movie["genre"]]

In [19]:
df_metacritic_movie["genre"].unique()

array(['Drama', 'Mystery & Suspense', 'Comedy',
       'Science Fiction & Fantasy', 'Action & Adventure', 'Biography',
       'Documentary', 'History', 'Thrill', 'Western',
       'Musical & Performing Arts', 'Crime', 'War', 'Horror', 'Romance',
       'Animation', 'Kids & Family', 'Sport', 'untagged'], dtype=object)

In [None]:
# Thêm các giá trị mới vào trong genres warehouse
# Giữ đg dẫn
warehouse_genres_list = list(pd.read_csv("/Users/trananhvu/Documents/Tichhopdulieu/Data_Integration_Group23/Data/warehouse/field_value/genres.csv")["genres"])
warehouse_genres_list
warehouse_genres_list+=["Biography", "history", "untagged"]
warehouse_genres_list = list(set(warehouse_genres_list))
pd.DataFrame(warehouse_genres_list, columns=["genres"]).to_csv(
    "/Users/trananhvu/Documents/Tichhopdulieu/Data_Integration_Group23/Data/warehouse/field_value/genres.csv",
    index=False 
)

## Lưu dữ liệu đã preprocess

In [20]:
df_metacritic_movie.to_csv("H:/DataIntegation/metacritic/movies_preprocess.csv", index=False)

In [21]:
df_metacritic_movie_process = pd.read_csv("H:/DataIntegation/metacritic/movies_preprocess.csv")
df_metacritic_movie_process.head()

Unnamed: 0,title,age_rating,rating,rank,genre,director,year,producer,actor,runtime,description,img,url
0,Citizen Kane,R,8.4,1,Drama,Orson Welles,1941,RKO Radio Pictures,Joseph Cotten,119,"Following the death of a publishing tycoon, ne...",https://static.metacritic.com/images/products/...,https://www.metacritic.com/movie/citizen-kane
1,The Godfather,NC-17,9.2,2,Drama,Francis Ford Coppola,1972,Paramount Pictures,Al Pacino,175,Francis Ford Coppola's epic features Marlon Br...,https://static.metacritic.com/images/products/...,https://www.metacritic.com/movie/the-godfather
2,Rear Window,G,8.8,3,Mystery & Suspense,Alfred Hitchcock,1954,Paramount Pictures,Frank Cady,112,A wheelchair-bound photographer spies on his n...,https://static.metacritic.com/images/products/...,https://www.metacritic.com/movie/rear-window
3,Casablanca,R,8.9,4,Drama,Michael Curtiz,1943,Warner Bros.,Humphrey Bogart,102,"A Casablanca, Morocco casino owner in 1941 she...",https://static.metacritic.com/images/products/...,https://www.metacritic.com/movie/casablanca
4,Boyhood,NC-17,7.5,5,Drama,Richard Linklater,2014,IFC Films,Bonnie Cross,165,"Filmed over 12 years with the same cast, Richa...",https://static.metacritic.com/images/products/...,https://www.metacritic.com/movie/boyhood
