### Data_Movies.csv: 
File này chứa thông tin tổng hợp của ~ 12 000 bộ film, mỗi bộ film có 24 thuộc tính khác nhau, một số thuộc tính chính bao gồm:

1. adult: Bộ film dành cho người lớn hay không. Dữ liệu boolean (True - Flase)
2. original_language: Ngôn ngữ ban đầu; dữ liệu categorical
3. genres: Thể loại film
4. original_title: Tiêu đề của film, dữ liệu text
5. overview: Tóm tắt nội dung của film; Dữ liệu text
6. release_date: Ngày phát hành films
7. vote_average: Điểm đánh giá trung bình cho bộ phim [0: dở tệ - 10: Xuất sắc]
8. vote_count: Số lượt xem đánh giá bộ phim

In [1]:
# Import thư viện
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import ast

In [2]:
#Đọc tập dữ liệu thông tin của các film
file_path = 'Data/Data_Movies.csv'
data = pd.read_csv(file_path)

#Hiển thị thông tin tập dữ liệu
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12182 entries, 0 to 12181
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  12182 non-null  bool   
 1   belongs_to_collection  2409 non-null   object 
 2   budget                 12182 non-null  int64  
 3   genres                 12182 non-null  object 
 4   homepage               3330 non-null   object 
 5   id                     12182 non-null  int64  
 6   imdb_id                12180 non-null  object 
 7   original_language      12182 non-null  object 
 8   original_title         12182 non-null  object 
 9   overview               12119 non-null  object 
 10  popularity             12182 non-null  float64
 11  poster_path            12182 non-null  object 
 12  production_companies   12182 non-null  object 
 13  production_countries   12182 non-null  object 
 14  release_date           12180 non-null  object 
 15  re

In [3]:
# Hiển thị 5 dòng đầu
data.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [4]:
# Hiển thị thông tin dữ liệu
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12182 entries, 0 to 12181
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  12182 non-null  bool   
 1   belongs_to_collection  2409 non-null   object 
 2   budget                 12182 non-null  int64  
 3   genres                 12182 non-null  object 
 4   homepage               3330 non-null   object 
 5   id                     12182 non-null  int64  
 6   imdb_id                12180 non-null  object 
 7   original_language      12182 non-null  object 
 8   original_title         12182 non-null  object 
 9   overview               12119 non-null  object 
 10  popularity             12182 non-null  float64
 11  poster_path            12182 non-null  object 
 12  production_companies   12182 non-null  object 
 13  production_countries   12182 non-null  object 
 14  release_date           12180 non-null  object 
 15  re

In [5]:
# Xử lý cột release_date: Chuyển đổi sang định dạng ngày tháng
data['release_date'] = pd.to_datetime(data['release_date'], errors='coerce')

# Tạo cột year từ release_date
data['year'] = data['release_date'].dt.year

# Tính tỷ lệ lợi nhuận (profit_margin) với điều kiện budget > 0
data['profit_margin'] = data.apply(
    lambda x: (x['revenue'] - x['budget']) / x['budget'] if x['budget'] > 0 else None, axis=1
)

# Xử lý các cột dạng JSON string (genres và production_companies)
def extract_names(json_string):
    try:
        items = ast.literal_eval(json_string)
        return [item['name'] for item in items] if isinstance(items, list) else []
    except (ValueError, SyntaxError):
        return []

data['genres_list'] = data['genres'].apply(extract_names)
data['production_companies_list'] = data['production_companies'].apply(extract_names)

# Thêm cột genres_count (số lượng thể loại mỗi phim)
data['genres_count'] = data['genres_list'].apply(len)

# Điền giá trị thiếu cho runtime và vote_average bằng trung bình
data['runtime'].fillna(data['runtime'].mean(), inplace=True)

In [6]:
data

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,tagline,title,video,vote_average,vote_count,year,profit_margin,genres_list,production_companies_list,genres_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,,Toy Story,False,7.7,5415.0,1995.0,11.451801,"[Animation, Comedy, Family]",[Pixar Animation Studios],3
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0,1995.0,3.043035,"[Adventure, Fantasy, Family]","[TriStar Pictures, Teitler Film, Interscope Co...",3
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0,1995.0,,"[Romance, Comedy]","[Warner Bros., Lancaster Gate]",2
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0,1995.0,4.090760,"[Comedy, Drama, Romance]",[Twentieth Century Fox Film Corporation],3
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0,1995.0,,[Comedy],"[Sandollar Productions, Touchstone Pictures]",1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12177,False,"{'id': 123720, 'name': 'Frankenstein (Hammer S...",0,"[{'id': 27, 'name': 'Horror'}, {'id': 878, 'na...",,3104,tt0061683,en,Frankenstein Created Woman,A deformed tormented girl drowns herself after...,...,Now Frankenstein has created a beautiful woman...,Frankenstein Created Woman,False,5.9,33.0,1967.0,,"[Horror, Science Fiction]",[Hammer Film Productions],2
12178,False,,0,"[{'id': 35, 'name': 'Comedy'}, {'id': 80, 'nam...",,426272,tt6598626,en,Take Me,Ray is a fledgling entrepreneur who specialize...,...,A good hostage is hard to find.,Take Me,False,6.0,38.0,2017.0,,"[Comedy, Crime]",[Duplass Brothers Productions],2
12179,False,,0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",https://www.netflix.com/title/80171022,432789,tt5990342,en,The Incredible Jessica James,"Burned by a bad breakup, a struggling New York...",...,"Likes are easy, love is hard.",The Incredible Jessica James,False,6.2,37.0,2017.0,,"[Romance, Comedy]",[],2
12180,False,,0,"[{'id': 10751, 'name': 'Family'}, {'id': 16, '...",,455661,tt6969946,en,In a Heartbeat,A closeted boy runs the risk of being outed by...,...,The Heart Wants What The Heart Wants,In a Heartbeat,False,8.3,146.0,2017.0,,"[Family, Animation, Romance, Comedy]",[Ringling College of Art and Design],4


In [7]:
# Kiểm tra dữ liệu null
data.isnull().sum()

adult                           0
belongs_to_collection        9773
budget                          0
genres                          0
homepage                     8852
id                              0
imdb_id                         2
original_language               0
original_title                  0
overview                       63
popularity                      0
poster_path                     0
production_companies            0
production_countries            0
release_date                    2
revenue                         0
runtime                         0
spoken_languages                0
status                          1
tagline                      3400
title                           0
video                           0
vote_average                    0
vote_count                      0
year                            2
profit_margin                6105
genres_list                     0
production_companies_list       0
genres_count                    0
dtype: int64

In [8]:
# Loại bỏ các cột không cần thiết
columns_to_drop = ['homepage', 'belongs_to_collection', 'tagline', 'poster_path','genres','production_companies']
data = data.drop(columns=columns_to_drop)

In [9]:
#Sắp xếp lại dữ liệu theo ngày phát hành
data.sort_values('release_date',axis=0,inplace=True)
data

Unnamed: 0,adult,budget,id,imdb_id,original_language,original_title,overview,popularity,production_countries,release_date,...,status,title,video,vote_average,vote_count,year,profit_margin,genres_list,production_companies_list,genres_count
9712,False,0,774,tt0000010,fr,La Sortie de l'Usine Lumière à Lyon,Working men and women leave the Lumière factor...,0.693917,"[{'iso_3166_1': 'FR', 'name': 'France'}]",1895-06-10,...,Released,Workers Leaving the Lumière Factory,False,6.2,52.0,1895.0,,[Documentary],[Société Lumière],1
9272,False,0,82120,tt0000014,fr,Arroseur et arrosé,"A gardener is watering his flowers, when a mis...",1.963421,"[{'iso_3166_1': 'FR', 'name': 'France'}]",1895-12-27,...,Released,Tables Turned on the Gardener,False,7.0,44.0,1895.0,,[Comedy],[Lumière],1
8291,False,0,160,tt0000012,es,L'arrivée d'un train en gare de La Ciotat,A group of people are standing along the platf...,5.256608,"[{'iso_3166_1': 'FR', 'name': 'France'}]",1896-01-25,...,Released,The Arrival of a Train at La Ciotat,False,6.9,87.0,1896.0,,[Documentary],[Lumière],1
5035,False,5985,775,tt0000417,fr,Le Voyage dans la Lune,A Trip to The Moon is a science fiction film f...,6.321801,"[{'iso_3166_1': 'FR', 'name': 'France'}]",1902-09-01,...,Released,A Trip to the Moon,False,7.9,314.0,1902.0,-1.0,"[Adventure, Fantasy, Science Fiction]",[Star-Film],3
5634,False,150,5698,tt0000439,en,The Great Train Robbery,The clerk at the train station is assaulted an...,4.248169,"[{'iso_3166_1': 'US', 'name': 'United States o...",1903-12-01,...,Released,The Great Train Robbery,False,7.1,116.0,1903.0,-1.0,"[Action, Adventure, Western]",[Edison Manufacturing Company],3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
11826,False,0,354282,tt4536768,en,Science Fiction Volume One: The Osiris Child,Set in the future in a time of interplanetary ...,13.454531,"[{'iso_3166_1': 'AU', 'name': 'Australia'}]",2017-08-31,...,Released,Science Fiction Volume One: The Osiris Child,False,5.4,55.0,2017.0,,[Science Fiction],"[Storm Vision Entertainment, Eclectik Vision]",1
11338,False,0,300665,tt2620590,en,Leatherface,A young nurse is kidnapped by a group of viole...,9.742082,"[{'iso_3166_1': 'US', 'name': 'United States o...",2017-09-14,...,Released,Leatherface,False,5.7,62.0,2017.0,,[Horror],"[Campbell Grobman Films, LF2 Productions]",1
9820,False,0,76600,tt1630029,en,Avatar 2,A sequel to Avatar (2009).,6.020055,"[{'iso_3166_1': 'US', 'name': 'United States o...",2020-12-16,...,In Production,Avatar 2,False,0.0,58.0,2020.0,,"[Action, Adventure, Fantasy, Science Fiction]","[Twentieth Century Fox Film Corporation, Light...",4
11220,False,0,371758,tt3581932,en,And Then There Were None,"Ten strangers, drawn away from their normal li...",5.238281,"[{'iso_3166_1': 'GB', 'name': 'United Kingdom'}]",NaT,...,Released,And Then There Were None,False,7.9,91.0,,,"[Crime, Mystery, Drama]","[British Broadcasting Corporation (BBC), Mammo...",3


In [10]:
# Loại bỏ các dữ liệu null
data_clear = data.dropna()

In [11]:
data_clear

Unnamed: 0,adult,budget,id,imdb_id,original_language,original_title,overview,popularity,production_countries,release_date,...,status,title,video,vote_average,vote_count,year,profit_margin,genres_list,production_companies_list,genres_count
5035,False,5985,775,tt0000417,fr,Le Voyage dans la Lune,A Trip to The Moon is a science fiction film f...,6.321801,"[{'iso_3166_1': 'FR', 'name': 'France'}]",1902-09-01,...,Released,A Trip to the Moon,False,7.9,314.0,1902.0,-1.000000,"[Adventure, Fantasy, Science Fiction]",[Star-Film],3
5634,False,150,5698,tt0000439,en,The Great Train Robbery,The clerk at the train station is assaulted an...,4.248169,"[{'iso_3166_1': 'US', 'name': 'United States o...",1903-12-01,...,Released,The Great Train Robbery,False,7.1,116.0,1903.0,-1.000000,"[Action, Adventure, Western]",[Edison Manufacturing Company],3
10029,False,7500,2963,tt0000499,fr,Voyage à travers l'impossible,"Using every known means of transportation, sev...",1.529176,"[{'iso_3166_1': 'FR', 'name': 'France'}]",1904-10-01,...,Released,The Impossible Voyage,False,7.1,32.0,1904.0,-1.000000,"[Adventure, Comedy, Fantasy, Science Fiction]","[Star-Film, Georges Méliès]",4
3867,False,100000,618,tt0004972,en,The Birth of a Nation,The Birth of A Nation is a silent film from 19...,5.113205,"[{'iso_3166_1': 'US', 'name': 'United States o...",1915-02-08,...,Released,The Birth of a Nation,False,6.4,109.0,1915.0,109.000000,"[Drama, History, War]",[Epoch Film Co.],3
3956,False,8394751,3059,tt0006864,en,Intolerance: Love's Struggle Throughout the Ages,"The story of a poor young woman, separated by ...",4.282118,"[{'iso_3166_1': 'US', 'name': 'United States o...",1916-09-04,...,Released,Intolerance: Love's Struggle Throughout the Ages,False,7.4,63.0,1916.0,-1.000000,[Drama],"[Triangle Film Corporation, Wark Producing Corp.]",1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
12151,False,50000000,378236,tt4877122,en,The Emoji Movie,"Gene, a multi-expressional emoji, sets out on ...",33.694599,"[{'iso_3166_1': 'US', 'name': 'United States o...",2017-07-28,...,Released,The Emoji Movie,False,5.8,327.0,2017.0,0.338279,"[Comedy, Family, Animation]","[Columbia Pictures, Sony Pictures Animation]",3
12163,False,34000000,407448,tt5390504,en,Detroit,A police raid in Detroit in 1967 results in on...,9.797505,"[{'iso_3166_1': 'US', 'name': 'United States o...",2017-07-28,...,Released,Detroit,False,7.3,67.0,2017.0,-1.000000,"[Thriller, Crime, Drama, History]","[Metro-Goldwyn-Mayer (MGM), Annapurna Pictures...",4
12156,False,11000000,395834,tt5362988,en,Wind River,An FBI agent teams with the town's veteran gam...,40.796775,"[{'iso_3166_1': 'GB', 'name': 'United Kingdom'...",2017-08-03,...,Released,Wind River,False,7.4,181.0,2017.0,15.797291,"[Action, Crime, Mystery, Thriller]","[Thunder Road Pictures, Star Thrower Entertain...",4
12142,False,60000000,353491,tt1648190,en,The Dark Tower,"The last Gunslinger, Roland Deschain, has been...",50.903593,"[{'iso_3166_1': 'ZA', 'name': 'South Africa'},...",2017-08-03,...,Released,The Dark Tower,False,5.7,688.0,2017.0,0.183333,"[Action, Western, Science Fiction, Fantasy, Ho...","[Imagine Entertainment, Weed Road Pictures, Me...",5


In [12]:
# Kiểm tra lại xem đã loại bỏ hết các giá trị null chưa
data.isnull().sum()

adult                           0
budget                          0
id                              0
imdb_id                         2
original_language               0
original_title                  0
overview                       63
popularity                      0
production_countries            0
release_date                    2
revenue                         0
runtime                         0
spoken_languages                0
status                          1
title                           0
video                           0
vote_average                    0
vote_count                      0
year                            2
profit_margin                6105
genres_list                     0
production_companies_list       0
genres_count                    0
dtype: int64

In [13]:
# Hiển thị lại thông tin của dữ liệu
data_clear.info()

<class 'pandas.core.frame.DataFrame'>
Index: 6071 entries, 5035 to 12068
Data columns (total 23 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   adult                      6071 non-null   bool          
 1   budget                     6071 non-null   int64         
 2   id                         6071 non-null   int64         
 3   imdb_id                    6071 non-null   object        
 4   original_language          6071 non-null   object        
 5   original_title             6071 non-null   object        
 6   overview                   6071 non-null   object        
 7   popularity                 6071 non-null   float64       
 8   production_countries       6071 non-null   object        
 9   release_date               6071 non-null   datetime64[ns]
 10  revenue                    6071 non-null   float64       
 11  runtime                    6071 non-null   float64       
 12  spoken_

In [14]:
#Thống kê các bộ film trùng tên trong tập dữ liệu
data_clear['original_title'].value_counts()

original_title
Inferno                3
The Mummy              3
King Kong              3
Life                   3
Alice in Wonderland    3
                      ..
The Straight Story     1
The Story of Us        1
Random Hearts          1
Fortress 2             1
Kidnap                 1
Name: count, Length: 5939, dtype: int64

In [15]:
#Sắp xếp film theo thuộc tính vote_count và xóa các film trùng tên, 
#giữ lại film có lượt vote lớn hơn
data_clear.sort_values('vote_count',ascending=True,inplace=True)
data_clear.drop_duplicates(['original_title'],keep='last',inplace=True)
data_clear.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5939 entries, 9558 to 7116
Data columns (total 23 columns):
 #   Column                     Non-Null Count  Dtype         
---  ------                     --------------  -----         
 0   adult                      5939 non-null   bool          
 1   budget                     5939 non-null   int64         
 2   id                         5939 non-null   int64         
 3   imdb_id                    5939 non-null   object        
 4   original_language          5939 non-null   object        
 5   original_title             5939 non-null   object        
 6   overview                   5939 non-null   object        
 7   popularity                 5939 non-null   float64       
 8   production_countries       5939 non-null   object        
 9   release_date               5939 non-null   datetime64[ns]
 10  revenue                    5939 non-null   float64       
 11  runtime                    5939 non-null   float64       
 12  spoken_l

In [16]:
#Thống kê các bộ film trùng tên trong tập dữ liệu sau xử lý
data_clear['original_title'].value_counts()

original_title
Secret Défense                 1
Kramer vs. Kramer              1
End of Days                    1
eXistenZ                       1
Astérix aux Jeux Olympiques    1
                              ..
Salem's Lot                    1
The Human Stain                1
The Astronaut Farmer           1
The Red Shoes                  1
Inception                      1
Name: count, Length: 5939, dtype: int64

In [17]:
#Thống kê các dữ liệu trùng nhau
data_clear['overview'].value_counts()

overview
In France, terrorist groups and intelligence agencies battle in a merciless war everyday, in the name of radically opposed ideologies. Yet, terrorist and secret agents lead almost the same lives. Condemned to secrecy, these masters of manipulation follow the same methods. Alex and Al Barad are two of them. The former is the head of the D.G.S.E.'s (Direction Générale de la Sécurité Extérieure, the French equivalent of the CIA or the MI6) counter-terrorism unit while the latter reigns over a terrorist network, and both fight using the most ruthless of weapons: human beings.    1
Ted Kramer is a career man for whom his work comes before his family. His wife Joanna cannot take this anymore, so she decides to leave him. Ted is now faced with the tasks of housekeeping and taking care of himself and their young son Billy.                                                                                                                                                                     

In [18]:
#lọc các bộ film có phần tóm tắt là: No overview found, hoặc No Overview, hoặc chuỗi rỗng, hoặc No movie overview available. 
data_clear.loc[(data_clear['overview']=='No overview found.')].sort_values('overview')

Unnamed: 0,adult,budget,id,imdb_id,original_language,original_title,overview,popularity,production_countries,release_date,...,status,title,video,vote_average,vote_count,year,profit_margin,genres_list,production_companies_list,genres_count


In [19]:
#Lưu dữ liệu ra file Data_Movies_clear.csv
data_clear.sort_values(['release_date'],inplace=True)
data_clear.reset_index(drop=True,inplace=True)
data_clear.to_csv('D:/streamlit_project/Data/Data_Movies_clear.csv', index=None, encoding='utf-8-sig')