### Data_Movies.csv: 
File này chứa thông tin tổng hợp của ~ 12 000 bộ film, mỗi bộ film có 24 thuộc tính khác nhau, một số thuộc tính chính bao gồm:

1. adult: Bộ film dành cho người lớn hay không. Dữ liệu boolean (True - Flase)
2. original_language: Ngôn ngữ ban đầu; dữ liệu categorical
3. genres: Thể loại film
4. original_title: Tiêu đề của film, dữ liệu text
5. overview: Tóm tắt nội dung của film; Dữ liệu text
6. release_date: Ngày phát hành films
7. vote_average: Điểm đánh giá trung bình cho bộ phim [0: dở tệ - 10: Xuất sắc]
8. vote_count: Số lượt xem đánh giá bộ phim

In [1]:
# Import thư viện
import pandas as pd
import warnings
warnings.filterwarnings('ignore')
import matplotlib.pyplot as plt
import seaborn as sns
from collections import Counter
import ast

In [2]:
#Đọc tập dữ liệu thông tin của các film
file_path = 'Data/Data_Movies.csv'
data = pd.read_csv(file_path)

#Hiển thị thông tin tập dữ liệu
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12182 entries, 0 to 12181
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  12182 non-null  bool   
 1   belongs_to_collection  2409 non-null   object 
 2   budget                 12182 non-null  int64  
 3   genres                 12182 non-null  object 
 4   homepage               3330 non-null   object 
 5   id                     12182 non-null  int64  
 6   imdb_id                12180 non-null  object 
 7   original_language      12182 non-null  object 
 8   original_title         12182 non-null  object 
 9   overview               12119 non-null  object 
 10  popularity             12182 non-null  float64
 11  poster_path            12182 non-null  object 
 12  production_companies   12182 non-null  object 
 13  production_countries   12182 non-null  object 
 14  release_date           12180 non-null  object 
 15  re

In [3]:
# Hiển thị 5 dòng đầu
data.head()

Unnamed: 0,adult,belongs_to_collection,budget,genres,homepage,id,imdb_id,original_language,original_title,overview,...,release_date,revenue,runtime,spoken_languages,status,tagline,title,video,vote_average,vote_count
0,False,"{'id': 10194, 'name': 'Toy Story Collection', ...",30000000,"[{'id': 16, 'name': 'Animation'}, {'id': 35, '...",http://toystory.disney.com/toy-story,862,tt0114709,en,Toy Story,"Led by Woody, Andy's toys live happily in his ...",...,1995-10-30,373554033.0,81.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,,Toy Story,False,7.7,5415.0
1,False,,65000000,"[{'id': 12, 'name': 'Adventure'}, {'id': 14, '...",,8844,tt0113497,en,Jumanji,When siblings Judy and Peter discover an encha...,...,1995-12-15,262797249.0,104.0,"[{'iso_639_1': 'en', 'name': 'English'}, {'iso...",Released,Roll the dice and unleash the excitement!,Jumanji,False,6.9,2413.0
2,False,"{'id': 119050, 'name': 'Grumpy Old Men Collect...",0,"[{'id': 10749, 'name': 'Romance'}, {'id': 35, ...",,15602,tt0113228,en,Grumpier Old Men,A family wedding reignites the ancient feud be...,...,1995-12-22,0.0,101.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Still Yelling. Still Fighting. Still Ready for...,Grumpier Old Men,False,6.5,92.0
3,False,,16000000,"[{'id': 35, 'name': 'Comedy'}, {'id': 18, 'nam...",,31357,tt0114885,en,Waiting to Exhale,"Cheated on, mistreated and stepped on, the wom...",...,1995-12-22,81452156.0,127.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Friends are the people who let you be yourself...,Waiting to Exhale,False,6.1,34.0
4,False,"{'id': 96871, 'name': 'Father of the Bride Col...",0,"[{'id': 35, 'name': 'Comedy'}]",,11862,tt0113041,en,Father of the Bride Part II,Just when George Banks has recovered from his ...,...,1995-02-10,76578911.0,106.0,"[{'iso_639_1': 'en', 'name': 'English'}]",Released,Just When His World Is Back To Normal... He's ...,Father of the Bride Part II,False,5.7,173.0


In [4]:
# Hiển thị thông tin dữ liệu
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 12182 entries, 0 to 12181
Data columns (total 24 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   adult                  12182 non-null  bool   
 1   belongs_to_collection  2409 non-null   object 
 2   budget                 12182 non-null  int64  
 3   genres                 12182 non-null  object 
 4   homepage               3330 non-null   object 
 5   id                     12182 non-null  int64  
 6   imdb_id                12180 non-null  object 
 7   original_language      12182 non-null  object 
 8   original_title         12182 non-null  object 
 9   overview               12119 non-null  object 
 10  popularity             12182 non-null  float64
 11  poster_path            12182 non-null  object 
 12  production_companies   12182 non-null  object 
 13  production_countries   12182 non-null  object 
 14  release_date           12180 non-null  object 
 15  re

In [5]:
# Xử lý cột release_date: Chuyển đổi sang định dạng ngày tháng
data['release_date'] = pd.to_datetime(data['release_date'], errors='coerce')


# Tạo cột year từ release_date
data['year'] = data['release_date'].dt.year
data['release_date'] = data['release_date'].dt.strftime('%d-%m-%Y')
# Xử lý các cột dạng JSON string (genres và production_companies)
def extract_names(json_string):
    try:
        items = ast.literal_eval(json_string)
        return ','.join(item['name'] for item in items) if isinstance(items, list) else ''
    except (ValueError, SyntaxError):
        return ''

data['genres_list'] = data['genres'].apply(extract_names)
data['production_companies_list'] = data['production_companies'].apply(extract_names)
data['production_countries_list'] = data['production_countries'].apply(extract_names)
data['spoken_language_list'] = data['spoken_languages'].apply(extract_names)

# Chuyển cột id lên dâud
data = data[['id'] + [col for col in data.columns if col != 'id']]

# Điền giá trị thiếu cho runtime và vote_average bằng trung bình
data['runtime'].fillna(data['runtime'].mean(), inplace=True)

In [6]:
# Kiểm tra dữ liệu null
data.isnull().sum()

id                              0
adult                           0
belongs_to_collection        9773
budget                          0
genres                          0
homepage                     8852
imdb_id                         2
original_language               0
original_title                  0
overview                       63
popularity                      0
poster_path                     0
production_companies            0
production_countries            0
release_date                    2
revenue                         0
runtime                         0
spoken_languages                0
status                          1
tagline                      3400
title                           0
video                           0
vote_average                    0
vote_count                      0
year                            2
genres_list                     0
production_companies_list       0
production_countries_list       0
spoken_language_list            0
dtype: int64

In [7]:
# Xem giá trị cột adult
data['adult'].value_counts()

adult
False    12182
Name: count, dtype: int64

In [8]:
# Xem giá trị cột video
data['video'].value_counts()

video
False    12177
True         5
Name: count, dtype: int64

In [9]:
#Xem giá trị cột tagline
data['tagline'].value_counts()

tagline
Based on a true story.                                                                  7
Be careful what you wish for.                                                           4
There is no turning back                                                                3
How far would you go?                                                                   3
Every second counts.                                                                    2
                                                                                       ..
Schmidt Happens                                                                         1
A generation's final journey... begins.                                                 1
The hottest chick in town just switched bodies with the luckiest loser in the world.    1
Half time is game time                                                                  1
Too Cool For The Rules!                                                                 1
Na

In [10]:
# Xem giá trị cột homapage
data['homepage'].value_counts()

homepage
http://www.kungfupanda.com/                              4
http://www.missionimpossible.com/                        4
http://www.transformersmovie.com/                        4
http://phantasm.com                                      4
http://www.thehungergames.movie/                         4
                                                        ..
http://www.faqmovie.co.uk/                               1
http://www.sonyclassics.com/thewhiteribbon/              1
http://www.fugadecerebros.com/                           1
https://www.warnerbros.com/green-lantern-first-flight    1
https://www.netflix.com/title/80171022                   1
Name: count, Length: 3280, dtype: int64

In [11]:
# Xem giá trị cột belongs
data['belongs_to_collection'].value_counts()

belongs_to_collection
{'id': 645, 'name': 'James Bond Collection', 'poster_path': '/HORpg5CSkmeQlAolx3bKMrKgfi.jpg', 'backdrop_path': '/6VcVl48kNKvdXOZfJPdarlUGOsk.jpg'}                   26
{'id': 34055, 'name': 'Pokémon Collection', 'poster_path': '/j5te0YNZAMXDBnsqTUDKIBEt8iu.jpg', 'backdrop_path': '/iGoYKA0TFfgSoZpG2u5viTJMGfK.jpg'}                   18
{'id': 425164, 'name': 'Dragon Ball Z (Movie) Collection', 'poster_path': '/2VMZ1zRFPnUQtQp5K4WRXvDYBjh.jpg', 'backdrop_path': '/7PcbijxTfwi9vjWEfXdS0ReAw8q.jpg'}    15
{'id': 9735, 'name': 'Friday the 13th Collection', 'poster_path': '/uobgqpLQff9WvxGKE2OSvXv1RHm.jpg', 'backdrop_path': '/c7pMKwv5NzIN6N3KM4L8fYMTtPw.jpg'}            12
{'id': 19163, 'name': 'The Land Before Time Collection', 'poster_path': '/n1bjdBVThBezxR6nEf2dy43sTtV.jpg', 'backdrop_path': '/alkvR9vTtuZEmd5ygsayOfxYOMa.jpg'}      11
                                                                                                                                     

In [12]:
# xem các giá trị release_date null
redate_null = data[data['release_date'].isnull()]
redate_null

Unnamed: 0,id,adult,belongs_to_collection,budget,genres,homepage,imdb_id,original_language,original_title,overview,...,tagline,title,video,vote_average,vote_count,year,genres_list,production_companies_list,production_countries_list,spoken_language_list
11220,371758,False,,0,"[{'id': 80, 'name': 'Crime'}, {'id': 9648, 'na...",http://www.bbc.co.uk/programmes/b06v2v52,tt3581932,en,And Then There Were None,"Ten strangers, drawn away from their normal li...",...,Agatha Christie's darkest thriller,And Then There Were None,False,7.9,91.0,,"Crime,Mystery,Drama","British Broadcasting Corporation (BBC),Mammoth...",United Kingdom,"English,Français,Magyar"
12009,409926,False,,0,[],,tt0081846,en,Cosmos,Astronomer Dr. Carl Sagan is host and narrator...,...,,Cosmos,False,9.1,41.0,,,,,


In [13]:
# Loại bỏ các cột không cần thiết
columns_to_drop = ['homepage', 'belongs_to_collection', 'tagline', 'poster_path','genres','production_companies','adult', 'video', 'spoken_languages', 'production_countries']
data = data.drop(columns=columns_to_drop)

In [14]:
#Sắp xếp lại dữ liệu theo ngày phát hành
data.sort_values('release_date',axis=0,inplace=True)
data

Unnamed: 0,id,budget,imdb_id,original_language,original_title,overview,popularity,release_date,revenue,runtime,status,title,vote_average,vote_count,year,genres_list,production_companies_list,production_countries_list,spoken_language_list
4453,58129,0,tt0012364,sv,Körkarlen,It's New Year's Eve. Three drunkards evoke a l...,3.830428,01-01-1921,0.0,106.0,Released,The Phantom Carriage,7.7,70.0,1921.0,"Drama,Fantasy,Horror",Svensk Filmindustri (SF),Sweden,No Language
3481,14675,0,tt0028597,en,The Awful Truth,Unfounded suspicions lead a married couple to ...,8.105442,01-01-1937,0.0,91.0,Released,The Awful Truth,7.2,48.0,1937.0,"Comedy,Drama,Romance",Columbia Pictures Corporation,United States of America,English
5751,27040,275,tt0036154,en,Meshes of the Afternoon,A woman returning home falls asleep and has vi...,2.574317,01-01-1943,0.0,14.0,Released,Meshes of the Afternoon,7.7,57.0,1943.0,"Crime,Mystery,Thriller",,United States of America,No Language
4238,16391,424000,tt0039192,en,Black Narcissus,"After opening a convent in the Himalayas, five...",8.657125,01-01-1947,0.0,100.0,Released,Black Narcissus,7.7,106.0,1947.0,Drama,The Archers,United Kingdom,"English,हिन्दी"
3643,11898,0,tt0041546,en,Kind Hearts and Coronets,Louis Mazzini's mother belongs to the aristocr...,9.753718,01-01-1949,0.0,106.0,Released,Kind Hearts and Coronets,7.6,109.0,1949.0,"Comedy,Drama",Ealing Studios,United Kingdom,English
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9611,253295,0,tt1778931,en,Manny,From abject poverty to becoming a ten-time box...,2.711057,31-12-2014,0.0,106.0,Released,Manny,6.4,40.0,2014.0,"Documentary,Drama","Revelin Studios,Wonderspun",United States of America,English
10864,334394,0,tt4935418,tr,Baskin,Feature length version of the 2013 Turkish sho...,4.173011,31-12-2015,0.0,97.0,Released,Baskin,5.8,73.0,2015.0,"Fantasy,Horror","Film Colony,XYZ Films,Mo Film","Turkey,United States of America",Türkçe
11211,382517,0,tt4938374,en,Open Season: Scared Silly,The humans and animals believe a werewolf is o...,4.452672,31-12-2015,0.0,85.0,Released,Open Season: Scared Silly,5.5,57.0,2015.0,"Animation,Comedy,Family,Adventure",Sony Pictures Animation,United States of America,"Dansk,English,ภาษาไทย"
11220,371758,0,tt3581932,en,And Then There Were None,"Ten strangers, drawn away from their normal li...",5.238281,,0.0,168.0,Released,And Then There Were None,7.9,91.0,,"Crime,Mystery,Drama","British Broadcasting Corporation (BBC),Mammoth...",United Kingdom,"English,Français,Magyar"


In [15]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12182 entries, 4453 to 12009
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         12182 non-null  int64  
 1   budget                     12182 non-null  int64  
 2   imdb_id                    12180 non-null  object 
 3   original_language          12182 non-null  object 
 4   original_title             12182 non-null  object 
 5   overview                   12119 non-null  object 
 6   popularity                 12182 non-null  float64
 7   release_date               12180 non-null  object 
 8   revenue                    12182 non-null  float64
 9   runtime                    12182 non-null  float64
 10  status                     12181 non-null  object 
 11  title                      12182 non-null  object 
 12  vote_average               12182 non-null  float64
 13  vote_count                 12182 non-null  float

In [16]:
data.isnull().sum()

id                            0
budget                        0
imdb_id                       2
original_language             0
original_title                0
overview                     63
popularity                    0
release_date                  2
revenue                       0
runtime                       0
status                        1
title                         0
vote_average                  0
vote_count                    0
year                          2
genres_list                   0
production_companies_list     0
production_countries_list     0
spoken_language_list          0
dtype: int64

In [17]:
# Loại bỏ các dữ liệu null
data_clear = data.dropna()

In [18]:
data_clear

Unnamed: 0,id,budget,imdb_id,original_language,original_title,overview,popularity,release_date,revenue,runtime,status,title,vote_average,vote_count,year,genres_list,production_companies_list,production_countries_list,spoken_language_list
4453,58129,0,tt0012364,sv,Körkarlen,It's New Year's Eve. Three drunkards evoke a l...,3.830428,01-01-1921,0.0,106.0,Released,The Phantom Carriage,7.7,70.0,1921.0,"Drama,Fantasy,Horror",Svensk Filmindustri (SF),Sweden,No Language
3481,14675,0,tt0028597,en,The Awful Truth,Unfounded suspicions lead a married couple to ...,8.105442,01-01-1937,0.0,91.0,Released,The Awful Truth,7.2,48.0,1937.0,"Comedy,Drama,Romance",Columbia Pictures Corporation,United States of America,English
5751,27040,275,tt0036154,en,Meshes of the Afternoon,A woman returning home falls asleep and has vi...,2.574317,01-01-1943,0.0,14.0,Released,Meshes of the Afternoon,7.7,57.0,1943.0,"Crime,Mystery,Thriller",,United States of America,No Language
4238,16391,424000,tt0039192,en,Black Narcissus,"After opening a convent in the Himalayas, five...",8.657125,01-01-1947,0.0,100.0,Released,Black Narcissus,7.7,106.0,1947.0,Drama,The Archers,United Kingdom,"English,हिन्दी"
3643,11898,0,tt0041546,en,Kind Hearts and Coronets,Louis Mazzini's mother belongs to the aristocr...,9.753718,01-01-1949,0.0,106.0,Released,Kind Hearts and Coronets,7.6,109.0,1949.0,"Comedy,Drama",Ealing Studios,United Kingdom,English
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9682,227359,10000000,tt2611626,en,Force of Execution,"Seagal stars as mob kingpin Mr. Alexander, an ...",2.151259,31-12-2013,0.0,98.0,Released,Force of Execution,5.1,34.0,2013.0,"Action,Crime","Steamroller Productions,Voltage Pictures",United States of America,English
10600,298614,0,tt3667648,fr,Une heure de tranquillité,"Michel, who's crazy about jazz, has just found...",5.756279,31-12-2014,0.0,79.0,Released,Do Not Disturb,5.3,113.0,2014.0,Comedy,"Wild Bunch,TF1 Films Production,Canal+,Orange ...",France,"Français,Polski,Português,Español"
9611,253295,0,tt1778931,en,Manny,From abject poverty to becoming a ten-time box...,2.711057,31-12-2014,0.0,106.0,Released,Manny,6.4,40.0,2014.0,"Documentary,Drama","Revelin Studios,Wonderspun",United States of America,English
10864,334394,0,tt4935418,tr,Baskin,Feature length version of the 2013 Turkish sho...,4.173011,31-12-2015,0.0,97.0,Released,Baskin,5.8,73.0,2015.0,"Fantasy,Horror","Film Colony,XYZ Films,Mo Film","Turkey,United States of America",Türkçe


In [19]:
# Kiểm tra lại xem đã loại bỏ hết các giá trị null chưa
data_clear.isnull().sum()

id                           0
budget                       0
imdb_id                      0
original_language            0
original_title               0
overview                     0
popularity                   0
release_date                 0
revenue                      0
runtime                      0
status                       0
title                        0
vote_average                 0
vote_count                   0
year                         0
genres_list                  0
production_companies_list    0
production_countries_list    0
spoken_language_list         0
dtype: int64

In [20]:
# Hiển thị lại thông tin của dữ liệu
data_clear.info()

<class 'pandas.core.frame.DataFrame'>
Index: 12114 entries, 4453 to 11211
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         12114 non-null  int64  
 1   budget                     12114 non-null  int64  
 2   imdb_id                    12114 non-null  object 
 3   original_language          12114 non-null  object 
 4   original_title             12114 non-null  object 
 5   overview                   12114 non-null  object 
 6   popularity                 12114 non-null  float64
 7   release_date               12114 non-null  object 
 8   revenue                    12114 non-null  float64
 9   runtime                    12114 non-null  float64
 10  status                     12114 non-null  object 
 11  title                      12114 non-null  object 
 12  vote_average               12114 non-null  float64
 13  vote_count                 12114 non-null  float

In [21]:
#Thống kê các bộ film trùng tên trong tập dữ liệu
data_clear['original_title'].value_counts()

original_title
The Mummy                    4
Life                         4
Frankenstein                 4
Wuthering Heights            4
A Christmas Carol            4
                            ..
Pooh's Heffalump Movie       1
La science des rêves         1
Goodbye Bafana               1
The Wolfman                  1
Open Season: Scared Silly    1
Name: count, Length: 11761, dtype: int64

In [22]:
#Sắp xếp film theo thuộc tính vote_count và xóa các film trùng tên, 
#giữ lại film có lượt vote lớn hơn
data_clear.sort_values('vote_count',ascending=True,inplace=True)
data_clear.drop_duplicates(['original_title'],keep='last',inplace=True)
data_clear.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11761 entries, 4508 to 7116
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         11761 non-null  int64  
 1   budget                     11761 non-null  int64  
 2   imdb_id                    11761 non-null  object 
 3   original_language          11761 non-null  object 
 4   original_title             11761 non-null  object 
 5   overview                   11761 non-null  object 
 6   popularity                 11761 non-null  float64
 7   release_date               11761 non-null  object 
 8   revenue                    11761 non-null  float64
 9   runtime                    11761 non-null  float64
 10  status                     11761 non-null  object 
 11  title                      11761 non-null  object 
 12  vote_average               11761 non-null  float64
 13  vote_count                 11761 non-null  float6

In [23]:
#Thống kê các bộ film trùng tên trong tập dữ liệu sau xử lý
data_clear['original_title'].value_counts()

original_title
人間の條件　第３部望郷篇／第４部戦雲篇                   1
Neuilly sa Mère !                     1
Copycat                               1
Død Snø 2                             1
Bambi II                              1
                                     ..
中国合伙人                                 1
Whisper                               1
The Lion Guard: Return of the Roar    1
Gorky Park                            1
Inception                             1
Name: count, Length: 11761, dtype: int64

In [24]:
#Thống kê các dữ liệu trùng nhau
data_clear['overview'].value_counts()

overview
No overview found.                                                                                                                                                                                                                                                                                                                                                                                                       8
A few funny little novels about different aspects of life.                                                                                                                                                                                                                                                                                                                                                               3
Kaji, having lost his exemption from military service by protecting Chinese prisoners from unjust punishment, has now been conscripted into the Japanese Kwantung Army. D

In [25]:
#lọc các bộ film có phần tóm tắt là: No overview found, hoặc No Overview, hoặc chuỗi rỗng, hoặc No movie overview available. 
data_clear.loc[(data_clear['overview']=='No overview found.')].sort_values('overview')

Unnamed: 0,id,budget,imdb_id,original_language,original_title,overview,popularity,release_date,revenue,runtime,status,title,vote_average,vote_count,year,genres_list,production_companies_list,production_countries_list,spoken_language_list
10516,22182,0,tt1065305,en,Lezioni di cioccolato,No overview found.,2.064569,23-11-2007,0.0,99.0,Released,Lezioni di cioccolato,5.8,42.0,2007.0,Comedy,,Italy,Italiano
10103,10036,0,tt0461642,de,7 Zwerge - Der Wald ist nicht genug,No overview found.,6.637562,25-10-2006,0.0,95.0,Released,7 Dwarves: The Forest Is Not Enough,4.9,44.0,2006.0,Comedy,Universal Pictures,Germany,Deutsch
11093,13713,0,tt0477988,fr,Jean-Philippe,No overview found.,2.872268,05-04-2006,0.0,0.0,Released,Jean-Philippe,5.4,48.0,2006.0,Comedy,"Fidélité Productions,StudioCanal,TF1 Films Pro...",France,Français
8078,20106,0,tt0338828,fr,Mais qui a tué Pamela Rose ?,No overview found.,3.065777,04-06-2003,0.0,95.0,Released,Mais qui a tué Pamela Rose ?,6.6,62.0,2003.0,Comedy,"Gaumont,LGM Productions,TF1 Films Production,C...",France,Français
5024,2029,0,tt0274155,en,Tanguy,No overview found.,3.36053,21-11-2001,0.0,108.0,Released,Tanguy,5.8,70.0,2001.0,Comedy,"Les Productions du Champ Poirier,TPS Cinéma,TF...",France,Français
10612,26285,0,tt0085524,it,Fantozzi subisce ancora,No overview found.,4.434073,01-01-1983,0.0,85.0,Released,Fantozzi Still Suffers,6.4,77.0,1983.0,Comedy,,Italy,Italiano
9721,20414,0,tt1077026,en,"Grande, grosso e Verdone",No overview found.,4.309426,07-03-2008,0.0,0.0,Released,"Grande, grosso e Verdone",5.6,78.0,2008.0,Family,Filmauro,Italy,Italiano
10566,24169,0,tt1288637,it,Il cosmo sul comò,No overview found.,3.935385,19-12-2008,0.0,100.0,Released,Il cosmo sul comò,5.0,156.0,2008.0,Comedy,,Italy,Italiano


In [26]:
#Có tất cả 8 bộ film không có dữ liệu tóm tắt film:
#Xóa các bộ film này
data_clear = data_clear.loc[(data_clear['overview']!='No overview found.')]
data_clear.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11753 entries, 4508 to 7116
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         11753 non-null  int64  
 1   budget                     11753 non-null  int64  
 2   imdb_id                    11753 non-null  object 
 3   original_language          11753 non-null  object 
 4   original_title             11753 non-null  object 
 5   overview                   11753 non-null  object 
 6   popularity                 11753 non-null  float64
 7   release_date               11753 non-null  object 
 8   revenue                    11753 non-null  float64
 9   runtime                    11753 non-null  float64
 10  status                     11753 non-null  object 
 11  title                      11753 non-null  object 
 12  vote_average               11753 non-null  float64
 13  vote_count                 11753 non-null  float6

In [27]:
dt = data_clear[data_clear['spoken_language_list'] =='']
dt

Unnamed: 0,id,budget,imdb_id,original_language,original_title,overview,popularity,release_date,revenue,runtime,status,title,vote_average,vote_count,year,genres_list,production_companies_list,production_countries_list,spoken_language_list
7895,70585,0,tt1473397,en,Lucky,A wannabe serial killer wins the lottery and p...,8.954166,15-07-2011,0.0,103.0,Released,Lucky,4.9,31.0,2011.0,"Comedy,Romance",,United States of America,
9130,168742,1525000,tt2178256,lo,The Rocket,"Set against the lush backdrop of rural Laos, t...",1.665135,10-02-2013,0.0,96.0,Released,The Rocket,6.8,31.0,2013.0,Drama,Red Lamp Films,"Australia,Lao People's Democratic Republic,Tha...",
3235,37964,8000000,tt0297037,en,Brown Sugar,Sidney is a writer who's just left her L.A. Ti...,3.142874,05-10-2002,27362712.0,109.0,Released,Brown Sugar,7.0,31.0,2002.0,"Comedy,Romance",,,
5146,28201,0,tt0443518,en,The Girl in the Café,"Lawrence, an aging, lonely civil servant falls...",3.829855,25-06-2005,0.0,94.0,Released,The Girl in the Café,6.6,32.0,2005.0,"Comedy,Drama,Romance,TV Movie",,,
4459,27517,0,tt0019130,en,The Man Who Laughs,"Gwynplaine, son of Lord Clancharlie, has a per...",2.571104,27-04-1928,0.0,110.0,Released,The Man Who Laughs,7.4,32.0,1928.0,"Drama,Horror",Universal Pictures,United States of America,
9861,38580,0,tt0816562,en,The Little Matchgirl,An animated short based on Hans Christian Ande...,3.529071,07-09-2006,0.0,7.0,Released,The Little Matchgirl,7.3,33.0,2006.0,"Drama,Animation","Walt Disney Pictures,Walt Disney Animation Stu...",United States of America,
2795,5998,0,tt0013086,de,"Dr. Mabuse, der Spieler","Arch-criminal Dr. Mabuse, who is a master of d...",3.275117,27-04-1922,0.0,270.0,Released,"Dr. Mabuse, the Gambler",7.6,35.0,1922.0,"Crime,Drama,Mystery,Thriller",Uco-Film GmbH,Germany,
9690,276918,0,tt2785390,en,America: Imagine the World Without Her,"Political commentator, author and filmmaker Di...",1.313952,02-07-2014,0.0,103.0,Released,America: Imagine the World Without Her,5.4,35.0,2014.0,Documentary,,,
9734,18729,0,tt0088583,en,"North and South, Book I","Two friends, one northern and one southern, st...",2.238623,03-11-1985,0.0,561.0,Released,"North and South, Book I",6.9,37.0,1985.0,"Drama,History,Western",,,
12061,24914,0,tt0368574,en,Kid's Story,A high school student is haunted by thoughts o...,3.253814,02-06-2003,0.0,15.0,Released,Kid's Story,7.2,37.0,2003.0,"Science Fiction,Animation",Studio 4°C,,


In [28]:
# Xóa các dòng có giá trị rỗng trong cột 'spoken_language_list'
data_clear = data_clear[data_clear['spoken_language_list'] != '']

data_clear.info()

<class 'pandas.core.frame.DataFrame'>
Index: 11711 entries, 4508 to 7116
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         11711 non-null  int64  
 1   budget                     11711 non-null  int64  
 2   imdb_id                    11711 non-null  object 
 3   original_language          11711 non-null  object 
 4   original_title             11711 non-null  object 
 5   overview                   11711 non-null  object 
 6   popularity                 11711 non-null  float64
 7   release_date               11711 non-null  object 
 8   revenue                    11711 non-null  float64
 9   runtime                    11711 non-null  float64
 10  status                     11711 non-null  object 
 11  title                      11711 non-null  object 
 12  vote_average               11711 non-null  float64
 13  vote_count                 11711 non-null  float6

In [29]:
# Kiểm tra từng cột xem có giá trị rỗng ('') không
empty_columns = (data_clear == '').any()
empty_columns

id                           False
budget                       False
imdb_id                      False
original_language            False
original_title               False
overview                     False
popularity                   False
release_date                 False
revenue                      False
runtime                      False
status                       False
title                        False
vote_average                 False
vote_count                   False
year                         False
genres_list                   True
production_companies_list     True
production_countries_list     True
spoken_language_list         False
dtype: bool

In [30]:
data_clear = data_clear[data_clear['genres_list'] != '']
data_clear = data_clear[data_clear['production_companies_list'] != '']
data_clear = data_clear[data_clear['production_countries_list'] != '']

In [32]:
data_clear.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11173 entries, 0 to 11172
Data columns (total 19 columns):
 #   Column                     Non-Null Count  Dtype  
---  ------                     --------------  -----  
 0   id                         11173 non-null  int64  
 1   budget                     11173 non-null  int64  
 2   imdb_id                    11173 non-null  object 
 3   original_language          11173 non-null  object 
 4   original_title             11173 non-null  object 
 5   overview                   11173 non-null  object 
 6   popularity                 11173 non-null  float64
 7   release_date               11173 non-null  object 
 8   revenue                    11173 non-null  float64
 9   runtime                    11173 non-null  float64
 10  status                     11173 non-null  object 
 11  title                      11173 non-null  object 
 12  vote_average               11173 non-null  float64
 13  vote_count                 11173 non-null  flo

In [33]:
#Lưu dữ liệu ra file Data_Movies_clear.csv
data_clear.sort_values(['release_date'],inplace=True)
data_clear.reset_index(drop=True,inplace=True)
data_clear.to_csv('D:/streamlit_project/Data/Data_Movies_clear.csv', index=None, encoding='utf-8-sig')