# Netflix Movies and Tv Shows

## About this Dataset: 
Netflix is one of the most popular media and video streaming platforms. This dataset provides comprehensive information about the TV shows and Movies available on this website. As of mid-2021, they have over 8000 movies or TV shows on their platform. With a variety of movie genres, they attracted more than 200M Subscribers worldwide.

| columns_name | Description |
|--------------|-------------|
| show_id | A unique identifier for each movie or TV show. |
| type | Specifies whether the content is a Movie or TV Show. |
| title | The title or name of the content. |
| director | Name(s) of the director(s) of the movie or show. |
| cast | List of actors featured in the content. |
| country | Country of origin where the content was produced. |
| date_added |The date the content was added to Netflix. |
| release_year |The year the content was originally released. |
| rating | Content rating (e.g., TV-MA, PG-13) indicating suitability for audiences. |
| duration | Duration of movies in minutes or the number of seasons for TV shows. |
| listed_in |Categories or genres that the content belongs to (e.g., Drama, Comedy, Action). |
| description |A brief summary or synopsis of the content. |



# Some issues need to define 
1.Understanding what content is available in different countries (Tìm xem thể loại phim nào thì có ở các nước.groupby countries và xem các movie genre )  <br>
2.Identifying similar content by matching text-based features <br>
3.Network analysis of Actors / Directors and find interesting insights<br>
4.Does Netflix has more focus on TV Shows than movies in recent years

In [1]:
import pandas as pd
import numpy as np

In [2]:
Data=pd.read_csv("netflix_titles.csv",index_col="show_id") 
Data.head()


Unnamed: 0_level_0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
show_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [3]:
Data.rating.value_counts()

rating
TV-MA       3207
TV-14       2160
TV-PG        863
R            799
PG-13        490
TV-Y7        334
TV-Y         307
PG           287
TV-G         220
NR            80
G             41
TV-Y7-FV       6
NC-17          3
UR             3
74 min         1
84 min         1
66 min         1
Name: count, dtype: int64

# Meaning of rating
1. G : General Audience
2. PG : Parent Guidance Suggest
3. NC-17 : No-one 17 and under admitted
4. Tv : Tivishow
5. UR : Unrated
6. Y7 : 7 years old
7. FV : fantasy violence(children's programm only

# Data cleaning
<b>Checking some non-value in this data <b><br>

In [4]:
Data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8807 entries, s1 to s8807
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   type          8807 non-null   object
 1   title         8807 non-null   object
 2   director      6173 non-null   object
 3   cast          7982 non-null   object
 4   country       7976 non-null   object
 5   date_added    8797 non-null   object
 6   release_year  8807 non-null   int64 
 7   rating        8803 non-null   object
 8   duration      8804 non-null   object
 9   listed_in     8807 non-null   object
 10  description   8807 non-null   object
dtypes: int64(1), object(10)
memory usage: 825.7+ KB


In [5]:
missing_value=Data.isnull().sum()
total_rows=Data.shape[0]
total_missing=missing_value / total_rows *100
total_missing.sort_values(ascending=False)

director        29.908028
country          9.435676
cast             9.367549
date_added       0.113546
rating           0.045418
duration         0.034064
type             0.000000
title            0.000000
release_year     0.000000
listed_in        0.000000
description      0.000000
dtype: float64

1. First, we can see there are some columns fill with non-value : director, cast, country,date_added,rating and duration
2. Director account for almost 30% of total_rows
3. The country and cast variable are similarly had 9.43 % and 9.36%
4. The null-rows of date_added, rating and duration each represent a very small proportion. 

<b>checking null-value of factors(director, country,cast,rating)<b>
1. Values will be replace by Unknown
2. At rating columns, Some minor proportion titles will be change to <b> "Other" <b>

In [6]:
def filling_unknown(data,list):
    for factor in list:
        data[factor]=data[factor].fillna("Unknown")
    return data
factors=['director','country','cast','rating']
Data=filling_unknown(Data,factors)
Data.loc[:,['director','country','cast','rating']].isnull().sum()
    

director    0
country     0
cast        0
rating      0
dtype: int64

In [7]:
rating_counts=Data.rating.value_counts()
other_index=rating_counts[rating_counts<5].index
Data['rating']=Data.rating.replace(other_index,"Other")
Data.rating.value_counts()

rating
TV-MA       3207
TV-14       2160
TV-PG        863
R            799
PG-13        490
TV-Y7        334
TV-Y         307
PG           287
TV-G         220
NR            80
G             41
Other         13
TV-Y7-FV       6
Name: count, dtype: int64

<b> checking the correlation of 2 variables(release year and date_added)<b>

In [8]:
Data['release_year']=pd.to_datetime(Data.release_year,format="%Y")
Data['release_year']=Data.release_year.dt.strftime("%d/%m/%Y")
Data['date_added']=pd.to_datetime(Data.date_added.str.strip(),format="%B %d, %Y")
Data['date_added']=Data.date_added.dt.strftime("%d/%m/%Y")

In [9]:
Data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 8807 entries, s1 to s8807
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   type          8807 non-null   object
 1   title         8807 non-null   object
 2   director      8807 non-null   object
 3   cast          8807 non-null   object
 4   country       8807 non-null   object
 5   date_added    8797 non-null   object
 6   release_year  8807 non-null   object
 7   rating        8807 non-null   object
 8   duration      8804 non-null   object
 9   listed_in     8807 non-null   object
 10  description   8807 non-null   object
dtypes: object(11)
memory usage: 825.7+ KB


<b> because release_year has no non_value and they have a relationship , so i will fillna at date_added by using formula :<br>
missing_value=(release_year-date_added).mean()

In [10]:
Data['date_added']=pd.to_datetime(Data.date_added,format="%d/%m/%Y")
Data['release_year']=pd.to_datetime(Data.release_year,format="%d/%m/%Y")
day_diff= (Data[Data['date_added'].notna()]['date_added'] - Data[Data['date_added'].notna()]['release_year']).mean()
Data['date_added'] = Data.apply(
    lambda row: row['release_year'] + day_diff if pd.isna(row['date_added']) else row['date_added'],
    axis=1
)

In [11]:
Data['duration']=Data.duration.fillna("Unknown")

In [12]:
Data['movie_duration']=Data['duration'].apply(lambda x: int(x.split()[0]) if 'min' in x else None)
Data['Season']=Data['duration'].apply(lambda x: int(x.split()[0]) if 'S' in x else None)
Data['Season']=Data['Season'].fillna(0)

In [13]:
Data.head(5)

Unnamed: 0_level_0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,movie_duration,Season
show_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,Unknown,United States,2021-09-25,2020-01-01,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",90.0,0.0
s2,TV Show,Blood & Water,Unknown,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021-01-01,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",,2.0
s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",Unknown,2021-09-24,2021-01-01,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,,1.0
s4,TV Show,Jailbirds New Orleans,Unknown,Unknown,Unknown,2021-09-24,2021-01-01,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",,1.0
s5,TV Show,Kota Factory,Unknown,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021-01-01,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,,2.0


In [14]:
Data['title']=Data['title'].str.title()
Data['country'] = (
    Data['country']
    .str.lower()  # Convert all to lowercase for consistency
    .str.replace(r'\b(u\.?s\.?|united states)\b', 'USA', regex=True)
    .str.replace(r'\b(united kingdom|u\.k\.)\b', 'UK', regex=True)
    .str.replace(r'\buae\b', 'UAE', regex=True)
    .str.title()
)

In [15]:
mean_duration=Data.movie_duration.mean()
Data['movie_duration']=Data['movie_duration'].fillna(value=mean_duration,axis=0)

<b> Export to csv file to visualize by Power BI tool 

In [16]:
#Data.to_csv('D:/Tải/Netflix.csv')

In [17]:
show_country = ", ".join(Data['country'].dropna()).split(", ")

In [18]:
from collections import Counter

In [19]:
count_countries= Counter(show_country)

In [20]:
countries=pd.DataFrame(list(count_countries.items()), columns=['Country', 'Frequency'])

In [21]:
countries.Frequency.sort_values(ascending=False)

0      3689
3      1046
2       831
6       804
16      445
       ... 
101       1
102       1
103       1
104       1
127       1
Name: Frequency, Length: 128, dtype: int64

In [22]:
#countries.loc[:,:].sort_values(by='Frequency',ascending=False).head(10).to_csv('D:/Tải/Top10_Country.csv')

In [23]:
#countries.to_csv('D:/Tải/Country.csv')

In [24]:
#Data[Data.Season>0].to_csv('D:/Tải/Season.csv')

In [25]:
Data['first_country']=Data['country'].apply(lambda x: x.split(",")[0])

Unnamed: 0_level_0,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,movie_duration,Season,first_country
show_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1
s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,Unknown,Usa,2021-09-25,2020-01-01,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",90.000000,0.0,Usa
s2,TV Show,Blood & Water,Unknown,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,2021-09-24,2021-01-01,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",99.577187,2.0,South Africa
s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",Unknown,2021-09-24,2021-01-01,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,99.577187,1.0,Unknown
s4,TV Show,Jailbirds New Orleans,Unknown,Unknown,Unknown,2021-09-24,2021-01-01,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",99.577187,1.0,Unknown
s5,TV Show,Kota Factory,Unknown,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,2021-09-24,2021-01-01,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,99.577187,2.0,India
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",Usa,2019-11-20,2007-01-01,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a...",158.000000,0.0,Usa
s8804,TV Show,Zombie Dumb,Unknown,Unknown,Unknown,2019-07-01,2018-01-01,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g...",99.577187,2.0,Unknown
s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",Usa,2019-11-01,2009-01-01,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...,88.000000,0.0,Usa
s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",Usa,2020-01-11,2006-01-01,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero...",88.000000,0.0,Usa


### Depend on rating of Netflix Content. i want to dive deep in the target age which Netflix aims to

In [41]:
def rating_age(rating):
    age_dict={"Kids":['TV-Y', 'TV-G', 'G'],
"Older Kids":['TV-Y7', 'TV-Y7-FV', 'TV-PG'],
"Teen":['PG', 'PG-13', 'TV-14'],
"Adults":['R', 'TV-MA', 'NR', 'Other']}
    for key, values in age_dict.items():
        if rating in values:
            return key
    return rating
Data['Age']=Data['rating'].apply(rating_age)
            



In [None]:
#Data.to_csv('D:/Tải/Netflix.csv')

In [47]:
Categories=", ".join(Data.listed_in).split(", ")

In [53]:
count_Categories=Counter(Categories)

In [55]:
#Category=pd.DataFrame(list(count_Categories.items()),columns=['Category','Frequency'])
#Category.to_csv('D:/Tải/Category_Netflix.csv')

In [58]:
#Actors=", ".join(Data.cast).split(", ")
#count_Actors=Counter(Actors)
#Actor=pd.DataFrame(list(count_Actors.items()),columns=['Actor','Frequency'])
#Actor.to_csv('D:/Tải/Actor_Netflix.csv')         

In [59]:
#Directors=", ".join(Data.director).split(", ")
#count_Directors=Counter(Directors)
#Directors=pd.DataFrame(list(count_Directors.items()),columns=['Director','Frequency'])
#Directors.to_csv('D:/Tải/Director_Netflix.csv')        