## Import

In [81]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

In [82]:
netflix = pd.read_csv('data\\netflix_titles.csv')

In [83]:
netflix.shape

(8807, 12)

In [84]:
netflix.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


## Data Cleaning

In [85]:
netflix.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


Check for null values

In [86]:
netflix.isna().sum()

show_id            0
type               0
title              0
director        2634
cast             825
country          831
date_added        10
release_year       0
rating             4
duration           3
listed_in          0
description        0
dtype: int64

Dropping unnecessary columns

In [87]:
netflix = netflix.drop(['director', 'cast', 'description'], axis=1)

In [88]:
netflix.columns

Index(['show_id', 'type', 'title', 'country', 'date_added', 'release_year',
       'rating', 'duration', 'listed_in'],
      dtype='object')

In [89]:
netflix.isna().sum()

show_id           0
type              0
title             0
country         831
date_added       10
release_year      0
rating            4
duration          3
listed_in         0
dtype: int64

Fix data types

In [90]:
netflix.dtypes

show_id         object
type            object
title           object
country         object
date_added      object
release_year     int64
rating          object
duration        object
listed_in       object
dtype: object

In [91]:
netflix['type'] = pd.Categorical(netflix['type'])

In [92]:
netflix['date_added'] = netflix['date_added'].astype('datetime64[ns]')

In [93]:
netflix['rating'] = pd.Categorical(netflix['rating'])

In [94]:
netflix.dtypes

show_id                 object
type                  category
title                   object
country                 object
date_added      datetime64[ns]
release_year             int64
rating                category
duration                object
listed_in               object
dtype: object

In [95]:
netflix['rating'].cat.categories

Index(['66 min', '74 min', '84 min', 'G', 'NC-17', 'NR', 'PG', 'PG-13', 'R',
       'TV-14', 'TV-G', 'TV-MA', 'TV-PG', 'TV-Y', 'TV-Y7', 'TV-Y7-FV', 'UR'],
      dtype='object')

In [96]:
netflix[netflix['rating'] == "66 min"]

Unnamed: 0,show_id,type,title,country,date_added,release_year,rating,duration,listed_in
5813,s5814,Movie,Louis C.K.: Live at the Comedy Store,United States,2016-08-15,2015,66 min,,Movies


In [97]:
netflix[netflix['rating'] == "74 min"]

Unnamed: 0,show_id,type,title,country,date_added,release_year,rating,duration,listed_in
5541,s5542,Movie,Louis C.K. 2017,United States,2017-04-04,2017,74 min,,Movies


In [98]:
netflix[netflix['rating'] == "84 min"]

Unnamed: 0,show_id,type,title,country,date_added,release_year,rating,duration,listed_in
5794,s5795,Movie,Louis C.K.: Hilarious,United States,2016-09-16,2010,84 min,,Movies


Risolvere questi 3 problemi risolve anche i 3 missing values precedentemente trovati per la colonna duration.

In [99]:
netflix.loc[netflix.rating == "66 min", 'duration'] = "66 min"

In [100]:
netflix.loc[netflix.rating == "74 min", 'duration'] = "74 min"

In [101]:
netflix.loc[netflix.rating == "84 min", 'duration'] = "84 min"

In [102]:
netflix[netflix['rating'] == "84 min"]

Unnamed: 0,show_id,type,title,country,date_added,release_year,rating,duration,listed_in
5794,s5795,Movie,Louis C.K.: Hilarious,United States,2016-09-16,2010,84 min,84 min,Movies


In [103]:
netflix.loc[netflix['rating'] == "66 min", 'rating'] = np.nan

In [104]:
netflix.loc[netflix['rating'] == "74 min", 'rating'] = np.nan

In [105]:
netflix.loc[netflix['rating'] == "84 min", 'rating'] = np.nan

Idea: dividi rating a seconda dell'età consentita in ITA, ad esempio vedi sotto

ITALIA

I film sono classificati in quattro categorie:

T: Film per tutti.

6+: Non adatto ai minori di 6 anni.

14+: Vietato ai minori di 14 anni; spettatori che hanno compiuto i 12 anni sono ammessi alla proiezione se accompagnati da un genitore o da un tutore.

18+: Vietato ai minori di 18 anni; spettatori che hanno compiuto i 16 anni sono ammessi alla proiezione se accompagnati da un genitore o da un tutore.

source: https://it.wikipedia.org/wiki/Sistemi_di_classificazione_dei_film

TV-Y: This program is designed to be appropriate for all children.

TV-Y7: This program is designed for children age 7 and above.

TV-G: This program is suitable for all ages.

TV-PG: This program contains material that parents may find unsuitable for younger children (hence, Parental Guidance)

TV-14: This program contains some material that many parents would find unsuitable for children under 14 years of age.

TV-MA: This program is specifically designed to be viewed by adults and therefore may be unsuitable for children under 17.

G: This program is suitable for all ages.

NC-17: unsuitable for children under 17.

NR: not rated.

PG: may find unsuitable for younger children. I would say it is the same as TV-PG

PG-13: for children over 13.

R: restricted, only for children over 12.

TV-Y7-FV: Fantasy violence (exclusive to the TV-Y7 rating)

UR: not rated, as in (unrated).

source: https://en.wikipedia.org/wiki/Television_content_rating_system

In [106]:
"""
ratings_ages = {
    'TV-PG': 'Older Kids',
    'TV-MA': 'Adults',
    'TV-Y7-FV': 'Older Kids',
    'TV-Y7': 'Older Kids',
    'TV-14': 'Teens',
    'R': 'Adults',
    'TV-Y': 'Kids',
    'NR': 'Adults',
    'PG-13': 'Teens',
    'TV-G': 'Kids',
    'PG': 'Older Kids',
    'G': 'Kids',
    'UR': 'Adults',
    'NC-17': 'Adults'
}
"""

"\nratings_ages = {\n    'TV-PG': 'Older Kids',\n    'TV-MA': 'Adults',\n    'TV-Y7-FV': 'Older Kids',\n    'TV-Y7': 'Older Kids',\n    'TV-14': 'Teens',\n    'R': 'Adults',\n    'TV-Y': 'Kids',\n    'NR': 'Adults',\n    'PG-13': 'Teens',\n    'TV-G': 'Kids',\n    'PG': 'Older Kids',\n    'G': 'Kids',\n    'UR': 'Adults',\n    'NC-17': 'Adults'\n}\n"

In [107]:
ratings_ages = {
    'G': 'T',
    'TV-G': 'T',
    'TV-Y': 'T',
    'TV-PG': '6+',
    'TV-Y7-FV': '6+',
    'TV-Y7': '6+',
    'PG': '6+',
    'TV-14': '14+',
    'PG-13': '14+',
    'R': '18+',
    'TV-MA': '18+',
    'NC-17': '18+',
    'UR': '18+',
    'NR': '18+'
}

Risolvo altre criticità null values

In [108]:
netflix[netflix['rating'].isna()]

Unnamed: 0,show_id,type,title,country,date_added,release_year,rating,duration,listed_in
5541,s5542,Movie,Louis C.K. 2017,United States,2017-04-04,2017,,74 min,Movies
5794,s5795,Movie,Louis C.K.: Hilarious,United States,2016-09-16,2010,,84 min,Movies
5813,s5814,Movie,Louis C.K.: Live at the Comedy Store,United States,2016-08-15,2015,,66 min,Movies
5989,s5990,Movie,13TH: A Conversation with Oprah Winfrey & Ava ...,,2017-01-26,2017,,37 min,Movies
6827,s6828,TV Show,Gargantia on the Verdurous Planet,Japan,2016-12-01,2013,,1 Season,"Anime Series, International TV Shows"
7312,s7313,TV Show,Little Lunch,Australia,2018-02-01,2015,,1 Season,"Kids' TV, TV Comedies"
7537,s7538,Movie,My Honor Was Loyalty,Italy,2017-03-01,2015,,115 min,Dramas


In [109]:
netflix[netflix['date_added'].isna()]

Unnamed: 0,show_id,type,title,country,date_added,release_year,rating,duration,listed_in
6066,s6067,TV Show,A Young Doctor's Notebook and Other Stories,United Kingdom,NaT,2013,TV-MA,2 Seasons,"British TV Shows, TV Comedies, TV Dramas"
6174,s6175,TV Show,Anthony Bourdain: Parts Unknown,United States,NaT,2018,TV-PG,5 Seasons,Docuseries
6795,s6796,TV Show,Frasier,United States,NaT,2003,TV-PG,11 Seasons,"Classic & Cult TV, TV Comedies"
6806,s6807,TV Show,Friends,United States,NaT,2003,TV-14,10 Seasons,"Classic & Cult TV, TV Comedies"
6901,s6902,TV Show,Gunslinger Girl,Japan,NaT,2008,TV-14,2 Seasons,"Anime Series, Crime TV Shows"
7196,s7197,TV Show,Kikoriki,,NaT,2010,TV-Y,2 Seasons,Kids' TV
7254,s7255,TV Show,La Familia P. Luche,United States,NaT,2012,TV-14,3 Seasons,"International TV Shows, Spanish-Language TV Sh..."
7406,s7407,TV Show,Maron,United States,NaT,2016,TV-MA,4 Seasons,TV Comedies
7847,s7848,TV Show,Red vs. Blue,United States,NaT,2015,NR,13 Seasons,"TV Action & Adventure, TV Comedies, TV Sci-Fi ..."
8182,s8183,TV Show,The Adventures of Figaro Pho,Australia,NaT,2015,TV-Y7,2 Seasons,"Kids' TV, TV Comedies"


Divisione tra film e serie tv

In [110]:
netflix_f = netflix[netflix['type'] == "Movie"]

In [111]:
netflix_f.shape

(6131, 9)

In [112]:
netflix_s = netflix[netflix['type'] == "TV Show"] 

In [113]:
netflix_s.shape

(2676, 9)

Change duration column to numeric

In [114]:
netflix_f[['duration_min', 'min']] = netflix_f['duration'].str.split(' ', expand=True)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  self[k1] = value[k2]


In [115]:
netflix_f.head()

Unnamed: 0,show_id,type,title,country,date_added,release_year,rating,duration,listed_in,duration_min,min
0,s1,Movie,Dick Johnson Is Dead,United States,2021-09-25,2020,PG-13,90 min,Documentaries,90,min
6,s7,Movie,My Little Pony: A New Generation,,2021-09-24,2021,PG,91 min,Children & Family Movies,91,min
7,s8,Movie,Sankofa,"United States, Ghana, Burkina Faso, United Kin...",2021-09-24,1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies",125,min
9,s10,Movie,The Starling,United States,2021-09-24,2021,PG-13,104 min,"Comedies, Dramas",104,min
12,s13,Movie,Je Suis Karl,"Germany, Czech Republic",2021-09-23,2021,TV-MA,127 min,"Dramas, International Movies",127,min


In [117]:
netflix_f['duration_min'].astype('int')

0        90
6        91
7       125
9       104
12      127
       ... 
8801     96
8802    158
8804     88
8805     88
8806    111
Name: duration_min, Length: 6131, dtype: int32

In [118]:
netflix_f[netflix_f.isna()].sum()

  netflix_f[netflix_f.isna()].sum()


show_id           0
title             0
country           0
release_year    0.0
duration          0
listed_in         0
duration_min      0
min               0
dtype: object

In [120]:
netflix_f = netflix_f.drop(['duration', 'min'], axis=1)

In [121]:
netflix_f.head()

Unnamed: 0,show_id,type,title,country,date_added,release_year,rating,listed_in,duration_min
0,s1,Movie,Dick Johnson Is Dead,United States,2021-09-25,2020,PG-13,Documentaries,90
6,s7,Movie,My Little Pony: A New Generation,,2021-09-24,2021,PG,Children & Family Movies,91
7,s8,Movie,Sankofa,"United States, Ghana, Burkina Faso, United Kin...",2021-09-24,1993,TV-MA,"Dramas, Independent Movies, International Movies",125
9,s10,Movie,The Starling,United States,2021-09-24,2021,PG-13,"Comedies, Dramas",104
12,s13,Movie,Je Suis Karl,"Germany, Czech Republic",2021-09-23,2021,TV-MA,"Dramas, International Movies",127
