# **EDA on NETFLIX 🍿🎥**

<img src="./images/Netflix_Homepage.jpg" alt="Image Description">

---
**Author:** `Syed Ghazi Ali Zaidi`

* Contact: _sghazializaidi@gmail.com_
* Explore my code: _https://github.com/Ghazi-work_
* Connect with me: _https://www.linkedin.com/in/syed-ghazi-ali-zaidi-405931217_


---

 ## **Data Overview 📊**

* **Attributes:** Explore key details such as show ID, type (movie or TV show), title, director, cast, country, date added, release year, rating, duration, and more.
* **Record Structure:** Each row unfolds a unique show, providing a snapshot of its cinematic attributes.

| Column        | Data Type | Description |
|---------------|-----------|-------------|
| show_id       | Object    | Identifier for the show/movie |
| type          | Object    | Type of the content (Movie or TV Show) |
| title         | Object    | Title of the show |
| director      | Object    | Director of the show/movie (if available) |
| cast          | Object    | Cast members of the show/movie |
| country       | Object    | Country where the show/movie was produced |
| date_added    | Object    | Date when the show/movie was added to Netflix |
| release_year  | Numeric   | Year when the show/movie was released |
| rating        | Object    | Content rating of the show/movie |
| duration      | Object    | Duration of the show/movie |
| listed_in     | Object    | Categories in which the show/movie is listed |
| description   | Object    | Brief description of the show/movie |

## **Motivation 🚀** 

While movie databases are aplenty, exploring the Netflix galaxy provides a unique perspective. This dataset opens doors to analyze content trends, viewer preferences, and the global impact of Netflix's vast library. Gathering this celestial data had its challenges, making it an even more intriguing resource.


## **Credits and Acknowledgements 🌟**

A heartfelt gratitude to the data curators at Netflix and the original contributors on [Kaggle](https://www.kaggle.com/shivamb/netflix-shows). Their dedication has bestowed upon us this remarkable dataset, opening doors to endless possibilities.

Special thanks to [Dr. Aammar Tufail](https://github.com/AammarTufail), whose guidance has been a beacon in the sea of data exploration.

Let the analysis begin! 🚀📊

#### **Kernel Version**:
* Python 3.11.5

# **1. Importing Libraries**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import plotly
# connected=True means it will download the latest version of plotly javascript library.
plotly.offline.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff

%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
pd.option_context('display.max_columns', None)

<pandas._config.config.option_context at 0x193a7abcad0>

# **2. Loading Dataset and Exploring**

In [2]:
df = pd.read_csv('./Datasets/Netflix_data.csv')

* Sneak peek at the dataset

In [3]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [4]:
print(f'There are {df.shape[0]} rows and {df.shape[1]} columns in the dataset.')

There are 8807 rows and 12 columns in the dataset.


In [5]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


> ##### 1. The column `show_id` can be numeric after removing the character `s`.
> ##### 2. The column `date_added` can be divided in to different column like `day`,`month`, and `year` for better analysis.



#### Looking at selected columns to find something intresting 🔎

In [6]:
df['duration'].value_counts()

duration
1 Season     1793
2 Seasons     425
3 Seasons     199
90 min        152
94 min        146
             ... 
16 min          1
186 min         1
193 min         1
189 min         1
191 min         1
Name: count, Length: 220, dtype: int64

* **Intresting** the `duration` column can be used to make **2** different dataframes one for only **movies** that holds time and seperate for **Seasons** that holds season count. Help us get good insights 🤩

In [7]:
df['rating'].value_counts()

rating
TV-MA       3207
TV-14       2160
TV-PG        863
R            799
PG-13        490
TV-Y7        334
TV-Y         307
PG           287
TV-G         220
NR            80
G             41
TV-Y7-FV       6
NC-17          3
UR             3
74 min         1
84 min         1
66 min         1
Name: count, dtype: int64

* **Hmmmm 🤔** looks like some other column data has replaced the `rating` data which needs to be handled in the next phase.

* Also the `rating` column data doesn't communicate much.. I'll try to give meaningful names to each.\
like: PG as *Parents Guidance*

In [8]:
df['country'].value_counts()

country
United States                             2818
India                                      972
United Kingdom                             419
Japan                                      245
South Korea                                199
                                          ... 
Romania, Bulgaria, Hungary                   1
Uruguay, Guatemala                           1
France, Senegal, Belgium                     1
Mexico, United States, Spain, Colombia       1
United Arab Emirates, Jordan                 1
Name: count, Length: 748, dtype: int64

* Looks like some movie/shows have **more than 1** country listed, it can be due to collaborative work between two or more countries

In [9]:
df['listed_in'].value_counts()

listed_in
Dramas, International Movies                          362
Documentaries                                         359
Stand-Up Comedy                                       334
Comedies, Dramas, International Movies                274
Dramas, Independent Movies, International Movies      252
                                                     ... 
Kids' TV, TV Action & Adventure, TV Dramas              1
TV Comedies, TV Dramas, TV Horror                       1
Children & Family Movies, Comedies, LGBTQ Movies        1
Kids' TV, Spanish-Language TV Shows, Teen TV Shows      1
Cult Movies, Dramas, Thrillers                          1
Name: count, Length: 514, dtype: int64

### **Observations:** 
----
1. There are 8807 rows and 12 columns.

2. The columns in the dataset are:
    - **`'show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'`**

#### **Operations to be Performed ⚒️**
 1. `show_id` which can be numeric if we remove `s` 
 
 2. `date_added` can be divided into 3 columns. `Year`,`month`, and `Day`
 3. `rating` data has to be labeled into more meaningful way, also handle the replaced values by duration.
 4.  `duration` **2** dataframes will be made for each:
        - Movie: This dataframe will hold `duration` in *Hours*
        - Show: This dataframe will hold `duration` as *Season count*
----

# **3. Data Pre-processing**

* In this section we'll do cleaning, transforming, and integrating of the Netflix dataframe, the above mention operations will be performed step by step. So sit back and watch... 😃

#### **1. Show_id**

* **Recap:** character `s` has to be removed later convert it into int64.

* **Reason:** After converting to *Numeric* aggeragate functions can be applied on this column.

* checking null values

In [10]:
df['show_id'].isnull().sum()

0

In [11]:
print(df['show_id'].unique())

print('---------------------') # seperator

df['show_id'].value_counts()

['s1' 's2' 's3' ... 's8805' 's8806' 's8807']
---------------------


show_id
s1       1
s5875    1
s5869    1
s5870    1
s5871    1
        ..
s2931    1
s2930    1
s2929    1
s2928    1
s8807    1
Name: count, Length: 8807, dtype: int64

> ##### * It shows that the values are indeed `unique id` of the show/movie.

* Replacing `s` with empty space

In [12]:
df['show_id'] = df['show_id'].str.replace('s','') # replacing 's'

df['show_id'] = df['show_id'].astype('int64') # Converting it to int64 

df['show_id'].info() # checking the data type after converting

<class 'pandas.core.series.Series'>
RangeIndex: 8807 entries, 0 to 8806
Series name: show_id
Non-Null Count  Dtype
--------------  -----
8807 non-null   int64
dtypes: int64(1)
memory usage: 68.9 KB


#### **2. Date_added**

* **Recap:** The date will divided into 3 columns `Year`, `Month`, and `Day`

* **Reason:** Would be helpful in generating insights even on `Day` and `Month` level

- Viewing to understanding the pattern

In [13]:
df['date_added'].sample(5)

7347      August 16, 2018
6845    December 31, 2017
5190     November 1, 2017
7757       April 16, 2018
6732        July 26, 2019
Name: date_added, dtype: object

> * Month first, then the day in last we get the year

In [14]:
# Converting the datatype of 'date_added' to date time
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')

# Extracting year, month, and day into separate columns
df['year_added'] = df['date_added'].dt.year
df['month_added'] = df['date_added'].dt.month
df['day_added'] = df['date_added'].dt.day

* Viewing the first 5 rows after operation

In [15]:
df[['date_added','year_added','month_added','day_added']].head()

Unnamed: 0,date_added,year_added,month_added,day_added
0,2021-09-25,2021.0,9.0,25.0
1,2021-09-24,2021.0,9.0,24.0
2,2021-09-24,2021.0,9.0,24.0
3,2021-09-24,2021.0,9.0,24.0
4,2021-09-24,2021.0,9.0,24.0


#### **3. Rating**

* **Recap:** This column data has to be converted into more meaningful way, secondly there were issues with 3 rows.

* **Reason:** Would be helpful in understanding when presenting the data

In [16]:
df['rating'].isnull().sum()

4

> * We will be dropping the nulls as they are few in numbers and cannot be imputed

In [17]:
df['rating'].value_counts()

rating
TV-MA       3207
TV-14       2160
TV-PG        863
R            799
PG-13        490
TV-Y7        334
TV-Y         307
PG           287
TV-G         220
NR            80
G             41
TV-Y7-FV       6
NC-17          3
UR             3
74 min         1
84 min         1
66 min         1
Name: count, dtype: int64

#### We'll Print those rows having `74 min`, `84 min` and `66 min` in `rating` column

In [18]:
# Identify rows with the specified previous 'rating' values
wrong_ratings = ['74 min', '84 min', '66 min']
modified = df[df['rating'].isin(wrong_ratings)]

# Print the relevant columns for these rows
modified[['show_id', 'rating', 'duration']].head()


Unnamed: 0,show_id,rating,duration
5541,5542,74 min,
5794,5795,84 min,
5813,5814,66 min,


> * As can be seen that duration values have come in place of rating values, even if we swap the values we still won't be able to judge the `rating` so the best way is to remove these rows

In [19]:
show_ids_to_remove = [5542, 5795, 5814]
df = df.dropna(subset=['rating']) # dropping nan from rating
df = df[~df['show_id'].isin(show_ids_to_remove)] # removing rows with specified ids

* Now giving meaningful name to rating values. 🙂

In [20]:
df['rating'].unique() # Viewing unique values in the rating

array(['PG-13', 'TV-MA', 'PG', 'TV-14', 'TV-PG', 'TV-Y', 'TV-Y7', 'R',
       'TV-G', 'G', 'NC-17', 'NR', 'TV-Y7-FV', 'UR'], dtype=object)

* Mapping each column with meaningful names

In [21]:
# Mapping of original ratings to meaningful names
rating_mapping = {
    'PG-13': 'Parental Guidance 13+',
    'TV-MA': 'Mature Audiences',
    'PG': 'Parental Guidance',
    'TV-14': 'Parents Strongly Cautioned 14+',
    'TV-PG': 'Parental Guidance',
    'TV-Y': 'All Children',
    'TV-Y7': 'Directed to Older Children',
    'R': 'Restricted',
    'TV-G': 'General Audiences',
    'G': 'General Audiences',
    'NC-17': 'No one 17 and under admitted',
    'NR': 'Not Rated',
    'TV-Y7-FV': 'Directed to Older Children - Fantasy Violence',
    'UR': 'Unrated'
}

# Create a new column 'rating_meaningful' based on the mapping
df['rating_meaningful'] = df['rating'].map(rating_mapping)

# Print the DataFrame
df[['rating', 'rating_meaningful']].sample(6).head()

Unnamed: 0,rating,rating_meaningful
8094,TV-Y7,Directed to Older Children
130,TV-Y,All Children
6694,TV-14,Parents Strongly Cautioned 14+
5651,TV-MA,Mature Audiences
639,TV-14,Parents Strongly Cautioned 14+


### **Handling Duplicates**

In [23]:
df.duplicated().sum()

0

> * There seems no duplicated values but let me check `description` column

In [25]:
df['description'].duplicated().sum()

32

> * 🫣 There seems to have dublicated values in the dataset, let's check which rows are they..

In [33]:
df[['show_id','title','description','date_added','release_year']].loc[df['description'].duplicated()].head().sort_index()

Unnamed: 0,show_id,title,description,date_added,release_year
79,80,Tughlaq Durbar (Telugu),A budding politician has devious plans to rise...,2021-09-11,2021
237,238,Boomika (Hindi),"Paranormal activity at a lush, abandoned prope...",2021-08-23,2021
238,239,Boomika (Malayalam),"Paranormal activity at a lush, abandoned prope...",2021-08-23,2021
239,240,Boomika (Telugu),"Paranormal activity at a lush, abandoned prope...",2021-08-23,2021
851,852,99 Songs (Tamil),Challenged to compose 100 songs before he can ...,2021-05-21,2021


> * As can be seen that there are movies that are same but in different languge, we'll be taking one of their instances.