# **EDA on NETFLIX 🍿🎥**

<img src="./images/Netflix_Homepage.jpg" alt="Image Description">

---
**Author:** `Syed Ghazi Ali Zaidi`

* Contact: _sghazializaidi@gmail.com_
* Explore my code: _https://github.com/Ghazi-work_
* Connect with me: _https://www.linkedin.com/in/syed-ghazi-ali-zaidi-405931217_


---

 ## **Data Overview 📊**

* **Attributes:** Explore key details such as show ID, type (movie or TV show), title, director, cast, country, date added, release year, rating, duration, and more.
* **Record Structure:** Each row unfolds a unique show, providing a snapshot of its cinematic attributes.

| Column        | Data Type | Description |
|---------------|-----------|-------------|
| show_id       | Object    | Identifier for the show |
| type          | Object    | Type of the content (Movie or TV Show) |
| title         | Object    | Title of the show |
| director      | Object    | Director of the show (if available) |
| cast          | Object    | Cast members of the show |
| country       | Object    | Country where the show was produced |
| date_added    | Object    | Date when the show was added to Netflix |
| release_year  | Numeric   | Year when the show was released |
| rating        | Object    | Content rating of the show |
| duration      | Object    | Duration of the show |
| listed_in     | Object    | Categories in which the show is listed |
| description   | Object    | Brief description of the show |

## **Motivation 🚀** 

While movie databases are aplenty, exploring the Netflix galaxy provides a unique perspective. This dataset opens doors to analyze content trends, viewer preferences, and the global impact of Netflix's vast library. Gathering this celestial data had its challenges, making it an even more intriguing resource.


## **Credits and Acknowledgements 🌟**

A heartfelt gratitude to the data curators at Netflix and the original contributors on [Kaggle](https://www.kaggle.com/shivamb/netflix-shows). Their dedication has bestowed upon us this remarkable dataset, opening doors to endless possibilities.

Special thanks to [Dr. Aammar Tufail](https://github.com/AammarTufail), whose guidance has been a beacon in the sea of data exploration.

Let the analysis begin! 🚀📊

#### **Kernel Version**:
* Python 3.11.5

## **1. Importing Libraries**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import plotly
# connected=True means it will download the latest version of plotly javascript library.
plotly.offline.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

## **2. Loading Dataset and Exploring**

In [2]:
df = pd.read_csv('./Datasets/Netflix_data.csv')

In [3]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [6]:
print(f'There are {df.shape[0]} rows and {df.shape[1]} columns in the dataset')

There are 8807 rows and 12 columns in the dataset


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [18]:
df.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

In [19]:
df.describe()

Unnamed: 0,release_year
count,8807.0
mean,2014.180198
std,8.819312
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2021.0


In [21]:
df['rating'].unique()

array(['PG-13', 'TV-MA', 'PG', 'TV-14', 'TV-PG', 'TV-Y', 'TV-Y7', 'R',
       'TV-G', 'G', 'NC-17', '74 min', '84 min', '66 min', 'NR', nan,
       'TV-Y7-FV', 'UR'], dtype=object)

In [24]:
df['duration'].unique()

array(['90 min', '2 Seasons', '1 Season', '91 min', '125 min',
       '9 Seasons', '104 min', '127 min', '4 Seasons', '67 min', '94 min',
       '5 Seasons', '161 min', '61 min', '166 min', '147 min', '103 min',
       '97 min', '106 min', '111 min', '3 Seasons', '110 min', '105 min',
       '96 min', '124 min', '116 min', '98 min', '23 min', '115 min',
       '122 min', '99 min', '88 min', '100 min', '6 Seasons', '102 min',
       '93 min', '95 min', '85 min', '83 min', '113 min', '13 min',
       '182 min', '48 min', '145 min', '87 min', '92 min', '80 min',
       '117 min', '128 min', '119 min', '143 min', '114 min', '118 min',
       '108 min', '63 min', '121 min', '142 min', '154 min', '120 min',
       '82 min', '109 min', '101 min', '86 min', '229 min', '76 min',
       '89 min', '156 min', '112 min', '107 min', '129 min', '135 min',
       '136 min', '165 min', '150 min', '133 min', '70 min', '84 min',
       '140 min', '78 min', '7 Seasons', '64 min', '59 min', '139 min',
    

In [25]:
df['rating'].value_counts()

rating
TV-MA       3207
TV-14       2160
TV-PG        863
R            799
PG-13        490
TV-Y7        334
TV-Y         307
PG           287
TV-G         220
NR            80
G             41
TV-Y7-FV       6
NC-17          3
UR             3
74 min         1
84 min         1
66 min         1
Name: count, dtype: int64

In [28]:
df.loc[df['rating'] == '74 min'].head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
5813,s5814,Movie,Louis C.K.: Live at the Comedy Store,Louis C.K.,Louis C.K.,United States,"August 15, 2016",2015,66 min,,Movies,The comic puts his trademark hilarious/thought...


In [29]:
df.loc[df['rating'] == '84 min'].head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
5794,s5795,Movie,Louis C.K.: Hilarious,Louis C.K.,Louis C.K.,United States,"September 16, 2016",2010,84 min,,Movies,Emmy-winning comedy writer Louis C.K. brings h...


In [30]:
df.loc[df['rating'] == '66 min'].head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
5813,s5814,Movie,Louis C.K.: Live at the Comedy Store,Louis C.K.,Louis C.K.,United States,"August 15, 2016",2015,66 min,,Movies,The comic puts his trademark hilarious/thought...


### **Observations:** 
----
1. There are 8807 rows and 12 columns.
2. The columns in the dataset are:
    - `'show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'`
3. There is only **1 column** that are _numeric_ but when we analyze there are actually **2 columns** that can be _numeric_.  
4. The other one is `show_id` which can be numeric if we remove `s` 
----

## **3. Data Pre-processing**