# **EDA on NETFLIX 🍿🎥**

<img src="./images/Netflix_Homepage.jpg" alt="Image Description">

---
**Author:** `Syed Ghazi Ali Zaidi`

* Contact: _sghazializaidi@gmail.com_
* Explore my code: _https://github.com/Ghazi-work_
* Connect with me: _https://www.linkedin.com/in/syed-ghazi-ali-zaidi-405931217_


---

 ## **Data Overview 📊**

* **Attributes:** Explore key details such as show ID, type (movie or TV show), title, director, cast, country, date added, release year, rating, duration, and more.
* **Record Structure:** Each row unfolds a unique show, providing a snapshot of its cinematic attributes.

| Column        | Data Type | Description |
|---------------|-----------|-------------|
| show_id       | Object    | Identifier for the show |
| type          | Object    | Type of the content (Movie or TV Show) |
| title         | Object    | Title of the show |
| director      | Object    | Director of the show (if available) |
| cast          | Object    | Cast members of the show |
| country       | Object    | Country where the show was produced |
| date_added    | Object    | Date when the show was added to Netflix |
| release_year  | Numeric   | Year when the show was released |
| rating        | Object    | Content rating of the show |
| duration      | Object    | Duration of the show |
| listed_in     | Object    | Categories in which the show is listed |
| description   | Object    | Brief description of the show |

## **Motivation 🚀** 

While movie databases are aplenty, exploring the Netflix galaxy provides a unique perspective. This dataset opens doors to analyze content trends, viewer preferences, and the global impact of Netflix's vast library. Gathering this celestial data had its challenges, making it an even more intriguing resource.


## **Credits and Acknowledgements 🌟**

A heartfelt gratitude to the data curators at Netflix and the original contributors on [Kaggle](https://www.kaggle.com/shivamb/netflix-shows). Their dedication has bestowed upon us this remarkable dataset, opening doors to endless possibilities.

Special thanks to [Dr. Aammar Tufail](https://github.com/AammarTufail), whose guidance has been a beacon in the sea of data exploration.

Let the analysis begin! 🚀📊

#### **Kernel Version**:
* Python 3.11.5

## **1. Importing Libraries**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats
import plotly
# connected=True means it will download the latest version of plotly javascript library.
plotly.offline.init_notebook_mode(connected=True)
import plotly.graph_objs as go
import plotly.figure_factory as ff
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

## **2. Loading Dataset and Exploring**

In [2]:
df = pd.read_csv('./Datasets/Netflix_data.csv')

In [3]:
df.head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...


In [6]:
print(f'There are {df.shape[0]} rows and {df.shape[1]} columns in the dataset')

There are 8807 rows and 12 columns in the dataset


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       8807 non-null   object
 1   type          8807 non-null   object
 2   title         8807 non-null   object
 3   director      6173 non-null   object
 4   cast          7982 non-null   object
 5   country       7976 non-null   object
 6   date_added    8797 non-null   object
 7   release_year  8807 non-null   int64 
 8   rating        8803 non-null   object
 9   duration      8804 non-null   object
 10  listed_in     8807 non-null   object
 11  description   8807 non-null   object
dtypes: int64(1), object(11)
memory usage: 825.8+ KB


In [18]:
df.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'],
      dtype='object')

In [19]:
df.describe()

Unnamed: 0,release_year
count,8807.0
mean,2014.180198
std,8.819312
min,1925.0
25%,2013.0
50%,2017.0
75%,2019.0
max,2021.0


In [46]:
df['title'].value_counts()

title
Dick Johnson Is Dead                     1
Ip Man 2                                 1
Hannibal Buress: Comedy Camisado         1
Turbo FAST                               1
Masha's Tales                            1
                                        ..
Love for Sale 2                          1
ROAD TO ROMA                             1
Good Time                                1
Captain Underpants Epic Choice-o-Rama    1
Zubaan                                   1
Name: count, Length: 8807, dtype: int64

In [21]:
df['rating'].unique()

array(['PG-13', 'TV-MA', 'PG', 'TV-14', 'TV-PG', 'TV-Y', 'TV-Y7', 'R',
       'TV-G', 'G', 'NC-17', '74 min', '84 min', '66 min', 'NR', nan,
       'TV-Y7-FV', 'UR'], dtype=object)

In [38]:
df['description'].value_counts().sort_values(ascending=False)

description
Paranormal activity at a lush, abandoned property alarms a group eager to redevelop the site, but the eerie events may not be as unearthly as they think.    4
A surly septuagenarian gets another chance at her 20s after having her photo snapped at a studio that magically takes 50 years off her life.                 3
Multiple women report their husbands as missing but when it appears they are looking for the same man, a police officer traces their cryptic connection.     3
Challenged to compose 100 songs before he can marry the girl he loves, a tortured but passionate singer-songwriter embarks on a poignant musical journey.    3
A scheming matriarch plots to cut off her disabled stepson and his wife from the family fortune, creating a division within the clan.                        2
                                                                                                                                                            ..
A philandering small-town mechanic

In [36]:
df.loc[df['description'] == 'Paranormal activity at a lush, abandoned property alarms a group eager to redevelop the site, but the eerie events may not be as unearthly as they think.'].head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
236,s237,Movie,Boomika,Rathindran R Prasad,"Aishwarya Rajesh, Vidhu, Surya Ganapathy, Madh...",,"August 23, 2021",2021,TV-14,122 min,"Horror Movies, International Movies, Thrillers","Paranormal activity at a lush, abandoned prope..."
237,s238,Movie,Boomika (Hindi),Rathindran R Prasad,"Aishwarya Rajesh, Vidhu, Surya Ganapathy, Madh...",,"August 23, 2021",2021,TV-14,122 min,"Horror Movies, International Movies, Thrillers","Paranormal activity at a lush, abandoned prope..."
238,s239,Movie,Boomika (Malayalam),Rathindran R Prasad,"Aishwarya Rajesh, Vidhu, Surya Ganapathy, Madh...",,"August 23, 2021",2021,TV-14,122 min,"Horror Movies, International Movies, Thrillers","Paranormal activity at a lush, abandoned prope..."
239,s240,Movie,Boomika (Telugu),Rathindran R Prasad,"Aishwarya Rajesh, Vidhu, Surya Ganapathy, Madh...",,"August 23, 2021",2021,TV-14,122 min,"Horror Movies, International Movies, Thrillers","Paranormal activity at a lush, abandoned prope..."


In [37]:
df.loc[df['description'] == 'Challenged to compose 100 songs before he can marry the girl he loves, a tortured but passionate singer-songwriter embarks on a poignant musical journey.'].head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
850,s851,Movie,99 Songs,Vishwesh Krishnamoorthy,"Ehan Bhat, Edilsy Vargas, Manisha Koirala, Lis...",India,"May 21, 2021",2021,TV-14,131 min,"Dramas, International Movies, Music & Musicals",Challenged to compose 100 songs before he can ...
851,s852,Movie,99 Songs (Tamil),,,,"May 21, 2021",2021,TV-14,131 min,"Dramas, International Movies, Music & Musicals",Challenged to compose 100 songs before he can ...
852,s853,Movie,99 Songs (Telugu),,,,"May 21, 2021",2021,TV-14,131 min,"Dramas, International Movies, Music & Musicals",Challenged to compose 100 songs before he can ...


In [39]:
df.loc[df['description'] == 'A scheming matriarch plots to cut off her disabled stepson and his wife from the family fortune, creating a division within the clan.'].head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
2969,s2970,Movie,Together For Eternity,Sooraj R. Barjatya,"Salman Khan, Karisma Kapoor, Saif Ali Khan, Ta...",India,"February 1, 2020",1999,TV-G,176 min,"Dramas, International Movies, Music & Musicals",A scheming matriarch plots to cut off her disa...
7022,s7023,Movie,Hum Saath-Saath Hain,Sooraj R. Barjatya,"Salman Khan, Karisma Kapoor, Saif Ali Khan, Ta...",India,"January 1, 2018",1999,TV-G,176 min,"Dramas, International Movies, Music & Musicals",A scheming matriarch plots to cut off her disa...


In [28]:
df.loc[df['rating'] == '74 min'].head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
5813,s5814,Movie,Louis C.K.: Live at the Comedy Store,Louis C.K.,Louis C.K.,United States,"August 15, 2016",2015,66 min,,Movies,The comic puts his trademark hilarious/thought...


In [29]:
df.loc[df['rating'] == '84 min'].head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
5794,s5795,Movie,Louis C.K.: Hilarious,Louis C.K.,Louis C.K.,United States,"September 16, 2016",2010,84 min,,Movies,Emmy-winning comedy writer Louis C.K. brings h...


In [30]:
df.loc[df['rating'] == '66 min'].head()

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
5813,s5814,Movie,Louis C.K.: Live at the Comedy Store,Louis C.K.,Louis C.K.,United States,"August 15, 2016",2015,66 min,,Movies,The comic puts his trademark hilarious/thought...


In [40]:
df['listed_in'].value_counts()

listed_in
Dramas, International Movies                          362
Documentaries                                         359
Stand-Up Comedy                                       334
Comedies, Dramas, International Movies                274
Dramas, Independent Movies, International Movies      252
                                                     ... 
Kids' TV, TV Action & Adventure, TV Dramas              1
TV Comedies, TV Dramas, TV Horror                       1
Children & Family Movies, Comedies, LGBTQ Movies        1
Kids' TV, Spanish-Language TV Shows, Teen TV Shows      1
Cult Movies, Dramas, Thrillers                          1
Name: count, Length: 514, dtype: int64

In [41]:
df['country'].value_counts()

country
United States                             2818
India                                      972
United Kingdom                             419
Japan                                      245
South Korea                                199
                                          ... 
Romania, Bulgaria, Hungary                   1
Uruguay, Guatemala                           1
France, Senegal, Belgium                     1
Mexico, United States, Spain, Colombia       1
United Arab Emirates, Jordan                 1
Name: count, Length: 748, dtype: int64

In [52]:
df.groupby(['director','country']).count()

Unnamed: 0_level_0,Unnamed: 1_level_0,show_id,type,title,cast,date_added,release_year,rating,duration,listed_in,description
director,country,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
A. L. Vijay,India,2,2,2,2,2,2,2,2,2,2
A. Raajdheep,India,1,1,1,1,1,1,1,1,1,1
A. Salaam,India,1,1,1,1,1,1,1,1,1,1
A.R. Murugadoss,India,1,1,1,1,1,1,1,1,1,1
Aadish Keluskar,India,1,1,1,1,1,1,1,1,1,1
...,...,...,...,...,...,...,...,...,...,...,...
Çagan Irmak,Turkey,1,1,1,1,1,1,1,1,1,1
Ísold Uggadóttir,"Iceland, Sweden, Belgium",1,1,1,1,1,1,1,1,1,1
Óskar Thór Axelsson,Iceland,1,1,1,1,1,1,1,1,1,1
Ömer Faruk Sorak,Turkey,2,2,2,2,2,2,2,2,2,2


### **Observations:** 
----
1. There are 8807 rows and 12 columns.
2. The columns in the dataset are:
    - `'show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description'`
3. There is only **1 column** that are _numeric_ but when we analyze there are actually **2 columns** that can be _numeric_.  
4. The other one is `show_id` which can be numeric if we remove `s` 
----

### **Observations advance:** 
----
1. show_id convert to int

2. date_added to seperate columns and then relate with release_year

3. Rating --> 3 rows which can cause issues

4. duration --> make new dataframe for only movies and then make new column for duration in 1 unit (min or hour)

5. duration --> make new df for only shows and then make new column for only number of seasons (numeric)

6. Rating --> label into more meaningful manner

7. Description --> For the duplicates they are indians having different languages

8. For missing values directors etc are same for movies



--> Insights:

1. Description: len find on average
2. Average Title size/len
3. How many shows or movies are documentaries
4. Which director made (groupby)
 
----

## **3. Data Pre-processing**