**Netflix Dataset Analysis**
- The project focuses on data cleaning and preprocessing of a Netflix dataset. 
- The primary goal is to prepare the dataset for further analysis and visualization by addressing missing values, handling duplicates, and ensuring data integrity.

**Data Import and Exploration:**
The dataset is imported using Pandas, and initial exploration is conducted to understand its structure, including the number of rows and columns, and the first and last few entries.

**Handling Missing Values**
- Various columns in the dataset are checked for null values, and appropriate strategies are applied to fill these gaps:
- The 'director' and 'cast' columns are filled with "Unknown".
- The 'country', 'date_added', and 'rating' columns are filled with the most frequently occurring values.
- The 'duration' column is filled based on whether the entry is a movie or a TV show.

**Data Cleaning:**
Duplicate entries are removed, and the cleaned dataset is saved to a new CSV file for future use.

**Data Analysis**
After cleaning, the next steps will involve analyzing the dataset to extract insights regarding trends, popular content, and user ratings.

**Data Visualization** 
Visualizations will be created to represent the findings effectively, helping to communicate insights clearly.

In [19]:
import numpy as np
import pandas as pd 

#importing the dataset
data = pd.read_csv("C:/Users/HP/Desktop/ANUDIP/Python/Project Python Anudip/netflix_database.csv")
#displaying information about the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8807 entries, 0 to 8806
Data columns (total 13 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   show_id          8807 non-null   object 
 1   type             8807 non-null   object 
 2   title            8807 non-null   object 
 3   director         6173 non-null   object 
 4   cast             7982 non-null   object 
 5   country          7976 non-null   object 
 6   date_added       8797 non-null   object 
 7   release_year     8807 non-null   int64  
 8   rating           8800 non-null   object 
 9   duration         8804 non-null   object 
 10  listed_in        8807 non-null   object 
 11  description      8807 non-null   object 
 12  Duration_Movies  6128 non-null   float64
dtypes: float64(1), int64(1), object(11)
memory usage: 894.6+ KB


In [2]:
#displaying the first 5 rows
print("First 5 rows are: ")
data.head(5)

First 5 rows are: 


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,Duration_Movies
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,9/25/2021,2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm...",90.0
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,9/24/2021,2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t...",
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,9/24/2021,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...,
3,s4,TV Show,Jailbirds New Orleans,,,,9/24/2021,2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo...",
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,9/24/2021,2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...,


In [3]:
#displaying last 5 rows
print("Last 5 rows are: ")
data.tail(5)

Last 5 rows are: 


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,Duration_Movies
8802,s8803,Movie,Zodiac,David Fincher,"Mark Ruffalo, Jake Gyllenhaal, Robert Downey J...",United States,11/20/2019,2007,R,158 min,"Cult Movies, Dramas, Thrillers","A political cartoonist, a crime reporter and a...",158.0
8803,s8804,TV Show,Zombie Dumb,,,,7/1/2019,2018,TV-Y7,2 Seasons,"Kids' TV, Korean TV Shows, TV Comedies","While living alone in a spooky town, a young g...",
8804,s8805,Movie,Zombieland,Ruben Fleischer,"Jesse Eisenberg, Woody Harrelson, Emma Stone, ...",United States,11/1/2019,2009,R,88 min,"Comedies, Horror Movies",Looking to survive in a world taken over by zo...,88.0
8805,s8806,Movie,Zoom,Peter Hewitt,"Tim Allen, Courteney Cox, Chevy Chase, Kate Ma...",United States,1/11/2020,2006,PG,88 min,"Children & Family Movies, Comedies","Dragged from civilian life, a former superhero...",88.0
8806,s8807,Movie,Zubaan,Mozez Singh,"Vicky Kaushal, Sarah-Jane Dias, Raaghav Chanan...",India,3/2/2019,2015,TV-14,111 min,"Dramas, International Movies, Music & Musicals",A scrappy but poor boy worms his way into a ty...,111.0


In [4]:
#displaying the shape of datset
print("Number of rows and columns in the dataset:", data.shape)

Number of rows and columns in the dataset: (8807, 13)


In [5]:
#displaying the names of columns
data.columns

Index(['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added',
       'release_year', 'rating', 'duration', 'listed_in', 'description',
       'Duration_Movies'],
      dtype='object')

In [6]:
#slicing the 10 rows from 10th
data[9:19]

Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description,Duration_Movies
9,s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,9/24/2021,2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contend...,104.0
10,s11,TV Show,"Vendetta: Truth, Lies and The Mafia",,,,9/24/2021,2021,TV-MA,1 Season,"Crime TV Shows, Docuseries, International TV S...","Sicily boasts a bold ""Anti-Mafia"" coalition. B...",
11,s12,TV Show,Bangkok Breaking,Kongkiat Komesiri,"Sukollawat Kanarot, Sushar Manaying, Pavarit M...",,9/23/2021,2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...","Struggling to earn a living in Bangkok, a man ...",
12,s13,Movie,Je Suis Karl,Christian Schwochow,"Luna Wedler, Jannis Niewöhner, Milan Peschel, ...","Germany, Czech Republic",9/23/2021,2021,TV-MA,127 min,"Dramas, International Movies",After most of her family is murdered in a terr...,127.0
13,s14,Movie,Confessions of an Invisible Girl,Bruno Garotti,"Klara Castanho, Lucca Picon, Júlia Gomes, Marc...",,9/22/2021,2021,TV-PG,91 min,"Children & Family Movies, Comedies",When the clever but socially-awkward Tetê join...,91.0
14,s15,TV Show,Crime Stories: India Detectives,,,,9/22/2021,2021,TV-MA,1 Season,"British TV Shows, Crime TV Shows, Docuseries",Cameras following Bengaluru police on the job ...,
15,s16,TV Show,Dear White People,,"Logan Browning, Brandon P. Bell, DeRon Horton,...",United States,9/22/2021,2021,TV-MA,4 Seasons,"TV Comedies, TV Dramas",Students of color navigate the daily slights a...,
16,s17,Movie,Europe's Most Dangerous Man: Otto Skorzeny in ...,"Pedro de Echave García, Pablo Azorín Williams",,,9/22/2021,2020,TV-MA,67 min,"Documentaries, International Movies",Declassified documents reveal the post-WWII li...,67.0
17,s18,TV Show,Falsa identidad,,"Luis Ernesto Franco, Camila Sodi, Sergio Goyri...",Mexico,9/22/2021,2020,TV-MA,2 Seasons,"Crime TV Shows, Spanish-Language TV Shows, TV ...",Strangers Diego and Isabel flee their home in ...,
18,s19,Movie,Intrusion,Adam Salky,"Freida Pinto, Logan Marshall-Green, Robert Joh...",,9/22/2021,2021,TV-14,94 min,Thrillers,After a deadly home invasion at a couple’s new...,94.0


In [7]:
#Finding null Values of dataset
data.isna().sum()

show_id               0
type                  0
title                 0
director           2634
cast                825
country             831
date_added           10
release_year          0
rating                7
duration              3
listed_in             0
description           0
Duration_Movies    2679
dtype: int64

In [21]:
#Handling the empty values 

"""Director"""

# Replace null values in the 'director' column with "Unknown"
data['director'].fillna("Unknown", inplace=True)

# Check if null values have been replaced
if (data['director'].isna().sum() == 0):
    print("Null values in 'director' column have been replaced")
else:
    print("Null values still exist in 'director' column")

Null values in 'director' column have been replaced


In [23]:
"""Cast"""
# Replace null values in the 'cast' column with "Unknown"
data['cast'].fillna("Unknown", inplace=True)

# Check if null values have been replaced
if (data['cast'].isna().sum() == 0):
    print("Null values in 'cast' column have been replaced")
else:
    print("Null values still exist in 'cast' column")

Null values in 'cast' column have been replaced


In [25]:
"""Country"""
# Replace null values in the 'Country' column with maximum value in that column

data['country'].fillna(data['country'].mode()[0], inplace=True)

# Check if null values have been replaced
if (data['country'].isna().sum() == 0):
    print("Null values in 'country' column have been replaced")
else:
    print("Null values still exist in 'country' column")

Null values in 'country' column have been replaced


In [27]:
"""Date Added"""
# Replace null values in the 'date_added' column with maximum occured value
data['date_added'].fillna(data['date_added'].mode()[0], inplace=True)

# Check if null values have been replaced
if (data['date_added'].isna().sum() == 0):
    print("Null values in 'date_added' column have been replaced")
else:
    print("Null values still exist in 'date_added' column")

Null values in 'date_added' column have been replaced


In [29]:
"""rating"""
# Replace null values in the 'rating' column with maximum occured value
data['rating'].fillna(data['rating'].mode()[0], inplace=True)

# Check if null values have been replaced
if (data['rating'].isna().sum() == 0):
    print("Null values in 'rating' column have been replaced")
else:
    print("Null values still exist in 'rating' column")

Null values in 'rating' column have been replaced


In [31]:
"""duration"""
# Replace null values in the 'duration' column on the basis of if its a Movie or a TV show
data['duration'].fillna(data['type'].map({'Movie': 'Unknown mins', 'TV Show': 'Unknown Seasons'}),inplace=True)

# Check if null values have been replaced
if (data['rating'].isna().sum() == 0):
    print("Null values in 'rating' column have been replaced")
else:
    print("Null values still exist in 'rating' column")

Null values in 'rating' column have been replaced


In [14]:
#Not replacing Duration_Movies nulls because we don't the the time of TV Shows, knowing that it can't be zero.
#Duration_Movies remains the same
data['Duration_Movies']

0        90.0
1         NaN
2         NaN
3         NaN
4         NaN
        ...  
8802    158.0
8803      NaN
8804     88.0
8805     88.0
8806    111.0
Name: Duration_Movies, Length: 8807, dtype: float64

In [15]:
#Finding null Values of dataset after doing the cleaning process
data.isna().sum()

show_id               0
type                  0
title                 0
director              0
cast                  0
country               0
date_added            0
release_year          0
rating                0
duration              0
listed_in             0
description           0
Duration_Movies    2679
dtype: int64

In [16]:
#removing duplicates
print(data.drop_duplicates())
#We had 8807 rows and after running this query we have same number of rows means we had no duplicate rows.

     show_id     type                  title         director  \
0         s1    Movie   Dick Johnson Is Dead  Kirsten Johnson   
1         s2  TV Show          Blood & Water          Unknown   
2         s3  TV Show              Ganglands  Julien Leclercq   
3         s4  TV Show  Jailbirds New Orleans          Unknown   
4         s5  TV Show           Kota Factory          Unknown   
...      ...      ...                    ...              ...   
8802   s8803    Movie                 Zodiac    David Fincher   
8803   s8804  TV Show            Zombie Dumb          Unknown   
8804   s8805    Movie             Zombieland  Ruben Fleischer   
8805   s8806    Movie                   Zoom     Peter Hewitt   
8806   s8807    Movie                 Zubaan      Mozez Singh   

                                                   cast        country  \
0                                               Unknown  United States   
1     Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...   South Africa 

In [18]:
# Save the cleaned dataset to a CSV file
try:
    data.to_csv("cleaned_netflix_database_Anudip.csv", index=False)
    print("Data cleaning completed. Cleaned dataset saved successfully.")
except Exception as e:
    print(f"Error saving cleaned dataset: {str(e)}")

Data cleaning completed. Cleaned dataset saved successfully.
