In [4]:
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns

df = pd.read_csv(r"netflix_dataset.csv")
print(df.head())

  show_id     type  title           director  \
0      s1  TV Show     3%                NaN   
1      s2    Movie   7:19  Jorge Michel Grau   
2      s3    Movie  23:59       Gilbert Chan   
3      s4    Movie      9        Shane Acker   
4      s5    Movie     21     Robert Luketic   

                                                cast        country  \
0  João Miguel, Bianca Comparato, Michel Gomes, R...         Brazil   
1  Demián Bichir, Héctor Bonilla, Oscar Serrano, ...         Mexico   
2  Tedd Chan, Stella Chung, Henley Hii, Lawrence ...      Singapore   
3  Elijah Wood, John C. Reilly, Jennifer Connelly...  United States   
4  Jim Sturgess, Kevin Spacey, Kate Bosworth, Aar...  United States   

          date_added  release_year rating   duration  \
0    August 14, 2020          2020  TV-MA  4 Seasons   
1  December 23, 2016          2016  TV-MA     93 min   
2  December 20, 2018          2011      R     78 min   
3  November 16, 2017          2009  PG-13     80 min   
4   

# Netflix Dataset Summary

## Basic Information
- **Number of rows:** 7787
- **Number of columns:** 12

## Dataset Description
This dataset contains information related to Netflix shows and movies. It includes details such as titles, type (Movie/TV Show), release year, rating, duration, and other descriptive attributes. The dataset can be used for data analysis for data cleaning and visualization

# Column Descriptions
## Column Overview
| Column Name | Data Type | Description |
|------------|-----------|-------------|
| show_id | object | Unique id for each movie or TV show on Netflix. |
| type | object | Specifies whether the content is a Movie or a TV Show. |
| title | object | Title of the movie or TV show. |
| director | object | Name of the director of the content. |
| cast | object | Main actors and actresses involved in the content. |
| country | object | Country or countries where the content was produced. |
| date_added | object | Date when the content was added to Netflix. |
| release_year | int64 | Year when the movie or TV show was originally released. |
| rating | object | Age rating of the content |
| duration | object | Duration of the movie (in minutes) or number of seasons for TV shows. |
| listed_in | object | Genre or categories the content belongs to. |
| description | object | Brief summary or storyline of the content. |


## Additional Information

This dataset contains details of movies and TV shows available on Netflix.  
It includes only content information, not user ratings or watch history.  
Some columns have missing values, so data cleaning is required before analysis.


## Data Assessment – Netflix Dataset

This dataset contains information about movies and TV shows available on Netflix.  
During data assessment, we observed that some columns have missing values like director, cast, and country.  
The dataset needs basic cleaning and formatting before performing proper data analysis.


In [5]:
print(df.isnull().sum())

show_id            0
type               0
title              0
director        2389
cast             718
country          507
date_added        10
release_year       0
rating             7
duration           0
listed_in          0
description        0
dtype: int64


In [12]:
print(df.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       7787 non-null   object
 1   type          7787 non-null   object
 2   title         7787 non-null   object
 3   director      5398 non-null   object
 4   cast          7069 non-null   object
 5   country       7280 non-null   object
 6   date_added    7777 non-null   object
 7   release_year  7787 non-null   int64 
 8   rating        7780 non-null   object
 9   duration      7787 non-null   object
 10  listed_in     7787 non-null   object
 11  description   7787 non-null   object
dtypes: int64(1), object(11)
memory usage: 730.2+ KB
None


In [14]:
print(df.duplicated().sum())

0


In [15]:
print(df.describe())

       release_year
count   7787.000000
mean    2013.932580
std        8.757395
min     1925.000000
25%     2013.000000
50%     2017.000000
75%     2018.000000
max     2021.000000


In [18]:
df[['director','cast','country','date_added','rating']] = df[['director','cast','country','date_added','rating']].fillna('NA')

In [17]:
print(df.isnull().sum())

show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: int64


In [19]:
df.isnull().mean() * 100

show_id         0.0
type            0.0
title           0.0
director        0.0
cast            0.0
country         0.0
date_added      0.0
release_year    0.0
rating          0.0
duration        0.0
listed_in       0.0
description     0.0
dtype: float64

In [20]:
df.duplicated().sum()

np.int64(0)

In [22]:
df[df.duplicated()].sum()

show_id         0
type            0
title           0
director        0
cast            0
country         0
date_added      0
release_year    0
rating          0
duration        0
listed_in       0
description     0
dtype: object

In [23]:
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')

In [24]:
df['date_added'].dtype

dtype('<M8[ns]')

In [25]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   show_id       7787 non-null   object        
 1   type          7787 non-null   object        
 2   title         7787 non-null   object        
 3   director      7787 non-null   object        
 4   cast          7787 non-null   object        
 5   country       7787 non-null   object        
 6   date_added    7689 non-null   datetime64[ns]
 7   release_year  7787 non-null   int64         
 8   rating        7787 non-null   object        
 9   duration      7787 non-null   object        
 10  listed_in     7787 non-null   object        
 11  description   7787 non-null   object        
dtypes: datetime64[ns](1), int64(1), object(10)
memory usage: 730.2+ KB


## Final Conclusion

The dataset was checked to find missing values, duplicates, and incorrect data types. Missing values were filled, and the `date_added` column was converted into proper date format. This cleaning process removed messy data and made the dataset clean and consistent. After cleaning, the data is reliable and ready for analysis and visualization.
