In this section, we will load the Netflix dataset and do some basic cleaning of the title column by removing extra spaces and changing all the letters to lowercase. This will help avoid issues when checking for duplicate titles later.

In [None]:
import pandas as pd
df = pd.read_csv("netflix1.csv")

df['title'] = df['title'].str.strip().str.lower()

In this section, we clean the movie titles by standardizing case and removing leading/trailing spaces. We then identify and count duplicate titles and display the rows containing those duplicates along with relevant metadata (title, director, and country).

In [None]:
df['title_clean'] = df['title'].str.strip().str.lower()
title_counts = df['title_clean'].value_counts()
duplicates = title_counts[title_counts > 1]
print("Titles with duplicates:\n", duplicates)

duplicate_rows = df[df['title_clean'].isin(duplicates.index)]
print("\nFull rows with duplicate titles:\n", duplicate_rows[['title', 'director', 'country']])


Titles with duplicates:
 title_clean
esperando la carroza        2
9-feb                       2
fullmetal alchemist         2
consequences                2
15-aug                      2
death note                  2
sin senos sí hay paraíso    2
love in a puff              2
22-jul                      2
Name: count, dtype: int64

Full rows with duplicate titles:
                          title            director        country
220             love in a puff      Pang Ho-cheung      Hong Kong
393                      9-feb           Not Given       Pakistan
415       esperando la carroza     Alejandro Doria      Argentina
537                      9-feb           Not Given       Pakistan
2590              consequences        Ozan Açıktan         Turkey
2925                    15-aug  Swapnaneel Jayakar          India
3285                    22-jul     Paul Greengrass         Norway
3637       fullmetal alchemist       Fumihiko Sori          Japan
3819                death note        

In this section, we will clean and standardize the `date_added` column. First, we will convert it into proper date format, and then change it to a more readable style like dd-mm-yyyy. This will help make the date values easier to understand and use later.

In [None]:
df['date_added'] = pd.to_datetime(df['date_added'], dayfirst=True, errors='coerce')

df['date_added'] = df['date_added'].dt.strftime('%d-%m-%Y')

print(df[['title', 'date_added']].head(15))

                               title  date_added
0               dick johnson is dead  25-09-2021
1                          ganglands  24-09-2021
2                      midnight mass  24-09-2021
3   confessions of an invisible girl  22-09-2021
4                            sankofa  24-09-2021
5      the great british baking show  24-09-2021
6                       the starling  24-09-2021
7    motu patlu in the game of zones  01-05-2021
8                       je suis karl  23-09-2021
9           motu patlu in wonderland  01-05-2021
10    motu patlu: deep sea adventure  01-05-2021
11          motu patlu: mission moon  01-05-2021
12                  99 songs (tamil)  21-05-2021
13       bridgerton - the afterparty  13-07-2021
14     bling empire - the afterparty  12-06-2021


  df['date_added'] = pd.to_datetime(df['date_added'], dayfirst=True, errors='coerce')


In this section, we will extract the year from the `date_added` column. First, we’ll make sure the date is in the correct format, and then we’ll create a new column called `year_added` that stores just the year part.

In [None]:
df['date_added'] = pd.to_datetime(df['date_added'], format='%d-%m-%Y', errors='coerce')
df['year_added'] = df['date_added'].dt.year

In this section, we will clean the `duration` column. We'll separate the numbers and the text (like minutes or seasons) into two new columns: `duration_int` and `duration_type`. Then we’ll convert the numeric part to float so that it can be used for analysis later.

In [None]:
df[['duration_int', 'duration_type']] = df['duration'].str.extract(r'(\d+)\s*(\w+)')
df['duration_int'] = df['duration_int'].astype('float')  
df[['duration', 'duration_int', 'duration_type']].head()

Unnamed: 0,duration,duration_int,duration_type
0,90 min,90.0,min
1,1 Season,1.0,Season
2,1 Season,1.0,Season
3,91 min,91.0,min
4,125 min,125.0,min


In [None]:
# Drop 'title_clean' if no longer needed
df.drop(columns=['title_clean'], inplace=True)

In this section, we will clean the `title` column by removing extra spaces and converting everything to lowercase. Then, we’ll remove any duplicate titles to keep only unique ones. Finally, we’ll check the new dataset shape and export the cleaned data to a CSV file.

In [None]:
df['title'] = df['title'].str.strip().str.lower()
df = df.drop_duplicates(subset='title', keep='first')
print("Final shape after duplicate removal:", df.shape)
df.to_csv("netflix_cleaned_final.csv", index=False)

Final shape after duplicate removal: (8781, 13)


In [11]:
# Save to CSV with the new column
df.to_csv("netflix_cleaned_with_year.csv", index=False)