In [2]:
import pandas as pd

# Step 1: Load the dataset
df = pd.read_csv('/content/netflixxx.csv')

# Step 2: View information
print("Initial Data Info:\n")
print(df.info())
print("\nMissing values:\n")
print(df.isnull().sum())

# Step 3: Handle missing values
df['director'].fillna('Unknown', inplace=True)
df['cast'].fillna('Unknown', inplace=True)
df['country'].fillna('Not Specified', inplace=True)
df.dropna(subset=['date_added', 'rating'], inplace=True)

# Step 4: Remove duplicates
df.drop_duplicates(inplace=True)

# Step 5: Standardize text values
df['country'] = df['country'].str.lower().str.strip()
df['rating'] = df['rating'].str.upper().str.strip()

# Step 6: Convert date formats
df['date_added'] = pd.to_datetime(df['date_added'], errors='coerce')

# Step 7: Rename columns to lowercase, no spaces
df.columns = [col.strip().lower().replace(" ", "_") for col in df.columns]

# Step 8: Final check on data types and cleaned preview
print("\nCleaned Data Info:\n")
print(df.info())
print("\nSample of Cleaned Data:\n")
print(df.head())



Initial Data Info:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7787 entries, 0 to 7786
Data columns (total 12 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   show_id       7787 non-null   object
 1   type          7787 non-null   object
 2   title         7787 non-null   object
 3   director      5398 non-null   object
 4   cast          7069 non-null   object
 5   country       7280 non-null   object
 6   date_added    7777 non-null   object
 7   release_year  7787 non-null   int64 
 8   rating        7780 non-null   object
 9   duration      7787 non-null   object
 10  listed_in     7787 non-null   object
 11  description   7787 non-null   object
dtypes: int64(1), object(11)
memory usage: 730.2+ KB
None

Missing values:

show_id            0
type               0
title              0
director        2389
cast             718
country          507
date_added        10
release_year       0
rating             7
duration           0

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['director'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['cast'].fillna('Unknown', inplace=True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values alw