
Strategy 1: Drop Rows with Missing Values

In this method, any row containing at least one missing value is removed entirely from the dataset. While this ensures a clean dataset, it can lead to a significant loss of data, especially if missing values are scattered across many rows.
Strategy 2: Fill Missing Values with Mean (or 'Unknown')

In this method:

    Missing values in numerical columns are filled with the mean of that column.
    Missing values in categorical columns are replaced with the string 'Unknown'.

This approach preserves all rows and prevents data loss, making it especially useful when missing values are limited and not critical.

Which Strategy is Better?

The second strategy (filling missing values) is generally better in this scenario because:

    The dataset is relatively small (1000 rows), and dropping rows risks losing valuable data.
    Filling with the mean allows us to retain trends in numerical data.
    Replacing categorical gaps with 'Unknown' helps maintain dataset integrity for further analysis or modeling.


In [187]:
import pandas as pd

# Load dataset
df = pd.read_csv("top1000movies.csv")

df_dropna = df.dropna()
print("Shape after dropping rows with missing values:", df_dropna.shape)

df_fillna = df.copy()


df['Runtime'] = df['Runtime'].str.replace(' min', '').astype(int)


numeric_cols = ['Released_Year', 'Runtime', 'IMDB_Rating', 'Meta_score', 'No_of_Votes', 'Gross']
for col in numeric_cols:
    df_fillna[col] = pd.to_numeric(df_fillna[col], errors='coerce')

df_fillna[numeric_cols] = df_fillna[numeric_cols].fillna(df_fillna[numeric_cols].mean())


# For categorical columns, fill with 'Unknown'
categorical_cols = ['Certificate', 'Genre', 'Overview', 'Director', 'Star1', 'Star2', 'Star3', 'Star4']
df_fillna[categorical_cols] = df_fillna[categorical_cols].fillna('unknown')

df = df_fillna.copy()


Shape after dropping rows with missing values: (713, 16)


In [188]:
#Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross


for col in categorical_cols:
    df[col] = df[col].str.strip().str.lower()



df['Certificate'] = df['Certificate'].str.lower().str.strip()

# Define a mapping dictionary
certificate_mapping = {
    'u/a': 'ua',  # Merge variations
    'pg-13': 'pg13',
    'tv-pg': 'pg',
    'tv-14': 'pg13',
    'tv-ma': 'r',
    'gp': 'pg',
    'passed': 'approved',
    '16': 'r',  # Assume 16+ is similar to R rating
    'unrated': 'unknown'  # Unrated movies can be labeled as 'unknown'
}

# Apply mapping
df['Certificate'] = df['Certificate'].replace(certificate_mapping)

vc_certificate =  df['Certificate'].value_counts()


vc_release_year = df['Released_Year'].value_counts()

vc_genre = df['Genre'].value_counts()

vc_metascore = df['Meta_score'].value_counts()

vc_director = df['Director'].value_counts()


vc_runtime = df['Runtime'].value_counts()

print()




In [189]:
df_en = df.copy()

df_en = pd.get_dummies(df_en, columns=['Certificate'], prefix='Cert', drop_first=True)
df_en['Genre'] = df_en['Genre'].str.lower().str.split(', ')

vc_genre = df_en['Genre'].value_counts()

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
genre_dummies = pd.DataFrame(mlb.fit_transform(df_en['Genre']), columns=mlb.classes_, index=df.index)

# Merge with original DataFrame
df_en = pd.concat([df_en, genre_dummies], axis=1)
df_en = df_en.drop('Genre', axis=1)


print(df_en.columns)


Index(['Poster_Link', 'Series_Title', 'Released_Year', 'Runtime',
       'IMDB_Rating', 'Overview', 'Meta_score', 'Director', 'Star1', 'Star2',
       'Star3', 'Star4', 'No_of_Votes', 'Gross', 'Cert_approved', 'Cert_g',
       'Cert_pg', 'Cert_pg13', 'Cert_r', 'Cert_u', 'Cert_ua', 'Cert_unknown',
       'action', 'adventure', 'animation', 'biography', 'comedy', 'crime',
       'drama', 'family', 'fantasy', 'film-noir', 'history', 'horror', 'music',
       'musical', 'mystery', 'romance', 'sci-fi', 'sport', 'thriller', 'war',
       'western'],
      dtype='object')
