# Exploratory Data Analysis (EDA) on Movies Dataset

This notebook demonstrates how to load and clean the `movies.csv` dataset in preparation for exploratory data analysis (EDA).

In [5]:
# Import required libraries
import pandas as pd
import numpy as np

In [6]:
# Load the movies.csv dataset
df = pd.read_csv('movies.csv')
df.head()

Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
0,Blood Red Sky,(2021),"\nAction, Horror, Thriller",6.1,\nA woman with a mysterious illness is forced ...,\n Director:\nPeter Thorwarth\n| \n Star...,21062.0,121.0,
1,Masters of the Universe: Revelation,(2021– ),"\nAnimation, Action, Adventure",5.0,\nThe war for Eternia begins again in what may...,"\n \n Stars:\nChris Wood, \nSara...",17870.0,25.0,
2,The Walking Dead,(2010–2022),"\nDrama, Horror, Thriller",8.2,\nSheriff Deputy Rick Grimes wakes up from a c...,"\n \n Stars:\nAndrew Lincoln, \n...",885805.0,44.0,
3,Rick and Morty,(2013– ),"\nAnimation, Adventure, Comedy",9.2,\nAn animated series that follows the exploits...,"\n \n Stars:\nJustin Roiland, \n...",414849.0,23.0,
4,Army of Thieves,(2021),"\nAction, Crime, Horror",,"\nA prequel, set before the events of Army of ...",\n Director:\nMatthias Schweighöfer\n| \n ...,,,


## Data Cleaning Steps

We will perform the following cleaning steps:

- Check for missing values
- Remove duplicate rows
- Handle missing or invalid data
- Convert data types if necessary

In [7]:
# Check for missing values
missing_values = df.isnull().sum()
print('Missing values per column:')
print(missing_values)

# Check for duplicate rows
duplicates = df.duplicated().sum()
print(f'Number of duplicate rows: {duplicates}')

Missing values per column:
MOVIES         0
YEAR         644
GENRE         80
RATING      1820
ONE-LINE       0
STARS          0
VOTES       1820
RunTime     2958
Gross       9539
dtype: int64
Number of duplicate rows: 431


In [8]:
# Remove duplicate rows
df_cleaned = df.drop_duplicates()

# Drop rows with missing values (if any)
df_cleaned = df_cleaned.dropna()

print(f'Shape after cleaning: {df_cleaned.shape}')
df_cleaned.head()

Shape after cleaning: (460, 9)


Unnamed: 0,MOVIES,YEAR,GENRE,RATING,ONE-LINE,STARS,VOTES,RunTime,Gross
77,The Hitman's Bodyguard,(2017),"\nAction, Comedy, Crime",6.9,"\nThe world's top bodyguard gets a new client,...",\n Director:\nPatrick Hughes\n| \n Stars...,205979,118.0,$75.47M
85,Jurassic Park,(1993),"\nAction, Adventure, Sci-Fi",8.1,\nA pragmatic paleontologist visiting an almos...,\n Director:\nSteven Spielberg\n| \n Sta...,897444,127.0,$402.45M
95,Don't Breathe,(2016),"\nCrime, Horror, Thriller",7.1,"\nHoping to walk away with a massive fortune, ...",\n Director:\nFede Alvarez\n| \n Stars:\...,237601,88.0,$89.22M
111,The Lord of the Rings: The Fellowship of the Ring,(2001),"\nAction, Adventure, Drama",8.8,\nA meek Hobbit from the Shire and eight compa...,\n Director:\nPeter Jackson\n| \n Stars:...,1713028,178.0,$315.54M
125,Escape Room,(I) (2019),"\nAction, Adventure, Horror",6.4,\nSix strangers find themselves in a maze of d...,\n Director:\nAdam Robitel\n| \n Stars:\...,99351,99.0,$57.01M


In [None]:
# Remove brackets surrounding the values in the 'YEAR' column
df['YEAR'] = df['YEAR'].str.strip().str.replace(r'^\(|\)$', '', regex=True)
df_cleaned['YEAR'] = df_cleaned['YEAR'].str.strip().str.replace(r'^\(|\)$', '', regex=True)

In [12]:
# Keep only numeric characters in the 'YEAR' column
df['YEAR'] = df['YEAR'].str.extract('(\d+)', expand=False)
df_cleaned['YEAR'] = df_cleaned['YEAR'].str.extract('(\d+)', expand=False)

  df['YEAR'] = df['YEAR'].str.extract('(\d+)', expand=False)
  df_cleaned['YEAR'] = df_cleaned['YEAR'].str.extract('(\d+)', expand=False)


## Summary

The dataset has been loaded and cleaned. It is now ready for exploratory data analysis (EDA).

In [13]:
# Save the cleaned dataset to an Excel file
df_cleaned.to_excel('movies_cleaned.xlsx', index=False)
print('Cleaned data saved to movies_cleaned.xlsx')

Cleaned data saved to movies_cleaned.xlsx
