# Movie Dataset Analysis

Welcome to the beginner-friendly guide to cleaning and analyzing a movie dataset! In this notebook, we will learn how to load, clean, analyze, encode, and normalize movie data to make it ready for machine learning models.

## 🎬 Data Wrangler Challenge

Transform messy movie data into a clean, structured format suitable for machine learning!

![Workflow Image](images/movie_data_analysis.png)

_"Real data is messy - let's clean it up!"_

### 📋 Your Mission
1. 🔍 Load and inspect the movie dataset
2. 🧹 Clean missing data (`?`, `N/A`, empty values)
3. 📊 Group by genre and calculate average IMDB rating
4. 🔢 Encode categorical variables (`Genre`, `Director`)
5. 📏 Normalize numerical columns (`Budget`, `Duration`) 
6. 💾 Save the cleaned dataset

### 🎭 Sample Movie Dataset
```csv
Movie_Name,Lead_Actor,Director,Release_Year,Lead_Heroine,Main_Villain,Main_Comedy,Music_Director,Genre,Budget,Box_Office_Gross,Number_of_Screens,IMDB_Rating,Duration_Minutes,Nominations
Eeswar,Prabhas,Jayanth C. Paranjee,2002,Sridevi Vijaykumar,Nassar,Brahmanandam,R.P. Patnaik,Drama,50000000,80000000,450,5.1,137,0
Chirutha,Ram Charan,Puri Jagannadh,2007,Neha Sharma,Prakash Raj,N/A,Mani Sharma,Action,35,40,?,5.2,117,0
Magadheera,Ram Charan,S.S. Rajamouli,2009,Kajal Aggarwal,Dev Gill,Sunil,M.M. Keeravani,Fantasy Action,40,150,?,7.7,158,5
Govindudu Andari Vaadele,Ram Charan,Krishna Vamsi,2014,Kajal Aggarwal,?,N/A,Devi Sri Prasad,Family Drama,50,72,?,5.7,160,1
Orange,Ram Charan,Bhaskar,2010,Genelia D'Souza,N/A,N/A,Devi Sri Prasad,Romantic Drama,40,60,?,6.6,155,2
```

### 🔧 Step-by-Step Approach
1. **Load:** Use `pd.read_csv()` and inspect data
2. **Clean:** Replace `'?'`, `'N/A'` with `np.nan`
3. **Analyze:** Group by genre and analyze ratings
4. **Encode:** Convert categorical variables to numbers
5. **Normalize:** Scale numerical columns to 0-1
6. **Export:** Save as `'cleaned_movies.csv'`

### 💻 Code Structure
```python
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

# Step 1: Load data
movies = pd.read_csv('movie_dataset.csv')

# Step 2: Data cleaning
# TODO: Replace messy values with np.nan
# TODO: Handle missing values appropriately

# Step 3: Exploratory analysis  
# TODO: Group by genre and analyze ratings

# Step 4: Encoding categorical variables
# TODO: Convert text to numbers for ML

# Step 5: Normalize numerical columns
# TODO: Scale Budget and Duration to 0-1 range

# Step 6: Export cleaned data
# movies.to_csv('cleaned_movies.csv', index=False)

print("Data wrangling complete! 🎉")
```

### 🎯 Expected Results
- 📊 **Genre Analysis:** Average IMDB rating per genre
- 🧹 **Clean Data:** No missing or messy entries
- 🔢 **Encoded Features:** Categorical data as numbers
- 📏 **Normalized Data:** Numerical features scaled 0-1
- 💾 **ML-Ready Dataset:** Ready for machine learning!

_Example insight: "Action movies have an average rating of 6.2"_

### 🏆 Mission Accomplished!
You have successfully:
- ✅ Cleaned messy real-world data
- ✅ Extracted meaningful insights
- ✅ Prepared data for machine learning
- ✅ Mastered essential data wrangling skills

_"Data wrangling is 80% of data science - you're now a pro!" 🎉_