# IMDB Movies Dataset Exploration

## 1. Dataset Introduction

This dataset contains information about the Top 1000 movies from IMDb.  
It includes numerical attributes such as ratings, votes, gross revenue, and critic scores, alongside categorical information like genre, certificate, director, and cast.  
The dataset is suitable for understanding broad patterns in movie quality, audience engagement, and basic commercial performance.

## 2. Conclusion

After evaluating the dataset, I decided **not to select** the IMDB Movies dataset for my final project.  
Although it includes useful movie-related information, many fields are text-only, several columns require significant cleaning, and the number of strong numerical variables is limited.  
Compared with the Car Price dataset, IMDB offers less analytical depth and fewer meaningful quantitative relationships.

## 3. Setup and Initial Inspection

In [1]:
import pandas as pd

df = pd.read_csv("datasets/imdb_top_1000.csv")
df.head()

Unnamed: 0,Poster_Link,Series_Title,Released_Year,Certificate,Runtime,Genre,IMDB_Rating,Overview,Meta_score,Director,Star1,Star2,Star3,Star4,No_of_Votes,Gross
0,https://m.media-amazon.com/images/M/MV5BMDFkYT...,The Shawshank Redemption,1994,A,142 min,Drama,9.3,Two imprisoned men bond over a number of years...,80.0,Frank Darabont,Tim Robbins,Morgan Freeman,Bob Gunton,William Sadler,2343110,28341469
1,https://m.media-amazon.com/images/M/MV5BM2MyNj...,The Godfather,1972,A,175 min,"Crime, Drama",9.2,An organized crime dynasty's aging patriarch t...,100.0,Francis Ford Coppola,Marlon Brando,Al Pacino,James Caan,Diane Keaton,1620367,134966411
2,https://m.media-amazon.com/images/M/MV5BMTMxNT...,The Dark Knight,2008,UA,152 min,"Action, Crime, Drama",9.0,When the menace known as the Joker wreaks havo...,84.0,Christopher Nolan,Christian Bale,Heath Ledger,Aaron Eckhart,Michael Caine,2303232,534858444
3,https://m.media-amazon.com/images/M/MV5BMWMwMG...,The Godfather: Part II,1974,A,202 min,"Crime, Drama",9.0,The early life and career of Vito Corleone in ...,90.0,Francis Ford Coppola,Al Pacino,Robert De Niro,Robert Duvall,Diane Keaton,1129952,57300000
4,https://m.media-amazon.com/images/M/MV5BMWU4N2...,12 Angry Men,1957,U,96 min,"Crime, Drama",9.0,A jury holdout attempts to prevent a miscarria...,96.0,Sidney Lumet,Henry Fonda,Lee J. Cobb,Martin Balsam,John Fiedler,689845,4360000


### Summary of Initial Observation

- The dataset contains columns such as:  
  `Series_Title`, `Released_Year`, `Certificate`, `Runtime`, `Genre`,  
  `IMDB_Rating`, `Meta_score`, `Director`, `Star1–Star4`, `No_of_Votes`, `Gross`.

- **High-value (core) columns** for analysis include:  
  `IMDB_Rating`, `Meta_score`, `No_of_Votes`, `Gross`.

- **Text-heavy columns** such as `Overview`, `Director`, and `Stars` are not easily analyzable without NLP.

- **Potential issues spotted**:  
  - **Gross** appears as formatted text with commas, suggesting it is stored as a string rather than a numeric value.
  - Multi-genre entries (“Action, Drama, Sci-Fi”) require splitting if used
  - **Overview** is a long free-text field, not suitable for analysis in a basic EDA workflow.

### Initial Potential Analysis Directions
1. Relationship between IMDb rating and number of votes.  
2. Comparison of ratings across different genres or certificates.  
3. Relationship between IMDb rating and critic score (Meta_score).

## 4. Basic Structure

In [2]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 16 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Poster_Link    1000 non-null   object 
 1   Series_Title   1000 non-null   object 
 2   Released_Year  1000 non-null   object 
 3   Certificate    899 non-null    object 
 4   Runtime        1000 non-null   object 
 5   Genre          1000 non-null   object 
 6   IMDB_Rating    1000 non-null   float64
 7   Overview       1000 non-null   object 
 8   Meta_score     843 non-null    float64
 9   Director       1000 non-null   object 
 10  Star1          1000 non-null   object 
 11  Star2          1000 non-null   object 
 12  Star3          1000 non-null   object 
 13  Star4          1000 non-null   object 
 14  No_of_Votes    1000 non-null   int64  
 15  Gross          831 non-null    object 
dtypes: float64(2), int64(1), object(13)
memory usage: 125.1+ KB


### Summary from info()

- Shape: 1000 rows × 16 columns  
- Mix of numerical (`IMDB_Rating`, `Meta_score`, `No_of_Votes`) and object columns (genre, title, director, stars).  
- Missing values mainly in `Meta_score`, `Certificate` and `Gross` — normal for movie datasets.  
- `Released_Year` stored as `object`, indicating irregular formatting or non-numeric entries.  
- Data size is reasonable for exploratory analysis.

In [3]:
df.describe()

Unnamed: 0,IMDB_Rating,Meta_score,No_of_Votes
count,1000.0,843.0,1000.0
mean,7.9493,77.97153,273692.9
std,0.275491,12.376099,327372.7
min,7.6,28.0,25088.0
25%,7.7,70.0,55526.25
50%,7.9,79.0,138548.5
75%,8.1,87.0,374161.2
max,9.3,100.0,2343110.0


### Summary from describe()

- **IMDB_Rating**  
  - Range roughly 7.0–9.3, which is realistic for a “Top 1000 movies” list.  
  - Mean ≈ 7.95 and median close → slightly left-skewed but mostly balanced.

- **Meta_score**  
  - Mean is lower than the median → downward skew, possibly due to missing values or critic score bias.

- **No_of_Votes**  
  - Very wide range: some movies have under 50k votes, others exceed 2M.  
  - Highly right-skewed distribution, which is expected for movie popularity.

- **Gross**  
  - Not included in describe() because stored as text.  
  - Requires cleaning and conversion to numeric.

Overall, numerical columns show realistic ranges but vary greatly in spread.  
There are no extreme outliers that immediately appear invalid.

In [4]:
# Categorical Columns: Value Distribution
# To quickly inspect the distribution of all categorical (object-type) columns, I use the following loop.  
# This helps identify potential issues such as highly imbalanced categories, unusual labels, multi-value fields, or inconsistent formatting.

cat_cols = df.select_dtypes(include='object').columns
for c in cat_cols:
    print(f"\n=== {c} (unique={df[c].nunique()}) ===")
    print(df[c].value_counts().head(10))


=== Poster_Link (unique=1000) ===
Poster_Link
https://m.media-amazon.com/images/M/MV5BMTY5ODAzMTcwOF5BMl5BanBnXkFtZTcwMzYxNDYyNA@@._V1_UX67_CR0,0,67,98_AL_.jpg                                                    1
https://m.media-amazon.com/images/M/MV5BMDFkYTc0MGEtZmNhMC00ZDIzLWFmNTEtODM1ZmRlYWMwMWFmXkEyXkFqcGdeQXVyMTMxODk2OTU@._V1_UX67_CR0,0,67,98_AL_.jpg                    1
https://m.media-amazon.com/images/M/MV5BM2MyNjYxNmUtYTAwNi00MTYxLWJmNWYtYzZlODY3ZTk3OTFlXkEyXkFqcGdeQXVyNzkwMjQ5NzM@._V1_UY98_CR1,0,67,98_AL_.jpg                    1
https://m.media-amazon.com/images/M/MV5BMTMxNTMwODM0NF5BMl5BanBnXkFtZTcwODAyMTk2Mw@@._V1_UX67_CR0,0,67,98_AL_.jpg                                                    1
https://m.media-amazon.com/images/M/MV5BMWMwMGQzZTItY2JlNC00OWZiLWIyMDctNDk2ZDQ2YjRjMWQ0XkEyXkFqcGdeQXVyNzkwMjQ5NzM@._V1_UY98_CR1,0,67,98_AL_.jpg                    1
https://m.media-amazon.com/images/M/MV5BMWU4N2FjNzYtNTVkNC00NzQ0LTg0MjAtYTJlMjFhNGUxZDFmXkEyXkFqcGdeQXVyNjc1NTYyMjg@._

### Summary of Value Distribution

- **Certificate** shows a small number of dominant categories (e.g., R, PG-13), while many certificates appear only a few times. The distribution is imbalanced but typical for movie datasets.
- **Genre** has many unique combinations because multiple genres are stored in a single cell (e.g., "Action, Drama, Sci-Fi"). This confirms that genre analysis requires preprocessing or simplification.
- **Released_Year** contains mixed formatting (some rows show non-standard entries depending on dataset version), suggesting the need for cleaning and conversion to numeric.
- **Director** and the **Stars** columns have extremely high cardinality, making them unsuitable for grouping without further transformation.
- **Overview** appears as long free-text descriptions and is not directly useful for quantitative analysis.

These observations confirm that only a small subset of the object columns are analytically meaningful without additional preprocessing.


## 5. Confirmed Possible Analysis Directions

Based on the structure and data quality, the dataset supports these realistic directions:

- Explore whether higher critic Meta_score correlates with higher IMDb rating.
- Compare IMDb ratings across major rating certificates (PG, R, PG-13).
- Investigate whether movies with more votes tend to have higher ratings.

These directions are feasible but limited due to text-heavy columns and missing numeric fields.


## 6. Strengths and Weaknesses

### Strengths
- Includes useful numeric variables (ratings, votes, gross) for basic statistical analysis.
- Contains diverse categorical attributes such as genre and certificate for grouping.
- Topic is intuitive and easy to interpret.

### Weaknesses
- Many columns are text-only and difficult to analyze without NLP techniques.
- Several key columns (`Gross`, `Released_Year`) require significant cleaning.
- Limited number of strong numerical variables reduces analytical depth.