# 🎬 Movie Rating Predictor: Scraping & Machine Learning for Rating and Hit/Flop Prediction

### Dataset Info / Acknowledgment

In [163]:
# IMDb Top 250 Movie Rating Predictor

## Dataset Information
"""
- This dataset contains 218 top movies out of IMDb Top 250 (some movies skipped due to missing data in OMDb API).
- Columns included: `Title`, `Year`, `Genre`, `Runtime`, `Director`, `IMDb Rating`, `Rotten Tomatoes Score`.
- **Important:** The dataset was **manually created by using OMDb API with my personal API key**. 
  I fetched each movie's information programmatically and compiled it into a CSV file.
- API reference: [OMDb API](http://www.omdbapi.com/)
"""

"\n- This dataset contains 218 top movies out of IMDb Top 250 (some movies skipped due to missing data in OMDb API).\n- Columns included: `Title`, `Year`, `Genre`, `Runtime`, `Director`, `IMDb Rating`, `Rotten Tomatoes Score`.\n- **Important:** The dataset was **manually created by using OMDb API with my personal API key**. \n  I fetched each movie's information programmatically and compiled it into a CSV file.\n- API reference: [OMDb API](http://www.omdbapi.com/)\n"

### API KEY USING......

In [151]:
import requests
import pandas as pd

API_KEY = "6686926a"

# Movies ka list jinka data lena hai
movies = ["Inception", "The Godfather", "Interstellar", "Fight Club"]

data = []

for movie in movies:
    url = f"http://www.omdbapi.com/?t={movie}&apikey={API_KEY}"
    response = requests.get(url)
    result = response.json()
    
    if result.get("Response") == "True":
        # Rotten Tomatoes score nikalna (Ratings list ke andar hota hai)
        rt_score = "N/A"
        for rating in result.get("Ratings", []):
            if rating["Source"] == "Rotten Tomatoes":
                rt_score = rating["Value"]
                break
        
        data.append({
            "Title": result.get("Title", "N/A"),
            "Year": result.get("Year", "N/A"),
            "Genre": result.get("Genre", "N/A"),
            "Runtime": result.get("Runtime", "N/A"),
            "Director": result.get("Director", "N/A"),
            "IMDb_Rating": result.get("imdbRating", "N/A"),
            "RT_Score": rt_score
        })
    else:
        print(f"❌ Not found: {movie}")

# DataFrame me convert karna
df = pd.DataFrame(data)

# CSV me save karna
df.to_csv("movies_dataset.csv", index=False)

print("✅ Dataset saved as movies_dataset.csv")
print(df)

✅ Dataset saved as movies_dataset.csv
           Title  Year                      Genre  Runtime  \
0      Inception  2010  Action, Adventure, Sci-Fi  148 min   
1  The Godfather  1972               Crime, Drama  175 min   
2   Interstellar  2014   Adventure, Drama, Sci-Fi  169 min   
3     Fight Club  1999     Crime, Drama, Thriller  139 min   

               Director IMDb_Rating RT_Score  
0     Christopher Nolan         8.8      87%  
1  Francis Ford Coppola         9.2      97%  
2     Christopher Nolan         8.7      73%  
3         David Fincher         8.8      81%  


In [155]:
print(df.shape)

(4, 7)


In [157]:
import requests
from bs4 import BeautifulSoup
import pandas as pd
import time

API_KEY = "6686926a"
BASE_URL = "http://www.omdbapi.com/"

# Step 1: Scrape IMDb Top 250 movie titles
imdb_url = "https://www.imdb.com/chart/top/"
response = requests.get(imdb_url)
soup = BeautifulSoup(response.text, "html.parser")

# Extract movie titles
titles = [tag.text.strip() for tag in soup.select("td.titleColumn a")]
print(f"🎬 Scraped {len(titles)} titles from IMDb Top 250")

# Step 2: Fetch details from OMDb API
data = []
for i, title in enumerate(titles, start=1):
    params = {"t": title, "apikey": API_KEY}
    res = requests.get(BASE_URL, params=params).json()

    if res.get("Response") == "True":
        data.append({
            "Title": res.get("Title"),
            "Year": res.get("Year"),
            "Genre": res.get("Genre"),
            "Runtime": res.get("Runtime"),
            "Director": res.get("Director"),
            "IMDb_Rating": res.get("imdbRating"),
            "RT_Score": next(
                (r["Value"] for r in res.get("Ratings", []) if r["Source"] == "Rotten Tomatoes"),
                None
            )
        })
        print(f"✅ {i}. {title} added")
    else:
        print(f"❌ {i}. Not found: {title}")

    time.sleep(0.2)  # avoid API rate limiting

# Step 3: Save dataset to CSV
df = pd.DataFrame(data)
df.to_csv("IMDb_Top250.csv", index=False)

print("\n✅ Dataset saved as IMDb_Top250.csv")
print(df.head())
print(df.shape)

🎬 Scraped 0 titles from IMDb Top 250

✅ Dataset saved as IMDb_Top250.csv
Empty DataFrame
Columns: []
Index: []
(0, 0)


In [159]:
import requests
import pandas as pd
import time

API_KEY = "6686926a"
BASE_URL = "http://www.omdbapi.com/"

# IMDb Top 250 movie titles (ready-made list

titles = [
    "The Shawshank Redemption", "The Godfather", "The Dark Knight",
    "The Godfather Part II", "12 Angry Men", "Schindler's List",
    "The Lord of the Rings: The Return of the King", "Pulp Fiction",
    "The Lord of the Rings: The Fellowship of the Ring", "The Good, the Bad and the Ugly",
    "Forrest Gump", "Fight Club", "Inception", "The Lord of the Rings: The Two Towers",
    "Star Wars: Episode V - The Empire Strikes Back", "The Matrix", "Goodfellas",
    "One Flew Over the Cuckoo's Nest", "Se7en", "Seven Samurai",
    "It's a Wonderful Life", "The Silence of the Lambs", "Saving Private Ryan",
    "City of God", "Life Is Beautiful", "The Green Mile", "Interstellar",
    "Terminator 2: Judgment Day", "Back to the Future", "Spirited Away",
    "Psycho", "Parasite", "Leon: The Professional", "The Lion King",
    "Gladiator", "American History X", "The Usual Suspects", "The Departed",
    "Whiplash", "The Prestige", "Casablanca", "Harakiri", "The Intouchables",
    "Modern Times", "Once Upon a Time in the West", "Rear Window",
    "Cinema Paradiso", "Alien", "Apocalypse Now", "Memento", "Raiders of the Lost Ark",
    "The Great Dictator", "Django Unchained", "WALL·E", "The Lives of Others",
    "Sunset Boulevard", "Paths of Glory", "The Shining", "Avengers: Infinity War",
    "Witness for the Prosecution", "Aliens", "American Beauty", "Dr. Strangelove",
    "The Dark Knight Rises", "Oldboy", "Joker", "Amadeus", "Braveheart",
    "Toy Story", "Coco", "Inglourious Basterds", "Avengers: Endgame",
    "Good Will Hunting", "Requiem for a Dream", "The Hunt", "3 Idiots",
    "Eternal Sunshine of the Spotless Mind", "Singin' in the Rain",
    "Star Wars: Episode VI - Return of the Jedi", "2001: A Space Odyssey",
    "Reservoir Dogs", "Vertigo", "Lawrence of Arabia", "Citizen Kane",
    "North by Northwest", "Amélie", "Your Name", "A Clockwork Orange",
    "Ikiru", "Double Indemnity", "Full Metal Jacket", "Scarface",
    "Taxi Driver", "Snatch", "The Kid", "Toy Story 3", "Indiana Jones and the Last Crusade",
    "1917", "Green Book", "The Wolf of Wall Street", "Jaws",
    "Blade Runner 2049", "Inside Out", "The Father", "No Country for Old Men",
    "There Will Be Blood", "The Pianist", "Heat", "The Sixth Sense",
    "L.A. Confidential", "Rashomon", "The Truman Show", "Gone Girl",
    "Shutter Island", "Kill Bill: Vol. 1", "Fargo", "The Thing",
    "The Handmaiden", "On the Waterfront", "The Bridge on the River Kwai",
    "Trainspotting", "Lock, Stock and Two Smoking Barrels", "Casino",
    "The Secret in Their Eyes", "The Gold Rush", "Children of Heaven",
    "My Neighbor Totoro", "Howl's Moving Castle", "The Incredibles",
    "Monsters, Inc.", "Up", "Finding Nemo", "The Iron Giant",
    "Beauty and the Beast", "Ratatouille", "Zootopia", "The Princess Bride",
    "The Grand Budapest Hotel", "The Big Lebowski", "The Deer Hunter",
    "The Terminator", "The Exorcist", "Rocky", "Batman Begins",
    "V for Vendetta", "Black Swan", "Prisoners", "Warrior",
    "The Imitation Game", "Spotlight", "The Social Network", "A Beautiful Mind",
    "The Theory of Everything", "Hacksaw Ridge", "Mad Max: Fury Road",
    "Logan", "Dead Poets Society", "Stand by Me", "The Breakfast Club",
    "Her", "La La Land", "The Revenant", "Dangal",
    "PK", "Drishyam", "Barfi!", "Gully Boy",
    "Swades", "Taare Zameen Par", "Andhadhun", "Super Deluxe",
    "Article 15", "The White Tiger", "The Lunchbox", "Queen",
    "Masaan", "Lagaan", "Mughal-E-Azam", "Sholay",
    "Pyaasa", "Guide", "Mother India", "Ganga Jumna",
    "Chhoti Bahu", "Black Friday", "Satya", "Gangs of Wasseypur",
    "Omkara", "Haider", "Kahaani", "Paan Singh Tomar",
    "Nil Battey Sannata", "Pink", "Neerja", "Talvar",
    "Badhaai Ho", "Article 370", "A Wednesday", "Chak De! India",
    "Munna Bhai M.B.B.S.", "Lage Raho Munna Bhai", "Hera Pheri", "Andaz Apna Apna",
    "Dil Chahta Hai", "Zindagi Na Milegi Dobara", "Kal Ho Naa Ho", "Kabhi Khushi Kabhie Gham",
    "Koi Mil Gaya", "Krrish", "Krrish 3", "Raees",
    "Don", "Don 2", "Kabir Singh", "Sanju",
    "Uri: The Surgical Strike", "Kesari", "Jab We Met", "3 Idiots"
]

data = []
for i, title in enumerate(titles, start=1):
    params = {"t": title, "apikey": API_KEY}
    res = requests.get(BASE_URL, params=params).json()

    if res.get("Response") == "True":
        data.append({
            "Title": res.get("Title"),
            "Year": res.get("Year"),
            "Genre": res.get("Genre"),
            "Runtime": res.get("Runtime"),
            "Director": res.get("Director"),
            "IMDb_Rating": res.get("imdbRating"),
            "RT_Score": next(
                (r["Value"] for r in res.get("Ratings", []) if r["Source"] == "Rotten Tomatoes"),
                None
            )
        })
        print(f"✅ {i}. {title} added")
    else:
        print(f"❌ {i}. Not found: {title}")

    time.sleep(0.2)  # rate-limit handling

# Save dataset
df = pd.DataFrame(data)
df.to_csv("IMDb_Top250.csv", index=False)

print("\n✅ Dataset saved as IMDb_Top250.csv")
print(df.head())
print(df.shape)

✅ 1. The Shawshank Redemption added
✅ 2. The Godfather added
✅ 3. The Dark Knight added
✅ 4. The Godfather Part II added
✅ 5. 12 Angry Men added
✅ 6. Schindler's List added
✅ 7. The Lord of the Rings: The Return of the King added
✅ 8. Pulp Fiction added
✅ 9. The Lord of the Rings: The Fellowship of the Ring added
✅ 10. The Good, the Bad and the Ugly added
✅ 11. Forrest Gump added
✅ 12. Fight Club added
✅ 13. Inception added
✅ 14. The Lord of the Rings: The Two Towers added
✅ 15. Star Wars: Episode V - The Empire Strikes Back added
✅ 16. The Matrix added
✅ 17. Goodfellas added
✅ 18. One Flew Over the Cuckoo's Nest added
✅ 19. Se7en added
✅ 20. Seven Samurai added
✅ 21. It's a Wonderful Life added
✅ 22. The Silence of the Lambs added
✅ 23. Saving Private Ryan added
✅ 24. City of God added
✅ 25. Life Is Beautiful added
✅ 26. The Green Mile added
✅ 27. Interstellar added
✅ 28. Terminator 2: Judgment Day added
✅ 29. Back to the Future added
✅ 30. Spirited Away added
✅ 31. Psycho added
✅ 32.