# üé¨ Comprehensive Movie Data Analysis & Extraction System

## Overview
This notebook provides a complete system for:
- Listing Tamil movies from Moviesda
- Searching for specific movies
- Extracting detailed movie information
- Analyzing movie metadata
- Downloading and organizing movie data

---

**Author:** Enhanced Movie Analysis System  
**Version:** 2.0  
**Last Updated:** January 2026

---

## üìã Table of Contents

1. [Setup & Installation](#setup)
2. [Import Libraries](#imports)
3. [Configuration](#config)
4. [Core Functions](#functions)
5. [Movie Listing](#listing)
6. [Movie Search](#search)
7. [Detailed Movie Information](#details)
8. [Data Analysis](#analysis)
9. [Export & Save](#export)
10. [Advanced Features](#advanced)

---

## 1. Setup & Installation <a id='setup'></a>

Install all required packages for the movie data extraction system.

In [None]:
# Install required packages
!pip install requests beautifulsoup4 pandas numpy matplotlib seaborn lxml --quiet

print("‚úÖ All packages installed successfully!")

---

## 2. Import Libraries <a id='imports'></a>

Import all necessary libraries for data extraction, processing, and visualization.

In [None]:
# Core libraries
import requests
from bs4 import BeautifulSoup
import json
import re
from typing import Dict, List, Optional, Tuple
from urllib.parse import urljoin, urlparse
import time
from datetime import datetime

# Data processing
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# File operations
import os
from pathlib import Path

# Set display options
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', 100)
pd.set_option('display.width', 1000)

# Set plot style
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")

print("‚úÖ All libraries imported successfully!")
print(f"üìÖ Current Date: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")

---

## 3. Configuration <a id='config'></a>

Configure the system parameters and settings.

In [None]:
# Configuration Class
class Config:
    """Configuration settings for the movie extraction system"""
    
    # Base URLs
    BASE_URL = "https://moviesda15.com"
    
    # Headers for requests
    HEADERS = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36',
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
        'Accept-Language': 'en-US,en;q=0.5',
        'Accept-Encoding': 'gzip, deflate, br',
        'Connection': 'keep-alive',
        'Upgrade-Insecure-Requests': '1'
    }
    
    # Request settings
    TIMEOUT = 30
    MAX_RETRIES = 3
    RETRY_DELAY = 2
    
    # Data directories
    DATA_DIR = Path('/mnt/user-data/outputs')
    CACHE_DIR = Path('/home/claude/cache')
    
    # Default year
    DEFAULT_YEAR = '2026'
    
    @classmethod
    def create_directories(cls):
        """Create necessary directories"""
        cls.DATA_DIR.mkdir(parents=True, exist_ok=True)
        cls.CACHE_DIR.mkdir(parents=True, exist_ok=True)
        print(f"‚úÖ Created directories: {cls.DATA_DIR}, {cls.CACHE_DIR}")

# Initialize configuration
Config.create_directories()
print("‚úÖ Configuration initialized successfully!")

---

## 4. Core Functions <a id='functions'></a>

Define all core functions for movie data extraction and processing.

In [None]:
class MovieExtractor:
    """Main class for extracting movie data from Moviesda"""
    
    def __init__(self):
        self.session = requests.Session()
        self.session.headers.update(Config.HEADERS)
    
    def make_request(self, url: str, retries: int = Config.MAX_RETRIES) -> Optional[requests.Response]:
        """
        Make an HTTP request with retry logic
        
        Args:
            url: URL to fetch
            retries: Number of retry attempts
            
        Returns:
            Response object or None if failed
        """
        for attempt in range(retries):
            try:
                response = self.session.get(url, timeout=Config.TIMEOUT)
                response.raise_for_status()
                return response
            except requests.RequestException as e:
                print(f"‚ö†Ô∏è Attempt {attempt + 1}/{retries} failed: {e}")
                if attempt < retries - 1:
                    time.sleep(Config.RETRY_DELAY)
                else:
                    print(f"‚ùå Failed to fetch {url} after {retries} attempts")
                    return None
    
    def list_movies(self, year: str = Config.DEFAULT_YEAR) -> List[Dict]:
        """
        List all movies for a specific year
        
        Args:
            year: Year to fetch movies for
            
        Returns:
            List of movie dictionaries
        """
        print(f"\nüé¨ Fetching movies for year: {year}")
        url = f"{Config.BASE_URL}/year/{year}/"
        
        response = self.make_request(url)
        if not response:
            return []
        
        soup = BeautifulSoup(response.content, 'lxml')
        movies = []
        
        # Find all movie articles
        articles = soup.find_all('article', class_='item movies')
        
        for article in articles:
            try:
                # Extract movie data
                title_tag = article.find('h3')
                link_tag = article.find('a', href=True)
                img_tag = article.find('img')
                
                if title_tag and link_tag:
                    movie_data = {
                        'title': title_tag.get_text(strip=True),
                        'url': link_tag['href'],
                        'year': year,
                        'poster': img_tag['src'] if img_tag and 'src' in img_tag.attrs else None,
                        'type': 'movie'  # Can be enhanced to detect web-series
                    }
                    movies.append(movie_data)
            except Exception as e:
                print(f"‚ö†Ô∏è Error parsing movie: {e}")
                continue
        
        print(f"‚úÖ Found {len(movies)} movies for {year}")
        return movies
    
    def get_movie_details(self, movie_url: str) -> Dict:
        """
        Extract detailed information about a movie
        
        Args:
            movie_url: URL of the movie page
            
        Returns:
            Dictionary with comprehensive movie details
        """
        print(f"\nüîç Fetching details from: {movie_url}")
        
        response = self.make_request(movie_url)
        if not response:
            return {}
        
        soup = BeautifulSoup(response.content, 'lxml')
        details = {'url': movie_url}
        
        # Extract title
        title_tag = soup.find('h1', class_='entry-title')
        if title_tag:
            details['title'] = title_tag.get_text(strip=True)
        
        # Extract metadata
        meta_items = soup.find_all('div', class_='sgeneros')
        for item in meta_items:
            text = item.get_text(strip=True)
            if 'Genre:' in text:
                details['genre'] = text.replace('Genre:', '').strip()
            elif 'Year:' in text:
                details['year'] = text.replace('Year:', '').strip()
        
        # Extract plot/description
        plot_div = soup.find('div', class_='wp-content')
        if plot_div:
            details['plot'] = plot_div.get_text(strip=True)
        
        # Extract poster
        poster_img = soup.find('div', class_='poster').find('img') if soup.find('div', class_='poster') else None
        if poster_img and 'src' in poster_img.attrs:
            details['poster'] = poster_img['src']
        
        # Extract download links
        details['download_links'] = self._extract_download_links(soup)
        
        # Extract cast and crew
        details['cast'] = self._extract_cast_crew(soup)
        
        print(f"‚úÖ Successfully extracted details for: {details.get('title', 'Unknown')}")
        return details
    
    def _extract_download_links(self, soup: BeautifulSoup) -> List[Dict]:
        """
        Extract all download links from movie page
        
        Args:
            soup: BeautifulSoup object of the page
            
        Returns:
            List of download link dictionaries
        """
        links = []
        
        # Find all download sections
        download_sections = soup.find_all('div', class_='download-links')
        
        for section in download_sections:
            quality_tags = section.find_all('a', href=True)
            
            for tag in quality_tags:
                link_data = {
                    'url': tag['href'],
                    'text': tag.get_text(strip=True),
                    'quality': self._extract_quality(tag.get_text(strip=True))
                }
                links.append(link_data)
        
        return links
    
    def _extract_quality(self, text: str) -> str:
        """
        Extract quality information from link text
        
        Args:
            text: Link text
            
        Returns:
            Quality string (480p, 720p, 1080p, etc.)
        """
        quality_patterns = ['480p', '720p', '1080p', '4K', '2K']
        for pattern in quality_patterns:
            if pattern.lower() in text.lower():
                return pattern
        return 'Unknown'
    
    def _extract_cast_crew(self, soup: BeautifulSoup) -> Dict:
        """
        Extract cast and crew information
        
        Args:
            soup: BeautifulSoup object of the page
            
        Returns:
            Dictionary with cast and crew details
        """
        cast_crew = {}
        
        # Find cast/crew section
        cast_section = soup.find('div', class_='persons')
        
        if cast_section:
            items = cast_section.find_all('div', class_='person')
            
            for item in items:
                role = item.find('span', class_='role')
                name = item.find('span', class_='name')
                
                if role and name:
                    role_text = role.get_text(strip=True)
                    name_text = name.get_text(strip=True)
                    cast_crew[role_text] = name_text
        
        return cast_crew
    
    def search_movie(self, title: str, year: str = Config.DEFAULT_YEAR) -> Dict:
        """
        Search for a movie by title
        
        Args:
            title: Movie title to search for
            year: Year to search in
            
        Returns:
            Dictionary with search results and details
        """
        print(f"\nüîç Searching for: '{title}' in year {year}")
        
        # List all movies for the year
        all_movies = self.list_movies(year)
        
        # Search for matching titles
        matches = [
            movie for movie in all_movies 
            if title.lower() in movie['title'].lower()
        ]
        
        if not matches:
            print(f"‚ùå No movies found matching '{title}'")
            return {'matches': [], 'total': 0}
        
        print(f"‚úÖ Found {len(matches)} matching movie(s)")
        
        # Get details for the best match
        best_match = matches[0]
        details = self.get_movie_details(best_match['url'])
        
        return {
            'matches': matches,
            'total': len(matches),
            'best_match': best_match,
            'details': details
        }

print("‚úÖ Core functions defined successfully!")

---

## 5. Movie Listing <a id='listing'></a>

List all available movies for a specific year.

In [None]:
# Initialize the extractor
extractor = MovieExtractor()

# List movies for 2026
year_to_search = '2026'
movies_2026 = extractor.list_movies(year_to_search)

# Convert to DataFrame for better visualization
df_movies = pd.DataFrame(movies_2026)

# Display results
print(f"\nüìä Total Movies Found: {len(df_movies)}")
print("\n" + "="*80)
print("MOVIE LIST:")
print("="*80)
display(df_movies.head(20))

# Save to CSV
csv_path = Config.DATA_DIR / f'movies_{year_to_search}.csv'
df_movies.to_csv(csv_path, index=False)
print(f"\nüíæ Saved to: {csv_path}")

---

## 6. Movie Search <a id='search'></a>

Search for specific movies by title.

In [None]:
# Search for a specific movie
search_title = "Parasakthi"  # Change this to search for different movies
search_year = "2026"

search_results = extractor.search_movie(search_title, search_year)

# Display search results
print("\n" + "="*80)
print(f"SEARCH RESULTS FOR: '{search_title}'")
print("="*80)

if search_results['total'] > 0:
    print(f"\n‚úÖ Found {search_results['total']} matching movie(s):\n")
    
    for i, match in enumerate(search_results['matches'], 1):
        print(f"{i}. {match['title']}")
        print(f"   URL: {match['url']}")
        print(f"   Year: {match['year']}\n")
    
    # Display detailed information
    if 'details' in search_results:
        print("\n" + "="*80)
        print("DETAILED INFORMATION:")
        print("="*80)
        
        details = search_results['details']
        for key, value in details.items():
            if key not in ['download_links', 'cast']:
                print(f"\n{key.upper()}: {value}")
else:
    print(f"\n‚ùå No movies found matching '{search_title}'")

---

## 7. Detailed Movie Information <a id='details'></a>

Extract comprehensive details about a specific movie.

In [None]:
# Get details for a specific movie URL
# Replace with actual movie URL from the search results
if search_results['total'] > 0:
    movie_url = search_results['best_match']['url']
    
    # Extract detailed information
    movie_details = extractor.get_movie_details(movie_url)
    
    # Display comprehensive details
    print("\n" + "="*80)
    print("COMPREHENSIVE MOVIE DETAILS")
    print("="*80)
    
    print(f"\nüé¨ Title: {movie_details.get('title', 'N/A')}")
    print(f"üìÖ Year: {movie_details.get('year', 'N/A')}")
    print(f"üé≠ Genre: {movie_details.get('genre', 'N/A')}")
    print(f"üîó URL: {movie_details.get('url', 'N/A')}")
    
    if 'plot' in movie_details:
        print(f"\nüìñ Plot:\n{movie_details['plot'][:500]}...")
    
    if 'cast' in movie_details and movie_details['cast']:
        print("\nüë• Cast & Crew:")
        for role, name in movie_details['cast'].items():
            print(f"   {role}: {name}")
    
    if 'download_links' in movie_details and movie_details['download_links']:
        print(f"\nüíæ Download Links ({len(movie_details['download_links'])} available):")
        for i, link in enumerate(movie_details['download_links'][:5], 1):
            print(f"   {i}. {link['text']} - Quality: {link['quality']}")
    
    # Save detailed information to JSON
    json_path = Config.DATA_DIR / f"movie_details_{movie_details.get('title', 'unknown').replace(' ', '_')}.json"
    with open(json_path, 'w', encoding='utf-8') as f:
        json.dump(movie_details, f, indent=2, ensure_ascii=False)
    
    print(f"\nüíæ Detailed information saved to: {json_path}")
else:
    print("‚ö†Ô∏è No movie selected for detailed extraction")

---

## 8. Data Analysis <a id='analysis'></a>

Analyze the collected movie data with visualizations.

In [None]:
# Analyze movie data
if len(df_movies) > 0:
    print("\n" + "="*80)
    print("DATA ANALYSIS")
    print("="*80)
    
    # Basic statistics
    print(f"\nüìä Total Movies: {len(df_movies)}")
    print(f"üìä Unique Years: {df_movies['year'].nunique()}")
    
    # Create visualizations
    fig, axes = plt.subplots(2, 2, figsize=(15, 10))
    fig.suptitle('Movie Data Analysis', fontsize=16, fontweight='bold')
    
    # 1. Movies by Type
    if 'type' in df_movies.columns:
        df_movies['type'].value_counts().plot(kind='bar', ax=axes[0, 0], color='skyblue')
        axes[0, 0].set_title('Movies by Type')
        axes[0, 0].set_xlabel('Type')
        axes[0, 0].set_ylabel('Count')
    
    # 2. Top 10 Movies (by title length - just for demonstration)
    df_movies['title_length'] = df_movies['title'].str.len()
    top_movies = df_movies.nlargest(10, 'title_length')[['title', 'title_length']]
    axes[0, 1].barh(range(len(top_movies)), top_movies['title_length'], color='lightcoral')
    axes[0, 1].set_yticks(range(len(top_movies)))
    axes[0, 1].set_yticklabels(top_movies['title'], fontsize=8)
    axes[0, 1].set_title('Top 10 Movies by Title Length')
    axes[0, 1].set_xlabel('Title Length')
    
    # 3. Movies distribution
    axes[1, 0].hist(df_movies['title_length'], bins=20, color='lightgreen', edgecolor='black')
    axes[1, 0].set_title('Distribution of Title Lengths')
    axes[1, 0].set_xlabel('Title Length')
    axes[1, 0].set_ylabel('Frequency')
    
    # 4. Summary statistics
    stats_text = f"""Summary Statistics:
    
Total Movies: {len(df_movies)}
Average Title Length: {df_movies['title_length'].mean():.1f}
Median Title Length: {df_movies['title_length'].median():.1f}
Shortest Title: {df_movies['title_length'].min()}
Longest Title: {df_movies['title_length'].max()}
    """
    axes[1, 1].text(0.1, 0.5, stats_text, fontsize=12, verticalalignment='center')
    axes[1, 1].axis('off')
    axes[1, 1].set_title('Statistics Overview')
    
    plt.tight_layout()
    
    # Save plot
    plot_path = Config.DATA_DIR / 'movie_analysis.png'
    plt.savefig(plot_path, dpi=300, bbox_inches='tight')
    print(f"\nüìä Analysis plot saved to: {plot_path}")
    
    plt.show()
else:
    print("‚ö†Ô∏è No data available for analysis")

---

## 9. Export & Save <a id='export'></a>

Export all collected data in various formats.

In [None]:
# Export data in multiple formats
print("\n" + "="*80)
print("EXPORTING DATA")
print("="*80)

if len(df_movies) > 0:
    # 1. CSV Export
    csv_file = Config.DATA_DIR / f'movies_export_{datetime.now().strftime("%Y%m%d_%H%M%S")}.csv'
    df_movies.to_csv(csv_file, index=False, encoding='utf-8')
    print(f"‚úÖ CSV exported to: {csv_file}")
    
    # 2. Excel Export
    try:
        excel_file = Config.DATA_DIR / f'movies_export_{datetime.now().strftime("%Y%m%d_%H%M%S")}.xlsx'
        df_movies.to_excel(excel_file, index=False, engine='openpyxl')
        print(f"‚úÖ Excel exported to: {excel_file}")
    except ImportError:
        print("‚ö†Ô∏è openpyxl not installed. Skipping Excel export.")
    
    # 3. JSON Export
    json_file = Config.DATA_DIR / f'movies_export_{datetime.now().strftime("%Y%m%d_%H%M%S")}.json'
    df_movies.to_json(json_file, orient='records', indent=2, force_ascii=False)
    print(f"‚úÖ JSON exported to: {json_file}")
    
    # 4. HTML Report
    html_file = Config.DATA_DIR / f'movies_report_{datetime.now().strftime("%Y%m%d_%H%M%S")}.html'
    html_content = f"""
    <!DOCTYPE html>
    <html>
    <head>
        <title>Movie Data Report</title>
        <style>
            body {{ font-family: Arial, sans-serif; margin: 20px; }}
            h1 {{ color: #333; }}
            table {{ border-collapse: collapse; width: 100%; }}
            th, td {{ border: 1px solid #ddd; padding: 8px; text-align: left; }}
            th {{ background-color: #4CAF50; color: white; }}
            tr:nth-child(even) {{ background-color: #f2f2f2; }}
        </style>
    </head>
    <body>
        <h1>Movie Data Report - {datetime.now().strftime('%Y-%m-%d')}</h1>
        <p>Total Movies: {len(df_movies)}</p>
        {df_movies.to_html(index=False, escape=False)}
    </body>
    </html>
    """
    with open(html_file, 'w', encoding='utf-8') as f:
        f.write(html_content)
    print(f"‚úÖ HTML report exported to: {html_file}")
    
    print(f"\n‚úÖ All exports completed successfully!")
    print(f"üìÅ Files saved to: {Config.DATA_DIR}")
else:
    print("‚ö†Ô∏è No data available for export")

---

## 10. Advanced Features <a id='advanced'></a>

Advanced functionality for batch processing and automation.

In [None]:
# Batch process multiple movies
def batch_extract_movies(movie_urls: List[str]) -> pd.DataFrame:
    """
    Extract details for multiple movies at once
    
    Args:
        movie_urls: List of movie URLs
        
    Returns:
        DataFrame with all movie details
    """
    print(f"\nüîÑ Batch extracting {len(movie_urls)} movies...")
    
    all_details = []
    
    for i, url in enumerate(movie_urls, 1):
        print(f"\nProcessing {i}/{len(movie_urls)}: {url}")
        details = extractor.get_movie_details(url)
        if details:
            all_details.append(details)
        time.sleep(1)  # Be respectful to the server
    
    df = pd.DataFrame(all_details)
    print(f"\n‚úÖ Batch extraction complete! Extracted {len(df)} movies.")
    return df

# Example: Extract details for top 5 movies
if len(df_movies) > 0:
    top_5_urls = df_movies['url'].head(5).tolist()
    batch_results = batch_extract_movies(top_5_urls)
    
    # Display batch results
    print("\n" + "="*80)
    print("BATCH EXTRACTION RESULTS")
    print("="*80)
    display(batch_results[['title', 'year', 'genre']].head())
    
    # Save batch results
    batch_file = Config.DATA_DIR / f'batch_extraction_{datetime.now().strftime("%Y%m%d_%H%M%S")}.csv'
    batch_results.to_csv(batch_file, index=False)
    print(f"\nüíæ Batch results saved to: {batch_file}")

---

## üìù Summary

This notebook provides a complete system for:

- ‚úÖ Listing movies by year
- ‚úÖ Searching for specific movies
- ‚úÖ Extracting comprehensive movie details
- ‚úÖ Analyzing movie data
- ‚úÖ Exporting data in multiple formats
- ‚úÖ Batch processing multiple movies

### Next Steps:

1. Customize search parameters
2. Add more analysis features
3. Implement caching for faster performance
4. Add error handling for edge cases
5. Create automated reports

---

**Happy Movie Hunting! üé¨**