# **Web Scraping & Data Handling Challenge**

### **Contributer**- Babita Pruseth, Cohort Moscow, AlmaBetter



### **Website:**
JustWatch -  https://www.justwatch.com/in/movies?release_year_from=2000


### **Description:**

JustWatch is a popular platform that allows users to search for movies and TV shows across multiple streaming services like Netflix, Amazon Prime, Hulu, etc. For this assignment, you will be required to scrape movie and TV show data from JustWatch using Selenium, Python, and BeautifulSoup. Extract data from HTML, not by directly calling their APIs. Then, perform data filtering and analysis using Pandas, and finally, save the results to a CSV file.

### **Tasks:**

**1. Web Scraping:**

Use BeautifulSoup to scrape the following data from JustWatch:

   **a. Movie Information:**

      - Movie title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the movie page on JustWatch

   **b. TV Show Information:**

      - TV show title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the TV show page on JustWatch

  **c. Scope:**

```
 ` - Scrape data for at least 50 movies and 50 TV shows.
   - You can choose the entry point (e.g., starting with popular movies,
     or a specific genre, etc.) to ensure a diverse dataset.`

```


**2. Data Filtering & Analysis:**

   After scraping the data, use Pandas to perform the following tasks:

   **a. Filter movies and TV shows based on specific criteria:**

   ```
      - Only include movies and TV shows released in the last 2 years (from the current date).
      - Only include movies and TV shows with an IMDb rating of 7 or higher.
```

   **b. Data Analysis:**

   ```
      - Calculate the average IMDb rating for the scraped movies and TV shows.
      - Identify the top 5 genres that have the highest number of available movies and TV shows.
      - Determine the streaming service with the most significant number of offerings.
      
   ```   

**3. Data Export:**

```
   - Dump the filtered and analysed data into a CSV file for further processing and reporting.

   - Keep the CSV file in your Drive Folder and Share the Drive link on the colab while keeping view access with anyone.
```

**Submission:**
```
- Submit a link to your Colab made for the assignment.

- The Colab should contain your Python script (.py format only) with clear
  comments explaining the scraping, filtering, and analysis process.

- Your Code shouldn't have any errors and should be executable at a one go.

- Before Conclusion, Keep your Dataset Drive Link in the Notebook.
```



**Note:**

1. Properly handle errors and exceptions during web scraping to ensure a robust script.

2. Make sure your code is well-structured, easy to understand, and follows Python best practices.

3. The assignment will be evaluated based on the correctness of the scraped data, accuracy of data filtering and analysis, and the overall quality of the Python code.








# **Start The Project**

## **Task 1:- Web Scrapping**

In [None]:
#Installing all necessary labraries
!pip install bs4
!pip install requests

In [1]:
#import all necessary labraries
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np

## **Scrapping Movies Data**

In [None]:
def fetch_movie_urls(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
    }
    response = requests.get(url, headers=headers)
    if response.status_code != 200:
        return "Failed to retrieve the page, status code:", response.status_code
    soup = BeautifulSoup(response.text, 'html.parser')
    return soup


url = 'https://www.justwatch.com/in/movies?release_year_from=2000'
soup=fetch_movie_urls(url)
print(soup.prettify())

## Hint : Use the following code to extract the film urls
# movie_links = soup.find_all('a', href=True)
# movie_urls = [link['href'] for link in movie_links if '/movie/' in link['href']]

# url_list=[]
# for x in movie_urls:
#   url_list.append('https://www.justwatch.com'+x)

## **Scrapping Movie Title**

In [6]:
# Movie title

movie_title_list=[]# List to store all movie title

# Extracting all movie titles from <a> tag and storing them in movie_titles
movie_titles = soup.find_all('a',class_='title-list-grid__item--link',attrs={'href':True})

# Extracting each movie title from movie_titles and storing in movie_title_list
for movie_title in movie_titles:

    # Extract the 'href' attribute value, which contains the movie title
    data_id_value = movie_title['href']

    # Removing the '/in/movie/' prefix to get the clean movie title
    data_id_value = data_id_value.replace("/in/movie/","")

    # Converting the movie title to uppercase and appending to the list
    movie_title_list.append(data_id_value.upper())


## **Fetching Movie URL's**

In [9]:
# Write Your Code here
# Movie url

movie_url_list=[] # List to store all movie urls

# For every movie title present in movies_title_list , Finding their url
for movie in movie_title_list:

    # Constructing the absolute URL for fetching each movie
    absolute_url = 'https://www.justwatch.com/in/movie/' + movie
    movie_url_list.append(absolute_url)


In [10]:
# List to store all movie URLs
movie_url_list = []

# For every movie title present in movies_title_list, finding their URL
for movie in movie_title_list:
    # Constructing the absolute URL for fetching each movie
    absolute_url = 'https://www.justwatch.com/in/movie/' + movie
    # Append the URL to the movie_url_list
    movie_url_list.append(absolute_url)


## **Scrapping release Year**

In [13]:
movie_release_year_list = []  # List to store all movie release years

# For every movie title present in movie_title_list, find their release year
for movie in movie_title_list:
    # Constructing the absolute URL for fetching each movie
    absolute_url = 'https://www.justwatch.com/in/movie/' + movie

    # Sending an HTTP GET request
    response_ry = requests.get(absolute_url)

    # Parsing HTML content with Beautiful Soup
    soup_ry = BeautifulSoup(response_ry.text, 'html.parser')

    # Finding the movie release year element
    release_year_element = soup_ry.find('span', class_='text-muted')
    
    # Check if the element is found and handle None case
    if release_year_element:
        movie_release_year = release_year_element.text.strip()
        movie_release_year = movie_release_year.replace("(", "").replace(")", "")
        movie_release_year_list.append(movie_release_year)
    else:
        # If the release year is not found, handle the missing case
        movie_release_year_list.append('Year not found')



## **Scrapping Genres**

In [10]:
# Write Your Code here
# Movie genre

movie_genre_list = []# List to store all movie genre

# For every movie title present in movies_title_list , Finding their genre
for movie in movie_title_list:

    # Constructing the absolute URL for fetching each movie
    absolute_url = 'https://www.justwatch.com/in/movie/' + movie
    response_g = requests.get(absolute_url)
    soup = BeautifulSoup(response_g.text,'html.parser')

    # Selecting only those h3 whose heading is genres
    h3_element = soup.find('h3', class_='detail-infos__subheading', string='Genres')

    if h3_element:
        # Check if the next sibling is a div with class "detail-infos__value"
        div_element = h3_element.find_next_sibling('div', class_='detail-infos__value')

        if div_element:
            movie_genre_list.append(div_element.text.strip())
        else:
            movie_genre_list.append("Genre Not Listed")
    else:
         movie_genre_list.append("Genre Not Listed")


## **Scrapping IMBD Rating**

In [11]:
# Write Your Code here
movie_imdb_list = []  # List to store all movie IMDb ratings

# For every movie title present in movie_title_list, find their IMDb rating
for movie in movie_title_list:
    
    # Constructing the absolute URL for fetching each movie
    absolute_url = 'https://www.justwatch.com/in/movie/' + movie
    response_g = requests.get(absolute_url)
    soup = BeautifulSoup(response_g.text, 'html.parser')

    # Selecting the h3 element whose heading is "Rating"
    h3_element = soup.find('h3', class_='detail-infos__subheading', string='Rating')

    if not h3_element:
        movie_imdb_list.append("IMDb Rating Not Listed.")
        continue  # Move to the next movie if the h3 element isn't found

    # Check if the next sibling is a div with class "detail-infos__value"
    div_element = h3_element.find_next_sibling('div', class_='detail-infos__value')

    if not div_element:
        movie_imdb_list.append("IMDb Rating Not Listed.")
        continue  # Move to the next movie if div_element isn't found

    # Find all div elements that contain the rating
    rating_divs = div_element.find_all('div', class_='jw-scoring-listing__rating')

    if not rating_divs:
        movie_imdb_list.append("IMDb Rating Not Listed.")
        continue  # Move to the next movie if no rating divs are found

    # Extract the last rating div as it contains the IMDb rating
    last_rating_div = rating_divs[-1]
    span_elements = last_rating_div.find_all('span')

    if not span_elements:
        movie_imdb_list.append("IMDb Rating Not Listed.")
        continue  # Move to the next movie if no span elements are found

    # Extract the rating from the last span element
    imdb_rating = span_elements[-1].text.strip()
    movie_imdb_list.append(imdb_rating)


## **Scrapping Runtime/Duration**

In [12]:
movie_runtime_list = []  # List to store all movie runtimes/durations

# For every movie title present in movie_title_list, find their Runtime/Duration
for movie in movie_title_list:

    # Constructing the absolute URL for fetching each movie
    absolute_url = 'https://www.justwatch.com/in/movie/' + movie
    response_g = requests.get(absolute_url)
    soup = BeautifulSoup(response_g.text, 'html.parser')

    # Selecting the h3 element whose heading is "Runtime"
    h3_element = soup.find('h3', class_='detail-infos__subheading', string='Runtime')

    if not h3_element:
        movie_runtime_list.append("No Runtime/Duration mentioned")
        continue  # Move to the next movie if the h3 element isn't found

    # Check if the next sibling is a div with class "detail-infos__value"
    div_element = h3_element.find_next_sibling('div', class_='detail-infos__value')

    if not div_element:
        movie_runtime_list.append("No Runtime/Duration mentioned")
        continue  # Move to the next movie if div_element isn't found

    # Extract and append the runtime text after stripping any extra spaces
    movie_runtime_list.append(div_element.text.strip())


## **Scrapping Age Rating**

In [13]:
movie_age_rating_list = []  # List to store all movie age ratings

# For every movie title present in movie_title_list, find their Age Rating
for movie in movie_title_list:

    # Construct the absolute URL for fetching each movie
    absolute_url = 'https://www.justwatch.com/in/movie/' + movie
    response_g = requests.get(absolute_url)
    soup = BeautifulSoup(response_g.text, 'html.parser')

    # Find the h3 element with the heading 'Age rating'
    h3_element = soup.find('h3', class_='detail-infos__subheading', string='Age rating')

    if not h3_element:
        movie_age_rating_list.append("Age Rating Not Listed.")
        continue  # Move to the next movie if 'Age rating' isn't found

    # Find the next sibling div that contains the age rating value
    div_element = h3_element.find_next_sibling('div', class_='detail-infos__value')

    if not div_element:
        movie_age_rating_list.append("Age Rating Not Listed.")
        continue  # Move to the next movie if no div with the age rating is found

    # Append the age rating after stripping any extra spaces
    movie_age_rating_list.append(div_element.text.strip())



## **Fetching Production Countries Details**

In [14]:
movie_production_country_list = []  # List to store all movie production countries

# For every movie title present in movie_title_list, find their Production Country
for movie in movie_title_list:

    # Constructing the absolute URL for fetching each movie
    absolute_url = 'https://www.justwatch.com/in/movie/' + movie
    response = requests.get(absolute_url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Selecting the h3 element with the subheading 'Production country'
    h3_element = soup.find('h3', class_='detail-infos__subheading', string='Production country')

    if not h3_element:
        movie_production_country_list.append("Production Country Not Listed")
        continue  # Move to the next movie if no 'Production country' heading is found

    # Find the next sibling div that contains the production country value
    div_element = h3_element.find_next_sibling('div', class_='detail-infos__value')

    if not div_element:
        movie_production_country_list.append("Production Country Not Listed")
        continue  # Move to the next movie if no div with the production country is found

    # Append the production country after stripping any extra spaces
    movie_production_country_list.append(div_element.text.strip())



## **Fetching Streaming Service Details**

In [15]:
movie_streaming_list = []  # List to store all movie streaming platforms

# For every movie title present in movie_title_list, find their Streaming Platform
for movie in movie_title_list:

    # Constructing the absolute URL for fetching each movie
    absolute_url = 'https://www.justwatch.com/in/movie/' + movie
    response = requests.get(absolute_url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Finding the outer div element with the class "buybox-row stream"
    outer_div = soup.find('div', class_='buybox-row stream')

    if not outer_div:
        movie_streaming_list.append("Not Available for Streaming.")
        continue  # Move to the next movie if no streaming section is found

    # Finding the nested div with class "buybox-row__offers"
    inner_div = outer_div.find('div', class_='buybox-row__offers')

    if not inner_div:
        movie_streaming_list.append("Not Available for Streaming.")
        continue  # Move to the next movie if no offers section is found

    # Find the picture element within the nested div
    picture_element = inner_div.find('picture')

    if not picture_element:
        movie_streaming_list.append("Not Available for Streaming.")
        continue  # Move to the next movie if no picture element is found

    # Extract the alt attribute from the img element inside the picture (contains the platform name)
    img_element = picture_element.find('img')

    if img_element and 'alt' in img_element.attrs:
        movie_streaming_list.append(img_element['alt'])
    else:
        movie_streaming_list.append("Not Available for Streaming.")



## **Now Creating Movies DataFrame**

In [None]:
# Creating Movies Dataframe

data_movies = {
    'Movie Title':movie_title_list,
    'IMDB Rating':movie_imdb_list,
    'Release Year':movie_release_year_list,
    'Genre':movie_genre_list,
    'Runtime/Duration':movie_runtime_list,
    'Age Rating':movie_age_rating_list,
    'Production Country':movie_production_country_list,
    'Streaming Platform':movie_streaming_list,
    'Url':movie_url_list
}

df_movies = pd.DataFrame(data_movies)
# Display the first few rows of the DataFrame to verify
print(df_movies.head())


## **Scraping TV  Show Data**

In [None]:
# Specifying the URL from which tv show related data will be fetched
tv_url='https://www.justwatch.com/in/tv-shows?release_year_from=2000'
# Sending an HTTP GET request to the URL
page=requests.get(tv_url)
# Parsing the HTML content using BeautifulSoup with the 'html.parser'
soup=BeautifulSoup(page.text,'html.parser')
# Printing the prettified HTML content
print(soup.prettify())

## **Fetching Tv Show Title details**

In [None]:
# Tv Shows title

tv_show_title_list=[] # List to store all tv show title

# Extracting all tv show titles and storing them in tv_show_titles
tv_show_titles = soup_tv.find_all('a',class_='title-list-grid__item--link',attrs={'href':True})

# Extracting each tv show title from tv_show_titles and storing in tv_show_title_list
for tv_show_title in tv_show_titles:

    # Extract the 'href' attribute value, which contains the tv_show title
    data_id_value = tv_show_title['href']

    # Removing the '/in/tv-show/' prefix to get the clean tv_show title
    data_id_value = data_id_value.replace("/in/tv-show/","")

    # Converting the tv_show title to uppercase and appending to the list
    tv_show_title_list.append(data_id_value.upper())


## **Fetching Tv shows Url details**

In [None]:
# Tv Shows url
tv_show_url_list=[] # List to store all tv show urls

# For every tv show title present in tv_show_title_list , Finding their url
for tv_show in tv_show_title_list:

    # Constructing the absolute URL for fetching each tv show
    absolute_url = 'https://www.justwatch.com/in/tv-show/' + tv_show

    tv_show_url_list.append(absolute_url)


## **Fetching Release Year**

In [None]:
# Tv Shows Release year

# Movie release year
tv_show_release_year_list = [] # List to store all tv show Release Year

# For every tv show title present in tv_show_title_list , Finding their release year
for tv_show in tv_show_title_list:

    # Constructing the absolute URL for fetching each tv show
    absolute_url = 'https://www.justwatch.com/in/tv-show/' + tv_show

    # Sending an HTTP GET request to the url
    response = requests.get(absolute_url)

    # Parsing HTML content with Beautiful Soup
    soup = BeautifulSoup(response.text,'html.parser')
    tv_show_release_year =soup.find('span',class_='text-muted').text.strip()
    tv_show_release_year=tv_show_release_year.replace("(","")
    tv_show_release_year=tv_show_release_year.replace(")","")
    tv_show_release_year_list.append(tv_show_release_year)


## **Fetching TV Show Genre Details**

In [None]:
# Tv Shows Genre

tv_show_genre_list = [] # List to store all tv show Genres

# For every tv show title present in tv_show_title_list , Finding their Genre
for tv_show in tv_show_title_list:

    # Constructing the absolute URL for fetching each tv show
    absolute_url = 'https://www.justwatch.com/in/tv-show/' + tv_show
    response = requests.get(absolute_url)
    soup = BeautifulSoup(response.text,'html.parser')

    # Selecting only those h3 whose heading is genres
    h3_element = soup.find('h3', class_='detail-infos__subheading', string='Genres')

    if h3_element:
        # Check if the next sibling is a div with class "detail-infos__value"
        div_element = h3_element.find_next_sibling('div', class_='detail-infos__value')

        if div_element:
            tv_show_genre_list.append(div_element.text.strip())
        else:
            tv_show_genre_list.append("Genre Not Listed")
    else:
         tv_show_genre_list.append("Genre Not Listed")


## **Fetching IMDB Rating Details**

In [None]:
# Tv Shows  Imdb Rating

tv_show_imdb_list = [] # List to store all tv show Imdb Rating

# For every tv show title present in tv_show_title_list , Finding their Imdb Rating
for tv_show in tv_show_title_list:

    # Constructing the absolute URL for fetching each tv show
    absolute_url = 'https://www.justwatch.com/in/tv-show/' + tv_show
    response = requests.get(absolute_url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # Selecting only those h3 whose heading is Rating
    h3_element = soup.find('h3', class_='detail-infos__subheading', string='Rating')

    if h3_element:
        # Check if the next sibling is a div with class "detail-infos__value"
        div_element = h3_element.find_next_sibling('div', class_='detail-infos__value')

        if div_element:
            inside_div = div_element.find_all('div', class_='jw-scoring-listing__rating')

            # Check if inside_div is non-empty
            if inside_div:
                inside_div_last = inside_div[-1] # extracting last div of inside div as last div elemnt contains span (in which last span contains rating)

                # Check if inside_div_last is non-empty
                if inside_div_last:
                    span_all = inside_div_last.find_all('span')

                    # Check if span_all is non-empty
                    if span_all:
                        span_last = span_all[-1] # Here we are extracting rating from the last span(span_last) inside last div(inside_div_last) of main div_element(div_element)
                        tv_show_imdb_list.append(span_last.text.strip())
                    else:
                        tv_show_imdb_list.append("Imdb Rating Not Listed.")
                else:
                    tv_show_imdb_list.append("Imdb Rating Not Listed.")
            else:
                tv_show_imdb_list.append("Imdb Rating Not Listed.")
        else:
            tv_show_imdb_list.append("Imdb Rating Not Listed.")
    else:
        tv_show_imdb_list.append("Imdb Rating Not Listed.")


## **Fetching Age Rating Details**

In [None]:
# Tv Shows Age Rating

tv_show_age_rating_list = [] # List to store all tv show Age Ratings

# For every tv show title present in tv_show_title_list , Finding their Age Rating
for tv_show in tv_show_title_list:

    # Constructing the absolute URL for fetching each tv show
    absolute_url = 'https://www.justwatch.com/in/tv-show/' + tv_show
    response = requests.get(absolute_url)
    soup = BeautifulSoup(response.text,'html.parser')

    # Selecting only those h3 whose heading is Age rating
    h3_element = soup.find('h3', class_='detail-infos__subheading', string='Age rating')

    if h3_element:
        # Check if the next sibling is a div with class "detail-infos__value"
        div_element = h3_element.find_next_sibling('div', class_='detail-infos__value')

        if div_element:
            tv_show_age_rating_list.append(div_element.text.strip())
        else:
            tv_show_age_rating_list.append("Age Rating Not Listed.")
    else:
         tv_show_age_rating_list.append("Age Rating Not Listed.")


## **Fetching Production Country details**

In [None]:
# Tv Shows Production Country

tv_show_production_country_list=[] # List to store all tv show Production Countries

# For every tv show title present in tv_show_title_list , Finding their Production country
for tv_show in tv_show_title_list:

    # Constructing the absolute URL for fetching each tv show
    absolute_url = 'https://www.justwatch.com/in/tv-show/' + tv_show
    response = requests.get(absolute_url)
    soup = BeautifulSoup(response.text,'html.parser')

    # Selecting only those h3 whose sub-heading inside details- infos is 'Production Country'
    h3_element = soup.find('h3', class_='detail-infos__subheading', string=' Production country ')

    if h3_element:
        # Check if the next sibling is a div with class "detail-infos__value"
        div_element = h3_element.find_next_sibling('div', class_='detail-infos__value')

        if div_element:
            tv_show_production_country_list.append(div_element.text.strip())
        else:
            tv_show_production_country_list.append("Production Country Not Listed")
    else:
          tv_show_production_country_list.append("Production Country Not Listed")



## **Fetching Streaming Service details**

In [None]:
# Tv Shows Streaming Platform

tv_show_streaming_list=[] # List to store all tv show Streaming Platorms

# For every tv show title present in tv_show_title_list , Finding their Streaming Platform
for tv_show in tv_show_title_list:

    # Constructing the absolute URL for fetching each tv show
    absolute_url = 'https://www.justwatch.com/in/tv-show/' + tv_show
    response = requests.get(absolute_url)
    soup = BeautifulSoup(response.text,'html.parser')

    # Finding the outer div element with the class "buybox-row stream"
    outer_div = soup.find('div', class_='buybox-row stream')

    if outer_div:
        # Finding the nested div with class "buybox-row__offers" inside the outer div
        inner_div = outer_div.find('div', class_='buybox-row__offers')

        if inner_div:
            # Find the picture element within the nested div
            picture_element = inner_div.find('picture')

            if picture_element:
                # Extract the alt attribute from the img element inside the picture which contains streaming platform name
                img_element = picture_element.find('img')
                if img_element:
                    alt_text = img_element['alt']
                    tv_show_streaming_list.append(alt_text)
                else:
                    tv_show_streaming_list.append("Not Available for Streaming.")
            else:
                tv_show_streaming_list.append("Not Available for Streaming.")
        else:
            tv_show_streaming_list.append("Not Available for Streaming.")
    else:
        tv_show_streaming_list.append("Not Available for Streaming.")

## **Fetching Duration Details**

In [None]:
# Tv Shows Runtime/Duration

tv_show_runtime_list=[] # List to store all tv show Runtimes

# For every tv show title present in tv_show_title_list , Finding their Runtime/Duration
for tv_show in tv_show_title_list:

    # Constructing the absolute URL for fetching each tv show
    absolute_url = 'https://www.justwatch.com/in/tv-show/' + tv_show
    response = requests.get(absolute_url)
    soup = BeautifulSoup(response.text,'html.parser')

    # Selecting only those h3 whose heading is runtine
    h3_element = soup.find('h3', class_='detail-infos__subheading', string='Runtime')

    if h3_element:
        # Check if the next sibling is a div with class "detail-infos__value"
        div_element = h3_element.find_next_sibling('div', class_='detail-infos__value')

        if div_element:
            tv_show_runtime_list.append(div_element.text.strip())
        else:
            tv_show_runtime_list.append("No Runtime/Duration mentioned")
    else:
      tv_show_runtime_list.append("No Runtime/Duration mentioned")



## **Creating TV Show DataFrame**

In [None]:
# Write Your Code here
# Creating Tv Shows Dataframe

data_tv_shows = {
    'Tv_Show Title':tv_show_title_list,
    'IMDB Rating':tv_show_imdb_list,
    'Release Year':tv_show_release_year_list,
    'Genre':tv_show_genre_list,
    'Runtime/Duration':tv_show_runtime_list,
    'Age Rating':tv_show_age_rating_list,
    'Production Country':tv_show_production_country_list,
    'Streaming Platform':tv_show_streaming_list,
    'Url':tv_show_url_list
}

df_tv_shows = pd.DataFrame(data_tv_shows)


## **Task 2 :- Data Filtering & Analysis**

In [None]:
# Filtering movies and TV shows to include only those released in the last two years and with an IMDB Rating of 7 or higher.

from datetime import datetime, timedelta

# Get the current date
current_date = datetime.now()

# Calculate the date 2 years ago from the current date
two_years_ago = current_date - timedelta(days=365 * 2)

def filter_df(df, release_year_col, imdb_rating_col, years_ago, current_date):
    # Convert 'Release Year' to datetime format
    df[release_year_col] = pd.to_datetime(df[release_year_col], errors='coerce')

    # Filter the DataFrame to include only entries released in the last `years_ago` years
    filtered_df = df[(df[release_year_col] >= two_years_ago) & (df[release_year_col] <= current_date)].copy()

    # Converting 'IMDB Rating' column to a string so that, in the next step, we can convert it to numeric values
    filtered_df.loc[:, imdb_rating_col] = filtered_df[imdb_rating_col].astype(str)

    # Extract numeric part and convert to numeric
    filtered_df[imdb_rating_col] = pd.to_numeric(filtered_df[imdb_rating_col].str.extract(r'([\d.]+)', expand=False), errors='coerce')

    # Filter the DataFrame to include only entries whose IMDb Rating >= 7
    filtered_df = filtered_df[filtered_df[imdb_rating_col] >= 7]

    return filtered_df

# Filtering Movies
filtered_df_movies = filter_df(df_movies, 'Release Year', 'IMDB Rating', two_years_ago, current_date)

# Filtering TV Shows
filtered_df_tv_shows = filter_df(df_tv_shows, 'Release Year', 'IMDB Rating', two_years_ago, current_date)



## **Calculating Mean IMDB Ratings for both Movies and Tv Shows**

In [None]:
# Calculating Movies mean IMDb rating
movie_mean_imdb = filtered_df_movies['IMDB Rating'].mean()
movie_mean_imdb_rounded = round(movie_mean_imdb, 2)
print("Mean IMDb Rating for Movies is:", movie_mean_imdb_rounded)

# Calculating Tv Shows mean IMDb rating
tv_mean_imdb = filtered_df_tv_shows['IMDB Rating'].mean()
tv_mean_imdb_rounded = round(tv_mean_imdb, 2)
print("Mean IMDb Rating for Tv Shows is:", tv_mean_imdb_rounded)



## **Analyzing Top Genres**

In [None]:
# Function for Finding top 5 Highest Imdb rating movies / tv shows

def get_top_5_imdb(df):

  # Convert 'IMDB Rating' column to string
  df['IMDB Rating'] = df['IMDB Rating'].astype(str)

  # Extract only the IMDb rating value
  df['IMDB Rating'] = df['IMDB Rating'].str.extract('(\d+\.\d+)')

  # Convert the 'IMDB Rating' column to numeric
  df['IMDB Rating'] = pd.to_numeric(df['IMDB Rating'], errors='coerce')

  # Select the top 5 movies/Tv Shows based on IMDb rating
  top_5 = df.nlargest(5, 'IMDB Rating')

  return top_5


In [None]:
# Top 5 Highest IMDB Rating Movies

top_5_movies = get_top_5_imdb(filtered_df_movies)
print(top_5_movies.loc[:, ['Movie Title', 'IMDB Rating']])


In [None]:
# Top 5 Highest IMDB Rating Tv Shows

top_5_tv_shows = get_top_5_imdb(filtered_df_tv_shows)
print(top_5_tv_shows.loc[:, ['Tv_Show Title', 'IMDB Rating']])

## **Finding Predominant Streaming Service**

In [None]:
# Funtion for Finding  Movies / Tv Shows Predominant Streaming Service

def visualize_streaming_distribution_wordcloud(df):
    # Filter streaming information available
    streaming_platforms = df[df['Streaming Platform'] != 'Not Available for Streaming.']['Streaming Platform']

    # Create a string of streaming platforms
    streaming_text = ' '.join(streaming_platforms)

    # Generate the word cloud
    wordcloud = WordCloud(width=800, height=400, background_color='white').generate(streaming_text)

    # Display the word cloud
    plt.figure(figsize=(10, 6))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off')
    plt.title('Streaming Service Distribution - Word Cloud')
    plt.show()

    # Identify the predominant streaming service
    predominant_service = streaming_platforms.mode().iloc[0]
    print(f"The predominant streaming service is: {predominant_service}")

## **Task 3 :- Data Export**

In [None]:
# Saving Final Movies/Tv Shows dataframe as Final Data in csv format

df_movies.to_csv('Final_Movies_Data.csv', index=False)
df_tv_shows.to_csv('Final_Tv_Shows_Data.csv', index=False)


In [None]:
# Saving Filtered Movies/Tv Shows dataframe as Filtered Data in csv format

filtered_df_movies.to_csv('Filtered_Movies_Data.csv', index=False)
filtered_df_tv_shows.to_csv('Filtered_Tv_Shows_Data.csv', index=False)


# ***Congratulations!!! You have completed your Assignment.***