# **Web Scraping & Data Handling Challenge**



### **Website:**
JustWatch -  https://www.justwatch.com/in/movies?release_year_from=2000


### **Description:**

JustWatch is a popular platform that allows users to search for movies and TV shows across multiple streaming services like Netflix, Amazon Prime, Hulu, etc. For this assignment, you will be required to scrape movie and TV show data from JustWatch using Selenium, Python, and BeautifulSoup. Extract data from HTML, not by directly calling their APIs. Then, perform data filtering and analysis using Pandas, and finally, save the results to a CSV file.

### **Tasks:**

**1. Web Scraping:**

Use BeautifulSoup to scrape the following data from JustWatch:

   **a. Movie Information:**

      - Movie title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the movie page on JustWatch

   **b. TV Show Information:**

      - TV show title
      - Release year
      - Genre
      - IMDb rating
      - Streaming services available (Netflix, Amazon Prime, Hulu, etc.)
      - URL to the TV show page on JustWatch

  **c. Scope:**

```
 ` - Scrape data for at least 50 movies and 50 TV shows.
   - You can choose the entry point (e.g., starting with popular movies,
     or a specific genre, etc.) to ensure a diverse dataset.`

```


**2. Data Filtering & Analysis:**

   After scraping the data, use Pandas to perform the following tasks:

   **a. Filter movies and TV shows based on specific criteria:**

   ```
      - Only include movies and TV shows released in the last 2 years (from the current date).
      - Only include movies and TV shows with an IMDb rating of 7 or higher.
```

   **b. Data Analysis:**

   ```
      - Calculate the average IMDb rating for the scraped movies and TV shows.
      - Identify the top 5 genres that have the highest number of available movies and TV shows.
      - Determine the streaming service with the most significant number of offerings.
      
   ```   

**3. Data Export:**

```
   - Dump the filtered and analysed data into a CSV file for further processing and reporting.

   - Keep the CSV file in your Drive Folder and Share the Drive link on the colab while keeping view access with anyone.
```

**Submission:**
```
- Submit a link to your Colab made for the assignment.

- The Colab should contain your Python script (.py format only) with clear
  comments explaining the scraping, filtering, and analysis process.

- Your Code shouldn't have any errors and should be executable at a one go.

- Before Conclusion, Keep your Dataset Drive Link in the Notebook.
```



**Note:**

1. Properly handle errors and exceptions during web scraping to ensure a robust script.

2. Make sure your code is well-structured, easy to understand, and follows Python best practices.

3. The assignment will be evaluated based on the correctness of the scraped data, accuracy of data filtering and analysis, and the overall quality of the Python code.








# **Start The Project**

## **Task 1:- Web Scrapping**

In [10]:
#Installing all necessary labraries
!pip install bs4
!pip install requests



In [11]:
#import all necessary labraries
import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import numpy as np
from bs4 import BeautifulSoup
import time


## **Scrapping Movies Data**

In [12]:
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9','Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
}

In [5]:
import requests
from bs4 import BeautifulSoup

url = 'https://www.justwatch.com/in/movies?release_year_from=2000'

# Define the headers for the request
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9','Accept-Encoding': 'gzip, deflate, br',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
}

# Send the GET request
content = requests.get(url, headers=headers)

# Check if the content was successfully retrieved
if content.status_code == 200:
    # Decode the content properly and parse it using BeautifulSoup
    content.encoding = content.apparent_encoding  # Set correct encoding
    soup = BeautifulSoup(content.text, 'html.parser')
    print(soup.prettify())  # To see the parsed HTML
else:
    print(f"Failed to retrieve content, Status code: {content.status_code}")


���GT����� @}�&lt;tʀ�� ��y�#���� UUU�	�ݫ�~��7��ß����?��w �  ������������������������������ۇO_�����ߗj�[������Q��-����: KKRR���e�ECѾ��ooV����󥮌�H���~]g����2,$:�,��r��t��O�f滩�E4��b���K� !����23�������+���ױ����ʬ,T!&gt;�'V��U@�E�A�����q�w��կV[{f�g(�|�/�#۱����؉���\��X (Y~S�}og;�e��i~Y.b�Q"K��R�{�KU�^��!q��z�}_���!�~�Iq� 9̐	3�E�Z�!�������O��+����7���~��vud�̴��5g���I� J��D^�(��
 C��Ŝ��/��޲/�=�פ��߼?϶�߿�~e�wjG�[be���i��f���/2_4^FdGīd�����Zκ�区Ҵ��=��UdJ{�$uq�׊�����i -f�3R����NRW�ܙ�ۿZY�����3=Kf��po��AKUIU��JK-�.�*#�,���ԛ�g3�����o��rzҽ���qSi����&amp;�Hʽ�}�&amp;�p�d?�@c �8wU�;��5�V���e*wt������w��
2Jq���	IAzFwA��*H�������[���YA)9ArPh�[� �RT�f,��0����&gt;�y��9�$u�
<o�gp���"��d�#��(����}���si������%j�t�t�z`���a����a����������j�zc �$(�if�="" ��[u��cο�꓏y;����l9m���pilpst\t.��w�="���&amp;��f�]&lt;3]&lt;Y" !�ϡ���ք�*����{s+�a�|�#ru�ۑ���\���{q�`02���������@c="" &y����3*}u��]�y6u������ �qje�ۈ�

In [23]:
# fetch the web content
url ='https://www.justwatch.com/in/movies?release_year_from=2000'
content = requests.get(url,headers=headers)
soup = BeautifulSoup(content.text,'html.parser')
content.encoding = content.apparent_encoding
soup


�&gt;�GT����� @}�&lt;tʀ�� ��y�$`�Y��PUU��ܽj? �_~�?��/�ǿ��I` pL   ��`hdlbjfnaiemckg������������ͻ��|���Ͽ/���j��9W��D)[�
1�
bu ����Ρ���.��}�)
��ެ�����K]��-(@�%���0
}�e XHt2
X�������D��wS�=�h����ZB�?]Ief~U���W���c��	���YY�B|6O�P=��5��$� ��m�����V����H��I�1���Ρ$K�5X_�%[I�ϭ*�T )�uV�������i~QJ\��6�.�7�e�z9'�30\i�X�K�~
����}q�I����G k.��i�
@�z
"�����Y4A�@,��,�����;v���8�iw�k����I� J��D^��H@�!��b�k��Ky���{��&amp;��۟g�ʯ7�*I}GjI�TˣY@vvP@
�jt/�y�����1�
ö�8�SO]I�W����ڙl~U
��d��_Ij���)=�A�j`����3R��k��5u�=���*��]mO����PU�s0���
�HIER
4t4�2�	V&amp;P�U,zjO����i���{{��y]����M��97��DDf��c�8$�� ����#�0J0"xɛ�iQ�M5U��o��
�;6�Ã�0���
����R���AwBR���]�d�
�&lt;;�G+�7�V��vVPJGN�
�V;@�ԅ��
�0
�����e^�t·1�A]%�S����)E4��)2��{�9{�����\�|���D�Z.U-��
�2uGmG�A��$y��({�����o��8�Ҡ�X��R�AY|�$#�H����$�vn9���ӓ��������%���%��*�
�o��TP��ҧ�����4i���M�f�=݀
J�f

J���7S�Rb
�'��
k���
���FrЮ�$$��{�w�4�f�����vm���c�k� �%�r\:
����sΟ�	s�hwݕ

In [24]:
soup

�&gt;�GT����� @}�&lt;tʀ�� ��y�$`�Y��PUU��ܽj? �_~�?��/�ǿ��I` pL   ��`hdlbjfnaiemckg������������ͻ��|���Ͽ/���j��9W��D)[�
1�
bu ����Ρ���.��}�)
��ެ�����K]��-(@�%���0
}�e XHt2
X�������D��wS�=�h����ZB�?]Ief~U���W���c��	���YY�B|6O�P=��5��$� ��m�����V����H��I�1���Ρ$K�5X_�%[I�ϭ*�T )�uV�������i~QJ\��6�.�7�e�z9'�30\i�X�K�~
����}q�I����G k.��i�
@�z
"�����Y4A�@,��,�����;v���8�iw�k����I� J��D^��H@�!��b�k��Ky���{��&amp;��۟g�ʯ7�*I}GjI�TˣY@vvP@
�jt/�y�����1�
ö�8�SO]I�W����ڙl~U
��d��_Ij���)=�A�j`����3R��k��5u�=���*��]mO����PU�s0���
�HIER
4t4�2�	V&amp;P�U,zjO����i���{{��y]����M��97��DDf��c�8$�� ����#�0J0"xɛ�iQ�M5U��o��
�;6�Ã�0���
����R���AwBR���]�d�
�&lt;;�G+�7�V��vVPJGN�
�V;@�ԅ��
�0
�����e^�t·1�A]%�S����)E4��)2��{�9{�����\�|���D�Z.U-��
�2uGmG�A��$y��({�����o��8�Ҡ�X��R�AY|�$#�H����$�vn9���ӓ��������%���%��*�
�o��TP��ҧ�����4i���M�f�=݀
J�f

J���7S�Rb
�'��
k���
���FrЮ�$$��{�w�4�f�����vm���c�k� �%�r\:
����sΟ�	s�hwݕ

## **Fetching Movie URL's**

In [25]:
# Write Your Code here
movie_url=[]
for x in soup.find_all('a',class_="title-list-grid__item--link"):
  movie_url.append('https://www.justwatch.com'+ x['href'])
print(movie_url)
print(len(movie_url))

[]
0


In [6]:
soup

���GT����� �Z:e��z  ���pG=i�= ���
��W� ���o~��?��o������
 �	   ��������������������������������|���Ͽ/���j��9W��D)[�
1�
bu ����Ρ���.��}�)
��ެ�����K]��-(@�%���0
}�e XHt2
X�������D��wS�=�h����ZB�?]Ief~U���W���c��	���YY�B|6O�P=��5��$� ��m������_��(����P���_�G�c��'nK�W�ɹ$!� P���V���v��+���\��/�D�2�� ������
C�ğ?���_�y�Iw� ��`�M�d��l�x�$lm��!
�
����o���Ǥ�
Y`$A�0�L�!�zJ��ͪ����2#2��-"2�j�{�A$$"�$� �
eG
"�D���
 ]������ʽ渏�޽k#��f�/�����L���;��#�h��
+�@5Y� ��;'*_d�h���Y���
�0�u�]�u�i�Fi{|��Ȕ�'�Iꤖ׊���
��i -f�3
R����NRW�ܙ�ۿZY���
��3=Kf��po��AKUIU��JK-�.�*#�,���ԛ�g3�����o��rzʥ_?%QU�M$�����~$a��k��	
�,�^Q�[���h�l�o��
�;6�Ã�0���
����R���AwBR���]�d�
�&lt;;�G+�7�V��vVPJGN�
�V;@�ԅ��
�0
�����e^�t·1�A]%�S����)E4��)2��{�9{�����\�|���D�Z.U-��
�2uGmG�A��$y��({�����o��8�Ҡ�X��R�AY|�$#�H����$�vn9���ӓ��������%���%��*�
�o��TP��ҧ�����4i���M�f�=݀
J�f

J����T�l�"֐yj�4�����E��:he$�,��}��w�t�ڨlIۢ2�+e�Ƭ #�Y
X�K�4(����?��ґ$��;�8

## **Scrapping Movie Title**

In [7]:
# Write Your Code here
movie_titles=[]
for url in movie_url:
  content=requests.get(url,headers=headers)
  soup=BeautifulSoup(content.text,'html.parser')
  title=soup.find_all('h1')[0].text
  movie_titles.append(title)
print(len(movie_titles))
print(movie_titles)


NameError: name 'movie_url' is not defined

## **Scrapping release Year**

In [8]:
movie_year = []

# Loop over each movie URL
for url in movie_url:
    try:
        # Fetch the page content
        content = requests.get(url, headers=headers)
        soup = BeautifulSoup(content.text, 'html.parser')

        # Try to find the release year on the page
        year = soup.find('span', class_='release-year')
        if year:
            year = year.text.strip('()')  # Clean up the year (remove parentheses)
        else:
            year = 'NA'  # If the year is not found, mark as 'NA'

        movie_year.append(year)  # Append the year to the list
    except Exception as e:
        movie_year.append('NA')  # If there's an error, append 'NA'
        print(f"Error processing {url}: {e}")

# Output the movie years and their count
print(movie_year)

NameError: name 'movie_url' is not defined

## **Scrapping Genres**

In [11]:

movie_genre = []

# Loop over each movie URL
for url in movie_url:
    try:
        # Fetch the page content
        content = requests.get(url, headers=headers)
        soup = BeautifulSoup(content.text, 'html.parser')

        # Try to find the genre on the page (adjust the tag and class as needed)
        genre_tag = soup.find('span', class_='genre')  # Example class, update based on actual page
        if genre_tag:
            genre = genre_tag.text.strip()
        else:
            genre = 'NA'  # If genre is not found, mark as 'NA'

        movie_genre.append(genre)  # Append the genre to the list
    except Exception as e:
        genre = 'NA'  # If there's an error, set genre as 'NA'
        movie_genre.append(genre)  # Append the genre to the list
        print(f"Error processing {url}: {e}")

# Output the movie genres and their count
print(movie_genre)
print(f"Total movies processed: {len(movie_genre)}")

['NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA', 'NA']
Total movies processed: 100


In [12]:
# Write Your Code here
movie_genre=[]
for url in movie_url:
  try:
    content=requests.get(url,headers=headers)
    soup=BeautifulSoup(content.text,'html.parser')
    for x in soup.find_all('div',class_='detail-infos'):
      if x.find_all('h3')[0].text=='Genres':
        genre = x.find_all('span')[0].text
  except:
    genre='NA'
movie_genre.append(genre )
print(movie_genre)
print(len(movie_genre))



['NA']
1


## **Scrapping IMBD Rating**

In [13]:
# Write Your Code here
movie_rating=[]
for url in movie_url:
  try:
    content=requests.get(url,headers=headers)
    soup=BeautifulSoup(content.text,'html.parser')
    for x in soup.find_all('div',class_='poster-detail-infos'):
      if x.find_all('h3')[0].text=='Rating':
        rating=x.find_all('div')[0].text
  except:
    rating='NA'
    movie_rating.append(rating)
print(movie_rating)

[]


## **Scrapping Runtime/Duration**

In [14]:
# Write Your Code here
# runtime
movie_runtime=[]
for url in movie_url:
  try:
    content=requests.get(url,headers=headers)
    soup=BeautifulSoup(content.text,'html.parser')
    for x in soup.find_all('div',class_='detail-infos'):
      if x.find_all('h3')[0].text=='Runtime':
        runtime=x.find_all('div')[0].text
  except:
    runtime='NA'
movie_runtime.append(runtime)
print(movie_runtime)



NameError: name 'runtime' is not defined

## **Scrapping Age Rating**

In [None]:
# Write Your Code here
# Age rating
movie_agerating=[]
for url in movie_url:
  try:
    content=requests.get(url,headers=headers)
    soup=BeautifulSoup(content.text,'html.parser')
    for x in soup.find_all('div',class_='detail-infos'):
      if x.find_all('h3')[0].text=='Age rating':
        age_rating=x.find_all('div')[0].text
  except:
    runtime='NA'
movie_agerating.append(age_rating)
print(movie_agerating)



## **Fetching Production Countries Details**

In [None]:
# Write Your Code here
movie_country=[]
for url in movie_url:
  try:
    content=requests.get(url,headers=headers)
    soup=BeautifulSoup(content.text,'html.parser')
    for x in soup.find_all('div',class_='detail-infos'):
      if x.find_all('h3')[0].text==' Production country ':
        country=x.find_all('div')[0].text
except:
runtime='NA'
movie_country.append(country)
print(movie_country)



## **Fetching Streaming Service Details**

In [None]:
# Write Your Code here
movie_stream_service=[]
for url in movie_url:
  try:
    content=requests.get(url,headers=headers)
    soup=BeautifulSoup(content.text,'html.parser')
    names=[x['alt'] for x in soup.find_all('img',class_='offer__icon')]
  except:
    names='NA'
    movie_stream_service.append(" , ".join(names))
print(movie_stream_service)


## **Now Creating Movies DataFrame**

In [None]:
# Write Your Code here
info={'movie_url':movie_url,
      'movie_name':movie_titles,
      'release_year':movie_year,
      'movie_rating':movie_rating,
      'movie_genre':movie_genre,
      'movie_runtime':movie_runtime,
      'movie_agerating':movie_agerating,
      'movie_country':movie_country,
      'movie_stream_service' :movie_stream_service}

data = pd.DataFrame(info)

In [None]:
data

In [None]:
data.to_csv('new_data.csv')


## **Scraping TV  Show Data**

In [None]:
# Specifying the URL from which tv show related data will be fetched
tv_url='https://www.justwatch.com/in/tv-shows?release_year_from=2000'
# Sending an HTTP GET request to the URL
page=requests.get(tv_url)
# Parsing the HTML content using BeautifulSoup with the 'html.parser'
soup=BeautifulSoup(page.text,'html.parser')
# Printing the prettified HTML content
print(soup.prettify())

## **Fetching Tv shows Url details**

In [None]:
# Write Your Code here


## **Fetching Tv Show Title details**

In [None]:
# Write Your Code here


## **Fetching Release Year**

In [None]:
# Write Your Code here


## **Fetching TV Show Genre Details**

In [None]:
# Write Your Code here


## **Fetching IMDB Rating Details**

In [None]:
# Write Your Code here


## **Fetching Age Rating Details**

In [None]:
# Write Your Code here


## **Fetching Production Country details**

In [None]:
# Write Your Code here


## **Fetching Streaming Service details**

In [None]:
# Write Your Code here


## **Fetching Duration Details**

In [None]:
# Write Your Code here


## **Creating TV Show DataFrame**

In [None]:
# Write Your Code here


## **Task 2 :- Data Filtering & Analysis**

In [None]:
# Write Your Code here


## **Calculating Mean IMDB Ratings for both Movies and Tv Shows**

In [None]:
# Write Your Code here


## **Analyzing Top Genres**

In [None]:
# Write Your Code here


In [None]:
#Let's Visvalize it using word cloud


## **Finding Predominant Streaming Service**

In [None]:
# Write Your Code here


In [None]:
#Let's Visvalize it using word cloud


## **Task 3 :- Data Export**

In [None]:
#saving final dataframe as Final Data in csv format


In [None]:
#saving filter data as Filter Data in csv format


# **Dataset Drive Link (View Access with Anyone) -**

# ***Congratulations!!! You have completed your Assignment.***