# **Introduction to Data Science Final Project**

**Student Information:**

StudentID|Full Name
-|-
21127012|Tran Huy Ban
21127050|Tran Nguyen Huan 
21127143|Nguyen Minh Quan 
21127175|Le Anh Thu


## **Table of contents**

[Overview](#overview)

1. [Data Collection](#collect)
   

    b. [Collecting data](#collecting)
   
3. [Data Pre-processing](#process)

    a. [Pre-precessing](#preprocess)

    b. [Exploration](#exploration)
       
4. [Data Modeling](#modeling)

5. [Deploy Model](#deploy)

[References](#references)

## **Overview** <a name="overview"></a>

<center>
<h3>
    <b>
    Movie Recommendations: Explore a World of Cinematic Brilliance
    </b>
</h3>
    <img style="padding:10px" src="https://beebom.com/wp-content/uploads/2019/08/netflix-family-movies-featured.jpg?w=750&quality=75" width="800"/>
</center>

Looking for your next movie night delight? Our recommendation engine, powered by [The Movie Database (TMDB) API](#https://developer.themoviedb.org/docs), is here to guide you through a curated list of cinematic gems that promise both captivating overviews and significant impact. Let's dive into why these movies should be on your must-watch list:

- Each recommended movie comes with a unique and compelling overview that provides a glimpse into the storyline. From heartwarming dramas to spine-chilling thrillers, our selection covers a diverse range of genres. Whether you're in the mood for a gripping narrative or a light-hearted adventure, our recommendations offer intriguing synopses to help you make the perfect choice.

- Beyond just entertaining, these movies have left a lasting impact on audiences worldwide. They have resonated with viewers, sparking discussions and leaving a mark on the world of cinema. Prepare to embark on an unforgettable journey as you explore films that have not only earned critical acclaim but have also contributed to the cultural tapestry of the film industry.

### **Necessary Libraries and Key**

In [7]:
import requests
import csv
import pandas as pd
import os

In [8]:
api_key = '15d786cc910b647049be3fc40ce9f3a2'

## **1. Data Collection** <a name="collect"></a>

Download movie dataset via API

In [15]:
base_url = 'https://api.themoviedb.org/3/'
movie_endpoint = 'movie/top_rated'

page = 1
total_pages = 50

movies_data = []

while page <= total_pages:
    discover_url = f'{base_url}{movie_endpoint}?api_key={api_key}&page={page}'
    discover_response = requests.get(discover_url)

    if discover_response.status_code == 200:
        discover_data = discover_response.json()

        if total_pages == 1:
            total_pages = discover_data['total_pages']

        if page == 1:
            header = list(discover_data['results'][0].keys())
            movies_data.append(header)

        for movie in discover_data['results']:
            movies_data.append([movie[field] for field in header])

        page += 1
        
    else:
        print(f"Error: Failed to retrieve discovered movies. Status Code: {discover_response.status_code}")
        break
        
with open('Data/movies.csv', 'w', newline='', encoding='utf-8') as movies_file:
    movies_writer = csv.writer(movies_file)
    movies_writer.writerows(movies_data)

Load movie dataset into dataframe

In [16]:
movies_df = pd.read_csv('Data/movies.csv')
movies_df.head(2)

Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count
0,False,/tmU7GeKVybMWFButWEGl2M4GeiP.jpg,"[18, 80]",238,en,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...",142.683,/3bhkrj58Vtu7enYsRolD1fZdja1.jpg,1972-03-14,The Godfather,False,8.709,18971
1,False,/kXfqcdQKsToO0OUXHcrrNCHDBzO.jpg,"[18, 80]",278,en,The Shawshank Redemption,Framed in the 1940s for the double murder of h...,110.139,/q6y0Go1tsGEsmtFryDOJo3dEmqu.jpg,1994-09-23,The Shawshank Redemption,False,8.705,24958


But, we want more information about each movie in those dataset. So, we get more about credits: casts and crews in each movie

In [17]:
header.append('credits')

for movie_info in movies_data[1:]:  
    movie_id = movie_info[header.index('id')]
    credits_url = f'{base_url}movie/{movie_id}/credits?api_key={api_key}'
    credits_response = requests.get(credits_url)

    if credits_response.status_code == 200:
        credits_data = credits_response.json()
        movie_credits = credits_data.get('cast', [])  # Assuming you want the cast information
    else:
        movie_credits = []

    movie_info.append(movie_credits)

with open('Data/movies_with_credits.csv', 'w', newline='', encoding='utf-8') as movies_file:
    movies_writer = csv.writer(movies_file)
    movies_writer.writerows(movies_data)

In [32]:
df = pd.read_csv('Data/movies_with_credits.csv')
df.head(2)

Unnamed: 0,adult,backdrop_path,genre_ids,id,original_language,original_title,overview,popularity,poster_path,release_date,title,video,vote_average,vote_count,credits
0,False,/tmU7GeKVybMWFButWEGl2M4GeiP.jpg,"[18, 80]",238,en,The Godfather,"Spanning the years 1945 to 1955, a chronicle o...",142.683,/3bhkrj58Vtu7enYsRolD1fZdja1.jpg,1972-03-14,The Godfather,False,8.709,18971,"[{'adult': False, 'gender': 2, 'id': 3084, 'kn..."
1,False,/kXfqcdQKsToO0OUXHcrrNCHDBzO.jpg,"[18, 80]",278,en,The Shawshank Redemption,Framed in the 1940s for the double murder of h...,110.139,/q6y0Go1tsGEsmtFryDOJo3dEmqu.jpg,1994-09-23,The Shawshank Redemption,False,8.705,24958,"[{'adult': False, 'gender': 2, 'id': 504, 'kno..."


Looks like we have enough data to solve our problem

## **2. Data Pre-processing** <a name="process"></a>

### **a. Pre-processing** <a name="preprocess"></a>

Firstly, we should inspect our data.

In [33]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 15 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   adult              1000 non-null   bool   
 1   backdrop_path      1000 non-null   object 
 2   genre_ids          1000 non-null   object 
 3   id                 1000 non-null   int64  
 4   original_language  1000 non-null   object 
 5   original_title     1000 non-null   object 
 6   overview           1000 non-null   object 
 7   popularity         1000 non-null   float64
 8   poster_path        1000 non-null   object 
 9   release_date       1000 non-null   object 
 10  title              1000 non-null   object 
 11  video              1000 non-null   bool   
 12  vote_average       1000 non-null   float64
 13  vote_count         1000 non-null   int64  
 14  credits            1000 non-null   object 
dtypes: bool(2), float64(2), int64(2), object(9)
memory usage: 103.6+ KB


We can see that our data have no empty/null data. So, we skip handle null data.

Then, we check for duplicates in our data

In [34]:
duplicates = df[df.duplicated()]
print(f"\nNumber of duplicates: {len(duplicates)}")


Number of duplicates: 11


Remove all duplicate columns

In [35]:
df = df.drop_duplicates()
df = df.dropna()

In [36]:
duplicates = df[df.duplicated()]
print(f"\nCheck number of duplicates again: {len(duplicates)}")


Check number of duplicates again: 0


### **b. Exploration** <a name="exploration"></a>

#### **Feature Selection**

Our original data has 13 columns but we don't need to parse the entire column. We should select some important columns for better analyze.

In [42]:
# Feature Selection
selected_columns = ['genre_ids', 'id', 'overview', 'popularity', 'release_date', 'title', 'vote_average', 'vote_count', 'credits']
df = df[selected_columns]

In [43]:
df.head(2)

Unnamed: 0,genre_ids,id,overview,popularity,release_date,title,vote_average,vote_count,credits
0,"[18, 80]",238,"Spanning the years 1945 to 1955, a chronicle o...",142.683,1972-03-14,The Godfather,8.709,18971,"[{'adult': False, 'gender': 2, 'id': 3084, 'kn..."
1,"[18, 80]",278,Framed in the 1940s for the double murder of h...,110.139,1994-09-23,The Shawshank Redemption,8.705,24958,"[{'adult': False, 'gender': 2, 'id': 504, 'kno..."


#### **Correct data type**

## **3. Data Modeling** <a name="modeling"></a>

### **4. Deploy Model** <a name="deploy"></a>

## **References** <a name="references"></a>