## <span style="color:#FF0000">Part 1</span>


1. # Load the data into your project using pandas:

- Open a Python environment or **Jupyter Notebook**.
- Import the pandas library: **import pandas as pd**.
- Use the copied link addresses to **read each file into a DataFrame()**:

In [1]:
# Imports
import pandas as pd
import numpy as np

In [2]:
basics_url = "https://datasets.imdbws.com/title.basics.tsv.gz"
akas_url = "https://datasets.imdbws.com/title.akas.tsv.gz"
ratings_url = "https://datasets.imdbws.com/title.ratings.tsv.gz"

basics = pd.read_csv(basics_url, sep='\t', low_memory=False)
akas = pd.read_csv(akas_url, sep='\t', low_memory=False)
ratings = pd.read_csv(ratings_url, sep='\t', low_memory=False)
print("CSV Readings complete")

CSV Readings complete


In [3]:
# Example
basics.head()

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
0,tt0000001,short,Carmencita,Carmencita,0,1894,\N,1,"Documentary,Short"
1,tt0000002,short,Le clown et ses chiens,Le clown et ses chiens,0,1892,\N,5,"Animation,Short"
2,tt0000003,short,Pauvre Pierrot,Pauvre Pierrot,0,1892,\N,4,"Animation,Comedy,Romance"
3,tt0000004,short,Un bon bock,Un bon bock,0,1892,\N,12,"Animation,Short"
4,tt0000005,short,Blacksmith Scene,Blacksmith Scene,0,1893,\N,1,"Comedy,Short"


2. # Perform data preprocessing and filtering:

- Replace "\N" with np.nan in each DataFrame: basics.replace({'\\N': np.nan}, inplace=True), akas.replace({'\\N': np.nan}, inplace=True), ratings.replace({'\\N': np.nan}, inplace=True).
- Filter the basics DataFrame based on the provided specifications:

In [4]:
basics = pd.DataFrame(basics).replace({'\\N': np.nan})
akas = pd.DataFrame(akas).replace({'\\N': np.nan})
ratings = pd.DataFrame(ratings).replace({'\\N': np.nan})

print("NaNs were Replaced Succesfully")

NaNs were Replaced Succesfully


In [5]:
basics = basics.dropna(subset=['runtimeMinutes', 'genres'])
basics = basics[basics['titleType'] == 'movie']
basics['startYear'] = pd.to_numeric(basics['startYear'], errors='coerce') # Convert 'startYear' to numeric type
basics = basics[(basics['startYear'] >= 2000) & (basics['startYear'] <= 2021)]
basics = basics[~basics['genres'].str.contains('documentary', case=False)]

### Filter the basics DataFrame to include only US movies based on the akas DataFrame:

In [6]:
keepers = basics['tconst'].isin(akas[akas['region'] == 'US']['titleId'])
basics = basics[keepers]

3. # Save the filtered DataFrames to compressed CSV files:

- Create a "Data" folder within your repository if it doesn't already exist.
- Use the to_csv method to save each DataFrame with compression:

In [7]:
basics.to_csv("Data/title_basics.csv.gz", compression='gzip', index=False)
akas.to_csv("Data/title_akas.csv.gz", compression='gzip', index=False)
ratings.to_csv("Data/title_ratings.csv.gz", compression='gzip', index=False)

In [8]:
print("Check \"Data\" Folder")

Check "Data" Folder


## <span style="color:#ffa500">Part 2</span>


# More Data

- **install:** tmdbsimple

In [10]:
!pip install tmdbsimple

Collecting tmdbsimple
  Downloading tmdbsimple-2.9.1-py3-none-any.whl (38 kB)
Installing collected packages: tmdbsimple
Successfully installed tmdbsimple-2.9.1


In [17]:
import tmdbsimple as tmdb
import pandas as pd

# Set your TMDB API key
tmdb.API_KEY = 'f9333936ac8c033efb0442f581fc52a5'

# Function to get movie certification (MPAA Rating) from TMDB API
def get_movie_certification(movie_id):
    try:
        movie = tmdb.Movies(movie_id)
        response = movie.info()
        release_dates = movie.release_dates()['results']
        certification = None
        for result in release_dates:
            if 'iso_3166_1' in result and result['iso_3166_1'] == 'US':
                certification = result['release_dates'][0]['certification']
                break
        return certification
    except tmdb.exceptions.TMDBException as e:
        print(f"Error retrieving data for movie ID: {movie_id}")
        print(str(e))

# Test the get_movie_certification() function
avengers_certification = get_movie_certification("tt0848228")
notebook_certification = get_movie_certification("tt0332280")

print("The Avengers Certification:", avengers_certification)
print("The Notebook Certification:", notebook_certification)


The Avengers Certification: PG-13
The Notebook Certification: PG-13
