 # Convert CSV to SQL Database: IMDB Top 1000 Movies

<div class="alert alert-block alert-warning">
    <b>WARNING</b> - Do not run this code as the database is already constructed! <br>
    This document is for reference purposes only. <br>
    1. The cursor object is commented out to avoid accidently overwriting the database.
</div>

In [1]:
# Importing dependencies for this project.
import sqlite3
import pandas as pd

## Objective
This notebook is a quick reference example for taking CSV data and spinning up a basic SQLite database.  
To convert the data provided from kaggle<sup>1</sup> to an SQL database.  
Link: __[Original Kaggle Dataset](https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows)__ <br>

## Contents
1. [Database Structure](#Database-Structure)
2. [Database Constructor](#Database-Constructor)
3. [Reading Raw CSV Data](#Reading-Raw-CSV-Data)
4. [Inserting Data into Table](#Inserting-Data-into-Table)
5. [Validating Data Insertion](#Validating-Data-Insertion)
6. [Close Connection to the Database](#Close-Connection-to-the-Database)
7. [References](#References)

## Database Structure
1. Database called movies.db containing <b>1</b> (one) table.
2. Table called movies_data containing <b>8</b> (eight) attributes:
    - <b>id</b> (Primary Key)
    - <b>title</b> [string]
    - <b>release_year</b> [yyyy]
    - <b>certificate</b> [string]
    - <b>runtime</b> [int]
    - <b>imdb_rating</b> [float]
    - <b>num_votes</b> [int]
    - <b>gross</b> [float]

## Database Constructor

In [2]:
# Creating the database.
connection = sqlite3.connect("data/movies.db")
# minion = connection.cursor()

In [3]:
# Creating the table.
SQL_command = """
    CREATE TABLE
        movies_data
        (
            title TEXT NOT NULL,
            release_year TEXT,
            certificate TEXT,
            runtime INTEGER,
            imdb_rating REAL NOT NULL,
            num_votes INTEGER,
            gross REAL
        );
"""
minion.execute(SQL_command)

<sqlite3.Cursor at 0x229f7a4c040>

In [4]:
# Checking that the table was created.
tables = minion.execute("SELECT name FROM sqlite_master")
tables.fetchall()

[('movies_data',)]

## Reading Raw CSV Data

In [5]:
data = pd.read_csv("data/imdb_top_1000.csv", nrows=3)
print(data)

                                         Poster_Link  \
0  https://m.media-amazon.com/images/M/MV5BMDFkYT...   
1  https://m.media-amazon.com/images/M/MV5BM2MyNj...   
2  https://m.media-amazon.com/images/M/MV5BMTMxNT...   

               Series_Title  Released_Year Certificate  Runtime  \
0  The Shawshank Redemption           1994           A  142 min   
1             The Godfather           1972           A  175 min   
2           The Dark Knight           2008          UA  152 min   

                  Genre  IMDB_Rating  \
0                 Drama          9.3   
1          Crime, Drama          9.2   
2  Action, Crime, Drama          9.0   

                                            Overview  Meta_score  \
0  Two imprisoned men bond over a number of years...          80   
1  An organized crime dynasty's aging patriarch t...         100   
2  When the menace known as the Joker wreaks havo...          84   

               Director           Star1           Star2          Star3  

## Inserting Data into Table

In [6]:
SQL_command = """
    INSERT INTO
        movies_data
    VALUES
        (?, ?, ?, ?, ?, ?, ?);
"""

# Iterate the csv by chunks of 100.
chunksize = 100
for chunk in pd.read_csv("data/imdb_top_1000.csv", chunksize=chunksize):
    records = []
    for index, row in chunk.iterrows():
        record = (
            row['Series_Title'],
            row['Released_Year'],
            row['Certificate'],
            int(row['Runtime'].split()[0]),
            row['IMDB_Rating'],
            row['No_of_Votes'],
            row['Gross']
        )
        records.append(record)
    minion.executemany(SQL_command, records)
    connection.commit()
print('Database Update Complete!')

Database Update Complete!


## Validating Data Insertion

In [7]:
# Check the number of records in the table.
SQL_command = """
    SELECT
        COUNT(*)
    FROM
        movies_data;
"""
minion.execute(SQL_command)
print(minion.fetchall())

[(1000,)]


In [8]:
# Check the first 9 titles against the CSV file.
SQL_command = """
    SELECT
        rowid,
        title
    FROM
        movies_data
    WHERE
        rowid < 10;
"""
minion.execute(SQL_command)
data = pd.read_csv("data/imdb_top_1000.csv", nrows=9)
print(f'         SQL DATA        |         CSV FILE         ')
for i in range(0, 9):
    sql_data = minion.fetchone()
    print(f'{sql_data[0]}, {sql_data[1][:20]} {data["Series_Title"][i][:20]:>24}')

         SQL DATA        |         CSV FILE         
1, The Shawshank Redemp     The Shawshank Redemp
2, The Godfather            The Godfather
3, The Dark Knight          The Dark Knight
4, The Godfather: Part      The Godfather: Part 
5, 12 Angry Men             12 Angry Men
6, The Lord of the Ring     The Lord of the Ring
7, Pulp Fiction             Pulp Fiction
8, Schindler's List         Schindler's List
9, Inception                Inception


## Close Connection to the Database

In [9]:
connection.close()

## References
1. The dataset was prepared by Harshit Shankhdhar from IMDB available information for their top-rated 1000 movies. Harshit Shankhdhar (2020). <i>IMDB Movies Dataset</i> [Data set]. Kaggle. https://www.kaggle.com/datasets/harshitshankhdhar/imdb-dataset-of-top-1000-movies-and-tv-shows