
## First we are going to load the dataset and check for the column values if any minor error are present which might need some data cleaning

In [2]:
# Importing Pandas and sqlite3 libraries

import pandas as pd
import sqlite3

from IPython.display import display, HTML

In [3]:

# connecting with database
conn = sqlite3.connect('Db-IMDB-Assignment.db')


* __Getting the tables from the IMDB database__

In [7]:

query = """
SELECT NAME FROM sqlite_master WHERE type='table';"""

pd.read_sql_query(query, conn)

Unnamed: 0,name
0,Movie
1,Genre
2,Language
3,Country
4,Location
5,M_Location
6,M_Country
7,M_Language
8,M_Genre
9,Person


* __Getting information about the 'Movie' table__

In [8]:
query ="""
PRAGMA TABLE_INFO (Movie) ;"""

pd.read_sql_query(query, conn)

Unnamed: 0,cid,name,type,notnull,dflt_value,pk
0,0,index,INTEGER,0,,0
1,1,MID,TEXT,0,,0
2,2,title,TEXT,0,,0
3,3,year,TEXT,0,,0
4,4,rating,REAL,0,,0
5,5,num_votes,INTEGER,0,,0



### Checking for discrepancies in year values 

In [9]:
query ="""
SELECT DISTINCT year FROM Movie ;"""

pd.read_sql_query(query, conn)

Unnamed: 0,year
0,2018
1,2012
2,2016
3,2017
4,2008
...,...
120,IV 2017
121,1943
122,1950
123,I 1969


* __We can see the years have some roman numerics and unwanted white spaces in between them. We need to clean those.__

* __We also will clean the different ids like, MID,PID,GID by trimming down the white spaces before and after them. Since the ids are vital while joining 2 or more tables, we should keep them error free.__



**NB-**   *I faced much issues when i tried data analysis using this data before cleaning it. Dont forget doing this cleaning to save time!*

**Dependencies**
- pip install ipython-sql


## Data Cleaning

In [22]:
#loading SQL module
%load_ext sql

#connect to the database
%sql sqlite:///Db-IMDB-Assignment.db

In [23]:

%%time
%%sql

UPDATE Movie SET year = REPLACE(year, "I", "");
UPDATE Movie SET year = REPLACE(year, "V", "");
UPDATE Movie SET year = REPLACE(year, "X", "");
UPDATE Movie SET year = TRIM(year);
UPDATE Movie SET title = TRIM(title);
UPDATE Movie SET MID = TRIM(MID);

UPDATE M_Producer SET PID = TRIM(PID);
UPDATE M_Producer SET MID = TRIM(MID);

UPDATE M_Director SET PID = TRIM(PID);
UPDATE M_Director SET MID = TRIM(MID);

UPDATE M_Cast SET PID = TRIM(PID);
UPDATE M_Cast SET MID = TRIM(MID);

UPDATE M_Genre SET GID = TRIM(GID);
UPDATE M_Genre SET MID = TRIM(MID);

UPDATE Genre SET GID = TRIM(GID);
UPDATE Genre SET Name = TRIM(Name);

UPDATE Person SET Name = TRIM(Name);
UPDATE Person SET PID = TRIM(PID);
UPDATE Person SET Gender = TRIM(Gender);



 * sqlite:///Db-IMDB-Assignment.db
3473 rows affected.
3473 rows affected.
3473 rows affected.
3473 rows affected.
3473 rows affected.
3473 rows affected.
11749 rows affected.
11749 rows affected.
3473 rows affected.
3473 rows affected.
82835 rows affected.
82835 rows affected.
3473 rows affected.
3473 rows affected.
328 rows affected.
328 rows affected.
37566 rows affected.
37566 rows affected.
37566 rows affected.
Wall time: 4.76 s


[]

* __Lets check if the discrepancies in data got corrected or not.__

In [24]:
query ="""
SELECT DISTINCT year FROM Movie ;"""

pd.read_sql_query(query, conn)

Unnamed: 0,year
0,2018
1,2012
2,2016
3,2017
4,2008
...,...
73,1947
74,1936
75,1946
76,1943


- **As we can see the wanted roman numerics and the white space is removed. And our whole database has been modified and cleaned properly.**