# Methodology of Filtering
## Configuration

### title.basics.tsv.gz

- tconst (string) - alphanumeric unique identifier of the title
- titleType (string) – the type/format of the title (e.g. movie, short, tvseries, tvepisode, video, etc)
- primaryTitle (string) – the more popular title / the title used by the filmmakers on promotional materials at the point of release
- originalTitle (string) - original title, in the original language
- isAdult (boolean) - 0: non-adult title; 1: adult title
- startYear (YYYY) – represents the release year of a title. In the case of TV Series, it is the series start year
- endYear (YYYY) – TV Series end year. ‘\N’ for all other title types
- runtimeMinutes – primary runtime of the title, in minutes
- genres (string array) – includes up to three genres associated with the title

### title.principals.tsv.gz

- tconst (string) - alphanumeric unique identifier of the title
- ordering (integer) – a number to uniquely identify rows for a given titleId
- nconst (string) - alphanumeric unique identifier of the name/person
- category (string) - the category of job that person was in
- job (string) - the specific job title if applicable, else '\N'
- characters (string) - the name of the character played if applicable, else '\N'

### dataset.csv (Output)
These are the data fields of interest, we store these in a csv file called dataset.csv.
In this dataset we are left with the following fields:

- tconst
- nconst
- category
- startYear

We will be filtering data based on a few criteria.
We did not do this because of a lack of ambition but for practical reasons. Since order to analyse the whole dataset.
We would either need a cloud based solution or apply GPU accelerated processing. Both of these solutions demand time we do not have.
Therefore we are scoping analysis based on the following criteria:

- 1970<=startYear<1990
- titleType=='movie'
- averageRating>=7.0

## Importing IMDb Files: principals, basics and ratings

In [1]:
import pandas as pd

In [2]:
# principals and basics path
principals_path = "data/rawdata/principals.tsv"
basics_path = "data/rawdata/basics.tsv"
ratings_path = "data/rawdata/ratings.tsv"

In [3]:
# basics to dataframe only retrieving columns: tconst, startYear and titleType
basics = pd.read_csv(basics_path, sep='\t', low_memory=False)\
    .filter(items=["tconst", "startYear", "titleType"])
# remove \\N values in start year
basics =  basics[(basics['startYear'] != "\\N")]
# convert startYear to datatype int
basics = basics.astype(dtype={"tconst": str,"startYear":int})

In [4]:
# principals to dataframe only keeping columns: tconst, nconst and category
principals = pd.read_csv(principals_path, sep='\t', low_memory=False)\
    .filter(items=["tconst", "nconst", "category"])
# convert tconst, nconst, and category to datatype string
principals = principals.astype(dtype={'tconst':str, 'nconst': str, 'category': str})

In [5]:
# ratings to dataframe only keeping columns: tconst and averageRating
ratings = pd.read_csv(ratings_path, sep='\t', low_memory=False)\
    .filter(items=["tconst", 'averageRating'])
# convert tconst, nconst, and category to datatype string
ratings = ratings.astype(dtype={'tconst':str, 'averageRating': float})

## Filtering on Criteria
(1970<=startYear<1990, titleType=='movie' and averageRating>=7.0)

In [6]:
# filtering basics and ratings where 1970<=startYear<1990, titleType=='movie' and averageRating>=7.0
basics = basics[(basics['startYear'] >= 1970) & (basics['startYear'] < 1990) & (basics['titleType'] == 'movie')]\
    .filter(items=["tconst", "startYear", "titleType"])
ratings = ratings[(ratings['averageRating'] >= 7.0)]

In [7]:
# resulting basics dataset
basics

Unnamed: 0,tconst,startYear,titleType
30901,tt0031458,1970,movie
35967,tt0036606,1983,movie
38011,tt0038687,1980,movie
38759,tt0039442,1973,movie
44159,tt0044952,1977,movie
...,...,...,...
9884089,tt9913320,1981,movie
9884314,tt9913814,1981,movie
9884324,tt9913834,1981,movie
9884345,tt9913878,1981,movie


In [8]:
# resulting principals dataset
principals

Unnamed: 0,tconst,nconst,category
0,tt0000001,nm1588970,self
1,tt0000001,nm0005690,director
2,tt0000001,nm0374658,cinematographer
3,tt0000002,nm0721526,director
4,tt0000002,nm1335271,composer
...,...,...,...
56267418,tt9916880,nm0996406,director
56267419,tt9916880,nm0584014,director
56267420,tt9916880,nm1482639,writer
56267421,tt9916880,nm2586970,writer


In [9]:
# resulting ratings dataset
ratings

Unnamed: 0,tconst,averageRating
11,tt0000012,7.4
13,tt0000014,7.1
58,tt0000060,7.4
194,tt0000211,7.4
247,tt0000310,7.3
...,...,...
1316989,tt9916730,8.3
1316990,tt9916766,7.0
1316991,tt9916778,7.2
1316992,tt9916840,7.5


## Merging Files on tconst

In [10]:
# merge principals, basics and ratings
dataset = pd.merge(principals, basics, on="tconst", how="inner")
dataset = pd.merge(dataset, ratings, on="tconst", how="inner")
dataset = dataset.filter(items=["tconst", "nconst", "category","startYear"])

In [11]:
# resulting dataset
dataset

Unnamed: 0,tconst,nconst,category,startYear
0,tt0038687,nm0404158,actor,1980
1,tt0038687,nm3455274,self,1980
2,tt0038687,nm0001379,director,1980
3,tt0038687,nm0442105,writer,1980
4,tt0044952,nm0097888,actor,1977
...,...,...,...,...
95466,tt9900354,nm0412915,director,1982
95467,tt9900354,nm0756534,actress,1982
95468,tt9900354,nm0782938,actor,1982
95469,tt9900354,nm2509198,actress,1982


In [12]:
dataset.to_csv("Data/FilteredData/dataset.csv")