# Titles

In this notebook we will create the title column we will use for our dataset.

In [26]:
# import libraries
import pandas as pd
import os

## Description

This one is self-explanatory. We will use the IMDB title dataset, stripped of TV shows.


### Dependencies

IMDB title dataset

In [7]:
path = '../data/raw/title.basics.tsv.gz'
all_titles = pd.read_csv(path, delimiter='\t')

  interactivity=interactivity, compiler=compiler, result=result)


### Shape

The result will have two columns: 
1. imdb_id
2. title

### Processsing

Strip all non-movies

In [18]:
is_movie = 'titleType == "movie"'
movie_titles = all_titles.query(is_movie)

In [19]:
movie_titles.sample(5)

Unnamed: 0,tconst,titleType,primaryTitle,originalTitle,isAdult,startYear,endYear,runtimeMinutes,genres
530594,tt0440705,movie,Der Passagier von Nr. 7,Der Passagier von Nr. 7,0,1921,\N,\N,\N
134975,tt0109084,movie,Buccaneer Soul,Alma Corsária,0,1993,\N,112,Drama
3107796,tt1575714,movie,The Heart with Million Knots,Xin you qian qian jie,0,1973,\N,101,Romance
1651716,tt10589540,movie,Ang taran tanods: K'nang buhay 'to!,Ang taran tanods: K'nang buhay 'to!,0,2019,\N,95,Comedy
150571,tt0122058,movie,The Women's Great Escape,Yeosu daetalok,0,1976,\N,90,"Drama,Thriller"


Build new dataframe with correct columns

In [20]:
movie_titles = movie_titles[['tconst', 'primaryTitle']]

In [21]:
movie_titles.sample(5)

Unnamed: 0,tconst,primaryTitle
4830420,tt3764458,All My Boyfriend's Girlfriends
4071563,tt2385784,A small southern enterprise
66858,tt0053802,Blood and Roses
1441941,tt10297320,Swingers
5223679,tt4452636,Coming Through Slaughter


In [22]:
movie_titles.rename(columns={ 'tconst': 'imdb_id', 'primaryTitle': 'title'}, inplace=True)

In [23]:
movie_titles.sample(5)

Unnamed: 0,imdb_id,title
7227494,tt7916910,A1: The Long Road to Edinburgh
1348990,tt10168242,Lotus
3056325,tt1533971,Abraham's Children
6329808,tt6394432,Landrauschen
8333462,tt9797642,Work A Double


### Completed

This is the final product.

In [25]:
movie_titles.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 683297 entries, 10 to 8404146
Data columns (total 2 columns):
imdb_id    683297 non-null object
title      683297 non-null object
dtypes: object(2)
memory usage: 15.6+ MB


Export it to the interim data folder using pandas built-in method

In [28]:
path = os.path.join(os.pardir, 'data', 'interim', 'movie_titles.csv')
movie_titles.to_csv(path_or_buf=path, index=False)

Make sure the write was succesful

In [29]:
saved_movie_titles = pd.read_csv(path)

In [30]:
saved_movie_titles.sample(5)

Unnamed: 0,imdb_id,title
136975,tt0177116,Pochti lyubovna istoriya
644462,tt8310474,Bruised
229166,tt0362458,Blumen lügen nicht
189690,tt0270273,Chhaya
37965,tt0042726,Memorias de un mexicano
