# Data Cleaning

In [1]:
# Libraries
import pandas as pd
pd.set_option('display.max_columns', None)

import numpy as np

import warnings
warnings.filterwarnings('ignore')

I'm going to do a general cleaning of all the tables, it will go like so:
+ Drop `last_update` column.
+ Check if the DTypes are correct.

Some tables need specific cleaning but it will be cleared.

After general cleaning, I'm going to:
1. Transform the data in `old_HDD` so that it contains the `film_id` and `actor_id`, as the relationship between the two tables will be many-to-many.
1. 'Transfer' the `category_id` table to  `film` and substitute the id with the category itself; there's no need for a table with just the category.

`Language` table, as stated before in the `0-data_exploration` notebook, is going to be dropped since there is no way to know in which language is every film without web-scrapping or access to another database.

## Inventory

In [23]:
df = pd.read_csv('../src/inventory.csv')

In [24]:
# drop last_update
df.drop('last_update', axis=1, inplace=True)

In [25]:
# checking data types
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 3 columns):
 #   Column        Non-Null Count  Dtype
---  ------        --------------  -----
 0   inventory_id  1000 non-null   int64
 1   film_id       1000 non-null   int64
 2   store_id      1000 non-null   int64
dtypes: int64(3)
memory usage: 23.6 KB


In [26]:
# saving it for later use
df.to_csv('../data/inventory.csv')

# Actor

In [27]:
df = pd.read_csv('../src/actor.csv')

In [28]:
# drop last_update
df.drop('last_update', axis=1, inplace=True)

In [29]:
# checking data types
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   actor_id    200 non-null    int64 
 1   first_name  200 non-null    object
 2   last_name   200 non-null    object
dtypes: int64(1), object(2)
memory usage: 26.2 KB


In [30]:
# saving it for later use
df.to_csv('../data/actor.csv')

## Film

In [37]:
df = pd.read_csv('../src/film.csv')

In [38]:
# drop last_update, rental_duration, duplicated language_id column and null original_language_id
df.drop(['last_update', 'rental_rate', 'language_id', 'original_language_id'], axis=1, inplace=True)

In [40]:
# changing column names for clarification
df.rename(columns= {'rental_duration': 'rental_days'}, inplace=True)

In [42]:
# checking data types
df.info(memory_usage='deep')

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 9 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   film_id           1000 non-null   int64  
 1   title             1000 non-null   object 
 2   description       1000 non-null   object 
 3   release_year      1000 non-null   int64  
 4   rental_days       1000 non-null   int64  
 5   length            1000 non-null   int64  
 6   replacement_cost  1000 non-null   float64
 7   rating            1000 non-null   object 
 8   special_features  1000 non-null   object 
dtypes: float64(1), int64(4), object(4)
memory usage: 397.6 KB


In [43]:
df.to_csv('../data/film.csv')

## Linking Actor and FilmID

In [46]:
df = pd.read_csv('../src/old_HDD.csv')

In [48]:
df.head()

Unnamed: 0,first_name,last_name,title,release_year,category_id
0,PENELOPE,GUINESS,ACADEMY DINOSAUR,2006,6
1,PENELOPE,GUINESS,ANACONDA CONFESSIONS,2006,2
2,PENELOPE,GUINESS,ANGELS LIFE,2006,13
3,PENELOPE,GUINESS,BULWORTH COMMANDMENTS,2006,10
4,PENELOPE,GUINESS,CHEAPER CLYDE,2006,14
