# Análise exploratória dos dados

## Importando as bibliotecas

In [3770]:
import pandas as pd
import numpy as np

## Caminho para o dataset

In [3771]:
path_csv = '../data/raw/movies_dataset.csv'

## Lendo o dataset

In [3772]:
df = pd.read_csv(path_csv, sep=',')
df

Unnamed: 0,id,title,original_title,original_language,overview,tagline,budget,revenue,runtime,release_date,popularity,vote_average,vote_count,release_year
0,411405,Small Crimes,Small Crimes,en,"A disgraced former cop, fresh off a six-year p...",,0.0,0.0,95.0,2017-04-28,7.219022,5.8,55.0,2017.0
1,42492,Up the Sandbox,Up the Sandbox,en,"A young wife and mother, bored with day-to-day...",,0.0,0.0,97.0,1972-12-21,0.138450,7.3,2.0,1972.0
2,12143,Bad Lieutenant,Bad Lieutenant,en,"While investigating a young nun's rape, a corr...",Gambler. Thief. Junkie. Killer. Cop.,1000000.0,2019469.0,96.0,1992-09-16,6.417037,6.9,162.0,1992.0
3,9976,Satan's Little Helper,Satan's Little Helper,en,A naïve young boy unknowingly becomes the pawn...,You'll laugh 'til you die,0.0,0.0,100.0,2004-01-01,2.233189,5.0,42.0,2004.0
4,46761,Sitcom,Sitcom,fr,The adventures of an upper-class suburban fami...,,0.0,0.0,80.0,1998-05-27,1.800582,6.4,27.0,1998.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5002,16010,Life Hits,Råzone,da,Christina is living in a suburb to Copenhagen....,,0.0,0.0,93.0,2006-07-03,0.509359,6.5,6.0,2006.0
5003,205775,In the Heart of the Sea,In the Heart of the Sea,en,"In the winter of 1820, the New England whaling...",Based on the incredible true story that inspir...,100000000.0,93820758.0,122.0,2015-11-20,11.696923,6.5,1300.0,2015.0
5004,218238,Blinker,Blinker,en,A series of mysterious events and the pesterin...,,0.0,0.0,90.0,1999-01-01,0.052350,6.0,1.0,1999.0
5005,47694,Temptress Moon,Feng yue,zh,"Set in the decadent 1920s, Temptress Moon tell...",,0.0,0.0,130.0,1996-05-09,1.136222,7.2,6.0,1996.0


In [3773]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5007 entries, 0 to 5006
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   id                 5007 non-null   int64  
 1   title              5006 non-null   object 
 2   original_title     5007 non-null   object 
 3   original_language  5006 non-null   object 
 4   overview           4909 non-null   object 
 5   tagline            2248 non-null   object 
 6   budget             4766 non-null   object 
 7   revenue            4807 non-null   float64
 8   runtime            4836 non-null   object 
 9   release_date       4893 non-null   object 
 10  popularity         5006 non-null   float64
 11  vote_average       5006 non-null   float64
 12  vote_count         5006 non-null   float64
 13  release_year       4893 non-null   float64
dtypes: float64(5), int64(1), object(8)
memory usage: 547.8+ KB


## Tratando dados duplicados

Removeremos os dados duplicados.

In [3774]:
df.duplicated()

0       False
1       False
2       False
3       False
4       False
        ...  
5002     True
5003     True
5004     True
5005     True
5006     True
Length: 5007, dtype: bool

In [3775]:
df.duplicated().sum()

6

In [3776]:
df.drop_duplicates(inplace=True)

In [3777]:
df.duplicated().sum()

0

In [3778]:
df_sem_duplicado = df.copy()

## Tratando dados ausentes

In [3779]:
df_sem_duplicado.isna().sum()

id                      0
title                   1
original_title          0
original_language       1
overview               98
tagline              2756
budget                241
revenue               200
runtime               171
release_date          114
popularity              1
vote_average            1
vote_count              1
release_year          114
dtype: int64

Percebemos que a coluna `Tagline` contém mais de 50% dos registros como valores ausentes (`NaN`). Como essa coluna apenas informa uma frase promocional que é única para cada filme, ela não tem correlação com outras colunas e, já que, mais de 50% dos valores são `NaN`s, vamos retira-la do dataframe.

In [3780]:
df_sem_duplicado.drop(columns=['tagline'], inplace=True)

In [3781]:
df_sem_duplicado.columns

Index(['id', 'title', 'original_title', 'original_language', 'overview',
       'budget', 'revenue', 'runtime', 'release_date', 'popularity',
       'vote_average', 'vote_count', 'release_year'],
      dtype='object')

Nas colunas `budget` e `revenue`, notamos que pelas insformações do dataframe a maior parte dos dados são iguais a `0.0` o que indica que os valores que são diferentes de `0.0` podem ser possíveis `outliers`. Logo, para essas colunas, iremos transformar os valores `NaN` em `0.0`. 

Para a coluna `revenue`, temos:

In [3782]:
media_nan_revenue = df_sem_duplicado['revenue'].isna().mean()
media_nan_revenue

0.03999200159968006

In [3783]:
media_zero_revenue = (df_sem_duplicado['revenue'] == 0.0).mean()
media_zero_revenue

0.8014397120575885

In [3784]:
df_sem_duplicado['revenue'] = df_sem_duplicado['revenue'].replace(np.nan, 0.0)

In [3785]:
df_sem_duplicado['revenue'].isna().median()

0.0

Para a coluna `budget` percebemos que ela está com o tipo `object` nas informações do dataframe. Então, primeiramente faremos a mudança de tipo para `float`. 

A mudança direta para `float` gera um erro por causa da palavra `unknown` em alguns registros nessa coluna. Então substituiremos esse valor por `NaN` e após isso faremos a mudança de tipo para `float`.

In [3786]:
df_sem_duplicado['budget'] = df_sem_duplicado['budget'].replace('unknown', np.nan)

Mundando para `float`.

In [3787]:
df_sem_duplicado['budget'] = df_sem_duplicado['budget'].astype(float)

Preenchendo os `NaN`s de `budget` com `0.0`.

In [3788]:
media_nan_budget = df_sem_duplicado['budget'].isna().mean()
media_nan_budget

0.06778644271145771

In [3789]:
media_zero_budget = (df_sem_duplicado['budget'] == 0.0).mean()
media_zero_budget

0.7570485902819436

In [3790]:
df_sem_duplicado['budget'] = df_sem_duplicado['budget'].replace(np.nan, 0.0)

In [3791]:
df_sem_duplicado['budget'].isna().mean()

0.0

In [3792]:
df_sem_duplicado.describe()

Unnamed: 0,id,budget,revenue,popularity,vote_average,vote_count,release_year
count,5001.0,5001.0,5001.0,5000.0,5000.0,5000.0,4887.0
mean,109172.770446,29564740.0,9669093.0,2.935516,5.60706,106.2628,1991.868426
std,112384.820066,1435053000.0,54016460.0,8.936065,1.962822,460.395815,24.123312
min,13.0,0.0,-50000.0,0.0,0.0,0.0,1891.0
25%,27313.0,0.0,0.0,0.376077,5.0,3.0,1978.0
50%,60479.0,0.0,0.0,1.097116,6.0,10.0,2001.0
75%,156954.0,0.0,0.0,3.580283,6.8,34.0,2010.0
max,464207.0,100000000000.0,1156731000.0,547.488298,10.0,8670.0,2018.0


Para a coluna `runtime` subdtituiremos os `NaN`s pela a `madiana`, pois ela performa melhor com `outliers`.

In [3793]:
df_sem_duplicado['runtime'].isna().sum()

171

Ao transformarmos o tipo de `runtime` para `float` encontramos o problema de que existem registros com `min` de minutos após o número. Então, retiraremos o `min` primeiro antes de mudar para `float`.

In [3794]:
df_sem_duplicado['runtime'] = df_sem_duplicado['runtime'].str.replace('min', '', regex=False).str.strip()

In [3795]:
df_sem_duplicado['runtime'] = df_sem_duplicado['runtime'].astype(float)

Agora substituiremos os valores `NaN` pela `mediana`,

In [3796]:
mediana_runtime = df_sem_duplicado['runtime'].median()
df_sem_duplicado['runtime'] = df_sem_duplicado['runtime'].replace(np.nan, mediana_runtime)

In [3797]:
df_sem_duplicado['runtime'].isna().sum()

0

Para as colunas `popularity`, `vote_average` e `vote_count`, substuiremos pela `média`, `média` e `mediana` respectivamente, já que, `popularity` e `vote_average` tem desvio padrão baixo (há indícios de poucos `outliers`). Já `vote_count` possui desvio padrão alto, ou seja, há indícios de mais `outliers` nessa coluna e então a `mediana` tem um melhor comportamento.

In [3798]:
media_popularity = df_sem_duplicado['popularity'].mean()
media_vote_average = df_sem_duplicado['vote_average'].mean()
mediana_vote_count = df_sem_duplicado['vote_count'].median()

In [3799]:
df_sem_duplicado['popularity'] = df_sem_duplicado['popularity'].replace(np.nan, media_popularity)
df_sem_duplicado['vote_average'] = df_sem_duplicado['vote_average'].replace(np.nan, media_vote_average)
df_sem_duplicado['vote_count'] = df_sem_duplicado['vote_count'].replace(np.nan, mediana_vote_count)

In [3800]:
df_sem_duplicado.isna().sum()

id                     0
title                  1
original_title         0
original_language      1
overview              98
budget                 0
revenue                0
runtime                0
release_date         114
popularity             0
vote_average           0
vote_count             0
release_year         114
dtype: int64

Analisando o valor ausente da coluna `title` vemos que também existem valores ausentes em `release_date` e `release_year` e pela `overview` do filme conseguimos saber qual é o filme e preencher essas colunas.

In [3801]:
df_sem_duplicado[df_sem_duplicado['title'].isna()]

Unnamed: 0,id,title,original_title,original_language,overview,budget,revenue,runtime,release_date,popularity,vote_average,vote_count,release_year
2474,122662,,マルドゥック・スクランブル 排気,ja,Third film of the Mardock Scramble series.,0.0,0.0,94.0,,2.935516,5.60706,10.0,


In [3802]:
df_sem_duplicado['title'] = df_sem_duplicado['title'].astype(str)

In [3803]:
df_sem_duplicado.loc[2474, 'title'] = 'Mardock Scramble: The Third Exhaust'

Agora vamos transforma a coluna `release_date` em `datetime` para inserir a data de lançamento do filme.

In [3804]:
df_sem_duplicado['release_date'] = pd.to_datetime(df_sem_duplicado['release_date'], errors='coerce')

In [3805]:
df_sem_duplicado.loc[2474, 'release_date'] = pd.to_datetime('2012-09-29')
df_sem_duplicado.loc[2474, 'release_year'] = 2012.0

Inserimos todos os valores ausentes desse registro.

In [3806]:
df_sem_duplicado.loc[2474]

id                                                       122662
title                       Mardock Scramble: The Third Exhaust
original_title                                 マルドゥック・スクランブル 排気
original_language                                            ja
overview             Third film of the Mardock Scramble series.
budget                                                      0.0
revenue                                                     0.0
runtime                                                    94.0
release_date                                2012-09-29 00:00:00
popularity                                             2.935516
vote_average                                            5.60706
vote_count                                                 10.0
release_year                                             2012.0
Name: 2474, dtype: object

Agora para a coluna `original_language` temos uma única coluna `NaN` que também conseguimos preenchela.

In [3807]:
df_sem_duplicado[df_sem_duplicado['original_language'].isna()]

Unnamed: 0,id,title,original_title,original_language,overview,budget,revenue,runtime,release_date,popularity,vote_average,vote_count,release_year
4946,381096,Yarn,Garn,,The traditional crafts of crochet and knitting...,0.0,0.0,76.0,2016-03-12,0.067624,0.0,0.0,2016.0


In [3808]:
df_sem_duplicado['original_language'].unique()

array(['en', 'fr', 'nl', 'it', 'ko', 'hi', 'ja', 'cs', 'cn', 'tr', 'sv',
       'de', 'ta', 'English', 'ingles', 'EN', 'zh', 'is', 'hr', 'el',
       'pt', 'ml', 'ar', 'ru', 'sr', 'fi', 'th', 'fa', 'ur', 'da', 'es',
       'iu', 'pl', 'hu', 'lv', 'mk', 'et', 'vi', 'ro', 'tl', 'no', 'mn',
       'he', 'bg', 'uk', 'xx', 'pa', 'mr', 'ca', 'id', 'nb', 'uz', 'sk',
       'sl', 'sq', 'te', 'ka', 'ay', 'ps', 'ms', 'bn', nan], dtype=object)

In [3809]:
df_sem_duplicado['original_language'] = df_sem_duplicado['original_language'].replace(['English', 'ingles', 'EN'], 'en')

In [3810]:
df_sem_duplicado['original_language'].unique()

array(['en', 'fr', 'nl', 'it', 'ko', 'hi', 'ja', 'cs', 'cn', 'tr', 'sv',
       'de', 'ta', 'zh', 'is', 'hr', 'el', 'pt', 'ml', 'ar', 'ru', 'sr',
       'fi', 'th', 'fa', 'ur', 'da', 'es', 'iu', 'pl', 'hu', 'lv', 'mk',
       'et', 'vi', 'ro', 'tl', 'no', 'mn', 'he', 'bg', 'uk', 'xx', 'pa',
       'mr', 'ca', 'id', 'nb', 'uz', 'sk', 'sl', 'sq', 'te', 'ka', 'ay',
       'ps', 'ms', 'bn', nan], dtype=object)

Preenchendo o idioma faltante em `original_language`.

In [3811]:
df_sem_duplicado.loc[4946, 'original_language'] = 'en'

In [3812]:
df_sem_duplicado.loc[4946]

id                                                              381096
title                                                             Yarn
original_title                                                    Garn
original_language                                                   en
overview             The traditional crafts of crochet and knitting...
budget                                                             0.0
revenue                                                            0.0
runtime                                                           76.0
release_date                                       2016-03-12 00:00:00
popularity                                                    0.067624
vote_average                                                       0.0
vote_count                                                         0.0
release_year                                                    2016.0
Name: 4946, dtype: object

Para as colunas `overview`, `release_date` e `release_year` não conseguimos inferir seus valores e por enquanto não alteraremos em nada nessas colunas.

In [3813]:
df_sem_duplicado.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5001 entries, 0 to 5001
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   id                 5001 non-null   int64         
 1   title              5001 non-null   object        
 2   original_title     5001 non-null   object        
 3   original_language  5001 non-null   object        
 4   overview           4903 non-null   object        
 5   budget             5001 non-null   float64       
 6   revenue            5001 non-null   float64       
 7   runtime            5001 non-null   float64       
 8   release_date       4792 non-null   datetime64[ns]
 9   popularity         5001 non-null   float64       
 10  vote_average       5001 non-null   float64       
 11  vote_count         5001 non-null   float64       
 12  release_year       4888 non-null   float64       
dtypes: datetime64[ns](1), float64(7), int64(1), object(4)
memory usage: 

In [3814]:
df_limpo = df_sem_duplicado.copy()

In [3815]:
df_limpo.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5001 entries, 0 to 5001
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype         
---  ------             --------------  -----         
 0   id                 5001 non-null   int64         
 1   title              5001 non-null   object        
 2   original_title     5001 non-null   object        
 3   original_language  5001 non-null   object        
 4   overview           4903 non-null   object        
 5   budget             5001 non-null   float64       
 6   revenue            5001 non-null   float64       
 7   runtime            5001 non-null   float64       
 8   release_date       4792 non-null   datetime64[ns]
 9   popularity         5001 non-null   float64       
 10  vote_average       5001 non-null   float64       
 11  vote_count         5001 non-null   float64       
 12  release_year       4888 non-null   float64       
dtypes: datetime64[ns](1), float64(7), int64(1), object(4)
memory usage: 

## Salvando o dataset

In [3816]:
df_limpo.to_csv('../data/processed/filmes_limpos.csv', index=False)