# **Configurando**
Realizando importações das bibliotecas utilizadas durante a análise exploratória 

In [15]:
import pandas as pd

Lendo Datasets de Movies e Ratings

In [16]:
df_movies = pd.read_csv("../../data/movies/movies.csv")
df_ratings = pd.read_csv("../../data/movies/ratings.csv")

In [17]:
print("Movies Dataset:\n")
print(df_movies.head())
print("Ratings Dataset:\n")
print(df_ratings)

Movies Dataset:

      tconst titleType            primaryTitle           originalTitle  \
0  tt0000001     short              Carmencita              Carmencita   
1  tt0000002     short  Le clown et ses chiens  Le clown et ses chiens   
2  tt0000003     short            Poor Pierrot          Pauvre Pierrot   
3  tt0000004     short             Un bon bock             Un bon bock   
4  tt0000005     short        Blacksmith Scene        Blacksmith Scene   

   isAdult startYear endYear runtimeMinutes                    genres  
0        0      1894      \N              1         Documentary,Short  
1        0      1892      \N              5           Animation,Short  
2        0      1892      \N              5  Animation,Comedy,Romance  
3        0      1892      \N             12           Animation,Short  
4        0      1893      \N              1                     Short  
Ratings Dataset:

            tconst  averageRating  numVotes
0        tt0000001            5.7      2183


# **Estruturando em um único Dataset**
Adicionado ao dataset Movies as features averageRating e numVotes

In [29]:
Dataset = pd.merge(df_movies, df_ratings[['tconst', 'averageRating', 'numVotes']], on='tconst', how='left')
print(Dataset.head())

      tconst titleType            primaryTitle           originalTitle  \
0  tt0000001     short              Carmencita              Carmencita   
1  tt0000002     short  Le clown et ses chiens  Le clown et ses chiens   
2  tt0000003     short            Poor Pierrot          Pauvre Pierrot   
3  tt0000004     short             Un bon bock             Un bon bock   
4  tt0000005     short        Blacksmith Scene        Blacksmith Scene   

   isAdult startYear endYear runtimeMinutes                    genres  \
0        0      1894      \N              1         Documentary,Short   
1        0      1892      \N              5           Animation,Short   
2        0      1892      \N              5  Animation,Comedy,Romance   
3        0      1892      \N             12           Animation,Short   
4        0      1893      \N              1                     Short   

   averageRating  numVotes  
0            5.7    2183.0  
1            5.5     303.0  
2            6.4    2263.0  


# **Explorando Features do Dataset**

Vendo os tipos de cada feature

In [19]:
print(Dataset.dtypes)

tconst             object
titleType          object
primaryTitle       object
originalTitle      object
isAdult             int64
startYear          object
endYear            object
runtimeMinutes     object
genres             object
averageRating     float64
numVotes          float64
dtype: object


Eliminando tconst por se tratar apenas de id do IMDB

In [20]:
Dataset.drop('tconst', axis=1, inplace=True)
print(Dataset)

          titleType               primaryTitle              originalTitle  \
0             short                 Carmencita                 Carmencita   
1             short     Le clown et ses chiens     Le clown et ses chiens   
2             short               Poor Pierrot             Pauvre Pierrot   
3             short                Un bon bock                Un bon bock   
4             short           Blacksmith Scene           Blacksmith Scene   
...             ...                        ...                        ...   
11949941  tvEpisode              Episode #3.17              Episode #3.17   
11949942  tvEpisode              Episode #3.19              Episode #3.19   
11949943  tvEpisode              Episode #3.20              Episode #3.20   
11949944      short                   The Wind                   The Wind   
11949945  tvEpisode  Horrid Henry Knows It All  Horrid Henry Knows It All   

          isAdult startYear endYear runtimeMinutes  \
0               0    

Analisando o que tem em cada coluna

In [None]:
for col in Dataset.columns:
    print(f"\nColuna: {col}")
    print(Dataset[col].unique())



Coluna: titleType
['short' 'movie' 'tvShort' 'tvMovie' 'tvEpisode' 'tvSeries' 'tvMiniSeries'
 'tvSpecial' 'video' 'videoGame' 'tvPilot']

Coluna: primaryTitle
['Carmencita' 'Le clown et ses chiens' 'Poor Pierrot' ... 'Luc Janssens'
 "Horrid Henry's Comic Caper" 'Horrid Henry Knows It All']

Coluna: originalTitle
['Carmencita' 'Le clown et ses chiens' 'Pauvre Pierrot' ... 'Luc Janssens'
 "Horrid Henry's Comic Caper" 'Horrid Henry Knows It All']

Coluna: isAdult
[   0    1 2019 1981 2020 2017 2023 2022 1977 1978 1979 1966 1970 1971
 1972 1973 1974 1975 1988 1980 1987 1986 1982 1985 1983 1984 1976 1968
 1969 2024 1967 1965 1958 2025 2014 2005]

Coluna: startYear
['1894' '1892' '1893' '1895' '1896' '1898' '1897' '1900' '1899' '1901'
 '1902' '1903' '1904' '1905' '1912' '1907' '1906' '1908' '1910' '1909'
 '\\N' '1911' '1990' '1914' '1913' '1915' '1919' '1916' '1917' '1918'
 '1936' '1925' '1922' '1920' '1921' '1923' '1924' '1928' '2019' '1926'
 '1927' '1929' '1993' '1935' '1930' '1942' '1934

Subdividir a feature categorica 'titleType' em várias features binarias atráves do One Hoting Encode

In [30]:
Dataset = pd.get_dummies(Dataset, columns=['titleType'], prefix='', prefix_sep='')
print(Dataset.dtypes)

tconst             object
primaryTitle       object
originalTitle      object
isAdult             int64
startYear          object
endYear            object
runtimeMinutes     object
genres             object
averageRating     float64
numVotes          float64
movie                bool
short                bool
tvEpisode            bool
tvMiniSeries         bool
tvMovie              bool
tvPilot              bool
tvSeries             bool
tvShort              bool
tvSpecial            bool
video                bool
videoGame            bool
dtype: object


Checando conteúdo dos titulos

In [None]:
total = len(Dataset)
dif = Dataset['primaryTitle'] != Dataset['originalTitle']
dif_count = dif.sum()
print(f"Diferentes: {dif_count} de {total} ({dif_count/total:.2%})")
diferentes = Dataset[dif]
print(diferentes[['primaryTitle', 'originalTitle']])


Diferentes: 169572 de 11949946 (1.42%)
                                         primaryTitle  \
2                                        Poor Pierrot   
9                                 Leaving the Factory   
11                             The Arrival of a Train   
12        The Photographical Congress Arrives in Lyon   
13                                The Waterer Watered   
...                                               ...   
11949601                            The Taste Is Mine   
11949622                                The Rehearsal   
11949715                                        Coven   
11949747                          The Secret of China   
11949845                                       Eugène   

                                              originalTitle  
2                                            Pauvre Pierrot  
9                       La sortie de l'usine Lumière à Lyon  
11                         L'arrivée d'un train à La Ciotat  
12        Le débarquement du

São só nomes populares e os nomes originais do filme, pode ser retirado.
Então vamos retirar a coluna 'primaryTitle'

In [None]:
Dataset.drop('primaryTitle', axis=1, inplace=True)
print(Dataset.head())