---
###Objetivo:

Uma analise comparativa de pandas X spark (de preferencia no databricks)

###O que fazer:

Pegar 2 datasets um > 50mb e outro menor

Realizar pre-processamento de dados e analise exploratoria dos dados, usando pandas e pyspark

Identificar quando o panda ou pyspark é melhor

###Datasets escolhidos:

https://www.kaggle.com/datasets/swatikhedekar/exploratory-data-analysis-on-netflix-data

Dataset público com informações sobre série e filmes disponiveis na plataforma de streaming Netflix de 2008 a 2021.

Tamanho do dataset = **3.4 mb**


https://www.kaggle.com/datasets/peacehegemony/history-of-music-bnb

Dataset com metadata sobre o catalogo de músicas na livraria britânica em http://explore.bl.uk

Tamanho do dataset = **261 mb**

---

###pandas
import pandas as pd

df1 = pd.read_csv("/dbfs/FileStore/shared_uploads/briancamargos@gmail.com/netflix_titles_2021.csv")

###spark

df1 = spark.read.format("csv").option("header", "true").load("dbfs:/FileStore/shared_uploads/briancamargos@gmail.com/netflix_titles_2021.csv")

---
### Trabalhando com o Dataset menor usando PANDAS
---

In [1]:
# Importando as bibliotecas que serão utilizadas

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Leitura do arquivo csv

%time netflix = pd.read_csv('netflix_titles_2021.csv')
%time netflix.head(10)

CPU times: total: 93.8 ms
Wall time: 108 ms
CPU times: total: 0 ns
Wall time: 0 ns


Unnamed: 0,show_id,type,title,director,cast,country,date_added,release_year,rating,duration,listed_in,description
0,s1,Movie,Dick Johnson Is Dead,Kirsten Johnson,,United States,"September 25, 2021",2020,PG-13,90 min,Documentaries,"As her father nears the end of his life, filmm..."
1,s2,TV Show,Blood & Water,,"Ama Qamata, Khosi Ngema, Gail Mabalane, Thaban...",South Africa,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, TV Dramas, TV Mysteries","After crossing paths at a party, a Cape Town t..."
2,s3,TV Show,Ganglands,Julien Leclercq,"Sami Bouajila, Tracy Gotoas, Samuel Jouy, Nabi...",,"September 24, 2021",2021,TV-MA,1 Season,"Crime TV Shows, International TV Shows, TV Act...",To protect his family from a powerful drug lor...
3,s4,TV Show,Jailbirds New Orleans,,,,"September 24, 2021",2021,TV-MA,1 Season,"Docuseries, Reality TV","Feuds, flirtations and toilet talk go down amo..."
4,s5,TV Show,Kota Factory,,"Mayur More, Jitendra Kumar, Ranjan Raj, Alam K...",India,"September 24, 2021",2021,TV-MA,2 Seasons,"International TV Shows, Romantic TV Shows, TV ...",In a city of coaching centers known to train I...
5,s6,TV Show,Midnight Mass,Mike Flanagan,"Kate Siegel, Zach Gilford, Hamish Linklater, H...",,"September 24, 2021",2021,TV-MA,1 Season,"TV Dramas, TV Horror, TV Mysteries",The arrival of a charismatic young priest brin...
6,s7,Movie,My Little Pony: A New Generation,"Robert Cullen, José Luis Ucha","Vanessa Hudgens, Kimiko Glenn, James Marsden, ...",,"September 24, 2021",2021,PG,91 min,Children & Family Movies,Equestria's divided. But a bright-eyed hero be...
7,s8,Movie,Sankofa,Haile Gerima,"Kofi Ghanaba, Oyafunmike Ogunlano, Alexandra D...","United States, Ghana, Burkina Faso, United Kin...","September 24, 2021",1993,TV-MA,125 min,"Dramas, Independent Movies, International Movies","On a photo shoot in Ghana, an American model s..."
8,s9,TV Show,The Great British Baking Show,Andy Devonshire,"Mel Giedroyc, Sue Perkins, Mary Berry, Paul Ho...",United Kingdom,"September 24, 2021",2021,TV-14,9 Seasons,"British TV Shows, Reality TV",A talented batch of amateur bakers face off in...
9,s10,Movie,The Starling,Theodore Melfi,"Melissa McCarthy, Chris O'Dowd, Kevin Kline, T...",United States,"September 24, 2021",2021,PG-13,104 min,"Comedies, Dramas",A woman adjusting to life after a loss contend...


In [None]:
#verificando o tamanho do dataset

%time netflix.shape

In [None]:
# Alterando os nomes das colunas

colunas = {'type':'tipo', 'title':'titulo', 'director':'diretor', 'cast':'elenco',
            'country':'pais', 'date_added':'adicionado_em','release_year':'ano_lancamento',
            'rating':'avaliacao', 'duration':'duracao', 'listed_in':'categorias', 'description':'descricao'}

%time netflix.rename(columns=colunas, inplace=True)

In [None]:
%time netflix.head()

In [None]:
# Contagem/% de valores nulos no dataframe

%time frequencia = netflix.isna().sum() 
%time percentual_na = round((netflix.isna().sum() / len(netflix))*100, 2)
%time valores_na = pd.DataFrame([frequencia, percentual_na]).transpose().rename(columns={0:'Frequência', 1:'% de NAs'})
valores_na

In [None]:
%time netflix.info()

In [None]:
# Otimizando o tamanho da variável ano_lancamento

%time netflix['ano_lancamento'] = netflix["ano_lancamento"].astype('int16')

In [None]:
netflix.info()

---
### Trabalhando com o Dataset menor usando SPARK
---

In [3]:
## importando as bibliotecas
from pyspark.sql import *

spark = SparkSession.builder.getOrCreate()

In [5]:
Netflix_Spark = spark.read.csv("netflix_titles_2021.csv", header = True)

---
### Trabalhando com o Dataset maior usando PANDAS
---

In [11]:
# Leitura do arquivo csv

%time musicas = pd.read_csv('detailedrecords.csv', engine='python', on_bad_lines='skip')
%time musicas.head(10)

In [12]:
#verificando o tamanho do dataset

%time musicas.shape

CPU times: total: 0 ns
Wall time: 0 ns


(81269, 23)

In [17]:
# Alterando os nomes das colunas

colunas = {'BL record ID':'ID da gravação', 'Composer':'Compositor', 'Composer life dates':'Datas de vida do compositor', 'Title':'Titulo',
            'Standardised title':'Titulo padronizado', 'Other titles':'Outros titulos','Other names':'Outros nomes',
            'Publication date (standardised)':'Data de publicação (padronizada)', 'Publication date (not standardised)':'Data de publicação (não padronizada)',
             'Country of publication':'País de publicação', 'Place of publication':'Local de publicação', 'Publisher':'Editora','Notes':'Notas',
             'Contents':'Conteudos extras','Referenced in':'Referencias' ,'Subject/genre terms':'Genero','Physical description':'Descrição física',
             'Series title':'Titulo da série','Number within series':'Número dentro da série',
             'Publisher number':'Número da editora','BL shelfmark':'Marca de Prateleira'
            }

%time musicas.rename(columns=colunas, inplace=True)

CPU times: total: 0 ns
Wall time: 1.04 ms


In [18]:
%time musicas.head()

CPU times: total: 0 ns
Wall time: 0 ns


Unnamed: 0,ID da gravação,Compositor,Datas de vida do compositor,Titulo,Titulo padronizado,Outros titulos,Outros nomes,Data de publicação (padronizada),Data de publicação (não padronizada),País de publicação,...,Conteudos extras,Referencias,Genero,Descrição física,Titulo da série,Número dentro da série,ISBN,ISMN,Número da editora,Marca de Prateleira
0,1279866,,,The Penguin book of Canadian Folk songs,,Canadian folk songs,"Fowke, Edith ; MacMillan, Keith (Keith Campbell)",1973.0,1973,England,...,A Fenian song -- Bold Wolfe -- The battle of t...,,"Folk songs--Canada ; Folk songs, French--Canada","224 pages, music, 20 cm",,,140708421.0,,,mH00/3305 ; X.439/3548
1,1312079,,,Anthology for The musician's guide to theory a...,,"Musician's guide, anthology ; Musician's guide...","Clendinning, Jane Piper ; Marvin, Elizabeth West",2005.0,c2005,New York (State),...,,,Musical analysis--Music collections,"1 score (vii, 260 pages), 28 cm",,,393925765.0,,,F.1946.h
2,1706700,"Hofhaimer, Paul",1459-1537,Harmoniae poeticae Pauli Hofheimeri : viri equ...,Harmoniae poeticae,"Harmoniae poeticae Pauli Hofheimeri, & Ludovic...","Horace ; Stomius, Johannes ; Senfl, Ludwig",1539.0,1539,Germany,...,,RISM B/I 1539²⁶,"Part songs, Latin","5 parts, 8°",,,,,,1070.c.12.(1.) ; 1213.i.1 ; G.727
3,1825532,,,Istituzioni e monumenti dell'arte musicale ita...,,,,1931.0,1931-1939,Italy,...,v. 1-2. Andrea e Giovanni Gabrieli e la musica...,,"Instrumental music ; Vocal music, Italian ; Sa...","6 volumes, illustrations, facsimiles (part col...",,,,,,Hirsch IV.975 ; H.14
4,2225270,"Chapman, Mary, (Musician)",,Eight ball studies suitable for use in girls' ...,,,,1940.0,1940,England,...,,,Piano music ; Physical education and training,"1 score (12 pages ), 26 cm + 1 volume (8 pages...",,,,,,D-07907.f.10


In [13]:
musicas

Unnamed: 0,BL record ID,Composer,Composer life dates,Title,Standardised title,Other titles,Other names,Publication date (standardised),Publication date (not standardised),Country of publication,...,Contents,Referenced in,Subject/genre terms,Physical description,Series title,Number within series,ISBN,ISMN,Publisher number,BL shelfmark
0,1279866,,,The Penguin book of Canadian Folk songs,,Canadian folk songs,"Fowke, Edith ; MacMillan, Keith (Keith Campbell)",1973.0,1973,England,...,A Fenian song -- Bold Wolfe -- The battle of t...,,"Folk songs--Canada ; Folk songs, French--Canada","224 pages, music, 20 cm",,,0140708421,,,mH00/3305 ; X.439/3548
1,1312079,,,Anthology for The musician's guide to theory a...,,"Musician's guide, anthology ; Musician's guide...","Clendinning, Jane Piper ; Marvin, Elizabeth West",2005.0,c2005,New York (State),...,,,Musical analysis--Music collections,"1 score (vii, 260 pages), 28 cm",,,0393925765,,,F.1946.h
2,1706700,"Hofhaimer, Paul",1459-1537,Harmoniae poeticae Pauli Hofheimeri : viri equ...,Harmoniae poeticae,"Harmoniae poeticae Pauli Hofheimeri, & Ludovic...","Horace ; Stomius, Johannes ; Senfl, Ludwig",1539.0,1539,Germany,...,,RISM B/I 1539²⁶,"Part songs, Latin","5 parts, 8°",,,,,,1070.c.12.(1.) ; 1213.i.1 ; G.727
3,1825532,,,Istituzioni e monumenti dell'arte musicale ita...,,,,1931.0,1931-1939,Italy,...,v. 1-2. Andrea e Giovanni Gabrieli e la musica...,,"Instrumental music ; Vocal music, Italian ; Sa...","6 volumes, illustrations, facsimiles (part col...",,,,,,Hirsch IV.975 ; H.14
4,2225270,"Chapman, Mary, (Musician)",,Eight ball studies suitable for use in girls' ...,,,,1940.0,1940,England,...,,,Piano music ; Physical education and training,"1 score (12 pages ), 26 cm + 1 volume (8 pages...",,,,,,D-07907.f.10
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
81264,4243946,"Bryan, Robert",1858-1920,Noswyl yn y Dwyrain. Night-watch in the East. ...,,,,1925.0,c 1925,England,...,,,,"8 pages, 8°",,,,,,E.352.(2.)
81265,4243947,"Bryan, Robert",1858-1920,O Dad gwroniaid. Father of Heroes. Chorus for ...,,,,1920.0,1920,England,...,,,,8°,,,,,,F.163.u.(15.)
81266,4243948,"Bryan, Robert",1858-1920,"O llefara, addfwyn Iesu. Speak, oh gentle Jesu...",,,,1922.0,1922,England,...,,,,8°,,,,,,E.602.ff.(24*.)
81267,4243949,"Bryan, Robert",1858-1920,(O serch na fyn fy ngollwng i.) O Love that wi...,,,,1922.0,1922,England,...,,,,folio,,,,,,H.1186.e.(4.)


---
### Trabalhando com o Dataset maior usando SPARK
---

---
### Comparando os resultados
---

In [None]:
import time

inicio = time.time()
funcao()
fim = time.time()
print(fim - inicio)