# **Autor: Victor Matheus (ED07)**

In [1]:
import pandas as pd
import numpy as np

## About Dataset

YouTube was created in 2005, with the first video – Me at the Zoo - being uploaded on 23 April 2005. Since then, 1.3 billion people have set up YouTube accounts. In 2018, people watch nearly 5 billion videos each day. People upload 300 hours of video to the site every minute.

According to 2016 research undertaken by Pexeso, music only accounts for 4.3% of YouTube’s content. Yet it makes 11% of the views. Clearly, an awful lot of people watch a comparatively small number of music videos. It should be no surprise, therefore, that the most watched videos of all time on YouTube are predominantly music videos.

On August 13, BTS became the most-viewed artist in YouTube history, accumulating over 26.7 billion views across all their official channels. This count includes all music videos and dance practice videos.

Justin Bieber and Ed Sheeran now hold the records for second and third-highest views, with over 26 billion views each.

Currently, BTS’s most viewed videos are their music videos for “Boy With Luv,” “Dynamite,” and “DNA,” which all have over 1.4 billion views.

Headers of the Dataset
Total = Total views (in millions) across all official channels
Avg = Current daily average of all videos combined
100M = Number of videos with more than 100 million views

## **1. Fazer download do dataset Top Youtube Artists, do Kaggle.**

In [2]:
# Kaggle: https://www.kaggle.com/datasets/themrityunjaypathak/top-youtube-artist?resource=download
data = pd.read_csv("topyoutube.csv")
data

Unnamed: 0,Artist,Total Views,100M,Avg
0,BTS,27947.9,63.0,7.370
1,Bad Bunny,27573.4,66.0,14.555
2,Justin Bieber,27399.7,37.0,3.986
3,Ed Sheeran,26894.0,39.0,4.609
4,Taylor Swift,24350.0,38.0,5.716
...,...,...,...,...
1582,Pietro Lombardi,204.1,,0.022
1583,Duffy,203.2,,0.020
1584,Psirico,201.1,1.0,0.039
1585,Alex Clare,200.8,,0.020


## **2. Tratar as colunas sem informação adicionando o texto “Não informado”.**

In [3]:
# Warning, pois a coluna tem valores do tipo "float64", enquanto "Não informado" é uma string.
data.fillna("Não informado", inplace=True)
data

  data.fillna("Não informado", inplace=True)


Unnamed: 0,Artist,Total Views,100M,Avg
0,BTS,27947.9,63.0,7.370
1,Bad Bunny,27573.4,66.0,14.555
2,Justin Bieber,27399.7,37.0,3.986
3,Ed Sheeran,26894.0,39.0,4.609
4,Taylor Swift,24350.0,38.0,5.716
...,...,...,...,...
1582,Pietro Lombardi,204.1,Não informado,0.022
1583,Duffy,203.2,Não informado,0.020
1584,Psirico,201.1,1.0,0.039
1585,Alex Clare,200.8,Não informado,0.020


## **3. Formatar a coluna de total de inscritos multiplicando por 100 milhões.**

In [4]:
# Como a coluna é do tipo "object", é necessário convertê-la para valores numéricos (float64, no caso)
# para que a multiplicação faça sentido.
data["Total Views"].dtypes

dtype('O')

In [5]:
# No entanto, como alguns valores possuem vírgulas, é impossível aplicar o método "astype"
# antes de nos livrarmos delas, o que justifica o replace abaixo
data["Total Views"].replace(",", "", regex = True, inplace=True)
data

Unnamed: 0,Artist,Total Views,100M,Avg
0,BTS,27947.9,63.0,7.370
1,Bad Bunny,27573.4,66.0,14.555
2,Justin Bieber,27399.7,37.0,3.986
3,Ed Sheeran,26894.0,39.0,4.609
4,Taylor Swift,24350.0,38.0,5.716
...,...,...,...,...
1582,Pietro Lombardi,204.1,Não informado,0.022
1583,Duffy,203.2,Não informado,0.020
1584,Psirico,201.1,1.0,0.039
1585,Alex Clare,200.8,Não informado,0.020


In [6]:
# Agora sim conseguimos realizar o cast para float
data["Total Views"] = data["Total Views"].astype('float')

In [7]:
data["Total Views"] = data["Total Views"]*100
data

Unnamed: 0,Artist,Total Views,100M,Avg
0,BTS,2794790.0,63.0,7.370
1,Bad Bunny,2757340.0,66.0,14.555
2,Justin Bieber,2739970.0,37.0,3.986
3,Ed Sheeran,2689400.0,39.0,4.609
4,Taylor Swift,2435000.0,38.0,5.716
...,...,...,...,...
1582,Pietro Lombardi,20410.0,Não informado,0.022
1583,Duffy,20320.0,Não informado,0.020
1584,Psirico,20110.0,1.0,0.039
1585,Alex Clare,20080.0,Não informado,0.020


## **4. Formatar coluna “AVG” para 2 casas decimais depois da vírgula.**

In [8]:
data["Avg"].dtypes

dtype('float64')

In [9]:
data["Avg"] = np.round(data["Avg"], 2)
data

Unnamed: 0,Artist,Total Views,100M,Avg
0,BTS,2794790.0,63.0,7.37
1,Bad Bunny,2757340.0,66.0,14.56
2,Justin Bieber,2739970.0,37.0,3.99
3,Ed Sheeran,2689400.0,39.0,4.61
4,Taylor Swift,2435000.0,38.0,5.72
...,...,...,...,...
1582,Pietro Lombardi,20410.0,Não informado,0.02
1583,Duffy,20320.0,Não informado,0.02
1584,Psirico,20110.0,1.0,0.04
1585,Alex Clare,20080.0,Não informado,0.02


## **5. Mostrar os top 10 usuários.**

### Description:
#### **Headers of the Dataset**

- **Total** = Total views (in millions) across all official channels
- **Avg** = Current daily average of all videos combined
- **100M** = Number of videos with more than 100 million views

In [10]:
data.sort_values(by = ["Total Views", "100M", "Avg"], ascending=False).head(10)

Unnamed: 0,Artist,Total Views,100M,Avg
0,BTS,2794790.0,63.0,7.37
1,Bad Bunny,2757340.0,66.0,14.56
2,Justin Bieber,2739970.0,37.0,3.99
3,Ed Sheeran,2689400.0,39.0,4.61
4,Taylor Swift,2435000.0,38.0,5.72
5,Shakira,2396180.0,43.0,7.84
6,Katy Perry,2355340.0,25.0,3.18
7,Ozuna,2252480.0,49.0,4.93
8,Eminem,2093770.0,38.0,5.84
9,Ariana Grande,2061850.0,37.0,3.37


In [11]:
# data.sort_values(by = ["Avg"], ascending=False)

### **6. Mostrar primeiros 100 usuários por nome decrescente.**

In [12]:
data.sort_values(by = ["Artist"], ascending=False).head(100)

Unnamed: 0,Artist,Total Views,100M,Avg
1374,Łobuzy,39650.0,2.0,0.11
537,İrem Derici,171730.0,3.0,0.32
634,Ñengo Flow,138880.0,4.0,0.82
1032,Çağatay Akman,74270.0,2.0,0.05
350,will.i.am,258200.0,6.0,0.30
...,...,...,...,...
516,Via Vallen,178750.0,3.0,0.08
1507,Veysel Mutlu,27740.0,1.0,0.03
1358,Veysel,41220.0,Não informado,0.06
891,Vengaboys,93480.0,3.0,0.34


In [13]:
# Convertendo todos os nomes para letras minúsculas para não haver distinção de caracteres 
# por iniciarem com letra maiúscula, bagunçando a ordem desejada.
temp = data.copy()
temp["Artist"] = temp["Artist"].str.lower()

In [14]:
fisrt_100_alpha_decr_artists = temp.sort_values(by = ["Artist"], ascending=False).head(100)["Artist"].to_list()
fisrt_100_alpha_decr_artists

['łobuzy',
 'ñengo flow',
 'çağatay akman',
 'zé vaqueiro',
 'zé ramalho',
 'zé neto e cristiano',
 'zé felipe',
 'zoé',
 'zouhair bahaoui',
 'zion & lennox',
 'zhu',
 'zendaya',
 'zedd',
 'zaz',
 'zayn',
 'zay & zayion',
 'zara larsson',
 'zak chumpae',
 'zaho',
 'zack knight',
 'yuridia',
 'youngohm',
 'youngboy never broke again',
 'young thug',
 'young money',
 'young m.a',
 'yohani',
 'yo yo honey singh',
 'yo gotti',
 'ynw melly',
 'ylvis',
 'yinglee srijumpol',
 'yg',
 'yfn lucci',
 'yemi alade',
 'yellow claw',
 'yella beezy',
 'yelawolf',
 'years & years',
 'yeah yeah yeahs',
 'yasmin verissimo',
 'yash narvekar',
 'yandel',
 'y2k',
 'xxxtentacion',
 'xantos',
 'x ambassadors',
 'wyclef jean',
 'woodkid',
 'wonder girls',
 'wolfine',
 'wiz khalifa',
 'within temptation',
 'wisin & yandel',
 'wisin',
 'winner',
 'willy william',
 'willow smith',
 'will.i.am',
 "why don't we",
 'whitney houston',
 'whitesnake',
 'wheatus',
 'wham!',
 'westlife',
 'wesley safadão',
 'weird al yan

### **7. Salvar um novo CSV com a informação tratada.**

In [15]:
data.head()

Unnamed: 0,Artist,Total Views,100M,Avg
0,BTS,2794790.0,63.0,7.37
1,Bad Bunny,2757340.0,66.0,14.56
2,Justin Bieber,2739970.0,37.0,3.99
3,Ed Sheeran,2689400.0,39.0,4.61
4,Taylor Swift,2435000.0,38.0,5.72


In [16]:
data.to_csv("topyoutube_tratado.csv", index=False)