# Estudio de Plataformas Digitales a través de Google Cloud


## Lectura y creacion de datasets

En la primera etapa del proyecto, me centraré en unificar los diversos datasets obtenidos de Kaggle, los cuales están enfocados en datos recopilados mediante web scraping de las principales plataformas de streaming.

In [None]:
import pandas as pd
#!pip install google-colab
from google.colab import auth
auth.authenticate_user()

Los datasets se encuentran disponibles en este mismo repositorio para futuros analisis

In [None]:
hbo='/content/HBO.csv'
amazon='/content/Amazon.csv'
hulu='/content/HULU.csv'
netflix='/content/Netflix.csv'
apple='/content/appleTV.csv'

In [None]:
# Leer cada archivo CSV
df_hbo = pd.read_csv(hbo)
df_amazon = pd.read_csv(amazon)
df_hulu = pd.read_csv(hulu)
df_netflix = pd.read_csv(netflix)
df_apple = pd.read_csv(apple)


In [None]:
df_hbo = df_hbo.assign(plataforma='HBO')
df_amazon = df_amazon.assign(plataforma='Amazon')
df_hulu = df_hulu.assign(plataforma='Hulu')
df_netflix = df_netflix.assign(plataforma='Netflix')
df_apple = df_apple.assign(plataforma='AppleTV')

In [None]:
# Unificar todos los DataFrames en uno solo
df_plataformas = pd.concat([df_hbo, df_amazon, df_hulu, df_netflix, df_apple], ignore_index=True)
df_plataformas=df_plataformas.dropna()


In [None]:
display(df_plataformas)

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries,plataforma
0,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997.0,tt0119116,7.6,516953.0,"DK, FI, NO, SE",HBO
1,Unforgiven,movie,"Drama, Western",1992.0,tt0105695,8.2,443687.0,"AD, AG, AR, BB, BE, BO, BR, BS, BZ, CL, CO, CR...",HBO
2,Eternal Sunshine of the Spotless Mind,movie,"Drama, Romance, Sci-Fi",2004.0,tt0338013,8.3,1103296.0,"AG, AR, BA, BB, BG, BO, BS, BZ, CL, CO, CR, CZ...",HBO
3,A History of Violence,movie,"Action, Crime, Drama",2005.0,tt0399146,7.4,258921.0,"AG, AR, BB, BO, BR, BS, BZ, CL, CO, CR, DO, EC...",HBO
4,2001: A Space Odyssey,movie,"Adventure, Sci-Fi",1968.0,tt0062622,8.3,734926.0,"AD, AG, AR, BB, BE, BO, BR, BS, BZ, CL, CO, CR...",HBO
...,...,...,...,...,...,...,...,...,...
121415,Nöthin' But A Good Time: The Uncensored Story ...,tv,Documentary,2024.0,tt33210825,7.7,352.0,"AR, AT, AU, BO, BR, CA, CH, CL, CO, CR, DE, DO...",AppleTV
121417,Dating Naked UK,tv,"Game-Show, Reality-TV",2024.0,tt33262257,6.1,132.0,US,AppleTV
121426,Not Going Out,tv,Comedy,2006.0,tt0862614,7.6,8089.0,"CA, US",AppleTV
121427,Bergerac,tv,"Crime, Drama, Mystery",1981.0,tt0081831,6.8,2341.0,"CA, US",AppleTV


 Hay que asignar el tipo de dato adecuado a las columnas, en este caso las columnas 'releaseYear' y 'imbnumVotes' por defecto estan en formato float64 cuando deberian ser Int.

In [None]:
df_plataformas.loc[:, ['releaseYear', 'imdbNumVotes']] = df_plataformas[['releaseYear', 'imdbNumVotes']].astype(int)

Se guarda la primera version del archivo unificado en formato csv

In [None]:
df_plataformas.to_csv('/content/plataformas.csv', index=False)

In [None]:
df_plataformas.dtypes

Unnamed: 0,0
title,object
type,object
genres,object
releaseYear,int64
imdbId,object
imdbAverageRating,float64
imdbNumVotes,int64
availableCountries,object
plataforma,object


##Creación de una lista para genero y país

Debido a que las peliculas o series tienen generos y paises donde estan habilitados como string, mas adelante ocacionarian un problema para analizar esta data, por ende es mejor crear nuevas columnas que muestren el dato como lista.

In [None]:
df1=df_plataformas.dropna()
df1=df1.assign(
    genero=df1['genres'].str.split(', '),
    pais=df1['availableCountries'].str.split(', ')
)
display(df1)

Unnamed: 0,title,type,genres,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,availableCountries,plataforma,genero,pais
0,The Fifth Element,movie,"Action, Adventure, Sci-Fi",1997,tt0119116,7.6,516953,"DK, FI, NO, SE",HBO,"[Action, Adventure, Sci-Fi]","[DK, FI, NO, SE]"
1,Unforgiven,movie,"Drama, Western",1992,tt0105695,8.2,443687,"AD, AG, AR, BB, BE, BO, BR, BS, BZ, CL, CO, CR...",HBO,"[Drama, Western]","[AD, AG, AR, BB, BE, BO, BR, BS, BZ, CL, CO, C..."
2,Eternal Sunshine of the Spotless Mind,movie,"Drama, Romance, Sci-Fi",2004,tt0338013,8.3,1103296,"AG, AR, BA, BB, BG, BO, BS, BZ, CL, CO, CR, CZ...",HBO,"[Drama, Romance, Sci-Fi]","[AG, AR, BA, BB, BG, BO, BS, BZ, CL, CO, CR, C..."
3,A History of Violence,movie,"Action, Crime, Drama",2005,tt0399146,7.4,258921,"AG, AR, BB, BO, BR, BS, BZ, CL, CO, CR, DO, EC...",HBO,"[Action, Crime, Drama]","[AG, AR, BB, BO, BR, BS, BZ, CL, CO, CR, DO, E..."
4,2001: A Space Odyssey,movie,"Adventure, Sci-Fi",1968,tt0062622,8.3,734926,"AD, AG, AR, BB, BE, BO, BR, BS, BZ, CL, CO, CR...",HBO,"[Adventure, Sci-Fi]","[AD, AG, AR, BB, BE, BO, BR, BS, BZ, CL, CO, C..."
...,...,...,...,...,...,...,...,...,...,...,...
121415,Nöthin' But A Good Time: The Uncensored Story ...,tv,Documentary,2024,tt33210825,7.7,352,"AR, AT, AU, BO, BR, CA, CH, CL, CO, CR, DE, DO...",AppleTV,[Documentary],"[AR, AT, AU, BO, BR, CA, CH, CL, CO, CR, DE, D..."
121417,Dating Naked UK,tv,"Game-Show, Reality-TV",2024,tt33262257,6.1,132,US,AppleTV,"[Game-Show, Reality-TV]",[US]
121426,Not Going Out,tv,Comedy,2006,tt0862614,7.6,8089,"CA, US",AppleTV,[Comedy],"[CA, US]"
121427,Bergerac,tv,"Crime, Drama, Mystery",1981,tt0081831,6.8,2341,"CA, US",AppleTV,"[Crime, Drama, Mystery]","[CA, US]"


##Explode de datos

Una vez creada las columnas con data en formato lista, con el comando explode podemos separar los elementos por pais asi tendremos informacion mas detallada.

In [None]:
df1=df1.explode('genero').explode('pais')
df1 = df1.drop(columns=['genres', 'availableCountries'])
display(df1)

Unnamed: 0,title,type,releaseYear,imdbId,imdbAverageRating,imdbNumVotes,plataforma,genero,pais
0,The Fifth Element,movie,1997,tt0119116,7.6,516953,HBO,Action,DK
0,The Fifth Element,movie,1997,tt0119116,7.6,516953,HBO,Action,FI
0,The Fifth Element,movie,1997,tt0119116,7.6,516953,HBO,Action,NO
0,The Fifth Element,movie,1997,tt0119116,7.6,516953,HBO,Action,SE
0,The Fifth Element,movie,1997,tt0119116,7.6,516953,HBO,Adventure,DK
...,...,...,...,...,...,...,...,...,...
121428,Teenage Mutant Ninja Turtles,tv,2003,tt0318913,7.9,13333,AppleTV,Animation,AU
121428,Teenage Mutant Ninja Turtles,tv,2003,tt0318913,7.9,13333,AppleTV,Animation,CA
121428,Teenage Mutant Ninja Turtles,tv,2003,tt0318913,7.9,13333,AppleTV,Animation,GB
121428,Teenage Mutant Ninja Turtles,tv,2003,tt0318913,7.9,13333,AppleTV,Animation,IE


In [None]:
df1.to_csv('/content/plataformas1.csv', index=False)

In [None]:
grouped_countries = df1.groupby('pais').size().reset_index(name='conteo')
grouped_countries_sorted = grouped_countries.sort_values(by='conteo', ascending=False)
display(grouped_countries_sorted)

Unnamed: 0,pais,conteo
130,US,92935
44,GB,54673
46,GG,48660
66,JP,46525
58,IE,46373
...,...,...
128,UA,8676
12,BF,7037
22,CD,7031
121,TD,4977


## Envio de informacion a un bucket personal de cloud storage

Para este paso final solo es necesario iniciar sesion con la misma cuenta de GCP, y crear un bucket de manera manual.

In [None]:
from google.cloud import storage
bucket_name = 'streaming_platforms'
source_file_name = '/content/plataformas1.csv'
destination_blob_name = 'plataformas.csv'
client = storage.Client()
bucket = client.get_bucket(bucket_name)
blob = bucket.blob(destination_blob_name)
blob.upload_from_filename(source_file_name)
print(f"Archivo {source_file_name} subido a {destination_blob_name} en el bucket {bucket_name}.")

Archivo /content/plataformas1.csv subido a plataformas.csv en el bucket streaming_platforms.
