# **Analisis de datos de Netflix extraidos de KAGGLE con PySpark**

In [48]:
import tempfile
from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Netflix Analsis") \
    .master("local[*]") \
    .config("spark.executor.memory", "4g") \
    .config("spark.driver.memory", "2g") \
    .config("spark.executor.cores", "2") \
    .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
    .config("spark.sql.files.maxPartitionBytes", "128MB") \
    .config("spark.sql.shuffle.partitions", "200") \
    .config("spark.sql.execution.arrow.enabled", "true") \
    .getOrCreate()

Leemos nuestro csv

In [49]:
df_netflix = spark.read.csv("netflix_titles.csv", header=True, inferSchema=True)

In [50]:
df_netflix.show()

+-------+-------+--------------------+--------------------+--------------------+--------------------+------------------+------------+------+---------+--------------------+--------------------+
|show_id|   type|               title|            director|                cast|             country|        date_added|release_year|rating| duration|           listed_in|         description|
+-------+-------+--------------------+--------------------+--------------------+--------------------+------------------+------------+------+---------+--------------------+--------------------+
|     s1|  Movie|Dick Johnson Is Dead|     Kirsten Johnson|                NULL|       United States|September 25, 2021|        2020| PG-13|   90 min|       Documentaries|As her father nea...|
|     s2|TV Show|       Blood & Water|                NULL|Ama Qamata, Khosi...|        South Africa|September 24, 2021|        2021| TV-MA|2 Seasons|International TV ...|After crossing pa...|
|     s3|TV Show|           Ganglan

Realizo una inspeccion del schema y las columnas de nuestros datos

In [51]:
df_netflix.printSchema()

root
 |-- show_id: string (nullable = true)
 |-- type: string (nullable = true)
 |-- title: string (nullable = true)
 |-- director: string (nullable = true)
 |-- cast: string (nullable = true)
 |-- country: string (nullable = true)
 |-- date_added: string (nullable = true)
 |-- release_year: string (nullable = true)
 |-- rating: string (nullable = true)
 |-- duration: string (nullable = true)
 |-- listed_in: string (nullable = true)
 |-- description: string (nullable = true)



In [52]:
print(df_netflix.columns)

['show_id', 'type', 'title', 'director', 'cast', 'country', 'date_added', 'release_year', 'rating', 'duration', 'listed_in', 'description']


In [53]:
from pyspark.sql.functions import col, count

In [54]:
df_netflix.show(10)

+-------+-------+--------------------+--------------------+--------------------+--------------------+------------------+------------+------+---------+--------------------+--------------------+
|show_id|   type|               title|            director|                cast|             country|        date_added|release_year|rating| duration|           listed_in|         description|
+-------+-------+--------------------+--------------------+--------------------+--------------------+------------------+------------+------+---------+--------------------+--------------------+
|     s1|  Movie|Dick Johnson Is Dead|     Kirsten Johnson|                NULL|       United States|September 25, 2021|        2020| PG-13|   90 min|       Documentaries|As her father nea...|
|     s2|TV Show|       Blood & Water|                NULL|Ama Qamata, Khosi...|        South Africa|September 24, 2021|        2021| TV-MA|2 Seasons|International TV ...|After crossing pa...|
|     s3|TV Show|           Ganglan

Decidi pasar el Release_year a tipo INT

In [55]:
df_netflix = df_netflix.withColumn("release_year", col("release_year").cast("integer"))

## **Realizo las consultas**

Contar los titulos lanzados por cada año y organizados de la actualidad hacia atras

In [56]:
peliculas_por_ano = df_netflix.groupBy('release_year').count().orderBy('release_year', ascending=False).show()

+------------+-----+
|release_year|count|
+------------+-----+
|        2021|  589|
|        2020|  952|
|        2019| 1026|
|        2018| 1145|
|        2017| 1030|
|        2016|  901|
|        2015|  559|
|        2014|  352|
|        2013|  288|
|        2012|  237|
|        2011|  185|
|        2010|  193|
|        2009|  152|
|        2008|  136|
|        2007|   88|
|        2006|   95|
|        2005|   80|
|        2004|   64|
|        2003|   61|
|        2002|   51|
+------------+-----+
only showing top 20 rows



Obtener el año con mas titulos estrenados

In [57]:
ano_max_estrenos = df_netflix.groupBy('release_year').count().orderBy('count', ascending=False).show(1)

+------------+-----+
|release_year|count|
+------------+-----+
|        2018| 1145|
+------------+-----+
only showing top 1 row



Obtener la cantidad de estrenos por pais excluyendo a los NULL y ordenarlos de mayor a menor

In [58]:
conteo_pais = df_netflix.filter(col('country').isNotNull()).groupBy('country').count().orderBy('count', ascending=False)
conteo_pais.show()

+--------------------+-----+
|             country|count|
+--------------------+-----+
|       United States| 2805|
|               India|  972|
|      United Kingdom|  419|
|               Japan|  245|
|         South Korea|  199|
|              Canada|  181|
|               Spain|  145|
|              France|  123|
|              Mexico|  110|
|               Egypt|  106|
|              Turkey|  105|
|             Nigeria|   93|
|           Australia|   87|
|              Taiwan|   81|
|           Indonesia|   79|
|              Brazil|   77|
|         Philippines|   75|
|United Kingdom, U...|   75|
|United States, Ca...|   73|
|             Germany|   67|
+--------------------+-----+
only showing top 20 rows



Obtener los estrenos por Director, desechando los NULL y ordenandolos de mayor a menor

In [59]:
conteo_director = df_netflix.filter(col('director').isNotNull()).groupBy('director').count().orderBy('count', ascending=False)
conteo_director.show(10)

+--------------------+-----+
|            director|count|
+--------------------+-----+
|       Rajiv Chilaka|   19|
|Raúl Campos, Jan ...|   18|
|        Marcus Raboy|   16|
|         Suhas Kadav|   16|
|           Jay Karas|   14|
| Cathy Garcia-Molina|   13|
|     Youssef Chahine|   12|
|     Martin Scorsese|   12|
|         Jay Chapman|   12|
|    Steven Spielberg|   11|
+--------------------+-----+
only showing top 10 rows



Conteo de títulos por cada rating excluyendo a los NULL y dejando el TOP 10

In [61]:
conteo_rating = df_netflix.filter(col("rating").isNotNull()).groupBy('rating').count().orderBy('count', ascending=False)
conteo_rating.show(10)

+------+-----+
|rating|count|
+------+-----+
| TV-MA| 3195|
| TV-14| 2158|
| TV-PG|  862|
|     R|  796|
| PG-13|  489|
| TV-Y7|  334|
|  TV-Y|  307|
|    PG|  286|
|  TV-G|  220|
|    NR|   80|
+------+-----+
only showing top 10 rows



In [43]:
spark.stop()