# Apache Spark

Apache Spark es un motor de análisis de datos distribuido y de propósito general que facilita el procesamiento rápido y escalable de grandes volúmenes de datos. Fue diseñado para superar las limitaciones de velocidad y flexibilidad que existían en tecnologías anteriores (como Hadoop MapReduce).

## PySpark vs Scikit Learn

* Scikit-Learn usa NumPy y está optimizado para computación en RAM, por lo que maneja eficientemente datasets de hasta 1M de filas en máquinas modernas.

* PySpark escala mejor porque usa procesamiento distribuido en varios nodos, pero en un solo nodo local, puede ser más lento debido a la sobrecarga de ejecución distribuida.

Usa PySpark si...

* Tu dataset tiene millones o miles de millones de filas.
* Trabajas en un clúster de varios nodos y necesitas paralelización real.
* Procesas datos directamente desde HDFS, S3, Cassandra, Delta Lake, etc.
* Realizas operaciones ETL pesadas o pipeline de machine learning a gran escala.
* ETL (Extracción Transformación y Carga)

Ventajas:

* Tiene una API de alto nivel DataFrames muy similar a Pandas y se puede convertir a DataFrames de Pandas. No es 100 % igual pero es parecida.
* Tiene una API de alto nivel que es SparkSQL que permite trabajar directamente con código SQL y hace lo mismo que haríamos con DataFrames.
* Tiene MLlib para Machine Learning muy parecido a Scikit Learn.

Desventajas:
* Mayor overhead, cuando ejecutamos algo primero se tiene que traducir de python al lenguaje que usa internamente Apache Spark para poder procesarlo, que es Java/Scala. De ahí se pasa a un DAG que genera las transformaciones y acciones. Optimización DAG, internamente mejora el plan a ejecutar. Ejecución distribuida: Spark divide el trabajo en tareas y las ejecuta en el clúster. Recolección del resultado que se envía al Driver o se guardan en bases de datos.
* Complejidad de montar un cluster y mantenerlo.
* Cantidades masivas de datos, si no es el caso no sería necesario.
* Las ETL se pueden volver muy complejas

En pyspark las operaciones se dividen en dos tipos:

* Transformaciones (Lazy Evaluation) (operaciones intermedias): genera un nuevo DataFrame o un RDD pero no se ejecutan ni devuelven inmediatamente.
* Acción (operaciones terminales): ejecutan el plan de cómputo y devuelven un resultado al Driver o almacenan datos. Ejemplo: 
    * collect(): trae todos los datos en forma de lista. Cuidado porque al cargar todos los datos si son muchos podemos agotar recursos. Puede consumir mucha memoria.
    * take(n)
    * first()
    * head(n)
    * count()
    * show(n)
    * write.format().save()

In [1]:
import os
import seaborn as sns
import pandas as pd
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("pyspark_teoria").getOrCreate()
spark

In [2]:
df = spark.createDataFrame(sns.load_dataset('tips'))
df.show(5) # equivalente a head de pandas

+----------+----+------+------+---+------+----+
|total_bill| tip|   sex|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|
+----------+----+------+------+---+------+----+
only showing top 5 rows



In [3]:
# convertir a dataframe de pandas
df.limit(10).toPandas()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4
5,25.29,4.71,Male,No,Sun,Dinner,4
6,8.77,2.0,Male,No,Sun,Dinner,2
7,26.88,3.12,Male,No,Sun,Dinner,4
8,15.04,1.96,Male,No,Sun,Dinner,2
9,14.78,3.23,Male,No,Sun,Dinner,2


In [4]:
df.columns

['total_bill', 'tip', 'sex', 'smoker', 'day', 'time', 'size']

In [5]:
# df[['total_bill', 'tip', 'sex']].show(5)
df.select('total_bill', 'tip', 'sex').show(5)

+----------+----+------+
|total_bill| tip|   sex|
+----------+----+------+
|     16.99|1.01|Female|
|     10.34|1.66|  Male|
|     21.01| 3.5|  Male|
|     23.68|3.31|  Male|
|     24.59|3.61|Female|
+----------+----+------+
only showing top 5 rows



In [6]:
df.describe().toPandas()

Unnamed: 0,summary,total_bill,tip,sex,smoker,day,time,size
0,count,244.0,244.0,244,244,244,244,244.0
1,mean,19.78594262295082,2.9982786885245902,,,,,2.569672131147541
2,stddev,8.902411954856856,1.383638189001182,,,,,0.9510998047322344
3,min,3.07,1.0,Female,No,Fri,Dinner,1.0
4,max,50.81,10.0,Male,Yes,Thur,Lunch,6.0


In [7]:
df.dtypes

[('total_bill', 'double'),
 ('tip', 'double'),
 ('sex', 'string'),
 ('smoker', 'string'),
 ('day', 'string'),
 ('time', 'string'),
 ('size', 'bigint')]

In [8]:
df.schema['total_bill']

StructField('total_bill', DoubleType(), True)

In [9]:
help(df.withColumn)

Help on method withColumn in module pyspark.sql.dataframe:

withColumn(colName: str, col: pyspark.sql.column.Column) -> 'DataFrame' method of pyspark.sql.dataframe.DataFrame instance
    Returns a new :class:`DataFrame` by adding a column or replacing the
    existing column that has the same name.
    
    The column expression must be an expression over this :class:`DataFrame`; attempting to add
    a column from some other :class:`DataFrame` will raise an error.
    
    .. versionadded:: 1.3.0
    
    .. versionchanged:: 3.4.0
        Supports Spark Connect.
    
    Parameters
    ----------
    colName : str
        string, name of the new column.
    col : :class:`Column`
        a :class:`Column` expression for the new column.
    
    Returns
    -------
    :class:`DataFrame`
        DataFrame with new or replaced column.
    
    Notes
    -----
    This method introduces a projection internally. Therefore, calling it multiple
    times, for instance, via loops in order to 

In [10]:
# conversión de tipos de datos, en pandas solemos usar astype()
from pyspark.sql.functions import col
from pyspark.sql.types import IntegerType, FloatType

# df_cast = df.withColumn('total_bill', col('total_bill').cast('float')) \
#             .withColumn('tip', col('tip').cast('integer'))
  
df_cast = df.withColumn('total_bill', col('total_bill').cast(FloatType())) \
        .withColumn('tip', col('tip').cast(IntegerType()))
            
df_cast.printSchema()

root
 |-- total_bill: float (nullable = true)
 |-- tip: integer (nullable = true)
 |-- sex: string (nullable = true)
 |-- smoker: string (nullable = true)
 |-- day: string (nullable = true)
 |-- time: string (nullable = true)
 |-- size: long (nullable = true)



In [11]:
# agregaciones
df.select('total_bill', 'tip', 'size').summary('count', 'min', 'max', 'mean').show()

+-------+-----------------+------------------+-----------------+
|summary|       total_bill|               tip|             size|
+-------+-----------------+------------------+-----------------+
|  count|              244|               244|              244|
|    min|             3.07|               1.0|                1|
|    max|            50.81|              10.0|                6|
|   mean|19.78594262295082|2.9982786885245902|2.569672131147541|
+-------+-----------------+------------------+-----------------+



In [12]:
# equivalente al describe de pandas
df.summary().show()

+-------+-----------------+------------------+------+------+----+------+------------------+
|summary|       total_bill|               tip|   sex|smoker| day|  time|              size|
+-------+-----------------+------------------+------+------+----+------+------------------+
|  count|              244|               244|   244|   244| 244|   244|               244|
|   mean|19.78594262295082|2.9982786885245902|  NULL|  NULL|NULL|  NULL| 2.569672131147541|
| stddev|8.902411954856856| 1.383638189001182|  NULL|  NULL|NULL|  NULL|0.9510998047322345|
|    min|             3.07|               1.0|Female|    No| Fri|Dinner|                 1|
|    25%|            13.28|               2.0|  NULL|  NULL|NULL|  NULL|                 2|
|    50%|            17.78|              2.88|  NULL|  NULL|NULL|  NULL|                 2|
|    75%|            24.08|              3.55|  NULL|  NULL|NULL|  NULL|                 3|
|    max|            50.81|              10.0|  Male|   Yes|Thur| Lunch|        

In [13]:
help(df.filter)

Help on method filter in module pyspark.sql.dataframe:

filter(condition: 'ColumnOrName') -> 'DataFrame' method of pyspark.sql.dataframe.DataFrame instance
    Filters rows using the given condition.
    
    :func:`where` is an alias for :func:`filter`.
    
    .. versionadded:: 1.3.0
    
    .. versionchanged:: 3.4.0
        Supports Spark Connect.
    
    Parameters
    ----------
    condition : :class:`Column` or str
        a :class:`Column` of :class:`types.BooleanType`
        or a string of SQL expressions.
    
    Returns
    -------
    :class:`DataFrame`
        Filtered DataFrame.
    
    Examples
    --------
    >>> df = spark.createDataFrame([
    ...     (2, "Alice"), (5, "Bob")], schema=["age", "name"])
    
    Filter by :class:`Column` instances.
    
    >>> df.filter(df.age > 3).show()
    +---+----+
    |age|name|
    +---+----+
    |  5| Bob|
    +---+----+
    >>> df.where(df.age == 2).show()
    +---+-----+
    |age| name|
    +---+-----+
    |  2|Alice|


In [14]:
from pyspark.sql.functions import col

# Filtro por una columna
# df.filter(df.total_bill > 20).show()
df.filter(df['total_bill'] > 20).show()
# df.filter(col('total_bill') > 20).show()

# df.filter(df['total_bill'] > 20).collect()[0]['total_bill']

+----------+----+------+------+---+------+----+
|total_bill| tip|   sex|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|
|     26.88|3.12|  Male|    No|Sun|Dinner|   4|
|     35.26| 5.0|Female|    No|Sun|Dinner|   4|
|     21.58|3.92|  Male|    No|Sun|Dinner|   2|
|     20.65|3.35|  Male|    No|Sat|Dinner|   3|
|     20.29|2.75|Female|    No|Sat|Dinner|   2|
|     39.42|7.58|  Male|    No|Sat|Dinner|   4|
|      21.7| 4.3|  Male|    No|Sat|Dinner|   2|
|     20.69|2.45|Female|    No|Sat|Dinner|   4|
|     24.06| 3.6|  Male|    No|Sat|Dinner|   3|
|     31.27| 5.0|  Male|    No|Sat|Dinner|   3|
|      30.4| 5.6|  Male|    No|Sun|Dinner|   4|
|     22.23| 5.0|  Male|    No|Sun|Dinner|   2|
|      32.4| 6.0|  Male|    No|Sun|Dinner|   4|
|     28.55|2.05|  Male|    No|Sun|Dinne

In [15]:
df.filter((df['total_bill'] > 20) & (df['tip'] > 3)).show(2)

+----------+----+----+------+---+------+----+
|total_bill| tip| sex|smoker|day|  time|size|
+----------+----+----+------+---+------+----+
|     21.01| 3.5|Male|    No|Sun|Dinner|   3|
|     23.68|3.31|Male|    No|Sun|Dinner|   2|
+----------+----+----+------+---+------+----+
only showing top 2 rows



In [16]:
# Crear una nueva columna con el 10 % de total_bill para el IVA
df_new = df.withColumn('total_bill_iva_10', df['total_bill'] * 0.10)
df_new.show(2)

+----------+----+------+------+---+------+----+------------------+
|total_bill| tip|   sex|smoker|day|  time|size| total_bill_iva_10|
+----------+----+------+------+---+------+----+------------------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|1.6989999999999998|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|             1.034|
+----------+----+------+------+---+------+----+------------------+
only showing top 2 rows



In [17]:
# en pandas solemos aplicar una transformación utilizando apply()

from pyspark.sql.functions import col, when
# crear columna categórica a partir de numérica

# df.withColumn(
#     'tip_category', 
#     when(df['tip'] <= 1, 'baja')
#     .when((df['tip'] > 1) & (df['tip'] <= 3), 'media')
#     .otherwise('alta')
# ).show()

df.withColumn(
    'tip_category', 
    when(col('tip') <= 1, 'baja')
    .when((col('tip') > 1) & (col('tip') <= 3), 'media')
    .otherwise('alta')
).show()

+----------+----+------+------+---+------+----+------------+
|total_bill| tip|   sex|smoker|day|  time|size|tip_category|
+----------+----+------+------+---+------+----+------------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|       media|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|       media|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|        alta|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|        alta|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|        alta|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|        alta|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2|       media|
|     26.88|3.12|  Male|    No|Sun|Dinner|   4|        alta|
|     15.04|1.96|  Male|    No|Sun|Dinner|   2|       media|
|     14.78|3.23|  Male|    No|Sun|Dinner|   2|        alta|
|     10.27|1.71|  Male|    No|Sun|Dinner|   2|       media|
|     35.26| 5.0|Female|    No|Sun|Dinner|   4|        alta|
|     15.42|1.57|  Male|    No|Sun|Dinner|   2|       media|
|     18.43| 3.0|  Male|

In [18]:
# alternativa al ejemplo anterior usando User Defined Function (UDF)
# Esto es mejor solo para casos avanzados en los que no nos sirve con las funciones que ya hay en functions
from pyspark.sql.functions import udf 
from pyspark.sql.types import StringType

def categorize_tip(tip):
    if tip <= 1:
        return 'baja'
    elif tip > 1 and tip <= 3:
        return 'media'
    else:
        return 'alta'
    
udf_categorize_tip = udf(categorize_tip, StringType())
df.withColumn('tip_category', udf_categorize_tip('tip')).show()

+----------+----+------+------+---+------+----+------------+
|total_bill| tip|   sex|smoker|day|  time|size|tip_category|
+----------+----+------+------+---+------+----+------------+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|       media|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|       media|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|        alta|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|        alta|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|        alta|
|     25.29|4.71|  Male|    No|Sun|Dinner|   4|        alta|
|      8.77| 2.0|  Male|    No|Sun|Dinner|   2|       media|
|     26.88|3.12|  Male|    No|Sun|Dinner|   4|        alta|
|     15.04|1.96|  Male|    No|Sun|Dinner|   2|       media|
|     14.78|3.23|  Male|    No|Sun|Dinner|   2|        alta|
|     10.27|1.71|  Male|    No|Sun|Dinner|   2|       media|
|     35.26| 5.0|Female|    No|Sun|Dinner|   4|        alta|
|     15.42|1.57|  Male|    No|Sun|Dinner|   2|       media|
|     18.43| 3.0|  Male|

In [19]:
# renombrar columnas
df_renamed = df.withColumnRenamed('sex','genre')
df_renamed.show(4)

+----------+----+------+------+---+------+----+
|total_bill| tip| genre|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|
+----------+----+------+------+---+------+----+
only showing top 4 rows



In [20]:
df_dropped = df.drop('sex', 'smoker')
df_dropped.show(4)

+----------+----+---+------+----+
|total_bill| tip|day|  time|size|
+----------+----+---+------+----+
|     16.99|1.01|Sun|Dinner|   2|
|     10.34|1.66|Sun|Dinner|   3|
|     21.01| 3.5|Sun|Dinner|   3|
|     23.68|3.31|Sun|Dinner|   2|
+----------+----+---+------+----+
only showing top 4 rows



In [21]:
# ordenar por una columna: en pandas es sort_values
df.sort('total_bill').show(3) # asc
df.sort(col('total_bill').desc()).show(3)
df.orderBy(col('total_bill').desc()).show(3)

+----------+---+------+------+---+------+----+
|total_bill|tip|   sex|smoker|day|  time|size|
+----------+---+------+------+---+------+----+
|      3.07|1.0|Female|   Yes|Sat|Dinner|   1|
|      5.75|1.0|Female|   Yes|Fri|Dinner|   2|
|      7.25|1.0|Female|    No|Sat|Dinner|   1|
+----------+---+------+------+---+------+----+
only showing top 3 rows

+----------+----+----+------+---+------+----+
|total_bill| tip| sex|smoker|day|  time|size|
+----------+----+----+------+---+------+----+
|     50.81|10.0|Male|   Yes|Sat|Dinner|   3|
|     48.33| 9.0|Male|    No|Sat|Dinner|   4|
|     48.27|6.73|Male|    No|Sat|Dinner|   4|
+----------+----+----+------+---+------+----+
only showing top 3 rows

+----------+----+----+------+---+------+----+
|total_bill| tip| sex|smoker|day|  time|size|
+----------+----+----+------+---+------+----+
|     50.81|10.0|Male|   Yes|Sat|Dinner|   3|
|     48.33| 9.0|Male|    No|Sat|Dinner|   4|
|     48.27|6.73|Male|    No|Sat|Dinner|   4|
+----------+----+----+-

In [22]:
# agrupar datos
# equivalente a value_counts de pandas
df.groupBy('sex').count().show()

+------+-----+
|   sex|count|
+------+-----+
|Female|   87|
|  Male|  157|
+------+-----+



In [23]:
#similar a pandas podemos usar una función agg para pedir varias agregaciones
from pyspark.sql.functions import avg, count, sum

df.groupby('sex').agg(
    count('*').alias('count_rows'),
    avg('total_bill').alias('avg_total_bill'),
    sum('tip').alias('sum_tips')
).show()

+------+----------+------------------+------------------+
|   sex|count_rows|    avg_total_bill|          sum_tips|
+------+----------+------------------+------------------+
|Female|        87|18.056896551724137|            246.51|
|  Male|       157| 20.74407643312102|485.07000000000005|
+------+----------+------------------+------------------+



In [None]:
# Elimina filas donde hay al menos un valor nulo:
df_no_nulls = df.dropna()

# Eliminar filas donde hay nulos solo en algunas columnas especificadas:
df_no_nulls = df.dropna(subset=['tip'])

In [None]:
# rellenar nulos
df_imputed = df.fillna({
    'total_bill': 0,
    'smoker': 'desconocido'
})
df_imputed.show(4)

+----------+----+------+------+---+------+----+
|total_bill| tip|   sex|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|
+----------+----+------+------+---+------+----+
only showing top 4 rows



In [27]:
# Cargar CSV desde pandas y luego a pyspark
url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/tips.csv'
df_pandas = pd.read_csv(url)
df_spark = spark.createDataFrame(df_pandas)
df_spark.show(5)

+----------+----+------+------+---+------+----+
|total_bill| tip|   sex|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|
+----------+----+------+------+---+------+----+
only showing top 5 rows



In [29]:
# Cargar CSV directamente con pyspark (Más recomendable)
import requests

url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/tips.csv'
# csv_path= '/tmp/tips.csv'
csv_path= 'tips.csv'

with open(csv_path, 'wb') as file: # w de write b de binary
    file.write(requests.get(url).content)
    
df_spark = spark.read.csv(csv_path, header=True, inferSchema=True)
df_spark.show(5)

+----------+----+------+------+---+------+----+
|total_bill| tip|   sex|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|
+----------+----+------+------+---+------+----+
only showing top 5 rows



In [33]:
# Cargar CSV directamente con pyspark + schema (Más recomendable)
import requests
from pyspark.sql.types import StructType, StructField, FloatType, StringType, IntegerType

url = 'https://raw.githubusercontent.com/mwaskom/seaborn-data/refs/heads/master/tips.csv'
# csv_path= '/tmp/tips.csv'
csv_path= 'tips.csv'

with open(csv_path, 'wb') as file: # w de write b de binary
    file.write(requests.get(url).content)
    
schema = StructType([
    # columnas del dataset y su tipo de dato
    StructField('total_bill', FloatType(), True),
    StructField('tip', FloatType(), True),
    StructField('sex', StringType(), True),
    StructField('smoker', StringType(), True),
    StructField('day', StringType(), True),
    StructField('time', StringType(), True),
    StructField('size', IntegerType(), True)
])
    
df_spark = spark.read.csv(csv_path, header=True, inferSchema=False, schema=schema)
df_spark.show(5)
df_spark.printSchema()

+----------+----+------+------+---+------+----+
|total_bill| tip|   sex|smoker|day|  time|size|
+----------+----+------+------+---+------+----+
|     16.99|1.01|Female|    No|Sun|Dinner|   2|
|     10.34|1.66|  Male|    No|Sun|Dinner|   3|
|     21.01| 3.5|  Male|    No|Sun|Dinner|   3|
|     23.68|3.31|  Male|    No|Sun|Dinner|   2|
|     24.59|3.61|Female|    No|Sun|Dinner|   4|
+----------+----+------+------+---+------+----+
only showing top 5 rows

root
 |-- total_bill: float (nullable = true)
 |-- tip: float (nullable = true)
 |-- sex: string (nullable = true)
 |-- smoker: string (nullable = true)
 |-- day: string (nullable = true)
 |-- time: string (nullable = true)
 |-- size: integer (nullable = true)



In [None]:
# guardar datos a un csv
# por defecto se particiona en múltiples archivos para procesado distribuido y repartirlos en nodos
# luego a la hora de leerlo spark detecta automáticamente que el archivo está particionado y lo lee bien
df.write.csv('tips_clean.csv', header=True, mode='overwrite')

In [37]:
# reducir a una sola partición (No recomendable)
df.coalesce(1).write.csv('tips_clean2.csv', header=True, mode='overwrite')

In [35]:
# verificar que aparece el archivo guardado:
import os

files = os.listdir('.')
for file in files:
    print(file)

.bash_logout
.bashrc
.profile
.ipython
.npm
.cache
.ipynb_checkpoints
tips.csv
.local
.jupyter
tips_clean.csv
.conda
.config
.wget-hsts
work


In [36]:
df_tips_clean = spark.read.csv('tips_clean.csv', header=True, inferSchema=True)
df_tips_clean.show(3)

+----------+---+------+------+---+------+----+
|total_bill|tip|   sex|smoker|day|  time|size|
+----------+---+------+------+---+------+----+
|     16.27|2.5|Female|   Yes|Fri| Lunch|   2|
|     10.09|2.0|Female|   Yes|Fri| Lunch|   2|
|     20.45|3.0|  Male|    No|Sat|Dinner|   4|
+----------+---+------+------+---+------+----+
only showing top 3 rows



In [None]:
# # se puede conectar con otras fuentes de datos, como MySQL
# spark = SparkSession.builder.appName('mysqlapp').config('spark.jars', '/opt/mysql-connector-java-8.0.41.jar').getOrCreate()

# # java database connectivity
# #.option('dbtable', 'customers')

# df_mysql = spark.read.format('jdbc') \
#           .option('url', 'jdbc:mysql://localhost:3306/testing_db') \
#           .option('query', 'SELECT * FROM customers WHERE salary > 1000') \
#           .option('user', 'root') \
#           .option('password', 'admin') \
#           .load()