In [2]:
# Install dependencies
!apt-get update -qq
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://archive.apache.org/dist/spark/spark-3.1.2/spark-3.1.2-bin-hadoop2.7.tgz
!tar xf spark-3.1.2-bin-hadoop2.7.tgz
!pip install -q findspark

W: Skipping acquire of configured file 'main/source/Sources' as repository 'https://r2u.stat.illinois.edu/ubuntu jammy InRelease' does not seem to provide it (sources.list entry misspelt?)


In [3]:
# Storing access in environment variable
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-3.1.2-bin-hadoop2.7"

In [4]:
# Initializing spark
import findspark
findspark.init()

In [5]:
# Initialize a Spark session
from pyspark.sql import SparkSession

spark = SparkSession.builder\
     .master('local[*]')\
    .appName("Iniciando com Spark")\
    .config('spark.ui.port', '4050')\
    .getOrCreate()

In [6]:
# Download the ngrok files to use in the spark session
!wget -q https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip

Archive:  ngrok-stable-linux-amd64.zip
  inflating: ngrok                   


In [7]:
# Expose ngrok access server
get_ipython().system_raw('./ngrok http 4050 &')

In [10]:
# Create a database
spark.sql('CREATE DATABASE IF NOT EXISTS  desp')

DataFrame[]

In [None]:
# Command to position yourself within the desp database
spark.sql('USE desp')

In [12]:
# Insert the file into a dataframe and display it
churn_df = spark.read.csv('/content/Churn.csv', sep=';', header=True, inferSchema=True)
churn_df.show()

+-----------+---------+------+---+------+--------+-------------+---------+--------------+---------------+------+
|CreditScore|Geography|Gender|Age|Tenure| Balance|NumOfProducts|HasCrCard|IsActiveMember|EstimatedSalary|Exited|
+-----------+---------+------+---+------+--------+-------------+---------+--------------+---------------+------+
|        619|   France|Female| 42|     2|       0|            1|        1|             1|       10134888|     1|
|        608|    Spain|Female| 41|     1| 8380786|            1|        0|             1|       11254258|     0|
|        502|   France|Female| 42|     8| 1596608|            3|        1|             0|       11393157|     1|
|        699|   France|Female| 39|     1|       0|            2|        0|             0|        9382663|     0|
|        850|    Spain|Female| 43|     2|12551082|            1|        1|             1|         790841|     0|
|        645|    Spain|  Male| 44|     8|11375578|            2|        1|             0|       

In [13]:
# Partition the churn_df by the geography column and then save it in the database as a table

# After completing the command, you can see that within the spark-warehouse/churn directory, there is a folder based on each of the cardinal values
# generated a folder, in which each of the folders contains the values ​​divided by the geography column (France, Germany, Spain)
churn_df.write.partitionBy('Geography').saveAsTable('churn')

In [20]:
# During the table consultation, it can be seen that the data remained unchanged, with the following changes:
# They are in ascending order and the column was positioned last
spark.sql('select * from churn').show(100)

+-----------+------+---+------+--------+-------------+---------+--------------+---------------+------+---------+
|CreditScore|Gender|Age|Tenure| Balance|NumOfProducts|HasCrCard|IsActiveMember|EstimatedSalary|Exited|Geography|
+-----------+------+---+------+--------+-------------+---------+--------------+---------------+------+---------+
|        619|Female| 42|     2|       0|            1|        1|             1|       10134888|     1|   France|
|        502|Female| 42|     8| 1596608|            3|        1|             0|       11393157|     1|   France|
|        699|Female| 39|     1|       0|            2|        0|             0|        9382663|     0|   France|
|        822|  Male| 50|     7|       0|            2|        1|             1|         100628|     0|   France|
|        501|  Male| 44|     4|14205107|            2|        0|             1|         749405|     0|   France|
|        684|  Male| 27|     2|13460388|            1|        1|             1|        7172573| 

## Considerações de análise

 Na aba de arquivos, dentro do collab, você irá verificar que foi criada uma pasta spark-warehouse. Dentro dela haverá um diretório chamado churn, onde nesse diretório estarão alocados os arquivos particionados.

### Observação

 No spark, os arquivos, por padrão são salvos no modelo parquet