PySpark을 로컬머신에 설치하고 노트북을 사용하기 보다는 머신러닝 관련 다양한 라이브러리가 이미 설치되었고 좋은 하드웨어를 제공해주는 Google Colab을 통해 실습을 진행한다. 이를 위해 pyspark과 Py4J 패키지를 설치한다. Py4J 패키지는 파이썬 프로그램이 자바가상머신상의 오브젝트들을 접근할 수 있게 해준다. Local Standalone Spark을 사용한다.

# PySpark 설치

In [1]:
!pip install pyspark==3.3.1 py4j==0.10.9.5

Collecting pyspark==3.3.1
  Downloading pyspark-3.3.1.tar.gz (281.4 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m281.4/281.4 MB[0m [31m4.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Collecting py4j==0.10.9.5
  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m199.7/199.7 kB[0m [31m22.1 MB/s[0m eta [36m0:00:00[0m
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.1-py2.py3-none-any.whl size=281845500 sha256=18244d8e14fc362a78a28423c900d17b685d67e4461e875636e3c39f83feced8
  Stored in directory: /root/.cache/pip/wheels/0f/f0/3d/517368b8ce80486e84f89f214e0a022554e4ee64969f46279b
Successfully built pyspark
Installing collected packages: py4j, pyspark
  Attempting uninstall: py4j
    Found existing installation: py4j 0.10.9.7
    Uninstall

**Spark Session:** SparkSession은 Spark 2.0부터 엔트리 포인트로 사용된다. SparkSession을 이용해 RDD, 데이터 프레임등을 만든다. SparkSession은 SparkSession.builder를 호출하여 생성하며 다양한 함수들을 통해 세부 설정이 가능하다

* local[*] Spark이 하나의 JVM으로 동작하고 그 안에 컴퓨터의 코어 수 만큼의 스레드가 Executor로 동작한다


In [2]:
from pyspark.sql import SparkSession

spark = SparkSession.builder\
        .master("local[*]")\
        .appName('PySpark Tutorial')\
        .getOrCreate()

In [3]:
spark

In [4]:
!lscpu

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          2
On-line CPU(s) list:             0,1
Thread(s) per core:              2
Core(s) per socket:              1
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           79
Model name:                      Intel(R) Xeon(R) CPU @ 2.20GHz
Stepping:                        0
CPU MHz:                         2200.134
BogoMIPS:                        4400.26
Hypervisor vendor:               KVM
Virtualization type:             full
L1d cache:                       32 KiB
L1i cache:                       32 KiB
L2 cache:                        256 KiB
L3 cache:                        55 MiB
NUMA node0 CPU(s):               0,1
Vulnerability 

In [5]:
!grep MemTotal /proc/meminfo

MemTotal:       13294264 kB


# Python <> RDD <> DataFrame

**Python 객체를 RDD로 변환해보기**

**1> Python 리스트 생성**

In [6]:
name_list_json = [ '{"name": "keeyong"}', '{"name": "benjamin"}', '{"name": "claire"}' ]

In [7]:
for n in name_list_json:
  print(n)

{"name": "keeyong"}
{"name": "benjamin"}
{"name": "claire"}


**2> 파이썬 리스트를 RDD로 변환**

 * RDD로 변환되는 순간 Spark 클러스터의 서버들에 데이터가 나눠 저장됨 (파티션)

In [8]:
rdd = spark.sparkContext.parallelize(name_list_json)

In [9]:
rdd

ParallelCollectionRDD[0] at readRDDFromFile at PythonRDD.scala:274

In [10]:
rdd.count()

3

In [11]:
import json

parsed_rdd = rdd.map(lambda el:json.loads(el))

In [12]:
parsed_rdd

PythonRDD[2] at RDD at PythonRDD.scala:53

In [13]:
parsed_rdd.collect()

[{'name': 'keeyong'}, {'name': 'benjamin'}, {'name': 'claire'}]

In [14]:
parsed_name_rdd = rdd.map(lambda el:json.loads(el)["name"])

In [15]:
parsed_name_rdd.collect()

['keeyong', 'benjamin', 'claire']

**파이썬 리스트를 데이터프레임으로 변환하기**

In [16]:
# RDD의 장점 중 하나는 Lambda function을 쓰기 편하다는 점

from pyspark.sql.types import StringType

# string으로 존재하는 rdd를 json structure로 바꾼 새로운 rdd로 만듦
df = spark.createDataFrame(name_list_json, StringType())

In [17]:
df.count()

3

In [18]:
df.printSchema()

root
 |-- value: string (nullable = true)



In [31]:
# collect를 통해 dataframe을 파이썬 쪽으로 가져오면 Row 형태가 된다
df.select('*').collect()

[Row(name='Adaleigh', gender='F'),
 Row(name='Amryn', gender='Unisex'),
 Row(name='Apurva', gender='Unisex'),
 Row(name='Aryion', gender='M'),
 Row(name='Alixia', gender='F'),
 Row(name='Alyssarose', gender='F'),
 Row(name='Arvell', gender='M'),
 Row(name='Aibel', gender='M'),
 Row(name='Atiyyah', gender='F'),
 Row(name='Adlie', gender='F'),
 Row(name='Anyely', gender='F'),
 Row(name='Aamoni', gender='F'),
 Row(name='Ahman', gender='M'),
 Row(name='Arlane', gender='F'),
 Row(name='Armoney', gender='F'),
 Row(name='Atzhiry', gender='F'),
 Row(name='Antonette', gender='F'),
 Row(name='Akeelah', gender='F'),
 Row(name='Abdikadir', gender='M'),
 Row(name='Arinze', gender='M'),
 Row(name='Arshaun', gender='M'),
 Row(name='Alexandro', gender='M'),
 Row(name='Ayriauna', gender='F'),
 Row(name='Aqib', gender='M'),
 Row(name='Alleya', gender='F'),
 Row(name='Aavah', gender='F'),
 Row(name='Anesti', gender='Unisex'),
 Row(name='Adalaide', gender='F'),
 Row(name='Analena', gender='F'),
 Row(name=

RDD를 DataFrame으로 변환해보는 예제: 앞서 parsed_rdd를 DataFrame으로 변환해보자



In [32]:
df_parsed_rdd = parsed_rdd.toDF()

In [33]:
df_parsed_rdd.printSchema()

root
 |-- name: string (nullable = true)



In [34]:
df_parsed_rdd.select('name').collect()

[Row(name='keeyong'), Row(name='benjamin'), Row(name='claire')]

## Spark 데이터프레임으로 로드해보기

In [20]:
!wget https://s3-geospatial.s3-us-west-2.amazonaws.com/name_gender.csv

--2023-07-04 11:48:36--  https://s3-geospatial.s3-us-west-2.amazonaws.com/name_gender.csv
Resolving s3-geospatial.s3-us-west-2.amazonaws.com (s3-geospatial.s3-us-west-2.amazonaws.com)... 3.5.82.14, 3.5.85.15, 52.92.152.114, ...
Connecting to s3-geospatial.s3-us-west-2.amazonaws.com (s3-geospatial.s3-us-west-2.amazonaws.com)|3.5.82.14|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 997 [text/csv]
Saving to: ‘name_gender.csv’


2023-07-04 11:48:37 (46.4 MB/s) - ‘name_gender.csv’ saved [997/997]



In [21]:
df = spark.read.csv("name_gender.csv")
df.printSchema()

root
 |-- _c0: string (nullable = true)
 |-- _c1: string (nullable = true)



In [22]:
df = spark.read.option("header", True).csv("name_gender.csv")
df.printSchema()

root
 |-- name: string (nullable = true)
 |-- gender: string (nullable = true)



In [23]:
df.show()

+----------+------+
|      name|gender|
+----------+------+
|  Adaleigh|     F|
|     Amryn|Unisex|
|    Apurva|Unisex|
|    Aryion|     M|
|    Alixia|     F|
|Alyssarose|     F|
|    Arvell|     M|
|     Aibel|     M|
|   Atiyyah|     F|
|     Adlie|     F|
|    Anyely|     F|
|    Aamoni|     F|
|     Ahman|     M|
|    Arlane|     F|
|   Armoney|     F|
|   Atzhiry|     F|
| Antonette|     F|
|   Akeelah|     F|
| Abdikadir|     M|
|    Arinze|     M|
+----------+------+
only showing top 20 rows



In [24]:
df.head(5)

[Row(name='Adaleigh', gender='F'),
 Row(name='Amryn', gender='Unisex'),
 Row(name='Apurva', gender='Unisex'),
 Row(name='Aryion', gender='M'),
 Row(name='Alixia', gender='F')]

In [25]:
df.groupby(["gender"]).count().collect()

[Row(gender='F', count=65),
 Row(gender='M', count=28),
 Row(gender='Unisex', count=7)]

In [26]:
df.rdd.getNumPartitions()
# rdd가 가지고 있는 파티션을 통해 파티션 개수를 알아볼 수 있다.

1

데이터프레임을 테이블뷰로 만들어서 SparkSQL로 처리해보기

In [27]:
df.createOrReplaceTempView("namegender")

In [28]:
namegender_group_df = spark.sql("SELECT gender, count(1) FROM namegender GROUP BY 1")

In [29]:
namegender_group_df.collect()

[Row(gender='F', count(1)=65),
 Row(gender='M', count(1)=28),
 Row(gender='Unisex', count(1)=7)]

In [30]:
spark.catalog.listTables()

[Table(name='namegender', database=None, description=None, tableType='TEMPORARY', isTemporary=True)]

Partition의 수 계산해보기

In [35]:
namegender_group_df.rdd.getNumPartitions()

1

In [36]:
# 파티션을 2개로 나눔
two_namegender_group_df = namegender_group_df.repartition(2)

In [37]:
two_namegender_group_df.rdd.getNumPartitions()

2