애플 주식 데이터를 가지고 간단한 데이터 분석을 해보자. 모든 답은 Pyspark을 통해 이뤄져야 한다.

!pip install pyspark==3.0.1 py4j==0.10.9 

spark session 만들기

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark Dataframe basic example") \
    .getOrCreate()

- 애플 주식 CSV 파일 로딩하기: https://pyspark-test-sj.s3-us-west-2.amazonaws.com/appl_stock.csv
- 일단 pandas 데이터프레임으로 로딩해서 Spark 데이터프레임으로 변경한다

In [None]:
import pandas as pd

apple_pandas_df = pd.read_csv("https://pyspark-test-sj.s3-us-west-2.amazonaws.com/appl_stock.csv")
apple_spark_df = spark.createDataFrame(apple_pandas_df)

In [None]:
apple_spark_df.columns

In [None]:
apple_spark_df.printSchema()

In [None]:
apple_spark_df.show(n=5)

In [None]:
apple_spark_df.describe().show()

In [None]:
from pyspark.sql.functions import mean

apple_spark_df.select(mean("Close")).show()

In [None]:
from pyspark.sql.functions import min, max

apple_spark_df.select(max("Volume"), min("Volume")).show()

HV ratio라는 이름의 새로운 컬럼을 추가한 데이터프레임을 만들기. 이 컬럼의 값은 High/Volume으로 계산된다

apple_spark_df_with_hv = apple_spark_df.withColumn("hv ratio", apple_spark_df.High/apple_spark_df.Volume) 

In [None]:
apple_spark_df_with_hv.show(5)

월별 Close 컬럼의 평균값

In [None]:
from pyspark.sql.functions import month

monthdf = apple_spark_df.withColumn("Month", month("Date"))

In [None]:
monthavgdf = monthdf.select(["Month", "Close"]).groupBy("Month").mean()

In [None]:
monthavgdf.show()

In [None]:
monthavgdf.select(["Month", "avg(Close)"]).orderBy("Month").show()

## SparkSQL을 가지고 데이터를 분석

- Pyspark의 SparkSQL을 통해 분석


from pyspark.sql import SparkSession
import pandas as pd

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .getOrCreate()

apple_pandas_df = pd.read_csv("https://pyspark-test-sj.s3-us-west-2.amazonaws.com/appl_stock.csv")
apple_spark_df = spark.createDataFrame(apple_pandas_df)

In [None]:
#applespark_df에 apple_stock이라는 테입블 이름을 부여

apple_spark_df.createOrReplaceTempView("apple_stock")

In [None]:
# sql문법을 통해 여러가지 출력

In [None]:
spark.sql("desc apple_stock")

In [None]:
spark.sql("SELECT * FROM apple_stock LIMIT 5").show()

In [None]:
spark.sql("SELECT AVG(close) FROM apple_stock").show()

In [None]:
spark.sql("SELECT MAX(volume), MIN(volume) FROM apple_stock").show()

In [None]:
apple_spark_df_with_hv = spark.sql(
    """SELECT *, high/volume as hvratio FROM apple_stock"""
)   

In [None]:
apple_spark_df_with_hv.show(5)

In [None]:
spark.sql("SELECT Month(date) month, AVG(close) FROM apple_stock GROUP BY 1 ORDER BY 1").show()