<a href="https://colab.research.google.com/github/ASN-Lab/Big-Data/blob/main/Week_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Processing dengan Apache Spark

### 1. Membuat data frame sederhana di Spark menggunakan beberapa fungsi dasar yang tersedia.

In [11]:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('HandsOnPertemuan3').getOrCreate()

data = [('James', 'Sales', 3000),
        ('Michael', 'Sales', 4600),
        ('Robert', 'Sales', 4100),
        ('Maria', 'Finance', 3000)]
columns = ['Employee Name', 'Department', 'Salary']

print("Data frame sederhana: ")
df = spark.createDataFrame(data, schema=columns)
df.show()



Data frame sederhana: 
+-------------+----------+------+
|Employee Name|Department|Salary|
+-------------+----------+------+
|        James|     Sales|  3000|
|      Michael|     Sales|  4600|
|       Robert|     Sales|  4100|
|        Maria|   Finance|  3000|
+-------------+----------+------+



### 2. Transformasi data pada data frame

In [16]:
print("Data frame awal: ")
df = spark.createDataFrame(data, schema=columns)
df.show()

print("Seleksi kolom Employee Name dan Salary: ")
df.select('Employee Name', 'Salary').show()

print("Filter data pada kolom Salary yang memiliki nilai lebih dari 3000: ")
df.filter(df['Salary'] > 3000).show()

print("Pengelompokan berdasarkan kolom Department dan menghitung rata-rata Salary: ")
df.groupBy('Department').avg('Salary').show()

Data frame awal: 
+-------------+----------+------+
|Employee Name|Department|Salary|
+-------------+----------+------+
|        James|     Sales|  3000|
|      Michael|     Sales|  4600|
|       Robert|     Sales|  4100|
|        Maria|   Finance|  3000|
+-------------+----------+------+

Seleksi kolom Employee Name dan Salary: 
+-------------+------+
|Employee Name|Salary|
+-------------+------+
|        James|  3000|
|      Michael|  4600|
|       Robert|  4100|
|        Maria|  3000|
+-------------+------+

Filter data pada kolom Salary yang memiliki nilai lebih dari 3000: 
+-------------+----------+------+
|Employee Name|Department|Salary|
+-------------+----------+------+
|      Michael|     Sales|  4600|
|       Robert|     Sales|  4100|
+-------------+----------+------+

Pengelompokan berdasarkan kolom Department dan menghitung rata-rata Salary: 
+----------+-----------+
|Department|avg(Salary)|
+----------+-----------+
|     Sales|     3900.0|
|   Finance|     3000.0|
+-------

### 3. Mengolah tipe data kompleks dalam Spark dari data frames

In [14]:
print("Data frame awal: ")
df = spark.createDataFrame(data, schema=columns)
df.show()

print("Menambahkan kolom baru bernama Salary Bonus:")
df = df.withColumn('Salary Bonus', df['Salary'] * 0.1)
df.show()

print("Menambahkan kolom baru bernama Total Compensation:")
df = df.withColumn('Total Compensation', df['Salary'] + df['Salary Bonus'])
df.show()

Data frame awal: 
+-------------+----------+------+
|Employee Name|Department|Salary|
+-------------+----------+------+
|        James|     Sales|  3000|
|      Michael|     Sales|  4600|
|       Robert|     Sales|  4100|
|        Maria|   Finance|  3000|
+-------------+----------+------+

Menambahkan kolom baru bernama Salary Bonus:
+-------------+----------+------+------------+
|Employee Name|Department|Salary|Salary Bonus|
+-------------+----------+------+------------+
|        James|     Sales|  3000|       300.0|
|      Michael|     Sales|  4600|       460.0|
|       Robert|     Sales|  4100|       410.0|
|        Maria|   Finance|  3000|       300.0|
+-------------+----------+------+------------+

Menambahkan kolom baru bernama Total Compensation:
+-------------+----------+------+------------+------------------+
|Employee Name|Department|Salary|Salary Bonus|Total Compensation|
+-------------+----------+------+------------+------------------+
|        James|     Sales|  3000|     

### 4. Mengimplementasikan window function

In [17]:
from pyspark.sql.window import Window
from pyspark.sql import functions as F

# Mendefinisikan window specification
windowSpec = Window.partitionBy('Department').orderBy('Salary')

# Menambahkan kolom Rank berdasarkan Salary dalam tiap Department
df.withColumn('Rank', F.rank().over(windowSpec)).show()

+-------------+----------+------+----+
|Employee Name|Department|Salary|Rank|
+-------------+----------+------+----+
|        Maria|   Finance|  3000|   1|
|        James|     Sales|  3000|   1|
|       Robert|     Sales|  4100|   2|
|      Michael|     Sales|  4600|   3|
+-------------+----------+------+----+



### 5.

In [23]:
from pyspark.sql import SparkSession
from pyspark.sql import functions as F

# Inisialisasi SparkSession
spark = SparkSession.builder.appName("Analisis Pendapatan").getOrCreate()

# Path ke file CSV
file_path = "Pendapatan.csv"

# Memuat Data CSV ke Spark DataFrame
print("Memuat data dari:", file_path)
df_pendapatan = spark.read.csv(file_path, header=True, inferSchema=True)
print("Data berhasil dimuat!")

# Eksplorasi Data Awal
print("\nData frame: ")
df_pendapatan.show()

print("\nSkema data yang ada: ")
df_pendapatan.printSchema()

print("\nRingkasan statistik: ")
df_pendapatan.describe().show()

# Jumlah pekerja berdasarkan jenis kelamin
print("\nJumlah pekerja berdasarkan jenis kelamin: ")
df_pendapatan.groupBy("Jenis Kelamin").count().show()

# Rata-rata umur pekerja berdasarkan gaji
print("\nRata-rata umur pekerja berdasarkan gaji: ")
df_pendapatan.groupBy("Gaji").agg(F.avg("Umur").alias("Rata_rata_Umur")).show()

# Jenis pekerjaan paling banyak
print("\nList 5 jenis pekerjaan dengan jumlah yang paling banyak: ")
df_pendapatan.groupBy("Pekerjaan").count().orderBy(F.desc("count")).show(5)

# Rata-rata gaji pekerja berdasarkan tiap jenjang pendidikan
print("\nRata-rata gaji berdasarkan tingkat pendidikan: ")
df_pendapatan.groupBy("Pendidikan").agg(F.avg("Gaji").alias("Rata_rata_Gaji")).orderBy(F.desc("Rata_rata_Gaji")).show()

# Rata-rata keuntungan berdasarkan status perkawinan
print("\nStatus perkawinan pekerja dengan rata-rata keuntungan kapital tertinggi: ")
df_pendapatan.groupBy("Status Perkawinan").agg(F.avg("Keuntungan Kapital").alias("Rata_rata_Keuntungan_Kapital")).orderBy(F.desc("Rata_rata_Keuntungan_Kapital")).show()

# Menghentikan SparkSession
spark.stop()

Memuat data dari: Pendapatan.csv
Data berhasil dimuat!

Data frame: 
+-----+----+--------------------+-----------+-----------------+---------------------+--------------------+--------------------+-------------+------------------+----------------+--------------+----+
|   id|Umur|       Kelas Pekerja|Berat Akhir|       Pendidikan|Jmlh Tahun Pendidikan|   Status Perkawinan|           Pekerjaan|Jenis Kelamin|Keuntungan Kapital|Kerugian Capital|Jam per Minggu|Gaji|
+-----+----+--------------------+-----------+-----------------+---------------------+--------------------+--------------------+-------------+------------------+----------------+--------------+----+
|27247|  59|   Pemerintah Negara|     139616|           Master|                   14|             Menikah|Ekesekutif Manage...|        Laki2|               0.0|             0.0|          50.0|   1|
| 1640|  52|          Wiraswasta|     158993|              SMA|                    9|               Cerai|      Servis Lainnya|    Perempua