# Apache Spark

Apache Spark is an open-source, unified analytics engine for large-scale data processing and machine learning. It provides high-level APIs in Java, Python, Scala, and R, and an optimized engine that supports general computation graphs for data analysis.

### Key Features:
- **Speed**: Spark is designed for speed, with the ability to process data up to 100 times faster than traditional big data technologies.
- **Ease of Use**: Spark provides high-level APIs and a simple programming model, making it easy to write applications.
- **Flexibility**: Spark can handle batch processing, stream processing, machine learning, and graph processing, all in a single platform.
- **Unified Engine**: Spark's engine is designed to handle multiple workloads, making it a versatile tool for data analysis.

### Components:
- **Spark Core**: The foundation of Spark, providing basic data structures and APIs.
- **Spark SQL**: A module for structured data processing, with support for SQL queries.
- **Spark Streaming**: A module for real-time data processing.
- **MLlib**: A machine learning library, providing algorithms for classification, regression, clustering, and more.
- **GraphX**: A module for graph processing and analytics.

# PySpark

PySpark is the Python API for Apache Spark, allowing Python developers to write Spark applications using Python. It provides a seamless integration with Spark's engine, enabling data processing, machine learning, and data analytics.

## Key Features:
- **Pythonic API**: PySpark provides a Python-friendly API, making it easy to write Spark applications.
- **Dynamic Typing**: PySpark supports dynamic typing, allowing for flexible data processing.
- Integration with Spark: PySpark is built on top of Spark's engine, providing access to Spark's features and performance.
- **DataFrames and Datasets**: PySpark supports DataFrames and Datasets, providing a structured data processing API.
- **Machine Learning**: PySpark provides access to Spark's MLlib, enabling machine learning tasks.

## Documentation
- [Documentation - Apache Spark](https://spark.apache.org/docs/latest/)
- [Documentation - PySpark](https://spark.apache.org/docs/latest/api/python/index.html)
- [Wikipedia - Apache Spark](https://en.wikipedia.org/wiki/Apache_Spark)

# Imports

In [1]:
# Install pyspark
!pip install pyspark

# Install the rest of common librarys
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Import pyspark and create a session
import pyspark
from pyspark.sql import SparkSession
# spark = SparkSession.builder.getOrCreate()
spark = SparkSession.builder.appName('YbiFoundation').getOrCreate()

Collecting pyspark
  Downloading pyspark-3.5.1.tar.gz (317.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m317.0/317.0 MB[0m [31m4.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25ldone
[?25h  Created wheel for pyspark: filename=pyspark-3.5.1-py2.py3-none-any.whl size=317488493 sha256=ea66e39a295c20ad71a5276d4814ca0f9bf2a20027a0ef10d4f52377d219ccc0
  Stored in directory: /root/.cache/pip/wheels/80/1d/60/2c256ed38dddce2fdd93be545214a63e02fbd8d74fb0b7f3a6
Successfully built pyspark
Installing collected packages: pyspark
Successfully installed pyspark-3.5.1


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/08/10 19:15:23 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


# Reading data and data exploration

In [27]:
# import csv file from github as pandas dataframe and then converting to spark dataframe

df = pd.read_csv('https://github.com/YBIFoundation/BigData/raw/main/HR50k.csv')
df = spark.createDataFrame(df)

In [28]:
# Showing the first row and displaying it vertically
df.show(1, vertical=True)

-RECORD 0------------------------------
 Age                      | 31         
 Attrition                | No         
 BusinessTravel           | Non-Travel 
 DailyRate                | 158        
 Department               | Software   
 DistanceFromHome         | 7          
 Education                | 3          
 EducationField           | Medical    
 EmployeeCount            | 1          
 EmployeeNumber           | 1          
 EnvironmentSatisfaction  | 3          
 Gender                   | Male       
 HourlyRate               | 42         
 JobInvolvement           | 2          
 JobLevel                 | 3          
 JobRole                  | Developer  
 JobSatisfaction          | 1          
 MaritalStatus            | Married    
 MonthlyIncome            | 42682      
 MonthlyRate              | 298774     
 NumCompaniesWorked       | 2          
 Over18                   | Y          
 OverTime                 | No         
 PercentSalaryHike        | 20         


In [29]:
# Get the column names of the DataFrame
df.columns

['Age',
 'Attrition',
 'BusinessTravel',
 'DailyRate',
 'Department',
 'DistanceFromHome',
 'Education',
 'EducationField',
 'EmployeeCount',
 'EmployeeNumber',
 'EnvironmentSatisfaction',
 'Gender',
 'HourlyRate',
 'JobInvolvement',
 'JobLevel',
 'JobRole',
 'JobSatisfaction',
 'MaritalStatus',
 'MonthlyIncome',
 'MonthlyRate',
 'NumCompaniesWorked',
 'Over18',
 'OverTime',
 'PercentSalaryHike',
 'PerformanceRating',
 'RelationshipSatisfaction',
 'StandardHours',
 'StockOptionLevel',
 'TotalWorkingYears',
 'TrainingTimesLastYear',
 'WorkLifeBalance',
 'YearsAtCompany',
 'YearsInCurrentRole',
 'YearsSinceLastPromotion',
 'YearsWithCurrManager']

In [30]:
# Print the schema of the DataFrame to understand its structure and data types
df.printSchema()

root
 |-- Age: long (nullable = true)
 |-- Attrition: string (nullable = true)
 |-- BusinessTravel: string (nullable = true)
 |-- DailyRate: long (nullable = true)
 |-- Department: string (nullable = true)
 |-- DistanceFromHome: long (nullable = true)
 |-- Education: long (nullable = true)
 |-- EducationField: string (nullable = true)
 |-- EmployeeCount: long (nullable = true)
 |-- EmployeeNumber: long (nullable = true)
 |-- EnvironmentSatisfaction: long (nullable = true)
 |-- Gender: string (nullable = true)
 |-- HourlyRate: long (nullable = true)
 |-- JobInvolvement: long (nullable = true)
 |-- JobLevel: long (nullable = true)
 |-- JobRole: string (nullable = true)
 |-- JobSatisfaction: long (nullable = true)
 |-- MaritalStatus: string (nullable = true)
 |-- MonthlyIncome: long (nullable = true)
 |-- MonthlyRate: long (nullable = true)
 |-- NumCompaniesWorked: long (nullable = true)
 |-- Over18: string (nullable = true)
 |-- OverTime: string (nullable = true)
 |-- PercentSalaryHike: 

In [31]:
#Enable eager evaluation for Spark SQL, allowing for faster and more interactive queries
spark.conf.set('spark.sql.repl.eagerEval.enabled', True)
# Display the DataFrame, leveraging eager evaluation for faster rendering
df

Age,Attrition,BusinessTravel,DailyRate,Department,DistanceFromHome,Education,EducationField,EmployeeCount,EmployeeNumber,EnvironmentSatisfaction,Gender,HourlyRate,JobInvolvement,JobLevel,JobRole,JobSatisfaction,MaritalStatus,MonthlyIncome,MonthlyRate,NumCompaniesWorked,Over18,OverTime,PercentSalaryHike,PerformanceRating,RelationshipSatisfaction,StandardHours,StockOptionLevel,TotalWorkingYears,TrainingTimesLastYear,WorkLifeBalance,YearsAtCompany,YearsInCurrentRole,YearsSinceLastPromotion,YearsWithCurrManager
31,No,Non-Travel,158,Software,7,3,Medical,1,1,3,Male,42,2,3,Developer,1,Married,42682,298774,2,Y,No,20,4,1,80,2,15,1,2,12,4,10,11
38,No,Travel_Rarely,985,Human Resources,33,5,Life Sciences,1,2,1,Female,66,2,4,Healthcare Repres...,3,Single,45252,45252,8,Y,No,2,1,3,80,4,5,4,3,1,1,1,1
59,Yes,Non-Travel,1273,Sales,5,2,Technical Degree,1,3,4,Female,96,1,3,Manufacturing Dir...,2,Married,46149,507639,7,Y,Yes,39,3,2,80,2,9,5,1,6,6,4,3
52,Yes,Travel_Rarely,480,Support,2,5,Marketing,1,4,4,Female,71,2,4,Human Resources,1,Married,27150,27150,4,Y,No,16,3,2,80,2,22,4,4,10,9,5,6
32,No,Non-Travel,543,Human Resources,7,5,Human Resources,1,5,2,Male,122,3,3,Manager,2,Divorced,15894,47682,6,Y,Yes,42,3,4,80,2,30,3,4,29,27,9,7
19,Yes,Non-Travel,779,Hardware,43,1,Medical,1,6,2,Female,195,4,3,Research Director,3,Married,41552,1246560,3,Y,Yes,15,4,3,80,1,33,4,2,16,4,14,3
42,Yes,Non-Travel,934,Support,26,4,Human Resources,1,7,2,Female,80,3,5,Sales Executive,4,Divorced,5303,148484,3,Y,No,45,4,1,80,1,4,3,4,2,1,1,2
30,No,Travel_Rarely,380,Support,19,3,Marketing,1,8,4,Male,165,1,4,Human Resources,4,Single,28555,571100,2,Y,Yes,35,3,2,80,1,2,2,2,2,2,2,2
41,No,Travel_Frequently,1464,Software,16,1,Life Sciences,1,9,3,Male,134,1,2,Manager,4,Divorced,3241,87507,7,Y,No,1,1,3,80,2,8,1,2,2,1,2,2
45,No,Travel_Frequently,1020,Human Resources,17,5,Life Sciences,1,10,4,Female,137,2,4,Manager,2,Married,4323,116721,4,Y,Yes,32,1,3,80,4,6,4,4,5,3,4,1


In [32]:
# Display summary statistics for the DataFrame, with each statistic displayed vertically for easier reading
df.describe().show(vertical=True)

24/08/10 19:38:24 WARN TaskSetManager: Stage 32 contains a task of very large size (1021 KiB). The maximum recommended task size is 1000 KiB.

-RECORD 0----------------------------------------
 summary                  | count                
 Age                      | 50000                
 Attrition                | 50000                
 BusinessTravel           | 50000                
 DailyRate                | 50000                
 Department               | 50000                
 DistanceFromHome         | 50000                
 Education                | 50000                
 EducationField           | 50000                
 EmployeeCount            | 50000                
 EmployeeNumber           | 50000                
 EnvironmentSatisfaction  | 50000                
 Gender                   | 50000                
 HourlyRate               | 50000                
 JobInvolvement           | 50000                
 JobLevel                 | 50000                
 JobRole                  | 50000                
 JobSatisfaction          | 50000                
 MaritalStatus            | 50000                


                                                                                