# **Spark**

## **Setting Up Spark in Colab**

**Checking the version of Java Installed in the system**

In [1]:
!java -version

openjdk version "1.8.0_252"
OpenJDK Runtime Environment (build 1.8.0_252-8u252-b09-1~18.04-b09)
OpenJDK 64-Bit Server VM (build 25.252-b09, mixed mode)


**Installing Java8**

In [2]:
!apt-get install openjdk-8-jdk-headless -qq> /dev/null

**Downloading Spark**

In [3]:
!wget -q http://apachemirror.wuchna.com/spark/spark-2.4.6/spark-2.4.6-bin-hadoop2.7.tgz
!tar xf spark-2.4.6-bin-hadoop2.7.tgz

**Installing findspark**

In [4]:
!pip install findspark



**Setting path variables for Java and Spark**

In [5]:
import os 
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark-2.4.6-bin-hadoop2.7"

In [6]:
import findspark
findspark.init()

## **Basics of Spark DataFrames**

- Spark DataFrames are basically used to store some data.
- It does hold the data in rows and columns.
- Each column represents a feature or variable and each row represent an individual data point.
- Spark DataFrames are able to deal with various sources of data, which means it can input data and output data from a variety of wide sources like csv,json and so.
- We can perform various transformations on data and collect the results to visualize, or record for some other processing.
- In order to begin working with Spark DataFrames, we need to create a Spark Session.
- Spark Session is like a unified entry point of a spark application, it provides a way to interact with various spark funcationalities.

In [7]:
# importing SparkSession
from pyspark.sql import SparkSession

In [8]:
# Creating a SparkSession
spark = SparkSession.builder.appName('MyFirstSparkSession').getOrCreate()

- In this manner, we can create our spark session.
- We can use the session variable inside our scripts.

Inorder, to work with real data, we need to first read a dataset. 
- For this we can use the read method from spark context.
- We can also select the type of datafile we need to load, and for this, the read method has various options like csv,json and etc.

We can load a csv file present in our filesystem as ,

<code>
dataFrame_name = spark_session_variable.read.csv('filename')
</code>

For Example lets load the data churn_data_st.csv 

In [9]:
employee_df = spark.read.csv('/content/drive/My Drive/Repos/Git/Integrating Machine Learning with Big Data/PySpark/Dataset/employee.csv')

- Additional parameters such as header and inferSchema can be passed with the read.csv() method.
- The ***header*** parameter takes either a True or a False as its values and on giving it True, it would consider the first row of the dataset as its column title or header. If given false it would not consider it as header, rather would treat it as data.
- The ***inferSchema*** parameter also takes the value as either True or False. If True it would identify and assign the correct datatypes to the columns of the DataFrame based on dataset's column values, if not it would consider all the columsn as string datatype.

In [10]:
employee_df = spark.read.csv('/content/drive/My Drive/Repos/Git/Integrating Machine Learning with Big Data/PySpark/Dataset/employee.csv',header=True,inferSchema=True)

- To see our created DataFrame contents and how the data looks like by using ***show()*** method.
- Example: `df.show()`

In [11]:
employee_df.show()

+-----------+-------------+---+----------+-----+
|employee_id|employee_name|age|  location|hours|
+-----------+-------------+---+----------+-----+
|       G001|       Pichai| 47|California|   14|
|       M002|         Bill| 64|Washington|   10|
|       A003|         Jeff| 56|Washington|   11|
|       A004|         Cook| 59|California|   12|
+-----------+-------------+---+----------+-----+



- If we want to know the features or columns of the dataframe, we can get the list of columns by executing `df.columns`

In [12]:
employee_df.columns

['employee_id', 'employee_name', 'age', 'location', 'hours']

- We can also use ***df.printSchema()*** to get the column data and its data types, where ***df*** being the DataFrame that we created.

In [13]:
# Shows the datatype of the variables of the dataset
employee_df.printSchema()

root
 |-- employee_id: string (nullable = true)
 |-- employee_name: string (nullable = true)
 |-- age: integer (nullable = true)
 |-- location: string (nullable = true)
 |-- hours: integer (nullable = true)



## **Working with Rows and Columns in Spark DataFrame**

- Here, we will try to familiarize ourselves with the Spark DataFrame.
- We will try to get the rows and columns data and perform some operations on them.

### **Working with Columns in Spark**

- Using the DataFrame to get to the columns directly.
  - Inoder to get to the columns via DataFrame, we can directly pass the column name inside of the DataFrame, that is `df['column_name']`, but on accessing the column like this we can get the column object itself.

In [14]:
employee_df['age']

Column<b'age'>

- Using the select option in the DataFrame.
  - Instead of directly passing the column name as a list to the DataFrame, we can use the select method of the DataFrame by which we get the required column as a DataFrame.
  - **Syntax:** `df.select('column_name')`
  - We can also select multiple columns by passing the columns needed in the form of a list to the select function as its parameter.
  - **Syntax:** `df.select(['column1','column2'])`

In [15]:
# Selecting and Displaying a single column
employee_df.select('age').show()

+---+
|age|
+---+
| 47|
| 64|
| 56|
| 59|
+---+



In [16]:
# Selecting and Displaying Employee_name an Age
employee_df.select(['employee_name','age']).show()

+-------------+---+
|employee_name|age|
+-------------+---+
|       Pichai| 47|
|         Bill| 64|
|         Jeff| 56|
|         Cook| 59|
+-------------+---+



- Adding new Columns
  - We can add new columns using the method **withColumn()**. This function returns a new DataFrame by adding a new column or replacing the existing column if it has the same name as new column which we specify.
  - **Syntax:** `df.withColumn('new_column_name',column_iteself)`
  - Inorder to create a new column with the **withColumn()**, we pass in the first parameter as new column name and the second parameter as the column itself.

In [17]:
employee_df.show()

+-----------+-------------+---+----------+-----+
|employee_id|employee_name|age|  location|hours|
+-----------+-------------+---+----------+-----+
|       G001|       Pichai| 47|California|   14|
|       M002|         Bill| 64|Washington|   10|
|       A003|         Jeff| 56|Washington|   11|
|       A004|         Cook| 59|California|   12|
+-----------+-------------+---+----------+-----+



In [18]:
df_new = employee_df.withColumn('overtime_time',employee_df['hours'])
df_new.show()

+-----------+-------------+---+----------+-----+-------------+
|employee_id|employee_name|age|  location|hours|overtime_time|
+-----------+-------------+---+----------+-----+-------------+
|       G001|       Pichai| 47|California|   14|           14|
|       M002|         Bill| 64|Washington|   10|           10|
|       A003|         Jeff| 56|Washington|   11|           11|
|       A004|         Cook| 59|California|   12|           12|
+-----------+-------------+---+----------+-----+-------------+



- We will be able to rename columns using the **withColumnRenamed()** method.
  - This can be achieved by passing old column name as the first parameter and the new name as the second parameter.
  - **Syntax:** `df.withColumnRename('old_column_name','new_column_name')`
  - It return a new DataFrame.

In [19]:
df_new = df_new.withColumnRenamed('hours','working_hours')

In [20]:
df_new.show()

+-----------+-------------+---+----------+-------------+-------------+
|employee_id|employee_name|age|  location|working_hours|overtime_time|
+-----------+-------------+---+----------+-------------+-------------+
|       G001|       Pichai| 47|California|           14|           14|
|       M002|         Bill| 64|Washington|           10|           10|
|       A003|         Jeff| 56|Washington|           11|           11|
|       A004|         Cook| 59|California|           12|           12|
+-----------+-------------+---+----------+-------------+-------------+



- Dropping a column
  - We can drop a column using the **drop()** method.
  - **Syntax:** df_new = df.drop('column_name')