# ***Spark Basics***

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


### ***Installing Pyspark environment on Google Colab***




In [2]:
!pip install pyspark py4j

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyspark
  Downloading pyspark-3.3.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 56 kB/s 
[?25hCollecting py4j
  Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
[K     |████████████████████████████████| 200 kB 57.3 MB/s 
[?25h  Downloading py4j-0.10.9.5-py2.py3-none-any.whl (199 kB)
[K     |████████████████████████████████| 199 kB 78.8 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.3.1-py2.py3-none-any.whl size=281845514 sha256=46347fd7bf390892a40882a525e26929435de9ba7a5b47d04f446fe741d1a350
  Stored in directory: /root/.cache/pip/wheels/42/59/f5/79a5bf931714dcd201b26025347785f087370a10a3329a899c
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.5 pyspark-3.3.1


### **Starting a Spark Session**

In [3]:
from pyspark.sql import SparkSession

In [4]:
spark = SparkSession.builder.appName('Basics').getOrCreate()

### ***Reading data***

In [17]:
df = spark.read.json("/content/drive/MyDrive/SparkWork/SparkDoc/people1.json")


### ***Show - Used for showing the contents of dataframe***

In [18]:
df.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



### ***To display column names***

In [24]:
df.columns

['age', 'name']

### ***printing Schema of dataframe***

*   We can know the schema of the data that we are using using this statement which include the type of data (int, String, etc.) and if its nullable or not.



In [25]:
df.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



### **Describe**


*   Returns a dataframe which consist of statistical features of numerical data present on the dataframe.
*   show() can be used to see this dataframe, without using show(), it only returns a dataframe object that will give us details about the datatype of columns and column names.



In [27]:
df.describe()

DataFrame[summary: string, age: string, name: string]

In [28]:
df.describe().show()

+-------+------------------+-------+
|summary|               age|   name|
+-------+------------------+-------+
|  count|                 2|      3|
|   mean|              24.5|   null|
| stddev|7.7781745930520225|   null|
|    min|                19|   Andy|
|    max|                30|Michael|
+-------+------------------+-------+



## **Creating a new schema for dataframes**


*   The data that we get for processing need not be all neet (incomplete / Damaged Schema), But we need a clear Schema in order to process data and come to good conclusions
*  So we need to clarify the schema ie- we need to specify what columns are Strings what columns are integers etc.
* To do that we need some type tools.



**Importing typetools**

In [30]:
from pyspark.sql.types import StructField, StringType, IntegerType, StructType


**Specifying new Schema**

In [31]:
data_schema = [StructField('age', IntegerType(), True), StructField('name', StringType(), True)]

**Creating new SchemaStructure**

In [34]:
final_struc = StructType(fields = data_schema)

**creating a new dataframe using new Schema and checking the schema**

In [36]:
df = spark.read.json("/content/drive/MyDrive/SparkWork/SparkDoc/people1.json", schema = final_struc)

In [38]:
df.printSchema()

root
 |-- age: integer (nullable = true)
 |-- name: string (nullable = true)

