###IPL : Indian Premier Data League

Code Components;
1. Data Storage: Amazon s3
2. Databricks
    - Transformation (Spark)
    - SQL Analytics: SQL
    - Visualization 


####Ecosystem of Spark Core
- Apache Spark Core: Executing every code that is passed to spark. It is heart of Spark
- Spark SQL: For SQL query
- Spark Streaming: For processing real time data
- MLlib (Machine learning): ML on large data
- GraphX (Graph): Graphical db 

####Apache Spark
Apache spark is unified computing engine and a set of libraries for parallel data processing on computer clusters
Basically framework which partitions data in distributed way on machine and combines output at the end
1. Structured Streaming, Advanced Analytics, Libraries and Ecosystem
2. Structured APIs: Datasets, DataFrames, SQL
3. Low level APIs: RDDs, Distributed Variables

####Architecture of Apache Spark
Single machines do not have enough power and resources as if they were single computer. A cluster is needed.
A cluster, is a group of computers, pools the resources of many machines together, giving us the ability to use all cumulative resources as if they were a single computer
A group of machines alone is not powerful, you need framework to coordinate work accross them. Spark does just that, managing and coordinating the execution of tasks on data across a cluster of computers
![Delta arch](/files/tables/Shaurya/Spark_architecture_1.png)

Spark applications consist of 2 processes:
1. Driver Process: 
> Runs the main() function
> 1. Managing information about spark application
> 2. Responding to users program or input
> 3. Analyzing, distributing, and scheduling work across the executors
2. Executor Process
> 1. Executing code assigned to it by driver
> 2. Reporting the state of computation on that executor back to driver node

SparkSession
You control your spark application through a driver process called SparkSession
Default variable created when executed code, entry point

Spark
O/P: <pyspark.sql.session.SparkSession>

#####Spark DataFrame:
A dataframe is most common Structured API and simply represents a table of data with rows and columns
This is also distributedly

#####Transformations
Business logic on data
Creates logical plan and physical plan for execution
Only executed after Action is made

divsBy2 = myRange.where("number % 2 == 0")
divsBy2.count() -- Action

######Databricks
Software that supports Apache spark environment
No need to install JVM, packages, configure env path on 1000nds of machines, scaling and networking. 
Databricks takes care of all these. Only focus on writing code. 





In [0]:
%python
spark

In [0]:
%python
from pyspark.sql import SparkSession

#Create Session
spark = SparkSession.builder.appName("IPL Data Analysis").getOrCreate()


In [0]:
spark

In [0]:
# Create dataframe 
# It reads everything as String, inferSchema for detecting the correct datatype. It is also problem as it can read boolean to int and so on, create your own schema is best practice
bal_by_ball_df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("s3://ipl-data-analysis-project/Ball_By_Ball.csv")

In [0]:
# import all packages needed
from pyspark.sql.types import StructField, StructType, IntegerType, StringType, BooleanType, DateType, DecimalType

In [0]:
ball_by_ball_schema = StructType([
    StructField("Match_id", IntegerType(), True),  # True means it can be null
    StructField("over_id", IntegerType(), True),
    StructField("ball_id", IntegerType(), True),
    StructField("innings_no", IntegerType(), True),
    StructField("team_batting", StringType(), True),
    StructField("team_bowling", StringType(), True),
    StructField("striker_batting_position", IntegerType(), True),
    StructField("extra_type", StringType(), True),
    StructField("runs_scored", IntegerType(), True),
    StructField("extra_runs", IntegerType(), True),
    StructField("wides", IntegerType(), True),
    StructField("legbyes", IntegerType(), True),
    StructField("byes", IntegerType(), True),
    StructField("noballs", IntegerType(), True),
    StructField("penalty", IntegerType(), True),
    StructField("bowler_extras", IntegerType(), True),
    StructField("out_type", StringType(), True),
    StructField("caught", BooleanType(), True),
    StructField("bowled", BooleanType(), True),
    StructField("run_out", BooleanType(), True),
    StructField("lbw", BooleanType(), True),
    StructField("retired_hurt", BooleanType(), True),
    StructField("stumped", BooleanType(), True),
    StructField("caught_and_bowled", BooleanType(), True),
    StructField("hit_wicket", BooleanType(), True),
    StructField("obstructingfeild", BooleanType(), True),
    StructField("bowler_wicket", BooleanType(), True),
    StructField("match_date", DateType(), True),
    StructField("season", IntegerType(), True),
    StructField("striker", IntegerType(), True),
    StructField("non_striker", IntegerType(), True),
    StructField("bowler", IntegerType(), True),
    StructField("player_out", IntegerType(), True),
    StructField("fielders", IntegerType(), True),
    StructField("striker_match_sk", IntegerType(), True),
    StructField("strikersk", IntegerType(), True),
    StructField("nonstriker_match_sk", IntegerType(), True),
    StructField("nonstriker_sk", IntegerType(), True),
    StructField("fielder_match_sk", IntegerType(), True),
    StructField("fielder_sk", IntegerType(), True),
    StructField("bowler_match_sk", IntegerType(), True),
    StructField("bowler_sk", IntegerType(), True),
    StructField("playerout_match_sk", IntegerType(), True),
    StructField("battingteam_sk", IntegerType(), True),
    StructField("bowlingteam_sk", IntegerType(), True),
    StructField("keeper_catch", BooleanType(), True),
    StructField("player_out_sk", IntegerType(), True),
    StructField("matchdatesk", DateType(), True)
])

In [0]:
ball_by_ball_df = spark.read.schema(ball_by_ball_schema).format("csv").option("header", "true").load("s3://ipl-data-analysis-project/Ball_By_Ball.csv")

In [0]:
ball_by_ball_df.show(5)

+--------+-------+-------+----------+------------+------------+------------------------+----------+-----------+----------+-----+-------+----+-------+-------+-------------+--------------+------+------+-------+----+------------+-------+-----------------+----------+----------------+-------------+----------+------+-------+-----------+------+----------+--------+----------------+---------+-------------------+-------------+----------------+----------+---------------+---------+------------------+--------------+--------------+------------+-------------+-----------+
|Match_id|over_id|ball_id|innings_no|team_batting|team_bowling|striker_batting_position|extra_type|runs_scored|extra_runs|wides|legbyes|byes|noballs|penalty|bowler_extras|      out_type|caught|bowled|run_out| lbw|retired_hurt|stumped|caught_and_bowled|hit_wicket|obstructingfeild|bowler_wicket|match_date|season|striker|non_striker|bowler|player_out|fielders|striker_match_sk|strikersk|nonstriker_match_sk|nonstriker_sk|fielder_match_sk|

In [0]:
match_schema = StructType([
    StructField("match_sk", IntegerType(), True),
    StructField("match_id", IntegerType(), True),
    StructField("team1", StringType(), True),
    StructField("team2", StringType(), True),
    StructField("match_date", DateType(), True),
    StructField("season_year", IntegerType(), True),
    StructField("venue_name", StringType(), True),
    StructField("city_name", StringType(), True),
    StructField("country_name", StringType(), True),
    StructField("toss_winner", StringType(), True),
    StructField("match_winner", StringType(), True),
    StructField("toss_name", StringType(), True),
    StructField("win_type", StringType(), True),
    StructField("outcome_type", StringType(), True),
    StructField("manofmach", StringType(), True),
    StructField("win_margin", IntegerType(), True),
    StructField("country_id", IntegerType(), True)
])

In [0]:
match_df = spark.read.schema(match_schema).format("csv").option("header", "true").load("s3://ipl-data-analysis-project/Match.csv")

In [0]:
player_schema = StructType([
    StructField("player_sk", IntegerType(), True),
    StructField("player_id", IntegerType(), True),
    StructField("player_name", StringType(), True),
    StructField("dob", DateType(), True),
    StructField("batting_hand", StringType(), True),
    StructField("bowling_skill", StringType(), True),
    StructField("country_name", StringType(), True)
])
player_df = spark.read.schema(player_schema).format("csv").option("header", "true").load("s3://ipl-data-analysis-project/Player.csv")

In [0]:
player_match_schema = StructType([
    StructField("player_match_sk", IntegerType(), True),
    StructField("playermatch_key", DecimalType(10, 0), True),
    StructField("match_id", IntegerType(), True),
    StructField("player_id", IntegerType(), True),
    StructField("player_name", StringType(), True),
    StructField("dob", DateType(), True),
    StructField("batting_hand", StringType(), True),
    StructField("bowling_skill", StringType(), True),
    StructField("country_name", StringType(), True),
    StructField("role_desc", StringType(), True),
    StructField("player_team", StringType(), True),
    StructField("opposit_team", StringType(), True),
    StructField("season_year", IntegerType(), True),
    StructField("is_manofthematch", BooleanType(), True),
    StructField("age_as_on_match", IntegerType(), True),
    StructField("isplayers_team_won", BooleanType(), True),
    StructField("batting_status", StringType(), True),
    StructField("bowling_status", StringType(), True),
    StructField("player_captain", StringType(), True),
    StructField("opposit_captain", StringType(), True),
    StructField("player_keeper", StringType(), True),
    StructField("opposit_keeper", StringType(), True)
])
player_match_df = spark.read.schema(player_match_schema).format("csv").option("header", "true").load("s3://ipl-data-analysis-project/Player_match.csv")

In [0]:
team_schema = StructType([
    StructField("team_sk", IntegerType(), True),
    StructField("team_id", IntegerType(), True),
    StructField("team_name", StringType(), True)
])
team_df = spark.read.schema(team_schema).format("csv").option("header", "true").load("s3://ipl-data-analysis-project/Team.csv")

Transformation - Structured API
> Lazy evaluation : Until action is done, no actual processing of data.
   Spark will wait until the very last moment to execute the graph of computation instructions
```
   SQL             --->
   Dataframes      ---> Catalist Optimizer   ----> Physical Plan  (first create logical plan)
   DataSets        --->
```


In [0]:
from pyspark.sql.functions import col, when, sum, avg, row_number

In [0]:

#Filter to include only valid deliverables (excluding extras like wides and no balls for specific analysis)
ball_by_ball_df = ball_by_ball_df.filter( (col("wides") == 0) & (col("noballs") == 0) )

# Aggregation: Calculate the total and average runs scored in each match and inning
total_and_avg_runs = ball_by_ball_df.groupBy("match_id", "innings_no").agg(
  sum("runs_scored").alias("total_runs"),
  avg("runs_scored").alias("average_runs")
)