## Setting Up PySpark Environment

This code cell sets up the PySpark environment.

- **Input**: None
- **Actions**:
  - Sets the Python executable paths for PySpark.
  - Initializes the SparkContext.
- **Output**: SparkContext initialized for PySpark.


In [None]:
import pyspark
import os
import sys
from pyspark import SparkContext
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable

from pyspark.sql import SparkSession

## Initializing Spark Session

This code cell initializes a Spark session with specific configurations.

- **Configuration**:
  - Spark driver memory: 16 GB
  - Application name: chapter_8
- **Output**: SparkSession initialized with the specified configurations.


In [None]:
spark = SparkSession.builder.config("spark.driver.memory", "16g").appName('chapter_8').getOrCreate()

0.0.1 Preparing the Data

## Reading Multiple CSV Files

This code cell reads multiple CSV files into a Spark DataFrame.

- **Files Read**: ABAX.csv, AAME.csv, AEPI.csv
- **Headers**: The first row of each CSV file is considered as the header.
- **Schema Inference**: Automatically inferring the schema of the DataFrame.
- **Output**: Displaying the first two rows of the DataFrame.


In [None]:
stocks = spark.read.csv(["data/stocksA/ABAX.csv","data/stocksA/AAME.csv","data/stocksA/AEPI.csv"], header='true', inferSchema='true')

#stocks=spark.read.format("csv").option("inferSchema","true").option("header","true").load('C:/Users/HP/Desktop/aas-pyspark-edition/data/stocksA/AAIT.csv').load('C:/Users/HP/Desktop/aas-pyspark-edition/data/stocksA/AAME.csv')

stocks.show(2)

## Extracting Symbol from File Path

This code cell extracts the symbol from the file path of each stock and adds it as a new column to the DataFrame.

- **Method**: Extracts the symbol from the file path using string manipulation functions.
- **Output**: Displaying the first two rows of the DataFrame with the new "Symbol" column.


In [None]:
from pyspark.sql import functions as fun

stocks = stocks.withColumn("Symbol", fun.input_file_name()).withColumn("Symbol", fun.element_at(fun.split("Symbol", "/"), -1)).withColumn("Symbol",fun.element_at(fun.split("Symbol", "\."), 1))

stocks.show(2)

## Loading and Extracting Symbol from File Path

This code cell loads the factor data from multiple CSV files and extracts the symbol from the file path, adding it as a new column to the DataFrame.

- **Input**: Path to multiple CSV files containing factor data.
- **Method**: Loads the CSV files into a DataFrame and extracts the symbol from the file path using string manipulation functions.
- **Output**: DataFrame with the symbol extracted from the file path and added as a new column.


In [None]:
factors = spark.read.csv(["data/stocksA/ABAX.csv","data/stocksA/AAME.csv","data/stocksA/AEPI.csv"], header='true', inferSchema='true')

factors = factors.withColumn("Symbol", fun.input_file_name()).withColumn("Symbol",fun.element_at(fun.split("Symbol", "/"), -1)).withColumn("Symbol",fun.element_at(fun.split("Symbol", "\."), 1))

## Filtering Stocks Data by Symbol Count

This code cell filters the stocks data by the count of symbols, ensuring that only symbols with more than a specified count are retained.

- **Input**: DataFrame containing stocks data with a column 'Symbol'.
- **Method**: Uses a window function to calculate the count of each symbol and filters the DataFrame to retain only symbols with a count greater than a specified threshold.
- **Output**: DataFrame with stocks data filtered by symbol count.


In [None]:
from pyspark.sql import Window

stocks = stocks.withColumn('count', fun.count('Symbol').over(Window.partitionBy('Symbol'))).filter(fun.col('count') > 260*5 + 10)

## Setting Legacy Time Parser Policy

This code cell sets the time parser policy of Spark SQL to legacy.

- **Input**: None
- **Method**: Uses the `set` function of Spark SQL to set the `spark.sql.legacy.timeParserPolicy` property to `LEGACY`.
- **Output**: Spark SQL configuration with the legacy time parser policy set.


In [None]:
spark.sql("set spark.sql.legacy.timeParserPolicy=LEGACY")

## Converting Date Format

This code cell converts the date format in the `Date` column to a standard date format.

- **Input**: DataFrame `stocks` with a column named `Date` containing date values in the format 'dd-MMM-yy'.
- **Method**: Uses the `withColumn` function along with `to_timestamp` and `to_date` functions from the `functions` module of PySpark to convert the date format to a standard format.
- **Output**: DataFrame `stocks` with the `Date` column converted to a standard date format.


In [None]:
stocks = stocks.withColumn('Date',fun.to_date(fun.to_timestamp(fun.col('Date'),'dd-MMM-yy')))

stocks.printSchema()

## Filtering Date Range

This code cell filters the DataFrame `stocks` to include only rows with dates falling within a specific range.

- **Input**: DataFrame `stocks` containing a column named `Date` with date values.
- **Method**: Uses the `filter` function to select rows where the `Date` column values are greater than or equal to October 23, 2009, and less than or equal to October 23, 2014.
- **Output**: DataFrame `stocks` containing only rows with dates falling within the specified range.


In [None]:
from datetime import datetime

stocks = stocks.filter(fun.col('Date') >= datetime(2009, 10, 23)).filter(fun.col('Date') <= datetime(2014, 10, 23))

## Filtering Date Range

This code cell filters the DataFrame `factors` to include only rows with dates falling within a specific range.

- **Input**: DataFrame `factors` containing a column named `Date` with date values.
- **Method**: Uses the `filter` function to select rows where the `Date` column values are greater than or equal to October 23, 2009, and less than or equal to October 23, 2014.
- **Output**: DataFrame `factors` containing only rows with dates falling within the specified range.


In [None]:
factors = factors.withColumn('Date',fun.to_date(fun.to_timestamp(fun.col('Date'),'dd-MMM-yy')))

factors = factors.filter(fun.col('Date') >= datetime(2009, 10, 23)).filter(fun.col('Date') <= datetime(2014, 10, 23))

## Converting Spark DataFrames to Pandas DataFrames

This code cell converts the Spark DataFrames `stocks` and `factors` to Pandas DataFrames.

- **Input**: Spark DataFrames `stocks` and `factors`.
- **Method**: Uses the `toPandas()` function to convert the Spark DataFrames to Pandas DataFrames.
- **Output**: Pandas DataFrames `stocks_pd_df` and `factors_pd_df` containing the data from the respective Spark DataFrames.


In [None]:
stocks_pd_df = stocks.toPandas()
factors_pd_df = factors.toPandas()

factors_pd_df.head(5)

0.0.2 Determining the Factor Weights

## Calculating Rolling Returns

This code calculates rolling returns for stocks and factors.

- **Input**: Pandas DataFrames `stocks_pd_df` and `factors_pd_df`.
- **Parameters**: `n_steps` set to 10.
- **Method**:
  - Defines a custom function `my_fun(x)` to calculate returns based on the closing prices.
  - Groups the data by symbol and applies the rolling function to calculate returns over a window of `n_steps`.
- **Output**: DataFrames `stock_returns` and `factors_returns` containing the rolling returns for stocks and factors respectively.


In [None]:
n_steps = 10

def my_fun(x):
  return ((x.iloc[-1] - x.iloc[0]) / x.iloc[0])

stock_returns = stocks_pd_df.groupby('Symbol').Close.rolling(window=n_steps).apply(my_fun)
factors_returns = factors_pd_df.groupby('Symbol').Close.rolling(window=n_steps).apply(my_fun)

stock_returns = stock_returns.reset_index().sort_values('level_1').reset_index()
factors_returns = factors_returns.reset_index().sort_values('level_1').reset_index()

## Combining Stocks and Factors DataFrames

This code combines the stocks and factors DataFrames, adding rolling returns to the stocks DataFrame and organizing the factors DataFrame.

- **Input**: Pandas DataFrames `stocks_pd_df`, `stock_returns`, `factors_pd_df`, and `factors_returns`.
- **Output**: Combined DataFrames `stocks_pd_df_with_returns` and `factors_pd_df_with_returns`.
- **Method**:
  - Adds the rolling returns to the stocks DataFrame as a new column named `stock_returns`.
  - Adds the squared rolling returns to the factors DataFrame as a new column named `factors_returns_squared`.
  - Pivots the factors DataFrame to organize the data.
  - Resets the index of the factors DataFrame.


In [None]:
# Create combined stocks DF
stocks_pd_df_with_returns = stocks_pd_df.assign(stock_returns = stock_returns['Close'])

# Create combined factors DF
factors_pd_df_with_returns = factors_pd_df.assign(factors_returns = factors_returns['Close'],factors_returns_squared = factors_returns['Close']**2)

factors_pd_df_with_returns = factors_pd_df_with_returns.pivot(index='Date',columns='Symbol',values=['factors_returns', 'factors_returns_squared'])

factors_pd_df_with_returns.columns = factors_pd_df_with_returns.columns.to_series().str.join('_').reset_index()[0]

factors_pd_df_with_returns = factors_pd_df_with_returns.reset_index()

print(factors_pd_df_with_returns.head(1))

In [None]:
print(factors_pd_df_with_returns.columns)

## Linear Regression Analysis

This code performs linear regression analysis on the combined stocks and factors DataFrame.

- **Input**: Combined Pandas DataFrame `stocks_factors_combined_df` containing stocks and factors data.
- **Output**: DataFrame `coefs_per_stock` containing coefficients of the linear regression model for each stock.
- **Method**:
  - Merges the stocks and factors DataFrames.
  - Drops NaN values from the DataFrame.
  - Performs linear regression analysis for each stock.
  - Stores the coefficients of the linear regression model in the `coefs_per_stock` DataFrame.


In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression

# For each stock, create input DF for linear regression training

stocks_factors_combined_df = pd.merge(stocks_pd_df_with_returns,factors_pd_df_with_returns,how="left", on="Date")

feature_columns = list(stocks_factors_combined_df.columns[-6:])

with pd.option_context('mode.use_inf_as_na', True):
  stocks_factors_combined_df = stocks_factors_combined_df.dropna(subset=feature_columns + ['stock_returns'])

def find_ols_coef(df):
  y = df[['stock_returns']].values
  X = df[feature_columns]

  regr = LinearRegression()
  regr_output = regr.fit(X, y)

  return list(df[['Symbol']].values[0]) + list(regr_output.coef_[0])

coefs_per_stock = stocks_factors_combined_df.groupby('Symbol').apply(find_ols_coef)

coefs_per_stock = pd.DataFrame(coefs_per_stock).reset_index()
coefs_per_stock.columns = ['symbol', 'factor_coef_list']

coefs_per_stock = pd.DataFrame(coefs_per_stock.factor_coef_list.tolist(),index=coefs_per_stock.index,columns = ['Symbol'] + feature_columns)

coefs_per_stock

0.0.3 Sampling

## Kernel Density Estimation Plot

This code generates a Kernel Density Estimation (KDE) plot for the returns of a specific factor.

- **Input**: Pandas DataFrame `factors_returns` containing factor returns.
- **Output**: KDE plot of the returns for a specific factor.
- **Method**:
  - Selects the returns for a specific factor using the `loc` function.
  - Generates the KDE plot using the `plot.kde()` function.


In [None]:
samples = factors_returns.loc[factors_returns.Symbol == factors_returns.Symbol.unique()[0]]['Close']

samples.plot.kde()

## Correlation Analysis of Factor Returns

This code performs correlation analysis on the factor returns for three different factors.

- **Input**: Pandas DataFrame `factors_returns` containing factor returns.
- **Output**: Correlation matrix of factor returns for three different factors (`f1`, `f2`, `f3`).
- **Method**:
  - Selects the returns for each factor using the `loc` function.
  - Creates a DataFrame with three columns (`f1`, `f2`, `f3`).
  - Calculates the correlation matrix using the `corr()` function.


In [None]:
f_1 = factors_returns.loc[factors_returns.Symbol == factors_returns.Symbol.unique()[0]]['Close']
f_2 = factors_returns.loc[factors_returns.Symbol == factors_returns.Symbol.unique()[1]]['Close']
f_3 = factors_returns.loc[factors_returns.Symbol == factors_returns.Symbol.unique()[2]]['Close']

print(f_1.size,len(f_2),f_3.size)
pd.DataFrame({'f1': list(f_1)[1:1040], 'f2': list(f_2)[1:1040], 'f3': list(f_3)}).corr()

## Calculation of Covariance and Mean of Factor Returns

This code calculates the covariance matrix and mean of factor returns for three different factors.

- **Input**: Pandas DataFrame `factors_returns` containing factor returns for three factors (`f1`, `f2`, `f3`).
- **Output**: Covariance matrix and mean vector of factor returns.
- **Method**:
  - Constructs a DataFrame with three columns (`f1`, `f2`, `f3`) containing factor returns.
  - Calculates the covariance matrix using the `cov()` function and converts it to a numpy array.
  - Calculates the mean vector using the `mean()` function.


In [None]:
factors_returns_cov = pd.DataFrame({'f1': list(f_1)[1:1040],
                                    'f2': list(f_2)[1:1040],
                                    'f3': list(f_3)}).cov().to_numpy()

factors_returns_mean = pd.DataFrame({'f1': list(f_1)[1:1040],
                                    'f2': list(f_2)[1:1040],
                                    'f3': list(f_3)}).mean()

This code generates random samples from a multivariate normal distribution defined by the mean vector `factors_returns_mean` and the covariance matrix `factors_returns_cov`, using the `multivariate_normal` function from the numpy.random module.


In [None]:
from numpy.random import multivariate_normal

multivariate_normal(factors_returns_mean, factors_returns_cov)

0.0.4 Running the Trials

Broadcasting variables `coefs_per_stock`, `feature_columns`, `factors_returns_mean`, and `factors_returns_cov` using Spark's `broadcast` method for efficient sharing across Spark workers.


In [None]:
b_coefs_per_stock = spark.sparkContext.broadcast(coefs_per_stock)
b_feature_columns = spark.sparkContext.broadcast(feature_columns)
b_factors_returns_mean = spark.sparkContext.broadcast(factors_returns_mean)
b_factors_returns_cov = spark.sparkContext.broadcast(factors_returns_cov)

This code imports the `IntegerType` from the `pyspark.sql.types` module. It sets up parameters such as `parallelism`, `num_trials`, and `base_seed`. Then, it creates a list of seeds ranging from `base_seed` to `base_seed + parallelism`, creates a DataFrame from these seeds using Spark's `createDataFrame` method, and repartitions the DataFrame into `parallelism` partitions.


In [None]:
from pyspark.sql.types import IntegerType

parallelism = 1000
num_trials = 1000000
base_seed = 1496

seeds = [b for b in range(base_seed,base_seed + parallelism)]
seedsDF = spark.createDataFrame(seeds, IntegerType())

seedsDF = seedsDF.repartition(parallelism)

This code defines a Python function `calculate_trial_return(x)` which generates a list of trial returns based on the given seed `x`. Within the function, random factors are generated using the mean and covariance values broadcasted earlier. Then, the function computes returns per stock using coefficients and random factors. The function returns a list of trial returns.

It also creates a user-defined function (`udf_return`) using Spark's `udf` function, which applies the `calculate_trial_return` function to each element in a Spark DataFrame column, outputting an array of trial returns.


In [None]:
import random
from numpy.random import seed
from pyspark.sql.types import LongType, ArrayType, DoubleType
from pyspark.sql.functions import udf

def calculate_trial_return(x):
  # return x
  trial_return_list = []

  for i in range(int(num_trials/parallelism)):
    random_int = random.randint(0, num_trials*num_trials)

    seed(x)

    random_factors = multivariate_normal(b_factors_returns_mean.value,b_factors_returns_cov.value)

    coefs_per_stock_df = b_coefs_per_stock.value
    returns_per_stock = (coefs_per_stock_df[b_feature_columns.value] * (list(random_factors) + list(random_factors**2)))

    trial_return_list.append(float(returns_per_stock.sum(axis=1).sum()/b_coefs_per_stock.value.size))

  return trial_return_list

udf_return = udf(calculate_trial_return, ArrayType(DoubleType()))

This code applies the `udf_return` user-defined function to each element in the 'value' column of the DataFrame `seedsDF`, creating a new column 'trial_return'. It then explodes the array elements in the 'trial_return' column into separate rows. Finally, it caches the resulting DataFrame `trials` into memory for faster access.


In [None]:
from pyspark.sql.functions import col, explode

trials = seedsDF.withColumn("trial_return", udf_return(col("value")))
trials = trials.select('value', explode('trial_return').alias('trial_return'))

trials.cache()

0.0.5 TAKES SOME TIME

This code calculates the approximate quantile(s) of the 'trial_return' column in the `trials` DataFrame. It specifically computes the 5th percentile (0.05 quantile) with a relative error of 0.0.


In [None]:
trials.approxQuantile('trial_return', [0.05], 0.0)

This code sorts the `trials` DataFrame by the 'trial_return' column in ascending order, then limits the result to a fraction (1/20th) of the total rows. It calculates the average of the 'trial_return' values within this limited subset and displays the result.


In [None]:
trials.orderBy(col('trial_return').asc()).limit(int(trials.count()/20)).agg(fun.avg(col("trial_return"))).show()

0.0.6 Visualizing the Distribution of Returns

This code converts the `trials` DataFrame to a Pandas DataFrame named `mytrials`. Then, it generates a line plot using the `plot.line()` method, visualizing the data in the Pandas DataFrame.


In [None]:
import pandas

mytrials=trials.toPandas()
mytrials.plot.line()