# PKA-MONTE-CARLO Documentation

In [None]:
import pyspark
import os
import sys
from pyspark import SparkContext
os.environ['PYSPARK_PYTHON'] = sys.executable
os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable
from pyspark.sql import SparkSession

This code segment is essential for setting up the environment to run a PySpark application. Let's break it down step by step:

1. `import pyspark`: This imports the PySpark library, which is the main library for working with Spark in Python.

2. `import os`: This imports the `os` module, which provides a way to interact with the operating system.

3. `import sys`: This imports the `sys` module, which provides access to some variables used or maintained by the Python interpreter and to functions that interact with the interpreter.

4. `from pyspark import SparkContext`: This imports the `SparkContext` class from the `pyspark` module. `SparkContext` is the entry point to Spark functionality in Python.

5. `os.environ['PYSPARK_PYTHON'] = sys.executable`: This sets the environment variable `PYSPARK_PYTHON` to the path of the current Python executable. This is necessary to ensure that the Spark workers use the same Python interpreter as the one used to launch the Spark application.

6. `os.environ['PYSPARK_DRIVER_PYTHON'] = sys.executable`: This sets the environment variable `PYSPARK_DRIVER_PYTHON` to the path of the current Python executable. This ensures that the Python interpreter used by the Spark driver (the process that coordinates the execution of Spark jobs) is the same as the one used to launch the Spark application.

7. `from pyspark.sql import SparkSession`: This imports the `SparkSession` class from the `pyspark.sql` module. `SparkSession` is the entry point to working with structured data in Spark and is used to create DataFrame and Dataset objects.

Overall, this code segment ensures that the PySpark environment is correctly configured to use the appropriate Python interpreter and imports necessary modules for working with Spark, allowing you to proceed with developing a PySpark application, such as a movie recommendation system.

In [None]:
spark = SparkSession.builder.config("spark.driver.memory", "16g"). ,→appName('chapter_8').getOrCreate()

## 0.0.1 Preparing the Data

This code creates a SparkSession object with specific configurations.


In [None]:
stocks = spark.read.csv(["data/stocksA/ABAX.csv","data/stocksA/AAME.csv","data/stocksA/AEPI.csv"], header='true', inferSchema='true')
stocks.show(2)

`stocks`: This variable holds the DataFrame containing the data from the CSV files.

`spark.read.csv()`: This method reads CSV files into a DataFrame. It takes multiple file paths as input, specified as a list.

"data/stocksA/ABAX.csv","data/stocksA/AAME.csv","data/stocksA/AEPI.csv": These are the paths to the CSV files to be read. They are provided as a list of strings.

`header='true'`: This parameter indicates that the first row of each CSV file contains the header, which will be used as column names.

`inferSchema='true'`: This parameter tells Spark to infer the data types of columns automatically.

`stocks.show(2)`: This method displays the first two rows of the DataFrame stocks.


In [None]:
from pyspark.sql import functions as fun

stocks = stocks.withColumn("Symbol", fun.input_file_name()) \
    .withColumn("Symbol", fun.element_at(fun.split("Symbol", "/"), -1)) \
    .withColumn("Symbol", fun.element_at(fun.split("Symbol", "\."), 1))

stocks.show(2)

This code manipulates the DataFrame `stocks` to extract the 'Symbol' column from the file paths.



- `from pyspark.sql import functions as fun`: This imports the `functions` module from `pyspark.sql` and aliases it as `fun` for easier usage.

- `stocks.withColumn("Symbol", fun.input_file_name())`: This adds a new column named "Symbol" to the DataFrame `stocks`, containing the file path of each row.

- `.withColumn("Symbol", fun.element_at(fun.split("Symbol", "/"), -1))`: This updates the "Symbol" column to extract the filename from the file path by splitting it on "/" and taking the last element.

- `.withColumn("Symbol", fun.element_at(fun.split("Symbol", "\."), 1))`: This further updates the "Symbol" column to extract the symbol from the filename by splitting it on "." and taking the second element.

- `stocks.show(2)`: This method displays the first two rows of the modified DataFrame `stocks`.


In [None]:
factors = spark.read.csv(["data/stocksA/ABAX.csv","data/stocksA/AAME.csv","data/stocksA/AEPI.csv"], header='true', inferSchema='true')
factors = factors.withColumn("Symbol", fun.input_file_name()) \
                .withColumn("Symbol", fun.element_at(fun.split("Symbol", "/"), -1)) \
                .withColumn("Symbol", fun.element_at(fun.split("Symbol", "\."), 1))


This code reads multiple CSV files into a DataFrame using SparkSession and then manipulates the DataFrame to extract the 'Symbol' column from the file paths.


- `factors`: This variable holds the DataFrame containing the data from the CSV files.

- `spark.read.csv()`: This method reads CSV files into a DataFrame. It takes multiple file paths as input, specified as a list.

- `"data/stocksA/ABAX.csv","data/stocksA/AAME.csv","data/stocksA/AEPI.csv"`: These are the paths to the CSV files to be read. They are provided as a list of strings.

- `header='true'`: This parameter indicates that the first row of each CSV file contains the header, which will be used as column names.

- `inferSchema='true'`: This parameter tells Spark to infer the data types of columns automatically.

- `.withColumn("Symbol", fun.input_file_name())`: This adds a new column named "Symbol" to the DataFrame `factors`, containing the file path of each row.

- `.withColumn("Symbol", fun.element_at(fun.split("Symbol", "/"), -1))`: This updates the "Symbol" column to extract the filename from the file path by splitting it on "/" and taking the last element.

- `.withColumn("Symbol", fun.element_at(fun.split("Symbol", "\."), 1))`: This further updates the "Symbol" column to extract the symbol from the filename by splitting it on "." and taking the second element.
```

In [None]:
from pyspark.sql import Window

stocks = stocks.withColumn('count', fun.count('Symbol') \
                .over(Window.partitionBy('Symbol'))) \
                .filter(fun.col('count') > 260*5 + 10)



This code snippet calculates a count of occurrences of each 'Symbol' in the 'stocks' DataFrame and filters out those symbols that have counts greater than a specified threshold.



- `from pyspark.sql import Window`: This imports the `Window` class from `pyspark.sql`, which is used for defining window specifications for window functions.

- `stocks`: This DataFrame holds the data, likely with a 'Symbol' column.

- `.withColumn('count', fun.count('Symbol').over(Window.partitionBy('Symbol')))`: This adds a new column named 'count' to the DataFrame 'stocks', which calculates the count of each 'Symbol' partitioned by the 'Symbol' column using a window function.

- `.filter(fun.col('count') > 260*5 + 10)`: This filters the DataFrame to keep only rows where the count of occurrences of each 'Symbol' is greater than 260 multiplied by 5 plus 10. This condition serves as a threshold for retaining only symbols that have sufficient data, potentially indicating more reliable stocks in the dataset.


In [None]:
stocks = stocks.withColumn('Date',
                  fun.to_date(fun.to_timestamp(fun.col('Date'),
stocks.printSchema()


This code snippet modifies the 'Date' column in the 'stocks' DataFrame to convert it to a proper date format and prints the schema of the DataFrame.


- `stocks = stocks.withColumn('Date', fun.to_date(fun.to_timestamp(fun.col('Date')`: This line adds a new column named 'Date' to the DataFrame 'stocks'. It converts the existing 'Date' column to a timestamp using `fun.to_timestamp()`, then converts the timestamp to a date using `fun.to_date()`.

- `stocks.printSchema()`: This line prints the schema of the DataFrame 'stocks', displaying the data types and nullable status of each column.


In [None]:
from datetime import datetime
stocks = stocks.filter(fun.col('Date') >= datetime(2009, 10, 23)).\
                filter(fun.col('Date') <= datetime(2014, 10, 23))[9]:



This code filters the 'stocks' DataFrame to include only rows with dates between October 23, 2009, and October 23, 2014.


- `from datetime import datetime`: This imports the `datetime` class from the `datetime` module, which is used to work with dates and times in Python.

- `stocks = stocks.filter(fun.col('Date') >= datetime(2009, 10, 23))`: This filters the DataFrame 'stocks' to include only rows where the 'Date' column is greater than or equal to October 23, 2009.

- `.filter(fun.col('Date') <= datetime(2014, 10, 23))`: This further filters the DataFrame 'stocks' to include only rows where the 'Date' column is less than or equal to October 23, 2014.



In [None]:
factors = factors.withColumn('Date',
                              fun.to_date(fun.to_timestamp(fun.col('Date'),
                                                          'dd-MMM-yy')))
factors = factors.filter(fun.col('Date') >= datetime(2009, 10, 23)).\
                  filter(fun.col('Date') <= datetime(2014, 10, 23))


This code modifies the 'Date' column in the 'factors' DataFrame to convert it to a proper date format and then filters the DataFrame to include only rows with dates between October 23, 2009, and October 23, 2014.


- `factors = factors.withColumn('Date', fun.to_date(fun.to_timestamp(fun.col('Date'), 'dd-MMM-yy')))`: This line adds a new column named 'Date' to the DataFrame 'factors'. It converts the existing 'Date' column to a timestamp using `fun.to_timestamp()`, then converts the timestamp to a date using `fun.to_date()` with the specified date format 'dd-MMM-yy'.

- `factors = factors.filter(fun.col('Date') >= datetime(2009, 10, 23))`: This filters the DataFrame 'factors' to include only rows where the 'Date' column is greater than or equal to October 23, 2009.

- `.filter(fun.col('Date') <= datetime(2014, 10, 23))`: This further filters the DataFrame 'factors' to include only rows where the 'Date' column is less than or equal to October 23, 2014.


In [None]:
stocks_pd_df = stocks.toPandas()
factors_pd_df = factors.toPandas()
factors_pd_df.head(5)


This code converts the Spark DataFrames 'stocks' and 'factors' into Pandas DataFrames and displays the first 5 rows of the 'factors_pd_df' DataFrame.




- `stocks_pd_df = stocks.toPandas()`: This converts the Spark DataFrame 'stocks' into a Pandas DataFrame named 'stocks_pd_df'.

- `factors_pd_df = factors.toPandas()`: This converts the Spark DataFrame 'factors' into a Pandas DataFrame named 'factors_pd_df'.

- `factors_pd_df.head(5)`: This displays the first 5 rows of the 'factors_pd_df' DataFrame, showing a preview of the data converted to Pandas format.


## 0.0.2 Determining the Factor Weights

In [None]:
n_steps = 10
def my_fun(x):
    return ((x.iloc[-1] - x.iloc[0]) / x.iloc[0])
stock_returns = stocks_pd_df.groupby('Symbol').Close.\
                            rolling(window=n_steps).apply(my_fun)
factors_returns = factors_pd_df.groupby('Symbol').Close.\
                            rolling(window=n_steps).apply(my_fun)
stock_returns = stock_returns.reset_index().\
                              sort_values('level_1').\
                              reset_index()
factors_returns = factors_returns.reset_index().\
                                  sort_values('level_1').\
                                  reset_index()


This code computes rolling returns for stocks and factors using a custom function and then organizes the results.


- `n_steps = 10`: This variable defines the number of steps for the rolling window.

- `def my_fun(x): return ((x.iloc[-1] - x.iloc[0]) / x.iloc[0])`: This defines a custom function `my_fun` that calculates returns based on the first and last elements of the input series.

- `stock_returns = stocks_pd_df.groupby('Symbol').Close.rolling(window=n_steps).apply(my_fun)`: This computes rolling returns for each symbol in the 'stocks_pd_df' DataFrame using the custom function `my_fun`.

- `factors_returns = factors_pd_df.groupby('Symbol').Close.rolling(window=n_steps).apply(my_fun)`: This computes rolling returns for each symbol in the 'factors_pd_df' DataFrame using the custom function `my_fun`.

- `stock_returns = stock_returns.reset_index().sort_values('level_1').reset_index()`: This resets the index of the 'stock_returns' DataFrame, sorts it based on the 'level_1' column, and resets the index again.

- `factors_returns = factors_returns.reset_index().sort_values('level_1').reset_index()`: This resets the index of the 'factors_returns' DataFrame, sorts it based on the 'level_1' column, and resets the index again.


In [None]:
stocks_pd_df_with_returns = stocks_pd_df.\
                              assign(stock_returns = \
                                    stock_returns['Close'])
factors_pd_df_with_returns = factors_pd_df.\
                              assign(factors_returns = \
                                    factors_returns['Close'],
                                    factors_returns_squared = \
                                    factors_returns['Close']**2)
factors_pd_df_with_returns = factors_pd_df_with_returns.\
                                pivot(index='Date',
                                      columns='Symbol',
                                      values=['factors_returns', \
                                              'factors_returns_squared'])
factors_pd_df_with_returns.columns = factors_pd_df_with_returns.\
                                        columns.\
                                        to_series().\
                                        str.\
                                        join('_').\
                                        reset_index()[0]
factors_pd_df_with_returns = factors_pd_df_with_returns.\
                                reset_index()
print(factors_pd_df_with_returns.head(1))


This code manipulates the Pandas DataFrames containing stock and factor returns to create a combined DataFrame with additional features.


- `stocks_pd_df_with_returns = stocks_pd_df.assign(stock_returns = stock_returns['Close'])`: This adds a new column named 'stock_returns' to the 'stocks_pd_df' DataFrame, containing the calculated stock returns.

- `factors_pd_df_with_returns = factors_pd_df.assign(factors_returns = factors_returns['Close'], factors_returns_squared = factors_returns['Close']**2)`: This adds two new columns, 'factors_returns' and 'factors_returns_squared', to the 'factors_pd_df' DataFrame, containing the calculated factor returns and their squares, respectively.

- `factors_pd_df_with_returns = factors_pd_df_with_returns.pivot(index='Date', columns='Symbol', values=['factors_returns', 'factors_returns_squared'])`: This pivots the 'factors_pd_df_with_returns' DataFrame to reorganize the data, so each symbol's factor returns and their squares are columns, with 'Date' as the index.

- `factors_pd_df_with_returns.columns = factors_pd_df_with_returns.columns.to_series().str.join('_').reset_index()[0]`: This line concatenates the multi-level column names into single-level column names, joining them with an underscore.

- `factors_pd_df_with_returns = factors_pd_df_with_returns.reset_index()`: This resets the index of the 'factors_pd_df_with_returns' DataFrame.

- `print(factors_pd_df_with_returns.head(1))`: This prints the first row of the modified 'factors_pd_df_with_returns' DataFrame to check the result.


In [None]:
import pandas as pd
from sklearn.linear_model import LinearRegression
# For each stock, create input DF for linear regression training
stocks_factors_combined_df = pd.merge(stocks_pd_df_with_returns,factors_pd_df_with_returns,how="left", on="Date")
feature_columns = list(stocks_factors_combined_df.columns[-6:])
with pd.option_context('mode.use_inf_as_na', True):stocks_factors_combined_df = stocks_factors_combined_df.dropna(subset=feature_columns + ['stock_returns'])
def find_ols_coef(df):
    y = df[['stock_returns']].values
    X = df[feature_columns]
    regr = LinearRegression()
    regr_output = regr.fit(X, y)
    return list(df[['Symbol']].values[0]) + list(regr_output.coef_[0])
coefs_per_stock = stocks_factors_combined_df.\
                      groupby('Symbol').\
                      apply(find_ols_coef)

coefs_per_stock = pd.DataFrame(coefs_per_stock).reset_index()
coefs_per_stock.columns = ['symbol', 'factor_coef_list']
coefs_per_stock = pd.DataFrame(coefs_per_stock.factor_coef_list.tolist(),coefs_per_stock)


This code performs linear regression analysis on stock returns with respect to factors.


- `import pandas as pd`: This imports the pandas library, which is used for data manipulation and analysis.

- `from sklearn.linear_model import LinearRegression`: This imports the LinearRegression class from the scikit-learn library, which is used to perform linear regression.

- `stocks_factors_combined_df = pd.merge(stocks_pd_df_with_returns, factors_pd_df_with_returns, how="left", on="Date")`: This merges the 'stocks_pd_df_with_returns' and 'factors_pd_df_with_returns' DataFrames based on the 'Date' column, creating 'stocks_factors_combined_df'.

- `feature_columns = list(stocks_factors_combined_df.columns[-6:])`: This extracts the feature columns from the combined DataFrame, excluding the columns related to stock returns.

- `stocks_factors_combined_df = stocks_factors_combined_df.dropna(subset=feature_columns + ['stock_returns'])`: This drops rows with missing values in either the feature columns or the 'stock_returns' column.

- `def find_ols_coef(df): ...`: This defines a function 'find_ols_coef' that performs ordinary least squares (OLS) linear regression on a DataFrame, returning the coefficients.

- `coefs_per_stock = stocks_factors_combined_df.groupby('Symbol').apply(find_ols_coef)`: This applies the 'find_ols_coef' function to each group of data grouped by 'Symbol'.

- `coefs_per_stock = pd.DataFrame(coefs_per_stock).reset_index()`: This converts the output of the groupby operation into a DataFrame and resets the index.

- `coefs_per_stock.columns = ['symbol', 'factor_coef_list']`: This renames the columns of the DataFrame 'coefs_per_stock'.

- `coefs_per_stock = pd.DataFrame(coefs_per_stock.factor_coef_list.tolist(), coefs_per_stock)`: This expands the 'factor_coef_list' column into separate columns, resulting in a DataFrame where each row contains the coefficients for each factor.



## 0.0.3 Sampling

In [None]:
samples = factors_returns.loc[factors_returns.Symbol == \
                              factors_returns.Symbol.unique()[0]]['Close']
samples.plot.kde()



This code snippet visualizes the kernel density estimation (KDE) plot for the factor returns of a specific symbol.


- `factors_returns.Symbol.unique()[0]`: This retrieves the first unique symbol from the 'Symbol' column of the 'factors_returns' DataFrame.

- `factors_returns.loc[factors_returns.Symbol == ...]`: This selects rows from the 'factors_returns' DataFrame where the 'Symbol' matches the specified symbol.

- `['Close']`: This selects the 'Close' column from the filtered DataFrame, which contains the factor returns for the specified symbol.

- `samples.plot.kde()`: This plots the kernel density estimation (KDE) of the factor returns for the selected symbol using the Pandas `plot.kde()` function.


In [None]:
f_1 = factors_returns.loc[factors_returns.Symbol == \
                          factors_returns.Symbol.unique()[0]]['Close']
f_2 = factors_returns.loc[factors_returns.Symbol == \
                          factors_returns.Symbol.unique()[1]]['Close']
f_3 = factors_returns.loc[factors_returns.Symbol == \
                          factors_returns.Symbol.unique()[2]]['Close']


print(f_1.size,len(f_2),f_3.size)
pd.DataFrame({'f1': list(f_1)[1:1040], 'f2': list(f_2)[1:1040], 'f3':list(f_3)}).corr()


This code snippet calculates the correlation matrix between three factors.



- `f_1`, `f_2`, `f_3`: These variables store the factor returns for the first, second, and third unique symbols, respectively.

- `factors_returns.Symbol.unique()`: This retrieves the unique symbols from the 'Symbol' column of the 'factors_returns' DataFrame.

- `factors_returns.loc[factors_returns.Symbol == ...]['Close']`: This selects the 'Close' column from the 'factors_returns' DataFrame for each specific symbol.

- `print(f_1.size, len(f_2), f_3.size)`: This prints the number of elements in each factor returns series.

- `pd.DataFrame({'f1': list(f_1)[1:1040], 'f2': list(f_2)[1:1040], 'f3': list(f_3)}).corr()`: This creates a DataFrame with columns 'f1', 'f2', and 'f3' containing factor returns and calculates the correlation matrix between them using the Pandas `corr()` function.


In [None]:
factors_returns_cov = pd.DataFrame({'f1': list(f_1)[1:1040],
                                    'f2': list(f_2)[1:1040],
                                    'f3': list(f_3)})\
                                    .cov().to_numpy()
factors_returns_mean = pd.DataFrame({'f1': list(f_1)[1:1040],
                                    'f2': list(f_2)[1:1040],
                                    'f3': list(f_3)}).\
mean()


This code snippet calculates the correlation matrix between three factors.



- `f_1`, `f_2`, `f_3`: These variables store the factor returns for the first, second, and third unique symbols, respectively.

- `factors_returns.Symbol.unique()`: This retrieves the unique symbols from the 'Symbol' column of the 'factors_returns' DataFrame.

- `factors_returns.loc[factors_returns.Symbol == ...]['Close']`: This selects the 'Close' column from the 'factors_returns' DataFrame for each specific symbol.

- `print(f_1.size, len(f_2), f_3.size)`: This prints the number of elements in each factor returns series.

- `pd.DataFrame({'f1': list(f_1)[1:1040], 'f2': list(f_2)[1:1040], 'f3': list(f_3)}).corr()`: This creates a DataFrame with columns 'f1', 'f2', and 'f3' containing factor returns and calculates the correlation matrix between them using the Pandas `corr()` function.


In [None]:
from numpy.random import multivariate_normal
multivariate_normal(factors_returns_mean, factors_returns_cov)


This code snippet generates multivariate normal random samples based on the mean and covariance matrix provided.




- `from numpy.random import multivariate_normal`: This imports the `multivariate_normal` function from the `numpy.random` module, which is used to generate multivariate normal random samples.

- `multivariate_normal(factors_returns_mean, factors_returns_cov)`: This generates random samples from a multivariate normal distribution with the specified mean (`factors_returns_mean`) and covariance matrix (`factors_returns_cov`). The function returns an array of random samples.

## 0.0.4 Running the Trials

In [None]:
b_coefs_per_stock = spark.sparkContext.broadcast(coefs_per_stock)
b_feature_columns = spark.sparkContext.broadcast(feature_columns)
b_factors_returns_mean = spark.sparkContext.broadcast(factors_returns_mean)
b_factors_returns_cov = spark.sparkContext.broadcast(factors_returns_cov)


This code segment broadcasts several variables using SparkContext.


- `spark.sparkContext.broadcast(coefs_per_stock)`: This broadcasts the 'coefs_per_stock' DataFrame using SparkContext. Broadcasting allows efficient sharing of read-only variables across all nodes in the Spark cluster.

- `spark.sparkContext.broadcast(feature_columns)`: This broadcasts the 'feature_columns' list using SparkContext. Broadcasting this list allows all nodes in the Spark cluster to access it efficiently.

- `spark.sparkContext.broadcast(factors_returns_mean)`: This broadcasts the 'factors_returns_mean' array using SparkContext. Broadcasting this array enables all nodes in the Spark cluster to access it without needing to transfer the data over the network repeatedly.

- `spark.sparkContext.broadcast(factors_returns_cov)`: This broadcasts the 'factors_returns_cov' array using SparkContext. Broadcasting this array enables efficient access to the covariance matrix by all nodes in the Spark cluster.


In [None]:
from pyspark.sql.types import IntegerType
parallelism = 1000
num_trials = 1000000
base_seed = 1496
seeds = [b for b in range(base_seed,
                          base_seed + parallelism)]
seedsDF = spark.createDataFrame(seeds, IntegerType())
seedsDF = seedsDF.repartition(parallelism)


This code segment prepares a DataFrame with a range of seeds and repartitions it to control parallelism.


- `from pyspark.sql.types import IntegerType`: This imports the IntegerType from the pyspark.sql.types module, which is used to define the data type for the seeds.

- `parallelism = 1000`: This variable defines the desired level of parallelism.

- `num_trials = 1000000`: This variable specifies the number of trials.

- `base_seed = 1496`: This variable sets the base seed value.

- `seeds = [b for b in range(base_seed, base_seed + parallelism)]`: This creates a list of seeds ranging from the base seed to the base seed plus the parallelism.

- `seedsDF = spark.createDataFrame(seeds, IntegerType())`: This creates a DataFrame 'seedsDF' from the list of seeds, specifying the data type as IntegerType.

- `seedsDF = seedsDF.repartition(parallelism)`: This repartitions the DataFrame 'seedsDF' into the desired number of partitions to control parallelism during processing.


In [None]:
import random
from numpy.random import seed
from pyspark.sql.types import LongType, ArrayType, DoubleType
from pyspark.sql.functions import udf
def calculate_trial_return(x):

   trial_return_list = []
   for i in range(int(num_trials/parallelism)):
      for i in range(int(num_trials/parallelism)):
       random_int = random.randint(0, num_trials*num_trials)
       seed(x)
       random_factors = multivariate_normal(b_factors_returns_mean.value,
         b_factors_returns_cov.value)
       coefs_per_stock_df = b_coefs_per_stock.value
       returns_per_stock = (coefs_per_stock_df[b_feature_columns.value] *
         (list(random_factors) + list(random_factors**2)))
trial_return_list.append(float(returns_per_stock.sum(axis=1).sum()/ ,→b_coefs_per_stock.value.size))
   return trial_return_list
udf_return = udf(calculate_trial_return, ArrayType(DoubleType()))
       


This code defines a user-defined function (UDF) to calculate trial returns based on factors and coefficients.




- `from pyspark.sql.types import LongType, ArrayType, DoubleType`: This imports the required types from `pyspark.sql.types` for defining the UDF.

- `from pyspark.sql.functions import udf`: This imports the `udf` function from `pyspark.sql.functions`, which is used to create a user-defined function.

- `def calculate_trial_return(x):`: This defines a Python function named `calculate_trial_return`, which takes a parameter `x` (seed value) and calculates trial returns.

- `trial_return_list = []`: This initializes an empty list to store trial returns.

- The subsequent loop runs `num_trials/parallelism` times, generating random factors and computing returns for each trial.

- `random_int = random.randint(0, num_trials*num_trials)`: This generates a random integer within a specified range.

- `seed(x)`: This sets the seed for the random number generator to ensure reproducibility within each trial.

- `random_factors = multivariate_normal(b_factors_returns_mean.value, b_factors_returns_cov.value)`: This generates random factors using the mean and covariance provided.

- `coefs_per_stock_df = b_coefs_per_stock.value`: This retrieves coefficients per stock from the broadcasted DataFrame.

- `returns_per_stock = (coefs_per_stock_df[b_feature_columns.value] * (list(random_factors) + list(random_factors**2)))`: This computes returns per stock based on the factors and coefficients.

- `trial_return_list.append(float(returns_per_stock.sum(axis=1).sum()/ b_coefs_per_stock.value.size))`: This appends the calculated trial return to the list.

- `return trial_return_list`: This returns the list of trial returns.

- `udf_return = udf(calculate_trial_return, ArrayType(DoubleType()))`: This creates a UDF named `udf_return` from the `calculate_trial_return` function, specifying the return type as an array of doubles.

In [None]:
from pyspark.sql.functions import col, explode
trials = seedsDF.withColumn("trial_return", udf_return(col("value")))
trials = trials.select('value', explode('trial_return').alias('trial_return'))
trials.cache()


This code performs additional transformations on the DataFrame 'trials'.




- `from pyspark.sql.functions import col, explode`: This imports the `col` and `explode` functions from `pyspark.sql.functions`, which are used for column operations and exploding array elements into separate rows, respectively.

- `trials = seedsDF.withColumn("trial_return", udf_return(col("value")))`: This adds a new column 'trial_return' to the 'seedsDF' DataFrame by applying the UDF 'udf_return' to the 'value' column. Each row of 'trial_return' contains an array of trial returns.

- `trials = trials.select('value', explode('trial_return').alias('trial_return'))`: This selects the 'value' column and explodes the 'trial_return' array column into separate rows, aliasing the exploded column as 'trial_return'. This results in a DataFrame where each row represents a single trial return associated with a specific seed value.

- `trials.cache()`: This caches the DataFrame 'trials' in memory to improve performance for subsequent operations. Caching is particularly useful when an RDD or DataFrame is going to be reused multiple times.

In [None]:
trials.approxQuantile('trial_return', [0.05], 0.0)


This code computes the approximate quantile of the 'trial_return' column in the 'trials' DataFrame at the specified quantile value.




- `trials`: This is the DataFrame containing trial returns.

- `.approxQuantile('trial_return', [0.05], 0.0)`: This method computes the approximate quantile of the 'trial_return' column at the specified quantile value of 0.05. The third parameter (0.0) specifies the relative error tolerance for the approximation. The result is returned as a list of quantile values. In this case, it returns the approximate 5th percentile of the trial returns.

In [None]:
trials.orderBy(col('trial_return').asc()).\
 limit(int(trials.count()/20)).\
 agg(fun.avg(col("trial_return"))).show()


This code snippet calculates the average of the trial returns for the bottom 5% of the sorted trial returns in the 'trials' DataFrame.




- `trials.orderBy(col('trial_return').asc())`: This sorts the 'trials' DataFrame in ascending order based on the 'trial_return' column.

- `.limit(int(trials.count() / 20))`: This limits the DataFrame to the first 5% of the sorted trial returns, determined by dividing the total count of trials by 20 (which represents 5%).

- `.agg(fun.avg(col("trial_return")))`: This computes the average of the 'trial_return' column for the selected subset of trials using the `avg` aggregate function.

- `.show()`: This displays the result of the aggregation, which is the average of the trial returns for the bottom 5% of the sorted trial returns.

## 0.0.5 Visualizing the Distribution of Returns

In [None]:
import pandas
mytrials=trials.toPandas()
mytrials.plot.line()


This code snippet converts the Spark DataFrame 'trials' to a Pandas DataFrame and plots the data as a line plot.




- `import pandas`: This imports the pandas library for data manipulation and analysis.

- `mytrials = trials.toPandas()`: This converts the Spark DataFrame 'trials' to a Pandas DataFrame named 'mytrials'.

- `mytrials.plot.line()`: This plots the data in 'mytrials' DataFrame as a line plot. Each trial return is plotted against its corresponding index. If there are multiple trials, each trial return will be represented as a line on the plot.