# Library Imports
Here, the required libraries are imported. 

pandas: allows for powerful data storage and manipulation using DataFrames.

seaborn: used for creating good-looking graphics

matplotlib: used for creating plots

statsmodels: used for regression analysis

preprocessing: used for normalising data

math: used for general math functions

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib, pylab as plt
#import numpy as np
from ast import literal_eval
import statsmodels.api as sm
from sklearn import preprocessing
import math

# Data import
Here, the data is imported. It's assumed that a csv-file will be used.

In [2]:
#.read_csv is used to open a csv file. The csv file should be in the same folder as the script. 
#The delimeter can be changed here as well.
df = pd.read_csv('input_file.csv', delimiter = ',')

#This option can be used to change the number of columns that are displayed in a DataFrame.              
pd.set_option('display.max_columns', 100)

#The first ten rows of the DataFrame are displayed. Check if the data import was succesful.                
df.head(10)

FileNotFoundError: [Errno 2] File b'input_file.csv' does not exist: b'input_file.csv'

Boolean values, or categorical data with only 2 different values, should be converted to integers. This makes plotting the data easier.

In [None]:
#Changes boolean data to integers (True = 1, False = 0)
df['boolean_data'] *= 1

#Replaces categorical data with 2 possible values to 0 or 1
df = df.replace({'categorical_data': {'value_1': 0, 'value_2': 1})           

# Calculate summary statistic

Here, the summary statistic should be calculated. 

# Regression analysis

### Regression of the full model
Multiple regression should be done for the summary statistic and all independent variables as well as the random_seed.  

In [None]:
X = df[['random_seed', 
        'indepedent_variable_1',
        'independent_variable_2']]
X = sm.add_constant(X)
mreg = sm.OLS(df['summary_statistic'], X)
results = mreg.fit()
print(results.summary())

### Regression of important variables

In [None]:
X = df[['indepedent_variable_1',
        'independent_variable_2']]
X = sm.add_constant(X)
mreg = sm.OLS(df['summary_statistic'], X)
results = mreg.fit()
print(results.summary())

### Regression of important variables
You should check if the random_seed is significant or not.

In [None]:
X = df[['random_seed']]
X = sm.add_constant(X)
mreg = sm.OLS(df['summary_statistic'], X)
results = mreg.fit()
print(results.summary())

# Data visualisation

### Line plot of all independent variables

All independent variables that were varied throughout the simulations are plotted against the mean summary statistic of all runs for each value of the independent variable. The values of the indepdenent variable are normalised.

Such a graph can easily show which values are signicant.

In [None]:
#First, create a list of independent variables that will be used to create graphs
independent_variables = ['independent_variable1', 
                         'independent_variable2', 
                         'independent_variable3']

#Set the figure style and size
fig = plt.figure(figsize = (20, 10))
sns.set_style('whitegrid')

#Set the scaler
scaler=preprocessing.MinMaxScaler(feature_range=(0,1))

#Create the plot
for i in independent_variables:
    df_temp = df.groupby(['{}'.format(i)]).mean().reset_index()
    df_temp["{}".format(i)]=scaler.fit_transform(df_temp[["{}".format(i)]])
    
    plt.plot(i, 'summary_statistic', data=df_temp, linestyle='--', marker='o', label='{}'.format(i))

#Create labels and legend
plt.xlabel('Independent variables, normalised', fontsize=20)
plt.ylabel('summary_statistic', fontsize=20)
plt.legend(loc='best', fontsize=15)

### Violin plots
Create violin plots of all independent variables against a dependent variable of choice.

In [None]:
for i in independent_variables:
    fig, axes = plt.subplots(figsize = (20, 10))
    axes.set_ylim([0,100])
    ax = sns.violinplot(x=i, y = 'a dependent variable', data=df)