# Check and preprocessing stroke outcome data from the R project

The data in the R projects is held as serialised RData.  Preprocessing to convert to a clean CSV and python, pandas, numpy compatible format is below

## R commands

I ran the following code in R.  This opens the RData and writes as CSV files.   The code was run from within the R project directory.

> Note: I needed to use a mix of functions to get it to output something sensible.  SSNAP and OUTCOMES use the R built-in for writing a CSV.  DEATH.r is an R dataframe and worked better with `readr.write_csv()`

```R
> SSNAP.r <- readRDS("data/ssnap.RData")
> OUTCOMES.r <- readRDS("data/HERMES.RData")
> DEATH.r <- readRDS("data/gg2.deaths.RData")

> write.csv(SSNAP.r, 'SSNAP.csv')
> write.csv(OUTCOMES.r, 'OUTCOMES.csv')
> library(readr)
> write_csv(DEATH.r, 'DEATH.csv')
```

## Imports

In [1]:
import pandas as pd
import numpy as np

## Constants

Replace these URL's with path to your copy of files.

In [2]:
DEFAULT_SSNAP_CSV = '../stroke_data/SSNAP.csv'
DEFAULT_OUTCOME_CSV = '../stroke_data/OUTCOMES.csv'
DEFAULT_DEATH_CSV = '../stroke_data/DEATH.csv'

## SSNAP

First check it loads okay and inspect datatypes

In [3]:
ssnap = pd.read_csv(DEFAULT_SSNAP_CSV, index_col=0)
ssnap.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 437 entries, 1 to 437
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   x       437 non-null    float64
dtypes: float64(1)
memory usage: 6.8 KB


Pandas has read data in as a `float64`.  Check for R precision issues on export to csv found that some integers end up as x.99999999

In [4]:
rounding_issues = ssnap.to_numpy()[np.floor(ssnap['x'].to_numpy()) 
                                   != ssnap['x'].to_numpy()]

print(rounding_issues.shape)
print(rounding_issues[:5])

(63, 1)
[[74.99999999]
 [76.00000001]
 [74.99999999]
 [68.99999999]
 [45.99999999]]


In [5]:
ssnap = ssnap.round()


ssnap.head()

Unnamed: 0,x
1,117.0
2,50.0
3,86.0
4,162.0
5,120.0


### SSNAP cleaning function

In [6]:
def clean_ssnap(path=DEFAULT_SSNAP_CSV):
    return (pd.read_csv(path, index_col=0)
               .round()
               .astype({'x': np.int32}))

In [7]:
ssnap = clean_ssnap()
ssnap.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 437 entries, 1 to 437
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   x       437 non-null    int32
dtypes: int32(1)
memory usage: 5.1 KB


In [8]:
ssnap.head(3)

Unnamed: 0,x
1,117
2,50
3,86


## Outcomes

Check it loads okay.  No variables names in the R data file!

In [9]:
outcomes = pd.read_csv(DEFAULT_OUTCOME_CSV, index_col=0)
outcomes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 600 entries, 1 to 600
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V1      600 non-null    float64
 1   V2      600 non-null    float64
 2   V3      600 non-null    float64
 3   V4      600 non-null    float64
 4   V5      600 non-null    float64
 5   V6      600 non-null    float64
 6   V7      600 non-null    int64  
dtypes: float64(6), int64(1)
memory usage: 37.5 KB


In [10]:
outcomes.head()

Unnamed: 0,V1,V2,V3,V4,V5,V6,V7
1,0.12,0.3,0.52,0.7,0.87,0.93,1
2,0.119866,0.299716,0.519599,0.699616,0.86975,0.929833,1
3,0.119733,0.299432,0.519199,0.699232,0.869499,0.929666,1
4,0.119599,0.299149,0.518798,0.698848,0.869249,0.929499,1
5,0.119466,0.298865,0.518397,0.698464,0.868998,0.929332,1


## Deaths

A few issues with Death's interval column on import:

* Death's interval column is a 1D array (using c(i,..,n) in R).  Needs to be converted to numpy array in pandas.  

When read in it appears as a `object` column in `pandas`

In [11]:
deaths = pd.read_csv(DEFAULT_DEATH_CSV, index_col=0)
deaths.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 280 entries, 1 to 40
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   mRs        280 non-null    int64 
 1   intervals  280 non-null    object
dtypes: int64(1), object(1)
memory usage: 6.6+ KB


In [12]:
# note that the first value in intervals is 0 without decimals
deaths.head(2)

Unnamed: 0_level_0,mRs,intervals
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,"c(0, 1.09527915843621e-05, 0.00200061859353828..."
1,1,"c(0, 2.03207054714216e-05, 0.00309690512488614..."


Example convert the first row to explain how the operation works.

In [13]:
# slice removing 'c(... )' -> split into list -> ndarray -> astype float
example = np.array(deaths.intervals.iloc[0][2:-1].split(',')).astype(np.float64)
print(f'original: {deaths.intervals.iloc[0][:48]}')
print(f'cleaned: {example[0:3]}')

original: c(0, 1.09527915843621e-05, 0.00200061859353828, 
cleaned: [0.00000000e+00 1.09527916e-05 2.00061859e-03]


the same code as a function to apply to all rows in one go.

In [14]:
def r_array_to_python_array(row):
    return np.array(row['intervals'][2:-1].split(',')).astype(np.float64)

In [15]:
deaths['intervals'] = deaths.apply(lambda row : r_array_to_python_array(row), 
                                   axis=1)
deaths.head(2)

Unnamed: 0_level_0,mRs,intervals
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,"[0.0, 1.09527915843621e-05, 0.0020006185935382..."
1,1,"[0.0, 2.03207054714216e-05, 0.0030969051248861..."


In [16]:
# quick check
SELECTED_INDEX = 0
print(type(deaths.iloc[SELECTED_INDEX]['intervals']))
print(deaths.iloc[SELECTED_INDEX]['intervals'].shape)

<class 'numpy.ndarray'>
(489,)


### Final Deaths cleaning code.

In [17]:
def clean_intervals(df):
    df['intervals'] = df.apply(lambda row : r_array_to_python_array(row), 
                                       axis=1)
    return df

In [18]:
def clean_deaths(path=DEFAULT_DEATH_CSV):
    return (pd.read_csv(path, index_col=0)
              .astype({'mRs': np.int8})
              .pipe(clean_intervals))

In [19]:
deaths = clean_deaths()
deaths.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 280 entries, 1 to 40
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   mRs        280 non-null    int8  
 1   intervals  280 non-null    object
dtypes: int8(1), object(1)
memory usage: 4.6+ KB


In [20]:
deaths.head(2)

Unnamed: 0_level_0,mRs,intervals
id,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,"[0.0, 1.09527915843621e-05, 0.0020006185935382..."
1,1,"[0.0, 2.03207054714216e-05, 0.0030969051248861..."


## Complete listing without explanations.

In [21]:
import pandas as pd
import numpy as np

# default locations of stroke outcome data csv files
DEFAULT_SSNAP_CSV = '../stroke_data/SSNAP.csv'
DEFAULT_OUTCOME_CSV = '../stroke_data/OUTCOMES.csv'
DEFAULT_DEATH_CSV = '../stroke_data/DEATH.csv'

def clean_ssnap(path=DEFAULT_SSNAP_CSV):
    '''
    Load and clean SSNAP data.  Rounding up fixes odd precision issue.
    
    Params:
    -----
    path: str, optional
        path/URL to outcome data CSV that was outputted from R
    
    Returns:
    --------
    pd.DataFrame
    '''
    return (pd.read_csv(path, index_col=0)
               .round()
               .astype({'x': np.int32}))

def clean_outcomes(path=DEFAULT_OUTCOME_CSV):
    '''
    Load and clean HERMES outcome data.  
    
    Params:
    -----
    path: str, optional
        path/URL to outcome data CSV that was outputted from R
    
    Returns:
    --------
    pd.DataFrame
    '''
    return (pd.read_csv(path, index_col=0))

def r_array_to_python_array(row):
    '''
    Converts R array in str format `c(i, ..., n)` to numpy.ndarray
    
    Params:
    ------
    row: pandas.Series
        Row of pandas dataframe
        
    Returns:
    -------
    np.ndarray
    '''
    return np.array(row['intervals'][2:-1].split(',')).astype(np.float64)

def clean_intervals(df):
    '''
    Clean the intervals column in a pandas.DataFrame of DEATHS.r
    
    Params:
    ------
    df: pd.Dataframe
        Full dataframe of DEATHS.r
        
    Returns:
    --------
    pd.DataFrame
        
    '''
    df['intervals'] = df.apply(lambda row : r_array_to_python_array(row), 
                                       axis=1)
    return df

def clean_deaths(path=DEFAULT_DEATH_CSV):
    '''
    Load and clean DEATHS.r 
    Intervals are stored as strings used to create R arrays 'c(...)'.  These
    are converted to np.ndarrays held in a DataFrame cell.
    
    Params:
    -----
    path: str, optional
        path/URL to Death data CSV that was outputted from R
    
    Returns:
    --------
    pd.DataFrame
    '''
    return (pd.read_csv(path, index_col=0)
              .astype({'mRs': np.int8})
              .pipe(clean_intervals))


def stroke_outcome_data(ssnap_csv=DEFAULT_SSNAP_CSV, 
                        outcome_csv=DEFAULT_OUTCOME_CSV,
                        death_csv=DEFAULT_DEATH_CSV):
    
        ssnap = clean_ssnap(ssnap_csv)
        outcomes = clean_outcomes(outcome_csv)
        deaths = clean_deaths(death_csv)
        return ssnap, outcomes, deaths
   
    

In [22]:
ssnap, outcomes, deaths = stroke_outcome_data()

In [23]:
ssnap.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 437 entries, 1 to 437
Data columns (total 1 columns):
 #   Column  Non-Null Count  Dtype
---  ------  --------------  -----
 0   x       437 non-null    int32
dtypes: int32(1)
memory usage: 5.1 KB


In [24]:
outcomes.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 600 entries, 1 to 600
Data columns (total 7 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   V1      600 non-null    float64
 1   V2      600 non-null    float64
 2   V3      600 non-null    float64
 3   V4      600 non-null    float64
 4   V5      600 non-null    float64
 5   V6      600 non-null    float64
 6   V7      600 non-null    int64  
dtypes: float64(6), int64(1)
memory usage: 37.5 KB


In [25]:
deaths.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 280 entries, 1 to 40
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   mRs        280 non-null    int8  
 1   intervals  280 non-null    object
dtypes: int8(1), object(1)
memory usage: 4.6+ KB


In [26]:
ssnap.to_csv('../stroke_data/clean_ssnap.csv')
outcomes.to_csv('../stroke_data/clean_outcomes.csv')
deaths.to_csv('../stroke_data/clean_deaths.csv')