### 4. Data frame

**Data frame** is an object to efficiently handle data in a tabular form, where each row corresponds to an observation and each column to a variable. The **data frame** object provides column labeling as well as flexible indexing capabilities for the rows of the data set — similar to an Excel spreadsheet. Let's create a **data frame** object using **data.frame()** (https://stat.ethz.ch/R-manual/R-devel/library/base/html/data.frame.html) function:

In [5]:
df = data.frame(Integers=6:9, Double=6:9+0.5, row.names=c('a','b', 'c', 'd'))
df

Unnamed: 0,Integers,Double
a,6,6.5
b,7,7.5
c,8,8.5
d,9,9.5


In [None]:
class(df)

This simple example already shows some major features of the **data frame** class when it comes to storing data:

* **Data**: data itself can be provided in different shapes and types (**list**, **vector** and **matrix**);
* **Labels**: data is organized in columns, which can have custom names;
* **Index**: there is an index that can take on different formats (e.g., numbers, strings, time information).

!!!Working with a **DataFrame** object is convenient and efficient, compared to regular **ndarray** objects, which are more specialized and more restricted when it comes to enlarging an existing object. At the same time, **DataFrame** objects are computationally as efficient as **ndarray** objects. The following are simple examples showing how typical operations and attributes on a **DataFrame** object work:

In [None]:
print(   row.names(df)   )

In [None]:
print(   names(df)   )

In [None]:
df['c',]

In [None]:
df[c('a', 'b'),]

In [None]:
df[1:3,]

In [None]:
df[c(2, 1),]

In [None]:
print(   df$Integers   )

You can apply almost any function to be performed on the values in the **data frame**:

In [None]:
sqrt(df)

In [None]:
df ^ 2

Enlarging the **data frame** object in both dimensions is possible using **cbind()** and **rbind()** (https://stat.ethz.ch/R-manual/R-devel/library/base/html/cbind.html) functions:

In [None]:
df = cbind(df, Names=c('John', 'Tom', 'Sarrah', 'Kyle'), stringsAsFactors=FALSE)
df

You can define a new column directly:

In [None]:
df$Cities = c('Paris', 'London', 'Tokyo', 'Los Angeles')
df

In [None]:
df = rbind(df, e=list(10, 10.5, 'Peter', 'Mumbai'))
df

You can also add a new row by creating a new **data frame** object:

In [None]:
new_row = data.frame(Integers=11, Double=11.5, Names='Nick', Cities='New York', row.names='f')
df = rbind(df, new_row)
df

Column and row sums and means can be calculated using (https://stat.ethz.ch/R-manual/R-devel/library/base/html/colSums.html):

In [None]:
colSums(   df[c('Integers', 'Double')]   )

In [None]:
rowMeans(   df[c('Integers', 'Double')]   )

**_Looping through rows and columns of a data frame._** For this you will need **rownames()** and **colnames()** (https://stat.ethz.ch/R-manual/R-devel/library/base/html/colnames.html) functions:

In [None]:
for (row in rownames(df)){
    print(   df[row,]   )
}

In [None]:
for (column in colnames(df)){
    print(   df[column]   )
}

**_Deleting rows and columns._** Columns in R **data frame** are deleted or kept using **subset()** (https://stat.ethz.ch/R-manual/R-devel/library/base/html/subset.html) function:

In [None]:
df

Keep _'Integer'_ and _'Double'_ columns:

In [None]:
df_1 = subset(df, select=c('Integers', 'Double'))
df_1

Delete _'Cities'_ column. Note that inside **subset()** (https://stat.ethz.ch/R-manual/R-devel/library/base/html/subset.html) function column names go without _'quotes'_:

In [None]:
df_2 = subset(df, select=-Cities)
df_2

You cannot actually delete a row in R **data frame**, but you can access the **data frame** without some rows specified:

In [None]:
df_3 = df[c(1,4,5),]
df_3

In [None]:
df_3 = df[-(2:4),]
df_3

It becomes a bit more complicated when dealing with row names of a **data frame**:

In [None]:
df_4 = df[(row.names(df) %in% c('a','b','d')),]
df_4

In [None]:
df_5 = df[!(row.names(df) %in% c('e','f')),]
df_5

You can select rows based on certain criteria:

In [None]:
df_6 = df[df['Integers'] > 7,]
df_6

To delete rows with missing values, let's first create some missing data:

In [None]:
df['f',] = list(NA, NA, NA, NA)
df['e', 'Names'] = NA
df

**Na.omit()** (https://stat.ethz.ch/R-manual/R-devel/library/stats/html/na.fail.html) function deletes the rows with at least one missing value:

In [None]:
df_7 = na.omit(df)
df_7

Deleting only rows that have all the columns missing is a bit more complicated:

In [None]:
df_8 = df[rowSums(is.na(df)) != ncol(df),]
df_8

You can delete rows if a specific column has a missing value:

In [None]:
df_9 = df[!is.na(   df['Cities']   ),]
df_9

**_Creating random DataFrame object._** The following example is based on an **matrix** object with standard normally distributed random numbers, obtained using **rnorm()** (https://stat.ethz.ch/R-manual/R-devel/library/stats/html/Normal.html) function:

In [None]:
set.seed(6)
data = matrix(rnorm(40, 0, 1), nrow=10)
print(data)

Although one can construct **data frame** objects more directly (as seen before), using a **matrix** is also a good choice since R will retain the basic structure and will only add index values and column names:

In [None]:
df_10 = data.frame(data)
df_10

**Data frame** column names can be defined directly by assigning a **vector** object with the right number of elements. This illustrates that one can define/change the attributes of the **data frame** object easily:

In [None]:
colnames(df_10) = c('Col1', 'Col2', 'Col3', 'Col4')
df_10

To work with financial time series data efficiently, one must be able to handle time indices well. For example, assume that our ten data entries in the four columns correspond to month-end data, beginning in December 2019. A **Date** object is generated with the **seq()** (https://stat.ethz.ch/R-manual/R-devel/library/base/html/seq.Date.html) function as follows. _Notice that to create end of the month dates, you have to start with the next day and deduct 1 day_:

In [None]:
dates = seq(as.Date("2020-01-01"), by = "month", length.out=10)-1
print(dates)

The following code assigns the just created **Date** object as the relevant index object, making a time series of the original data set:

In [None]:
row.names(df_10) = dates
df_10

Many useful functions can be applied to a **data frame** object. Consider the **dim()** (https://stat.ethz.ch/R-manual/R-devel/library/base/html/dim.html), **head()** (https://stat.ethz.ch/R-manual/R-devel/library/utils/html/head.html) and **summary()** (https://stat.ethz.ch/R-manual/R-devel/library/base/html/summary.html) functions:

In [None]:
dim(df_10)

In [None]:
head(df_10)

In [None]:
summary(df_10)

In addition, one can easily get the column-wise or row-wise sums, means, and cumulative sums as shown below using **apply()** (https://stat.ethz.ch/R-manual/R-devel/library/base/html/apply.html) functione:

In [None]:
apply(df_10, 2, sum)

In [None]:
apply(df_10, 1, sum)

In [None]:
apply(df_10, 2, mean)

In [None]:
apply(df_10, 2, sd)

In [None]:
apply(df_10, 2, cumsum)

In [None]:
apply(df_10, 1, cumsum)

When working with **data frames** in R, make sure to handle **NaN** cases before performing mathematical operations - this way you can also work with incomplete data sets as if they were complete in a number of cases:

In [None]:
log(df_10)

Since it is impossible to calcualte log() of a negative number, we receive **NaN** in many cases:

In [None]:
apply(   log(df_10), 2, sum   )

R will ignore **NaN** cases if you pass **na.rm=TRUE** argument inside the **apply()** (https://stat.ethz.ch/R-manual/R-devel/library/base/html/apply.html) function:

In [None]:
apply(   log(df_10), 2, sum, na.rm=TRUE   )

To recap different ways we can perform calculations on **pandas** **DataFrame**, look at the following:

In [None]:
head(   exp(df_10)   )

In [None]:
head(   apply(df_10, 2, exp)   )

**_Basic plotting._** Plotting of data (to be discussed later) is only one line of code away using **plot()** (https://stat.ethz.ch/R-manual/R-devel/library/graphics/html/plot.default.html) function:

In [None]:
df

In [None]:
#The line below sets the size of the plot
options(repr.plot.width=3, repr.plot.height=3)

plot(df$Integers)

In [None]:
plot(df$Integers, df$Double, type='h')

In [None]:
plot()

In [None]:
plot(df$Integers, df$Double,
     type='b',
     main="A basic plot",
     xlab="Integer",
     ylab="Double")

**_Data grouping._** You can split the data into subsets and compute summary statistics for each using **aggregate()** (https://stat.ethz.ch/R-manual/R-devel/library/stats/html/aggregate.html) function. But first, you need to create a new variable to group by:

In [None]:
df

In [None]:
df['Continent'] = c('Europe', 'Europe', 'Asia', 'America', 'Asia', 'Other')
df

The code below groups by the **Continent** column and outputs statistics for the single groups:

In [None]:
aggregate(   df[c('Integers', 'Double')], by=list(df$Continent), FUN=sum, na.rm=TRUE  )

You can also group by more than one variable. For illustration purposes, let's create a new column called **Order**:

In [None]:
df['Order'] = c(1, 2, 1, 2, 1, 2)
df

Let's group by **Continent** and **Order** and calculate averages across the groups:

In [None]:
aggregate(   df[c('Integers', 'Double')], by=list(df$Continent, df$Order), FUN=mean, na.rm=TRUE  )

**_Combining data frame objects_**. The following examples walk you through different approaches to combine two simple data sets in the form of **data frame** objects. The two simple data sets are:

In [None]:
df_11 = data.frame(One=c(50, 60, 70, 80), row.names=c('a', 'b', 'c', 'd'))
df_11

In [None]:
df_12 = data.frame(Two=c(60, 55, 25), row.names=c('f', 'b', 'd'))
df_12

**Date frames** can be combined using **merge()** (https://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html) function. The arguments **by.x** and **by.y** specifiy the columns used for merging (0 means index):

In [None]:
merge(df_11, df_12, by.x=0, by.y=0)

In [None]:
merge(df_11, df_12, by.x=0, by.y=0, all=TRUE)

In [None]:
merge(df_11, df_12, by.x=0, by.y=0, all.x=TRUE)

In [None]:
merge(df_11, df_12, by.x=0, by.y=0, all.y=TRUE)

If you merge the column **One** from _df_11_ with column **Two** from _df_12_, you will get only one value, because only number 60 is common between these two columns:

In [None]:
merge(df_11, df_12, by.x='One', by.y='Two')

**_Exercises._**

Exercise 1. Given the expected return vector and variance-covariance matrix, generate 5 years of return data for 3 stocks, starting September 2015. Create a **data frame** with proper column and row labels.

In [None]:
ret = c(0.02518719, 0.02427579, 0.02552088)
cov = c(0.00493731, 0.00306012, 0.00263865,
        0.00306012, 0.00655305, 0.00341754,
        0.00263865, 0.00341754, 0.0040089)

cov = matrix(cov, nrow=3)

Exercise 2. Using the **DataFrame** from previous exercise, display the main statistics (average return, standard deviation etc.).