# Data Manipulation

![](banner_data_selection.jpg)

_<p style="text-align: center;"> The "Data-O-Matic"!  It slices!  It dices! </p>_

In [1]:
f = "setup.R"; for (i in 1:10) { if (file.exists(f)) break else f = paste0("../", f) }; source(f)

## Introduction

Motivation, context, history, related topics ...

## Terms

| IT                                  | Statistics     | Other Fields                        | Example                    |
|------------------------------------ | -------------- | ----------------------------------- | -------------------------- |
| database<br> table                  | population     | dataset<br> dataframe               | ![](table.jpg)             |
| record #                            | observation #  | row #<br> row name                  | ![](row_number.jpg)        |
| row<br> record                      | observation    | datapoint                           | ![](row.jpg)               |
| column name                         | variable       | feature<br> attribute<br> dimension | ![](column_name.jpg)       |
| column                              | distribution   | vector                              | ![](column.jpg)            |
| -                                   | distribution   | vector                              | ![](vector-vertical.jpg)   |
| -                                   | distribution   | vector                              | ![](vector-horizontal.jpg) |
| value<br> cell                      | value          | datum<br> vector                    | ![](value.jpg)             |
| slice<br> horizontal slice<br> dice | sample         | subset                              | ![](sample.jpg)            |
| slice<br> vertical slice            | -              | -                                   | ![](slice.jpg)             |

## Data

Consider the following pedagogical dataset called `data`.

In [2]:
data = read.csv("High-Tech Stocks.csv", header=TRUE)
data

Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return,Value.weighted.Market.Return,SP.500.Return,Price..Dell,Price..Apple,Price..IBM,Price..Microsoft,Calendar.Date
1990.042,-0.035461,-0.15909090,0.04780877,0.06321839,-0.070115,-0.068817,4.625,34.000,98.625,92.500,19900131
1990.125,0.003235,0.35135135,0.06550063,0.06756756,0.014901,0.008539,6.250,34.000,103.875,98.750,19900228
1990.208,0.183824,0.22000000,0.02166065,0.12151898,0.024140,0.024255,7.625,40.250,106.125,110.750,19900330
1990.292,-0.021739,0.11475410,0.02709069,0.04740406,-0.028286,-0.026887,8.500,39.375,109.000,58.000,19900430
1990.375,0.050413,0.29411766,0.11201835,0.25862068,0.088936,0.091989,11.000,41.250,120.000,73.000,19900531
1990.458,0.084848,0.14772727,-0.02083330,0.04109589,-0.004196,-0.008886,12.625,44.750,117.500,76.000,19900629
1990.542,-0.061453,-0.06930690,-0.05106380,-0.12500000,-0.009405,-0.005223,11.750,42.000,111.500,66.500,19900731
1990.625,-0.116429,0.00000000,-0.07547090,-0.07518800,-0.091896,-0.094314,11.750,37.000,101.875,61.500,19900831
1990.708,-0.216216,-0.25531910,0.04417178,0.02439024,-0.053843,-0.051184,8.750,29.000,106.375,63.000,19900928
1990.792,0.060345,0.21428572,-0.00940070,0.01190476,-0.012504,-0.006698,10.625,30.750,105.375,63.750,19901031


## Index-Based Extraction | Rows

To inspect all or part of the dataset, we reference the `data` table retrieved earlier.  We indicate which rows and columns of the table we want with `[...]` notation - specific row positions and column positions separated by a comma and enclosed within brackets.  We start counting row positions and column positions at 1.  If we want all rows, then we can leave the row positions blank and all rows will be assumed.  Similarly, if we want all columns, then we can leave the column positions blank and all columns will be assumed.

So, to inspect the 1st row of data, we reference `data[1,]`.  The row position (in this case just 1 row) is 1.  The column positions are blank and so assumed to be all columns.  Note, the comma is required even though we left the column positions blank.  The output is presented as a table with 12 columns, with headings, each with 1 value per column corresponding to the 1st observation in the dataset.

In [3]:
data[1,]

Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return,Value.weighted.Market.Return,SP.500.Return,Price..Dell,Price..Apple,Price..IBM,Price..Microsoft,Calendar.Date
1990.042,-0.035461,-0.1590909,0.04780877,0.06321839,-0.070115,-0.068817,4.625,34,98.625,92.5,19900131


To inpect the 5th row of data, we reference `data[5,]`.  The row index (in this case just 1 row) is 5.

In [4]:
data[5,]

Unnamed: 0,Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return,Value.weighted.Market.Return,SP.500.Return,Price..Dell,Price..Apple,Price..IBM,Price..Microsoft,Calendar.Date
5,1990.375,0.050413,0.2941177,0.1120184,0.2586207,0.088936,0.091989,11,41.25,120,73,19900531


To inspect the first 3 rows of data, we reference `data[1:3,]`.  We indicate a sequence of positions with the `:` notation - the left side is the start position, the right side is the stop position.  So, the row positions indicated by `1:3` are 1, 2, and 3.  The output is presented as a table with 12 columns, with headings, each with 3 values per column corresponding to the first 3 observations in the dataset.

In [5]:
data[1:3,]

Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return,Value.weighted.Market.Return,SP.500.Return,Price..Dell,Price..Apple,Price..IBM,Price..Microsoft,Calendar.Date
1990.042,-0.035461,-0.1590909,0.04780877,0.06321839,-0.070115,-0.068817,4.625,34.0,98.625,92.5,19900131
1990.125,0.003235,0.3513514,0.06550063,0.06756756,0.014901,0.008539,6.25,34.0,103.875,98.75,19900228
1990.208,0.183824,0.22,0.02166065,0.12151898,0.02414,0.024255,7.625,40.25,106.125,110.75,19900330


To inspect the 2nd through 5th rows of data, we reference `data[2:5,]`.  The row positions indicated by `2:5` are 2, 3, 4, and 5.

In [6]:
data[2:5,]

Unnamed: 0,Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return,Value.weighted.Market.Return,SP.500.Return,Price..Dell,Price..Apple,Price..IBM,Price..Microsoft,Calendar.Date
2,1990.125,0.003235,0.3513514,0.06550063,0.06756756,0.014901,0.008539,6.25,34.0,103.875,98.75,19900228
3,1990.208,0.183824,0.22,0.02166065,0.12151898,0.02414,0.024255,7.625,40.25,106.125,110.75,19900330
4,1990.292,-0.021739,0.1147541,0.02709069,0.04740406,-0.028286,-0.026887,8.5,39.375,109.0,58.0,19900430
5,1990.375,0.050413,0.2941177,0.11201835,0.25862068,0.088936,0.091989,11.0,41.25,120.0,73.0,19900531


To inspect the 1st and 3rd rows of the data, we reference `data[c(1,3),]`.  We indicate a sequence of positions with the `c` function.  So, the row positions described by `c(1,3)` are 1 and 3.  Note, the `:` notation would not be appropriate here, because it does not allow skipped positions.  Note also, it would not be appropriate here to describe the row positions without the `c` function because the row positions would not otherwise be distinguisghable from the column positions.  The output is presented as a table with 12 columns, with headings, each with 2 values per column corresponding to the 1st and 3rd observations in the dataset.

In [7]:
data[c(1,3),]

Unnamed: 0,Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return,Value.weighted.Market.Return,SP.500.Return,Price..Dell,Price..Apple,Price..IBM,Price..Microsoft,Calendar.Date
1,1990.042,-0.035461,-0.1590909,0.04780877,0.06321839,-0.070115,-0.068817,4.625,34.0,98.625,92.5,19900131
3,1990.208,0.183824,0.22,0.02166065,0.12151898,0.02414,0.024255,7.625,40.25,106.125,110.75,19900330


## Index-Based Extraction | Rows & Columns

To inspect the data at the intersection of row 1 and column 2, we reference `data[1,2]`.  Because this references a single value, association with a table is dropped, and the output is presented as single value.

In [8]:
data[1,2]

To inspect only the first 5 columns of the 1st row of data, we reference `data[1,1:5]`.

In [9]:
data[1,1:5]

Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return
1990.042,-0.035461,-0.1590909,0.04780877,0.06321839


To inspect only the first 5 columns of the first 3 rows of data, we reference `data[1:3,1:5]`.

In [10]:
data[1:3,1:5]

Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return
1990.042,-0.035461,-0.1590909,0.04780877,0.06321839
1990.125,0.003235,0.3513514,0.06550063,0.06756756
1990.208,0.183824,0.22,0.02166065,0.12151898


To inspect the first 5 columns and the 7th column of the 2nd and 4th rows of data, we reference `data[c(2,4),c(1:5,7)]`.  Note, the `c` function can take a combination of `:`-style sequences and single positions as parameters.

In [11]:
data[c(2,4),c(1:5,7)]

Unnamed: 0,Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return,SP.500.Return
2,1990.125,0.003235,0.3513514,0.06550063,0.06756756,0.008539
4,1990.292,-0.021739,0.1147541,0.02709069,0.04740406,-0.026887


## Name-Based Extraction | Rows & Columns

As a sometimes convenient alternative to referencing columns by their positions, we can indicate them by their names.  Note, column names are described as strings and so must each be enclosed within `"..."`.

In [12]:
data[1,"Apple.Return"]

In [13]:
data[1,c("Date","Apple.Return","Dell.Return","IBM.Return","Microsoft.Return")]

Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return
1990.042,-0.035461,-0.1590909,0.04780877,0.06321839


In [14]:
data[c(1,3),c("Date","Apple.Return","Dell.Return","IBM.Return","Microsoft.Return","SP.500.Return")]

Unnamed: 0,Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return,SP.500.Return
1,1990.042,-0.035461,-0.1590909,0.04780877,0.06321839,-0.068817
3,1990.208,0.183824,0.22,0.02166065,0.12151898,0.024255


## Name-Based Extraction | Rows & One Column
Indicate the column by its name using \$ notation.

As a sometimes convenient alternative to referencing columns using the `[...]` notation, if we want just one column, then we can use the `$` notation - a table name followed by the `$` symbol, followed by a column name, followed by row positions enclosed within `[...]`.

So, to inspect the `Apple.Return` column of the 1st row of data, we reference `data$Apple.Return[1]`.  Note, in this notation, the column name is described not as a string, so it is not enclosed within `"..."`.  Note also, because this references a single value, association with a table is dropped, and the output is presented as single value.  (We cannot force the reference to keep an association with a table.)

In [15]:
data$Apple.Return[1]

To inspect the `Apple.Return` column of the 1st 3 rows of data, we reference `data$Apple.Return[1:3]`.  Note, because these reference vectors of values, associations with tables are dropped, and the outputs are presented as vectors. (We cannot force these references to keep associations with tables.)

In [16]:
data$Apple.Return[1:3]

Similarly, to inspect the `Apple.Return` column of the 1st and 3rd rows of data, we reference `data$Apple.Return[c(1,3)]`.

In [17]:
data$Apple.Return[c(1,3)]

## Criterion-Based Extraction | Rows

To inspect the part of the dataset that satisfies a specific criterion, we again use the `[...]` notation, but indicate the rows we want by TRUE/FALSE expressions involving the columns.  (This is similar to, but more general than, Excel's filtering functionality.)

To inspect only rows where the `Date` column value is less than 1991 (i.e., observations from earlier than the year 1991), we reference `data[data$Date < 1991,]`.  The row positions are indicated by the expression `data$Data < 1991` - every value in the `Date` column is compared to 1991, and rows selected accordingly.  The column positions are blank and so assumed to be all columns.

In [18]:
data[data$Date < 1991,]

Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return,Value.weighted.Market.Return,SP.500.Return,Price..Dell,Price..Apple,Price..IBM,Price..Microsoft,Calendar.Date
1990.042,-0.035461,-0.1590909,0.04780877,0.06321839,-0.070115,-0.068817,4.625,34.0,98.625,92.5,19900131
1990.125,0.003235,0.3513514,0.06550063,0.06756756,0.014901,0.008539,6.25,34.0,103.875,98.75,19900228
1990.208,0.183824,0.22,0.02166065,0.12151898,0.02414,0.024255,7.625,40.25,106.125,110.75,19900330
1990.292,-0.021739,0.1147541,0.02709069,0.04740406,-0.028286,-0.026887,8.5,39.375,109.0,58.0,19900430
1990.375,0.050413,0.2941177,0.11201835,0.25862068,0.088936,0.091989,11.0,41.25,120.0,73.0,19900531
1990.458,0.084848,0.1477273,-0.0208333,0.04109589,-0.004196,-0.008886,12.625,44.75,117.5,76.0,19900629
1990.542,-0.061453,-0.0693069,-0.0510638,-0.125,-0.009405,-0.005223,11.75,42.0,111.5,66.5,19900731
1990.625,-0.116429,0.0,-0.0754709,-0.075188,-0.091896,-0.094314,11.75,37.0,101.875,61.5,19900831
1990.708,-0.216216,-0.2553191,0.04417178,0.02439024,-0.053843,-0.051184,8.75,29.0,106.375,63.0,19900928
1990.792,0.060345,0.2142857,-0.0094007,0.01190476,-0.012504,-0.006698,10.625,30.75,105.375,63.75,19901031


To inspect only rows where the `Date` column is between 1991 (inclusive) and 1992 (exclusive), (i.e., observations from the year 1991), we reference `data[(data$Date >= 1991) & (data$Date < 1992),]`.  The row positions are again indicated by an expression, this time a complex expression using the AND operator (`&`) to get the intersection of 2 sets of row positions.  The column positions are blank and so assumed to be all columns.

In [19]:
data[(data$Date >= 1991) & (data$Date < 1992),]

Unnamed: 0,Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return,Value.weighted.Market.Return,SP.500.Return,Price..Dell,Price..Apple,Price..IBM,Price..Microsoft,Calendar.Date
13,1991.042,0.290698,0.22297297,0.12168141,0.3039867,0.049078,0.041518,22.625,55.5,126.75,98.125,19910131
14,1991.125,0.033694,0.1160221,0.02532544,0.05732484,0.075847,0.067281,25.25,57.25,128.75,103.75,19910228
15,1991.208,0.187773,0.12871288,-0.115534,0.02289157,0.028923,0.022203,28.5,68.0,113.875,106.125,19910328
16,1991.292,-0.191176,-0.1798246,-0.0954994,-0.0671378,0.003322,0.000346,23.375,55.0,103.0,99.0,19910430
17,1991.375,-0.143273,0.05882353,0.04208738,0.10858586,0.040732,0.038577,24.75,47.0,106.125,109.75,19910531
18,1991.458,-0.117021,-0.010101,-0.0848057,-0.0689066,-0.044029,-0.047893,24.5,41.5,97.125,68.125,19910628
19,1991.542,0.114458,0.17346939,0.04247104,0.07889909,0.046795,0.044859,28.75,46.25,101.25,73.5,19910731
20,1991.625,0.148541,0.13478261,-0.0312593,0.15986395,0.026819,0.019649,32.625,53.0,96.875,85.25,19910830
21,1991.708,-0.066038,0.02298851,0.06967742,0.04398827,-0.010975,-0.019144,33.375,49.5,103.625,89.0,19910930
22,1991.792,0.040404,-0.2546816,-0.0518697,0.05477528,0.017789,0.01186,24.875,51.5,98.25,93.875,19911031


As an alternative way to inspect only observations from the year 1991, we reference `data[!((data$Date < 1991) | (data$Date >= 1992)),]`.  The row positions are again indicated by an expression, this time a complex expression using the NOT operator (`!`) and the OR operator (`|`).  The column positions are blank and so assumed to be all columns.

In [23]:
data[!((data$Date < 1991) | (data$Date >= 1992)),]

Unnamed: 0,Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return,Value.weighted.Market.Return,SP.500.Return,Price..Dell,Price..Apple,Price..IBM,Price..Microsoft,Calendar.Date
13,1991.042,0.290698,0.22297297,0.12168141,0.3039867,0.049078,0.041518,22.625,55.5,126.75,98.125,19910131
14,1991.125,0.033694,0.1160221,0.02532544,0.05732484,0.075847,0.067281,25.25,57.25,128.75,103.75,19910228
15,1991.208,0.187773,0.12871288,-0.115534,0.02289157,0.028923,0.022203,28.5,68.0,113.875,106.125,19910328
16,1991.292,-0.191176,-0.1798246,-0.0954994,-0.0671378,0.003322,0.000346,23.375,55.0,103.0,99.0,19910430
17,1991.375,-0.143273,0.05882353,0.04208738,0.10858586,0.040732,0.038577,24.75,47.0,106.125,109.75,19910531
18,1991.458,-0.117021,-0.010101,-0.0848057,-0.0689066,-0.044029,-0.047893,24.5,41.5,97.125,68.125,19910628
19,1991.542,0.114458,0.17346939,0.04247104,0.07889909,0.046795,0.044859,28.75,46.25,101.25,73.5,19910731
20,1991.625,0.148541,0.13478261,-0.0312593,0.15986395,0.026819,0.019649,32.625,53.0,96.875,85.25,19910830
21,1991.708,-0.066038,0.02298851,0.06967742,0.04398827,-0.010975,-0.019144,33.375,49.5,103.625,89.0,19910930
22,1991.792,0.040404,-0.2546816,-0.0518697,0.05477528,0.017789,0.01186,24.875,51.5,98.25,93.875,19911031


To inspect only rows where the `Date` column is between 1991 (inclusive) and 1992 (exclusive), and only the `Date` and `Apple.Return` columns, we reference `data[(data\$Date >= 1991) & (data\$Date < 1992), c("Date", "Apple.Return")]`.

In [24]:
data[(data$Date >= 1991) & (data$Date < 1992), c("Date", "Apple.Return")]

Unnamed: 0,Date,Apple.Return
13,1991.042,0.290698
14,1991.125,0.033694
15,1991.208,0.187773
16,1991.292,-0.191176
17,1991.375,-0.143273
18,1991.458,-0.117021
19,1991.542,0.114458
20,1991.625,0.148541
21,1991.708,-0.066038
22,1991.792,0.040404


## More Extraction | First/Last/Random Few Rows

As a sometimes convenient alternative to referencing rows by their positions or using criteria, we can use the `head` and `tail` functions.

To inspect the first 6 rows of data, we use `head(data)`.  The first parameter is the dataset, in this case `data`.  If no other parameters are provided, then the first 6 rows is assumed.

In [25]:
head(data)

Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return,Value.weighted.Market.Return,SP.500.Return,Price..Dell,Price..Apple,Price..IBM,Price..Microsoft,Calendar.Date
1990.042,-0.035461,-0.1590909,0.04780877,0.06321839,-0.070115,-0.068817,4.625,34.0,98.625,92.5,19900131
1990.125,0.003235,0.3513514,0.06550063,0.06756756,0.014901,0.008539,6.25,34.0,103.875,98.75,19900228
1990.208,0.183824,0.22,0.02166065,0.12151898,0.02414,0.024255,7.625,40.25,106.125,110.75,19900330
1990.292,-0.021739,0.1147541,0.02709069,0.04740406,-0.028286,-0.026887,8.5,39.375,109.0,58.0,19900430
1990.375,0.050413,0.2941177,0.11201835,0.25862068,0.088936,0.091989,11.0,41.25,120.0,73.0,19900531
1990.458,0.084848,0.1477273,-0.0208333,0.04109589,-0.004196,-0.008886,12.625,44.75,117.5,76.0,19900629


To inspect the first 3 rows of data we use `head(data, 3)`.  The 2nd parameter is the number of rows.

In [26]:
head(data, 3)

Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return,Value.weighted.Market.Return,SP.500.Return,Price..Dell,Price..Apple,Price..IBM,Price..Microsoft,Calendar.Date
1990.042,-0.035461,-0.1590909,0.04780877,0.06321839,-0.070115,-0.068817,4.625,34.0,98.625,92.5,19900131
1990.125,0.003235,0.3513514,0.06550063,0.06756756,0.014901,0.008539,6.25,34.0,103.875,98.75,19900228
1990.208,0.183824,0.22,0.02166065,0.12151898,0.02414,0.024255,7.625,40.25,106.125,110.75,19900330


To inspect the first 6 rows of the 2nd **through** 5th columns of data we use `head(data[,2:5])`.

In [28]:
head(data[,2:5])

Apple.Return,Dell.Return,IBM.Return,Microsoft.Return
-0.035461,-0.1590909,0.04780877,0.06321839
0.003235,0.3513514,0.06550063,0.06756756
0.183824,0.22,0.02166065,0.12151898
-0.021739,0.1147541,0.02709069,0.04740406
0.050413,0.2941177,0.11201835,0.25862068
0.084848,0.1477273,-0.0208333,0.04109589


To inspect the first 6 rows of the 2nd **and** 5th columns of data we use `head(data[c(2,5)])`.

In [29]:
head(data[,c(2,5)])

Apple.Return,Microsoft.Return
-0.035461,0.06321839
0.003235,0.06756756
0.183824,0.12151898
-0.021739,0.04740406
0.050413,0.25862068
0.084848,0.04109589


To inspect the first 6 positions of the `Apple.Return` column of data we use `head(data$Apple.Return)`.

In [30]:
head(data$Apple.Return)

To inspect the last 6 rows of data we use `tail(data)`.

In [31]:
tail(data)

Unnamed: 0,Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return,Value.weighted.Market.Return,SP.500.Return,Price..Dell,Price..Apple,Price..IBM,Price..Microsoft,Calendar.Date
256,2011.292,0.004656,0.066161,0.046054,0.020874,0.028683,0.028495,15.47,350.13,170.58,25.92,20110428
257,2011.375,-0.006569,0.039431,-0.005276,-0.028935,-0.014934,-0.013501,16.08,347.83,168.93,25.01,20110531
258,2011.458,-0.03496,0.036692,0.015509,0.039584,-0.018391,-0.018258,16.67,335.67,171.55,26.0,20110630
259,2011.542,0.163285,-0.025795,0.060041,0.053846,-0.022448,-0.021474,16.24,390.48,181.85,27.4,20110731
260,2011.625,-0.014469,-0.084667,-0.050536,-0.023358,-0.05747,-0.056791,14.865,384.83,171.91,26.6,20110831
261,2011.708,-0.009121,-0.048772,0.017218,-0.064286,-0.084872,-0.071762,14.14,381.32,174.87,24.89,20110929


To inspect a random sample of 6 rows of data we use `data[sample(1:nrow(data), 6), ]`.

In [36]:
data[sample(1:nrow(data), 6), ]

Unnamed: 0,Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return,Value.weighted.Market.Return,SP.500.Return,Price..Dell,Price..Apple,Price..IBM,Price..Microsoft,Calendar.Date
25,1992.042,0.148559,0.2439024,0.01123596,0.08089887,-0.001182,-0.0199,31.875,64.75,90.0,120.25,19920131
110,1999.125,-0.15478,-0.19875,-0.0724693,-0.1421429,-0.038105,-0.032283,80.125,34.8125,169.75,150.125,19990226
10,1990.792,0.060345,0.2142857,-0.0094007,0.01190476,-0.012504,-0.006698,10.625,30.75,105.375,63.75,19901031
219,2008.208,0.147816,0.001,0.011242,0.043386,-0.01048,-0.00596,19.92,143.5,115.14,28.38,20080331
179,2004.875,0.27958,0.155733,0.052033,0.068645,0.04821,0.038595,40.52,67.05,94.24,26.81,20041130
16,1991.292,-0.191176,-0.1798246,-0.0954994,-0.0671378,0.003322,0.000346,23.375,55.0,103.0,99.0,19910430


## Row-wise Concatenation

In [37]:
data.new = data.frame(Date=c(2012.06,2012.10),
                      Apple.Return=c(0,0),
                      Dell.Return=c(0,0),
                      IBM.Return=c(0,0),
                      Microsoft.Return=c(0,0),
                      Value.weighted.Market.Return=c(0,0),
                      SP.500.Return=c(-0.07,-0.07),
                      Price..Dell=c(14.14,14.14),
                      Price..Apple=c(381.32,381.32),
                      Price..IBM=c(174.87,174.87),
                      Price..Microsoft=c(24.89,24.89),
                      Calendar.Date=c(20120131,20120228))

data.new

Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return,Value.weighted.Market.Return,SP.500.Return,Price..Dell,Price..Apple,Price..IBM,Price..Microsoft,Calendar.Date
2012.06,0,0,0,0,0,-0.07,14.14,381.32,174.87,24.89,20120131
2012.1,0,0,0,0,0,-0.07,14.14,381.32,174.87,24.89,20120228


In [38]:
datax = rbind(data, data.new)
datax

Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return,Value.weighted.Market.Return,SP.500.Return,Price..Dell,Price..Apple,Price..IBM,Price..Microsoft,Calendar.Date
1990.042,-0.035461,-0.15909090,0.04780877,0.06321839,-0.070115,-0.068817,4.625,34.000,98.625,92.500,19900131
1990.125,0.003235,0.35135135,0.06550063,0.06756756,0.014901,0.008539,6.250,34.000,103.875,98.750,19900228
1990.208,0.183824,0.22000000,0.02166065,0.12151898,0.024140,0.024255,7.625,40.250,106.125,110.750,19900330
1990.292,-0.021739,0.11475410,0.02709069,0.04740406,-0.028286,-0.026887,8.500,39.375,109.000,58.000,19900430
1990.375,0.050413,0.29411766,0.11201835,0.25862068,0.088936,0.091989,11.000,41.250,120.000,73.000,19900531
1990.458,0.084848,0.14772727,-0.02083330,0.04109589,-0.004196,-0.008886,12.625,44.750,117.500,76.000,19900629
1990.542,-0.061453,-0.06930690,-0.05106380,-0.12500000,-0.009405,-0.005223,11.750,42.000,111.500,66.500,19900731
1990.625,-0.116429,0.00000000,-0.07547090,-0.07518800,-0.091896,-0.094314,11.750,37.000,101.875,61.500,19900831
1990.708,-0.216216,-0.25531910,0.04417178,0.02439024,-0.053843,-0.051184,8.750,29.000,106.375,63.000,19900928
1990.792,0.060345,0.21428572,-0.00940070,0.01190476,-0.012504,-0.006698,10.625,30.750,105.375,63.750,19901031


In [39]:
# Use rbind to insert rows
datax = rbind(data[1,], data.new, data[2:nrow(data),])
datax

Unnamed: 0,Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return,Value.weighted.Market.Return,SP.500.Return,Price..Dell,Price..Apple,Price..IBM,Price..Microsoft,Calendar.Date
1,1990.042,-0.035461,-0.15909090,0.04780877,0.06321839,-0.070115,-0.068817,4.625,34.000,98.625,92.500,19900131
2,2012.060,0.000000,0.00000000,0.00000000,0.00000000,0.000000,-0.070000,14.140,381.320,174.870,24.890,20120131
3,2012.100,0.000000,0.00000000,0.00000000,0.00000000,0.000000,-0.070000,14.140,381.320,174.870,24.890,20120228
262,1990.125,0.003235,0.35135135,0.06550063,0.06756756,0.014901,0.008539,6.250,34.000,103.875,98.750,19900228
310,1990.208,0.183824,0.22000000,0.02166065,0.12151898,0.024140,0.024255,7.625,40.250,106.125,110.750,19900330
4,1990.292,-0.021739,0.11475410,0.02709069,0.04740406,-0.028286,-0.026887,8.500,39.375,109.000,58.000,19900430
5,1990.375,0.050413,0.29411766,0.11201835,0.25862068,0.088936,0.091989,11.000,41.250,120.000,73.000,19900531
6,1990.458,0.084848,0.14772727,-0.02083330,0.04109589,-0.004196,-0.008886,12.625,44.750,117.500,76.000,19900629
7,1990.542,-0.061453,-0.06930690,-0.05106380,-0.12500000,-0.009405,-0.005223,11.750,42.000,111.500,66.500,19900731
8,1990.625,-0.116429,0.00000000,-0.07547090,-0.07518800,-0.091896,-0.094314,11.750,37.000,101.875,61.500,19900831


## Column-wise Concatenation

In [40]:
data.ibm = read.csv("IBM.csv", header=TRUE) 
data.microsoft = read.csv("Microsoft.csv", header=TRUE) 

In [41]:
data.ibm

Date,IBM.Return,Price..IBM,Calendar.Date
1990.042,0.04780877,98.625,19900131
1990.125,0.06550063,103.875,19900228
1990.208,0.02166065,106.125,19900330
1990.292,0.02709069,109.000,19900430
1990.375,0.11201835,120.000,19900531
1990.458,-0.02083330,117.500,19900629
1990.542,-0.05106380,111.500,19900731
1990.625,-0.07547090,101.875,19900831
1990.708,0.04417178,106.375,19900928
1990.792,-0.00940070,105.375,19901031


In [42]:
data.microsoft

Date,Microsoft.Return,Price..Microsoft,Calendar.Date
1990.042,0.06321839,92.500,19900131
1990.125,0.06756756,98.750,19900228
1990.208,0.12151898,110.750,19900330
1990.292,0.04740406,58.000,19900430
1990.375,0.25862068,73.000,19900531
1990.458,0.04109589,76.000,19900629
1990.542,-0.12500000,66.500,19900731
1990.625,-0.07518800,61.500,19900831
1990.708,0.02439024,63.000,19900928
1990.792,0.01190476,63.750,19901031


In [43]:
datax = cbind(data.ibm[, 1:3], data.microsoft[, 2:4])
datax

Date,IBM.Return,Price..IBM,Microsoft.Return,Price..Microsoft,Calendar.Date
1990.042,0.04780877,98.625,0.06321839,92.500,19900131
1990.125,0.06550063,103.875,0.06756756,98.750,19900228
1990.208,0.02166065,106.125,0.12151898,110.750,19900330
1990.292,0.02709069,109.000,0.04740406,58.000,19900430
1990.375,0.11201835,120.000,0.25862068,73.000,19900531
1990.458,-0.02083330,117.500,0.04109589,76.000,19900629
1990.542,-0.05106380,111.500,-0.12500000,66.500,19900731
1990.625,-0.07547090,101.875,-0.07518800,61.500,19900831
1990.708,0.04417178,106.375,0.02439024,63.000,19900928
1990.792,-0.00940070,105.375,0.01190476,63.750,19901031


## Joins

In [44]:
data.apple = read.csv("Apple 2000-2005.csv", header=TRUE) 
data.dell  = read.csv("Dell 2001-2007.csv",  header=TRUE) 

In [45]:
data.apple

Date,Apple.Return,Price..Apple,Calendar.Date
2000.042,0.009119,103.7500,20000131
2000.125,0.104819,114.6250,20000229
2000.208,0.184842,135.8125,20000331
2000.292,-0.086516,124.0625,20000428
2000.375,-0.322922,84.0000,20000531
2000.458,0.247024,52.3750,20000630
2000.542,-0.029833,50.8125,20000731
2000.625,0.199262,60.9375,20000831
2000.708,-0.577436,25.7500,20000929
2000.792,-0.240291,19.5625,20001031


In [46]:
data.dell

Date,Dell.Return,Price..Dell,Calendar.Date
2001.042,0.498208,26.1250,20010131
2001.125,-0.162679,21.8750,20010228
2001.208,0.174286,25.6875,20010330
2001.292,0.021509,26.2400,20010430
2001.375,-0.071646,24.3600,20010531
2001.458,0.073481,26.1500,20010629
2001.542,0.029828,26.9300,20010731
2001.625,-0.206090,21.3800,20010831
2001.708,-0.133302,18.5300,20010928
2001.792,0.294118,23.9800,20011031


In [47]:
datax = merge(data.apple, data.dell, by=c("Date","Calendar.Date"))
datax

Date,Calendar.Date,Apple.Return,Price..Apple,Dell.Return,Price..Dell
2001.042,20010131,0.453782,21.625,0.498208,26.125
2001.125,20010228,-0.156069,18.25,-0.162679,21.875
2001.208,20010330,0.209315,22.07,0.174286,25.6875
2001.292,20010430,0.154961,25.49,0.021509,26.24
2001.375,20010531,-0.21734,19.95,-0.071646,24.36
2001.458,20010629,0.165413,23.25,0.073481,26.15
2001.542,20010731,-0.191828,18.79,0.029828,26.93
2001.625,20010831,-0.012773,18.55,-0.20609,21.38
2001.708,20010928,-0.163881,15.51,-0.133302,18.53
2001.792,20011031,0.132173,17.56,0.294118,23.98


## Formatting

In [52]:
fmt(data[13,"Apple.Return"]) # output single value as a table, default title

"data[13, ""Apple.Return""]"
0.290698


In [53]:
fmt(data[13,"Apple.Return"], "apple return") # output single value as a table, title

apple return
0.290698


In [54]:
fmt(data[13:18,1:5]) # default title, no row numbers

Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return
1991.042,0.290698,0.222973,0.1216814,0.3039867
1991.125,0.033694,0.1160221,0.0253254,0.0573248
1991.208,0.187773,0.1287129,-0.115534,0.0228916
1991.292,-0.191176,-0.1798246,-0.0954994,-0.0671378
1991.375,-0.143273,0.0588235,0.0420874,0.1085859
1991.458,-0.117021,-0.010101,-0.0848057,-0.0689066


In [55]:
fmt(data[13:18,1:5], row.names=TRUE) # default title, row numbers

Unnamed: 0,Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return
13,1991.042,0.290698,0.222973,0.1216814,0.3039867
14,1991.125,0.033694,0.1160221,0.0253254,0.0573248
15,1991.208,0.187773,0.1287129,-0.115534,0.0228916
16,1991.292,-0.191176,-0.1798246,-0.0954994,-0.0671378
17,1991.375,-0.143273,0.0588235,0.0420874,0.1085859
18,1991.458,-0.117021,-0.010101,-0.0848057,-0.0689066


In [56]:
fmt(data[13:18,1:5], "1991 | 1st Half") # title, no row numbers

Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return
1991.042,0.290698,0.222973,0.1216814,0.3039867
1991.125,0.033694,0.1160221,0.0253254,0.0573248
1991.208,0.187773,0.1287129,-0.115534,0.0228916
1991.292,-0.191176,-0.1798246,-0.0954994,-0.0671378
1991.375,-0.143273,0.0588235,0.0420874,0.1085859
1991.458,-0.117021,-0.010101,-0.0848057,-0.0689066


In [57]:
fmt(data[13:19,], "1991 | 1st Half:", position="left") # title, no row numbers, position left

Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return,Value.weighted.Market.Return,SP.500.Return,Price..Dell,Price..Apple,Price..IBM,Price..Microsoft,Calendar.Date
1991.042,0.290698,0.222973,0.1216814,0.3039867,0.049078,0.041518,22.625,55.5,126.75,98.125,19910131
1991.125,0.033694,0.1160221,0.0253254,0.0573248,0.075847,0.067281,25.25,57.25,128.75,103.75,19910228
1991.208,0.187773,0.1287129,-0.115534,0.0228916,0.028923,0.022203,28.5,68.0,113.875,106.125,19910328
1991.292,-0.191176,-0.1798246,-0.0954994,-0.0671378,0.003322,0.000346,23.375,55.0,103.0,99.0,19910430
1991.375,-0.143273,0.0588235,0.0420874,0.1085859,0.040732,0.038577,24.75,47.0,106.125,109.75,19910531
1991.458,-0.117021,-0.010101,-0.0848057,-0.0689066,-0.044029,-0.047893,24.5,41.5,97.125,68.125,19910628
1991.542,0.114458,0.1734694,0.042471,0.0788991,0.046795,0.044859,28.75,46.25,101.25,73.5,19910731


In [58]:
layout(fmt(data[13:18,1:5], "1991 | 1st Half"), fmt(data[19:24,1:5], "1991 | 2nd Half"))

Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return
Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return
1991.042,0.290698,0.222973,0.1216814,0.3039867
1991.125,0.033694,0.1160221,0.0253254,0.0573248
1991.208,0.187773,0.1287129,-0.115534,0.0228916
1991.292,-0.191176,-0.1798246,-0.0954994,-0.0671378
1991.375,-0.143273,0.0588235,0.0420874,0.1085859
1991.458,-0.117021,-0.010101,-0.0848057,-0.0689066
1991.542,0.114458,0.1734694,0.042471,0.0788991
1991.625,0.148541,0.1347826,-0.0312593,0.159864
1991.708,-0.066038,0.0229885,0.0696774,0.0439883
1991.792,0.040404,-0.2546816,-0.0518697,0.0547753

Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return
1991.042,0.290698,0.222973,0.1216814,0.3039867
1991.125,0.033694,0.1160221,0.0253254,0.0573248
1991.208,0.187773,0.1287129,-0.115534,0.0228916
1991.292,-0.191176,-0.1798246,-0.0954994,-0.0671378
1991.375,-0.143273,0.0588235,0.0420874,0.1085859
1991.458,-0.117021,-0.010101,-0.0848057,-0.0689066

Date,Apple.Return,Dell.Return,IBM.Return,Microsoft.Return
1991.542,0.114458,0.1734694,0.042471,0.0788991
1991.625,0.148541,0.1347826,-0.0312593,0.159864
1991.708,-0.066038,0.0229885,0.0696774,0.0439883
1991.792,0.040404,-0.2546816,-0.0518697,0.0547753
1991.875,-0.012233,-0.0552764,-0.0462087,0.0359521
1991.958,0.110837,0.0904255,-0.0378378,0.1439589


## More About R

About strings:<br>
A string is a sequence of characters meant to be interpretted as such, and not to be interpretted as a name.  A string is described by enclosing a sequence of characters within `"..."`.

About comments:<br>
A comment is text ignored by the R system, but perhaps useful to us.  A comment is described by the `#` symbol, followed by the comment text, through the end of the line.

About functions:<br>
A function is described by a function name, followed by parentheses that enclose the function's parameter values, separated by commas.  The parameters are distinguished either by their position or by explicitly naming them.  Named parameters are described by a parameter name, followed by the `=` symbol, followed by the parameter value.  An example function looks like this:<br>
`melt(data, id=x)`

About TRUE/FALSE values:<br>
The R system recognizes the values `TRUE` and `FALSE`, and their abbreviations `T` and `F`.

About function names; parameter names; and table, vector, and value names:<br>
A name can include upper case and lower case letters - the upper case version of a letter is considered different than the lower case version of that letter.  A name can include numbers, but cannot start with a number.  A name can include the `.` and `_` characters.

About output:<br>
A table, vector, or value referenced on a line by itself will output the value assigned to it.

## Further Reading

* https://www.statmethods.net/management/subset.html

<p style="text-align:left; font-size:10px;">
Copyright (c) Berkeley Data Analytics Group, LLC
<span style="float:right;">
Document revised July 17, 2020
</span>
</p>