## Data Manipulation 

We see how to read data and perform elementary data manipulation and we gone on to also do some data analysis.

### Introduction to DataFrames

In [1]:

using DataArrays
using DataFrames

### Missing values

A missing value is represented by NA in Julia.
* NA is not part of Base, it is provided by the DataArrays package.
* NA poisons other values.

In [2]:
1+NA

NA

In [3]:
# Check if the evaluation of an expression results in NA
isna(1+NA)

true

In [5]:
# Note the difference between NaN and NA
(isa(NaN, Float64), isa(NA, Float64))

(true,false)

### DataArrays¶

* `DataArray`'s are used for representing arrays that contain missing values
* `DataArray{T}` allows storing `T` or `NA`
* In other words, `DataArray{T}` adds `NA`'s to `Array{T}`
* `PooledDataArray{T}` is used for storing data efficiently.
* `PooledDataArray{T}` compresses `DataArray{T}`.

#### Constructing DataArrays

In [6]:
# Call the DataArray() constructor by passing a Vector to it
DataArray([0.1, 0.5, -2.4])

3-element DataArrays.DataArray{Float64,1}:
  0.1
  0.5
 -2.4

In [7]:
# Construct a DataArray by calling the @data() macro with a Vector input argument
@data([0.1, 0.5, -2.4])

3-element DataArrays.DataArray{Float64,1}:
  0.1
  0.5
 -2.4

In [8]:
# Convert Vector to DataArray
convert(DataArray, [0.1, 0.5, -2.4])

3-element DataArrays.DataArray{Float64,1}:
  0.1
  0.5
 -2.4

In [9]:
# It is not possible to call DataArray() with NA in its input argument
DataArray([0.1, NA, -2.4])

LoadError: MethodError: Cannot `convert` an object of type DataArrays.NAtype to an object of type Float64
This may have arisen from a call to the constructor Float64(...),
since type constructors fall back to convert methods.

In [10]:
# However, it is possible to pass NA to the @data() macro
@data([0.1, NA, -2.4])

3-element DataArrays.DataArray{Float64,1}:
  0.1
   NA
 -2.4

In [11]:
# The @data() macro can also be called with a Matrix input argument
@data([0.4 1.2; 3.5 7.2])

2×2 DataArrays.DataArray{Float64,2}:
 0.4  1.2
 3.5  7.2

In [12]:
# Convert a Matrix to DataArray
convert(DataArray, [0.4 1.2; 3.5 7.2])

2×2 DataArrays.DataArray{Float64,2}:
 0.4  1.2
 3.5  7.2

In [13]:
# Numerical computing can be done with data vectors
x = @data([0.1, NA, -2.4])
y = @data([-9.9, 0.5, 6.7])
x+y

3-element DataArrays.DataArray{Float64,1}:
 -9.8
   NA
  4.3

In [15]:
# To remove missing values (NA), call dropna()
x = @data([0.1, NA, -2.4])
dropna(x)

2-element Array{Float64,1}:
  0.1
 -2.4

In [16]:
# Numerical computing can be done with data matrices and data vectors
A = @data([0.4 1.2 4.4; NA 7.2 3.9; 5.1 1.8 4.5])
y = @data([-9.9, 0.5, 6.7])
A*y

3-element DataArrays.DataArray{Float64,1}:
  26.12
    NA 
 -19.44

### DataFrames

* `DataFrame`'s are used for representing data tables.
* A `DataFrame` is a list of `DataArray`'s.
* So every `DataArray` of a  `DataFrame` represents a column of the corresponding data table.
* `DataFrame`'s accommodate heterogeneous data that might contain missing values.
* Every column (`DataArray`) of a `DataFrame` has its own type.

In [17]:
# Call the DataFrame() constructor with keyword arguments (columns) of type Vector
DataFrame(
  player = ["Larry Bird", "Magic Johnson", "Michael Jordan", "Scottie Pippen"],
  champions = [3, 5, 6, 6]
)

Unnamed: 0,player,champions
1,Larry Bird,3
2,Magic Johnson,5
3,Michael Jordan,6
4,Scottie Pippen,6


In [18]:
# Start with an empty DataFrame and populate it
ChampionsFrame = DataFrame()
ChampionsFrame[:player] = ["Larry Bird", "Magic Johnson", "Michael Jordan", "Scottie Pippen"]
ChampionsFrame[:champions] = [3, 5, 6, 6]
ChampionsFrame

Unnamed: 0,player,champions
1,Larry Bird,3
2,Magic Johnson,5
3,Michael Jordan,6
4,Scottie Pippen,6


In [19]:

# Provide CSV-like tabular data to construct a new DataFrame
csv"""
  player,champions
  Larry Bird,3
  Magic Johnson,5
  Michael Jordan,6
  Scottie Pippen,6
"""

Unnamed: 0,player,champions
1,Larry Bird,3
2,Magic Johnson,5
3,Michael Jordan,6
4,Scottie Pippen,6


In [20]:
# Call the DataFrame() constructor with keyword arguments (columns) of type DataArray
player = @data(["Larry Bird", "Magic Johnson", "Michael Jordan", "Scottie Pippen"])
champions = @data([3, 5, 6, 6])
ChampionsFrame = DataFrame(player=player, champions=champions)

Unnamed: 0,player,champions
1,Larry Bird,3
2,Magic Johnson,5
3,Michael Jordan,6
4,Scottie Pippen,6


In [21]:
# Construct a DataFrame by joining two existing DataFrames
height = [2.06, 2.06, 1.98, 2.03]
HeightsFrame = DataFrame(player=player, height=height)
join(ChampionsFrame, HeightsFrame, on = :player)

Unnamed: 0,player,champions,height
1,Larry Bird,3,2.06
2,Magic Johnson,5,2.06
3,Michael Jordan,6,1.98
4,Scottie Pippen,6,2.03


#### Quering basic information about DataFrames

In [22]:
# Get number of rows of a DataFrame
size(ChampionsFrame, 1)

4

In [23]:
# Get number of columns of a DataFrame
size(ChampionsFrame, 2)

2

In [24]:
# Get a numeric summary of a DataFrame
describe(ChampionsFrame)

player
Length  4
Type    String
NAs     0
NA%     0.0%
Unique  4

champions
Min      3.0
1st Qu.  4.5
Median   5.5
Mean     5.0
3rd Qu.  6.0
Max      6.0
NAs      0
NA%      0.0%



#### Indexing `DataFrames`

In [26]:
# Index DataFrame by column name to get a specific column
ChampionsFrame[:player]

4-element DataArrays.DataArray{String,1}:
 "Larry Bird"    
 "Magic Johnson" 
 "Michael Jordan"
 "Scottie Pippen"

In [27]:
# Index DataFrame by row numbers to get specific rows
ChampionsFrame[2:3, :]

Unnamed: 0,player,champions
1,Magic Johnson,5
2,Michael Jordan,6
