# Basics

---

This will be a brief introduction to **[DataFrames.jl](https://dataframes.juliadata.org/stable/)**, data analysis library in Julia. More to go in the second lecture. However, during this quick intro to `DataFrames.jl`, we will cover one of the most important aspect in data analysis - **how to read and write different data** along with checking data shape and size.

> **Note: Write functionality is not yet implemented in `Queryverse` library!**



### Lecture outline

---

* Read Data


* Write Data


* Data Size and Shape


* Summary Statistics


* Unique Observations


* Value Counts

In [1]:
using Statistics
using StatsBase
using Queryverse
using DataFrames
using FreqTables
using Pipe: @pipe # To chain the functions

In [2]:
# Julia version

VERSION

v"1.5.3"

In [3]:
# Set number of columns to be shown
ENV["COLUMNS"] = 1000

# Ser number of rows to be shown
ENV["LINES"] = 100

100

## Read Data

---

Reading data file is THE FIRST operation you will do during data analysis. You have to be able to read different format of data file. Let start with the simplest common one, the CSV file.

In [4]:
csv_file = DataFrame(load("data/admission.csv"))

first(csv_file, 5)

Unnamed: 0_level_0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Float64,Float64,Float64,Int64,Float64
1,1,337,118,4,4.5,4.5,9.65,1,0.92
2,2,324,107,4,4.0,4.5,8.87,1,0.76
3,3,316,104,3,3.0,3.5,8.0,1,0.72
4,4,322,110,3,3.5,2.5,8.67,1,0.8
5,5,314,103,2,2.0,3.0,8.21,0,0.65


In [5]:
excel_file = DataFrame(load("data/titanic.xlsx", "train"))

first(excel_file, 5)

Unnamed: 0_level_0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
Unnamed: 0_level_1,Float64,Float64,Float64,String,String,Float64?,Float64,Float64,Any,Float64,String?,String?
1,1.0,0.0,3.0,"Braund, Mr. Owen Harris",male,22.0,1.0,0.0,A/5 21171,7.25,missing,S
2,2.0,1.0,1.0,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38.0,1.0,0.0,PC 17599,71.2833,C85,C
3,3.0,1.0,3.0,"Heikkinen, Miss. Laina",female,26.0,0.0,0.0,STON/O2. 3101282,7.925,missing,S
4,4.0,1.0,1.0,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1.0,0.0,113803.0,53.1,C123,S
5,5.0,0.0,3.0,"Allen, Mr. William Henry",male,35.0,0.0,0.0,373450.0,8.05,missing,S


In [6]:
stata_file = DataFrame(load("data/airline.dta"))

first(stata_file, 5)

Unnamed: 0_level_0,year,y,w,r,l,k
Unnamed: 0_level_1,Int16?,Float32?,Float32?,Float32?,Float32?,Float32?
1,1948,1.214,0.243,0.1454,1.415,0.612
2,1949,1.354,0.26,0.2181,1.384,0.559
3,1950,1.569,0.278,0.3157,1.388,0.573
4,1951,1.948,0.297,0.394,1.55,0.564
5,1952,2.265,0.31,0.3559,1.802,0.574


In [7]:
sas_file = DataFrame(load("data/alcohol.sas7bdat"))

first(sas_file, 5)

Unnamed: 0_level_0,ADULTS,KIDS,INCOME,CONSUME
Unnamed: 0_level_1,Float64?,Float64?,Float64?,Float64?
1,2.0,2.0,758.0,1.0
2,2.0,3.0,1785.0,1.0
3,3.0,0.0,1200.0,1.0
4,1.0,0.0,545.0,1.0
5,4.0,1.0,547.0,1.0


In [8]:
spss_file = DataFrame(load("data/sleep.sav"))

first(spss_file, 5)

Unnamed: 0_level_0,id,sex,age,marital,edlevel,weight,height,healthrate,fitrate,weightrate,smoke,smokenum,alchohol,caffeine,hourwnit,hourwend,hourneed,trubslep,trubstay,wakenite,niteshft,liteslp,refreshd,satsleep,qualslp,stressmo,medhelp,problem,impact1,impact2,impact3,impact4,impact5,impact6,impact7,stopb,restlss,drvsleep,drvresul,ess,anxiety,depress,fatigue,lethargy,tired,sleepy,energy,stayslprec,getsleprec,qualsleeprec,totsas,cigsgp3,agegp3,probsleeprec,drvslprec
Unnamed: 0_level_1,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?,Float64?
1,83.0,0.0,42.0,2.0,2.0,52.0,162.0,10.0,7.0,5.0,1.0,15.0,3.0,5.0,9.0,9.0,9.0,missing,missing,1.0,2.0,2.0,1.0,2.0,6.0,2.0,2.0,2.0,missing,missing,missing,missing,missing,missing,missing,2.0,2.0,2.0,2.0,8.0,2.0,1.0,2.0,2.0,2.0,2.0,2.0,missing,missing,4.0,10.0,2.0,2.0,0.0,0.0
2,294.0,0.0,54.0,2.0,5.0,65.0,174.0,8.0,7.0,5.0,2.0,missing,0.0,10.0,6.5,6.5,7.0,1.0,1.0,1.0,2.0,1.0,1.0,5.0,4.0,5.0,2.0,2.0,missing,missing,missing,missing,missing,missing,missing,2.0,1.0,2.0,2.0,17.0,6.0,2.0,2.0,3.0,5.0,5.0,5.0,1.0,1.0,3.0,20.0,missing,3.0,0.0,0.0
3,425.0,1.0,missing,2.0,2.0,89.0,170.0,6.0,5.0,7.0,2.0,missing,12.0,4.0,6.0,6.0,8.0,1.0,1.0,1.0,2.0,1.0,2.0,3.0,2.0,6.0,1.0,2.0,missing,missing,missing,missing,missing,missing,missing,2.0,1.0,1.0,1.0,13.0,9.0,10.0,7.0,7.0,6.0,6.0,5.0,1.0,1.0,1.0,31.0,missing,missing,0.0,1.0
4,64.0,0.0,41.0,2.0,5.0,66.0,178.0,9.0,7.0,5.0,1.0,5.0,2.0,3.0,7.0,8.0,8.0,1.0,2.0,2.0,2.0,2.0,2.0,4.0,4.0,8.0,2.0,2.0,missing,missing,missing,missing,missing,missing,missing,2.0,2.0,2.0,2.0,12.0,8.0,3.0,7.0,7.0,6.0,6.0,8.0,0.0,1.0,3.0,34.0,1.0,2.0,0.0,0.0
5,536.0,0.0,39.0,2.0,5.0,62.0,160.0,9.0,5.0,7.0,2.0,missing,1.0,6.0,7.0,7.0,7.5,1.0,1.0,1.0,2.0,2.0,1.0,2.0,4.0,6.0,2.0,2.0,missing,missing,missing,missing,missing,missing,missing,2.0,2.0,2.0,2.0,12.0,4.0,0.0,5.0,3.0,5.0,6.0,6.0,1.0,1.0,3.0,25.0,missing,2.0,0.0,0.0


## Write Data

---

Writing data into a file is almost same procedure as reading. While writing data in a file, we have to indicate the address of new data file as well as the name of the file. Let write CSV and Excel file.

**Writing other formats are not available right now**

In [None]:
save("data/new_csv_file.csv", csv_file)

In [None]:
save("data/new_excel_file.xlsx", excel_file)

## Data Size and Shape

---

We will talk about data size and shape in the second lecture. However, here we quickly cover what is it and how to use that information.

In [9]:
size(csv_file) # Retruns number of rows and columns, respectively

(400, 9)

## Summary Statistics

---

This is a summary statistics of your data. This gives you the quick sight of your data at hand.

In [10]:
describe(csv_file)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Float64,Real,Float64,Real,Int64,DataType
1,Serial No.,200.5,1.0,200.5,400.0,0,Int64
2,GRE Score,316.808,290.0,317.0,340.0,0,Int64
3,TOEFL Score,107.41,92.0,107.0,120.0,0,Int64
4,University Rating,3.0875,1.0,3.0,5.0,0,Int64
5,SOP,3.4,1.0,3.5,5.0,0,Float64
6,LOR,3.4525,1.0,3.5,5.0,0,Float64
7,CGPA,8.59893,6.8,8.61,9.92,0,Float64
8,Research,0.5475,0.0,1.0,1.0,0,Int64
9,Chance of Admit,0.72435,0.34,0.73,0.97,0,Float64


## Unique Observations

---

Summary statistics does not give how many unique observations we have alongside columns. We can check it by using `unique()` function.

In [11]:
first(csv_file, 5)

Unnamed: 0_level_0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Float64,Float64,Float64,Int64,Float64
1,1,337,118,4,4.5,4.5,9.65,1,0.92
2,2,324,107,4,4.0,4.5,8.87,1,0.76
3,3,316,104,3,3.0,3.5,8.0,1,0.72
4,4,322,110,3,3.5,2.5,8.67,1,0.8
5,5,314,103,2,2.0,3.0,8.21,0,0.65


In [12]:
unique(csv_file[!, "Research"]) # Only two unique values in "Research" column

2-element Array{Int64,1}:
 1
 0

## Value Counts

---

We can count duplicated values across columns.

In [13]:
first(csv_file, 5)

Unnamed: 0_level_0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Unnamed: 0_level_1,Int64,Int64,Int64,Int64,Float64,Float64,Float64,Int64,Float64
1,1,337,118,4,4.5,4.5,9.65,1,0.92
2,2,324,107,4,4.0,4.5,8.87,1,0.76
3,3,316,104,3,3.0,3.5,8.0,1,0.72
4,4,322,110,3,3.5,2.5,8.67,1,0.8
5,5,314,103,2,2.0,3.0,8.21,0,0.65


In [14]:
csv_file.Research |> freqtable # We have 219 ones and 181 zeros, totaling to 400

2-element Named Array{Int64,1}
Dim1  │ 
──────┼────
0     │ 181
1     │ 219

`|>` is a pipe or chaining operator and gives us possibility to chain the functions. This is equivalent of `.` operator in Pandas

In [15]:
countmap(csv_file.Research) # Same as above

Dict{Int64,Int64} with 2 entries:
  0 => 181
  1 => 219

In [16]:
csv_file.Research |> freqtable |> prop # Proportions

2-element Named Array{Float64,1}
Dim1  │ 
──────┼───────
0     │ 0.4525
1     │ 0.5475

# Summary

---

In this lecture, we learn how to set up our working environment as well as how to install necessary libraries for data analysis. Moreover, we have covered one of the most important aspect of data analysis - data reading and writing. In the next lecture, we will uncover `DataFrames.jl` capabilities.