# Ex3 - Getting and Knowing your Data

This time we are going to pull data directly from the internet.
Special thanks to: https://github.com/justmarkham for sharing the dataset and materials.

### Step 1. Import the necessary libraries

In [None]:
using DotEnv
using Pkg

DotEnv.load!()
path = ENV["ENV_PATH"]
Pkg.activate(path)

using CSV
using DataFrames
using Downloads

### Step 2. Import the dataset from this [address](https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user). 

In [2]:
file = Downloads.download("https://raw.githubusercontent.com/justmarkham/DAT8/master/data/u.user")

"/tmp/jl_u5INuH0FI1"

### Step 3. Assign it to a variable called users

In [3]:
users = CSV.read(file, DataFrame, delim="|");

### Step 4. See the first 25 entries

In [4]:
first(users, 25)

Row,user_id,age,gender,occupation,zip_code
Unnamed: 0_level_1,Int64,Int64,String1,String15,String7
1,1,24,M,technician,85711
2,2,53,F,other,94043
3,3,23,M,writer,32067
4,4,24,M,technician,43537
5,5,33,F,other,15213
6,6,42,M,executive,98101
7,7,57,M,administrator,91344
8,8,36,M,administrator,5201
9,9,29,M,student,1002
10,10,53,M,lawyer,90703


### Step 5. See the last 10 entries

In [6]:
last(users, 10)

Row,user_id,age,gender,occupation,zip_code
Unnamed: 0_level_1,Int64,Int64,String1,String15,String7
1,934,61,M,engineer,22902
2,935,42,M,doctor,66221
3,936,24,M,other,32789
4,937,48,M,educator,98072
5,938,38,F,technician,55038
6,939,26,F,student,33319
7,940,32,M,administrator,2215
8,941,20,M,student,97229
9,942,48,F,librarian,78209
10,943,22,M,student,77841


### Step 6. What is the number of observations in the dataset?

In [5]:
n_observations = size(users, 1)
@show n_observations;

n_observations = 943


### Step 7. What is the number of columns in the dataset?

In [7]:
n_columns = size(users, 2)
@show n_columns;

n_columns = 5


### Step 8. Print the name of all the columns.

In [9]:
column_names = names(users)
@show column_names

column_names = ["user_id", "age", "gender", "occupation", "zip_code"]


5-element Vector{String}:
 "user_id"
 "age"
 "gender"
 "occupation"
 "zip_code"

### Step 9. What is the data type of each column?

In [10]:
eltype.(eachcol(users))

5-element Vector{DataType}:
 Int64
 Int64
 String1
 String15
 String7

### Step 10. Print only the occupation column

In [10]:
users[!, :occupation]

943-element PooledArrays.PooledVector{String15, UInt32, Vector{UInt32}}:
 "technician"
 "other"
 "writer"
 "technician"
 "other"
 "executive"
 "administrator"
 "administrator"
 "student"
 "lawyer"
 ⋮
 "doctor"
 "other"
 "educator"
 "technician"
 "student"
 "administrator"
 "student"
 "librarian"
 "student"

### Step 11. How many different occupations are in this dataset?

In [12]:
n_different_occupations = length(unique(users[!, :occupation]))
@show n_different_occupations;

n_different_occupations = 21


### Step 12. What is the most frequent occupation?

In [14]:
function value_counts(df::DataFrame, col::Union{String, Symbol}, rev::Bool=true)
    grouped_df = combine(groupby(df, col), nrow)
    sorted_grouped_df = sort(grouped_df, :nrow, rev=rev)
    return sorted_grouped_df
end

value_counts(users, :occupation)[1, :]

Row,occupation,nrow
Unnamed: 0_level_1,String15,Int64
1,student,196


### Step 13. Summarize the DataFrame.

In [15]:
describe(users)

Row,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,DataType
1,user_id,472.0,1,472.0,943,0,Int64
2,age,34.052,7,31.0,73,0,Int64
3,gender,,F,,M,0,String1
4,occupation,,administrator,,writer,0,String15
5,zip_code,,00000,,Y1A6B,0,String7


### Step 14. Summarize only the occupation column

In [16]:
describe(users, cols=:occupation)

Row,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Nothing,String15,Nothing,String15,Int64,DataType
1,occupation,,administrator,,writer,0,String15


### Step 15. What is the mean age of users?

In [19]:
using Statistics

user_mean_age = round(mean(users[!, :age]), digits=2)
@show user_mean_age;

user_mean_age = 34.05


### Step 16. What is the age with least occurrence?

In [22]:
value_counts(users, :age, false)[1:10, :]

Row,age,nrow
Unnamed: 0_level_1,Int64,Int64
1,7,1
2,10,1
3,11,1
4,66,1
5,73,1
6,62,2
7,64,2
8,68,2
9,69,2
10,14,3
