# Data Cleaning In Julia

### Common Issues With Data
+ Reading the file
+ Inconsistent Column Names
+ Missing Data
+ Different Data Types 
+ Duplicate rows
+ etc

In [None]:
# EDA packages
using DataFrames

## Loading or Reading the File
+ Encoding Error
+ Inconsistent rows 

In [2]:
# Issue 1 
df = readtable("unmodified_data.csv")

LoadError: [91mArgumentError: Columns and column index must be the same length[39m

In [3]:
# Solution 1 Encoding it
df = readtable("unmodified_data.csv",encoding=:latin1)

LoadError: [91mArgumentError: Argument 'encoding' only supports ':utf8' currently.[39m

In [4]:

df = readtable("unmodified_data.csv",encoding=:utf8)

LoadError: [91mArgumentError: Columns and column index must be the same length[39m

In [5]:
# Solution 2 Encoding it With A Text Editor
df = readtable("unmodified_data.csv",encoding=:utf8)

Unnamed: 0,movie_title,num_critic_for_reviews,duration,DIRECTOR_facebook_likes,actor_3_facebook_likes,ACTOR_1_facebook_likes,gross,num_voted_users,Cast_Total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,ACTOR_2_facebook_likes,imdb_score,title_year_1
1,Avatar?ÿ,723,178.0,10,855,1000,760505847,886204.0,4834.0,,3054,237000000,2009,936.0,7.9,2009.0
2,Pirates of the Caribbean: At World's End?ÿ,302,,563,1000,40000,309404152,471220.0,48350.0,,1238,300000000,2007,5000.0,7.1,
3,Spectre?ÿ,602,148.0,20,161,11000,200074175,275868.0,11700.0,1.0,994,245000000,2015,393.0,6.8,2015.0
4,The Dark Knight Rises?ÿ,813,,22000,23000,27000,448130642,1144337.0,106759.0,,2701,250000000,2012,23000.0,8.5,
5,John Carter?ÿ,462,132.0,"""475""",530,640,73058679,212204.0,1873.0,1.0,738,263700000,2012,632.0,6.6,
6,Spider-Man 3?ÿ,392,156.0,23,4000,24000,336530303,383056.0,46055.0,,1902,258000000,2007,11000.0,6.2,2007.0
7,Tangled?ÿ,324,,15,284,799,200807262,294810.0,,1.0,387,260000000,2010,553.0,7.8,
8,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,
9,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,2015.0
10,Harry Potter and the Half-Blood Prince?ÿ,375,153.0,282,10000,25000,301956980,321795.0,58753.0,3.0,973,250000000,2009,11000.0,7.5,


In [7]:
head(df,5)

Unnamed: 0,movie_title,num_critic_for_reviews,duration,DIRECTOR_facebook_likes,actor_3_facebook_likes,ACTOR_1_facebook_likes,gross,num_voted_users,Cast_Total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,ACTOR_2_facebook_likes,imdb_score,title_year_1
1,Avatar?ÿ,723,178.0,10,855,1000,760505847,886204,4834,,3054,237000000,2009,936,7.9,2009.0
2,Pirates of the Caribbean: At World's End?ÿ,302,,563,1000,40000,309404152,471220,48350,,1238,300000000,2007,5000,7.1,
3,Spectre?ÿ,602,148.0,20,161,11000,200074175,275868,11700,1.0,994,245000000,2015,393,6.8,2015.0
4,The Dark Knight Rises?ÿ,813,,22000,23000,27000,448130642,1144337,106759,,2701,250000000,2012,23000,8.5,
5,John Carter?ÿ,462,132.0,"""475""",530,640,73058679,212204,1873,1.0,738,263700000,2012,632,6.6,


## Inconsistent Column Names
+ Change Cases
+ Rename them

### Change the case to Upper

In [9]:
names(df)

16-element Array{Symbol,1}:
 :movie_title              
 :num_critic_for_reviews   
 :duration                 
 :DIRECTOR_facebook_likes  
 :actor_3_facebook_likes   
 :ACTOR_1_facebook_likes   
 :gross                    
 :num_voted_users          
 :Cast_Total_facebook_likes
 :facenumber_in_poster     
 :num_user_for_reviews     
 :budget                   
 :title_year               
 :ACTOR_2_facebook_likes   
 :imdb_score               
 :title_year_1             

In [10]:
uppercase(string(names(df)))

"SYMBOL[:MOVIE_TITLE, :NUM_CRITIC_FOR_REVIEWS, :DURATION, :DIRECTOR_FACEBOOK_LIKES, :ACTOR_3_FACEBOOK_LIKES, :ACTOR_1_FACEBOOK_LIKES, :GROSS, :NUM_VOTED_USERS, :CAST_TOTAL_FACEBOOK_LIKES, :FACENUMBER_IN_POSTER, :NUM_USER_FOR_REVIEWS, :BUDGET, :TITLE_YEAR, :ACTOR_2_FACEBOOK_LIKES, :IMDB_SCORE, :TITLE_YEAR_1]"

In [11]:
# Use names! function and parse in an array 
# NB names!() will change the original
# names!(dataframe,[arrayofname])
names!(df,[:MOVIE_TITLE, :NUM_CRITIC_FOR_REVIEWS, :DURATION, :DIRECTOR_FACEBOOK_LIKES, :ACTOR_3_FACEBOOK_LIKES, :ACTOR_1_FACEBOOK_LIKES, :GROSS, :NUM_VOTED_USERS, :CAST_TOTAL_FACEBOOK_LIKES, :FACENUMBER_IN_POSTER, :NUM_USER_FOR_REVIEWS, :BUDGET, :TITLE_YEAR, :ACTOR_2_FACEBOOK_LIKES, :IMDB_SCORE, :TITLE_YEAR_1])

Unnamed: 0,MOVIE_TITLE,NUM_CRITIC_FOR_REVIEWS,DURATION,DIRECTOR_FACEBOOK_LIKES,ACTOR_3_FACEBOOK_LIKES,ACTOR_1_FACEBOOK_LIKES,GROSS,NUM_VOTED_USERS,CAST_TOTAL_FACEBOOK_LIKES,FACENUMBER_IN_POSTER,NUM_USER_FOR_REVIEWS,BUDGET,TITLE_YEAR,ACTOR_2_FACEBOOK_LIKES,IMDB_SCORE,TITLE_YEAR_1
1,Avatar?ÿ,723,178.0,10,855,1000,760505847,886204.0,4834.0,,3054,237000000,2009,936.0,7.9,2009.0
2,Pirates of the Caribbean: At World's End?ÿ,302,,563,1000,40000,309404152,471220.0,48350.0,,1238,300000000,2007,5000.0,7.1,
3,Spectre?ÿ,602,148.0,20,161,11000,200074175,275868.0,11700.0,1.0,994,245000000,2015,393.0,6.8,2015.0
4,The Dark Knight Rises?ÿ,813,,22000,23000,27000,448130642,1144337.0,106759.0,,2701,250000000,2012,23000.0,8.5,
5,John Carter?ÿ,462,132.0,"""475""",530,640,73058679,212204.0,1873.0,1.0,738,263700000,2012,632.0,6.6,
6,Spider-Man 3?ÿ,392,156.0,23,4000,24000,336530303,383056.0,46055.0,,1902,258000000,2007,11000.0,6.2,2007.0
7,Tangled?ÿ,324,,15,284,799,200807262,294810.0,,1.0,387,260000000,2010,553.0,7.8,
8,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,
9,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,2015.0
10,Harry Potter and the Half-Blood Prince?ÿ,375,153.0,282,10000,25000,301956980,321795.0,58753.0,3.0,973,250000000,2009,11000.0,7.5,


### Renaming Columns

In [None]:
#rename()
#rename!() =changes the original

In [None]:
rename!(df,:NUM_CRITIC_FOR_REVIEWS,:REVIEWS)

## Missing Data
+ Add a default value for missing data or use mean to fill it
+ Delete the row/column with missing data
+ Interpolate the rows
+ Replace

#### To check for missing data
#### False means no missing data
+ showcols()
+ isna/ismissing
+ isnan
+ .na
+ completecases

In [13]:
# Specify a group of strings to be converted to NA values during reading:
df_with_na = readtable("unmodified_data.csv", nastrings=["NA", "na", "n/a", "missing"])

Unnamed: 0,movie_title,num_critic_for_reviews,duration,DIRECTOR_facebook_likes,actor_3_facebook_likes,ACTOR_1_facebook_likes,gross,num_voted_users,Cast_Total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,ACTOR_2_facebook_likes,imdb_score,title_year_1
1,Avatar?ÿ,723,178.0,10,855,1000,760505847,886204.0,4834.0,,3054,237000000,2009,936.0,7.9,2009.0
2,Pirates of the Caribbean: At World's End?ÿ,302,,563,1000,40000,309404152,471220.0,48350.0,,1238,300000000,2007,5000.0,7.1,
3,Spectre?ÿ,602,148.0,20,161,11000,200074175,275868.0,11700.0,1.0,994,245000000,2015,393.0,6.8,2015.0
4,The Dark Knight Rises?ÿ,813,,22000,23000,27000,448130642,1144337.0,106759.0,,2701,250000000,2012,23000.0,8.5,
5,John Carter?ÿ,462,132.0,"""475""",530,640,73058679,212204.0,1873.0,1.0,738,263700000,2012,632.0,6.6,
6,Spider-Man 3?ÿ,392,156.0,23,4000,24000,336530303,383056.0,46055.0,,1902,258000000,2007,11000.0,6.2,2007.0
7,Tangled?ÿ,324,,15,284,799,200807262,294810.0,,1.0,387,260000000,2010,553.0,7.8,
8,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,
9,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,2015.0
10,Harry Potter and the Half-Blood Prince?ÿ,375,153.0,282,10000,25000,301956980,321795.0,58753.0,3.0,973,250000000,2009,11000.0,7.5,


In [15]:
# Checking For Missing Data
showcols(df)

14×16 DataFrames.DataFrame
│ Col # │ Name                      │ Eltype  │ Missing │
├───────┼───────────────────────────┼─────────┼─────────┤
│ 1     │ MOVIE_TITLE               │ String  │ 0       │
│ 2     │ NUM_CRITIC_FOR_REVIEWS    │ Int64   │ 0       │
│ 3     │ DURATION                  │ Int64   │ 3       │
│ 4     │ DIRECTOR_FACEBOOK_LIKES   │ String  │ 2       │
│ 5     │ ACTOR_3_FACEBOOK_LIKES    │ Int64   │ 0       │
│ 6     │ ACTOR_1_FACEBOOK_LIKES    │ Int64   │ 0       │
│ 7     │ GROSS                     │ Int64   │ 0       │
│ 8     │ NUM_VOTED_USERS           │ Int64   │ 1       │
│ 9     │ CAST_TOTAL_FACEBOOK_LIKES │ Int64   │ 2       │
│ 10    │ FACENUMBER_IN_POSTER      │ Int64   │ 5       │
│ 11    │ NUM_USER_FOR_REVIEWS      │ Int64   │ 0       │
│ 12    │ BUDGET                    │ Int64   │ 0       │
│ 13    │ TITLE_YEAR                │ Int64   │ 0       │
│ 14    │ ACTOR_2_FACEBOOK_LIKES    │ Int64   │ 1       │
│ 15    │ IMDB_SCORE                │ Float64

In [None]:
# Logic behind the function

In [16]:
println(nrow(df))
println(ncol(df))
println(size(df))

14
16
(14, 16)


##### Checking with isna()

In [17]:
# Example of normal missing value but still not an NA
x = [1, 2," ", 3]


4-element Array{Any,1}:
 1   
 2   
  " "
 3   

In [19]:
# Checking for missing value in index 1
# False means it is not NA
isna(x,1)

false

In [20]:
# Check for missing value with isna. for whole array
isna.(x)

4-element BitArray{1}:
 false
 false
 false
 false

In [18]:
# Correct way of accepting NA with @data macro
y = @data([1,2,NA,3])

4-element DataArrays.DataArray{Int64,1}:
 1  
 2  
  NA
 3  

In [21]:
# But with the @data you can see the na
isna.(y)

4-element BitArray{1}:
 false
 false
  true
 false

In [22]:
isna(y,3)

true

In [23]:
# Returns the individual rows with missing values/na
find(isna.(y))

1-element Array{Int64,1}:
 3

### Back To Dataframe

In [24]:
# Returns the individual rows of a column with missing values/na
find(isna.(df[:, :DURATION]))

3-element Array{Int64,1}:
 2
 4
 7

In [25]:
# Returns the individual rows of a column with missing values/na with specifics
df[find(isna.(df[:, :DURATION])),:]

Unnamed: 0,MOVIE_TITLE,NUM_CRITIC_FOR_REVIEWS,DURATION,DIRECTOR_FACEBOOK_LIKES,ACTOR_3_FACEBOOK_LIKES,ACTOR_1_FACEBOOK_LIKES,GROSS,NUM_VOTED_USERS,CAST_TOTAL_FACEBOOK_LIKES,FACENUMBER_IN_POSTER,NUM_USER_FOR_REVIEWS,BUDGET,TITLE_YEAR,ACTOR_2_FACEBOOK_LIKES,IMDB_SCORE,TITLE_YEAR_1
1,Pirates of the Caribbean: At World's End?ÿ,302,,563,1000,40000,309404152,471220,48350.0,,1238,300000000,2007,5000,7.1,
2,The Dark Knight Rises?ÿ,813,,22000,23000,27000,448130642,1144337,106759.0,,2701,250000000,2012,23000,8.5,
3,Tangled?ÿ,324,,15,284,799,200807262,294810,,1.0,387,260000000,2010,553,7.8,


##### You Can Loop through

In [26]:
# Returns the individual rows of a column with missing values/na
nrows, ncols = size(df) # Size of rows
# You can also use nrow() function instead of above
for row in 1:nrows
     if isna(df[row, :DURATION])
       println("skipping missing values")
     else
       println("the value is $(df[row, :DURATION])")
     end
end

the value is 178
skipping missing values
the value is 148
skipping missing values
the value is 132
the value is 156
skipping missing values
the value is 141
the value is 141
the value is 153
the value is 183
the value is 169
the value is 106
the value is 151


In [27]:
showcols(df)

14×16 DataFrames.DataFrame
│ Col # │ Name                      │ Eltype  │ Missing │
├───────┼───────────────────────────┼─────────┼─────────┤
│ 1     │ MOVIE_TITLE               │ String  │ 0       │
│ 2     │ NUM_CRITIC_FOR_REVIEWS    │ Int64   │ 0       │
│ 3     │ DURATION                  │ Int64   │ 3       │
│ 4     │ DIRECTOR_FACEBOOK_LIKES   │ String  │ 2       │
│ 5     │ ACTOR_3_FACEBOOK_LIKES    │ Int64   │ 0       │
│ 6     │ ACTOR_1_FACEBOOK_LIKES    │ Int64   │ 0       │
│ 7     │ GROSS                     │ Int64   │ 0       │
│ 8     │ NUM_VOTED_USERS           │ Int64   │ 1       │
│ 9     │ CAST_TOTAL_FACEBOOK_LIKES │ Int64   │ 2       │
│ 10    │ FACENUMBER_IN_POSTER      │ Int64   │ 5       │
│ 11    │ NUM_USER_FOR_REVIEWS      │ Int64   │ 0       │
│ 12    │ BUDGET                    │ Int64   │ 0       │
│ 13    │ TITLE_YEAR                │ Int64   │ 0       │
│ 14    │ ACTOR_2_FACEBOOK_LIKES    │ Int64   │ 1       │
│ 15    │ IMDB_SCORE                │ Float64

#### Julia Tricks
+ isna()
+ isna.() 
+ !isna() #returns a dataframe that contains no rows with missing values.
+ completecase() # Works like the isna but opposite

In [29]:
df_2 = df

Unnamed: 0,MOVIE_TITLE,NUM_CRITIC_FOR_REVIEWS,DURATION,DIRECTOR_FACEBOOK_LIKES,ACTOR_3_FACEBOOK_LIKES,ACTOR_1_FACEBOOK_LIKES,GROSS,NUM_VOTED_USERS,CAST_TOTAL_FACEBOOK_LIKES,FACENUMBER_IN_POSTER,NUM_USER_FOR_REVIEWS,BUDGET,TITLE_YEAR,ACTOR_2_FACEBOOK_LIKES,IMDB_SCORE,TITLE_YEAR_1
1,Avatar?ÿ,723,178.0,10,855,1000,760505847,886204.0,4834.0,,3054,237000000,2009,936.0,7.9,2009.0
2,Pirates of the Caribbean: At World's End?ÿ,302,,563,1000,40000,309404152,471220.0,48350.0,,1238,300000000,2007,5000.0,7.1,
3,Spectre?ÿ,602,148.0,20,161,11000,200074175,275868.0,11700.0,1.0,994,245000000,2015,393.0,6.8,2015.0
4,The Dark Knight Rises?ÿ,813,,22000,23000,27000,448130642,1144337.0,106759.0,,2701,250000000,2012,23000.0,8.5,
5,John Carter?ÿ,462,132.0,"""475""",530,640,73058679,212204.0,1873.0,1.0,738,263700000,2012,632.0,6.6,
6,Spider-Man 3?ÿ,392,156.0,23,4000,24000,336530303,383056.0,46055.0,,1902,258000000,2007,11000.0,6.2,2007.0
7,Tangled?ÿ,324,,15,284,799,200807262,294810.0,,1.0,387,260000000,2010,553.0,7.8,
8,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,
9,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,2015.0
10,Harry Potter and the Half-Blood Prince?ÿ,375,153.0,282,10000,25000,301956980,321795.0,58753.0,3.0,973,250000000,2009,11000.0,7.5,


In [30]:
# Works like the isna but opposite
completecases(df_2)

14-element DataArrays.DataArray{Bool,1}:
 false
 false
  true
 false
 false
 false
 false
 false
  true
 false
 false
 false
  true
  true

In [31]:
# Simple way for checking for NA in a column is to us .na
df[:DURATION].na


14-element BitArray{1}:
 false
  true
 false
  true
 false
 false
  true
 false
 false
 false
 false
 false
 false
 false

## Solutions for Missing Values
+ Fill With NA

In [33]:
# Fill NA with 0
df[isna.(df[:DURATION]),:DURATION] = 0

0

In [34]:
showcols(df)

14×16 DataFrames.DataFrame
│ Col # │ Name                      │ Eltype  │ Missing │
├───────┼───────────────────────────┼─────────┼─────────┤
│ 1     │ MOVIE_TITLE               │ String  │ 0       │
│ 2     │ NUM_CRITIC_FOR_REVIEWS    │ Int64   │ 0       │
│ 3     │ DURATION                  │ Int64   │ 0       │
│ 4     │ DIRECTOR_FACEBOOK_LIKES   │ String  │ 2       │
│ 5     │ ACTOR_3_FACEBOOK_LIKES    │ Int64   │ 0       │
│ 6     │ ACTOR_1_FACEBOOK_LIKES    │ Int64   │ 0       │
│ 7     │ GROSS                     │ Int64   │ 0       │
│ 8     │ NUM_VOTED_USERS           │ Int64   │ 1       │
│ 9     │ CAST_TOTAL_FACEBOOK_LIKES │ Int64   │ 2       │
│ 10    │ FACENUMBER_IN_POSTER      │ Int64   │ 5       │
│ 11    │ NUM_USER_FOR_REVIEWS      │ Int64   │ 0       │
│ 12    │ BUDGET                    │ Int64   │ 0       │
│ 13    │ TITLE_YEAR                │ Int64   │ 0       │
│ 14    │ ACTOR_2_FACEBOOK_LIKES    │ Int64   │ 1       │
│ 15    │ IMDB_SCORE                │ Float64

In [None]:
# Fill NA with mean

In [35]:
#Mean is abandoned if the column contains a missing value, the missing value propagates back to the top level.
mean(df[:CAST_TOTAL_FACEBOOK_LIKES])

NA

In [36]:
#Solution is to edit or fill it with na and then the mean
find(isna.(df[:CAST_TOTAL_FACEBOOK_LIKES]))

2-element Array{Int64,1}:
  7
 12

In [37]:
# Select All element except NA
df[[1,2,3,4,5,6,8,9,10,11,13,14],[:CAST_TOTAL_FACEBOOK_LIKES]]

Unnamed: 0,CAST_TOTAL_FACEBOOK_LIKES
1,4834
2,48350
3,11700
4,106759
5,1873
6,46055
7,92000
8,92000
9,58753
10,24450


In [39]:
# Use the describe to find mean and supply it for the fillna with mean
describe(df[[1,2,3,4,5,6,8,9,10,11,13,14],[:CAST_TOTAL_FACEBOOK_LIKES]])

CAST_TOTAL_FACEBOOK_LIKES
Summary Stats:
Mean:           44773.583333
Minimum:        1873.000000
1st Quartile:   9983.500000
Median:         47202.500000
3rd Quartile:   67064.750000
Maximum:        106759.000000
Length:         12
Type:           Int64
Number Missing: 0
% Missing:      0.000000



In [41]:
df[isna.(df[:CAST_TOTAL_FACEBOOK_LIKES]),:CAST_TOTAL_FACEBOOK_LIKES] = 44774

44774

In [42]:
df[:CAST_TOTAL_FACEBOOK_LIKES]

14-element DataArrays.DataArray{Int64,1}:
   4834
  48350
  11700
 106759
   1873
  46055
      0
  92000
  92000
  58753
  24450
  44774
   2023
  48486

In [None]:
#df[[7,12],[:CAST_TOTAL_FACEBOOK_LIKES]]
df[[7,12],:]

In [44]:
#df[1:end .!= 2,:] # For rows
#df[:, 1:end .!= 2] # For columns
df[[1,2,3,4,5,6,8,9,10,11,13,14],[:CAST_TOTAL_FACEBOOK_LIKES]])
df[1:end .!= 7,[:CAST_TOTAL_FACEBOOK_LIKES]] 

Unnamed: 0,CAST_TOTAL_FACEBOOK_LIKES
1,4834
2,48350
3,11700
4,106759
5,1873
6,46055
7,92000
8,92000
9,58753
10,24450


### Solution Dropna and Completecases


In [45]:
# Solution Dropna 
# Works for only arrays hence use completecases()
df_drop = df[:NUM_VOTED_USERS]

14-element DataArrays.DataArray{Int64,1}:
  886204  
  471220  
  275868  
 1144337  
  212204  
  383056  
  294810  
  462669  
  462669  
  321795  
        NA
  240396  
  330784  
  522040  

In [46]:
dropna(df_drop)

13-element Array{Int64,1}:
  886204
  471220
  275868
 1144337
  212204
  383056
  294810
  462669
  462669
  321795
  240396
  330784
  522040

In [47]:
size(df_drop)

(14,)

In [48]:
df_complete = df

Unnamed: 0,MOVIE_TITLE,NUM_CRITIC_FOR_REVIEWS,DURATION,DIRECTOR_FACEBOOK_LIKES,ACTOR_3_FACEBOOK_LIKES,ACTOR_1_FACEBOOK_LIKES,GROSS,NUM_VOTED_USERS,CAST_TOTAL_FACEBOOK_LIKES,FACENUMBER_IN_POSTER,NUM_USER_FOR_REVIEWS,BUDGET,TITLE_YEAR,ACTOR_2_FACEBOOK_LIKES,IMDB_SCORE,TITLE_YEAR_1
1,Avatar?ÿ,723,178,10,855,1000,760505847,886204.0,4834,,3054,237000000,2009,936.0,7.9,2009.0
2,Pirates of the Caribbean: At World's End?ÿ,302,0,563,1000,40000,309404152,471220.0,48350,,1238,300000000,2007,5000.0,7.1,
3,Spectre?ÿ,602,148,20,161,11000,200074175,275868.0,11700,1.0,994,245000000,2015,393.0,6.8,2015.0
4,The Dark Knight Rises?ÿ,813,0,22000,23000,27000,448130642,1144337.0,106759,,2701,250000000,2012,23000.0,8.5,
5,John Carter?ÿ,462,132,"""475""",530,640,73058679,212204.0,1873,1.0,738,263700000,2012,632.0,6.6,
6,Spider-Man 3?ÿ,392,156,23,4000,24000,336530303,383056.0,46055,,1902,258000000,2007,11000.0,6.2,2007.0
7,Tangled?ÿ,324,0,15,284,799,200807262,294810.0,0,1.0,387,260000000,2010,553.0,7.8,
8,Avengers: Age of Ultron?ÿ,635,141,10,19000,26000,458991599,462669.0,92000,4.0,1117,250000000,2015,21000.0,7.5,
9,Avengers: Age of Ultron?ÿ,635,141,10,19000,26000,458991599,462669.0,92000,4.0,1117,250000000,2015,21000.0,7.5,2015.0
10,Harry Potter and the Half-Blood Prince?ÿ,375,153,282,10000,25000,301956980,321795.0,58753,3.0,973,250000000,2009,11000.0,7.5,


In [49]:
# Drops the na for the dataframe
#
completecases!(df_complete)

Unnamed: 0,MOVIE_TITLE,NUM_CRITIC_FOR_REVIEWS,DURATION,DIRECTOR_FACEBOOK_LIKES,ACTOR_3_FACEBOOK_LIKES,ACTOR_1_FACEBOOK_LIKES,GROSS,NUM_VOTED_USERS,CAST_TOTAL_FACEBOOK_LIKES,FACENUMBER_IN_POSTER,NUM_USER_FOR_REVIEWS,BUDGET,TITLE_YEAR,ACTOR_2_FACEBOOK_LIKES,IMDB_SCORE,TITLE_YEAR_1
1,Spectre?ÿ,602,148,20,161,11000,200074175,275868,11700,1,994,245000000,2015,393,6.8,2015
2,Avengers: Age of Ultron?ÿ,635,141,10,19000,26000,458991599,462669,92000,4,1117,250000000,2015,21000,7.5,2015
3,Quantum of Solace?ÿ,403,106,395,393,451,168368427,330784,2023,1,1243,200000000,2008,412,6.7,2008
4,Pirates of the Caribbean: Dead Man's Chest?ÿ,313,151,563,1000,40000,423032628,522040,48486,2,1832,225000000,2006,5000,7.3,2008


In [50]:
# Reduce it all the rows with na to 4
size(df_complete)

(4, 16)

In [51]:
df_with_drop = df

Unnamed: 0,MOVIE_TITLE,NUM_CRITIC_FOR_REVIEWS,DURATION,DIRECTOR_FACEBOOK_LIKES,ACTOR_3_FACEBOOK_LIKES,ACTOR_1_FACEBOOK_LIKES,GROSS,NUM_VOTED_USERS,CAST_TOTAL_FACEBOOK_LIKES,FACENUMBER_IN_POSTER,NUM_USER_FOR_REVIEWS,BUDGET,TITLE_YEAR,ACTOR_2_FACEBOOK_LIKES,IMDB_SCORE,TITLE_YEAR_1
1,Spectre?ÿ,602,148,20,161,11000,200074175,275868,11700,1,994,245000000,2015,393,6.8,2015
2,Avengers: Age of Ultron?ÿ,635,141,10,19000,26000,458991599,462669,92000,4,1117,250000000,2015,21000,7.5,2015
3,Quantum of Solace?ÿ,403,106,395,393,451,168368427,330784,2023,1,1243,200000000,2008,412,6.7,2008
4,Pirates of the Caribbean: Dead Man's Chest?ÿ,313,151,563,1000,40000,423032628,522040,48486,2,1832,225000000,2006,5000,7.3,2008


#### Duplicates

In [57]:
df_with_dup = readtable("duplicated_data.csv")

Unnamed: 0,movie_title,num_critic_for_reviews,duration,DIRECTOR_facebook_likes,actor_3_facebook_likes,ACTOR_1_facebook_likes,gross,num_voted_users,Cast_Total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,ACTOR_2_facebook_likes,imdb_score,title_year_1
1,Avatar?ÿ,723,178.0,10,855,1000,760505847,886204.0,4834.0,,3054,237000000,2009,936.0,7.9,2009.0
2,Pirates of the Caribbean: At World's End?ÿ,302,,563,1000,40000,309404152,471220.0,48350.0,,1238,300000000,2007,5000.0,7.1,
3,Spectre?ÿ,602,148.0,20,161,11000,200074175,275868.0,11700.0,1.0,994,245000000,2015,393.0,6.8,2015.0
4,The Dark Knight Rises?ÿ,813,,22000,23000,27000,448130642,1144337.0,106759.0,,2701,250000000,2012,23000.0,8.5,
5,John Carter?ÿ,462,132.0,"""475""",530,640,73058679,212204.0,1873.0,1.0,738,263700000,2012,632.0,6.6,
6,Spider-Man 3?ÿ,392,156.0,23,4000,24000,336530303,383056.0,46055.0,,1902,258000000,2007,11000.0,6.2,2007.0
7,Tangled?ÿ,324,,15,284,799,200807262,294810.0,,1.0,387,260000000,2010,553.0,7.8,
8,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,
9,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,2015.0
10,Harry Potter and the Half-Blood Prince?ÿ,375,153.0,282,10000,25000,301956980,321795.0,58753.0,3.0,973,250000000,2009,11000.0,7.5,


In [59]:
# Can be used for checking duplicate or not unique
nonunique(df_with_dup)

14-element Array{Bool,1}:
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false
 false

In [64]:
unique(df_with_dup)

Unnamed: 0,movie_title,num_critic_for_reviews,duration,DIRECTOR_facebook_likes,actor_3_facebook_likes,ACTOR_1_facebook_likes,gross,num_voted_users,Cast_Total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,ACTOR_2_facebook_likes,imdb_score,title_year_1
1,Avatar?ÿ,723,178.0,10,855,1000,760505847,886204.0,4834.0,,3054,237000000,2009,936.0,7.9,2009.0
2,Pirates of the Caribbean: At World's End?ÿ,302,,563,1000,40000,309404152,471220.0,48350.0,,1238,300000000,2007,5000.0,7.1,
3,Spectre?ÿ,602,148.0,20,161,11000,200074175,275868.0,11700.0,1.0,994,245000000,2015,393.0,6.8,2015.0
4,The Dark Knight Rises?ÿ,813,,22000,23000,27000,448130642,1144337.0,106759.0,,2701,250000000,2012,23000.0,8.5,
5,John Carter?ÿ,462,132.0,"""475""",530,640,73058679,212204.0,1873.0,1.0,738,263700000,2012,632.0,6.6,
6,Spider-Man 3?ÿ,392,156.0,23,4000,24000,336530303,383056.0,46055.0,,1902,258000000,2007,11000.0,6.2,2007.0
7,Tangled?ÿ,324,,15,284,799,200807262,294810.0,,1.0,387,260000000,2010,553.0,7.8,
8,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,
9,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,2015.0
10,Harry Potter and the Half-Blood Prince?ÿ,375,153.0,282,10000,25000,301956980,321795.0,58753.0,3.0,973,250000000,2009,11000.0,7.5,


In [67]:
# Can be used for checking duplicate by covert the logical mask into linear indices
find(nonunique(df_with_dup))

0-element Array{Int64,1}

In [70]:
drop_duplicated(df_with_dup)

LoadError: [91mUndefVarError: drop_duplicated not defined[39m

### Data Types Inconsistencies

In [71]:
typeof(df)

DataFrames.DataFrame

In [72]:
eltypes(df)

16-element Array{Type,1}:
 String 
 Int64  
 Int64  
 String 
 Int64  
 Int64  
 Int64  
 Int64  
 Int64  
 Int64  
 Int64  
 Int64  
 Int64  
 Int64  
 Float64
 Int64  

In [76]:
showcols(df_with_dup)

14×16 DataFrames.DataFrame
│ Col # │ Name                      │ Eltype  │ Missing │
├───────┼───────────────────────────┼─────────┼─────────┤
│ 1     │ movie_title               │ String  │ 0       │
│ 2     │ num_critic_for_reviews    │ Int64   │ 0       │
│ 3     │ duration                  │ Int64   │ 3       │
│ 4     │ DIRECTOR_facebook_likes   │ String  │ 2       │
│ 5     │ actor_3_facebook_likes    │ Int64   │ 0       │
│ 6     │ ACTOR_1_facebook_likes    │ Int64   │ 0       │
│ 7     │ gross                     │ Int64   │ 0       │
│ 8     │ num_voted_users           │ Int64   │ 1       │
│ 9     │ Cast_Total_facebook_likes │ Int64   │ 2       │
│ 10    │ facenumber_in_poster      │ Int64   │ 5       │
│ 11    │ num_user_for_reviews      │ Int64   │ 0       │
│ 12    │ budget                    │ Int64   │ 0       │
│ 13    │ title_year                │ Int64   │ 0       │
│ 14    │ ACTOR_2_facebook_likes    │ Int64   │ 1       │
│ 15    │ imdb_score                │ Float64

In [75]:
head(df_with_dup)

Unnamed: 0,movie_title,num_critic_for_reviews,duration,DIRECTOR_facebook_likes,actor_3_facebook_likes,ACTOR_1_facebook_likes,gross,num_voted_users,Cast_Total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,ACTOR_2_facebook_likes,imdb_score,title_year_1
1,Avatar?ÿ,723,178.0,10,855,1000,760505847,886204,4834,,3054,237000000,2009,936,7.9,2009.0
2,Pirates of the Caribbean: At World's End?ÿ,302,,563,1000,40000,309404152,471220,48350,,1238,300000000,2007,5000,7.1,
3,Spectre?ÿ,602,148.0,20,161,11000,200074175,275868,11700,1.0,994,245000000,2015,393,6.8,2015.0
4,The Dark Knight Rises?ÿ,813,,22000,23000,27000,448130642,1144337,106759,,2701,250000000,2012,23000,8.5,
5,John Carter?ÿ,462,132.0,"""475""",530,640,73058679,212204,1873,1.0,738,263700000,2012,632,6.6,
6,Spider-Man 3?ÿ,392,156.0,23,4000,24000,336530303,383056,46055,,1902,258000000,2007,11000,6.2,2007.0


In [77]:
typeof(df_with_dup[:gross])

DataArrays.DataArray{Int64,1}

In [80]:
df_with2 = df_with_dup

Unnamed: 0,movie_title,num_critic_for_reviews,duration,DIRECTOR_facebook_likes,actor_3_facebook_likes,ACTOR_1_facebook_likes,gross,num_voted_users,Cast_Total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,ACTOR_2_facebook_likes,imdb_score,title_year_1
1,Avatar?ÿ,723,178.0,10,855,1000,760505847,886204.0,4834.0,,3054,237000000,2009,936.0,7.9,2009.0
2,Pirates of the Caribbean: At World's End?ÿ,302,,563,1000,40000,309404152,471220.0,48350.0,,1238,300000000,2007,5000.0,7.1,
3,Spectre?ÿ,602,148.0,20,161,11000,200074175,275868.0,11700.0,1.0,994,245000000,2015,393.0,6.8,2015.0
4,The Dark Knight Rises?ÿ,813,,22000,23000,27000,448130642,1144337.0,106759.0,,2701,250000000,2012,23000.0,8.5,
5,John Carter?ÿ,462,132.0,"""475""",530,640,73058679,212204.0,1873.0,1.0,738,263700000,2012,632.0,6.6,
6,Spider-Man 3?ÿ,392,156.0,23,4000,24000,336530303,383056.0,46055.0,,1902,258000000,2007,11000.0,6.2,2007.0
7,Tangled?ÿ,324,,15,284,799,200807262,294810.0,,1.0,387,260000000,2010,553.0,7.8,
8,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,
9,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,2015.0
10,Harry Potter and the Half-Blood Prince?ÿ,375,153.0,282,10000,25000,301956980,321795.0,58753.0,3.0,973,250000000,2009,11000.0,7.5,


In [None]:
convert.(Float64,df_with_dup[:gross])

In [None]:
# JCharisTech J-Secur1ty
# JCharis Jesse
# Jesus Saves @ JCharisTech