# Data Cleaning In Julia

### Common Issues With Data
+ Reading the file
+ Inconsistent Column Names
+ Missing Data
+ Different Data Types 
+ Duplicate rows
+ etc

In [2]:
# EDA packages
using DataFrames

## Loading or Reading the File
+ Encoding Error
+ Inconsistent rows 

In [3]:
# Issue 1 
df = readtable("raw_data_unmodified.csv")

LoadError: [91mArgumentError: Columns and column index must be the same length[39m

In [4]:
# Solution 1 Encoding it
df = readtable("raw_data_unmodified.csv",encoding=:latin1)

LoadError: [91mArgumentError: Argument 'encoding' only supports ':utf8' currently.[39m

In [5]:
df = readtable("raw_data_unmodified.csv",encoding=:utf8)

LoadError: [91mArgumentError: Columns and column index must be the same length[39m

In [7]:
# Solution 2 Encoding it With A Text Editor
df = readtable("raw_data_unmodified.csv")

Unnamed: 0,movie_title,num_critic_for_reviews,duration,DIRECTOR_facebook_likes,actor_3_facebook_likes,ACTOR_1_facebook_likes,gross,num_voted_users,Cast_Total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,ACTOR_2_facebook_likes,imdb_score,title_year_1
1,Avatar?ÿ,723,178.0,10,855,1000,760505847,886204.0,4834.0,,3054,237000000,2009,936.0,7.9,2009.0
2,Pirates of the Caribbean: At World's End?ÿ,302,,563,1000,40000,309404152,471220.0,48350.0,,1238,300000000,2007,5000.0,7.1,
3,Spectre?ÿ,602,148.0,20,161,11000,200074175,275868.0,11700.0,1.0,994,245000000,2015,393.0,6.8,2015.0
4,The Dark Knight Rises?ÿ,813,,22000,23000,27000,448130642,1144337.0,106759.0,,2701,250000000,2012,23000.0,8.5,
5,John Carter?ÿ,462,132.0,"""475""",530,640,73058679,212204.0,1873.0,1.0,738,263700000,2012,632.0,6.6,
6,Spider-Man 3?ÿ,392,156.0,23,4000,24000,336530303,383056.0,46055.0,,1902,258000000,2007,11000.0,6.2,2007.0
7,Tangled?ÿ,324,,15,284,799,200807262,294810.0,,1.0,387,260000000,2010,553.0,7.8,
8,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,
9,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,2015.0
10,Harry Potter and the Half-Blood Prince?ÿ,375,153.0,282,10000,25000,301956980,321795.0,58753.0,3.0,973,250000000,2009,11000.0,7.5,


## Inconsistent Column Names
+ Change Cases
+ Rename them

### Change the case to Upper

In [8]:
head(df,4)

Unnamed: 0,movie_title,num_critic_for_reviews,duration,DIRECTOR_facebook_likes,actor_3_facebook_likes,ACTOR_1_facebook_likes,gross,num_voted_users,Cast_Total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,ACTOR_2_facebook_likes,imdb_score,title_year_1
1,Avatar?ÿ,723,178.0,10,855,1000,760505847,886204,4834,,3054,237000000,2009,936,7.9,2009.0
2,Pirates of the Caribbean: At World's End?ÿ,302,,563,1000,40000,309404152,471220,48350,,1238,300000000,2007,5000,7.1,
3,Spectre?ÿ,602,148.0,20,161,11000,200074175,275868,11700,1.0,994,245000000,2015,393,6.8,2015.0
4,The Dark Knight Rises?ÿ,813,,22000,23000,27000,448130642,1144337,106759,,2701,250000000,2012,23000,8.5,


In [9]:
names(df)

16-element Array{Symbol,1}:
 :movie_title              
 :num_critic_for_reviews   
 :duration                 
 :DIRECTOR_facebook_likes  
 :actor_3_facebook_likes   
 :ACTOR_1_facebook_likes   
 :gross                    
 :num_voted_users          
 :Cast_Total_facebook_likes
 :facenumber_in_poster     
 :num_user_for_reviews     
 :budget                   
 :title_year               
 :ACTOR_2_facebook_likes   
 :imdb_score               
 :title_year_1             

In [10]:
uppercase(string(names(df)))

"SYMBOL[:MOVIE_TITLE, :NUM_CRITIC_FOR_REVIEWS, :DURATION, :DIRECTOR_FACEBOOK_LIKES, :ACTOR_3_FACEBOOK_LIKES, :ACTOR_1_FACEBOOK_LIKES, :GROSS, :NUM_VOTED_USERS, :CAST_TOTAL_FACEBOOK_LIKES, :FACENUMBER_IN_POSTER, :NUM_USER_FOR_REVIEWS, :BUDGET, :TITLE_YEAR, :ACTOR_2_FACEBOOK_LIKES, :IMDB_SCORE, :TITLE_YEAR_1]"

In [None]:
# Use names! function and parse in an array 
# NB names!() will change the original
# names!(dataframe,[arrayofname])

In [160]:
lowercase(string(names(df)))

"symbol[:movie_title, :num_critic_for_reviews, :time, :director_facebook_likes, :actor_3_facebook_likes, :actor_1_facebook_likes, :gross, :num_voted_users, :cast_total_facebook_likes, :facenumber_in_poster, :num_user_for_reviews, :budget, :title_year, :actor_2_facebook_likes, :imdb_score, :title_year_1]"

In [165]:
names!(df[1:end],Array(lowercase(string(names(df)))))

LoadError: [91mMethodError: Cannot `convert` an object of type String to an object of type Array
This may have arisen from a call to the constructor Array(...),
since type constructors fall back to convert methods.[39m

In [12]:
names!(df,[:MOVIE_TITLE, :NUM_CRITIC_FOR_REVIEWS, :DURATION, :DIRECTOR_FACEBOOK_LIKES, :ACTOR_3_FACEBOOK_LIKES, :ACTOR_1_FACEBOOK_LIKES, :GROSS, :NUM_VOTED_USERS, :CAST_TOTAL_FACEBOOK_LIKES, :FACENUMBER_IN_POSTER, :NUM_USER_FOR_REVIEWS, :BUDGET, :TITLE_YEAR, :ACTOR_2_FACEBOOK_LIKES, :IMDB_SCORE, :TITLE_YEAR_1])

Unnamed: 0,MOVIE_TITLE,NUM_CRITIC_FOR_REVIEWS,DURATION,DIRECTOR_FACEBOOK_LIKES,ACTOR_3_FACEBOOK_LIKES,ACTOR_1_FACEBOOK_LIKES,GROSS,NUM_VOTED_USERS,CAST_TOTAL_FACEBOOK_LIKES,FACENUMBER_IN_POSTER,NUM_USER_FOR_REVIEWS,BUDGET,TITLE_YEAR,ACTOR_2_FACEBOOK_LIKES,IMDB_SCORE,TITLE_YEAR_1
1,Avatar?ÿ,723,178.0,10,855,1000,760505847,886204.0,4834.0,,3054,237000000,2009,936.0,7.9,2009.0
2,Pirates of the Caribbean: At World's End?ÿ,302,,563,1000,40000,309404152,471220.0,48350.0,,1238,300000000,2007,5000.0,7.1,
3,Spectre?ÿ,602,148.0,20,161,11000,200074175,275868.0,11700.0,1.0,994,245000000,2015,393.0,6.8,2015.0
4,The Dark Knight Rises?ÿ,813,,22000,23000,27000,448130642,1144337.0,106759.0,,2701,250000000,2012,23000.0,8.5,
5,John Carter?ÿ,462,132.0,"""475""",530,640,73058679,212204.0,1873.0,1.0,738,263700000,2012,632.0,6.6,
6,Spider-Man 3?ÿ,392,156.0,23,4000,24000,336530303,383056.0,46055.0,,1902,258000000,2007,11000.0,6.2,2007.0
7,Tangled?ÿ,324,,15,284,799,200807262,294810.0,,1.0,387,260000000,2010,553.0,7.8,
8,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,
9,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,2015.0
10,Harry Potter and the Half-Blood Prince?ÿ,375,153.0,282,10000,25000,301956980,321795.0,58753.0,3.0,973,250000000,2009,11000.0,7.5,


In [13]:
names(df)

16-element Array{Symbol,1}:
 :MOVIE_TITLE              
 :NUM_CRITIC_FOR_REVIEWS   
 :DURATION                 
 :DIRECTOR_FACEBOOK_LIKES  
 :ACTOR_3_FACEBOOK_LIKES   
 :ACTOR_1_FACEBOOK_LIKES   
 :GROSS                    
 :NUM_VOTED_USERS          
 :CAST_TOTAL_FACEBOOK_LIKES
 :FACENUMBER_IN_POSTER     
 :NUM_USER_FOR_REVIEWS     
 :BUDGET                   
 :TITLE_YEAR               
 :ACTOR_2_FACEBOOK_LIKES   
 :IMDB_SCORE               
 :TITLE_YEAR_1             

### Renaming Columns
+ rename()
+ rename!() =changes the original

In [14]:
rename!(df,:DURATION,:TIME)

Unnamed: 0,MOVIE_TITLE,NUM_CRITIC_FOR_REVIEWS,TIME,DIRECTOR_FACEBOOK_LIKES,ACTOR_3_FACEBOOK_LIKES,ACTOR_1_FACEBOOK_LIKES,GROSS,NUM_VOTED_USERS,CAST_TOTAL_FACEBOOK_LIKES,FACENUMBER_IN_POSTER,NUM_USER_FOR_REVIEWS,BUDGET,TITLE_YEAR,ACTOR_2_FACEBOOK_LIKES,IMDB_SCORE,TITLE_YEAR_1
1,Avatar?ÿ,723,178.0,10,855,1000,760505847,886204.0,4834.0,,3054,237000000,2009,936.0,7.9,2009.0
2,Pirates of the Caribbean: At World's End?ÿ,302,,563,1000,40000,309404152,471220.0,48350.0,,1238,300000000,2007,5000.0,7.1,
3,Spectre?ÿ,602,148.0,20,161,11000,200074175,275868.0,11700.0,1.0,994,245000000,2015,393.0,6.8,2015.0
4,The Dark Knight Rises?ÿ,813,,22000,23000,27000,448130642,1144337.0,106759.0,,2701,250000000,2012,23000.0,8.5,
5,John Carter?ÿ,462,132.0,"""475""",530,640,73058679,212204.0,1873.0,1.0,738,263700000,2012,632.0,6.6,
6,Spider-Man 3?ÿ,392,156.0,23,4000,24000,336530303,383056.0,46055.0,,1902,258000000,2007,11000.0,6.2,2007.0
7,Tangled?ÿ,324,,15,284,799,200807262,294810.0,,1.0,387,260000000,2010,553.0,7.8,
8,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,
9,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,2015.0
10,Harry Potter and the Half-Blood Prince?ÿ,375,153.0,282,10000,25000,301956980,321795.0,58753.0,3.0,973,250000000,2009,11000.0,7.5,


## Missing Data
+ Add a default value for missing data or use mean to fill it
+ Delete the row/column with missing data
+ Interpolate the rows
+ Replace

#### To check for missing data
#### False means no missing data
+ showcols()
+ isna/ismissing
+ isnan
+ .na
+ completecases

In [69]:
# Specify a group of strings to be converted to NA values during reading:
df1 = readtable("raw_data_unmodified.csv", nastrings=["NA", "na", "n/a", "missing"])

Unnamed: 0,movie_title,num_critic_for_reviews,duration,DIRECTOR_facebook_likes,actor_3_facebook_likes,ACTOR_1_facebook_likes,gross,num_voted_users,Cast_Total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,ACTOR_2_facebook_likes,imdb_score,title_year_1
1,Avatar?ÿ,723,178.0,10,855,1000,760505847,886204.0,4834.0,,3054,237000000,2009,936.0,7.9,2009.0
2,Pirates of the Caribbean: At World's End?ÿ,302,,563,1000,40000,309404152,471220.0,48350.0,,1238,300000000,2007,5000.0,7.1,
3,Spectre?ÿ,602,148.0,20,161,11000,200074175,275868.0,11700.0,1.0,994,245000000,2015,393.0,6.8,2015.0
4,The Dark Knight Rises?ÿ,813,,22000,23000,27000,448130642,1144337.0,106759.0,,2701,250000000,2012,23000.0,8.5,
5,John Carter?ÿ,462,132.0,"""475""",530,640,73058679,212204.0,1873.0,1.0,738,263700000,2012,632.0,6.6,
6,Spider-Man 3?ÿ,392,156.0,23,4000,24000,336530303,383056.0,46055.0,,1902,258000000,2007,11000.0,6.2,2007.0
7,Tangled?ÿ,324,,15,284,799,200807262,294810.0,,1.0,387,260000000,2010,553.0,7.8,
8,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,
9,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,2015.0
10,Harry Potter and the Half-Blood Prince?ÿ,375,153.0,282,10000,25000,301956980,321795.0,58753.0,3.0,973,250000000,2009,11000.0,7.5,


In [70]:
# Checking For Missing Data
showcols(df1)


14×16 DataFrames.DataFrame
│ Col # │ Name                      │ Eltype  │ Missing │
├───────┼───────────────────────────┼─────────┼─────────┤
│ 1     │ movie_title               │ String  │ 0       │
│ 2     │ num_critic_for_reviews    │ Int64   │ 0       │
│ 3     │ duration                  │ Int64   │ 3       │
│ 4     │ DIRECTOR_facebook_likes   │ String  │ 2       │
│ 5     │ actor_3_facebook_likes    │ Int64   │ 0       │
│ 6     │ ACTOR_1_facebook_likes    │ Int64   │ 0       │
│ 7     │ gross                     │ Int64   │ 0       │
│ 8     │ num_voted_users           │ Int64   │ 1       │
│ 9     │ Cast_Total_facebook_likes │ Int64   │ 2       │
│ 10    │ facenumber_in_poster      │ Int64   │ 5       │
│ 11    │ num_user_for_reviews      │ Int64   │ 0       │
│ 12    │ budget                    │ Int64   │ 0       │
│ 13    │ title_year                │ Int64   │ 0       │
│ 14    │ ACTOR_2_facebook_likes    │ Int64   │ 1       │
│ 15    │ imdb_score                │ Float64

##### Checking with isna()

In [71]:
isna(df1)

false

In [72]:
# Checking for missing value in index 1
# False means it is not NA
x = [1,2," ",4]

4-element Array{Any,1}:
 1   
 2   
  " "
 4   

In [76]:
isna(x,4)

false

In [77]:
# Check for missing value with isna. for whole array
isna.(x)

4-element BitArray{1}:
 false
 false
 false
 false

In [78]:
# Correct way of accepting NA with @data macro
y = @data([1,2,NA,4])

4-element DataArrays.DataArray{Int64,1}:
 1  
 2  
  NA
 4  

In [79]:
# But with the @data you can see the na
isna(y,3)

true

In [80]:
isna.(y)

4-element BitArray{1}:
 false
 false
  true
 false

In [81]:
# Returns the individual rows with missing values/na
find(isna.(y))

1-element Array{Int64,1}:
 3

In [82]:
sum(isna.(y))

1

### Back To Dataframe

In [83]:
head(df1)

Unnamed: 0,movie_title,num_critic_for_reviews,duration,DIRECTOR_facebook_likes,actor_3_facebook_likes,ACTOR_1_facebook_likes,gross,num_voted_users,Cast_Total_facebook_likes,facenumber_in_poster,num_user_for_reviews,budget,title_year,ACTOR_2_facebook_likes,imdb_score,title_year_1
1,Avatar?ÿ,723,178.0,10,855,1000,760505847,886204,4834,,3054,237000000,2009,936,7.9,2009.0
2,Pirates of the Caribbean: At World's End?ÿ,302,,563,1000,40000,309404152,471220,48350,,1238,300000000,2007,5000,7.1,
3,Spectre?ÿ,602,148.0,20,161,11000,200074175,275868,11700,1.0,994,245000000,2015,393,6.8,2015.0
4,The Dark Knight Rises?ÿ,813,,22000,23000,27000,448130642,1144337,106759,,2701,250000000,2012,23000,8.5,
5,John Carter?ÿ,462,132.0,"""475""",530,640,73058679,212204,1873,1.0,738,263700000,2012,632,6.6,
6,Spider-Man 3?ÿ,392,156.0,23,4000,24000,336530303,383056,46055,,1902,258000000,2007,11000,6.2,2007.0


In [84]:
names!(df1,[:MOVIE_TITLE, :NUM_CRITIC_FOR_REVIEWS, :DURATION, :DIRECTOR_FACEBOOK_LIKES, :ACTOR_3_FACEBOOK_LIKES, :ACTOR_1_FACEBOOK_LIKES, :GROSS, :NUM_VOTED_USERS, :CAST_TOTAL_FACEBOOK_LIKES, :FACENUMBER_IN_POSTER, :NUM_USER_FOR_REVIEWS, :BUDGET, :TITLE_YEAR, :ACTOR_2_FACEBOOK_LIKES, :IMDB_SCORE, :TITLE_YEAR_1])

Unnamed: 0,MOVIE_TITLE,NUM_CRITIC_FOR_REVIEWS,DURATION,DIRECTOR_FACEBOOK_LIKES,ACTOR_3_FACEBOOK_LIKES,ACTOR_1_FACEBOOK_LIKES,GROSS,NUM_VOTED_USERS,CAST_TOTAL_FACEBOOK_LIKES,FACENUMBER_IN_POSTER,NUM_USER_FOR_REVIEWS,BUDGET,TITLE_YEAR,ACTOR_2_FACEBOOK_LIKES,IMDB_SCORE,TITLE_YEAR_1
1,Avatar?ÿ,723,178.0,10,855,1000,760505847,886204.0,4834.0,,3054,237000000,2009,936.0,7.9,2009.0
2,Pirates of the Caribbean: At World's End?ÿ,302,,563,1000,40000,309404152,471220.0,48350.0,,1238,300000000,2007,5000.0,7.1,
3,Spectre?ÿ,602,148.0,20,161,11000,200074175,275868.0,11700.0,1.0,994,245000000,2015,393.0,6.8,2015.0
4,The Dark Knight Rises?ÿ,813,,22000,23000,27000,448130642,1144337.0,106759.0,,2701,250000000,2012,23000.0,8.5,
5,John Carter?ÿ,462,132.0,"""475""",530,640,73058679,212204.0,1873.0,1.0,738,263700000,2012,632.0,6.6,
6,Spider-Man 3?ÿ,392,156.0,23,4000,24000,336530303,383056.0,46055.0,,1902,258000000,2007,11000.0,6.2,2007.0
7,Tangled?ÿ,324,,15,284,799,200807262,294810.0,,1.0,387,260000000,2010,553.0,7.8,
8,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,
9,Avengers: Age of Ultron?ÿ,635,141.0,10,19000,26000,458991599,462669.0,92000.0,4.0,1117,250000000,2015,21000.0,7.5,2015.0
10,Harry Potter and the Half-Blood Prince?ÿ,375,153.0,282,10000,25000,301956980,321795.0,58753.0,3.0,973,250000000,2009,11000.0,7.5,


In [85]:
# Returns the individual rows of a column with missing values/na
find(isna.(df1[:, :DURATION]))

3-element Array{Int64,1}:
 2
 4
 7

In [86]:
df[find(isna.(df1[:, :DURATION]))]

Unnamed: 0,NUM_CRITIC_FOR_REVIEWS,DIRECTOR_FACEBOOK_LIKES,GROSS
1,723,10,760505847
2,302,563,309404152
3,602,20,200074175
4,813,22000,448130642
5,462,"""475""",73058679
6,392,23,336530303
7,324,15,200807262
8,635,10,458991599
9,635,10,458991599
10,375,282,301956980


In [None]:
isna()

In [88]:
# Returns the individual rows of a column with missing values/na
nrows, ncols = size(df) # Size of rows
# You can also use nrow() function instead of above
for row in 1:nrows
     if isna(df1[row, :DURATION])
       println("skipping missing values")
     else
       println("the value is $(df1[row, :DURATION])")
     end
end


the value is 178
skipping missing values
the value is 148
skipping missing values
the value is 132
the value is 156
skipping missing values
the value is 141
the value is 141
the value is 153
the value is 183
the value is 169
the value is 106
the value is 151


#### Julia Tricks
+ isna()
+ isna.() 
+ .!isna.() #returns a dataframe that contains no rows with missing values.
+ completecase() # 

In [89]:
.!isna.(df1[:DURATION])

14-element DataArrays.DataArray{Bool,1}:
  true
 false
  true
 false
  true
  true
 false
  true
  true
  true
  true
  true
  true
  true

## Solutions for Missing Values
+ Fill With NA

In [90]:
showcols(df1)

14×16 DataFrames.DataFrame
│ Col # │ Name                      │ Eltype  │ Missing │
├───────┼───────────────────────────┼─────────┼─────────┤
│ 1     │ MOVIE_TITLE               │ String  │ 0       │
│ 2     │ NUM_CRITIC_FOR_REVIEWS    │ Int64   │ 0       │
│ 3     │ DURATION                  │ Int64   │ 3       │
│ 4     │ DIRECTOR_FACEBOOK_LIKES   │ String  │ 2       │
│ 5     │ ACTOR_3_FACEBOOK_LIKES    │ Int64   │ 0       │
│ 6     │ ACTOR_1_FACEBOOK_LIKES    │ Int64   │ 0       │
│ 7     │ GROSS                     │ Int64   │ 0       │
│ 8     │ NUM_VOTED_USERS           │ Int64   │ 1       │
│ 9     │ CAST_TOTAL_FACEBOOK_LIKES │ Int64   │ 2       │
│ 10    │ FACENUMBER_IN_POSTER      │ Int64   │ 5       │
│ 11    │ NUM_USER_FOR_REVIEWS      │ Int64   │ 0       │
│ 12    │ BUDGET                    │ Int64   │ 0       │
│ 13    │ TITLE_YEAR                │ Int64   │ 0       │
│ 14    │ ACTOR_2_FACEBOOK_LIKES    │ Int64   │ 1       │
│ 15    │ IMDB_SCORE                │ Float64

In [99]:
# Fill with Default
df1[isna.(df1[:DURATION]),:DURATION] = NA

NA

In [101]:
df1[:DURATION]

14-element DataArrays.DataArray{Int64,1}:
  178
 9999
  148
 9999
  132
  156
 9999
  141
  141
  153
  183
  169
  106
  151

In [125]:
# Filling with the mean
showcols(df1)

14×16 DataFrames.DataFrame
│ Col # │ Name                      │ Eltype  │ Missing │
├───────┼───────────────────────────┼─────────┼─────────┤
│ 1     │ MOVIE_TITLE               │ String  │ 0       │
│ 2     │ NUM_CRITIC_FOR_REVIEWS    │ Int64   │ 0       │
│ 3     │ DURATION                  │ Int64   │ 0       │
│ 4     │ DIRECTOR_FACEBOOK_LIKES   │ String  │ 2       │
│ 5     │ ACTOR_3_FACEBOOK_LIKES    │ Int64   │ 0       │
│ 6     │ ACTOR_1_FACEBOOK_LIKES    │ Int64   │ 0       │
│ 7     │ GROSS                     │ Int64   │ 0       │
│ 8     │ NUM_VOTED_USERS           │ Int64   │ 1       │
│ 9     │ CAST_TOTAL_FACEBOOK_LIKES │ Int64   │ 2       │
│ 10    │ FACENUMBER_IN_POSTER      │ Int64   │ 5       │
│ 11    │ NUM_USER_FOR_REVIEWS      │ Int64   │ 0       │
│ 12    │ BUDGET                    │ Int64   │ 0       │
│ 13    │ TITLE_YEAR                │ Int64   │ 0       │
│ 14    │ ACTOR_2_FACEBOOK_LIKES    │ Int64   │ 1       │
│ 15    │ IMDB_SCORE                │ Float64

In [127]:
df1[:NUM_VOTED_USERS]

14-element DataArrays.DataArray{Int64,1}:
  886204  
  471220  
  275868  
 1144337  
  212204  
  383056  
  294810  
  462669  
  462669  
  321795  
        NA
  240396  
  330784  
  522040  

In [128]:
mean(df1[:NUM_VOTED_USERS])

NA

In [130]:
# Edit
find(isna.(df1[:NUM_VOTED_USERS]))

1-element Array{Int64,1}:
 11

In [131]:
df1[[1,2,3,4,5,6,7,8,9,10,12,13,14],[:NUM_VOTED_USERS]]

Unnamed: 0,NUM_VOTED_USERS
1,886204
2,471220
3,275868
4,1144337
5,212204
6,383056
7,294810
8,462669
9,462669
10,321795


In [134]:
# Pick every element except 11
df1[1:end .!=11,[:NUM_VOTED_USERS]]

Unnamed: 0,NUM_VOTED_USERS
1,886204
2,471220
3,275868
4,1144337
5,212204
6,383056
7,294810
8,462669
9,462669
10,321795


In [133]:
describe(df1[[1,2,3,4,5,6,7,8,9,10,12,13,14],[:NUM_VOTED_USERS]])

NUM_VOTED_USERS
Summary Stats:
Mean:           462157.846154
Minimum:        212204.000000
1st Quartile:   294810.000000
Median:         383056.000000
3rd Quartile:   471220.000000
Maximum:        1144337.000000
Length:         13
Type:           Int64
Number Missing: 0
% Missing:      0.000000



In [None]:
#df1[isna.(df1[:NUM_VOTED_USERS]),:NUM_VOTED_USERS] = 462157

### Deleting Rows or Dropping Rows with NA

In [None]:
df1 = readtable("raw_data_unmodified.csv", nastrings=["NA", "na", "n/a", "missing"])

In [139]:
head(df1)

Unnamed: 0,MOVIE_TITLE,NUM_CRITIC_FOR_REVIEWS,DURATION,DIRECTOR_FACEBOOK_LIKES,ACTOR_3_FACEBOOK_LIKES,ACTOR_1_FACEBOOK_LIKES,GROSS,NUM_VOTED_USERS,CAST_TOTAL_FACEBOOK_LIKES,FACENUMBER_IN_POSTER,NUM_USER_FOR_REVIEWS,BUDGET,TITLE_YEAR,ACTOR_2_FACEBOOK_LIKES,IMDB_SCORE,TITLE_YEAR_1
1,Avatar?ÿ,723,178,10,855,1000,760505847,886204,4834,,3054,237000000,2009,936,7.9,2009.0
2,Pirates of the Caribbean: At World's End?ÿ,302,9999,563,1000,40000,309404152,471220,48350,,1238,300000000,2007,5000,7.1,
3,Spectre?ÿ,602,148,20,161,11000,200074175,275868,11700,1.0,994,245000000,2015,393,6.8,2015.0
4,The Dark Knight Rises?ÿ,813,9999,22000,23000,27000,448130642,1144337,106759,,2701,250000000,2012,23000,8.5,
5,John Carter?ÿ,462,132,"""475""",530,640,73058679,212204,1873,1.0,738,263700000,2012,632,6.6,
6,Spider-Man 3?ÿ,392,156,23,4000,24000,336530303,383056,46055,,1902,258000000,2007,11000,6.2,2007.0


In [141]:
showcols(df1)

14×16 DataFrames.DataFrame
│ Col # │ Name                      │ Eltype  │ Missing │
├───────┼───────────────────────────┼─────────┼─────────┤
│ 1     │ MOVIE_TITLE               │ String  │ 0       │
│ 2     │ NUM_CRITIC_FOR_REVIEWS    │ Int64   │ 0       │
│ 3     │ DURATION                  │ Int64   │ 0       │
│ 4     │ DIRECTOR_FACEBOOK_LIKES   │ String  │ 2       │
│ 5     │ ACTOR_3_FACEBOOK_LIKES    │ Int64   │ 0       │
│ 6     │ ACTOR_1_FACEBOOK_LIKES    │ Int64   │ 0       │
│ 7     │ GROSS                     │ Int64   │ 0       │
│ 8     │ NUM_VOTED_USERS           │ Int64   │ 1       │
│ 9     │ CAST_TOTAL_FACEBOOK_LIKES │ Int64   │ 2       │
│ 10    │ FACENUMBER_IN_POSTER      │ Int64   │ 5       │
│ 11    │ NUM_USER_FOR_REVIEWS      │ Int64   │ 0       │
│ 12    │ BUDGET                    │ Int64   │ 0       │
│ 13    │ TITLE_YEAR                │ Int64   │ 0       │
│ 14    │ ACTOR_2_FACEBOOK_LIKES    │ Int64   │ 1       │
│ 15    │ IMDB_SCORE                │ Float64

In [143]:
#dropmissing
#isna ismissing
dropna()

In [144]:
a = @data([1,2,3,5,NA,6,NA,10])

8-element DataArrays.DataArray{Int64,1}:
  1  
  2  
  3  
  5  
   NA
  6  
   NA
 10  

In [145]:
dropna(a)

6-element Array{Int64,1}:
  1
  2
  3
  5
  6
 10

In [146]:
a

8-element DataArrays.DataArray{Int64,1}:
  1  
  2  
  3  
  5  
   NA
  6  
   NA
 10  

In [157]:
completecases!(df1)
# df.dropna(how="all")

Unnamed: 0,MOVIE_TITLE,NUM_CRITIC_FOR_REVIEWS,DURATION,DIRECTOR_FACEBOOK_LIKES,ACTOR_3_FACEBOOK_LIKES,ACTOR_1_FACEBOOK_LIKES,GROSS,NUM_VOTED_USERS,CAST_TOTAL_FACEBOOK_LIKES,FACENUMBER_IN_POSTER,NUM_USER_FOR_REVIEWS,BUDGET,TITLE_YEAR,ACTOR_2_FACEBOOK_LIKES,IMDB_SCORE,TITLE_YEAR_1
1,Spectre?ÿ,602,148,20,161,11000,200074175,275868,11700,1,994,245000000,2015,393,6.8,2015
2,Avengers: Age of Ultron?ÿ,635,141,10,19000,26000,458991599,462669,92000,4,1117,250000000,2015,21000,7.5,2015
3,Quantum of Solace?ÿ,403,106,395,393,451,168368427,330784,2023,1,1243,200000000,2008,412,6.7,2008
4,Pirates of the Caribbean: Dead Man's Chest?ÿ,313,151,563,1000,40000,423032628,522040,48486,2,1832,225000000,2006,5000,7.3,2008


In [158]:
df1

Unnamed: 0,MOVIE_TITLE,NUM_CRITIC_FOR_REVIEWS,DURATION,DIRECTOR_FACEBOOK_LIKES,ACTOR_3_FACEBOOK_LIKES,ACTOR_1_FACEBOOK_LIKES,GROSS,NUM_VOTED_USERS,CAST_TOTAL_FACEBOOK_LIKES,FACENUMBER_IN_POSTER,NUM_USER_FOR_REVIEWS,BUDGET,TITLE_YEAR,ACTOR_2_FACEBOOK_LIKES,IMDB_SCORE,TITLE_YEAR_1
1,Spectre?ÿ,602,148,20,161,11000,200074175,275868,11700,1,994,245000000,2015,393,6.8,2015
2,Avengers: Age of Ultron?ÿ,635,141,10,19000,26000,458991599,462669,92000,4,1117,250000000,2015,21000,7.5,2015
3,Quantum of Solace?ÿ,403,106,395,393,451,168368427,330784,2023,1,1243,200000000,2008,412,6.7,2008
4,Pirates of the Caribbean: Dead Man's Chest?ÿ,313,151,563,1000,40000,423032628,522040,48486,2,1832,225000000,2006,5000,7.3,2008


In [159]:
showcols(df1)

4×16 DataFrames.DataFrame
│ Col # │ Name                      │ Eltype  │ Missing │
├───────┼───────────────────────────┼─────────┼─────────┤
│ 1     │ MOVIE_TITLE               │ String  │ 0       │
│ 2     │ NUM_CRITIC_FOR_REVIEWS    │ Int64   │ 0       │
│ 3     │ DURATION                  │ Int64   │ 0       │
│ 4     │ DIRECTOR_FACEBOOK_LIKES   │ String  │ 0       │
│ 5     │ ACTOR_3_FACEBOOK_LIKES    │ Int64   │ 0       │
│ 6     │ ACTOR_1_FACEBOOK_LIKES    │ Int64   │ 0       │
│ 7     │ GROSS                     │ Int64   │ 0       │
│ 8     │ NUM_VOTED_USERS           │ Int64   │ 0       │
│ 9     │ CAST_TOTAL_FACEBOOK_LIKES │ Int64   │ 0       │
│ 10    │ FACENUMBER_IN_POSTER      │ Int64   │ 0       │
│ 11    │ NUM_USER_FOR_REVIEWS      │ Int64   │ 0       │
│ 12    │ BUDGET                    │ Int64   │ 0       │
│ 13    │ TITLE_YEAR                │ Int64   │ 0       │
│ 14    │ ACTOR_2_FACEBOOK_LIKES    │ Int64   │ 0       │
│ 15    │ IMDB_SCORE                │ Float64 