Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] CSV.read randomly changes eltype of column #1089

Closed
hungpham3112 opened this issue May 12, 2023 · 7 comments
Closed

[Bug] CSV.read randomly changes eltype of column #1089

hungpham3112 opened this issue May 12, 2023 · 7 comments
Labels

Comments

@hungpham3112
Copy link
Contributor

hungpham3112 commented May 12, 2023

Step to reproduce:

  • Copy code into Jupyter notebook or Pluto to see the result
using Plots,DataFrames, DataFramesMeta, CSV, HTTP, Statistics
filename = "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/auto.csv"
headers = ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
         "drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
         "num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
         "peak-rpm","city-mpg","highway-mpg","price"]
df = CSV.read(HTTP.get(filename).body, DataFrame, header=headers)
eltype(df[!, 1]), eltype(df[!, 2])
  • Run multiple times the line df = CSV.read(HTTP.get(filename).body, DataFrame, header=headers) and see sometimes the column changes its type.

I tested the csv file in Python, the first column is always fixed data type (Float64)-> not the problem with csv file.
Then I tried above snippet in Jupyter notebook and Pluto both experience the same bug. -> The problem with CSV.read and CSV.File

Vid:

  • Pluto.jl
bandicam.2023-05-12.08-33-29-925.mp4
  • Jupyter notebook
bandicam.2023-05-12.08-52-40-528.mp4

Versioninfo:

Julia Version 1.9.0
Commit 8e63055292 (2023-05-07 11:25 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × 11th Gen Intel(R) Core(TM) i7-11370H @ 3.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, tigerlake)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_DEPOT_PATH = C:\Users\sofia\.julia;C:\Users\sofia\.julia\juliaup\julia-1.9.0+0.x64.w64.mingw32\local\share\julia;C:\Users\sofia\.julia\juliaup\julia-1.9.0+0.x64.w64.mingw32\share\julia
  JULIA_LOAD_PATH = C:\Users\sofia\AppData\Local\Temp\jl_MjE6XO;@;@v#.#;@stdlib
  JULIA_NUM_THREADS = 8
  JULIA_PROJECT = C:\Users\sofia\JuliaProjects\MachineLearning\LinearRegression\Project.toml
  JULIA_REVISE_WORKER_ONLY = 1
  • CSV: v.10.10
@Liozou
Copy link
Contributor

Liozou commented Jun 6, 2023

Hi and thank you for the bug report! Would you mind testing whether this still occurs after updating CSV.jl? Version 0.10.11 (tagged yesterday) includes #1073 which intends to fix this kind of issues.

@hungpham3112
Copy link
Contributor Author

hungpham3112 commented Jun 7, 2023

I tested, the data race frequency decreased but the problem is still there. Moreover, now sometimes this plugin causes Pluto to hang for about 5 minutes I think because data racing.

bandicam.2023-06-07.07-58-26-840.mp4

My thought: if run the code single time, I mean run and wait until the code done -> continue, no problem exist with type. But if we run it many times, like I spam in the video, data racing will happen with multiple core(in my example is 8 cores). Idk if my thought is true or not, please explain for me.

@Liozou
Copy link
Contributor

Liozou commented Jun 7, 2023

Ah that's unfortunate and unexpected. It seems I cannot reproduce the issue: I tried running a Pluto notebook with the same environment (JULIA_NUM_THREADS=8 JULIA_REVISE_WORKER_ONLY=1 ~/julia-1.9.0/bin/julia --startup-file=no -e "using Pluto; Pluto.run()") and I put the code of your initial message, one line per cell. Then I did like in your video, refreshing the df definition cell repeatedly, even just leaving Shift+Enter pressed down for a while, but I never see the type of the first column changing.
I also tried the following to automate things a bit:

body = HTTP.get(filename).body
for _ in 1:10000
    df2 = CSV.read(body, DataFrame, header=headers)
    if eltype(df2[!,1]) != Int64
        error("Encountered: $(eltype(df2[!,1]))")
    end
end

but no error occurs.

Just to check if it can be something else in the configuration, can you please check the output of Base.Threads.nthreads() in one cell of your Pluto notebook, as well as that of import Pkg; Pkg.status()? Mine yields respectively 8 and

Status `/tmp/jl_pNSR9l/Project.toml`
  [336ed68f] CSV v0.10.11
  [a93c6f00] DataFrames v1.5.0
  [cd3eb016] HTTP v1.9.6
  [44cfe95a] Pkg v1.9.0
  [10745b16] Statistics v1.9.0

@hungpham3112
Copy link
Contributor Author

Just to check if it can be something else in the configuration, can you please check the output of Base.Threads.nthreads() in one cell of your Pluto notebook, as well as that of import Pkg; Pkg.status()? Mine yields respectively 8 and

Here is the output:
image

Ah that's unfortunate and unexpected. It seems I cannot reproduce the issue: I tried running a Pluto notebook with the same environment (JULIA_NUM_THREADS=8 JULIA_REVISE_WORKER_ONLY=1 ~/julia-1.9.0/bin/julia --startup-file=no -e "using Pluto; Pluto.run()") and I put the code of your initial message, one line per cell. Then I did like in your video, refreshing the df definition cell repeatedly, even just leaving Shift+Enter pressed down for a while, but I never see the type of the first column changing.
I also tried the following to automate things a bit:
I can reproduce the error with your requirement, maybe your OS is different to me. I'm using Windows 11 to test, with powershell=7.2.

Untitled.mp4

@Liozou
Copy link
Contributor

Liozou commented Jun 8, 2023

Thanks for checking: apparently you are still using CSV v0.10.10, but the bugfix I mentioned was only released starting from with CSV v0.10.11, which explains why you are still seeing this bug.
Would you mind updating the package and letting us know whether the bug still occurs afterwards? To update, run Pkg.update("CSV") from a cell of your notebook (or simply Pkg.update() to update all packages in your environment): you should see somewhere a line stating

  [336ed68f] ↑ CSV v0.10.10 ⇒ v0.10.11

@hungpham3112
Copy link
Contributor Author

hungpham3112 commented Jun 8, 2023

Thanks for checking: apparently you are still using CSV v0.10.10, but the bugfix I mentioned was only released starting from with CSV v0.10.11, which explains why you are still seeing this bug. Would you mind updating the package and letting us know whether the bug still occurs afterwards? To update, run Pkg.update("CSV") from a cell of your notebook (or simply Pkg.update() to update all packages in your environment): you should see somewhere a line stating

  [336ed68f] ↑ CSV v0.10.10 ⇒ v0.10.11

I realized that I only update local env not Pluto. sorry for that. The first time I check, data racing still exist but at the second time and third time everything ok. There's something weird in here or maybe problem with multi threads. We need more people to validate this behavior. Thanks

@hungpham3112
Copy link
Contributor Author

hungpham3112 commented Jul 26, 2023

Hi, today I come back to the problem and no data racing anymore. My thought was the last time I updated CSV from v0.10.10 => v0.10.11, temporary file still exists in local machine then the bug still occurs. #1073 absolutely fixes this issue. Thanks for the hard working. I will close this issue in here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants