# Biological Activity

New drugs are developed with chemicals that are biologically active (life engineering). Testing molecules for biological activity is a costly process and it would be useful to predict biological activity with lower cost measurements. It is even possible, without even making the compound, to calculate certain characteristics such as size, lidrophobicity (ability to dissolve), and polarity of key chemical groups at different sites in the molecule as well as the activity of the compound. This area of research is called computational chemistry.

In [None]:
using Pkg
Pkg.add("DataFrames")
Pkg.add("CSV")
Pkg.add("GLM")
Pkg.add("Statistics")
Pkg.add("Distributions")
Pkg.add("Gadfly")
Pkg.add("LinearAlgebra")
Pkg.add("ScikitLearn")

In [36]:
using Pkg
Pkg.add("PyCall")

[32m[1m   Resolving[22m[39m package versions...


[32m[1m    Updating[22m[39m `~/.julia/environments/v1.8/Project.toml`
 [90m [438e738f] [39m[92m+ PyCall v1.95.1[39m
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.8/Manifest.toml`


In [41]:
using CSV, DataFrames, GLM, Statistics, Distributions, Gadfly, LinearAlgebra, ScikitLearn

### Data

The data file, Penta, contains 31 observations and the variables
* NAME: name of the compound
* 15 X measurements: S1, L1,..., P5
* Response Y_logRAI: logarithm of the bradykinin activity (conversion enzyme)
* CLASS; classification of data: training, test

The file is divided into 2 parts; the first 15 observations form the training set of the PLS model (Ufkes 1978 study); the others constitute the test set and come from the 1982 study. The peptides used in the second study were different from those used in the first study, and the bradykinin used in the two studies was from different sources.

In [16]:
data = CSV.read("data.csv", DataFrame);
describe(data)

Unnamed: 0_level_0,variable,mean,min,median,max,nmissing,eltype
Unnamed: 0_level_1,Symbol,Union…,Any,Union…,Any,Int64,Type
1,ID,16.0,1,16.0,31,0,Int64
2,NOM,,AAAAA,,VWAAK,0,String7
3,S1,-1.70665,-4.9217,-2.6931,3.0777,0,Float64
4,L1,-1.80847,-5.3648,-2.5271,2.5215,0,Float64
5,P1,-0.915865,-3.4435,-1.2871,2.2253,0,Float64
6,S2,1.8728,-4.7548,2.8369,3.0777,0,Float64
7,L2,-0.194406,-5.3648,0.3891,3.6521,0,Float64
8,P2,-0.451558,-3.1398,-0.0701,0.8524,0,Float64
9,S3,-1.67861,-4.9217,0.0744,2.4064,0,Float64
10,L3,0.170848,-5.3648,-1.0285,3.6521,0,Float64


In [19]:
data_point, features = size(data)
print(data_point, " smaple - ", features, " features")

31 smaple - 19 features

In [29]:
# Split in training and validation set
train = data[(data.CLASSE .== "entraiment"), :];
x_train = select(train, Not([:CLASSE, :ID, :Y_logRAI]));
y_train = select(train, :Y_logRAI);
print("Training size: ", size(x_train))

valid = data[(data.CLASSE .== "test"), :];
x_valid = select(valid, Not([:CLASSE, :ID, :Y_logRAI]));
y_valid = select(valid, :Y_logRAI);
print("\nTest size: ", size(x_train))


Training size: (15, 16)
Test size: (15, 16)

### Objective
To develop a PLS ((Projection on Latent Structure) model based on the first study and examine its performance in predicting the data from the second study.

7a) Develop an initial PLS model (denoted M1) on only the test data (first 15 observations) for bradikinin activity. Consider a model with all components.

7b) Develop a second PLS model (denoted M2) based on the first 2 components only. Justify dropping the components beyond the first 2.

7c) Develop a third PLS model (denoted M3) based on the first 2 components based only on the regressors S1 P1 S3 P3 L3 S4 L4 P4. Justify the abandonment of the other L1 S2 L2 P2 S5 L5 P5.

7d) Use the M3 model to predict brakinin activity for the data from the second study. Comment on the result, propose a conclusion and possibly an explanation.

In [None]:
@sk_import linear_model: LogisticRegression

model = LogisticRegression(fit_intercept=true, max_iter = 200)
fit!(model, X, y);
accuracy = score(model, X, y)
println("accuracy: $accuracy")


In [40]:
@sk_import pls: PLSRegression

model = PLSRegression(n_components=10, scale=True, algorithm='nipals', max_iter=500, tol=1e-06, copy=True)¶


PyCall.PyError: PyError (PyImport_ImportModule

The Python package sklearn.pls could not be imported by pyimport. Usually this means
that you did not install sklearn.pls in the Python version being used by PyCall.

PyCall is currently configured to use the Julia-specific Python distribution
installed by the Conda.jl package.  To install the sklearn.pls module, you can
use `pyimport_conda("sklearn.pls", PKG)`, where PKG is the Anaconda
package that contains the module sklearn.pls, or alternatively you can use the
Conda package directly (via `using Conda` followed by `Conda.add` etcetera).

Alternatively, if you want to use a different Python distribution on your
system, such as a system-wide Python (as opposed to the Julia-specific Python),
you can re-configure PyCall with that Python.   As explained in the PyCall
documentation, set ENV["PYTHON"] to the path/name of the python executable
you want to use, run Pkg.build("PyCall"), and re-launch Julia.

) <class 'ModuleNotFoundError'>
ModuleNotFoundError("No module named 'sklearn.pls'")


In [47]:
using Pkg
Pkg.add("Conda")
using Conda
Conda.add("sklearn.pls")

[32m[1m   Resolving[22m[39m package versions...


[32m[1m    Updating[22m[39m `~/.julia/environments/v1.8/Project.toml`
 [90m [8f4d0f93] [39m[92m+ Conda v1.8.0[39m
[32m[1m  No Changes[22m[39m to `~/.julia/environments/v1.8/Manifest.toml`


┌ Info: Running `conda install -y sklearn.pls` in root environment
└ @ Conda /Users/guillaumethibault/.julia/packages/Conda/kOnIE/src/Conda.jl:127


Collecting package metadata (current_repodata.json): ...working... 

done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): ...working... 

done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.



PackagesNotFoundError: The following packages are not available from current channels:

  - sklearn.pls

Current channels:

  - https://repo.anaconda.com/pkgs/main/osx-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/osx-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.




ProcessFailedException: failed process: Process(setenv(`/Users/guillaumethibault/.julia/conda/3/bin/conda install -y sklearn.pls`,["XPC_FLAGS=0x0", "COMMAND_MODE=unix2003", "PATH=/opt/local/bin:/opt/local/sbin:/Library/Frameworks/Python.framework/Versions/3.7/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Users/guillaumethibault/Documents/Poly/22_H/Qualité/sonar-scanner-4.7.0.2747-macosx/bin:/Applications/VMware Fusion.app/Contents/Public:/Library/TeX/texbin:/usr/local/share/dotnet:/opt/X11/bin:~/.dotnet/tools:/Library/Apple/usr/bin:/Library/Frameworks/Mono.framework/Versions/Current/Commands:/Applications/Xamarin Workbooks.app/Contents/SharedSupport/path-bin:/Applications/Visual Studio Code.app/Contents/Resources/app/bin", "PWD=/Users/guillaumethibault/Documents/repo/statistics-dumb", "VSCODE_CODE_CACHE_PATH=/Users/guillaumethibault/Library/Application Support/Code/CachedData/441438abd1ac652551dbe4d408dfcec8a499b8bf", "DISPLAY=:0", "VSCODE_HANDLES_UNCAUGHT_ERRORS=true", "TERM_PROGRAM=Apple_Terminal", "XPC_SERVICE_NAME=application.com.microsoft.VSCode.99884614.99884620.5BD5A986-7A4E-4BBA-911B-105EB7043668", "ELECTRON_RUN_AS_NODE=1", "VSCODE_NLS_CONFIG={\"locale\":\"en-us\",\"availableLanguages\":{},\"_languagePackSupport\":true}", "SHELL=/bin/zsh", "VSCODE_AMD_ENTRYPOINT=vs/workbench/api/node/extensionHostProcess", "__CF_USER_TEXT_ENCODING=0x1F5:0x0:0x52", "KMP_INIT_AT_FORK=FALSE", "VSCODE_PID=19274", "__CFBundleIdentifier=com.microsoft.VSCode", "VSCODE_IPC_HOOK=/Users/guillaumethibault/Library/Application Support/Code/1.75.1-main.sock", "TMPDIR=/var/folders/wy/bd7zcwsj47dcjdv0j_tt7dk40000gn/T/", "CONDARC=/Users/guillaumethibault/.julia/conda/3/condarc-julia.yml", "ORIGINAL_XDG_CURRENT_DESKTOP=undefined", "LANG=en_CA.UTF-8", "LOGNAME=guillaumethibault", "SHLVL=2", "CONDA_PREFIX=/Users/guillaumethibault/.julia/conda/3", "MallocNanoZone=0", "SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.npAowkJZYv/Listeners", "TERM_SESSION_ID=93A3DA42-B592-4193-8D4C-1E6F2833F3B1", "VSCODE_CLI=1", "KMP_DUPLICATE_LIB_OK=True", "USER=guillaumethibault", "HOME=/Users/guillaumethibault", "TERM=xterm-256color", "TERM_PROGRAM_VERSION=440", "ELECTRON_NO_ATTACH_CONSOLE=1", "JULIA_NUM_THREADS=", "PYTHONIOENCODING=UTF-8", "VSCODE_CWD=/Users/guillaumethibault/Documents/repo/statistics-dumb"]), ProcessExited(1)) [1]


In [45]:
using PyCall
pyimport_conda("sklearn.pls", "ScikitLearn")

┌ Info: Installing sklearn.pls via the Conda ScikitLearn package...
└ @ PyCall /Users/guillaumethibault/.julia/packages/PyCall/twYvK/src/PyCall.jl:719
┌ Info: Running `conda install -y ScikitLearn` in root environment
└ @ Conda /Users/guillaumethibault/.julia/packages/Conda/kOnIE/src/Conda.jl:127


Collecting package metadata (current_repodata.json): ...working... 

done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.
Collecting package metadata (repodata.json): ...working... 

done
Solving environment: ...working... failed with initial frozen solve. Retrying with flexible solve.



PackagesNotFoundError: The following packages are not available from current channels:

  - scikitlearn

Current channels:

  - https://repo.anaconda.com/pkgs/main/osx-64
  - https://repo.anaconda.com/pkgs/main/noarch
  - https://repo.anaconda.com/pkgs/r/osx-64
  - https://repo.anaconda.com/pkgs/r/noarch

To search for alternate channels that may provide the conda package you're
looking for, navigate to

    https://anaconda.org

and use the search bar at the top of the page.




ProcessFailedException: failed process: Process(setenv(`/Users/guillaumethibault/.julia/conda/3/bin/conda install -y ScikitLearn`,["XPC_FLAGS=0x0", "COMMAND_MODE=unix2003", "PATH=/opt/local/bin:/opt/local/sbin:/Library/Frameworks/Python.framework/Versions/3.7/bin:/usr/local/bin:/usr/bin:/bin:/usr/sbin:/sbin:/Users/guillaumethibault/Documents/Poly/22_H/Qualité/sonar-scanner-4.7.0.2747-macosx/bin:/Applications/VMware Fusion.app/Contents/Public:/Library/TeX/texbin:/usr/local/share/dotnet:/opt/X11/bin:~/.dotnet/tools:/Library/Apple/usr/bin:/Library/Frameworks/Mono.framework/Versions/Current/Commands:/Applications/Xamarin Workbooks.app/Contents/SharedSupport/path-bin:/Applications/Visual Studio Code.app/Contents/Resources/app/bin", "PWD=/Users/guillaumethibault/Documents/repo/statistics-dumb", "VSCODE_CODE_CACHE_PATH=/Users/guillaumethibault/Library/Application Support/Code/CachedData/441438abd1ac652551dbe4d408dfcec8a499b8bf", "DISPLAY=:0", "VSCODE_HANDLES_UNCAUGHT_ERRORS=true", "TERM_PROGRAM=Apple_Terminal", "XPC_SERVICE_NAME=application.com.microsoft.VSCode.99884614.99884620.5BD5A986-7A4E-4BBA-911B-105EB7043668", "ELECTRON_RUN_AS_NODE=1", "VSCODE_NLS_CONFIG={\"locale\":\"en-us\",\"availableLanguages\":{},\"_languagePackSupport\":true}", "SHELL=/bin/zsh", "VSCODE_AMD_ENTRYPOINT=vs/workbench/api/node/extensionHostProcess", "__CF_USER_TEXT_ENCODING=0x1F5:0x0:0x52", "KMP_INIT_AT_FORK=FALSE", "VSCODE_PID=19274", "__CFBundleIdentifier=com.microsoft.VSCode", "VSCODE_IPC_HOOK=/Users/guillaumethibault/Library/Application Support/Code/1.75.1-main.sock", "TMPDIR=/var/folders/wy/bd7zcwsj47dcjdv0j_tt7dk40000gn/T/", "CONDARC=/Users/guillaumethibault/.julia/conda/3/condarc-julia.yml", "ORIGINAL_XDG_CURRENT_DESKTOP=undefined", "LANG=en_CA.UTF-8", "LOGNAME=guillaumethibault", "SHLVL=2", "CONDA_PREFIX=/Users/guillaumethibault/.julia/conda/3", "MallocNanoZone=0", "SSH_AUTH_SOCK=/private/tmp/com.apple.launchd.npAowkJZYv/Listeners", "TERM_SESSION_ID=93A3DA42-B592-4193-8D4C-1E6F2833F3B1", "VSCODE_CLI=1", "KMP_DUPLICATE_LIB_OK=True", "USER=guillaumethibault", "HOME=/Users/guillaumethibault", "TERM=xterm-256color", "TERM_PROGRAM_VERSION=440", "ELECTRON_NO_ATTACH_CONSOLE=1", "JULIA_NUM_THREADS=", "PYTHONIOENCODING=UTF-8", "VSCODE_CWD=/Users/guillaumethibault/Documents/repo/statistics-dumb"]), ProcessExited(1)) [1]
