Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add pandas #1522

Merged

Conversation

TaiSakuma
Copy link
Contributor

This PR adds pandas.

Following the presentation at the RECO/AT meeting on Thursday, 30 April 2015, I tried writing spec files for pandas and its requirements.

This is my first PR to this repo. I am not familiar with the development of this repo. I might not be doing right.

First, this PR is made to the branch IB/CMSSW_7_5_X/stable. I am not sure if this is the right branch. Please let me know if this PR should be made to another branch.

pandas depends on three other packages. It also has recommended and optional dependencies. These are listed at:

This PR upgrades NumPy and python-dateutil to meet the requirement.

The website above states that recommended dependencies provide large speedups for large data sets, which might be very useful. But this PR doesn't install those recommended dependencies. I could try to add them as a separate PR.

This PR adds ATLAS as NumPy uses it. The compilation of ATLAS is probably the most complicated part of this PR.

Comments of packages

I will add comments below for several packages.

ATLAS

(This PR no longer adds ATLAS)

This PR adds ATLAS, which NumPy uses for fast calculation of linear algebra. ATLAS is in the branch IB/CMSSW_4_4_X/stable. I copied the spec file from the branch and updated to use the most recent verion of ATLAS.

Tuning

The performance of ATLAS depends on how it is tuned at the compile time. This PR uses only -b 64 for the tuning. This might not be the best option. Or this might not work on OSX.

Netlib LAPACK

This PR uses the "--with-netlib-lapack-tarfile" option, which will compile netlib's LAPACK and install it.

However, the cmsdist also has lapack.spec. So there will be two instances of Netlib's LAPACK.

Shared objects

I modified Makefile such that shared objects are built in the way in which NumPy and other packages appear to expect.

The configuration option --shared creates two share objects: libsatlas.so and libtatlas.so, which include all symbols for serial and parallel API's respectively, which are not other packages that use ATLAS expect and find by default. For example, see a stackoverflow post.

This PR builds those two shared objects but doesn't install them.

Instead, this PR builds seven share objects, each for one archive file and install them. Those share objects are libatlas.so, libf77blas.so, libcblas.so, liblapack.so, libptf77blas.so, libptcblas.so, and libptlapack.so.

NumPy

This website says that shared objects for BLAS, LAPACK, and ATLAS can be specified by environmental variables. However, this wasn't the case when I tried. So instead, this PR uses site.cfg to specify those.

pandas

I modified setup.py so that setuptools won't compile NumPy again. I think setuptools are supposed to check if requires are installed and if not then install them. However, it would always try to install NumPy even when NumPy is installed. I couldn't figure out why. So I just removed several lines in setup.py that indicate the requirement of NumPy.

I removed the NumPy requirement from both setup_requires and install_requires. If it is in setup_requires, the setuptools would compile NumPy before compiling pandas. If it is not in setup_requires but in install_requires, the setuptools would compile NumPy after compiling pandas.

Test the branch

I tested my branch with the following commands on lxplus cmsdev.

CMSSW_X_Y_Z=CMSSW_7_5_0_pre3
ARCH=slc6_amd64_gcc491
QUEUE=`echo $CMSSW_X_Y_Z | sed -e 's/\(CMSSW_[0-9][0-9]*_[0-9][0-9]*\).*/\1_X/'`
eval $(curl -f -L https://raw.githubusercontent.com/cms-sw/cms-bot/master/config.map | grep "SCRAM_ARCH=$ARCH;" | grep "RELEASE_QUEUE=$QUEUE;")
git clone -b $CMSDIST_TAG git@github.com:cms-sw/cmsdist.git CMSDIST
git clone -b $PKGTOOLS_TAG git@github.com:cms-sw/pkgtools.git PKGTOOLS
sh -e PKGTOOLS/scripts/prepare-cmsdist $CMSSW_X_Y_Z $ARCH
cd CMSDIST
git remote add mine git@github.com:TaiSakuma/cmsdist.git
git fetch mine
git merge mine/dev-20150430-01-CMSSW_7_5_X-pandas
cd ..
screen -e^vv -L time PKGTOOLS/cmsBuild --architecture=$ARCH --builders 4 -j 16 build cms-git-tools
screen -e^vv -L time PKGTOOLS/cmsBuild --architecture=$ARCH --builders 4 -j 16 build py2-pandas-toolfile

pandas is compiled.

I don't know how to enter the environment that was built. So I didn't try to import pandas to a Python script.

Any comments or suggestions are welcome.

@cmsbuild cmsbuild added this to the Next CMSSW_7_5_X milestone May 2, 2015
@ktf
Copy link
Contributor

ktf commented May 4, 2015

I'm only concerned about the ATLAS usage, which is a painful beast to
support, given it autotunes. Can't you use the standard version of
LAPACK / BLAS?

Ciao,
Giulio

On 2 May 2015, at 19:12, Tai Sakuma wrote:

TaiSakuma wants to merge 10 commits into cms-sw:IB/CMSSW_7_5_X/stable
from TaiSakuma:dev-20150430-01-CMSSW_7_5_X-pandas:

This PR adds pandas.

Following the
presentation

at the RECO/AT meeting on
Thursday, 30 April 2015, I tried writing spec files for pandas and
its requirements.

This is my first PR to this repo. I am not familiar with the
development of this repo. I might not be doing right.

First, this PR is made to the branch
IB/CMSSW_7_5_X/stable.
I am not sure if this is the right branch. Please let me know if this
PR should be made to another branch.

pandas depends on three other packages. It also has recommended and
optional dependencies. These are listed at:

http://pandas.pydata.org/pandas-docs/stable/install.html#dependencies

This PR upgrades NumPy and python-dateutil to meet the requirement.

The website above states that recommended dependencies provide large
speedups for large data sets, which might be very useful. But this PR
doesn't install those recommended dependencies. I could try to add
them as a separate PR.

This PR adds ATLAS as NumPy uses it. The compilation of ATLAS is
probably the most complicated part of this PR.

Comments of packages

I will add comments below for several packages.

ATLAS

This PR adds ATLAS, which NumPy
uses for fast calculation of linear algebra. ATLAS is in the branch
IB/CMSSW_4_4_X/stable.
I copied the spec file from the branch and updated to use the most
recent
verion
of
ATLAS.

Tuning

The performance of ATLAS depends on how it is
tuned
at
the compile time. This PR uses only -b 64 for the tuning. This might
not be the best option. Or this might not work on OSX.

Netlib LAPACK

This PR uses the "--with-netlib-lapack-tarfile"
option
,
which will compile netlib's LAPACK
and install it.

However, the cmsdist also has lapack.spec. So there will be two
instances of Netlib's LAPACK.

Shared objects

I modified Makefile such that shared objects are built in the way in
which NumPy and other packages appear to expect.

The configuration option
--shared
creates two share objects: libsatlas.so and libtatlas.so, which
include all symbols for serial and parallel API's respectively, which
are not other packages that use ATLAS expect and find by default. For
example, see a stackoverflow
post
.

This PR builds those two shared objects but doesn't install them.

Instead, this PR builds seven share objects, each for one archive file
and install them. Those share objects are libatlas.so,
libf77blas.so, libcblas.so, liblapack.so, libptf77blas.so,
libptcblas.so, and libptlapack.so.

NumPy

This website says
that shared objects for BLAS, LAPACK, and ATLAS can be specified by
environmental variables. However, this wasn't the case when I tried.
So instead, this PR uses site.cfg to specify those.

pandas

I modified setup.py so that setuptools won't compile NumPy again.
I think setuptools are supposed to check if requires are installed and
if not then install them. However, it would always try to install
NumPy even when NumPy is installed. I couldn't figure out why. So I
just removed several lines in setup.py that indicate the requirement
of NumPy.

I removed the NumPy requirement from both setup_requires and
install_requires. If it is in setup_requires, the setuptools would
compile NumPy before compiling pandas. If it is not in
setup_requires but in install_requires, the setuptools would
compile NumPy after compiling pandas.

Test the branch

I tested my branch with the following commands on lxplus.

CMSSW_X_Y_Z=CMSSW_7_5_0_pre3
ARCH=slc6_amd64_gcc491
QUEUE=`echo $CMSSW_X_Y_Z | sed -e 
's/\(CMSSW_[0-9][0-9]*_[0-9][0-9]*\).*/\1_X/'`
eval $(curl -f -L 
https://raw.githubusercontent.com/cms-sw/cms-bot/master/config.map | 
grep "SCRAM_ARCH=$ARCH;" | grep "RELEASE_QUEUE=$QUEUE;")
git clone -b $CMSDIST_TAG git@github.com:cms-sw/cmsdist.git CMSDIST
git clone -b $PKGTOOLS_TAG git@github.com:cms-sw/pkgtools.git PKGTOOLS
sh -e PKGTOOLS/scripts/prepare-cmsdist $CMSSW_X_Y_Z $ARCH
cd CMSDIST
git remote add mine git@github.com:TaiSakuma/cmsdist.git
git fetch mine
git merge mine/dev-20150430-01-CMSSW_7_5_X-pandas
cd ..
screen -e^vv -L time PKGTOOLS/cmsBuild --architecture=$ARCH --builders 
4 -j 16 build cms-git-tools
screen -e^vv -L time PKGTOOLS/cmsBuild --architecture=$ARCH --builders 
4 -j 16 build py2-pandas-toolfile

pandas is compiled.

I don't know how to enter the environment that was built. So I didn't
try to import pandas to a Python script.

Any comments or suggestions are welcome.

You can view, comment on, or merge this pull request online at:

#1522

-- Commit Summary --

  • bring atlas.spec from IB/CMSSW_4_4_X/stable
  • add atlas.3.10.2.patch
  • update atlas.3.10.2.patch
  • upgrade atlas to 3.10.2
  • upgrade numpy to 1.9.2
  • upgrade python-dateutil to 1.5
  • add py2-pandas-0.16.0.patch
  • add py2-pandas.spec
  • add py2-pandas-toolfile.spec
  • add py2-pandas-toolfile to Requires of cmssw-tool-conf

-- File Changes --

A atlas.3.10.2.patch (47)
A atlas.spec (33)
M cmssw-tool-conf.spec (1)
M py2-numpy.spec (20)
A py2-pandas-0.16.0.patch (13)
A py2-pandas-toolfile.spec (26)
A py2-pandas.spec (21)
M py2-python-dateutil.spec (13)

-- Patch Links --

https://github.com/cms-sw/cmsdist/pull/1522.patch
https://github.com/cms-sw/cmsdist/pull/1522.diff


Reply to this email directly or view it on GitHub:
#1522

@TaiSakuma
Copy link
Contributor Author

@ktf, Ok. I will try without ATLAS.

@TaiSakuma
Copy link
Contributor Author

@ktf, I updated the PR. It won't add ATLAS any more.

@ktf
Copy link
Contributor

ktf commented May 4, 2015

Thanks. @Degano can you try this out? I would say let's get this PR in and then let's have a cms-externals repo for all the modified externals.

@ghost
Copy link

ghost commented May 5, 2015

@ktf @TaiSakuma I'm testing it right now.

@ghost
Copy link

ghost commented May 5, 2015

@ktf @TaiSakuma I found out an issue while building the packages that depends on the ones modified with this PR: py2-pyfits was not completing correctly.
I fixed the error on top of @TaiSakuma branch and created a PR to his repository here: TaiSakuma#1.
The error is probably unrelated to this PR but would fail the whole compilation nevertheless.

Fix deletion of egg-info in pyfits.
@TaiSakuma
Copy link
Contributor Author

@Degano, thank you for the test and fix. I merged your PR.

@ghost
Copy link

ghost commented May 5, 2015

+1
Tested compilation of all the externals involved and their dependencies.

ghost pushed a commit that referenced this pull request May 5, 2015
@ghost ghost merged commit 3812568 into cms-sw:IB/CMSSW_7_5_X/stable May 5, 2015
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants