# Using the CSV format in Jupyter

The data collected by the CMS detector can be handled in many different file formats. One easy way is to handle the data in CSV files (comma-separated values). A CSV file is basically a regular text file which includes values separated by commas and lines.

### Reading the CSV file

CSV files can be read for example with the function _read_\__csv( )_ of the _pandas_ module. Let's read the file _Zmumu_\__Run2011A.csv_ which is in the folder _Data_ in the parent directory. Let's also save the content of the file to the variable _dataset_.

The file contains events from the CMS primary dataset [1] with the specific selection criteria [2].
<br>
<br>
<br>
[1] CMS collaboration (2016). DoubleMu primary dataset in AOD format from RunA of 2011 (/DoubleMu/Run2011A-12Oct2013-v1/AOD). CERN Open Data Portal. DOI: [10.7483/OPENDATA.CMS.RZ34.QR6N](http://doi.org/10.7483/OPENDATA.CMS.RZ34.QR6N).
<br>
[2] Thomas McCauley (2016). Jpsimumu. Jupyter Notebook file. https://github.com/tpmccauley/cmsopendata-jupyter/blob/hst-0.1/Jpsimumu.ipynb. <br>

In [1]:
import pandas

dataset = pandas.read_csv('../Data/Zmumu_Run2011A.csv')

We can check what kind of information the file we read contains. Let's use the command _head( )_ of the _pandas_ module which will print the first five lines of the DataFrame variable written before the command ([pandas documentation](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.head.html)).

In [2]:
dataset.head()

Unnamed: 0,Run,Event,pt1,eta1,phi1,Q1,dxy1,iso1,pt2,eta2,phi2,Q2,dxy2,iso2
0,165617,74969122,54.7055,-0.432396,2.57421,1,-0.074544,0.499921,34.2464,-0.98848,-0.498704,-1,0.071222,3.42214
1,165617,75138253,24.5872,-2.0522,2.86657,-1,-0.055437,0.0,28.5389,0.385163,-1.99117,1,0.051477,0.0
2,165617,75887636,31.7386,-2.25945,-1.33229,-1,0.087917,0.0,30.2344,-0.468419,1.88331,1,-0.087639,0.0
3,165617,75779415,39.7394,-0.712338,-0.312266,1,0.058481,0.0,48.279,-0.195625,2.97032,-1,-0.049201,0.0
4,165617,75098104,41.2998,-0.157055,-3.04077,1,-0.030463,1.22804,43.4508,0.590958,-0.042756,-1,0.044175,0.0


<br>
Notice that there are more lines in the variable _dataset_ than the five printed. We can check the number of the lines with the function _len( )_ which will return the length of the variable given in the brackets.

In [3]:
len(dataset)

10583

### Observing and selecting the values

From the print above, we can see that the content of the file has been saved into a table (DataFrame tabular data structure). Each line of the table represent a different collision event and the columns include different saved values for the event. Some of the values are measured by the detector and some have been calculated from the measured values.

Values in the table can be accessed with the _pandas_ module. For example the data we are using contains the charges of two muons marked as _Q1_ and _Q2_. We can select certain columns from a table e.g. the charges of the first muon for all of the events by referring to the column name:

In [4]:
dataset['Q1']

0        1
1       -1
2       -1
3        1
4        1
5       -1
6       -1
7       -1
8        1
9       -1
10      -1
11       1
12       1
13      -1
14       1
15      -1
16       1
17      -1
18       1
19       1
20      -1
21       1
22      -1
23      -1
24       1
25       1
26       1
27       1
28      -1
29       1
        ..
10553    1
10554   -1
10555    1
10556   -1
10557   -1
10558   -1
10559    1
10560    1
10561   -1
10562    1
10563    1
10564    1
10565    1
10566   -1
10567   -1
10568   -1
10569   -1
10570   -1
10571    1
10572   -1
10573    1
10574    1
10575   -1
10576    1
10577    1
10578   -1
10579    1
10580   -1
10581    1
10582    1
Name: Q1, Length: 10583, dtype: int64

Now the code printed the values of the column _Q1_ of the variable _dataset_. Of course all of the values will not be printed (there are over 10 000 of them) and on the last line of the print you can see the name, lengt and tyoe of the information printed.

The numbers on the left tell the index of the line and the numbers on the right are the values of the charges. By replacing the _Q1_ in the code it is possible to select any of the column from the _dataset_ (e.g. _pt1_, _eta1_, _phi2_, ...).

If for example only the ten first values of the charges are wanted to be selected, it can be done by using the _.loc_ method. In the method the brackets first include the indexes of the lines wanted to be selected (here lines 0--10) and after those the name of the column from where the lines will be selected (here _Q1_). With Python 2, the method is _.ix_.

In [5]:
dataset.loc[0:10, 'Q1']
# If you use Python 2, use
# dataset.ix[0:10, 'Q1']

0     1
1    -1
2    -1
3     1
4     1
5    -1
6    -1
7    -1
8     1
9    -1
10   -1
Name: Q1, dtype: int64

Also individual values can be picked. Let's say we want to see that charges from indices 0,1,5 and 10. This can be done with

In [6]:
dataset.loc[[0,1,5,10],'Q1']
# If you use Python 2, use
# dataset.ix[[0,1,5,10],'Q1']

0     1
1    -1
5    -1
10   -1
Name: Q1, dtype: int64

### Other notes

- There are also other options for selecting the values. For example this [Stack Overflow link](http://stackoverflow.com/questions/31593201/pandas-iloc-vs-ix-vs-loc-explanation) includes other possibilities of the _pandas_ module. CSV files can also be read with other modules or libraries.
- Note that different CSV files include different data depending on from which kind of collision events the file has been created. You can always check the content of the file by opening it in the Jupyter or in a text editor.