**Basic reading and writing of csv files as a first data processing**  

This script starts from the raw csv files provided by central DQM as an ultimate input.  
These files are difficult to work with since they contain a fixed number of lines, not grouped by e.g. run number, and they contain a large number of histogram types together.  
This script (of which basically all the functionality is in the 'utils' folder) puts them into a more useful form, i.e. one file per histogram type and per year, containing all runs and lumisections for that type for that year.  

It might be a good idea to run this code, where you change the histogram types to the ones that you intend to use in your study.  
Options are also available (although not shown in this small tutorial) to make files per era instead of per year, if you prefer that.

For more information, check the documentation of utils/csv_utils and utils.dataframe_utils! See also the comments in the code below for some more explanation.

In [1]:
### imports

# external modules
import sys
import importlib

# local modules
sys.path.append('../utils')
import csv_utils as csvu
import dataframe_utils as dfu
importlib.reload(csvu)
importlib.reload(dfu)

<module 'dataframe_utils' from '../utils/dataframe_utils.py'>

In [3]:
# read an example csv file

dim = 2 # dimension of histograms (1 or 2)
datadirs = list(csvu.get_data_dirs(year='2017',dim=dim)) 
# get_data_dirs returns the directories where to find the input csv files.
# this is hard-coded for now and might change in the future.
# if your data is located elsewhere, you can easily write an equivalent function with the same output.
print('data directories:')
print(datadirs)
datadir = datadirs[0]
csvfiles = csvu.sort_filenames(list(csvu.get_csv_files(datadir)))
# sort_filenames and get_csv_files are more or less self-explanatory.
print('number of csv files in {}: {}'.format(datadir,len(csvfiles)))
df = csvu.read_csv(csvfiles[0])
# read_csv turns an input csv file into a pandas dataframe. 
# uncomment the following two lines to get a printout of the dataframe before any further processing.
# comment them back again to have a better view of the rest of the printouts in this cell.
#print('example data frame:')
print(df)
print('--- available runs present in this file: ---')
for r in dfu.get_runs(df): print(r)
print('--- available histogram types in this file ---')
for h in dfu.get_histnames(df): print(h)

data directories:
['/eos/project/c/cmsml4dc/ML_2020/UL2017_Data/DF2017B_2D_Complete', '/eos/project/c/cmsml4dc/ML_2020/UL2017_Data/DF2017C_2D_Complete', '/eos/project/c/cmsml4dc/ML_2020/UL2017_Data/DF2017D_2D_Complete', '/eos/project/c/cmsml4dc/ML_2020/UL2017_Data/DF2017E_2D_Complete', '/eos/project/c/cmsml4dc/ML_2020/UL2017_Data/DF2017F_2D_Complete']
number of csv files in /eos/project/c/cmsml4dc/ML_2020/UL2017_Data/DF2017B_2D_Complete: 45
       Unnamed: 0  fromrun  fromlumi  \
0           51546   297050         1   
1           51547   297050         1   
2           51548   297050         1   
3           51549   297050         1   
4           51550   297050         1   
5           51551   297050         1   
6           51552   297050         1   
7           51553   297050         1   
8           51554   297050         1   
9           51555   297050         1   
10          51556   297050         1   
11          51557   297050         1   
12          51558   297050         

clusterposition_zphi_PXLayer_1
clusterposition_zphi_PXLayer_2
clusterposition_zphi_PXLayer_3
clusterposition_zphi_PXLayer_4
clusters_per_SignedModuleCoord_per_SignedLadderCoord_PXLayer_1
clusters_per_SignedModuleCoord_per_SignedLadderCoord_PXLayer_2
clusters_per_SignedModuleCoord_per_SignedLadderCoord_PXLayer_3
clusters_per_SignedModuleCoord_per_SignedLadderCoord_PXLayer_4
digi_occupancy_per_SignedModuleCoord_per_SignedLadderCoord_PXLayer_1
digi_occupancy_per_SignedModuleCoord_per_SignedLadderCoord_PXLayer_2
digi_occupancy_per_SignedModuleCoord_per_SignedLadderCoord_PXLayer_3
digi_occupancy_per_SignedModuleCoord_per_SignedLadderCoord_PXLayer_4
clusterposition_xy_PXDisk_+1
clusterposition_xy_PXDisk_+2
clusterposition_xy_PXDisk_+3
clusterposition_xy_PXDisk_-1
clusterposition_xy_PXDisk_-2
clusterposition_xy_PXDisk_-3
clusters_per_SignedDiskCoord_per_SignedBladePanelCoord_PXRing_1
clusters_per_SignedDiskCoord_per_SignedBladePanelCoord_PXRing_2
digi_occupancy_per_SignedDiskCoord_per_SignedB

In [None]:
# main reformatting of input csv files
# note that this function can take quite a while to run!

csvu.write_skimmed_csv(['clusterposition_zphi_ontrack_PXLayer_1'],'2017',eras=['B'],dim=2)

In [None]:
# extra: for 2D histograms, even the files per histogram type and per era might be too big to easily work with.
# this cell writes even smaller files for quicker testing

year = '2017'
era = 'B'
dim = 2 # dimension of histograms (1 or 2)
histname = 'clusterposition_zphi_ontrack_PXLayer_1'
datadirs = list(csvu.get_data_dirs(year=year,eras=[era],dim=dim)) 
datadir = datadirs[0]
csvfiles = csvu.sort_filenames(list(csvu.get_csv_files(datadir)))
print('number of csv files in {}: {}'.format(datadir,len(csvfiles)))
df = csvu.read_csv(csvfiles[0])
df = dfu.select_histnames(df,[histname])
df.to_csv('DF'+year+era+'_'+histname+'_subset.csv')