# Understanding the idle data investigation script

This scripts produces a list of of files was no source (no replica) and are identified by taking a look to blocks asociated with basis value of -6 (at least one file in the block has no source replica remaining). 

This files keep piling as idle data from subscriptions to diferent sites (https://cmsweb.cern.ch/phedex/prod/Activity::QueuePlots?graph=idle&entity=dest&dest_filter=T0%7CT1%7CT2_CH_CERN&no_mss=true&period=l12h&upto=&.submit=Update). 

The script recovers all the blocks realted to specific site subscriptions, then retrieves the responsible files. It checks their creation date. Produces a report of the datasets involve and in with extent (files with non source/ total files in the dataset ).

A list of the files separated for each "type" ( whether they are 'data', 'mc' ... ) is generated. This list is intended to be used to proced with global invalidation or further investigation, depending the type.

If all the files of a block or dataset have no source. A file with a list of for such blocks and a file with a list for such datasets is generated. This lists can facilitate the invalidation process to be performe in bulk instead of by files. Aditionally a list of non source files is generated, excluding the files that are included in the block list or datasets list. The later also to considering the case in which a bulk invalidation wants to be performed. 

Finally deletions subscriptions for the involve blocks are performed. And file concateneting the information pulled is generated. Until it haven't present the case of finding a deletion request related. In the case this starts happening the script funtionallity should be extended.

In [1]:
#from urllib2 import Request, urlopen
#import ssl
import json
import pandas as pd
from pandas.io.json import json_normalize
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
from datetime import datetime, timedelta
import re

In [2]:
from IPython.display import display

In [3]:
#Silent warnings of insecure request from requests library
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)

In [4]:
def dataset_from_blockname(blockname):
    '''
    prety simple function to get the dataset name having the block name
    '''
    return re.search('(.+)#', blockname).group(1)

In [5]:
def check_datetime_Xweeks_older (timestamp, nweeks):
    '''
    checks if timestamp is older than certain amount of weeks
    '''
    time_limit = datetime.now() - timedelta(weeks=nweeks)
    return datetime.fromtimestamp(timestamp) <= time_limit

In [6]:
def get_nosource_files_info(block):
    '''
    Seerch replicas for a given block and return the files that have no replica at all "[]" in replica field. 
    The returned value is a dictionary having a panda data frame with the metainfo of the block,
    the count of non  source files and the total number of files for the block
    Only the files with a creation time of more than 1 week are reported
    '''
    url = 'https://cmsweb.cern.ch/phedex/datasvc/json/prod/filereplicas'
    params = {"block": block}
    replicas_info = requests.get(url=url, params=params, verify=False).content
    replicas_json = json.loads(replicas_info)
    replicas_table = json_normalize(
        replicas_json['phedex']['block'][0]['file'])

    # Discards row entries of files with a creation date of one week or less
    replicas_table = replicas_table[replicas_table['time_create'].apply(
        check_datetime_Xweeks_older, nweeks=1) == True]

    num_files_in_block = len(replicas_table)
    no_source_files_table = replicas_table.loc[
        replicas_table.astype(str)['replica'] == "[]"]
    num_nosource_files_in_block = len(no_source_files_table)
    
    return {'df': no_source_files_table,
            'num_files_in_block': num_files_in_block,
            'num_nosource_files_in_block': num_nosource_files_in_block}

### Example of `get_nosource_files_info( block )` 

In [7]:
example_block = '/BTagCSV/Run2016BBackfill-BACKFILL-v13/AOD#6c960820-8625-11e6-b16f-02163e0184a6'
example_ns_files_info = get_nosource_files_info(example_block)
display(example_ns_files_info['df'])
print 'Number of non-source files: {} of {}'.format(example_ns_files_info['num_nosource_files_in_block'],
                                                    example_ns_files_info['num_files_in_block'])



Unnamed: 0,bytes,checksum,id,name,original_node,replica,time_create
0,2480029510,"adler32:b26c6dd3,cksum:1405059702",101458476,/store/data/Run2016BBackfill/BTagCSV/AOD/BACKF...,,[],1475142000.0
2,2446225392,"adler32:637d2e73,cksum:3564716832",101460134,/store/data/Run2016BBackfill/BTagCSV/AOD/BACKF...,,[],1475144000.0
3,2492112257,"adler32:52ce435c,cksum:2214022121",101463210,/store/data/Run2016BBackfill/BTagCSV/AOD/BACKF...,,[],1475146000.0
6,2387554713,"adler32:28874f74,cksum:1531599078",101482960,/store/data/Run2016BBackfill/BTagCSV/AOD/BACKF...,,[],1475165000.0
11,2322897779,"adler32:5cc6c949,cksum:1101638021",101487218,/store/data/Run2016BBackfill/BTagCSV/AOD/BACKF...,,[],1475169000.0
13,3397572478,"adler32:3d945dc3,cksum:4160566198",101511084,/store/data/Run2016BBackfill/BTagCSV/AOD/BACKF...,,[],1475189000.0
14,2359316290,"adler32:da333a33,cksum:522016462",101483764,/store/data/Run2016BBackfill/BTagCSV/AOD/BACKF...,,[],1475166000.0
15,2381648457,"adler32:d2053c21,cksum:1710051829",101479006,/store/data/Run2016BBackfill/BTagCSV/AOD/BACKF...,,[],1475162000.0
16,2383272277,"adler32:8bc8cfc7,cksum:2568201960",101478547,/store/data/Run2016BBackfill/BTagCSV/AOD/BACKF...,,[],1475161000.0
17,2428671269,"adler32:c02906d2,cksum:3865174811",101464007,/store/data/Run2016BBackfill/BTagCSV/AOD/BACKF...,,[],1475147000.0


Number of non-source files: 31 of 55


In [8]:
def site_nosource_files_df(site):
    '''
    check the data service blockarrive for blocks with status basis = -6
    basis -6 means: at least one file in the block has no source replica remaining
    It returns a dictionary with a dataframe containing the metainformation of with files with no source of all blocks and
    a dataframe with the metainfomation at the block level of all theh blocks with -6 statue for the evaluated site
    '''
    #Get the info of blockarrive for the site parsed in a panda dataframe
    # panda dataframe with the info at block level: blocks_arrive_table
    url = "https://cmsweb.cern.ch/phedex/datasvc/json/prod/blockarrive"
    params = {"to_node": site,"basis": -6}
    block_arrive_info = requests.get(url=url, params= params, verify=False).content

    block_arrive_info_json = json.loads(block_arrive_info)
    blocks_arrive_table = json_normalize(block_arrive_info_json['phedex']['block'])


    #Get the info for non-source files and how many of all the files in the block have non-source
    #nosource_files_df: have the information of the non-source file. At file level
    # blocks_info_df: have the information at block level that block_arrive table has and additionally the count of ns files
    ns_files_list = []
    blocks_info_df = pd.DataFrame()
    for blockname in blocks_arrive_table['name']:
        nosource_files_info = get_nosource_files_info(blockname)
        nosource_files_df = nosource_files_info['df']
        nosource_files_df['blockname'] = blockname
        ns_files_list.append(nosource_files_df)
        df_tmp = pd.DataFrame(data= [[nosource_files_info['num_files_in_block'], nosource_files_info['num_nosource_files_in_block']]])
        blocks_info_df = blocks_info_df.append(df_tmp, ignore_index=True)

    nosource_files_df =  pd.concat(ns_files_list)
    nosource_files_df['datetime_create'] = nosource_files_df['time_create'].apply(datetime.fromtimestamp)
    nosource_files_df['dataset'] = nosource_files_df['blockname'].apply(dataset_from_blockname)

    blocks_info_df.columns = ['num_files_in_block', 'num_nosource_files_in_block']
    blocks_info_df = pd.concat([blocks_arrive_table,blocks_info_df], axis=1)

    return {'files_info': nosource_files_df, 'blocks_info': blocks_info_df}

### Example of site_nosource_files_df( site, basis )

In [9]:
site = "T1_US_FNAL_Disk"

nosource_site_info = site_nosource_files_df(site)
display(nosource_site_info['files_info'])
display(nosource_site_info['blocks_info'])


Unnamed: 0,bytes,checksum,id,name,original_node,replica,time_create,blockname,datetime_create,dataset
0,4752817175,"adler32:c4e9fd9f,cksum:3982282292",83491612,/store/data/Run2015D/SingleMuon/RECO/16Dec2015...,,[],1.450709e+09,/SingleMuon/Run2015D-16Dec2015-v1/RECO#4252054...,2015-12-21 08:36:49.513020,/SingleMuon/Run2015D-16Dec2015-v1/RECO
0,2713083854,"adler32:a2f170e3,cksum:1986985331",104541926,/store/mc/RunIISummer16DR80Premix/DYJetsToNuNu...,,[],1.478732e+09,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...,2016-11-09 17:00:15.302690,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...
1,4006024947,"adler32:51435245,cksum:24716296",104543345,/store/mc/RunIISummer16DR80Premix/DYJetsToNuNu...,,[],1.478734e+09,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...,2016-11-09 17:32:53.108160,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...
2,2520895924,"adler32:6f16d7bd,cksum:3956406453",104541982,/store/mc/RunIISummer16DR80Premix/DYJetsToNuNu...,,[],1.478733e+09,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...,2016-11-09 17:03:36.459760,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...
3,3596321947,"adler32:9821dd1f,cksum:1277982894",104542900,/store/mc/RunIISummer16DR80Premix/DYJetsToNuNu...,,[],1.478734e+09,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...,2016-11-09 17:23:44.365570,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...
4,3157685787,"adler32:480a0164,cksum:3216692587",104543485,/store/mc/RunIISummer16DR80Premix/DYJetsToNuNu...,,[],1.478735e+09,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...,2016-11-09 17:36:14.130630,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...
5,2832391222,"adler32:7078d070,cksum:2613710556",104541686,/store/mc/RunIISummer16DR80Premix/DYJetsToNuNu...,,[],1.478732e+09,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...,2016-11-09 16:48:31.042490,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...
6,4031842227,"adler32:ac5cd63f,cksum:3697835106",104542898,/store/mc/RunIISummer16DR80Premix/DYJetsToNuNu...,,[],1.478734e+09,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...,2016-11-09 17:23:44.365570,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...
7,4025257470,"adler32:821b88c0,cksum:216284042",104544341,/store/mc/RunIISummer16DR80Premix/DYJetsToNuNu...,,[],1.478736e+09,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...,2016-11-09 17:59:47.331060,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...
8,4031585377,"adler32:c45a11bd,cksum:1953835665",104544573,/store/mc/RunIISummer16DR80Premix/DYJetsToNuNu...,,[],1.478737e+09,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...,2016-11-09 18:09:51.426270,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...


Unnamed: 0,bytes,dataset,destination,files,id,name,time_create,time_update,num_files_in_block,num_nosource_files_in_block
0,4752817175,/SingleMuon/Run2015D-16Dec2015-v1/RECO,"[{u'files': 1, u'name': u'T1_US_FNAL_Disk', u'...",1,5912292,/SingleMuon/Run2015D-16Dec2015-v1/RECO#4252054...,1450708000.0,1476532000.0,1,1
1,512873882415,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...,"[{u'files': 135, u'name': u'T1_US_FNAL_Disk', ...",135,7794552,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...,1478731000.0,1478797000.0,135,135
2,126221101,/BTag/None-v0/AOD,"[{u'files': 1, u'name': u'T1_US_FNAL_Disk', u'...",1,4574137,/BTag/None-v0/AOD#a95a28d0-1134-11e4-81d1-0022...,1405988000.0,1406054000.0,1,1
3,4553925469,/SingleNeutrino/RunIISummer16DR80-PUFlat0to50_...,"[{u'files': 2, u'name': u'T1_US_FNAL_Disk', u'...",2,7874253,/SingleNeutrino/RunIISummer16DR80-PUFlat0to50_...,1480413000.0,1480428000.0,2,2
4,14175442011,/SingleNeutrino/RunIISummer16DR80-PUFlat0to50_...,"[{u'files': 6, u'name': u'T1_US_FNAL_Disk', u'...",6,7870112,/SingleNeutrino/RunIISummer16DR80-PUFlat0to50_...,1480333000.0,1480427000.0,6,6
5,101697018661,/SingleNeutrino/RunIISummer16DR80-PUFlat0to50_...,"[{u'files': 32, u'name': u'T1_US_FNAL_Disk', u...",32,7869576,/SingleNeutrino/RunIISummer16DR80-PUFlat0to50_...,1480320000.0,1480427000.0,32,32
6,7707307007,/MuGunFlatPt2to8/TTI2023Upg14D-FLATBS15PU200_F...,"[{u'files': 4, u'name': u'T1_US_FNAL_Disk', u'...",4,7836394,/MuGunFlatPt2to8/TTI2023Upg14D-FLATBS15PU200_F...,1479598000.0,1496276000.0,4,1
7,172183097185,/QCD_Pt-15to7000_TuneCUETHS1_FlatP6_13TeV_herw...,"[{u'files': 63, u'name': u'T1_US_FNAL_Disk', u...",63,7871689,/QCD_Pt-15to7000_TuneCUETHS1_FlatP6_13TeV_herw...,1480362000.0,1480416000.0,63,63
8,3642742308,/BTagCSV/Run2016BBackfill-BACKFILL-v13/MINIAOD,"[{u'files': 3, u'name': u'T1_US_FNAL_Disk', u'...",3,7580925,/BTagCSV/Run2016BBackfill-BACKFILL-v13/MINIAOD...,1475114000.0,1475180000.0,3,2
9,138057876940,/BTagCSV/Run2016BBackfill-BACKFILL-v13/AOD,"[{u'files': 55, u'name': u'T1_US_FNAL_Disk', u...",55,7582879,/BTagCSV/Run2016BBackfill-BACKFILL-v13/AOD#6c9...,1475141000.0,1475207000.0,55,31


## Procedure for one site

In [10]:
#For one site

site =  "T1_US_FNAL_Disk"
nosource_site_info = site_nosource_files_df(site)

#file level df
ns_files_info = nosource_site_info['files_info']
#block level df
ns_blocks_info = nosource_site_info['blocks_info']

#Sum of the size reported for all the files with no source found
#This with to help to compare with the plot of idle data for the site for with the subscriptions has been investigated
ns_files_size = ns_files_info['bytes'].sum() 
ns_files_size_TB = ns_files_size * (10**(-12))
#Num of files
num_ns_files = len(ns_files_info)
    
print 'For blocks subscriptions to {}, {} files where found to have no source.'.format(site, num_ns_files)
print 'Total size: {} TB'.format(ns_files_size_TB)



For blocks subscriptions to T1_US_FNAL_Disk, 274 files where found to have no source.
Total size: 0.891015672335 TB


In [11]:
ns_files_info

Unnamed: 0,bytes,checksum,id,name,original_node,replica,time_create,blockname,datetime_create,dataset
0,4752817175,"adler32:c4e9fd9f,cksum:3982282292",83491612,/store/data/Run2015D/SingleMuon/RECO/16Dec2015...,,[],1.450709e+09,/SingleMuon/Run2015D-16Dec2015-v1/RECO#4252054...,2015-12-21 08:36:49.513020,/SingleMuon/Run2015D-16Dec2015-v1/RECO
0,2713083854,"adler32:a2f170e3,cksum:1986985331",104541926,/store/mc/RunIISummer16DR80Premix/DYJetsToNuNu...,,[],1.478732e+09,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...,2016-11-09 17:00:15.302690,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...
1,4006024947,"adler32:51435245,cksum:24716296",104543345,/store/mc/RunIISummer16DR80Premix/DYJetsToNuNu...,,[],1.478734e+09,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...,2016-11-09 17:32:53.108160,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...
2,2520895924,"adler32:6f16d7bd,cksum:3956406453",104541982,/store/mc/RunIISummer16DR80Premix/DYJetsToNuNu...,,[],1.478733e+09,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...,2016-11-09 17:03:36.459760,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...
3,3596321947,"adler32:9821dd1f,cksum:1277982894",104542900,/store/mc/RunIISummer16DR80Premix/DYJetsToNuNu...,,[],1.478734e+09,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...,2016-11-09 17:23:44.365570,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...
4,3157685787,"adler32:480a0164,cksum:3216692587",104543485,/store/mc/RunIISummer16DR80Premix/DYJetsToNuNu...,,[],1.478735e+09,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...,2016-11-09 17:36:14.130630,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...
5,2832391222,"adler32:7078d070,cksum:2613710556",104541686,/store/mc/RunIISummer16DR80Premix/DYJetsToNuNu...,,[],1.478732e+09,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...,2016-11-09 16:48:31.042490,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...
6,4031842227,"adler32:ac5cd63f,cksum:3697835106",104542898,/store/mc/RunIISummer16DR80Premix/DYJetsToNuNu...,,[],1.478734e+09,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...,2016-11-09 17:23:44.365570,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...
7,4025257470,"adler32:821b88c0,cksum:216284042",104544341,/store/mc/RunIISummer16DR80Premix/DYJetsToNuNu...,,[],1.478736e+09,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...,2016-11-09 17:59:47.331060,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...
8,4031585377,"adler32:c45a11bd,cksum:1953835665",104544573,/store/mc/RunIISummer16DR80Premix/DYJetsToNuNu...,,[],1.478737e+09,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...,2016-11-09 18:09:51.426270,/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-...


In [12]:
def count_file_type (ns_files_df):
    '''
    From a non-source files dataframe get the number of files that
    are 'data', 'mc' or 'other' types.
    '''
    count_file_type = {}
    for filepath in ns_files_info['name']:
        if re.search('^/store/data', filepath):
            count_file_type['data'] =  count_file_type.get('data', 0) + 1
        elif re.search('^/store/mc', filepath):
            count_file_type['mc'] =  count_file_type.get('mc', 0) + 1
        else:
            count_file_type['others'] =  count_file_type.get('others', 0) + 1
    return count_file_type

In [13]:
#Files types diversity 
count_file_type_dic = count_file_type(ns_files_info)

for file_type, count in count_file_type_dic.iteritems():
    print str(count) + " " + file_type + " at " + site

35 data at T1_US_FNAL_Disk
239 mc at T1_US_FNAL_Disk


In [14]:
def datasets_count_hash(blocks_file_list):
    '''
    From a list of the blocks identified with basis -6
    Produces a dictionary listing the involve datasets as keys and the num of blocks (with basis -6) as values
     '''
    datasets_count = {}
    for i in blocks_file_list:
        dataset = dataset_from_blockname(str(i))
        datasets_count[dataset] = datasets_count.get(dataset, 0) + 1
    return datasets_count

In [15]:
#Datasets and corresponding number of blocks involve (basis -6)
datasets_count = datasets_count_hash(ns_blocks_info['name'].tolist())
datasets_count

{'/BTag/None-v0/AOD': 1,
 '/BTagCSV/Run2016BBackfill-BACKFILL-v13/AOD': 1,
 '/BTagCSV/Run2016BBackfill-BACKFILL-v13/MINIAOD': 1,
 '/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-amcatnloFXFX-pythia8/RunIISummer16DR80Premix-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/AODSIM': 1,
 '/MuGunFlatPt2to8/TTI2023Upg14D-FLATBS15PU200_FLATBS15_DES23_62_V1-v3/GEN-SIM-DIGI-RAW': 1,
 '/QCD_Pt-15to7000_TuneCUETHS1_FlatP6_13TeV_herwigpp/RunIISummer16DR80-PUFlat0to50_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/AODSIM': 1,
 '/SingleMuon/Run2015D-16Dec2015-v1/RECO': 1,
 '/SingleNeutrino/RunIISummer16DR80-PUFlat0to50_magnetOff_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v3/AODSIM': 3}

In [16]:
def counts_for_dataset(dataset):
    '''
    For a dataset, retrieves relevant summary information, i.e.,
    returns total number of files and total number of blocks
    '''
    url = 'https://cmsweb.cern.ch/phedex/datasvc/json/prod/data'
    params = {"dataset": dataset, "level": 'block'}
    data_info = requests.get(url=url, params= params, verify=False).content
    data_json = json.loads(data_info)
    data_table = json_normalize(data_json['phedex']['dbs'][0]['dataset'][0]['block'])
    return {'num_files': data_table['files'].sum(), 'num_blocks': len(data_table)}

In [17]:
#Get total number of files and total number of blocks
datasets_total = {}
for dataset in datasets_count:
    datasets_total[dataset] = counts_for_dataset(dataset)

In [18]:
datasets_total

{'/BTag/None-v0/AOD': {'num_blocks': 4, 'num_files': 20},
 '/BTagCSV/Run2016BBackfill-BACKFILL-v13/AOD': {'num_blocks': 22,
  'num_files': 565},
 '/BTagCSV/Run2016BBackfill-BACKFILL-v13/MINIAOD': {'num_blocks': 16,
  'num_files': 159},
 '/DYJetsToNuNu_PtZ-250To400_TuneCUETP8M1_13TeV-amcatnloFXFX-pythia8/RunIISummer16DR80Premix-PUMoriond17_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/AODSIM': {'num_blocks': 21,
  'num_files': 2217},
 '/MuGunFlatPt2to8/TTI2023Upg14D-FLATBS15PU200_FLATBS15_DES23_62_V1-v3/GEN-SIM-DIGI-RAW': {'num_blocks': 1,
  'num_files': 4},
 '/QCD_Pt-15to7000_TuneCUETHS1_FlatP6_13TeV_herwigpp/RunIISummer16DR80-PUFlat0to50_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v1/AODSIM': {'num_blocks': 1,
  'num_files': 63},
 '/SingleMuon/Run2015D-16Dec2015-v1/RECO': {'num_blocks': 81,
  'num_files': 7818},
 '/SingleNeutrino/RunIISummer16DR80-PUFlat0to50_magnetOff_80X_mcRun2_asymptotic_2016_TrancheIV_v6-v3/AODSIM': {'num_blocks': 9,
  'num_files': 214}}

In [19]:
def dataset_kind_is(dataset):
    '''
    From dataset name returns the type of file it contains: 'data',  'mc'....
    '''
    url = 'https://cmsweb.cern.ch/phedex/datasvc/json/prod/filereplicas'
    params = {"dataset": dataset}
    replicas_info = requests.get(url=url, params= params, verify=False).content
    replicas_json = json.loads(replicas_info)
    replicas_sample_table = json_normalize(replicas_json['phedex']['block'][0]['file'])
    return (str(replicas_sample_table['name'][0]).split("/")[2])

In [20]:
for i in datasets_total:
    print dataset_kind_is(i)

mc
data
mc
mc
data
data
data
mc


In [21]:
def check_filetype (filepath, file_type):
    '''
    check if a file is of a specific category: 'mc',  'data' and 'other'
    '''
    return bool(re.search('^/store/' + file_type, filepath))

In [22]:
def write_nosource_files_site (site, file_type, list_files):
    filename = "nosource_files_" + file_type + "_" + site + ".txt"

    with open(filename, "w") as f:
        f.write("\n".join(list_files))

def write_nosource_block_list (site, file_type, list_blocks):
    filename = "nosource_blocks_" + file_type + "_" + site + ".txt"

    with open(filename, "w") as f:
        f.write("\n".join(list_blocks))

def write_nosource_dataset_list (site, file_type, list_datasets):
    filename = "nosource_datasets_" + file_type + "_" + site + ".txt"

In [23]:

for file_type in count_file_type_dic.keys():
    print ('Type of files: ' + file_type)
    whole_blocks_ns = []
    whole_datasets_ns = []
    for dataset in datasets_count:
        if (dataset_kind_is(dataset) == file_type):
            n_files = datasets_total[dataset]['num_files']
            ns_blocks_info_subset = ns_blocks_info[ns_blocks_info.dataset == dataset]
            num_ns_files = ns_blocks_info_subset.num_nosource_files_in_block.sum()
            num_ns_blocks = datasets_count[dataset]
            num_tot_blocks = datasets_total[dataset]['num_blocks']
            print 'Dataset: {}'.format(dataset)
            print 'Num of blocks with no source files: {}/{}'.format(num_ns_blocks, num_tot_blocks)
            print 'Num of no source files: {}/{}'.format(num_ns_files, n_files)
            print 'Num of no source files by block:'
            if (num_ns_blocks == num_tot_blocks and num_ns_files == n_files):
                whole_datasets_ns.append(dataset)

            for idx, row in ns_blocks_info_subset.iterrows():
                print ' {} {}/{}'.format(row['name'], row.num_nosource_files_in_block, row.num_files_in_block)
                if (row.num_nosource_files_in_block == row.num_files_in_block and (num_ns_blocks !=num_tot_blocks and num_ns_files != n_files)):
                    whole_blocks_ns.append(row['name'])
            print '=' * 100
    list_whole_ns_files = ns_files_info[ns_files_info.name.apply(check_filetype, file_type = file_type)]['name'].tolist()
    nosource_files_site_df = ns_files_info[~ns_files_info.dataset.isin(whole_datasets_ns) & ~ns_files_info.blockname.isin(whole_blocks_ns)]
    list_ns_files = ns_files_info['name'].tolist()
    list_ns_files = [ ns_file for ns_file in list_ns_files if re.search('^/store/' + file_type, ns_file)]
    
    print '=' * 100
    print '=' * 100
    print list_whole_ns_files
    print '=' * 100
    print whole_blocks_ns
    print '=' * 100
    print whole_datasets_ns
    print '=' * 100
    list_ns_files

    write_nosource_block_list(site, file_type, whole_blocks_ns)
    write_nosource_dataset_list(site, file_type, whole_datasets_ns)
    write_nosource_files_site(site, file_type, list_ns_files)
    write_nosource_files_site(site, file_type + '_whole', list_whole_ns_files)

Type of files: data
Dataset: /SingleMuon/Run2015D-16Dec2015-v1/RECO
Num of blocks with no source files: 1/81
Num of no source files: 1/7818
Num of no source files by block:
 /SingleMuon/Run2015D-16Dec2015-v1/RECO#42520542-a7ef-11e5-b857-a0369f23cf8a 1/1
Dataset: /BTagCSV/Run2016BBackfill-BACKFILL-v13/AOD
Num of blocks with no source files: 1/22
Num of no source files: 31/565
Num of no source files by block:
 /BTagCSV/Run2016BBackfill-BACKFILL-v13/AOD#6c960820-8625-11e6-b16f-02163e0184a6 31/55
Dataset: /BTag/None-v0/AOD
Num of blocks with no source files: 1/4
Num of no source files: 1/20
Num of no source files by block:
 /BTag/None-v0/AOD#a95a28d0-1134-11e4-81d1-00221959e777 1/1
Dataset: /BTagCSV/Run2016BBackfill-BACKFILL-v13/MINIAOD
Num of blocks with no source files: 1/16
Num of no source files: 2/159
Num of no source files by block:
 /BTagCSV/Run2016BBackfill-BACKFILL-v13/MINIAOD#3d3fd31e-85e7-11e6-b16f-02163e0184a6 2/3
[u'/store/data/Run2015D/SingleMuon/RECO/16Dec2015-v1/10015/AC06B

In [24]:
def search_deletions (block):
    url = 'https://cmsweb.cern.ch/phedex/datasvc/json/prod/deletions'
    params = {"block": block}
    deletions_info = requests.get(url=url, params= params, verify=False).content
    deletions_json = json.loads(deletions_info)
    return(deletions_json)

In [25]:
def write_deletions_json (site, deletions_json_list):
    filename = "nosource_blocks_deletions_jsons" + site + ".txt"

    with open(filename, "w") as f:
        deletions_json_list = map(str, deletions_json_list)
        f.write("\n".join(deletions_json_list))

In [26]:
deletions_json = []
for j in ns_files_info.blockname:
    deletion_json = search_deletions(j)
    deletions_json.append(deletion_json)

write_deletions_json(site, deletions_json)


