<a href="https://colab.research.google.com/github/NadezhdaMalysheva/projects/blob/main/clf_x_ray_py_example.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# clf_x-ray.py guide

Загрузить оболочку для запуса программы можно таким образом: 
`$ conda create --name <env> --file requirements.txt`

In [None]:
!python3 ./clf_x-ray.py -h

usage: clf_x-ray.py [-h] [--inputFile INPUTFILE] [--predictOn PREDICTON]
                    [--outputExt OUTPUTEXT] [-o OUTPUTDIR]
                    [--modelsSeries MODELSSERIES]
                    [--modelsIds MODELSIDS [MODELSIDS ...]]
                    [--keepModelsInMemory] [--chunkSize CHUNKSIZE]
                    [--njobs NJOBS]

Script to make сlassification of objects into stars, quasars and galaxy for objects with photometric data.

 List of models
 
   - gb - Gradient boosting models
   - tn - TabNet models
       - 18 - SDSS + WISE
       - 19 - PanSTARRS + WISE
       - 20 - SDSS + DESI LIS + WISE
       - 21 - PanSTARRS + DESI LIS + WISE
       - 22 - DESI LIS + WISE
       - 34 - SDSS + PanSTARRS + WISE
       - 35 - SDSS + PanSTARRS + DESI LIS + WISE

optional arguments:
  -h, --help            show this help message and exit
  --inputFile INPUTFILE
                        [type: str; default: "./x-ray_data.gz_pkl"] Path to input file.
  

# Пример

In [None]:
!ls

 clf_x-ray.py		       models		  x-ray_data.gz_pkl
'clf_x-ray.py example.ipynb'   __pycache__
 clf_x-ray.zip		       requirements.txt


In [None]:
!python3 ./clf_x-ray.py --outputDir ./\
        --modelsSeries gb\
        --modelsIds 18 19 21 22 34 35

Preparing data: 100%|█████████████████████████████| 1/1 [00:13<00:00, 13.18s/it]
Predictions:   0%|                                        | 0/1 [00:00<?, ?it/s]
  0%|                                                     | 0/6 [00:00<?, ?it/s][A
 17%|███████▌                                     | 1/6 [00:00<00:02,  2.04it/s][A
 33%|███████████████                              | 2/6 [00:00<00:01,  2.40it/s][A
 50%|██████████████████████▌                      | 3/6 [00:01<00:01,  2.58it/s][A
 67%|██████████████████████████████               | 4/6 [00:01<00:00,  2.71it/s][A
 83%|█████████████████████████████████████▌       | 5/6 [00:01<00:00,  3.02it/s][A
100%|█████████████████████████████████████████████| 6/6 [00:01<00:00,  3.19it/s][A
Predictions: 100%|████████████████████████████████| 1/1 [00:02<00:00,  2.22s/it]


Чтение данных происходит из одного файла, его можно явно указать, используя `--inputFile INPUTFILE`\
    (любой формат из: pkl_gz, pkl или fits)\
Результат сохраняется в директории `-o OUTPUTDIR, --outputDir OUTPUTDIR`\
Так же можно указать выходной формат `--outputExt OUTPUTEXT` (аналогично: pkl_gz, pkl или fits)

In [None]:
!ls

 clf_x-ray.py		       part-00000.predictions.gb.gz_pkl
'clf_x-ray.py example.ipynb'   __pycache__
 clf_x-ray.zip		       requirements.txt
 models			       x-ray_data.gz_pkl


Появися файл `part-00000.predictions.gb.gz_pkl`, в нем хранится результат предсказания

In [None]:
!python3 ./clf_x-ray.py\
        --inputFile ./x-ray_data.gz_pkl\
        --outputDir ./\
        --modelsSeries gb\
        --modelsIds 18 19 21 22 34 35\
        --outputExt fits

Preparing data: 100%|█████████████████████████████| 1/1 [00:13<00:00, 13.93s/it]
Predictions:   0%|                                        | 0/1 [00:00<?, ?it/s]
  0%|                                                     | 0/6 [00:00<?, ?it/s][A
 17%|███████▌                                     | 1/6 [00:00<00:02,  1.85it/s][A
 33%|███████████████                              | 2/6 [00:00<00:01,  2.21it/s][A
 50%|██████████████████████▌                      | 3/6 [00:01<00:01,  2.46it/s][A
 67%|██████████████████████████████               | 4/6 [00:01<00:00,  2.60it/s][A
 83%|█████████████████████████████████████▌       | 5/6 [00:01<00:00,  2.92it/s][A
100%|█████████████████████████████████████████████| 6/6 [00:01<00:00,  3.02it/s][A
Predictions: 100%|████████████████████████████████| 1/1 [00:02<00:00,  2.35s/it]


In [None]:
!ls

 buf			       part-00000.predictions.gb.fits
 clf_x-ray.py		       part-00000.predictions.gb.gz_pkl
'clf_x-ray.py example.ipynb'   __pycache__
 clf_x-ray.zip		       requirements.txt
 data			       x-ray_data.gz_pkl
 models


В данном случае для вывода указывалась текущая директория\
В той же директории, что и `clf_x-ray.py` должна находиться папка `model` с моделями

Посмотрим, как выглядит результат

In [None]:
import pandas as pd
file = './part-00000.predictions.gb.gz_pkl'
pred = pd.read_pickle(file, compression='gzip')

In [None]:
pred

Unnamed: 0_level_0,ProbabilitySgb34,ProbabilityQgb34,ProbabilityGgb34,Labelgb34,ProbabilitySgb19,ProbabilityQgb19,ProbabilityGgb19,Labelgb19,ProbabilitySgb18,ProbabilityQgb18,...,ProbabilityGgb35,Labelgb35,ProbabilitySgb21,ProbabilityQgb21,ProbabilityGgb21,Labelgb21,ProbabilitySgb22,ProbabilityQgb22,ProbabilityGgb22,Labelgb22
__tempid__,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,0.004517,0.966577,0.028905,2.0,0.001897,0.989312,0.008791,2.0,0.000552,0.988873,...,0.015376,2.0,0.000256,0.995512,0.004232,2.0,0.001364,0.993743,0.004893,2.0
1,0.000827,0.998221,0.000952,2.0,0.001591,0.996217,0.002192,2.0,0.000189,0.999653,...,0.002154,2.0,0.000132,0.999748,0.000120,2.0,0.000591,0.998952,0.000456,2.0
2,,,,,,,,,0.013012,0.986682,...,,,,,,,0.000848,0.998655,0.000498,2.0
3,0.002541,0.995689,0.001769,2.0,0.001091,0.996747,0.002162,2.0,0.005976,0.993741,...,0.003867,2.0,0.000163,0.999522,0.000315,2.0,0.000551,0.998912,0.000537,2.0
4,0.001099,0.996892,0.002009,2.0,0.001039,0.997610,0.001351,2.0,0.000036,0.999627,...,0.003562,2.0,0.000190,0.999538,0.000273,2.0,0.000246,0.998956,0.000798,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19999,0.932117,0.064689,0.003193,1.0,0.877714,0.108891,0.013395,1.0,0.982566,0.016635,...,0.004279,1.0,0.945765,0.050512,0.003722,1.0,0.875394,0.029101,0.095505,1.0
20000,,,,,,,,,0.999948,0.000014,...,,,,,,,0.999162,0.000201,0.000637,1.0
20001,0.995363,0.003357,0.001279,1.0,0.994701,0.003620,0.001679,1.0,0.999166,0.000741,...,0.002931,1.0,0.999519,0.000438,0.000043,1.0,0.996967,0.000774,0.002259,1.0
20010,,,,,,,,,0.999835,0.000087,...,,,,,,,0.996267,0.000958,0.002775,1.0


Так выглядели входные данные:

In [None]:
file = './x-ray_data.gz_pkl'
df = pd.read_pickle(file, compression='gzip')

In [None]:
df

Unnamed: 0,nrow,ra_x,dec_x,__workxid__,ls_sep_input,ls_release,ls_brickid,ls_brickname,ls_objid,ls_brick_primary,...,phot_g_mean_flux_error,phot_bp_mean_flux,phot_bp_mean_flux_error,phot_rp_mean_flux,phot_rp_mean_flux_error,radial_velocity,radial_velocity_error,prlx_sn,pmra_sn,pmdec_sn
0,1,120.629122,0.892014,0,0.142104,8000.0,336610.0,b'1206p010',336.0,True,...,,,,,,,,,,
1,2,241.254494,1.110543,1,0.006765,8000.0,337093.0,b'2413p010',4915.0,True,...,,,,,,,,,,
2,3,197.494598,-1.191194,2,0.005828,8000.0,323957.0,b'1973m012',3162.0,True,...,,,,,,,,,,
3,4,197.599616,-1.115184,3,0.004966,8000.0,325398.0,b'1976m010',186.0,True,...,,,,,,,,,,
4,5,197.774153,-1.193348,4,0.029734,8000.0,323959.0,b'1978m012',4291.0,True,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20007,19970,308.144845,41.371215,19969,,,,,,,...,1.36648,7.860650e+01,5.25157,965.727,11.1723,,,1.389686,7.795380,8.013441
20008,19971,307.804983,41.365742,19970,,,,,,,...,2.03865,8.633350e+01,9.88436,1322.310,10.5896,,,1.507246,0.121212,8.100000
20009,19972,308.273874,41.530661,19971,,,,,,,...,13.90560,6.608010e+03,21.44660,28941.000,71.9122,,,11.411605,12.230337,25.614973
20010,19973,239.875610,25.920151,19972,0.294787,8000.0,476147.0,b'2398p260',623.0,True,...,27626.00000,1.264880e+06,29466.40000,6820180.000,70073.8000,-27.12,6.9,24.850410,64.923077,135.868132


Обе таблицы совпадают по интексу. Их можно объединить следующим образом: 

In [None]:
pd.concat((df, pred), axis=1)

Unnamed: 0,nrow,ra_x,dec_x,__workxid__,ls_sep_input,ls_release,ls_brickid,ls_brickname,ls_objid,ls_brick_primary,...,ProbabilityGgb35,Labelgb35,ProbabilitySgb21,ProbabilityQgb21,ProbabilityGgb21,Labelgb21,ProbabilitySgb22,ProbabilityQgb22,ProbabilityGgb22,Labelgb22
0,1,120.629122,0.892014,0,0.142104,8000.0,336610.0,b'1206p010',336.0,True,...,0.015376,2.0,0.000256,0.995512,0.004232,2.0,0.001364,0.993743,0.004893,2.0
1,2,241.254494,1.110543,1,0.006765,8000.0,337093.0,b'2413p010',4915.0,True,...,0.002154,2.0,0.000132,0.999748,0.000120,2.0,0.000591,0.998952,0.000456,2.0
2,3,197.494598,-1.191194,2,0.005828,8000.0,323957.0,b'1973m012',3162.0,True,...,,,,,,,0.000848,0.998655,0.000498,2.0
3,4,197.599616,-1.115184,3,0.004966,8000.0,325398.0,b'1976m010',186.0,True,...,0.003867,2.0,0.000163,0.999522,0.000315,2.0,0.000551,0.998912,0.000537,2.0
4,5,197.774153,-1.193348,4,0.029734,8000.0,323959.0,b'1978m012',4291.0,True,...,0.003562,2.0,0.000190,0.999538,0.000273,2.0,0.000246,0.998956,0.000798,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
20007,19970,308.144845,41.371215,19969,,,,,,,...,,,,,,,,,,
20008,19971,307.804983,41.365742,19970,,,,,,,...,,,,,,,,,,
20009,19972,308.273874,41.530661,19971,,,,,,,...,,,,,,,,,,
20010,19973,239.875610,25.920151,19972,0.294787,8000.0,476147.0,b'2398p260',623.0,True,...,,,,,,,0.996267,0.000958,0.002775,1.0


Посмотрим, как выгдядит результат при сохранении файла в формате `fits`

In [None]:
import astropy

In [None]:
table = './part-00000.predictions.gb.fits'
table = astropy.table.Table.read(table)

In [None]:
table 

ProbabilitySgb34,ProbabilityQgb34,ProbabilityGgb34,Labelgb34,ProbabilitySgb19,ProbabilityQgb19,ProbabilityGgb19,Labelgb19,ProbabilitySgb18,ProbabilityQgb18,ProbabilityGgb18,Labelgb18,ProbabilitySgb35,ProbabilityQgb35,ProbabilityGgb35,Labelgb35,ProbabilitySgb21,ProbabilityQgb21,ProbabilityGgb21,Labelgb21,ProbabilitySgb22,ProbabilityQgb22,ProbabilityGgb22,Labelgb22,index
float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,int64
0.004517153746090492,0.9665774008848774,0.028905445369032053,2.0,0.0018972265351799536,0.9893120975014862,0.008790675963333775,2.0,0.0005523317741931157,0.9888733040607209,0.010574364165085899,2.0,0.0036911797961480352,0.9809332070864689,0.015375613117382926,2.0,0.0002557735095116674,0.995512180351634,0.0042320461388542984,2.0,0.0013642345010582194,0.9937431664778039,0.004892599021137743,2.0,0
0.0008274618692076662,0.9982207576821637,0.0009517804486286483,2.0,0.0015909807353964162,0.9962171971778292,0.0021918220867744444,2.0,0.000188560522559988,0.9996534314242556,0.00015800805318446593,2.0,0.0013946766483165665,0.996450979154073,0.0021543441976104617,2.0,0.00013226836693311744,0.9997477274980574,0.00012000413500943504,2.0,0.0005913111824108493,0.9989522692249341,0.00045641959265498,2.0,1
-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,0.01301247247047298,0.9866815406687183,0.0003059868608087602,2.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,0.0008476149347479787,0.9986548702221874,0.000497514843064696,2.0,2
0.002541390659382013,0.9956892151987831,0.0017693941418350434,2.0,0.0010913386387490236,0.9967467071994395,0.0021619541618115726,2.0,0.005975951753950154,0.9937407941747555,0.0002832540712944147,2.0,0.0025215659477677214,0.9936110269838748,0.0038674070683574766,2.0,0.0001630031298456597,0.9995219207338587,0.0003150761362954477,2.0,0.0005510128638055663,0.9989121441062565,0.000536843029937949,2.0,3
0.0010986196046981601,0.9968921696016277,0.0020092107936740345,2.0,0.0010391443855870776,0.9976096282521011,0.0013512273623116757,2.0,3.625834146617351e-05,0.9996268275937249,0.0003369140648088682,2.0,0.0016498034806744658,0.9947879717659722,0.0035622247533531755,2.0,0.00018971387386087775,0.999537512127914,0.0002727739982251564,2.0,0.00024616774829422253,0.9989562637586312,0.0007975684930743524,2.0,4
0.0011307335449519925,0.9978433782375153,0.0010258882175326679,2.0,0.003876105227078257,0.9946354798746848,0.0014884148982369668,2.0,0.0003175149852707171,0.9995961576665987,8.632734813069802e-05,2.0,0.0023039174891613316,0.9951761227902157,0.0025199597206229875,2.0,0.0005920750676727836,0.9992338581248037,0.00017406680752358485,2.0,0.0005360588228320277,0.9988084505340737,0.0006554906430942213,2.0,5
-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,0.00010081512421418674,0.9996279835700227,0.00027120130576304465,2.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,-9999.0,0.000376812687900545,0.998867877427464,0.0007553098846354085,2.0,6
0.009106927134470144,0.5262288798438275,0.4646641930217024,2.0,0.007925954983387476,0.405303704775533,0.5867703402410795,3.0,0.002710180581537692,0.7214250500186571,0.27586476939980514,2.0,0.008403684763169775,0.6867461128207902,0.30485020241604,2.0,0.0007338842980858997,0.6538615762873945,0.3454045394145197,2.0,0.017273189057305495,0.5178407999370869,0.46488601100560756,2.0,7
0.0008454695238425666,0.997487101128096,0.0016674293480614412,2.0,0.0014398539579403111,0.9966117369101893,0.0019484091318703245,2.0,0.0001525865182715834,0.9994633854289656,0.0003840280527628491,2.0,0.0014064618105571088,0.9962424238620986,0.0023511143273442676,2.0,0.00014918863981877963,0.9996696462712653,0.00018116508891587968,2.0,0.00033463788538259416,0.9992631912861105,0.00040217082850685213,2.0,8
0.001084215356654269,0.9973713321558481,0.0015444524874976143,2.0,0.0012920609126822152,0.9968517554040881,0.00185618368322972,2.0,0.0001127897901405793,0.9992941042616441,0.0005931059482154294,2.0,0.0015376729019740438,0.9959430796801182,0.0025192474179077224,2.0,0.0002522232345801351,0.9996812104143821,6.656635103792425e-05,2.0,0.0006165056678420902,0.9990413046336037,0.00034218969855422386,2.0,9


Для `fits` в файле добавляется строка `index`, чтобы можно было востановить индекс