Skip to content
This repository has been archived by the owner on Aug 26, 2021. It is now read-only.

Tag by folder #205

Closed
s1rk1t opened this issue Dec 17, 2018 · 13 comments
Closed

Tag by folder #205

s1rk1t opened this issue Dec 17, 2018 · 13 comments

Comments

@s1rk1t
Copy link

s1rk1t commented Dec 17, 2018

I'm trying to follow what happened in Issue #175 but am unable to reproduce his results.

Here's my code:

def AutoTagAmbarFile(self, AmbarFile):
self.SetOCRTag(AmbarFile)
self.SetSourceIdTag(AmbarFile)
self.SetArchiveTag(AmbarFile)
self.SetImageTag(AmbarFile)
self.SetFolderTag(AmbarFile)

Followed by this:

def SetFolderTag(self, AmbarFile):
if('folderName' in AmbarFile['meta']['full_name']):
self.AddTagToAmbarFile(AmbarFile['file_id'], AmbarFile['meta']['full_name']
,self.AUTO_TAG_TYPE, 'folderName')

I've tried altering a pre-existing tag as did the poster in Issue #175 , but was unable to see any change after I rebuilt the Pipeline image, pulled the new image, and spun up a new instance of AMBAR. I've tried clearing my browser cache, as that had caused issues in the past, but there was no change.

Is there somewhere else I need to change some code in order for the new tag to show up on the search page?

Thanks in advance for any help you can offer!

@sochix
Copy link
Member

sochix commented Dec 18, 2018

Everything looks good.

Check in debug mode that your condition
if('folderName' in AmbarFile['meta']['full_name']):
works properly.

@s1rk1t
Copy link
Author

s1rk1t commented Dec 18, 2018

So I tried this:

def SetFolderNameTag(self, AmbarFile):
fileString = AmbarFile['meta']['full_name']
self.logger.LogMessage('verbose', '{0} is full_name'.format(fileString))
if('folderName' in fileString):
self.AddTagToAmbarFile(AmbarFile['file_id'], AmbarFile['meta']['full_name'], self.AUTO_TAG_TYPE, 'folderName')

and this:

def SetFolderNameTag(self, AmbarFile):
fileString = AmbarFile['meta']['full_name']
if('folderName' in fileString):
self.logger.LogMessage('verbose', 'folderName is in {0}'.format(fileString))
self.AddTagToAmbarFile(AmbarFile['file_id'], AmbarFile['meta']['full_name'], self.AUTO_TAG_TYPE, 'folderName')

but after using the 'sudo docker logs pipelineContainerID' command the output was this:

Dec 18, 2018 2:04:31 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
for optional dependencies.

Dec 18, 2018 2:04:31 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: Tesseract OCR is installed and will be automatically applied to image files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on via TikaConfig.
Dec 18, 2018 2:04:31 PM org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.

Is that the correct command to view the proper log?

I ask because on line 96 of autotagging.py there is this statement:

self.logger.LogMessage('verbose', '{0} tag added to {1}'.format(Tag, FullName))

but I am not seeing any of that output in the log file above

EDIT:

So now (after using docker-compose down and reloading the images) I am getting this in the log (after the previously stated output):

2018-12-18 14:50:09.066066: [info] [0] started
2018-12-18 14:50:09.107822: [info] [0] connecting to Rabbit amqp://rabbit...
2018-12-18 14:50:09.204385: [info] [0] connected to Rabbit!
2018-12-18 14:50:09.220989: [info] [0] waiting for messages...

2018-12-18 14:51:09.128585: [verbose] [0] add task received for (then comes the full_name data)
2018-12-18 14:51:09.151687: [verbose] [0] meta found for (again, the full_name data)

This second 2 line chunk repeats a bunch of times, presumably for each time the new tag is supposed to be applied.

After grepping the language ('meta found for') in that output it looks like it's coming from the pipeline.py file, specifically lines 78 and 113.

Thanks again for your help!

@sochix
Copy link
Member

sochix commented Dec 18, 2018

Did you crawl a file with 'folderName' in the path?

@s1rk1t
Copy link
Author

s1rk1t commented Dec 18, 2018

yes

@sochix
Copy link
Member

sochix commented Dec 18, 2018

Can you please put the full path here as example?

@s1rk1t
Copy link
Author

s1rk1t commented Dec 18, 2018

Sure,

//mycrawler/outerFolder/subFolder/testDocument.pdf

folder name is outerFolder

@sochix
Copy link
Member

sochix commented Dec 18, 2018

So your code snippet is:

def SetFolderNameTag(self, AmbarFile):
  fileString = AmbarFile['meta']['full_name']
  if('outerFolder' in fileString):
    self.logger.LogMessage('verbose', 'outerFolder is in {0}'.format(fileString))
    self.AddTagToAmbarFile(AmbarFile['file_id'], AmbarFile['meta']['full_name'], self.AUTO_TAG_TYPE, 'outerFolder')

Am I right?

@s1rk1t
Copy link
Author

s1rk1t commented Dec 18, 2018

Yes, that looks right.

@sochix
Copy link
Member

sochix commented Dec 18, 2018

Did you change the ambar pipeline image source in docker-compose file?

@s1rk1t
Copy link
Author

s1rk1t commented Dec 18, 2018

Yes

@sochix
Copy link
Member

sochix commented Dec 18, 2018

Can you share your docker-compose file please

@s1rk1t
Copy link
Author

s1rk1t commented Dec 18, 2018

I think I may have figured it out. Once I ran docker's prune command I was able to see a change in the tag (I had changed ocr to ocr-test like the poster did in Issue #175 ). Rerunning it now to see if the new tags show up.

@s1rk1t
Copy link
Author

s1rk1t commented Dec 18, 2018

Yep, that did it. It's working as expected now.

Thanks so much for your help!

@s1rk1t s1rk1t closed this as completed Dec 18, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants