Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support the identification of tweet contents #111

Open
mazz opened this issue Mar 30, 2015 · 4 comments
Open

Support the identification of tweet contents #111

mazz opened this issue Mar 30, 2015 · 4 comments

Comments

@mazz
Copy link

mazz commented Mar 30, 2015

I've noticed that cahoots doesn't process #hashtags @callouts or urls in tweets very well.

For instance, this input:

13 bizarre, perplexing and distressing performance directions @classicfm http://ow.ly/KYv55  @mjdominus #music

Results in cahoots returning the structured text for programming languages.

{
    "date": "2015-03-30T20:42:25.897593",
    "execution_seconds": 0.10108280181884766,
    "query": "13 bizarre, perplexing and distressing performance directions @classicfm http...",
    "results": {
        "count": 3,
        "matches": [
            {
                "subtype": "Python",
                "data": {},
                "value": null,
                "confidence": 13,
                "type": "Programming"
            },
            {
                "subtype": "Ruby",
                "data": {},
                "value": null,
                "confidence": 11,
                "type": "Programming"
            },
            {
                "subtype": "Perl",
                "data": {},
                "value": null,
                "confidence": 9,
                "type": "Programming"
            }
        ],
        "types": [
            "Programming"
        ]
    },
    "top": {
        "subtype": "Python",
        "data": {},
        "value": null,
        "confidence": 13,
        "type": "Programming"
    }
}
@hickeroar
Copy link
Member

Heya, that's not something currently that Cahoots does, but I like the idea. I also need to revisit the programming parser and work out a way to root out a lot of the false positives its producing.

Thanks for the idea and I'll figure out when I can approach supporting these things

@hickeroar
Copy link
Member

After some thought on the subject: Given that Cahoots is not a text mining engine, I'm not sure exactly what direction to go with this. Typically what Cahoots would do for something like this is say "This string is 140chars or less, it contains hash tags and/or @ symbols, and may or may not contain a URL.....Therefore based on the presence (or not) of said symbols, it may or may not be a tweet."

Since Cahoots is categorically not a text MINING engine, it may (or may not) be out of scope to expect it to parse out what the urls are, what the hash tags are, etc. I'm open to discussion on the topic though. A "TweetParser" module is not out of the question by any means, we should determine what we want/expect cahoots to be able to do with a given chunk of text.

@mazz
Copy link
Author

mazz commented Mar 31, 2015

Cool, glad to possibly take Cahoots into another domain.

Bare minimum though, I would at least expect cahoots to identify the URL.

In the same sense that @Handles and #hashtags are structured text, wouldn't a user expect their identification also?

@hickeroar
Copy link
Member

Main issue with cahoots identifying that there's a URL is that that's definitely crossing boundaries into "text mining." Cahoots is specifically meant NOT to be a mining engine. Cahoots' purpose is to identify, as a whole, what a snippet of text "is." So, it may take your snippet and determine that it's a tweet, but it's not meant to diagram a sentence and tell you about its parts.

It may say "based on the evidence, this is a tweet," and then proceed to tell you more about that tweet, but it shouldn't be looking through sets of random text to identify things it might be able to comprehend. We'd have to individually inspect every token in the string looking for "something..." Cahoots just isn't meant for that.

My personal long term goal is to write an open source text mining engine that uses cahoots for identifying pieces that are mined out of text, but I'm not really interested in delving into it in Cahoots itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants