## News Headlines

### Formating

[The Associated Press Stylebook](https://www.amazon.com/Associated-Press-Stylebook-2017-Briefing/dp/0465093043/) is a style guide widely used among American journalists. It enforces the following rules for capitalization of news headlines:

1. Capitalize nouns, pronouns, adjectives, verbs, adverbs, and subordinate conjunctions. If a word is hyphenated, every part of the word should be capitalized (e.g., "Self-Reflection" not "Self-reflection").
2. Capitalize the first and the last word.
3. Lowercase all other parts of speech: articles, coordinating conjunctions, prepositions, particles, interjections.

Write a program that formats a headline according to the rules above. Use any programming language and any NLP toolkit.

When done, run your program on [the corpus of headlines from The Examiner](examiner-headlines.txt) and submit your program and a file with corrected headlines to your directory. Output statistics: how many titles were properly formatted?

In [46]:
import spacy, re, requests, pdb
from spacy import displacy
nlp = spacy.load('en_core_web_md')

In [63]:
url_headlines = 'https://raw.githubusercontent.com/vseloved/prj-nlp/master/tasks/02-structural-linguistics/examiner-headlines.txt'
raw_headlines = requests.get(url_headlines).text.strip().split('\n')

5000
5000


In [67]:
to_capitalize_pos = ['PROPN',
                     'NOUN',
                     'VERB',
                     'ADJ',
                     'ADV']

to_capitalize_dep = ['mark']

to_strip_space = ['PUNCT']

In [68]:
def clean_headline(headline):
    clean_headline = []
    for token in nlp(headline):
        
        if re.match('[A-Z]{2,}', token.text):
            proc_token = token.text_with_ws
            
        elif token.pos_ in to_capitalize_pos \
        or token.dep_ in to_capitalize_dep \
        or token.is_sent_start:
            proc_token = token.text_with_ws.capitalize()
            
        else:
            proc_token = token.text_with_ws.lower().replace('--', '—')
            
        clean_headline.append(proc_token)
        
    if len(clean_headline) < 1:
        pdb.set_trace()
    clean_headline[-1] = clean_headline[-1].capitalize() \
                         if not re.match('[A-Z]', clean_headline[-1]) \
                         else clean_headline[-1]
            
    return ''.join(clean_headline).strip()

In [73]:
with open('capitalized_examiner-headlines.txt', 'w') as f:
    f.write('\n'.join([clean_headline(h) for h in raw_headlines]))

### Catch catchy headlines

The paper on [Automatic Extraction of News Values from Headline Text](http://www.aclweb.org/anthology/E17-4007) defines that a catchy headline has the following features:
1. Prominence
2. Sentiment
3. Superlativeness
4. Proximity
5. Surprise
6. Uniqueness

Write a program that analyzes a headline for prominence (a.k.a, named entities), sentiment, and superlativeness. For sentiment, check if the average sentiment for the top 5 meanings of word+POS in [SentiWordNet](http://sentiwordnet.isti.cnr.it/) is above 0.5.

When done, run your program on [the corpus of headlines](examiner-headlines.txt), extract the headlines that have at least one of the described features, and submit your program and a file with catchy headlines to your directory.

In [167]:
from nltk.corpus import sentiwordnet
from nltk.corpus.reader.wordnet import WordNetError
import pandas as pd

In [99]:
prominent_ents = ['PERSON',
                  'ORG',
                  'PRODUCT',
                  'WORK_OF_ART',]

headline_prom = [len(list(filter(lambda t: t.ent_type_ in prominent_ents, nlp(headline))))
                 for headline in raw_headlines]

sent_pos_tags = {'NOUN': 'n',
                 'VERB': 'v',
                 'ADJ': 'a',
                 'ADV': 'r',}

def is_sentimental(headline):
    nlp_headline = nlp(headline)
    n_sentiment_words = 0
    for t in nlp_headline:
        if t.pos_ in sent_pos_tags:
            for i in range(0, 5):
                try:
                    sentiments = sentiwordnet.senti_synset(f'{t.lemma_}.{sent_pos_tags[t.pos_]}.0{i}')
                    if sentiments.neg_score() > 0.5 or sentiments.pos_score() > 0.5:
                        n_sentiment_words += 1
                except WordNetError:
                    # in case word has < 5 sences
                    return n_sentiment_words
    return n_sentiment_words
        
headline_sentiments = list(map(is_sentimental, raw_headlines))

headline_superl = [len(list(filter(lambda t: t.tag_ == 'JJS', nlp(headline))))
                   for headline in raw_headlines]

In [187]:
headline_stats = pd.DataFrame({'headline': raw_headlines,
                               'prominent_w': headline_prom,
                               'sentimental_w': headline_sentiments,
                               'superlative_w': headline_superl,
                              })

headline_stats['is_catchy'] = headline_stats.sum(axis=1, numeric_only=True)
headline_stats = headline_stats.sort_values('is_catchy', ascending=False)

headline_stats.loc[headline_stats.superlative_w > 0
                  ].to_html('catchy_examiner-headlines.md')

headline_stats.head()

Unnamed: 0,headline,prominent_w,sentimental_w,superlative_w,is_catchy
1641,Best National Hispanic Heritage Month of Jazz ...,12,4,1,17
2853,Sarasota Orchestra's World premiere of 'Salon ...,14,0,0,14
2025,Justin Gardenhire Wins Event 11 at WSOPC New O...,13,0,0,13
4410,"Grammy Week: Rev Run, RedOne and DJ Khaled for...",13,0,0,13
4564,Golden Globes Red Carpet 'How Tos' With Pravan...,11,2,0,13
