# This Project's aim is to recommend a tv-news clip based on a given article using a content filter recommender

# Summary of the project:

#### Static structure
To build this, we will need a DB of the possible recommendations. This will have features such as:
- `Title`
    - Used to display to user
- `entities`
    - A vector of the features for each video
- `url` 
    - A link to the full segment
    
    
#### Program
1. We will programatically scan an article and parse it for its own entity vector. 
    - NLP topic modeling needed
2. Then we will find cosine dist from the article to each news clip. 
3. return the 3 closest clips.

#### Notes:
- Depending on the size of the DB, we probably won't want to go through each entry.  Maybe I can cluster closer clips and return the cluster
- Probably want to use components as well. 
- News sources: BBC, CNN, Fox, RT, MSNBC, Huffpo, the guardian, NYT
- opencalais
- 


## Imports

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import requests
from bs4 import BeautifulSoup

## Scan article
- request site
    - found this api:https://newsapi.org/
        - Can't search individual articles by url as far as I can tell.  Might be worth considering though for multiple news sources
        - Doesn't return full article, only title and desc
- find relevant info
    - http://mallet.cs.umass.edu/index.php
    - http://www.opencalais.com/
- assemble corpus OR run search on archive.org

#### Requests

In [2]:
url = 'https://www.nytimes.com/2018/08/09/us/politics/kansas-kobach-colyer-votes.html'
res = requests.get(url)
res.status_code

200

In [3]:
soup = BeautifulSoup(res.content, 'lxml')

### Parse

In [4]:
title = soup.find("h1", attrs={"itemprop":"headline"}).text
title

'Kobach Says He Will Recuse Himself From Kansas Primary Vote Count'

In [5]:
subtitle = soup.find("body").find("span", attrs={"class":"ResponsiveMedia-captionText--2WFdF"}).text
subtitle

' Gov. Jeff Colyer, announcing the creation of a “voting integrity hotline” and suggesting there had been problems on Election Day.'

In [6]:
soup.find("a", attrs={"class": "author-card__link"}).text

AttributeError: 'NoneType' object has no attribute 'text'

In [7]:
body = "\n\n".join(list(map(lambda x: x.text, soup.find_all("p", attrs={"class":"css-1i0edl6 e2kc3sl0" }))))
print(body)

Secretary of State Kris W. Kobach of Kansas, clinging to the slimmest of leads in the Republican primary for governor, said Thursday night that he planned to recuse himself from the vote-counting process. Earlier in the evening, his opponent, Gov. Jeff Colyer, said that some local election officials had been provided incorrect information by Mr. Kobach that could suppress votes.


The governor’s fiery recusal request, and Mr. Kobach’s pledge to comply, came after the nationally watched  primary left the candidates separated by only 191 votes entering Thursday.

In a letter, Mr. Colyer said some clerks had been provided incorrect information about which ballots to count, and he urged Mr. Kobach to appoint the state attorney general to handle future questions from local election workers.

“It has come to my attention that your office is giving advice to county election officials — as recently as a conference call yesterday — and you are making public statements on national television whi

## Entity extraction

### OpenCalais

In [8]:
# calais variables
access_token = "cvTFhY53VXBYm5HO85weHPx346W05015"
calais_url = 'https://api.thomsonreuters.com/permid/calais'
headers = {'X-AG-Access-Token' : access_token, 'Content-Type' : 'text/raw', 'outputformat' : 'application/json'}


In [9]:
response = requests.post(calais_url, data=body.encode('utf-8'), headers=headers, timeout=80)
print ('status code: %s' % response.status_code)
content = response.text
# print ('Results received: %s' % content)

status code: 200


In [10]:
import json

c = json.loads(content)


In [11]:
topics = list(c.keys())[1:]

In [12]:
for topic in topics:
    try:
        print(c[topic]['name'])
        if(c[topic]["_typeGroup"] in ['topics', 'socialTag']):
            print("importance:", c[topic]['importance'])
        else:
            print("relevance:", c[topic]['relevance'])
    except:
        pass
    print()

Politics


Government
importance: 1

Politics
importance: 1

United States
importance: 1

American people of German descent
importance: 2

Kris Kobach
importance: 2

Voter suppression
importance: 2

Jeff Colyer
importance: 2

Presidential Advisory Commission on Election Integrity
importance: 2

Provisional ballot
importance: 2

Kansas
importance: 2

Electoral fraud
importance: 2

Fish v. Kobach
importance: 2

Broadcasting - NEC
relevance: 0

News Agencies
relevance: 0


Greg Orman
relevance: 0.2

primary election
relevance: 0.2

Kansas
relevance: 0.2

Kris W. Kobach
relevance: 0.8

Governor
relevance: 0.8

a primary
relevance: 0.2

businessman
relevance: 0.2

fox news
relevance: 0

President
relevance: 0.8

primary
relevance: 0.2

Laura Kelly
relevance: 0.2

comparatively mild-mannered plastic surgeon
relevance: 0.2

Trump
relevance: 0.8

general election
relevance: 0.2

Thomas County
relevance: 0.2

Jeff Colyer
relevance: 0.8

the primary
relevance: 0.2

clerk
relevance: 0.2

attorney

In [13]:
for topic in topics:
    if c[topic]['_typeGroup'] == "entities":
        print(c[topic]['name'])
        print("mentioned",len(c[topic]['instances']),"times")
        print(c[topic].keys())
        print()

Greg Orman
mentioned 1 times
dict_keys(['_typeGroup', '_type', 'forenduserdisplay', 'name', 'persontype', 'nationality', 'confidencelevel', 'firstname', 'lastname', 'commonname', '_typeReference', 'permid', 'instances', 'relevance', 'confidence'])

primary election
mentioned 1 times
dict_keys(['_typeGroup', '_type', 'forenduserdisplay', 'name', 'politicaleventtype', 'location', '_typeReference', 'instances', 'relevance'])

Kansas
mentioned 6 times
dict_keys(['_typeGroup', '_type', 'forenduserdisplay', 'name', '_typeReference', 'instances', 'relevance', 'resolutions'])

Kris W. Kobach
mentioned 17 times
dict_keys(['_typeGroup', '_type', 'forenduserdisplay', 'name', 'persontype', 'nationality', 'confidencelevel', 'commonname', 'confidence', '_typeReference', 'permid', 'instances', 'relevance'])

Governor
mentioned 4 times
dict_keys(['_typeGroup', '_type', 'forenduserdisplay', 'name', '_typeReference', 'instances', 'relevance'])

a primary
mentioned 1 times
dict_keys(['_typeGroup', '_type

At this point, the next step is to synthesize these results into a search querry

In [14]:
tags = []
for topic in topics: 
    if c[topic]['_typeGroup'] == "entities":
        new_entity = {}
        new_entity['name'] = c[topic]['name']
        new_entity['mentions'] = len(c[topic]['instances'])
        new_entity['type'] = c[topic]['_type']
        if(new_entity['type'] == "Person"):
#             print(new_entity['name'])
            new_entity['commonname'] = c[topic]['commonname']
        else:
            new_entity['commonname'] = ""
        tags.append(new_entity)

In [15]:
import pandas as pd

df = pd.DataFrame(tags)


# Select all people who were mentioned more than the mean number of mentions per person
POI = list(df[(df["type"] == "Person") & (df['mentions'] > df[df['type'] =='Person']['mentions'].mean())]['commonname'])

# Do we want to search full names, or just last names?
query = " AND ".join(POI)
query = query.replace(" ", "%20")

queryURL = "https://archive.org/details/tv?q="+query
print(queryURL)

In [18]:
# Select all people who were mentioned more than the mean number of mentions per person
POI = list(df[(df["type"] == "Person") & (df['mentions'] > df[df['type'] =='Person']['mentions'].mean())]['commonname'])

In [19]:
# Do we want to search full names, or just last names?
query = " AND ".join(POI)
query = query.replace(" ", "%20")

In [20]:
queryURL = "https://archive.org/details/tv?q="+query
print(queryURL)

https://archive.org/details/tv?q=Kris%20Kobach%20AND%20Jeff%20Colyer
