# This Project's aim is to recommend a tv-news clip based on a given article using a content filter recommender

# Summary of the project:

#### Static structure
To build this, we will need a DB of the possible recommendations. This will have features such as:
- `Title`
    - Used to display to user
- `entities`
    - A vector of the features for each video
- `url` 
    - A link to the full segment
    
    
#### Program
1. We will programatically scan an article and parse it for its own entity vector. 
    - NLP topic modeling needed
2. Then we will find cosine dist from the article to each news clip. 
3. return the 3 closest clips.

#### Notes:
- Depending on the size of the DB, we probably won't want to go through each entry.  Maybe I can cluster closer clips and return the cluster
- Probably want to use components as well. 
- News sources: BBC, CNN, Fox, RT, MSNBC, Huffpo, the guardian, 
- opencalais
- 


## Imports

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import requests
from bs4 import BeautifulSoup

In [2]:
access_token = "cvTFhY53VXBYm5HO85weHPx346W05015"
calais_url = 'https://api.thomsonreuters.com/permid/calais'
headers = {'X-AG-Access-Token' : access_token, 'Content-Type' : 'text/raw', 'outputformat' : 'application/json'}


## Scan article
- request site
    - found this api:https://newsapi.org/
        - Can't search individual articles by url as far as I can tell.  Might be worth considering though for multiple news sources
        - Doesn't return full article, only title and desc
- find relevant info
    - http://mallet.cs.umass.edu/index.php
    - http://www.opencalais.com/
- assemble corpus OR run search on archive.org

#### Requests

In [27]:
url = 'https://www.huffingtonpost.com/entry/kansas-democrat-brent-welder-ad_us_5b625d65e4b0b15aba9f9774'
header = {"Accept-Encoding": "gzip", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
res = requests.get(url, headers=header )
res.status_code

200

In [28]:
soup = BeautifulSoup(res.content, 'lxml')

#### Parse

In [22]:
title = soup.find("h1", attrs={"class":"headline__title"}).text
title

'New Ad For Democratic Candidate Tests A Populist Message In Suburban Kansas'

In [23]:
subtitle = soup.find("div", attrs={"class":"headline__subtitle"}).text
subtitle

'Brent Welder is a progressive labor attorney running in Kansas’ 3rd Congressional District.'

In [37]:
soup.find("a", attrs={"class": "author-card__link yr-author-name"}).text

'Daniel Marans'

In [25]:
body = "\n\n".join(list(map(lambda x: x.text, soup.find("div", attrs={"class":"entry__text"}).find_all("div", attrs={"class":"content-list-component"}))))
print(body)

The Progressive Change Campaign Committee is funding an advertisement in support of Brent Welder, a progressive lawyer seeking the Democratic nomination in Kansas’ 3rd Congressional District.

The minute-long video, which consists almost entirely of a passionate segment from one of Welder’s speeches, banks on voters’ appetites for unabashed economic populism. Through its super PAC, the PCCC has purchased $30,000 worth of airtime to play the spot on the local morning and evening news in the Kansas City media market, starting Thursday.

“The wealthiest of the wealthy take a tiny sliver of those enormous profits that they steal from us and use it to pay off politicians to keep it rigged in their favor,” Welder tells supporters at a July 20 rally featured in the video. “We gather here tonight to say, ‘Enough is enough!’”



Welder, a union-side labor attorney and former campaign official for both Barack Obama and Sen. Bernie Sanders (I-Vt.), is in a close contest for the Democratic nominat

## SKLearn NLP

## Calais section

In [29]:
response = requests.post(calais_url, data=body.encode('utf-8'), headers=headers, timeout=80)
print ('status code: %s' % response.status_code)
content = response.text
# print ('Results received: %s' % content)

status code: 200


In [30]:
import json

c = json.loads(content)


topics = list(c.keys())[1:]

In [32]:
for topic in topics:
    try:
        print(c[topic]['name'])
        if(c[topic]["_typeGroup"] in ['topics', 'socialTag']):
            print("importance:", c[topic]['importance'])
        else:
            print("relevance:", c[topic]['relevance'])
    except:
        pass
    print()

Politics


Politics of the United States
importance: 1

United States
importance: 1

Brent Welder
importance: 1

Bernie Sanders presidential campaign
importance: 2

Progressive Change Campaign Committee
importance: 2

Alexandria Ocasio-Cortez
importance: 2

Bernie Sanders
importance: 2

Democratic Party
importance: 2

Bio Therapeutic Drugs
relevance: 0


Kansas City
relevance: 0.2

Kevin Yoder
relevance: 0.2

Congress
relevance: 0.2

Kansas
relevance: 0.8

Progressive Change Campaign Committee
relevance: 0.8

official
relevance: 0.2

Alexandria Ocasio-Cortez
relevance: 0.2

Brent Welder
relevance: 0.8

Bernie Sanders
relevance: 0.2

co-founder
relevance: 0.2

Welder , a progressive lawyer
relevance: 0.2

media market
relevance: 0.2

Elizabeth Warren
relevance: 0.2

Welder , a union-side labor attorney
relevance: 0.2

Welder
relevance: 0.8

The primary
relevance: 0.2

Tom Niermann
relevance: 0.2

MMA
relevance: 0.2

Sharice Davids
relevance: 0.2

Adam Green
relevance: 0.2

Hillary Clint

In [33]:
for topic in topics:
    try:
        print(c[topic]['name'])
        print(c[topic])
        print("mentioned",len(c[topic]['instances']),"times")
        

    except:
        pass
    print()

Politics
{'_typeGroup': 'topics', 'forenduserdisplay': 'false', 'score': 0.975, 'name': 'Politics'}


Politics of the United States
{'_typeGroup': 'socialTag', 'id': 'http://d.opencalais.com/dochash-1/f164e4d2-3f69-3e4b-b26d-9a42e3a6314f/SocialTag/1', 'socialTag': 'http://d.opencalais.com/genericHasher-1/450a0fe4-479f-343b-8ca1-2ced72c6b3ef', 'forenduserdisplay': 'true', 'name': 'Politics of the United States', 'importance': '1', 'originalValue': 'Politics of the United States'}

United States
{'_typeGroup': 'socialTag', 'id': 'http://d.opencalais.com/dochash-1/f164e4d2-3f69-3e4b-b26d-9a42e3a6314f/SocialTag/2', 'socialTag': 'http://d.opencalais.com/genericHasher-1/359e9bfd-d04d-3ee5-931a-c69b04eac8d2', 'forenduserdisplay': 'true', 'name': 'United States', 'importance': '1', 'originalValue': 'United States'}

Brent Welder
{'_typeGroup': 'socialTag', 'id': 'http://d.opencalais.com/dochash-1/f164e4d2-3f69-3e4b-b26d-9a42e3a6314f/SocialTag/3', 'socialTag': 'http://d.opencalais.com/genericHa

At this point, the next step is to synthesize these results into a search querry

In [34]:
tags = []
for topic in topics: 
    if c[topic]['_typeGroup'] == "entities":
        new_entity = {}
        new_entity['name'] = c[topic]['name']
        new_entity['mentions'] = len(c[topic]['instances'])
        new_entity['type'] = c[topic]['_type']
        if(new_entity['type'] == "Person"):
#             print(new_entity['name'])
            new_entity['commonname'] = c[topic]['commonname']
        else:
            new_entity['commonname'] = ""
        tags.append(new_entity)

In [38]:
import pandas as pd

df = pd.DataFrame(tags)
df

Unnamed: 0,commonname,mentions,name,type
0,,3,Kansas City,City
1,Kevin Yoder,1,Kevin Yoder,Person
2,,1,Congress,Organization
3,,3,Kansas,ProvinceOrState
4,,5,Progressive Change Campaign Committee,Organization
5,,1,official,Position
6,Alexandria Ocasio-Cortez,3,Alexandria Ocasio-Cortez,Person
7,Brent Welder,12,Brent Welder,Person
8,Bernie Sanders,6,Bernie Sanders,Person
9,,1,co-founder,Position


In [41]:
df[df['mentions'] > df['mentions'].mean()]

Unnamed: 0,commonname,mentions,name,type
0,,3,Kansas City,City
3,,3,Kansas,ProvinceOrState
4,,5,Progressive Change Campaign Committee,Organization
6,Alexandria Ocasio-Cortez,3,Alexandria Ocasio-Cortez,Person
7,Brent Welder,12,Brent Welder,Person
8,Bernie Sanders,6,Bernie Sanders,Person
14,,9,Welder,Position
16,Tom Niermann,3,Tom Niermann,Person
18,Sharice Davids,12,Sharice Davids,Person
25,,3,Army,Organization


In [36]:
# Select all people who were mentioned more than the mean number of mentions per person
POI = list(df[(df["type"] == "Person") & (df['mentions'] > df[df['type'] =='Person']['mentions'].mean())]['commonname'])

# Do we want to search full names, or just last names?
query = " AND ".join(POI)
query = query.replace(" ", "%20")

queryURL = "https://archive.org/details/tv?q="+query
print(queryURL)

https://archive.org/details/tv?q=Brent%20Welder%20AND%20Bernie%20Sanders%20AND%20Sharice%20Davids
