# This Project's aim is to recommend a tv-news clip based on a given article using a content filter recommender

# Summary of the project:

#### Static structure
To build this, we will need a DB of the possible recommendations. This will have features such as:
- `Title`
    - Used to display to user
- `entities`
    - A vector of the features for each video
- `url` 
    - A link to the full segment
    
    
#### Program
1. We will programatically scan an article and parse it for its own entity vector. 
    - NLP topic modeling needed
2. Then we will find cosine dist from the article to each news clip. 
3. return the 3 closest clips.

#### Notes:
- Depending on the size of the DB, we probably won't want to go through each entry.  Maybe I can cluster closer clips and return the cluster
- Probably want to use components as well. 
- News sources: BBC, CNN, Fox, RT, MSNBC, Huffpo, the guardian, 
- opencalais
- 


## Imports

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import requests
from bs4 import BeautifulSoup

In [2]:
access_token = "cvTFhY53VXBYm5HO85weHPx346W05015"
calais_url = 'https://api.thomsonreuters.com/permid/calais'
headers = {'X-AG-Access-Token' : access_token, 'Content-Type' : 'text/raw', 'outputformat' : 'application/json'}


## Scan article
- request site
    - found this api:https://newsapi.org/
        - Can't search individual articles by url as far as I can tell.  Might be worth considering though for multiple news sources
        - Doesn't return full article, only title and desc
- find relevant info
    - http://mallet.cs.umass.edu/index.php
    - http://www.opencalais.com/
- assemble corpus OR run search on archive.org

#### Requests

In [15]:
url = 'https://www.huffingtonpost.com/entry/kansas-democrat-brent-welder-ad_us_5b625d65e4b0b15aba9f9774'
header = {"Accept-Encoding": "gzip", "User-Agent": "Max-of-all-trades"}
res = requests.get(url, params=header )
res.status_code

403

In [14]:
soup = BeautifulSoup(res.content, 'lxml')

#### Parse

In [25]:
title = soup.find("h1", attrs={"class":"headline__title"}).text
title

"Alex Jones' Infowars Still Not Banned On App Stores, Instagram And Twitter"

In [26]:
subtitle = soup.find("div", attrs={"class":"headline__subtitle"}).text
subtitle

'Companies like Spotify and Facebook have taken action, removing Infowars content.'

In [27]:
soup.find("a", attrs={"class": "author-card__link"}).text

'\n'

In [24]:
body = "\n\n".join(list(map(lambda x: x.text, soup.find("div", attrs={"class":"entry__text"}).find_all("div", attrs={"class":"content-list-component"}))))
print(body)

Apple and Google have both removed some Alex Jones content from some of their platforms, but his Infowars app is still available for download in both app stores and his accounts on Twitter and Instagram are still active.

Listed under the “News” sections of the iOS App Store and the Google Play store, the Infowars app offers its subscribers livestreams and articles. It’s ranked as high as No. 23 among free news apps on Apple, according to CNN Money.

Apple did remove some of Jones’ podcasts from iTunes, while YouTube, which is owned by Google, removed his channel. 

“Apple does not tolerate hate speech, and we have clear guidelines that creators and developers must follow to ensure we provide a safe environment for all of our users,” the company said in a statement. “Podcasts that violate these guidelines are removed from our directory making them no longer searchable or available for download or streaming. We believe in representing a wide range of views, so long as people are respect

In [36]:
response = requests.post(calais_url, data=body.encode('utf-8'), headers=headers, timeout=80)
print ('status code: %s' % response.status_code)
content = response.text
# print ('Results received: %s' % content)

status code: 200


In [40]:
import json

c = json.loads(content)


{'doc': {'info': {'calaisRequestID': 'a42792bb-fbf5-5ca4-1652-0b6b1433558f',
   'docDate': '2018-08-09 22:02:48.952',
   'docId': 'http://d.opencalais.com/dochash-1/002fcf23-1236-38e4-9fe2-1ec16bb33b4c',
   'docTitle': '',
   'document': 'Apple and Google have both removed some Alex Jones content from some of their platforms, but his Infowars app is still available for download in both app stores and his accounts on Twitter and Instagram are still active.\n\nListed under the “News” sections of the iOS App Store and the Google Play store, the Infowars app offers its subscribers livestreams and articles. It’s ranked as high as No. 23 among free news apps on Apple, according to CNN Money.\n\nApple did remove some of Jones’ podcasts from iTunes, while YouTube, which is owned by Google, removed his channel. \n\n“Apple does not tolerate hate speech, and we have clear guidelines that creators and developers must follow to ensure we provide a safe environment for all of our users,” the company

In [52]:
topics = list(c.keys())[1:]

In [68]:
for topic in topics:
    try:
        print(c[topic]['name'])
        if(c[topic]["_typeGroup"] in ['topics', 'socialTag']):
            print("importance:", c[topic]['importance'])
        else:
            print("relevance:", c[topic]['relevance'])
    except:
        pass
    print()

Technology_Internet


Software
importance: 1

Computing
importance: 1

Digital media
importance: 1

Social media
importance: 2

Social networking services
importance: 2

Universal Windows Platform apps
importance: 2

Photo sharing
importance: 2

E-commerce
importance: 2

Instagram
importance: 2

InfoWars
importance: 2

Stitcher Radio
importance: 2

Facebook
importance: 2

Broadcasting - NEC
relevance: 0

Internet & Mail Order Department Stores
relevance: 0

Phones & Smart Phones
relevance: 0.5

Social Media & Networking
relevance: 0.2

Search Engines
relevance: 0.2

Software - NEC
relevance: 0

Internet Services - NEC
relevance: 0.2


Instagram
relevance: 0

Pinterest
relevance: 0.2

LinkedIn
relevance: 0.2

Amazon
relevance: 0

President
relevance: 0.2

iOS App Store
relevance: 0.2

SPOTIFY
relevance: 0

social media platforms
relevance: 0.2

Twitter
relevance: 0

Internet Bill
relevance: 0.2

Google
relevance: 0.2

YouTube
relevance: 0.2

Facebook
relevance: 0.2

Alex Jones
relevance

In [75]:
for topic in topics:
    try:
        print(c[topic]['name'])
        print(c[topic])
        print("mentioned",len(c[topic]['instances']),"times")
        

    except:
        pass
    print()

Technology_Internet
{'_typeGroup': 'topics', 'forenduserdisplay': 'false', 'score': 0.957, 'name': 'Technology_Internet'}


Software
{'_typeGroup': 'socialTag', 'id': 'http://d.opencalais.com/dochash-1/002fcf23-1236-38e4-9fe2-1ec16bb33b4c/SocialTag/1', 'socialTag': 'http://d.opencalais.com/genericHasher-1/16293516-a848-3162-8f98-a310581b027d', 'forenduserdisplay': 'true', 'name': 'Software', 'importance': '1', 'originalValue': 'Software'}

Computing
{'_typeGroup': 'socialTag', 'id': 'http://d.opencalais.com/dochash-1/002fcf23-1236-38e4-9fe2-1ec16bb33b4c/SocialTag/2', 'socialTag': 'http://d.opencalais.com/genericHasher-1/4f9a3d55-33f5-3738-a2f7-3e9065a5a169', 'forenduserdisplay': 'true', 'name': 'Computing', 'importance': '1', 'originalValue': 'Computing'}

Digital media
{'_typeGroup': 'socialTag', 'id': 'http://d.opencalais.com/dochash-1/002fcf23-1236-38e4-9fe2-1ec16bb33b4c/SocialTag/3', 'socialTag': 'http://d.opencalais.com/genericHasher-1/771e4d9a-b0cd-37a2-bec6-6840c4cafab3', 'foren

At this point, the next step is to synthesize these results into a search querry