# This Project's aim is to recommend a tv-news clip based on a given article using a content filter recommender

# Summary of the project:

#### Static structure
To build this, we will need a DB of the possible recommendations. This will have features such as:
- `Title`
    - Used to display to user
- `entities`
    - A vector of the features for each video
- `url` 
    - A link to the full segment
    
    
#### Program
1. We will programatically scan an article and parse it for its own entity vector. 
    - NLP topic modeling needed
2. Then we will find cosine dist from the article to each news clip. 
3. return the 3 closest clips.

#### Notes:
- Depending on the size of the DB, we probably won't want to go through each entry.  Maybe I can cluster closer clips and return the cluster
- Probably want to use components as well. 
- News sources: BBC, CNN, Fox, RT, MSNBC, Huffpo, the guardian, 
- opencalais
- 


## Imports

In [1]:
from sklearn.feature_extraction.text import CountVectorizer
import requests
from bs4 import BeautifulSoup

In [2]:
access_token = "cvTFhY53VXBYm5HO85weHPx346W05015"
calais_url = 'https://api.thomsonreuters.com/permid/calais'
headers = {'X-AG-Access-Token' : access_token, 'Content-Type' : 'text/raw', 'outputformat' : 'application/json'}


## Scan article
- request site
    - found this api:https://newsapi.org/
        - Can't search individual articles by url as far as I can tell.  Might be worth considering though for multiple news sources
        - Doesn't return full article, only title and desc
- find relevant info
    - http://mallet.cs.umass.edu/index.php
    - http://www.opencalais.com/
- assemble corpus OR run search on archive.org

#### Requests

In [3]:
url = 'https://www.huffingtonpost.com/entry/alex-jones-infowars-app-apple-google_us_5b694ec3e4b0de86f4a4bc1d'
header = {"Accept-Encoding": "gzip", "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64)"}
res = requests.get(url, headers = header)
res.status_code

200

In [4]:
soup = BeautifulSoup(res.content, 'lxml')

#### Parse

In [5]:
title = soup.find("h1", attrs={"class":"headline__title"}).text
title

"Alex Jones' Infowars Still Not Banned On App Stores, Instagram And Twitter"

In [6]:
subtitle = soup.find("div", attrs={"class":"headline__subtitle"}).text
subtitle

'Companies like Spotify and Facebook have taken action, removing Infowars content.'

In [7]:
soup.find("a", attrs={"class": "author-card__link yr-author-name"}).text

'Willa Frej'

In [8]:
body = "\n\n".join(list(map(lambda x: x.text, soup.find("div", attrs={"class":"entry__text"}).find_all("div", attrs={"class":"content-list-component"}))))
print(body)

Apple and Google have both removed some Alex Jones content from some of their platforms, but his Infowars app is still available for download in both app stores and his accounts on Twitter and Instagram are still active.

Listed under the “News” sections of the iOS App Store and the Google Play store, the Infowars app offers its subscribers livestreams and articles. It’s ranked as high as No. 23 among free news apps on Apple, according to CNN Money.

Apple did remove some of Jones’ podcasts from iTunes, while YouTube, which is owned by Google, removed his channel. 

“Apple does not tolerate hate speech, and we have clear guidelines that creators and developers must follow to ensure we provide a safe environment for all of our users,” the company said in a statement. “Podcasts that violate these guidelines are removed from our directory making them no longer searchable or available for download or streaming. We believe in representing a wide range of views, so long as people are respect

In [9]:
response = requests.post(calais_url, data=body.encode('utf-8'), headers=headers, timeout=80)
print ('status code: %s' % response.status_code)
content = response.text
# print ('Results received: %s' % content)

status code: 200


In [10]:
import json

c = json.loads(content)


In [19]:
for key, value in c.items():
    if value.get('_typeGroup') == 'entities':
        print(value['name'])

Instagram
Pinterest
LinkedIn
Amazon
President
iOS App Store
SPOTIFY
social media platforms
Twitter
Internet Bill
Google
YouTube
Facebook
Alex Jones
Donald Trump
CNN
Google Play store
Apple


In [13]:
topics = list(c.keys())[1:]

for topic in topics:
    try:
        print(c[topic]['name'])
        if(c[topic]["_typeGroup"] in ['topics', 'socialTag']):
            print("importance:", c[topic]['importance'])
        else:
            print("relevance:", c[topic]['relevance'])
    except:
        pass
    print()

In [15]:
for topic in topics:
    try:
        print(c[topic]['name'])
        print(c[topic])
        print("mentioned",len(c[topic]['instances']),"times")
        

    except:
        pass
    print()

Technology_Internet
{'_typeGroup': 'topics', 'forenduserdisplay': 'false', 'score': 0.957, 'name': 'Technology_Internet'}


Software
{'_typeGroup': 'socialTag', 'id': 'http://d.opencalais.com/dochash-1/002fcf23-1236-38e4-9fe2-1ec16bb33b4c/SocialTag/1', 'socialTag': 'http://d.opencalais.com/genericHasher-1/16293516-a848-3162-8f98-a310581b027d', 'forenduserdisplay': 'true', 'name': 'Software', 'importance': '1', 'originalValue': 'Software'}

Computing
{'_typeGroup': 'socialTag', 'id': 'http://d.opencalais.com/dochash-1/002fcf23-1236-38e4-9fe2-1ec16bb33b4c/SocialTag/2', 'socialTag': 'http://d.opencalais.com/genericHasher-1/4f9a3d55-33f5-3738-a2f7-3e9065a5a169', 'forenduserdisplay': 'true', 'name': 'Computing', 'importance': '1', 'originalValue': 'Computing'}

Digital media
{'_typeGroup': 'socialTag', 'id': 'http://d.opencalais.com/dochash-1/002fcf23-1236-38e4-9fe2-1ec16bb33b4c/SocialTag/3', 'socialTag': 'http://d.opencalais.com/genericHasher-1/771e4d9a-b0cd-37a2-bec6-6840c4cafab3', 'foren

At this point, the next step is to synthesize these results into a search querry

In [20]:
tags = []
for key, value in c.items(): 
    if value.get('_typeGroup') == "entities":
        new_entity = {}
        new_entity['name'] = value['name']
        new_entity['mentions'] = len(value['instances'])
        new_entity['type'] = value['_type']
        if(new_entity['type'] == "Person"):
#             print(new_entity['name'])
            new_entity['commonname'] = value['commonname']
        else:
            new_entity['commonname'] = ""
        tags.append(new_entity)

In [21]:
import pandas as pd

df = pd.DataFrame(tags)
df

Unnamed: 0,commonname,mentions,name,type
0,,3,Instagram,Company
1,,1,Pinterest,Company
2,,1,LinkedIn,Company
3,,1,Amazon,Company
4,,1,President,Position
5,,1,iOS App Store,Facility
6,,1,SPOTIFY,Company
7,,1,social media platforms,IndustryTerm
8,,4,Twitter,Company
9,,1,Internet Bill,IndustryTerm


In [18]:
df[df['mentions'] > df['mentions'].mean()]

Unnamed: 0,commonname,mentions,name,type
0,,3,Instagram,Company
8,,4,Twitter,Company
13,Alex Jones,14,Alex Jones,Person
17,,4,Apple,Company


In [19]:
# Select all people who were mentioned more than the mean number of mentions per person
POI = list(df[(df["type"] == "Person") & (df['mentions'] > df[df['type'] =='Person']['mentions'].mean())]['commonname'])

# Do we want to search full names, or just last names?
query = " AND ".join(POI)
query = query.replace(" ", "%20")

queryURL = "https://archive.org/details/tv?q="+query
print(queryURL)

https://archive.org/details/tv?q=Alex%20Jones


In [20]:
POI

['Alex Jones']

In [21]:
json_result = {"POI": POI, "document": body, "title": title}

In [25]:
with open("search_result.json","w+") as f:
    json.dump(json_result, f)

In [26]:
from sklearn.metrics.pairwise import pairwise_distances, cosine_distances