Welcome to the IPYNB of this project. This file is merely a show of proof of what we did to understand and use NLP to the best of our abilities. Most of this code is made functional in the backbone.py file. 

In [5]:
# importing all required libraries
import re
import pandas as pd
import numpy as np
import spacy

next, we create a function that consolidates all the required preprocessing into one function that can be applied onto a DataFrame.

In [6]:
def clean(text): 
    text = re.sub('[0-9]+.\t' + '...','',str(text))
    text = re.sub('\n ','',str(text))
    text = re.sub('\n',' ',str(text))
    text = re.sub("'s",'',str(text))
    text = re.sub("-",' ',str(text))
    text = re.sub("—",'',str(text))
    text = re.sub('\"','',str(text))
    text = re.sub("Mr\.",'Mr',str(text))
    text = re.sub("Mrs\.",'Mrs',str(text))
    text = re.sub("[\(\[].*?[\)\]]", "", str(text))
    text = re.sub(r'^https?:\/\/.*[\r\n]*', '', text, flags=re.MULTILINE)
    return [text]

Getting the dataset, and filtering out the data we will not need.

In [7]:
# Dataframe
df = pd.read_csv(r'.\datasets_AI\news.csv')
df = df.drop(columns=['url', 'hostname', 'timestamp'], axis=1)
df = df[df.story == 'dABGVITQs6X1I4MdYGnX9zY59PpVM']
# clean speech
df['Speech_clean'] = df['main_content'].apply(clean)

#reset indices for removed stuff
df.reset_index(inplace=True)
df.drop(['Unnamed: 0', 'index'], axis = 1, inplace = True)

df.head()

Unnamed: 0,id,title,publisher,category,story,main_content,main_content_len,Speech_clean
0,22237,"The Incredibles 2, Cars 3 in the works, Disney...",Digital Spy,e,dABGVITQs6X1I4MdYGnX9zY59PpVM,The Incredibles 2 and Cars 3 are in developmen...,1340.0,[The Incredibles 2 and Cars 3 are in developme...
1,22241,The Incredibles are set for another big-screen...,Belfast Telegraph,e,dABGVITQs6X1I4MdYGnX9zY59PpVM,"Incredibles 2, Cars 3 in the works BelfastTele...",1620.0,"[Incredibles 2, Cars 3 in the works BelfastTel..."
2,22244,Pixar Working On Sequels For Popular Animated ...,Online News Heard Now,e,dABGVITQs6X1I4MdYGnX9zY59PpVM,Posted by News\n\nPixar Working On Sequels For...,1339.0,[Posted by News Pixar Working On Sequels For ...
3,22248,"State Of The (Disney) Union: Cars 3, Incredibl...",Contactmusic.com,e,dABGVITQs6X1I4MdYGnX9zY59PpVM,In news you didn’t know you needed until right...,1737.0,[In news you didn’t know you needed until righ...
4,22249,Disney Pixar confirm The Incredibles 2,Total Film,e,dABGVITQs6X1I4MdYGnX9zY59PpVM,The first footage from Incredibles 2 (there's ...,2942.0,[The first footage from Incredibles 2 was sho...


this step is mostly redundant, we initially thought we would need multiple functions to preprocess our datasets properly.

In [8]:
df2 = pd.DataFrame(columns=['sent','id','len'])

row_list = []

for i in range(len(df)):
    for sent in df.at[i,'Speech_clean']:
    
        wordcount = len(sent.split())
        id = df.at[i,'id']

        dict1 = {'id' : id, 'sent' : sent, 'len' : wordcount}
        row_list.append(dict1)
    

df2 = pd.DataFrame(row_list)
df2.head()

Unnamed: 0,id,sent,len
0,22237,The Incredibles 2 and Cars 3 are in developmen...,224
1,22241,"Incredibles 2, Cars 3 in the works BelfastTele...",239
2,22244,Posted by News Pixar Working On Sequels For P...,227
3,22248,In news you didn’t know you needed until right...,306
4,22249,The first footage from Incredibles 2 was show...,503


loading the spacy english model, the one which supports vectors so we get more accurate cosine similarities

In [9]:
nlp = spacy.load('en_core_web_md')

taking a random, small sample from our dataset. we don't need to train our model. so there's no need to have proper test/train sets.

In [10]:
p = df2['sent'].tolist()
docx = nlp(p[np.random.randint(1, len(df2))])
docy = nlp(p[np.random.randint(1, len(df2))])
x = docx.similarity(docy)

[(docx, docy), x]

[(Pixar has announced that it is working on sequels to The Incredibles and Cars 2.  Disney CEO Bob Iger confirmed the news while speaking at the company shareholder meeting in Portland, Oregon.  No details of storyline or release dates have been revealed for the upcoming films.  The studio is also working on a Finding Nemo sequel, Finding Dory, as well as The Good Dinosaur and Inside Out.  However, with Finding Dory and The Good Dinosaur both being pushed back, 2014 will be Pixar first year without releasing a feature film since 2005.  The delays mean the gap between Pixar most recent release, Monsters University, and Inside Out is the longest period they've gone without a new movie since the original Incredibles  was followed by Cars in June 2006.  The most recent film in the Cars franchise, Cars 2, was released in 2011.  Pixar has also announced that a 3D version of 2007 Ratatouille is planned.  Toy Story and Toy Story 2 have previously been re released in 3D, as well as Finding Nemo

define a function to convert our data into a dict.

In [11]:
def dictfy(d1, t1):
    sendict = dict()
    for key in t1:
        sendict[key] = []
        for word in d1:
            if word.label_ == key:
                sendict[key].append(str(word).lower().strip())
    for key in sendict.keys():
        sendict[key] = list(set(sendict[key]))
    return sendict


if the articles are not related at all, we simply do not find their intersections and deal with the data. the rest of this code has been perfected in backbone.py, but it does not have any comments due to lack of time.

In [12]:
if x > 0.925:
    x1 = dictfy([ent for ent in docx.ents], [ent.label_ for ent in docx.ents])
    xres = filter(lambda x: x.tag_ == 'VBG', docx)
    x1['VERB'] = list(set(xres))
    y1 = dictfy([ent for ent in docy.ents], [ent.label_ for ent in docy.ents])
    yres = filter(lambda y: y.tag_ == 'VBG', docy)
    y1['VERB'] = list(set(yres))
    print([x1, y1], x)
else:
    print("sentences not similar enough")

# [[ent.text, ent.label_] for ent in docy.ents]

[{'ORG': ['pixar', 'inside out', 'monsters university', 'twitter', 'cars', 'monsters inc.', 'lucasfilm', 'incredibles', 'disney'], 'PERSON': ['jj abrams', 'nemo', 'bob iger'], 'GPE': ['oregon', 'portland', 'london'], 'DATE': ['about 30 years', '2014', '2011', 'may', 'june 2006', 'first year', '18 december 2015', '2005'], 'WORK_OF_ART': ['star wars'], 'FAC': ['pinewood studios'], 'VERB': [being, Finding, releasing, filming, working, working, shooting, speaking]}, {'CARDINAL': ['two', '2', '3d.'], 'ORG': ['pixar', 'inside out', 'nemo, monsters inc.', 'cars 2', 'monsters university’s', 'pinewood studios', 'lucasfilm', 'toy story and toy story 2', 'incredibles', 'disney'], 'PERSON': ['bob iger'], 'WORK_OF_ART': ['online news heard now', 'star wars', 'finding dory”'], 'DATE': ['2014', '2004', 'may', 'first year', '2006', '2007', 'dec. 18, 2015', '30 years', '2005'], 'GPE': ['london'], 'VERB': [being, filming, working, Finding, Finding, working, including]}] 0.994671072706162
