### Project Type: Search Engine Building

#### Group : Group 4

#### Members : 

    Nupur Roy Chowdhury, nr572@drexel.edu
    
    Mahesh Sercat Ramakumar, ms4976@drexel.edu
    
    Rohith Lakshminarayana, rl669@drexel.edu
    
    Manisha Uttam Nandawadekar, mun24@drexel.edu
    

### Introduction:

As part of our project we are using the News API: http://newsapi.org/  to collect news data for our project.

In this notebook we are collecting, loading the data and indexing it into the elasstic search servers.

In [1]:
from urllib.request import urlopen
import pandas as pd
import json, datetime, warnings
from elasticsearch import Elasticsearch, RequestsHttpConnection
from newsapi import NewsApiClient
from sklearn.preprocessing import MinMaxScaler

### Defining the server connections:

In the below, we are establishing the connection to the Elasticsearch servers hosted in Drexel Network.

In [2]:
def Elasticsearch_connection(host_link,user_auth):
    warnings.filterwarnings("ignore")
    es = Elasticsearch(hosts=host_link ,verify_certs=False,http_auth= user_auth, connection_class=RequestsHttpConnection,)
    print("ElasticSearch connection has been established and this connection instance is stored in es variable")
    return es

In [None]:
es = Elasticsearch_connection(['https://tux-es1.cci.drexel.edu:9200/','https://tux-es2.cci.drexel.edu:9200/',
                               'https://tux-es3.cci.drexel.edu:9200/'],'ms4976:Phooh3ahkei7')

Once the connection has got established, we are storing it in a variable called es. This variable will be in use below to create the index.

#### We are storing the key value in a global variable

In [None]:
apikey  =  '50c30305b54a4857bef3a755586cc26e'

### Creating an index with default settings and  mappings:

In [4]:

index_name = 'ms4976_info624_201904_newsproject'

request_body = {
        'mappings': {
            
            "properties":{
                "source":{
                    "type": "text",
                    "analyzer": "standard"
                    },
                "author":{
                    "type": "text" ,
                    "analyzer": "standard",
                    "similarity": "boolean"
                    },
                "title":{
                    "type": "text" ,
                    "analyzer": "english",
                    },
     
                "description":{
                    "type": "text" ,
                    "analyzer": "english",
                    },
                 "url":{
                    "type": "text"
                    },
     
                "publishedAt":{
                    "type" : "date"
                    },
                "timestamp" :{
                    "type" : "rank_feature",
                    "positive_score_impact" : True  
                    }
                
                }
            }
        }
es.indices.create(index = index_name, body = request_body)

{'acknowledged': True,
 'shards_acknowledged': True,
 'index': 'ms4976_info624_201904_newsproject'}

* Our main index for the project is: ms4976_info624_201904_newsproject

* We have created this index with default mappings where:

        We have used the analyzer: standard for the fields - source and author.
    
        We have used the analyzer: English for title and description.
    
        For the author field, we have used the boolean similarity.
    
        To rank the documents in a given order, we have used the rank_feature for the filed timestamp.

### Collecting sources names:

Below we have collected the names of all the 128 news channels/sources from where we would be collecting our data.

The reason to do this is if we use only one name in the query field of the 'API' then, it returns only 20 values of the data.

Hence, to collect more data, we have used multiple sources.

In [6]:
newsapi = NewsApiClient(api_key=apikey)
sources = newsapi.get_sources()
news_terms = list()
for news_name in sources["sources"]:
    news_terms.append(news_name['id']) 
print(news_terms)

['abc-news', 'abc-news-au', 'aftenposten', 'al-jazeera-english', 'ansa', 'argaam', 'ars-technica', 'ary-news', 'associated-press', 'australian-financial-review', 'axios', 'bbc-news', 'bbc-sport', 'bild', 'blasting-news-br', 'bleacher-report', 'bloomberg', 'breitbart-news', 'business-insider', 'business-insider-uk', 'buzzfeed', 'cbc-news', 'cbs-news', 'cnn', 'cnn-es', 'crypto-coins-news', 'der-tagesspiegel', 'die-zeit', 'el-mundo', 'engadget', 'entertainment-weekly', 'espn', 'espn-cric-info', 'financial-post', 'focus', 'football-italia', 'fortune', 'four-four-two', 'fox-news', 'fox-sports', 'globo', 'google-news', 'google-news-ar', 'google-news-au', 'google-news-br', 'google-news-ca', 'google-news-fr', 'google-news-in', 'google-news-is', 'google-news-it', 'google-news-ru', 'google-news-sa', 'google-news-uk', 'goteborgs-posten', 'gruenderszene', 'hacker-news', 'handelsblatt', 'ign', 'il-sole-24-ore', 'independent', 'infobae', 'info-money', 'la-gaceta', 'la-nacion', 'la-repubblica', 'le-m

#### Some popular news items category from kaggle news dataset

In [7]:
news_data = pd.read_json(r"News_Category_Dataset_v2.json",lines=True)
news_category = list(news_data["category"].unique())
print(news_category)

['CRIME', 'ENTERTAINMENT', 'WORLD NEWS', 'IMPACT', 'POLITICS', 'WEIRD NEWS', 'BLACK VOICES', 'WOMEN', 'COMEDY', 'QUEER VOICES', 'SPORTS', 'BUSINESS', 'TRAVEL', 'MEDIA', 'TECH', 'RELIGION', 'SCIENCE', 'LATINO VOICES', 'EDUCATION', 'COLLEGE', 'PARENTS', 'ARTS & CULTURE', 'STYLE', 'GREEN', 'TASTE', 'HEALTHY LIVING', 'THE WORLDPOST', 'GOOD NEWS', 'WORLDPOST', 'FIFTY', 'ARTS', 'WELLNESS', 'PARENTING', 'HOME & LIVING', 'STYLE & BEAUTY', 'DIVORCE', 'WEDDINGS', 'FOOD & DRINK', 'MONEY', 'ENVIRONMENT', 'CULTURE & ARTS']


In [8]:
#some news items names manually added
news_items = ['World Cup','IPL','Scam','Environment','Rankings','Convention','Missing','Kamala Harris','Russia','Safety',
              'Purdue','prohibit','Pandemic','Epidemic','Covid','Cricket''Hockey','Basketball','Badminton','Football','Soccer',
              'President','History','Games','Wildfire','Trump','Politics','Arrest','Shooting','Gathering','Threatning','Police',
              'Vote','Election','Industry','Congress','Health','Entertainment','Destructive','Mystery','Summer','Risks','Fraud',
              'Hospital','Community','Federal','Killer','Airline','Voting','Protest','Voilence','Lockdowns','President',
              'Robbery','Education']
print(news_items)

['World Cup', 'IPL', 'Scam', 'Environment', 'Rankings', 'Convention', 'Missing', 'Kamala Harris', 'Russia', 'Safety', 'Purdue', 'prohibit', 'Pandemic', 'Epidemic', 'Covid', 'CricketHockey', 'Basketball', 'Badminton', 'Football', 'Soccer', 'President', 'History', 'Games', 'Wildfire', 'Trump', 'Politics', 'Arrest', 'Shooting', 'Gathering', 'Threatning', 'Police', 'Vote', 'Election', 'Industry', 'Congress', 'Health', 'Entertainment', 'Destructive', 'Mystery', 'Summer', 'Risks', 'Fraud', 'Hospital', 'Community', 'Federal', 'Killer', 'Airline', 'Voting', 'Protest', 'Voilence', 'Lockdowns', 'President', 'Robbery', 'Education']


In [9]:
news_terms_list = news_terms + news_category + news_items

From the above, we have collected a list of query terms to fetch our data from the 'News API' as below:

In [10]:
#scraping the data using various news terms in newsapi 
def scraping_data(news_terms_list,apikey):
    data = list()
    for term in news_terms_list:
        #if any term contains the spaces, it will replace it with '%20' value as url wont recognize for any empty spaces in between
        term = term.replace(' ','%20')
        url = "http://newsapi.org/v2/everything?q="+str(term)+"&apiKey="+str(apikey)
        json_dumps = urlopen(url)
        for article in json.loads(json_dumps.read())['articles']:
            data.append(article)
    print(str(len(data))+" Documents extracted from various news sources")
    return data

In [11]:
extracted_data = scraping_data(news_terms_list,apikey)

4221 Documents extracted from various news sources


Hence, a total of 4221 Documents extracted from various news sources

### Cleaning the scraped data:

    In the JSON file collected above the source, the field has a name and id.

    The id filed is sometimes having a null value, hence we are taking only the name field from the source.

    To use the rank feature we are converting the 'publishedAt' value to timestamp and normalizing it.

    The fields 'urlToImage' and 'content are being removed.


In [12]:
def cleaning_data(data):
    total_value = 0
    #Modifiying source feature and removes the unwanted fields
    for key,value in enumerate(data):
        temp = value['source']['name']
        data[key]['source'] = temp
        date = datetime.datetime.strptime(value['publishedAt'], "%Y-%m-%dT%H:%M:%SZ")
        timestamp = datetime.datetime.timestamp(date)
        data[key]['timestamp'] = timestamp #creates a new field timestamp 
        
        total_value += timestamp
        del data[key]['urlToImage']
        del data[key]['content']
        
    #Normalize the timestamp field
    factor = 1.0/total_value
    for key,value in enumerate(data):
        data[key]['timestamp'] = value['timestamp'] *factor

    print("Data cleaning/Formating has completed")
    return data

In [13]:
cleaned_data = cleaning_data(extracted_data)

Data cleaning/Formating has completed


* Below, we are indexing the documents which are unique and have not been collected previously.

* Based on this, the documents will be indexed from the last index previously created.

In [14]:
#removing redundant data in any
def redundancy_check(data, es, index_name):
    new_data = list()
    check_set = set()
    #Finding number of documents present in our index
    result = es.search(index=index_name)
    doc_id = result['hits']['total']['value']
    
    #retreive all the data from the index and uses as a check_set load all titles present in index to avoid data repetition
    res = es.search(index=index_name, body={"from" : 0, "size" : doc_id,"query": {"match_all" : {}}})
    for each_doc in res['hits']['hits']:
        check_set.add(each_doc['_source']['title'])
    #removes the data redundancy not only w.r.to index data, but also looking into data prepared itself
    for each_doc in data:
        if each_doc["title"] not in check_set:
            new_data.append(each_doc)
            check_set.add(each_doc["title"])
    print("There exists "+str(len(new_data))+" unique documents which are not present in our index and we can index this data")
    return new_data

In [15]:
final_data = redundancy_check(cleaned_data, es,index_name)

There exists 3561 unique documents which are not present in our index and we can index this data


In [16]:
#indexing_data function is used to index the data
def indexing_data(data,es,index_name):
    
    #Checks the total documents present in the given index and stores this count in  doc_id variable
    result = es.search(index=index_name)
    doc_id = result['hits']['total']['value']
    
    #Loops over the data['articles'] dictionary and index each document into our index with sequential unique doc id
    for i in data:
        doc_id +=1
        es.index(index=index_name, doc_type='_doc',id=doc_id, body=i)
    print("Total "+str(len(data))+" documents indexed successfully into "+index_name+" index.")
    print("After indexing, Total "+str(doc_id)+" documents are present in this index")
    return doc_id

In [17]:
doc_id = indexing_data(final_data,es, index_name)

Total 3561 documents indexed successfully into ms4976_info624_201904_newsproject index.
After indexing, Total 3561 documents are present in this index


#### Next we have the Custome Similarity Comparision Jupyter notebook which we have used for comparing different custom similarities for our project.