# Audio Listening Preference

Authors: Rahul Kuriyedath, Hazel Jiang, Marc Sun

## Executive Summary

As an audio news aggregator mobile APP, Newsly reads the latest trending web articles to users in a natural human voice. It has a great interface where users can view the top trending articles of the day on the home page, select news articles by country and choose articles by category. In addition to the variety of functions Newsly has, we believe a recommendation system could make the APP more attractive to users. In this proposal, we plan to apply two methods on Newsly’s data set to make recommendations to users on audio articles they might be interested in. The first method is the general approach, which recommends based on popularity of previous user listening history. This approach is more suitable for new users that have not clicked on any articles yet. The other method is the article specific approach, which recommends based on related topics. Since we do not have a large amount of data, there exist some limitations to both approaches. Further development may involve actions like eliminating noise factors (test users) from the data set, and making more personalized recommendations with data that link users and articles together.

## Introduction

Nowadays, there are hundreds of news articles generated and posted hourly. With a massive amount of information, it is hard for users to quickly find what they are interested in{cite}`recommendation`. A recommendation system helps with this problem. In this capstone, we propose to develop two methods that make recommendations to users that could improve user experience and increase user engagement.

- General Recommendation: Provide generic recommendations based on the news articles that are most popular among users on the application. 
- Article-Specific Recommendation: Using a specific news article to make recommendations based on the related topics.

We would like to have our code in Python files as our end product and deploy it to cloud (either AWS or Canarie) for Newsly to run on their own.

### *EDA*

Newsly uses both AWS and Google Analytics to store their data. Data in AWS are mainly information about users and news articles while data in Google Analytics are events and activity counts. After going through all the data on both platforms, we find three tables useful. The first one is the `listened article` shown in Figure 1. It contains the title of the article that existing users have clicked on and the total number of clicks. This information helps us to define the popular topics existing users were interested in.

In [5]:
import pandas as pd
listened_article = pd.read_csv('data\listened_article_2021.csv',skiprows=249).rename(columns={"Custom parameter":"Trend", "Event count":"Total Clicks"}).head()
def highlight_cols(s):
    color = 'lightgreen'
    return 'background-color: %s' % color
listened_article.style.applymap(highlight_cols, subset=pd.IndexSlice[:,['Trend','Total Clicks']]).hide_index()

Trend,Total Clicks,Total users
(not set),292,100
Elon Musk is donating a $100 million prize for carbon capture technology — here's what that means,13,4
Ashley Biden rocked a tuxedo on inauguration night — and it was everything,12,10
Longtime home run king Hank Aaron dies at 86,11,7
Kamala Harris: The Vice President,10,10


```{glue:figure} table_1
:figwidth: 300px
:name: "tbl:df"
Articles existing users have listened
```

Figure 2 contains all archives of past articles that were shown in the APP with both the title and the full text of the article. We plan to use these data in the article-specific approach to find articles with similar contents.

In [12]:
trendinfo = pd.read_csv('data/trendinfo.txt',sep='\t', names=['name', 'content', 'drop']).drop(columns='drop')
trendinfo_clean = {'Date':pd.to_datetime(trendinfo.query('name == "DATE"').content.replace({'.PDT', '.PST'}, ' America/Vancouver', regex=True), 
                                         format='%H:%M:%S.%f - %b %d %Y %Z'), 
                   'Trend':trendinfo.query('name == "TREND"').content.to_list(),
                   'TEXT':trendinfo.query('name == "TEXT"').content.to_list()}
trendinfo_clean = pd.DataFrame(trendinfo_clean)
trendinfo_clean_head = trendinfo_clean.head().copy()
trendinfo_clean_head.TEXT = trendinfo_clean_head.TEXT.apply(lambda x: x[:70]+'...')
trendinfo_clean_head.style.applymap(highlight_cols, subset=pd.IndexSlice[:,['Trend','TEXT']]).hide_index()

Date,Trend,TEXT
2021-04-16 11:05:15.075547-07:00,Ontario COVID,"TORONTO - . Without a stay-at-home order lasting six weeks, a more robust vacci..."
2021-02-25 12:06:10.848151-08:00,Charlie Munger,"Berkshire Hathaway Vice Chairman Charlie Munger gave his views about Robinhood,..."
2020-12-13 09:03:45.175009-08:00,Crystal Palace vs Tottenham,"Crystal Palace v Spurs \xe2\x80\x93 history, stats and facts | Tottenham Hotspur..."
2020-12-13 15:06:40.845662-08:00,Jordyn Huitema,"Vikings vs. Buccaneers highlights | Week 14. The Canadian PressHenry runs wild, ..."
2021-02-20 23:06:14.307117-08:00,Genesis Invitational,"Major champion misses a putt by 40 feet, then play is suspended. Keegan Bradley..."


```{glue:figure} table_1
:figwidth: 300px
:name: "tbl:df"
Archived articles with body of text
```

Figure 3 indicates all the news articles that are currently showing in the APP. We will make recommendations from articles that are presented in this table. Table 3 is also the most informative table with article title, domain, categories and etc. Right now we are only using the article title, but as we develop our methods, other information might be put into consideration as well

In [9]:
pd.read_csv('data/Trends.csv').style.applymap(highlight_cols, subset=pd.IndexSlice[:,['trend','categories','duration']]).hide_index()

trend,categories,countries,created,domain,duration,processed,traffic
"Friends: The Reunion' trailer drops special guests include Justin Bieber, Lady Gaga and BTS",ENTERTAINMENT,"US, CA",1620954600,nme.com,1:54,1620954600,"CA: 0, US: 0"
A mysterious 'hum' vibrates interstellar space. Voyager 1 has a recording of it.,SCIENCE,"US, CA",1620954600,livescience.com,3:21,1620868200,"CA: 0, US: 0"
Adidas and Allbirds Team Up to Make Sustainable Running Shoes,BUSINESS,"US, CA",1620954600,wired.com,6:32,1620868200,"CA: 0, US: 0"
Apple Execs Chose to Keep a Hack of 128 Million iPhones Quiet,TECHNOLOGY,"US, CA",1620954600,wired.com,4:03,1620695399,"CA: 0, US: 0"
Asus ZenFone 8 delivers Snapdragon 888 from $599,TECHNOLOGY,"US, CA",1620954600,9to5google.com,2:16,1620868200,"CA: 0, US: 0"


```{glue:figure} table_1
:figwidth: 150px
:name: "tbl:df"
Articles currently presenting in the APP
```

## Data Science Techniques

### *General Recommendation:*

General recommendation implies non-user-specific recommendation, which means the same content will be recommended to all the users, and the recommendation is not personalized. This method is expected to be mainly applied for new users. Because if the user’s browsing history and listening preference are unknown, the app should simply provide recommendations of popular articles.

Determining whether an article is popular is straightforward if it has already been published for at least one day, since the number of views/listens can be observed. However, if an article is newly published, the system will have to predict the popularity of that article.

The core idea in popularity prediction is determining whether an article belongs to a popular topic. For example, from the past records, the system found that COVID-19 has been a very popular topic in the past few days. If a newly published article is also about COVID-19, then the system should be recommending this new article. To achieve this, a variety of machine learning techniques, especially natural language processing techniques would be applied.

There are two main processes for this approach, process A is classifying the historical articles by different topics, process B involves making popularity predictions based on the result from process A, below are the detailed descriptions.

Figure 2 represent the two main process of the general recommendation approach.

```{figure} images/general.png
---
height: 400px
name: geneal_approach
---
Illustration of general recommendation approach
```

### *Article-specific recommendations:*

The general recommendation approach described in the previous section would be advantageous when a new user uses the Newsly application for the very first time. Articles would be recommended to the user based on popularity among other listeners. However, this approach is limited since it cannot provide personalized recommendations to each and every user based on the articles that they decide to listen to. This is where the article-specific recommendations come in.

An article-specific recommendation refers to an article that may be related to another article that the user has clicked on. For example, suppose a user clicks on an article titled “Canada to get two million Pfizer vaccine doses”. An example of a related article that could be recommended to the user could be: “80% of Canadians support COVID-19 vaccine passports for travel: poll”. This is what we aim to do with article-specific recommendations.

Our main idea is to build a system that can identify which articles that are currently present on the Newsly application are related to each other. Once these relations between articles have been established, recommendations can be made for a user in real-time i.e. when an article is clicked.

There are 2 main processes that are a part of this approach that we have labelled process A and B. The high level view of these are shown below in Figure 3 and Figure 4:

```{figure} images/input_output.png
---
height: 400px
name: input_1
---
Process A: Create Index of articles in Trends table
```
```{figure} images/input_output_2.png
---
height: 400px
name: input_2
---
Process B: Recommend top 'N' articles related to what the user clicked
```

Below is the detailed description of both processes:

1. **Process A:** This process is responsible for creating an index of all articles that are currently in the Trends table which will help in knowing which articles are related to each other, and will be used by process B to make recommendations for the user.

The process is triggered when either of the following events occur:
- A new article is added to the Trends table or
- An article is deleted from the Trends table

<u>Process Steps</u>: The steps of this process have been described in Fig 5

```{figure} images/Process_A.png
---
height: 400px
name: process_A
---
Process A
```

<u>Dependency</u>: This process depends on information from the Trends table

<u>Process output</u>: An index of articles and relationships that will be used to recommend articles that are related to the one clicked by the user.

2. **Process B:** This process recommends the top N articles related to the one that the user has just clicked on. To find related articles, the process would use the index created by Process A.

This process would be triggered when a user clicks on any article in the Newsly application.

<u>Process Steps</u>: The steps of this process have been described in Figure 6

```{figure} images/Process_B.png
---
height: 400px
name: process_B
---
Process B
```

<u>Dependency</u>: The URL of the article clicked by the user needs to be passed to the recommender system.

<u>Process output</u>: Top N recommendations related to the article clicked by the user

## Project Timeline

This project will have 3 broad phases: Ideation and Brainstorming, Development of the recommendation system and final submission of data product and report. These broad phases have been broken down and explained further in the Figure 7.

```{figure} images/Timeline.png
---
height: 400px
name: Timeline
---
Capstone timeline
```

## Bibliography

```{bibliography} references.bib
```
