*N.B. the following EDA is conducted on the pre-made train/test split (as found in the website). For a new split (70-15-15), this must be re-run with adequate files.*

In [2]:
#disabe annoying warnings
import warnings
warnings.filterwarnings('ignore')
#imports
import numpy as np
import pandas as pd
import altair as alt
import seaborn as sns
import matplotlib.pyplot as plt
import textstat
import json
from IPython.display import Image

alt.data_transformers.enable('json')

DataTransformerRegistry.enable('json')


Hot takes from this EDA on [MIND-small](https://msnews.github.io/) dataset:
- **<u>News</u>**:           <font color='red'>training:  51282 | test: 42416  unique news </font> | 17 Categories
- News and Sport most common categories (ca 15k each), 3rd Finance with 3k | least common: middleeast (2), northamerica (1) , kids(16)
- News-politics are 0.05% of the training data
- 1.08% Abstract are missing:  1/2ca from "sport" (ca 1300), 1/4 from "news" (600ca)
- Average words per Title: 10.75 (vs 11.52 OG Paper) | Abstract: 34.29 (vs 43)
- **<u>Flesh Reading Ease (FRE)</u>**: Generally: Higher (80=<)  more readable than avg | Lower (<60) = least readable | 70-60: plain english
- Average FRE per Title: 67.38 | Abstract: 64.68 (both average, title slightly easier-->probably due to smaller length)
- ABSTRACT categories: Higher (i.e. most readable) by FRE: food, autos, entertainment | Lower (i.e. least readable) by FRE: northamerica, middleeast, news,
- TITLE categories: Higher: food, kids, entertainment, lifstyle, health | Lower: northamerica, news( already ca 62)
- **<u>Reading Time</u>** (RT): Abstract mean 2.67 sec | Title mean 0.83 sec
- RT x Abstract category > 2.67: weather, news, travel, sport (few) | < 2.67 (almost all the rest): enterntainment, kids, middleeast, health, autos
- RT x Title category > 0.83: northamerica, tv, movies, news, music (few) | < 0.83: kids, middleeast, video, autos ...(most of them)
- Entities: 26k ca unique: 15k in Title (training) | 23k in Abstract (training) || 13k in Title (test) | 20k in Abstract (test)
- Both Entities and Relationship have 100d embeddings
- **<u>Behaviours</u>**: <font color='red'>50k unique users (both train & test)</font>, only 6k ca in both. | 
- **<u>History**</u> (unique clicked news) train: 33k | test: 37k | Most clicked categories: news (31%), sport (14%), lifestyle (10&) (not following distribution 1:1 ) | Least: middleeast, weather, kids || 0.6% of the clicks are news-politics
- **<u>Impressions</u>**: (recommended news).  total <u>unique news</u> recommended: 27837 (1.3% abt news-politics), of which 27% (7713) clicked and 72% (20124) not clicked. | <u>unique users: 50000</u> | number of <u>impression (user, news presented)</u>: 5843444 of which 4.04% clicked and 95.96% not clicked. Between the recommended and clicked news 5% are news-politics.<br>


OG PAPER: 2.1M samples for training, 360k for val, 2.3M for test<br>


features to be added via NLP techniques:<br>
[POLITICS](https://aclanthology.org/2022.findings-naacl.101/)<br>
------------------------------------------------------------------------------------------------------------<br>
Possible features to be added via NLP techniques:<br>
------------------------------------------------------------------------------------------------------------<br>
[Linguistic devices used in Newspaper headlines](https://www.researchgate.net/publication/364069851_Linguistic_Devices_Used_in_Newspaper_Headlines).

Assumption: "Paying attention to the headlines of the news, the reader may decide whether to read the entire article.(...)they constitute an indicator of the style and values of the news outlet"<br>
classification of reader's perception:  MCD (Membership Categorization Device) and Individualization Strategy (IS):<br> **MCD**  descriptor to categorize ppl into social groups or categories (identities) e.g. politicians, terrorist, victims. simplify complex social dynamic (issue w/ stereotipes, bias, story framing etc..) <br> **IS** descriptor of personal, unique, distinct characterization, towards individualization of the ppl.<br> 

possible label (4) MCD, IS, MCD+IS, None. via entities in title & abstract (rule-based, manually classified if low unique values vs n of entities mcd and is per sample if to be automated>@TO-DO: check n unique entities)

e.g. of MCD headlines:<br>

1.*Tech Giants* Pledge to Combat Climate Change (group affiliation)<br>
2.*Millennials* Are Changing the Workplace (social cat)<br>
3.*Americans* Demand More Renewable Energy" (national identity)<br>
<br>
e.g. of IS headlines:<br>

1.Nobel Laureate *Malala Yousafzai* Advocates for Girls' Education Worldwide (personal attr)<br>
2.CEO John Doe Apologizes for Company's Environmental Misconduct<br>
3.Refugee Turned Mayor Shares His Journey of Hope and Resilience<br>
------------------------------------------------------------------------------------------------------------<br>

[Gattani, Akshay. 2005. Maximum Entropy Discriminative Models for Headline Generation](https://summit.sfu.ca/_flysystem/fedora/sfu_migrate/2546/etd2783.pdf) : Types of headlines (and short summaries) can be categorized into INDICATIVE: headlines which indicate what topics are covered by the news story, INFORMATIVE: headlines which convey what particular concept, theme or event is covered in the news story and EYECATCHERS: headlines which do not inform about the content of the story but are designed to attract attention and entice people to read the story.<br>
Hard to find linguistic tool to classify the headlines into these categories, probably one would rely on category and subcategory.<br>
------------------------------------------------------------------------------------------------------------<br>
[On newspaper headlines as relevance optimizers](https://www.researchgate.net/publication/229005694_On_Newspaper_Headlines_as_Relevance_Optimizers)--->motivates the use of 'reading time' as a feature (especially) for titles too<br>
------------------------------------------------------------------------------------------------------------<br>
[News Sentiment Analysis](https://arxiv.org/pdf/2007.02238.pdf)--> Classical NLP technique. Unsure if this is relevant for this task, but could be interesting to see if the sentiment of the news has an impact on the recommendation<br>
------------------------------------------------------------------------------------------------------------<br>
More sources w.i.p:<br> [Linguistic effects on news headline success](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0281682)<br> [Headlines Matter: Using Headlines to Predict the Popularity of News Articles on Twitter and Facebook](https://aaai.org/papers/00656-14951-headlines-matter-using-headlines-to-predict-the-popularity-of-news-articles-on-twitter-and-facebook/)


In [27]:
news_train = 'data/MINDsmall_train/news.tsv'
behavior_train = 'data/MINDsmall_train/behaviors.tsv'
entity_train = 'data/MINDsmall_train/entity_embedding.vec'
relation_train = 'data/MINDsmall_train/relation_embedding.vec'
#-------------------------------------------
news_test = 'data/MINDsmall_dev/news.tsv'
behavior_test = 'data/MINDsmall_dev/behaviors.tsv'
entity_test = 'data/MINDsmall_dev/entity_embedding.vec'
relation_test = 'data/MINDsmall_dev/relation_embedding.vec' 

def load_df(path):
    if 'news' in path:
        columns = ['News ID',
                "Category",
                "SubCategory",
                "Title",
                "Abstract",
                "URL",
                "Title Entities",
                "Abstract Entities"]
    
    elif 'behavior' in path:
        columns = ['Impression ID',
                "User ID",
                "Time",
                "History",
                "Impressions"]
    else:
        return pd.read_csv(path, sep='\t', header=None)
    
    df = pd.read_csv(path, sep='\t', header=None, names=columns)
    return df

news_train, news_test, behavior_train, behavior_test = map(load_df, [news_train, news_test, behavior_train, behavior_test])
entity_train, relation_train, entity_test, relation_test = map(load_df, [entity_train, relation_train, entity_test, relation_test])
print('MIND-small:')
print(f"{'Dataset':<15} {'Train shape':<20} {'Test shape'}")
print(f"{'-'*50}")
print(f"{'news':<15} {str(news_train.shape):<20} {news_test.shape}")
print(f"{'behavior':<15} {str(behavior_train.shape):<20} {behavior_test.shape}")
print(f"{'entity':<15} {str(entity_train.shape):<20} {entity_test.shape}")
print(f"{'relation':<15} {str(relation_train.shape):<20} {relation_test.shape}")

MIND-small:
Dataset         Train shape          Test shape
--------------------------------------------------
news            (51282, 8)           (42416, 8)
behavior        (156965, 5)          (73152, 5)
entity          (26904, 102)         (22893, 102)
relation        (1091, 102)          (1091, 102)


# News

In [None]:
print("news_train: ")
display(news_train.head(3)) 

news_train: 


Unnamed: 0,News ID,Category,SubCategory,Title,Abstract,URL,Title Entities,Abstract Entities
0,N55528,lifestyle,lifestyleroyals,"The Brands Queen Elizabeth, Prince Charles, and Prince Philip Swear By","Shop the notebooks, jackets, and more that the royals can't live without.",https://assets.msn.com/labs/mind/AAGH0ET.html,"[{""Label"": ""Prince Philip, Duke of Edinburgh"", ""Type"": ""P"", ""WikidataId"": ""Q80976"", ""Confidence"": 1.0, ""OccurrenceOffsets"": [48], ""SurfaceForms"": [""Prince Philip""]}, {""Label"": ""Charles, Prince of Wales"", ""Type"": ""P"", ""WikidataId"": ""Q43274"", ""Confidence"": 1.0, ""OccurrenceOffsets"": [28], ""SurfaceForms"": [""Prince Charles""]}, {""Label"": ""Elizabeth II"", ""Type"": ""P"", ""WikidataId"": ""Q9682"", ""Confidence"": 0.97, ""OccurrenceOffsets"": [11], ""SurfaceForms"": [""Queen Elizabeth""]}]",[]
1,N19639,health,weightloss,50 Worst Habits For Belly Fat,These seemingly harmless habits are holding you back and keeping you from shedding that unwanted belly fat for good.,https://assets.msn.com/labs/mind/AAB19MK.html,"[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""WikidataId"": ""Q193583"", ""Confidence"": 1.0, ""OccurrenceOffsets"": [20], ""SurfaceForms"": [""Belly Fat""]}]","[{""Label"": ""Adipose tissue"", ""Type"": ""C"", ""WikidataId"": ""Q193583"", ""Confidence"": 1.0, ""OccurrenceOffsets"": [97], ""SurfaceForms"": [""belly fat""]}]"
2,N61837,news,newsworld,The Cost of Trump's Aid Freeze in the Trenches of Ukraine's War,"Lt. Ivan Molchanets peeked over a parapet of sand bags at the front line of the war in Ukraine. Next to him was an empty helmet propped up to trick snipers, already perforated with multiple holes.",https://assets.msn.com/labs/mind/AAJgNsz.html,[],"[{""Label"": ""Ukraine"", ""Type"": ""G"", ""WikidataId"": ""Q212"", ""Confidence"": 0.946, ""OccurrenceOffsets"": [87], ""SurfaceForms"": [""Ukraine""]}]"


In [None]:
#unique values
print(pd.DataFrame({'Train': news_train.nunique(), 'Test': news_test.nunique()}))


                   Train   Test
News ID            51282  42416
Category              17     17
SubCategory          264    257
Title              50434  41823
Abstract           47309  39470
URL                51281  42416
Title Entities     34472  28800
Abstract Entities  36277  29889


In [None]:
#return all unique subcategory with string 'politics' in the name
string_politics = news_train[news_train['SubCategory'].str.contains('politic', case=False)]['SubCategory'].unique()
#print how many datasample in t hose two categories:
print(news_train[news_train['SubCategory'].isin(string_politics)]['SubCategory'].value_counts())


SubCategory
newspolitics         2826
newsworldpolitics       5
Name: count, dtype: int64


In [None]:
#missing values
print(pd.DataFrame({'Train': news_train.isna().sum(), 'Test': news_test.isna().sum()}))

                   Train  Test
News ID                0     0
Category               0     0
SubCategory            0     0
Title                  0     0
Abstract            2666  2021
URL                    0     0
Title Entities         3     2
Abstract Entities      4     2


In [None]:
#display category of missing abstract 
missing = news_train[news_train['Abstract'].isna()]
missing['Category'].value_counts()
plot = alt.Chart(missing).mark_bar().encode(
    x='Category',
    y='count()',
    color='Category'
).properties(
    title='Missing values in Abstract'
)
plot

In [None]:
news_train['Category'] = news_train['Category'].astype(str)
news_train['SubCategory'] = news_train['SubCategory'].astype(str)

click = alt.selection_multi(fields=['Category'])

# categories
category_chart = alt.Chart(news_train).mark_bar().encode(
    x=alt.X('Category:N', sort='-y'),
    y=alt.Y('count():Q'),
    color='Category:N',
    tooltip=['Category:N', 'count()'],
    opacity=alt.condition(click, alt.value(1), alt.value(0.2)) 
).add_selection(
    click
).properties(
    width=600,
    height=300,
    title='Category distribution'
)

# subcategories
subcategory_chart = alt.Chart(news_train).transform_filter(
    click 
).mark_bar().encode(
    x=alt.X('count():Q'),
    y=alt.Y('SubCategory:N', sort='-x'),
    color='Category:N',
    tooltip=['SubCategory:N', 'count()']
).properties(
    width=600,
    height=300,
    title='Subcategory distribution'
)

#concatenate charts
alt.vconcat(category_chart, subcategory_chart).configure_concat(spacing=30)


Screenshots of dynamic plots:<br>
![category distribution](https://github.com/BianchiGiulia/practical_work/blob/main/imgs/cat_distr.png?raw=true)
![subcategory distribution](https://github.com/BianchiGiulia/practical_work/blob/main/imgs/subcat_di.png)


## <p style="text-align: center;"> Titles & Abstracts</p>

In [None]:
#descriptive stats for Title & Abstract(->latter has more missing values)
news_train['Abstract'] = news_train['Abstract'].fillna('')
abstract_len = news_train['Abstract'].apply(lambda x: len(x.split()) if isinstance(x, str) else 0)
title_len = news_train['Title'].apply(lambda x: len(x.split()) if isinstance(x, str) else 0)
print(pd.DataFrame({'Title': title_len.describe(), 'Abstract': abstract_len.describe()}))


              Title      Abstract
count  51282.000000  51282.000000
mean      10.754417     34.293319
std        3.265311     26.542819
min        1.000000      0.000000
25%        9.000000     15.000000
50%       10.000000     24.000000
75%       13.000000     62.000000
max       57.000000    474.000000


In [None]:
#add len column
news_train['AbstractLength'] = news_train['Abstract'].apply(lambda x: len(x.split()))
news_train['TitleLength'] = news_train['Title'].apply(lambda x: len(x.split()))
#drop duplicates
unique_abstract_lengths = news_train.drop_duplicates(subset=['Abstract', 'Category'])
unique_title_lengths = news_train.drop_duplicates(subset=['Title', 'Category'])
selection = alt.selection_multi(fields=['Category'], bind='legend')
title_chart = alt.Chart(unique_title_lengths).mark_bar().encode(
    x=alt.X('TitleLength:Q', title='Title Length'),
    y=alt.Y('count()', title='Number of Titles'),
    color='Category:N',
    opacity=alt.condition(selection, alt.value(1), alt.value(0.2)),
    tooltip=['Category:N', 'count()', alt.Tooltip('TitleLength:Q', title='Title Length')]
).add_selection(
    selection
).properties(
    title='Histogram of Unique Title Lengths by Category',
    width=400,
    height=400
)
abstract_chart = alt.Chart(unique_abstract_lengths).mark_bar().encode(
    x=alt.X('AbstractLength:Q', title='Abstract Length'),
    y=alt.Y('count()', title='Number of Abstracts'),
    color='Category:N',
    opacity=alt.condition(selection, alt.value(1), alt.value(0.2)),
    tooltip=['Category:N', 'count()', alt.Tooltip('AbstractLength:Q', title='Abstract Length')]
).add_selection(
    selection
).properties(
    title='Histogram of Unique Abstract Lengths by Category',
    width=400,
    height=400
)
combined_chart = alt.hconcat(title_chart, abstract_chart).properties(
    title='Histograms of Unique Title and Abstract Lengths by Category'
)
combined_chart.display()


![.](https://github.com/BianchiGiulia/practical_work/blob/main/imgs/title_abst.png)

[textstat library](https://pypi.org/project/textstat/)<br><br>
readability: <br>
**[The Flesch Reading Ease (FRE)](https://en.wikipedia.org/wiki/Flesch%E2%80%93Kincaid_readability_tests#Flesch_reading_ease)**: higher scores indicate material that is easier to read; lower numbers mark passages that are more difficult to read.<br>
<img src="read_scores.png" alt="Drawing" style="width: 550px;"/><vr><br> a sententence is a group of charcter ending with . ! or ?<br>
complexity:<br> find one <br>
aggregate stat:<br> **[Readability Consensus](https://pypi.org/project/textstat/)**: average of different statistical methods to measure readability and complexity

In [None]:
print("-"*20,"ABSTRACT","-"*20)
news_train['Title_Flesch_Reading_Ease'] = news_train['Title'].apply(textstat.flesch_reading_ease) #-----> does it actually make sense?
print(f"Before processing: {news_train.shape}")
#delete missing ABSTRACT
news_train = news_train[news_train['Abstract'].notna()]
#delete abstracts with < 3 words (S-V-O)->prob not valid for titles
news_train = news_train[news_train['Abstract'].apply(lambda x: len(x.split()) > 3)]
print(f"After processing: {news_train.shape}")
news_train['Abstract_Flesch_Reading_Ease'] = news_train['Abstract'].apply(textstat.flesch_reading_ease) #has range (-inf; 121]
print(f'\n F.R.E. stat for Abstracts:\n' , news_train['Abstract_Flesch_Reading_Ease'].describe(), "\n")
#print abtract of highest and lowest score (3 each)
sorted_by_ease = news_train.sort_values(by='Abstract_Flesch_Reading_Ease')
print("3 easiest Abstracts:")
for index, row in sorted_by_ease.tail(3).iterrows():
    print(f"Abstract: {row['Abstract']}, Score: {row['Abstract_Flesch_Reading_Ease']}")
    
print("3 most difficult Abstracts:")
for index, row in sorted_by_ease.head(3).iterrows():
    print(f"Abstract: {row['Abstract']}, Score: {row['Abstract_Flesch_Reading_Ease']}")


-------------------- ABSTRACT --------------------
Before processing: (51282, 11)
After processing: (48270, 11)

 F.R.E. stat for Abstracts:
 count    48270.000000
mean        64.688534
std         18.512011
min       -118.710000
25%         53.550000
50%         65.730000
75%         76.620000
max        119.190000
Name: Abstract_Flesch_Reading_Ease, dtype: float64 

3 easiest Abstracts:
Abstract: They paid for Snoop. They got Snoop., Score: 118.68
Abstract: These guys were here. They did stuff. Now they're gone., Score: 118.89
Abstract: The Kings win! The Kings win!, Score: 119.19
3 most difficult Abstracts:
Abstract: Giannis Antetokounmpo looked exasperated., Score: -118.71
Abstract: Wednesday afternoon offseason rumormongering, Score: -76.41
Abstract: Professional consequences of legalizing marijuana, Score: -68.97


In [None]:
#average FRE per category
average_flesch_reading_ease_per_category = news_train.groupby('Category')['Abstract_Flesch_Reading_Ease'].mean()
average_fre = average_flesch_reading_ease_per_category.mean()
average_flesch_reading_ease_per_category = average_flesch_reading_ease_per_category.sort_values(ascending=False)
print(f'Average FRE overall: {average_fre}\nAverage FRE per',average_flesch_reading_ease_per_category)    
#------f.r.e. of abstract per category----------------
selection = alt.selection_multi(fields=['Category'], bind='legend')
chart = alt.Chart(news_train).mark_bar().encode(
    x=alt.X('Abstract_Flesch_Reading_Ease', bin=alt.Bin(maxbins=30), title='Flesch Reading Ease: 0 or less (difficult) - 100 (easy)'),
    y=alt.Y('count()', title='Number of Articles'),
    color=alt.condition(selection,
                         'Category:N',
                         alt.value('lightgray')),
    tooltip=['Category', 'count()']
).add_selection(
    selection
).properties(
    width=600,
    height=400,
    title='ABSTRACT Distribution of Flesch Reading Ease by Category'
)

chart.display()


Average FRE overall: 62.91739157086721
Average FRE per Category
foodanddrink     72.507077
autos            71.380315
entertainment    71.144928
lifestyle        70.436165
sports           69.129351
tv               68.984190
music            68.602069
health           65.664923
movies           65.610414
travel           65.349278
weather          64.217632
finance          61.641094
video            60.646022
kids             60.115000
news             58.372198
middleeast       47.625000
northamerica     28.170000
Name: Abstract_Flesch_Reading_Ease, dtype: float64


![subcategory distribution](https://github.com/BianchiGiulia/practical_work/blob/main/imgs/FREabst.png)

In [None]:
#Flesch Reading Ease for Title----------------
print("-"*20,"TITLE","-"*20)
print('F.R.E. stat for Title:', news_train['Title_Flesch_Reading_Ease'].describe(), "\n")
sorted_by_ease = news_train.sort_values(by='Title_Flesch_Reading_Ease')
print("3 easiest Titles:")
for index, row in sorted_by_ease.tail(3).iterrows():
    print(f"Title: {row['Title']}, Score: {row['Title_Flesch_Reading_Ease']}")
    
print("\n3 most difficult Titles:")
for index, row in sorted_by_ease.head(3).iterrows():
    print(f"Title: {row['Title']}, Score: {row['Title_Flesch_Reading_Ease']}")


-------------------- TITLE --------------------
F.R.E. stat for Title: count    48270.000000
mean        67.385014
std         22.482539
min       -555.590000
25%         53.880000
50%         68.770000
75%         83.660000
max        120.210000
Name: Title_Flesch_Reading_Ease, dtype: float64 

3 easiest Titles:
Title: ND faces Howard, Score: 119.19
Title: USF faces IUPUI, Score: 119.19
Title: S.N.O.T.: 11-9-2019, Score: 120.21

3 most difficult Titles:
Title: southern_california_erupts_in_fire, Score: -555.59
Title: Northwestern-Purdue Gamethread, Score: -91.3
Title: Senator considers hearings a constitutional responsibility, Score: -78.44


In [None]:
#average FRE per category
average_fle = news_train.groupby('Category')['Title_Flesch_Reading_Ease'].mean()
average_fre = average_fle.mean()
average_fle = average_fle.sort_values(ascending=False)
print(f'Average FRE overall: {average_fre}\nAverage FRE per',average_fle) 

    
chart = alt.Chart(news_train).mark_bar().encode(
    x=alt.X('Title_Flesch_Reading_Ease', bin=alt.Bin(maxbins=50), title='Flesch Reading Ease: 0 or less (difficult) - 100 (easy)'),
    y=alt.Y('count()', title='Number of Articles'),
    color=alt.condition(selection,
                        'Category:N',
                        alt.value('lightgray')),
    tooltip=['Category', 'count()']
).add_selection(
    selection
).properties(
    width=600,
    height=400,
    title='TITLE Distribution of Flesch Reading Ease by Category'
)
chart.display()

Average FRE overall: 67.12596831707961
Average FRE per Category
foodanddrink     73.999372
kids             73.540625
entertainment    73.235254
lifestyle        72.761235
health           71.023775
autos            70.876550
music            70.717236
sports           70.364794
tv               68.870208
movies           68.069906
finance          68.052322
travel           67.418606
weather          66.200699
video            65.254908
middleeast       62.850000
news             61.955972
northamerica     35.950000
Name: Title_Flesch_Reading_Ease, dtype: float64


![subcategory distribution](https://github.com/BianchiGiulia/practical_work/blob/main/imgs/fretitle.png)

In [None]:
#Readability Consensus based upon different tests
news_train['Abstract_Readability_Consensus'] = news_train['Abstract'].apply(textstat.text_standard, float_output=False)
news_train['Abstract_Readability_Consensus'].describe()
#"""chart3 = alt.Chart(news_train).mark_bar().encode(
#    x=alt.X('Abstract_Readability_Consensus:O', title='Estimated School Grade Level'),
#    y=alt.Y('count()', title='Number of Articles'),
#    color=alt.condition(selection, 'Category:N', alt.value('lightgray')),
#    tooltip=['Abstract_Readability_Consensus', 'Category', 'count()']
#).add_selection(
#    selection
#).properties(
#    width=600,
#    height=400,
#    title='Distribution of Estimated School Grade Level for Abstracts by Category'
#)
#chart3.display()"""

count                 48270
unique                   38
top       8th and 9th grade
freq                   5702
Name: Abstract_Readability_Consensus, dtype: object

<img src="abstr_read_consensus.png" alt="Drawing" style="width: 600px;"/><vr>

In [None]:
#READING TIME
print("-"*20,"ABSTRACT","-"*20)
news_train['Abstract_Reading_Time'] = news_train['Abstract'].apply(lambda x: textstat.reading_time(x, ms_per_char=14.69)) 
print(news_train['Abstract_Reading_Time'].describe())
alt.data_transformers.enable('json')

"""#plot it
selection = alt.selection_multi(fields=['Category'], bind='legend')
chart = alt.Chart(news_train).mark_bar().encode(
    x=alt.X('Abstract_Reading_Time', bin=alt.Bin(maxbins=60), title='Reading Time'),
    y=alt.Y('count()', title='Number of Articles'),
    color=alt.condition(selection,
                         'Category:N',
                         alt.value('lightgray')),  # Use a neutral color if not selected
    tooltip=['Category', 'count()']
).add_selection(
    selection
).properties(
    width=600,
    height=400,
    title='Distribution of ABSTRACT Reading Times by Category'
)

chart.display()"""
average_fle = news_train.groupby('Category')['Abstract_Reading_Time'].mean()
average_fre = average_fle.mean()
average_fle = average_fle.sort_values(ascending=False)
print(f'Average Reading Time overall: {average_fre}\nAverage Reading Time per',average_fle) 


-------------------- ABSTRACT --------------------
count    48270.000000
mean         2.673108
std          1.883621
min          0.150000
25%          1.250000
50%          1.850000
75%          4.790000
max         31.230000
Name: Abstract_Reading_Time, dtype: float64
Average Reading Time overall: 2.3244604172897887
Average Reading Time per Category
weather          3.650261
news             3.154349
travel           2.934219
sports           2.567176
music            2.381108
video            2.341231
lifestyle        2.282543
movies           2.280771
northamerica     2.260000
finance          2.199595
foodanddrink     2.077264
tv               2.042211
autos            2.018181
health           1.911275
middleeast       1.815000
kids             1.806875
entertainment    1.793768
Name: Abstract_Reading_Time, dtype: float64


In [None]:
#READING TIME
print("-"*20,"TITLE","-"*20)
news_train['Title_Reading_Time'] = news_train['Title'].apply(lambda x: textstat.reading_time(x, ms_per_char=14.69)) 
print(news_train['Title_Reading_Time'].describe())
alt.data_transformers.enable('json')
average_fle = news_train.groupby('Category')['Title_Reading_Time'].mean()
average_fre = average_fle.mean()
average_fle = average_fle.sort_values(ascending=False)
print(f'Average Reading Time overall: {average_fre}\nAverage Reading Time per',average_fle) 

-------------------- TITLE --------------------
count    48270.000000
mean         0.832128
std          0.238715
min          0.150000
25%          0.680000
50%          0.810000
75%          0.970000
max          3.060000
Name: Title_Reading_Time, dtype: float64
Average Reading Time overall: 0.8085659335078406
Average Reading Time per Category
northamerica     0.940000
tv               0.909028
movies           0.885188
news             0.874694
music            0.858838
sports           0.827226
travel           0.824278
health           0.820602
lifestyle        0.808100
weather          0.804891
finance          0.803050
entertainment    0.792754
foodanddrink     0.768743
autos            0.764678
video            0.744176
middleeast       0.715000
kids             0.604375
Name: Title_Reading_Time, dtype: float64


## <p style="text-align: center;"> Entity & Relationship</p>

**Entities (Title & Abstract):<br>**
1. Label:  The entity name in the Wikidata knowledge graph 

2. Type: general category e.g "P" = "Person", "L" = Location, "O" = Organization, "D/T" Date-time, "Pr"= Product, "E"=events etc. ...

3. WikidataId [**also 1st column of embedding file**]: unique identifier for the entity in [Wikidata](https://www.wikidata.org/wiki/Wikidata:Main_Page),e.g. "Q80976" is the Wikidata ID for Prince Philip.

4. Confidence: range[0,1], confidence of the entity linking.

5. OccurrenceOffsets: character positions in the text.

6. SurfaceForms: The raw entity names in the original text

In [None]:
#TITLE - Train set
entity_df = news_train['Title Entities'].apply(lambda x: json.loads(x) if type(x) == str else []) #pd series
normalized_df = pd.DataFrame()
for index, entities in entity_df.items(): 
    if entities:  
        temp_df = pd.DataFrame(entities)
        temp_df['news_index'] = index
        normalized_df = pd.concat([normalized_df, temp_df], ignore_index=True)
normalized_df

Unnamed: 0,Label,Type,WikidataId,Confidence,OccurrenceOffsets,SurfaceForms,news_index
0,"Prince Philip, Duke of Edinburgh",P,Q80976,1.000,[48],[Prince Philip],0
1,"Charles, Prince of Wales",P,Q43274,1.000,[28],[Prince Charles],0
2,Elizabeth II,P,Q9682,0.970,[11],[Queen Elizabeth],0
3,Adipose tissue,C,Q193583,1.000,[20],[Belly Fat],1
4,Skin tag,C,Q3179593,1.000,[18],[Skin Tags],4
...,...,...,...,...,...,...,...
59943,Woolsey Fire,N,Q58445227,1.000,[53],[Woolsey Fire],51277
59944,Broadway theatre,F,Q235065,0.997,[24],[Broadway],51278
59945,MLS Cup,U,Q577698,0.963,[21],[MLS Cup],51280
59946,Seattle Sounders FC,O,Q632511,1.000,[8],[Sounders],51280


In [None]:
temp_df = normalized_df.applymap(lambda x: str(x) if isinstance(x, list) else x)
print(temp_df.nunique())
print("\n Distribution of:",temp_df['Type'].value_counts())
print("\n Mean Confidence",temp_df['Confidence'].mean())

Label                14991
Type                    22
WikidataId           14988
Confidence             100
OccurrenceOffsets      456
SurfaceForms         17966
news_index           37437
dtype: int64

 Distribution of: Type
O    15794
P    15261
G    12622
U     3762
C     3228
N     1733
F     1666
W      928
S      870
E      720
M      619
H      538
B      523
L      510
V      484
J      285
K      161
Y      111
R       79
Q       18
I       18
A       18
Name: count, dtype: int64

 Mean Confidence 0.9929094715420032


In [None]:
#TITLE - Test set
entity_df = news_test['Title Entities'].apply(lambda x: json.loads(x) if type(x) == str else []) #pd series
normalized_df = pd.DataFrame()
for index, entities in entity_df.items(): 
    if entities:  
        temp_df = pd.DataFrame(entities)
        temp_df['news_index'] = index
        normalized_df = pd.concat([normalized_df, temp_df], ignore_index=True)
normalized_df

Unnamed: 0,Label,Type,WikidataId,Confidence,OccurrenceOffsets,SurfaceForms,news_index
0,"Prince Philip, Duke of Edinburgh",P,Q80976,1.000,[48],[Prince Philip],0
1,"Charles, Prince of Wales",P,Q43274,1.000,[28],[Prince Charles],0
2,Elizabeth II,P,Q9682,0.970,[11],[Queen Elizabeth],0
3,Drug Enforcement Administration,O,Q622899,0.992,[50],[DEA],1
4,Skin tag,C,Q3179593,1.000,[18],[Skin Tags],4
...,...,...,...,...,...,...,...
49905,"Meghan, Duchess of Sussex",N,Q3304418,1.000,[11],[Meghan],42411
49906,"Catherine, Duchess of Cambridge",P,Q10479,0.994,[4],[Kate],42411
49907,Remembrance Sunday,H,Q1770490,1.000,[50],[Remembrance Sunday],42411
49908,Tennessee,G,Q1509,0.994,[0],[Tennessee],42413


In [None]:
temp_df = normalized_df.applymap(lambda x: str(x) if isinstance(x, list) else x)
print(temp_df.nunique())
print("\n Distribution of:",temp_df['Type'].value_counts())
print("\n Mean Confidence",temp_df['Confidence'].mean())

Label                12966
Type                    22
WikidataId           12963
Confidence             100
OccurrenceOffsets      419
SurfaceForms         15561
news_index           31128
dtype: int64

 Distribution of: Type
P    13347
O    13091
G    10008
U     3082
C     2759
N     1386
F     1354
W      813
S      725
E      637
M      492
V      455
B      407
L      400
H      388
J      236
K      121
Y       95
R       69
Q       18
A       16
I       11
Name: count, dtype: int64

 Mean Confidence 0.993104588258866


In [None]:
#ABSTRACT - Train set
entity_df = news_train['Abstract Entities'].apply(lambda x: json.loads(x) if type(x) == str else []) #pd series
normalized_df = pd.DataFrame()
for index, entities in entity_df.items(): 
    if entities:  
        temp_df = pd.DataFrame(entities)
        temp_df['news_index'] = index
        normalized_df = pd.concat([normalized_df, temp_df], ignore_index=True)
normalized_df

Unnamed: 0,Label,Type,WikidataId,Confidence,OccurrenceOffsets,SurfaceForms,news_index
0,Adipose tissue,C,Q193583,1.000,[97],[belly fat],1
1,Ukraine,G,Q212,0.946,[87],[Ukraine],2
2,National Basketball Association,O,Q155223,1.000,[40],[NBA],3
3,Skin tag,C,Q3179593,1.000,[105],[Skin Tags],4
4,Dermatology,C,Q171171,1.000,[131],[Dermatologist],4
...,...,...,...,...,...,...,...
95304,United States women's national soccer team,O,Q334526,1.000,"[9, 258]","[U.S. women's national soccer team, U.S. women...",51276
95305,TIAA Bank Field,N,Q635117,1.000,[135],[TIAA Bank Field],51276
95306,"Jacksonville, Florida",G,Q16568,1.000,[54],[Jacksonville],51276
95307,Costa Rica,G,Q800,0.991,"[159, 341]","[Costa Rica, Costa Rica]",51276


In [None]:
temp_df = normalized_df.applymap(lambda x: str(x) if isinstance(x, list) else x)
print(temp_df.nunique())
print("\n Distribution of:",temp_df['Type'].value_counts())
print("\n Mean Confidence",temp_df['Confidence'].mean())

Label                23835
Type                    22
WikidataId           23828
Confidence             100
OccurrenceOffsets    11172
SurfaceForms         33970
news_index           37453
dtype: int64

 Distribution of: Type
G    24562
O    22900
P    19546
U     6249
C     3385
M     3264
F     3188
N     2871
S     2352
W     1403
E     1035
L     1028
B      957
H      626
K      624
V      519
J      369
Y      219
R      113
Q       42
A       34
I       23
Name: count, dtype: int64

 Mean Confidence 0.9931588412426947


In [None]:
#ABSTRACT - Test set
entity_df = news_test['Abstract Entities'].apply(lambda x: json.loads(x) if type(x) == str else []) #pd series
normalized_df = pd.DataFrame()
for index, entities in entity_df.items(): 
    if entities:  
        temp_df = pd.DataFrame(entities)
        temp_df['news_index'] = index
        normalized_df = pd.concat([normalized_df, temp_df], ignore_index=True)
normalized_df

Unnamed: 0,Label,Type,WikidataId,Confidence,OccurrenceOffsets,SurfaceForms,news_index
0,Ukraine,G,Q212,0.946,[87],[Ukraine],2
1,National Basketball Association,O,Q155223,1.000,[40],[NBA],3
2,Skin tag,C,Q3179593,1.000,[105],[Skin Tags],4
3,Dermatology,C,Q171171,1.000,[131],[Dermatologist],4
4,Reader's Digest,M,Q371820,0.999,[163],[Reader's Digest],4
...,...,...,...,...,...,...,...
77054,Kate Hudson,P,Q169946,1.000,[30],[Kate Hudson],42412
77055,Chrissy Teigen,P,Q5111202,1.000,[11],[Chrissy Teigen],42412
77056,Tennessee Court of Appeals,O,Q7700055,1.000,[0],[Tennessee Court of Appeals],42413
77057,Belmont University,O,Q3298359,1.000,[355],[Belmont University College of Law],42413


In [None]:
temp_df = normalized_df.applymap(lambda x: str(x) if isinstance(x, list) else x)
print(temp_df.nunique())
print("\n Distribution of:",temp_df['Type'].value_counts())
print("\n Mean Confidence",temp_df['Confidence'].mean())

Label                20028
Type                    22
WikidataId           20022
Confidence             100
OccurrenceOffsets     8926
SurfaceForms         28308
news_index           30794
dtype: int64

 Distribution of: Type
G    19529
O    18670
P    16630
U     4850
C     2716
M     2668
F     2368
N     2252
S     1807
W     1142
E      860
L      831
B      771
K      455
V      454
H      388
J      303
Y      180
R      107
A       31
Q       29
I       18
Name: count, dtype: int64

 Mean Confidence 0.9932659909939138


In [None]:
print("-"*200, "\n entity_train embedding shape: ", entity_train.shape)
display(entity_train.head())
print("-"*200, "\n relation_train embedding shape: ", relation_train.shape)
display(relation_train.head()) #not fetchable

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
 entity_train embedding shape:  (26904, 102)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101
0,Q41,-0.063388,-0.181451,0.057501,-0.091254,-0.076217,-0.052525,0.0505,-0.224871,-0.018145,0.030722,0.064276,0.073063,0.039489,0.159404,-0.128784,0.016325,0.026797,0.13709,0.001849,-0.059103,0.012091,0.045418,0.000591,0.211337,-0.034093,-0.074582,0.014004,-0.099355,0.170144,0.109376,-0.014797,0.071172,0.080375,0.045563,-0.046462,0.070108,0.015413,-0.020874,-0.170324,-0.00113,0.05981,0.054342,0.027358,-0.028995,-0.224508,0.066281,-0.200006,0.018186,0.082396,0.167178,-0.136239,0.055134,-0.080195,-0.00146,0.031078,-0.017084,-0.091176,-0.036916,0.124642,-0.098185,-0.054836,0.152483,-0.053712,0.092816,-0.112044,-0.072247,-0.114896,-0.036541,-0.186339,-0.16061,0.037342,-0.133474,0.11008,0.070678,-0.005586,-0.046667,-0.07201,0.086424,0.026165,0.030561,0.077888,-0.117226,0.211597,0.112512,0.079999,-0.083398,-0.121117,0.071751,-0.017654,-0.134979,-0.051949,0.001861,0.124535,-0.151043,-0.263698,-0.103607,0.020007,-0.101157,-0.091567,0.035234,
1,Q1860,0.060958,0.069934,0.015832,0.079471,-0.023362,-0.125007,-0.043618,0.134063,-0.121691,0.089166,0.129177,0.148145,0.027196,-0.060636,0.06876,0.071959,0.150306,-0.099519,-0.050912,0.123948,-0.190319,-0.096762,-0.006279,-0.08681,-0.026199,0.017013,0.043436,0.058991,-0.131758,0.032473,-0.137706,-0.009527,0.085008,-0.060163,0.044856,0.03002,-0.042486,-0.098337,-0.024715,0.054446,-0.05623,0.161813,-0.106716,-0.052167,0.013636,0.132148,0.044919,0.074031,-0.085483,-0.083199,-0.007451,0.113236,0.098931,-0.079819,-0.02629,0.051472,-0.092252,0.068104,0.016942,0.009106,-0.062264,-0.001102,0.050228,0.016879,-0.026729,-0.051632,-0.08304,-0.14388,0.066569,-0.014793,-0.047219,-0.03439,0.009343,-0.002716,-0.094623,0.000528,-0.055017,-0.013458,-0.038277,-0.067144,0.091749,0.018254,-0.080948,0.06285,0.117076,-0.115282,0.050163,0.091078,-0.166571,0.056171,-0.070713,-0.014287,0.013578,0.099977,0.012199,-0.141138,0.056129,-0.133727,0.025795,0.051448,
2,Q39631,-0.093106,-0.052002,0.020556,-0.020801,0.04318,-0.072321,0.00091,0.028156,0.176303,0.035396,0.072642,0.000239,-0.171645,-0.034816,-0.106319,-0.082187,-0.022322,-0.121248,-0.084962,-0.146949,-0.015364,0.240605,-0.165207,0.033926,-0.055561,0.263102,-0.018281,-0.07163,0.067349,0.021943,-0.066642,0.154693,0.039514,-0.115533,0.157337,-0.018109,0.093555,-0.136766,-0.106228,0.020897,0.030024,-0.109274,-0.120507,0.046796,0.016082,0.063581,0.021472,-0.177214,-0.037778,0.089867,0.014073,0.014801,-0.083897,-0.009868,0.065859,-0.192299,0.013885,0.035729,0.025541,-0.107844,-0.215149,0.090272,0.13167,-0.065807,-0.119546,0.131104,-0.087323,0.118188,0.166771,0.014317,0.117788,-0.069088,0.002963,-0.008588,0.016064,0.007934,-0.115904,-0.066542,0.071987,0.078646,-0.036828,-0.134134,-0.158453,0.077707,-0.028514,-0.155193,-0.047059,0.035694,-0.107131,-0.000372,-0.124472,-0.08684,-0.078992,-0.062712,0.051117,-0.184307,0.127637,-0.144866,0.04469,0.013498,
3,Q30,-0.115737,-0.179113,0.102739,-0.112469,-0.101853,-0.177516,0.01586,-0.092626,0.086708,0.05785,0.176422,0.070668,0.071584,0.030533,-0.179654,-0.032312,0.047596,-0.028751,-0.031293,-0.044283,-0.144224,-0.089542,-0.046,0.215515,0.075296,-0.062332,-0.002456,0.035293,0.10955,0.052809,-0.081734,0.066101,0.148733,-0.073003,0.075038,-0.099213,-0.091732,-0.114809,-0.063178,0.076927,-0.066233,0.130834,-0.081943,-0.017894,-0.084129,-0.098396,-0.076425,0.145224,0.047662,0.061124,-0.147525,-0.035232,0.080132,0.075315,0.066264,0.053224,-0.008282,0.038551,-0.044559,-0.081108,-0.078284,0.11618,0.082531,0.101352,0.054269,-0.193552,-0.144609,-0.109713,-0.026049,-0.020009,-0.121075,-0.218548,0.150953,0.072083,-0.089645,-0.004471,-0.049331,0.189673,0.001631,0.156474,-0.022464,-0.082198,0.069881,0.183586,0.175343,0.005146,-0.028398,0.026972,-0.105001,-0.019177,0.005893,0.080511,-8.5e-05,-0.089968,-0.083486,-0.149992,-0.053031,-0.136071,-0.029001,0.174155,
4,Q60,-0.051036,-0.165637,0.132802,-0.089949,-0.146637,-0.142246,0.103853,-0.129651,0.096265,0.017288,0.096343,0.120867,0.139412,-0.101083,-0.105518,-0.044083,-0.081574,0.00825,-0.064942,-0.139662,-0.079039,0.029418,-0.049928,0.183146,-0.028249,-0.062579,-0.009422,-0.038783,0.099171,0.117744,-0.073817,0.030925,0.072817,-0.074308,-0.058244,0.053969,-0.176053,-0.110216,-0.087142,-0.031469,-0.138299,0.008009,-0.02006,-0.068153,-0.157539,-0.10143,-0.050036,0.041379,-0.044153,0.013049,-0.299345,0.061024,0.156111,0.050081,0.044341,-0.033624,0.023531,-0.030379,0.027055,-0.134954,-0.084445,0.103199,0.057259,0.082226,0.028525,-0.180036,-0.129249,-0.131783,-0.041311,-0.038343,-0.133523,-0.177585,0.153199,0.074922,-0.123952,-0.087973,0.018191,0.186848,0.074991,0.036592,0.086934,0.031789,0.094553,0.132498,0.139359,0.012824,-0.008956,-0.02369,-0.09444,-0.11028,-0.002713,0.078628,0.003711,-0.058953,-0.154067,-0.117159,-0.031614,-0.140451,0.001288,0.14035,


-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 
 relation_train embedding shape:  (1091, 102)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101
0,P31,-0.073467,-0.132227,0.034173,-0.032769,0.008289,-0.107088,-0.031712,-0.039581,0.101882,-0.106961,-0.053441,0.068202,-0.045584,-0.140448,-0.079402,0.001022,0.059921,-0.06251,0.102848,0.077947,-0.063644,0.05007,-0.01918,0.064456,-0.052222,0.071078,-0.036413,-0.039235,0.137947,0.067378,-0.137468,0.103482,0.121755,-0.006587,0.063077,-0.024954,-0.0313,-0.056833,-0.139115,-0.05357,0.165815,-0.022143,0.006561,-0.108691,-0.149139,0.080943,0.054542,-0.034564,0.082343,-0.095843,-0.068758,0.01385,-0.025589,-0.012451,0.116367,-0.066981,-0.006472,0.136078,-0.057084,-0.066427,-0.035916,-0.028447,-0.070395,-0.052364,-0.040038,0.037342,-0.073347,0.112529,0.106537,0.107426,0.086297,0.085833,0.054393,0.053187,0.066242,0.058507,-0.04718,-0.086089,0.050148,0.053491,-0.04237,-0.110435,-0.058929,0.063987,-0.037393,-0.057942,-0.032128,0.141226,-0.106979,0.072183,-0.045641,-0.050068,-0.053686,-0.045389,-0.037017,0.11719,-0.063597,-0.05691,0.058387,-0.114056,
1,P21,-0.078436,0.108589,-0.049429,-0.131355,0.0493,-0.094605,-0.101469,0.127802,-0.081245,0.113759,-0.171865,0.049044,0.141462,0.117907,0.040574,-0.057788,-0.146715,-0.085228,0.020211,-0.12101,-0.100422,-0.081288,0.031696,-0.060593,-0.072303,0.139442,-0.133374,-0.120222,0.0504,0.119134,-0.082276,0.050498,-0.108097,0.045905,0.118079,0.069211,-0.049801,-0.106901,0.133158,-0.065444,-0.085254,0.040706,0.007894,0.034556,0.139081,0.025119,0.122081,0.154464,0.099593,-0.0404,0.075233,0.096659,0.032061,-0.154013,0.085069,-0.144027,-0.06937,0.079479,0.090121,-0.154897,-0.12734,-0.031645,-0.09384,0.123652,-0.134066,0.066089,-0.159245,0.069276,0.074938,-0.129573,0.076426,-0.144846,0.147408,0.106457,-0.079138,0.081598,-0.132508,0.102217,0.117162,-0.064613,-0.120491,-0.075478,0.013671,-0.056833,0.086815,-0.111679,0.05102,0.094203,-0.092261,-0.147404,-0.151203,0.074341,-0.030571,-0.137183,0.045598,-0.151155,-0.066223,0.057489,0.130188,-0.054801,
2,P106,-0.052137,0.052444,-0.019886,-0.152309,0.014144,-0.180491,-0.132198,0.063082,0.085229,0.114965,0.023285,0.074741,-0.049949,-0.082051,-0.159896,0.035493,-0.113929,-0.111878,-0.139555,-0.106166,-0.011966,0.154562,-0.096405,0.131268,-0.068482,0.18524,-0.072894,-0.114885,-0.056082,0.112026,0.048216,0.098032,-0.098028,-0.106606,0.078594,-0.102013,-0.001059,-0.145055,3e-06,-0.047816,0.079029,-0.078351,-0.016361,-0.000218,-0.038627,0.057308,0.036923,-0.073602,-0.072402,0.001785,-0.002824,-0.060708,-0.002136,-0.017358,0.059936,-0.133305,-0.034796,-0.075657,0.14732,-0.133039,-0.149887,0.052375,0.024344,0.050036,-0.146324,0.075327,-0.135969,0.031892,0.049475,-0.106037,0.088477,-0.185415,0.10508,0.10744,-0.0282,-0.121917,-0.165206,0.026541,0.125522,0.080844,-0.178644,-0.060746,-0.078724,-0.009305,0.088131,-0.097797,-0.155246,-0.030237,-0.017188,0.070897,-0.088902,-0.058958,-0.032021,-0.147213,0.082776,-0.169705,0.122445,-0.054737,0.055321,0.070961,
3,P735,-0.051398,0.056219,0.068029,-0.137717,-0.03005,0.061566,-0.103184,-0.074124,-0.118975,0.1221,0.090664,0.050602,-0.023321,0.135801,0.082776,0.134691,-0.093377,-0.100187,0.060942,0.058473,0.06526,-0.049564,0.013162,-0.047667,-0.054335,0.123371,-0.145068,0.015066,0.045329,0.131864,0.062462,-0.106206,-0.117788,-0.050399,0.019886,-0.046332,0.08265,0.060583,0.169631,0.108123,-0.030897,0.046386,-0.01442,-0.053038,0.157436,-0.021491,0.087635,-0.051152,0.054433,0.121686,0.037487,0.044515,-0.07968,-0.114405,0.029875,-0.124201,-0.094803,0.017489,0.111024,-0.108676,0.011377,0.143746,-0.180618,-0.052341,-0.118239,-0.081315,-0.111308,0.058716,-0.111563,-0.222551,-0.019004,-0.102315,0.269483,-0.023461,0.046179,0.050954,-0.020268,-0.085623,-0.011426,-0.110763,-0.158052,0.104254,-0.097153,0.060086,-0.05042,-0.121439,-0.112373,-0.028001,0.076174,-0.132399,-0.096461,-0.092234,0.05687,0.01364,0.042696,0.013683,-0.021127,-0.189257,0.055315,0.101863,
4,P108,0.091231,0.022526,0.059349,-0.141853,0.035025,-0.11104,-0.127337,0.047645,-0.172328,0.090933,0.022216,0.079914,0.043736,-0.096588,-0.242773,-0.039824,-0.078472,-0.190807,-0.07551,-0.011143,-0.004291,-0.109142,-0.055437,0.139692,-0.032522,0.124695,-0.054761,-0.046256,-0.115983,0.098595,-0.087121,-0.029367,-0.108338,-0.02172,-0.028068,0.029053,-0.128703,-0.103341,-0.139387,0.134218,0.207785,-0.022484,0.049616,0.144433,-0.102246,-0.064737,0.094036,0.059295,-0.120209,0.079042,0.10534,0.08343,-0.007747,0.033792,-0.025764,-0.043842,0.013634,-0.119388,0.001556,-0.057961,-0.081745,0.092388,-0.053616,0.09569,-0.001889,-0.154143,-0.17789,-0.13144,-0.219218,-0.111252,-0.10479,0.067291,0.130789,0.144343,-0.105937,-0.070367,-0.152791,-0.079241,0.008933,0.107746,0.076368,-0.071356,-0.05611,-0.030554,-0.087335,-0.108328,-0.039597,0.074101,0.094331,-0.08839,0.026855,-0.046994,-0.056248,-0.146538,0.121375,-0.211757,0.077591,-0.0022,-0.05388,0.140873,


# Behaviours

In [None]:
behavior_train.head(3)


Unnamed: 0,Impression ID,User ID,Time,History,Impressions
0,1,U13740,11/11/2019 9:05:58 AM,"[N55189, N42782, N34694, N45794, N18445, N6330...",N55689-1 N35729-0
1,2,U91836,11/12/2019 6:11:30 PM,"[N31739, N6072, N63045, N23979, N35656, N43353...",N20678-0 N39317-0 N58114-0 N20495-0 N42977-0 N...
2,3,U73700,11/14/2019 7:01:48 AM,"[N10732, N25792, N7563, N21087, N41087, N5445,...",N50014-0 N23877-0 N35389-0 N49712-0 N16844-0 N...


In [13]:
#get list of all unique USER ID
l1, l2 = behavior_train['User ID'].unique(), behavior_test['User ID'].unique()
#get common elements
print(f'Unique User IDs in Train: {len(l1)}\nUnique User IDs in Test: {len(l2)}\nCommon User IDs: {len(set(l1).intersection(l2))}')

Unique User IDs in Train: 50000
Unique User IDs in Test: 50000
Common User IDs: 5943


In [14]:
#missing values
print(pd.DataFrame({'Train': behavior_train.isna().sum(), 'Test': behavior_test.isna().sum()}))

               Train  Test
Impression ID      0     0
User ID            0     0
Time               0     0
History         3238  2214
Impressions        0     0


History (clicked news)

In [None]:
#History
#delete missing values
behavior_train = behavior_train[behavior_train['History'].notna()]
behavior_test = behavior_test[behavior_test['History'].notna()]
behavior_train['History'] = behavior_train['History'].astype(str).str.split() 
behavior_test['History'] = behavior_test['History'].astype(str).str.split()

In [None]:
#train
exploded_df = behavior_train.explode('History', ignore_index=True)[['User ID', 'History']]
exploded_df.rename(columns={'History': 'News ID'}, inplace=True)
merged_df = exploded_df.merge(news_train[['News ID', 'Category', 'SubCategory']], on='News ID', how='left')
train_unique = merged_df['News ID'].nunique()
print(f'Unique news in train set: {train_unique}')
display(merged_df.head(3))

Unique news in train set: 33195


Unnamed: 0,User ID,News ID,Category,SubCategory
0,U13740,N55189,tv,tvnews
1,U13740,N42782,sports,baseball_mlb
2,U13740,N34694,tv,tvnews


In [None]:
#test
exploded_df = behavior_test.explode('History', ignore_index=True)[['User ID', 'History']]
exploded_df.rename(columns={'History': 'News ID'}, inplace=True)
merged_df2 = exploded_df.merge(news_train[['News ID', 'Category', 'SubCategory']], on='News ID', how='left')
test_unique = merged_df2['News ID'].nunique()
print(f'Unique news in test set: {test_unique}')
display(merged_df2.head(3))

Unique news in test set: 37681


Unnamed: 0,User ID,News ID,Category,SubCategory
0,U80234,N55189,tv,tvnews
1,U80234,N46039,news,newsus
2,U80234,N51741,tv,tv-celebrity


In [None]:
click_distribution = merged_df['Category'].value_counts().reset_index()
click_distribution.columns = ['Category', 'Clicks']
click_distribution = click_distribution.sort_values('Clicks', ascending=False)

chart = alt.Chart(click_distribution).mark_bar().encode(
    x=alt.X('Category', sort='-y'), 
    y='Clicks',
    color='Category'
).properties(
    title='Click Distribution by Category'
)
chart #static

In [None]:
click = alt.selection_multi(fields=['Category'])
category_chart = alt.Chart(merged_df).mark_bar().encode(
    x=alt.X('Category:N', sort='-y'),
    y=alt.Y('count():Q', title='Number of Clicks'),
    color='Category:N',
    tooltip=['Category:N', 'count()'],
    opacity=alt.condition(click, alt.value(1), alt.value(0.2))
).add_selection(
    click
).properties(
    width=600,
    height=300,
    title='Category Distribution'
)
subcategory_chart = alt.Chart(merged_df).transform_filter(
    click
).mark_bar().encode(
    x=alt.X('count():Q', title='Number of Clicks'),
    y=alt.Y('SubCategory:N', sort='-x'),
    color='Category:N',
    tooltip=['SubCategory:N', 'count()']
).properties(
    width=600,
    height=300,
    title='Subcategory Distribution'
)
alt.vconcat(category_chart, subcategory_chart).configure_concat(spacing=30) #dynamic


![cat click distr](https://github.com/BianchiGiulia/practical_work/blob/main/imgs/catclick.png)
![subcategory distribution](https://github.com/BianchiGiulia/practical_work/blob/main/imgs/subclick.png)

In [None]:
#ratio of clicks per category # in percentage
clicks_per_category = merged_df['Category'].value_counts()
clicks_per_category = clicks_per_category / clicks_per_category.sum()
clicks_per_category = clicks_per_category.sort_values(ascending=False)
clicks_per_category

Category
news             0.313972
sports           0.142581
lifestyle        0.103698
finance          0.076729
tv               0.071807
foodanddrink     0.053538
health           0.045033
movies           0.035491
autos            0.034971
entertainment    0.032406
travel           0.032153
music            0.025814
video            0.019935
weather          0.011827
kids             0.000032
middleeast       0.000014
Name: count, dtype: float64

In [None]:
#number of history in subcategory news politics
politics = merged_df[merged_df['SubCategory'].str.contains('politic', case=False)]
politics['SubCategory'].value_counts().sum() / merged_df.value_counts().sum() 
#6% of all history clicks are in politics (train data)

0.06754647303773818

Impressions (showed as recommended news, not all clicked)

In [None]:
impression_df = behavior_train[[ 'User ID', 'Impressions']]
def process_impressions(row):
    impressions = row['Impressions'].split()
    clicked = [impression.split("-")[0] for impression in impressions if "-1" in impression]
    non_clicked = [impression.split("-")[0] for impression in impressions if "-0" in impression]
    row['click'] = ','.join(clicked) if clicked else ''
    row['non-clicked'] = ','.join(non_clicked) if non_clicked else ''
    return row


#iterate each row
impression_df = impression_df.apply(process_impressions, axis=1) #2 min
#impression_df.head(3)


In [None]:
clicked_df = impression_df[['User ID', 'click']]
clicked_df['click'] = clicked_df['click'].str.split(',')
clicked_df = clicked_df.explode('click', ignore_index=True)
#clicked_df #236344 rows
#-------------------------
nonclick_df = impression_df[['User ID', 'non-clicked']]
nonclick_df['non-clicked'] = nonclick_df['non-clicked'].str.split(',')
nonclick_df = nonclick_df.explode('non-clicked', ignore_index=True)
#nonclick_df #5607100 rows

In [65]:
#add info about category and subcategory
clicked_df = clicked_df.merge(news_train[['News ID', 'Category', 'SubCategory']], left_on='click', right_on='News ID', how='left')
nonclick_df = nonclick_df.merge(news_train[['News ID', 'Category', 'SubCategory']], left_on='non-clicked', right_on='News ID', how='left')

In [95]:
display(clicked_df.describe(), nonclick_df.describe())

Unnamed: 0,User ID,click,News ID,Category,SubCategory
count,236344,236344,236344,236344,236344
unique,50000,7713,7713,16,188
top,U53220,N55689,N55689,news,newsus
freq,129,4316,4316,69408,25660


Unnamed: 0,User ID,non-clicked,News ID,Category,SubCategory
count,5607100,5607100,5607100,5607100,5607100
unique,50000,20124,20124,16,214
top,U4743,N47061,N47061,news,newsus
freq,1919,22217,22217,1521630,457088


In [94]:
tot = (5607100+236344)
print(f'Total: {tot}')
click = 236344/tot
non_click = 5607100/tot
print(f'Clicks: {click*100:.2f}%\nNon-Clicks: {non_click*100:.2f}%')

Total: 5843444
Clicks: 4.04%
Non-Clicks: 95.96%


In [66]:
#ratio of clicks per category # in percentage
clicks_per_category = clicked_df['Category'].value_counts()
clicks_per_category = clicks_per_category / clicks_per_category.sum()
clicks_per_category = clicks_per_category.sort_values(ascending=False)
clicks_per_category

Category
news             0.293674
sports           0.119237
lifestyle        0.112480
finance          0.087148
music            0.067647
tv               0.061537
foodanddrink     0.046047
health           0.045963
entertainment    0.044554
travel           0.035135
autos            0.031429
weather          0.019387
video            0.018033
movies           0.017711
kids             0.000013
northamerica     0.000004
Name: count, dtype: float64

In [67]:
#ratio of NON-clicks per category # in percentage
clicks_per_category = nonclick_df['Category'].value_counts()
clicks_per_category = clicks_per_category / clicks_per_category.sum()
clicks_per_category = clicks_per_category.sort_values(ascending=False)
clicks_per_category

Category
news             0.271376
lifestyle        0.112154
sports           0.100595
finance          0.097033
foodanddrink     0.063937
entertainment    0.060530
travel           0.054716
health           0.052295
autos            0.047083
music            0.045125
tv               0.041361
movies           0.022811
video            0.015959
weather          0.014992
kids             0.000028
northamerica     0.000005
Name: count, dtype: float64

In [96]:
#get total n of clicked news subcategory politics
poli_click = clicked_df[clicked_df['SubCategory'].str.contains('politic', case=False)]

In [100]:
display(poli_click.nunique()) #unique news
display(poli_click)

User ID        6067
click           387
News ID         387
Category          1
SubCategory       2
dtype: int64

Unnamed: 0,User ID,click,News ID,Category,SubCategory
70,U40937,N39317,N39317,news,newspolitics
73,U700,N30035,N30035,news,newspolitics
84,U50562,N41224,N41224,news,newspolitics
86,U54128,N25467,N25467,news,newspolitics
100,U23485,N36442,N36442,news,newspolitics
...,...,...,...,...,...
236171,U41493,N26025,N26025,news,newspolitics
236213,U47315,N21707,N21707,news,newspolitics
236222,U88859,N166,N166,news,newspolitics
236243,U5480,N11681,N11681,news,newspolitics


In [106]:
387/7713 #news in politics clicked / all news clicked

0.050175029171528586

In [105]:
387*100/27837 #news in politics / all news

1.3902360168121566

Sources: [Recommender](https://github.com/recommenders-team/recommenders/tree/main/recommenders) - [MIND github](https://github.com/msnews/msnews.github.io/blob/master/assets/doc/introduction.md) - [Leaderboard MIND](https://paperswithcode.com/sota/news-recommendation-on-mind) - [TECNICAL REPORT OF MIND COMPETITION WINNER](https://msnews.github.io/competition.html) - [RecBole models](https://recbole.io/docs/user_guide/model_intro.html) - [medium: build rs w Bert](http://webcache.googleusercontent.com/search?q=cache:https://medium.com/mlearning-ai/build-news-recommendation-model-using-python-bert-and-faiss-10ea8c65e6c&sca_esv=585540903&strip=1&vwsrc=0) w/ [notebook](https://colab.research.google.com/drive/1uuQaagWNh7gexSQhchpOgGyPKYK2e6SU#scrollTo=ooBElUsT53JO) - [NRMS model](https://github.com/recommenders-team/recommenders/blob/main/examples/00_quick_start/nrms_MIND.ipynb) - [DLRM model](https://nvidia-merlin.github.io/HugeCTR/v3.5/notebooks/news-example.html) - [Kaggle](https://www.kaggle.com/code/accountstatus/mind-microsoft-news-recommendation-v2#Importing-The-Packages)<BR>
