# TLDR
### by Riccardo Belli, Daniel Grgic Figueiredo, Luca Rausa
TLDR (*_Too Long, Didn't Read_*) is a daily newsletter that offers brief and easily digestible summaries of the most important news stories of the day. The newsletter covers a wide range of topics, including technology, business, politics, and culture. Its aim is to provide busy professionals and individuals with a quick and informative overview of current events, without the need to sift through lengthy articles. The newsletter's summaries are written in an engaging style, making it accessible to readers of all backgrounds.

We chose this dataset due to the lengthy list of articles at our disposable, but also due to the well structured and well separated topics which although similar in topic still display some rather noticeable differences, meaning there is a relevant amount of data to work with, while presenting a challenge to our study due to the intersections of the main topics selected.

The main topics mentioned are: "Big Tech & Startups", a dedicated section for big technology companies and their news as well as relevant startup companies that might have a chance to compete with these companies or that have tried to do so; "Science & Futuristic Technology", although this section is often dedicated to medical news, space travel and lifestyle technology (food/drink related articles and more) some of the terms may intersect with the other 2 topics, but still the one that has the least to do with others; "Programming, Design & Data Science", this last topic often includes github repositories as well as big news for programming languages as well as famous libraries, articles in this topic may some times collide with "Big Tech" due to the impact big tech companies have on programming languages and commonly used libraries.


<sub>If you're interested in subscribing to TLDR, you can do so by visiting their website at https://tldr.tech.</sub>

In [None]:
path = 'data/emails.txt'
with open(path, 'rb') as f:
    text = f.read().decode('utf-8', errors='replace')

## Cleaning the mess
The first step after downloading the required e-mails sent to a private account was to clean the text and make sure that there were no special characters that would interfere with our data. 

In [None]:
import re

def remove_special_characters(dirty_text):
    # The following pattern will match any character that is not a letter, number, basic punctuation, or `-` and ``
    pattern = re.compile(r'[^a-zA-Z0-9\s.,?!:;@#$%^&*()_+={}[\]|\\<>/"\'`-]', flags=re.UNICODE)
    clean_text = pattern.sub('', dirty_text)
    return clean_text

## Data collection (5 minute read)
All articles available have a title most of which are followed by an estimated time that an average person would take to read the full article. So we thought to remove it, in order to clean up a bit of the noise.

In [None]:
def remove_minute_read(dirty_text):
    minute_read_pattern = re.compile(r'\(?\d*\s*minute\s*read\)?', re.IGNORECASE)
    clean_text = re.sub(minute_read_pattern, '', dirty_text)
    return clean_text

## Cleaner than ever
Lastly the articles were further cleaned in a more detailed way, removing any brackets, empty lines and any other elements that were in the initial file, but were not a part of the text to be analyzed.

In [None]:
import quopri

def decode_quoted_printable(coded_text):
    decoded_text = quopri.decodestring(coded_text.encode('utf-8', errors='replace')).decode('utf-8', errors='replace')
    return decoded_text

In [None]:
def remove_square_brackets_urls(dirty_text):
    cleaned_text = re.sub(r'\[.*?\]', '', dirty_text)
    return cleaned_text

In [None]:
def remove_blank_lines(dirty_text):
    lines = dirty_text.splitlines()
    cleaned_lines = [line.strip() for line in lines if line.strip()]
    cleaned_text = '\n'.join(cleaned_lines)
    return cleaned_text

In [None]:
def cleanup_text(dirty_text):
    dirty_text = decode_quoted_printable(dirty_text)
    dirty_text = remove_special_characters(dirty_text)
    dirty_text = remove_square_brackets_urls(dirty_text)
    dirty_text = remove_minute_read(dirty_text)
    clean_text = remove_blank_lines(dirty_text)
    return clean_text

text = cleanup_text(text)

## Extraction ready
Once the text is all clean, the sections can be extracted, because the newsletter has a regular structure, we can use a _start section_ string and an _end section_ string which are used to separate the texts in a given email based on their topics.

In [None]:
def extract_sections(text, start_section, end_section):
    # Define the regular expression pattern to match the desired sections
    pattern = re.compile(r'(?i){}(.*?)(?={})'.format(re.escape(start_section), re.escape(end_section)), re.MULTILINE | re.DOTALL)
    sections = re.findall(pattern, text)

    return sections

## The big three and another one
The newsletter has the following arguments in the given order: [Big Tech & Startups, Science & Futuristic Technology, Programming, Design & Data Science] and these were the ones we used for our classification as a last argument we added _Miscellaneous_ to point as to where _Programming, Design & Data Science_ would end.
Once these were established we can proceed to the separation by arguments.

In [None]:
arguments = [
    "Big Tech & Startups",
    "Science & Futuristic Technology",
    "Programming, Design & Data Science",
    "Miscellaneous"
]

argument_to_code = {argument: code for code, argument in enumerate(arguments)}

text_by_argument = {}

for i in range(len(arguments) - 1):
    argument = arguments[i]
    next_argument = arguments[i + 1]

    text_by_argument[argument] = extract_sections(text, argument, next_argument)

for section, content in text_by_argument.items():
    print(f"{section}: {len(content)} sections extracted")

In [None]:
def print_first_10_sections(sections):
    for i, section in enumerate(sections[:10], start=1):
        print(section)

#print the first 10 extracted section for the BIG TECH & STARTUPS argument
print_first_10_sections(text_by_argument[arguments[0]])

## From sections to single articles
Once the sections have been properly separated, they are then further broken down into the articles. Each section contains at least 2 articles.

In [None]:
import pandas as pd
import re

def split_section_into_articles(section):
    pattern = r"\.(?=\s*(?:\n|\r\n?)*\s*[A-Z]{2,}(?:\s[A-Z]+)*(?!\w))"
    articles = re.split(pattern, section)
    return [article.strip() for article in articles if article.strip()]

def create_articles_dataframe(text_by_argument):
    data = []

    for argument, sections in text_by_argument.items():
        category_code = argument_to_code[argument]
        sections = text_by_argument[argument]
        for section in sections:
            articles = split_section_into_articles(section)
            for article in articles:
                article = article.replace('\n', ' ').replace('\r', ' ')
                data.append((article, category_code, argument))

    df = pd.DataFrame(data, columns=["text", "category_code", "category"])
    return df

In [None]:
articles_df = create_articles_dataframe(text_by_argument)
articles_df

## Last but not least
Once we have the articles separated, we can start creating the data_frames containing title, text and category of each article. This will allow us to then start working with the data available.

In [None]:
def extract_title(article_text):
    title_pattern = r'^[^a-z]+'
    title_search = re.search(title_pattern, article_text)

    title = title_search.group().strip()
    last_char_of_title = title[-1]
    title = title[:-1]
    article_text = last_char_of_title + article_text[len(title_search.group()):].strip()

    return title, article_text

In [None]:
from nltk import sent_tokenize, download

download('punkt')

def sentence_segmentation(article_text):
    title, article_text = extract_title(article_text)
    all_sentences = [title]

    sentences = sent_tokenize(article_text)
    for sentence in sentences:
        all_sentences = all_sentences + sentence.split('  ')

    return [x.strip() for x in all_sentences]

In [None]:
def create_sentences_dataframe(articles_df):
    data = []

    for _, row in articles_df.iterrows():
        article = row['text']
        category_code = row['category_code']
        category = row['category']
        sentences = sentence_segmentation(article)

        for sentence in sentences:
            data.append((sentence, category_code, category))

    df_sentences = pd.DataFrame(data, columns=["text", "category_code", "category"])
    return df_sentences

In [None]:
sentences_df = create_sentences_dataframe(articles_df)
sentences_df

## Train(, validation and test) makes perfect
With the dataframes properly set we want to further break them down into 3 seperate dataframes: train, validation and test.

In [None]:
from sklearn.model_selection import train_test_split

def split_and_save_dataframes(df, name, train_ratio=0.7, validation_ratio=0.15, test_ratio=0.15):
    assert (train_ratio + validation_ratio + test_ratio == 1), "Ratios must sum up to 1"

    train_df, temp_df = train_test_split(df, stratify=df['category_code'], train_size=train_ratio)
    validation_df, test_df = train_test_split(temp_df, stratify=temp_df['category_code'], train_size=(validation_ratio / (validation_ratio + test_ratio)))

    train_df.to_pickle(f'data/{name}_training.pkl')
    validation_df.to_pickle(f'data/{name}_validation.pkl')
    test_df.to_pickle(f'data/{name}_test.pkl')

<sub> We save the dataframes containing articles (as well as one containing sentences) into files to more easily access this information late in a separate notebook, if needed. </sub>

In [None]:
articles_df.to_pickle('data/articles_df.pkl')
sentences_df.to_pickle('data/sentences_df.pkl')

In [None]:
split_and_save_dataframes(articles_df, 'articles')
split_and_save_dataframes(sentences_df, 'sentences')