# Introduction

News articles are written in order to inform about recent events to readers. They are used to give readers the knowledge they need or want to learn about the world around them.

In today’s era it is difficult for most of us to read through the whole news article. With so many articles available online, it's hard to find the time to read everything. Plus, if the articles are in a language you are not familiar with. 

In this project we will be developing a tool using Machine Learning in Python that will summarise the article and also translate it in different languages.

# Problem Statement

**The problem that we are facing are: -** 
1. Keeping up with the news in today's hectic environment is difficult. With so many news articles available online it is difficult to go through all of them. 

2. If you are living in a foreign country it is difficult to read or translate a language you are not familiar with. Most of us are only interested in knowing the keyword. 

# Data Exploration

In this project we will start by importing necessary libraries then we will load the dataset, check the shape of the dataset, info of the dataset, we will also check if there are any duplicate values in our dataset column, then we will check for null values in the dataset and the type of the dataset columns.

In [1]:
import time
import pandas as pd
from transformers import pipeline
from googletrans import Translator
from sumy.summarizers.lsa import LsaSummarizer

In [2]:
# Load the dataset
df = pd.read_csv('data.csv')
df.head(3)

Unnamed: 0,title,text,author,published,tags,source
0,Warum nicht mal die Russen Putins Impfstoff wo...,Im Kreml sollte es in diesen Tagen eigentlich ...,Pavel Lokshin,1602234820,"Ausland, „Sputnik V“",welt
1,Mario Götze schwärmt von Schmidts Philosophie,,,1602234283,"Sport, PSV Eindhoven",welt
2,Welternährungsprogramm erhält den Friedensnobe...,Der Preis setze ein Zeichen für die Millionen ...,"'Hannes Stein', 'Birgit Herden', 'Dirk Schümer...",1602234160,"Ausland, Oslo",welt


Checking the shape of the dataset. In our dataset we have 174939 rows and 6 features.

In [3]:
df.shape

(174939, 6)

Checkind the info of the dataset. We can see the dataset column names, shapes, and so on.

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 174939 entries, 0 to 174938
Data columns (total 6 columns):
 #   Column     Non-Null Count   Dtype 
---  ------     --------------   ----- 
 0   title      174939 non-null  object
 1   text       167280 non-null  object
 2   author     151235 non-null  object
 3   published  174915 non-null  object
 4   tags       174939 non-null  object
 5   source     174939 non-null  object
dtypes: object(6)
memory usage: 8.0+ MB


Checking if there are any duplicate values in the dataset columns.
In the 'Title' column we have 11948 duplicate values. 

In [5]:
df['title'].duplicated().sum()

11948

In the 'Text' column we have 25700 duplicate values. 

In [6]:
df['text'].duplicated().sum()

25700

In the 'Published' column we have 20061 duplicate values. 

In [7]:
df['published'].duplicated().sum()

20061

In the 'Author' column we have 136748 duplicate values. 

In [8]:
df['author'].duplicated().sum()

136748

In the 'Tags' column we have 131395 duplicate values. 

In [9]:
df['tags'].duplicated().sum()

131395

In the 'Source' column we have 174891 duplicate values. 

In [10]:
df['source'].duplicated().sum()

174891

Checking  if there are any null values in the dataset columns. We have 7659 missing values in the 'text' column, 23704 missing values in the 'author' column, and 24 null values in the 'published' column. 

In [11]:
df.isnull().sum()

title            0
text          7659
author       23704
published       24
tags             0
source           0
dtype: int64

Checking the type of the dataset columns. The type of the dataset columns are 'object' it can probably, because the columns contains string, integer or any other mix values.

In [12]:
df.dtypes

title        object
text         object
author       object
published    object
tags         object
source       object
dtype: object

# Data Cleaning

Removing the rows of null values from the 'text' column of the dataset.

Replacing the null values from the 'author' column of the dataset with 'Unknown'.

Dropping 'published' and 'tags' column from the dataset because they are not so useful for this project.

In [13]:
# Removing rows with null values in the 'text' column
df.dropna(subset=['text'], inplace=True)

# Replacing missing values in the 'author' column with 'Unknown'
df['author'].fillna('Unknown', inplace=True)

# Dropping the 'published' column
df.drop(columns=['published'], inplace=True)

df.drop(columns=['tags'], inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df['author'].fillna('Unknown', inplace=True)


Checking if there are still any null values in the dataset.

In [14]:
df.isnull().sum()

title     0
text      0
author    0
source    0
dtype: int64

Dropping the rows that conatins duplicate values from 'Title', 'Text', 'Author', and 'Source' columns.

In [15]:
df.drop_duplicates(subset=['title', 'text', 'author', 'source'], inplace=True)

After dropping the duplicated values and removing the rows with null values. Checking the shape of the dataset. Now the shape of the dataset is 163987 rows and 4 features.

In [16]:
df.shape

(163987, 4)

Converting the types of the dataset columns uing convert_dtypes().

In [17]:
df = df.convert_dtypes()
df.dtypes

title     string[python]
text      string[python]
author    string[python]
source    string[python]
dtype: object

Checking if everything is done properly.

In [18]:
df.head(3)

Unnamed: 0,title,text,author,source
0,Warum nicht mal die Russen Putins Impfstoff wo...,Im Kreml sollte es in diesen Tagen eigentlich ...,Pavel Lokshin,welt
2,Welternährungsprogramm erhält den Friedensnobe...,Der Preis setze ein Zeichen für die Millionen ...,"'Hannes Stein', 'Birgit Herden', 'Dirk Schümer...",welt
4,In welchen deutschen Großstädten die Lage krit...,Die Kanzlerin berät mit den Bürgermeistern der...,"'Daniele Raffaele Gambone', 'Christoph B. Schi...",welt


Translating text from German to English by using the "googletrans" package. A translator example is created from the "Translator" class. By creating a function called translate text the content of the dataset's "text" column can be translated from German to English and it will manage errors like the Nan value in the text column, and it will include a one-second pause between translation requests to prevent overloading. We only want to translate the news article content of the first 70 rows of the "text" column and put it in a new column within the same dataset as a new column named "Translated-text" To do this we will use boolean indexing to remove rows containing the word "Anzeige" which is advertisement.

In [19]:
# Initializing the translator
translator = Translator()

# Function to translate text with error handling and delay
def translate_text(text):
    try:
        if pd.notnull(text):  # Checking if text is not NaN
            translation = translator.translate(text, src='de', dest='en')
            time.sleep(1)  # Introducing a delay of 1 second to avoid overwhelming the translation
            return translation.text
        else:
            return None  # Returning None for NaN values
    except Exception as e:
        print(f"Translation failed for text: {text}. Error: {str(e)}")
        return None

# Removing rows containing "Anzeige" from the DataFrame
df = df[~df['text'].str.contains('Anzeige')]

# Applying translation to the remaining rows
df['Translated_text'] = df['text'].iloc[:70].apply(translate_text)

df.head(5)


Unnamed: 0,title,text,author,source,Translated_text
0,Warum nicht mal die Russen Putins Impfstoff wo...,Im Kreml sollte es in diesen Tagen eigentlich ...,Pavel Lokshin,welt,There should actually be enough reasons for tu...
30,Gegen diesen Wasserstoff-Plan wirkt die deutsc...,An der Great Northern Road schrauben Monteure ...,Stefanie Bolzen,welt,At the Great Northern Road screws fitters arou...
39,Über 5000 Neuinfektionen – Merkels 19.200er-Sz...,Zwischen Ende Juni und Ende September hat sich...,Olaf Gersemann,welt,Between the end of June and the end of Septemb...
44,Zaubertore! Hier schießen sich Toni Kroos und ...,Das DFB-Team trifft in der Nations League auf ...,Unknown,welt,The DFB team meets Ukraine in the Nations Leag...
45,FDP entzieht Thüringer Parteichef die Unterstü...,as Präsidium der Bundes-FDP hat dem thüringisc...,"'Luisa Hofmeier', 'Geli Tangermann', 'Torsten ...",welt,AS Presidium of the Federal FDP withdrawn the ...


We create a text summarizing pipeline using the T5 base model in order to develop text summaries using the T5 model from the Transforms library by Hugging Face. Our function creates summary and takes text as input. It creates a summary using the summarization pipeline, and then replace this summary with the original text. The dataset's "Translated-text" column is the target of this function. It handles the task of generating summaries for values in the "Translated-text" column that are not missing and it then keeps the summaries in a new column called "Summarised-text" that is part of the same dataset.

In [20]:
# Loading the text summarization pipeline with the desired model
summarizer = pipeline("summarization", model="t5-base", revision="main")

# Function to generate summary for non-missing values and replace them in the Dataset
def generate_summary(text):
    if pd.notnull(text):  # Checking if the value is not NaN
        summary = summarizer(text, max_length=150, min_length=40, num_beams=4, do_sample=False)
        return summary[0]['summary_text']
    else:
        return text  # Returning NaN for missing values

# Applying the generate_summary function to the 'Translated_text' column
df['Summarised-text'] = df['Translated_text'].apply(generate_summary)

df.head(70)


Your max_length is set to 150, but your input_length is only 135. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=67)
Your max_length is set to 150, but your input_length is only 119. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=59)
Your max_length is set to 150, but your input_length is only 41. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=20)
Your max_length is set to 150, but your input_length is only 102. Since this is a summarization task, where outputs shorter than the input are typically wanted, you might consider decreasing max_length manually, e.g. summarizer('...', max_length=51)
Y

Unnamed: 0,title,text,author,source,Translated_text,Summarised-text
0,Warum nicht mal die Russen Putins Impfstoff wo...,Im Kreml sollte es in diesen Tagen eigentlich ...,Pavel Lokshin,welt,There should actually be enough reasons for tu...,the number of new infections has more than dou...
30,Gegen diesen Wasserstoff-Plan wirkt die deutsc...,An der Great Northern Road schrauben Monteure ...,Stefanie Bolzen,welt,At the Great Northern Road screws fitters arou...,Scotland's northeastern lives in the future th...
39,Über 5000 Neuinfektionen – Merkels 19.200er-Sz...,Zwischen Ende Juni und Ende September hat sich...,Olaf Gersemann,welt,Between the end of June and the end of Septemb...,"between end of June and the end of September, ..."
44,Zaubertore! Hier schießen sich Toni Kroos und ...,Das DFB-Team trifft in der Nations League auf ...,Unknown,welt,The DFB team meets Ukraine in the Nations Leag...,the DFB team meets Ukraine in the Nations Leag...
45,FDP entzieht Thüringer Parteichef die Unterstü...,as Präsidium der Bundes-FDP hat dem thüringisc...,"'Luisa Hofmeier', 'Geli Tangermann', 'Torsten ...",welt,AS Presidium of the Federal FDP withdrawn the ...,the early state election will be held on April...
...,...,...,...,...,...,...
595,Brasiliens überraschendes Comeback,Die Tendenz der Corona-Infektionen: In Argenti...,Tobias Käufer,welt,The tendency of the Corona infections: rising ...,the tendency of the Corona infections: rising ...
601,Erdogans neuer Krieg,Die Türkei unterstützt in Berg-Karabach das mu...,Alfred Hackensberger,welt,Turkey supports the Muslim Azerbaijan in Berg-...,turkey supports the Muslim Azerbaijan in Berg-...
614,„Ich musste noch allen die Schuhe putzen“,Bevor sich Steffen Freund auf den Rückweg begi...,Lars Gartenschläger,welt,Before Steffen Freund goes back to regulate hi...,Steffen Freund leaves him with a Mercedes befo...
627,Ein von Willkür geprägtes System,"Dass Beamte vor Gericht gehen, weil sie sich b...",Arnd Diringer,welt,"From time to time, officials go to court becau...",the higher administrative court of Rhineland-P...


Rouge score is used to evaluate text summaries for quality. By calculating the Rouge score of the manual created summaries and the machine generated summaries we will compare and check the results of both the summaries. The Rouge score is a tool used to compare the summaries in n-grams of manually generated versus machine generated summaries. Lastly, the determined Rouge Score result will be displayed to us. This will show how effectively the summaries produced by machines capture the important details in comparison to those created by manually.

In [44]:
from rouge import Rouge

# Extract manually generated summaries for the first 4 rows from the 'Summarised-text' column
manually_generated_summaries = df['Summarised-text'].iloc[6:10].tolist()

# Define machine-generated summaries
machine_generated_summaries = [
    "In addition, an inflation compensation border of 3,000 euros and other allowances, such as a higher vacation allowance, were agreed.These can now decide on acceptance or rejection of the result in a survey running until mid -April.Lufthansa Human Resources Director Michael Niggemann also spoke of a good compromise, which is also a 'great economic challenge'.",
    "In the past few days, the Union had asked Federal President Frank-Walter Steinmeier not to sign the law.The expert commission with experts from the fields of medicine, law, traffic and the police has now come to a result.",
    "Most recently, Germany had fallen below the brand in the 2019 pre-Corona year with a value of 59.6 percent.After that, the federal and state governments are no longer allowed to compensate for their budget deficits by taking out loans.While an absolute debt is valid for the countries, the federal government has a small scope.",
    "The most promising candidate, Maria Corina Machado, was banned in advance due to alleged irregularities from her time as a member of the exercise of public office for 15 years."
    "At the crucial moment, however, Zverev was there and made the semi-finals perfect with the only break of the set after 1:37 hours.Through the sovereign 6: 4, 6: 2 in the quarter -finals against the Czech Tomas Machac on Wednesday (27.3.2024 / local time) Sinner was the first player to book his ticket for the round of the best four."
]


# Initialize ROUGE
rouge = Rouge()

# Calculate ROUGE scores
rouge_scores = rouge.get_scores(machine_generated_summaries, manually_generated_summaries, avg=True)

# Print ROUGE scores
print("ROUGE Scores:")
print(rouge_scores)


ROUGE Scores:
{'rouge-1': {'r': 0.21365007541478132, 'p': 0.1413487016428193, 'f': 0.1678690616970145}, 'rouge-2': {'r': 0.03267841574293187, 'p': 0.020399958228905595, 'f': 0.024660087270561862}, 'rouge-l': {'r': 0.19347662141779787, 'p': 0.12644674085850555, 'f': 0.15075075369965824}}
