We will be using pandas to clean the CSV file. Pandas offers an easy solution through dataframes to clean and manipulate large pieces of tabular data. 

In [1]:
import pandas as pd

Load the data.

In [2]:
articles = pd.read_csv('articles1.csv')
articles.head(2)

Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,url,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,,"After the bullet shells get counted, the blood..."


Observe the dataset at the moment. The 'Unnamed: 0' column is a column containing the row count for the data set. This was included in the data set, but our data frame also contains a row count, so no need to include the Unamed: 0 column, as it is redundant. 

The id column is also an id of each news article, and is the unique identifier used by the original creator of this data set. For our purposes, the row number is a sufficient unique identifier, so we will also drop this column

In [25]:
articles= articles.drop(["Unnamed: 0", "id"], axis=1)

Next step of cleaning is to check where we have null values. We do not want to confuse the language model with missing information so we need to make sure each entry is complete and valid.

In [14]:
# Check for null values in each row
rows_with_nulls = articles[articles.isnull().any(axis=1)]

# Print the rows with null values
print(rows_with_nulls)

       Unnamed: 0     id                                              title  \
65             65  17360                     My Canada - The New York Times   
66             66  17361  How We Put Together Our 52 Places to Go List -...   
200           200  17510  The Best and Worst of the Golden Globes - The ...   
260           260  17577  President Obama’s Farewell Address: Full Video...   
367           367  17699  Transcript: President Obama on What Books Mean...   
...           ...    ...                                                ...   
49801       53091  73166  NFL POWER RANKINGS: Where all 32 teams stand h...   
49807       53097  73175  Iran negotiated to pay only half price for its...   
49809       53099  73177  Trump is now doing many of the things he criti...   
49819       53110  73191  Republicans have a massive plan to overhaul th...   
49825       53116  73202    ’Star Wars’ actress Carrie Fisher is dead at 60   

            publication author        date    year 

Notice that the url column contains no values. The language model is also unable to access these links directly, so for our purposes, we will delete the url column as a whole

In [13]:
articles = articles.drop('url', axis =1)
articles

Let's check for rows with null values again

In [None]:
# Check for null values in each row
rows_with_nulls = articles[articles.isnull().any(axis=1)]

# Print the rows with null values
print(rows_with_nulls)

In [20]:
articles.shape

(50000, 9)

There are some rows with null values, but the vast majority of our rows are complete. Let's remove all of the incomplete rows from our data set

In [21]:
cleaned_articles = articles.dropna()
print(cleaned_articles.shape)
cleaned_articles

(43694, 9)


Unnamed: 0.1,Unnamed: 0,id,title,publication,author,date,year,month,content
0,0,17283,House Republicans Fret About Winning Their Hea...,New York Times,Carl Hulse,2016-12-31,2016.0,12.0,WASHINGTON — Congressional Republicans have...
1,1,17284,Rift Between Officers and Residents as Killing...,New York Times,Benjamin Mueller and Al Baker,2017-06-19,2017.0,6.0,"After the bullet shells get counted, the blood..."
2,2,17285,"Tyrus Wong, ‘Bambi’ Artist Thwarted by Racial ...",New York Times,Margalit Fox,2017-01-06,2017.0,1.0,"When Walt Disney’s “Bambi” opened in 1942, cri..."
3,3,17286,"Among Deaths in 2016, a Heavy Toll in Pop Musi...",New York Times,William McDonald,2017-04-10,2017.0,4.0,"Death may be the great equalizer, but it isn’t..."
4,4,17287,Kim Jong-un Says North Korea Is Preparing to T...,New York Times,Choe Sang-Hun,2017-01-02,2017.0,1.0,"SEOUL, South Korea — North Korea’s leader, ..."
...,...,...,...,...,...,...,...,...,...
49995,53287,73465,"Rex Tillerson Says Climate Change Is Real, but …",Atlantic,Robinson Meyer,2017-01-11,2017.0,1.0,"As chairman and CEO of ExxonMobil, Rex Tillers..."
49996,53288,73466,The Biggest Intelligence Questions Raised by t...,Atlantic,Amy Zegart,2017-01-11,2017.0,1.0,I’ve spent nearly 20 years looking at intellig...
49997,53289,73467,Trump Announces Plan That Does Little to Resol...,Atlantic,Jeremy Venook,2017-01-11,2017.0,1.0,Donald Trump will not be taking necessary st...
49998,53290,73468,Dozens of For-Profit Colleges Could Soon Close,Atlantic,Emily DeRuy,2017-01-11,2017.0,1.0,Dozens of colleges could be forced to close ...


Okay. This looks good for our purposes. There are no null values. We will tokenize and lemmentize the contents in our AI files, so we don't need to fix anything else programmatically here.

*Things to highlight on the cleaned data set:*

1. Data Structure: The dataset maintains a structured format with clear columns for various attributes of each news article. This makes it easier to extract relevant information during fine-tuning.

2. Text Content: The "content" field contains the main text of the news articles. This is the text that the model will learn from and generate summaries for.

3. Metadata: The additional fields such as "title," "publication," "author," "date," "year," and "month" provide useful context about each article. This context can be leveraged during training or evaluation.

In [29]:
cleaned_articles.to_csv('cleaned_articles.csv')

When we are ready to split the data into training and testing data, here is the code. But we want to keep in mind:

1. Understand the Dataset Size: Take note of the total number of samples in your dataset. In your case, you mentioned that the dataset contains 50,121 rows.

2. Determine Split Ratio: Decide on a split ratio that divides your dataset into training and validation sets. A common split is 80% for training and 20% for validation, but you can adjust this ratio based on your dataset size and needs.

3. Calculate Split Sizes: Calculate the number of samples that will go into each split. For example, if you're using an 80-20 split and your dataset has 50,121 samples:
    - Training Set Size: 80% * 50,121 = 40,097 samples
    - Validation Set Size: 20% * 50,121 = 10,024 samples
    - Shuffle the Dataset: Shuffle the rows of your dataset randomly. This helps ensure that the training and validation sets have a representative distribution of data.

4. Create Splits: Divide the shuffled dataset into training and validation sets based on the calculated split sizes. For your dataset, the first 40,097 rows can go into the training set, and the remaining 10,024 rows can go into the validation set.

5. Save the Splits: Depending on your programming environment, you might save the training and validation sets as separate CSV files, create lists or arrays, or use other data structures to store them.

In [35]:
data = pd.read_csv("cleaned_articles.csv")

# Split ratio
split_ratio = 0.8  # 80% for training, 20% for validation

# Calculate split sizes
total_samples = len(data)
train_size = int(total_samples * split_ratio)
val_size = total_samples - train_size

# Shuffle the dataset
data_shuffled = data.sample(frac=1, random_state=42)

# Create training and validation splits
train_set = data_shuffled[:train_size]
val_set = data_shuffled[train_size:]
train_set

Unnamed: 0.1,Unnamed: 0,title,publication,author,date,year,month,content
19401,19437,"Migrant Sex Criminal: I Hate Sweden, I’m Just ...",Breitbart,Virginia Hale,2016-07-28,2016.0,7.0,The victim of a man who allegedly sexually har...
33632,34961,Washington Post: Kushner proposed secret commu...,CNN,Saba Hamedy,2017-05-27,2017.0,5.0,Washington (CNN) President Donald Trump’s a...
42152,48075,Why Simone Biles writes what looks like a doll...,Business Insider,Scott Davis,2016-08-12,2016.0,8.0,’ ’ ’ Before hopping on the balance beam fo...
22732,22768,Feds Arrest Multiple Deportee Sex Offender in ...,Breitbart,Caroline May,2016-05-03,2016.0,5.0,Immigration and Customs and Enforcement office...
25391,25427,Putin’s Geography Lesson: ’Russia’s Borders Do...,Breitbart,John Hayward,2016-11-25,2016.0,11.0,Appearing at an awards ceremony for young stud...
...,...,...,...,...,...,...,...,...
1559,1568,Sean Spicer Meets the Press. No Cameras Allowe...,New York Times,Michael M. Grynbaum,2017-03-07,2017.0,3.0,For the country’s most prominent political spo...
13313,13349,Donald Trump: Healthcare Plan ‘Will End in a B...,Breitbart,Charlie Spiering,2017-03-09,2017.0,3.0,President Donald Trump again tweeted support f...
13528,13564,Maher: Trump’s Tax Plan Proves ’Mental Illness...,Breitbart,Ian Hanchett,2017-04-28,2017.0,4.0,"On Friday’s broadcast of HBO’s “Real Time,” Bi..."
25017,25053,Marco Rubio Defiant On ’Glitch’ Debate Video: ...,Breitbart,Charlie Spiering,2016-02-07,2016.0,2.0,Sen. Marco Rubio defiantly defended his Saturd...


In [36]:
val_set

Unnamed: 0.1,Unnamed: 0,title,publication,author,date,year,month,content
7928,7964,Professor Wants Children to Learn ’Queer Theor...,Breitbart,Tom Ciccotta,2017-04-26,2017.0,4.0,A professor at the University of Arizona who c...
2107,2117,Uber Releases Diversity Report and Repudiates ...,New York Times,Mike Isaac,2017-03-29,2017.0,3.0,SAN FRANCISCO — After a string of scandals ...
30363,30399,Sharyl Attkisson is Right: Mexican Cartels Hav...,Breitbart,Brandon Darby &amp Ildefonso Ortiz,2016-06-26,2016.0,6.0,Veteran journalist extraordinaire Sharyl Attki...
31262,31298,Davi: Donald Trump Can Revive Reagan’s ’Inform...,Breitbart,Robert Davi,2016-11-14,2016.0,11.0,"My Dear Fellow Americans. [If you’ll recall, r..."
36817,40320,"Zuckerberg, wife give $75M to San Francisco ho...",CNN,Greg Botelho,2015-02-07,2015.0,2.0,(CNN) San Francisco General Hospital will soo...
...,...,...,...,...,...,...,...,...
6265,6298,What It Takes to Open a Bookstore - The New Yo...,New York Times,Jonah Engel Bromwich,2017-04-12,2017.0,4.0,"For more than 20 years, small bookstores have ..."
11284,11320,Kremlin Makes Dubious Claim It Killed Islamic ...,Breitbart,Edwin Mora,2017-06-16,2017.0,6.0,Russia may have executed the ruthless and elus...
38158,42412,Apperson charged with attempted murder of Geor...,CNN,Tony Marco,2015-06-18,2015.0,6.0,(CNN) Matthew Apperson has been charged Thurs...
860,867,How Washington State Upended Trump’s Travel Ba...,New York Times,Alexander Burns,2017-02-05,2017.0,2.0,While President Trump’s travel ban threw Ameri...


And we can send them to CSV for easy retrieval in our program

In [37]:
val_set.to_csv('test_set.csv')
train_set.to_csv('training_set.csv')

In [41]:
train_set['content'][0]

'WASHINGTON  —   Congressional Republicans have a new fear when it comes to their    health care lawsuit against the Obama administration: They might win. The incoming Trump administration could choose to no longer defend the executive branch against the suit, which challenges the administration’s authority to spend billions of dollars on health insurance subsidies for   and   Americans, handing House Republicans a big victory on    issues. But a sudden loss of the disputed subsidies could conceivably cause the health care program to implode, leaving millions of people without access to health insurance before Republicans have prepared a replacement. That could lead to chaos in the insurance market and spur a political backlash just as Republicans gain full control of the government. To stave off that outcome, Republicans could find themselves in the awkward position of appropriating huge sums to temporarily prop up the Obama health care law, angering conservative voters who have been 

In [42]:
train_set.columns

Index(['Unnamed: 0', 'title', 'publication', 'author', 'date', 'year', 'month',
       'content'],
      dtype='object')

In [44]:
df = pd.read_csv('./cnn_dailymail/validation.csv')
df

Unnamed: 0,id,article,highlights
0,61df4979ac5fcc2b71be46ed6fe5a46ce7f071c3,"Sally Forrest, an actress-dancer who graced th...","Sally Forrest, an actress-dancer who graced th..."
1,21c0bd69b7e7df285c3d1b1cf56d4da925980a68,A middle-school teacher in China has inked hun...,Works include pictures of Presidential Palace ...
2,56f340189cd128194b2e7cb8c26bb900e3a848b4,A man convicted of killing the father and sist...,"Iftekhar Murtaza, 29, was convicted a year ago..."
3,00a665151b89a53e5a08a389df8334f4106494c2,Avid rugby fan Prince Harry could barely watch...,Prince Harry in attendance for England's crunc...
4,9f6fbd3c497c4d28879bebebea220884f03eb41a,A Triple M Radio producer has been inundated w...,Nick Slater's colleagues uploaded a picture to...
...,...,...,...
13363,e93f721ba4949f21f33549c4a21d55ff456af979,All shops will be allowed to offer ‘click and ...,Shops won't have to apply for planning permiss...
13364,8df19a570ad14119a7d00f3bbe864fedf8c1691d,Mo Farah has had his nationality called into q...,Mo Farah broke the European half-marathon reco...
13365,2fdd5f89aa26e91ceea9b0ef264abfcfc3e6fa2e,Wolves kept their promotion hopes alive with a...,Wolves are three points off the play-off place...
13366,530d7b18d7a715b368b0745f9dfebfe353adeda8,A Brown University graduate student has died ...,"Hyoun Ju Sohn, a 25-year-old doctoral student,..."


In [47]:
df['article'].str.len().max()

11412

In [2]:
test = pd.read_csv("test.csv")
test

Unnamed: 0,id,article,highlights
0,92c514c913c0bdfe25341af9fd72b29db544099b,Ever noticed how plane seats appear to be gett...,Experts question if packed out planes are put...
1,2003841c7dc0e7c5b1a248f9cd536d727f27a45a,A drunk teenage boy had to be rescued by secur...,Drunk teenage boy climbed into lion enclosure ...
2,91b7d2311527f5c2b63a65ca98d21d9c92485149,Dougie Freedman is on the verge of agreeing a ...,Nottingham Forest are close to extending Dougi...
3,caabf9cbdf96eb1410295a673e953d304391bfbb,Liverpool target Neto is also wanted by PSG an...,Fiorentina goalkeeper Neto has been linked wit...
4,3da746a7d9afcaa659088c8366ef6347fe6b53ea,Bruce Jenner will break his silence in a two-h...,"Tell-all interview with the reality TV star, 6..."
...,...,...,...
11485,ed8674cc15b29a87d8df8de1efee353d71122272,Our young Earth may have collided with a body ...,Oxford scientists say a Mercury-like body stru...
11486,2f58d1a99e9c47914e4b1c31613e3a041cd9011e,A man facing trial for helping his former love...,Man accused of helping former lover kill woman...
11487,411f6d57825161c3a037b4742baccd6cd227c0c3,A dozen or more metal implements are arranged ...,Marianne Power tried the tuning fork facial at...
11488,b5683ef8342056b17b068e0d59bdbe87e3fe44ea,Brook Lopez dominated twin brother Robin with ...,Brooklyn Nets beat the Portland Trail Blazers ...


In [4]:
def get_random_sample(dataframe, sample_size):
    """
    Get a random sample of rows from a DataFrame.
    
    Parameters:
        dataframe (pd.DataFrame): The DataFrame from which to draw the sample.
        sample_size (int): The number of rows in the sample (default is 500).
        
    Returns:
        pd.DataFrame: Random sample of rows from the input DataFrame.
    """
    random_sample = dataframe.sample(n=sample_size, random_state=42)  # You can change the random_state if desired
    return random_sample

# Load your dataset into a DataFrame (assuming the variable name is df)
# df = pd.read_csv('your_dataset.csv')

# Get a random sample of 500 entries from the DataFrame
random_sample_test = get_random_sample(test, sample_size=500)
random_sample_test

Unnamed: 0,id,article,highlights
1516,f00ae3c3929d829cd469ba4f229cc613b0766203,Comedian Jenny Eclair travelled with her other...,The comedian stayed with Flavours who offer a ...
1393,9e451f79499e5c784222b3f237c6ae4829849d79,A woman of Arab and Jewish descent who was str...,The federal government will give Shoshana Hebs...
10560,dae58055bd50598b93a230aa3a58e0d2f519b536,World No 1 Novak Djokovic has apologised to th...,Novak Djokovic beat Andy Murray 7-6 4-6 6-0 in...
11457,c05bda9b387ec8ae43803170b6f59b4b82505db9,(CNN)ISIS on Wednesday released more than 200 ...,Most of those released were women and children...
647,5c7493c6f28cfd58aa7b5f0e486e611307b4126d,Hillary Clinton’s security detail arrived at a...,"Second modified, armored van spotted near Des ..."
...,...,...,...
7845,cbbfa370f624ec2333bd9bc4c5d1e01a9fa2dbac,Northampton's maligned hit man is primed to go...,Courtney Lawes has been derided as a thug acro...
3463,e32243f9f00cdd0c3c82f11f666bcaa216d2bb8d,A Turkish Airlines flight from Milan to Istanb...,Turkish Airlines flight landed at Istanbul Ata...
10341,e559a6b1f44d11afe2de7a3fe217d0f5b36eb970,The future of Britain's independent nuclear de...,Tories say Russia's Vladimir Putin would be ha...
5821,bebd06c7c9c4fdc89581ccad26a91162c962bf8e,A high-flying Tory MP has been ordered to remo...,"Case about Zac Evans, killed in machete attack..."


In [6]:
train = pd.read_csv("train.csv")
train

Unnamed: 0,id,article,highlights
0,0001d1afc246a7964130f43ae940af6bc6c57f01,By . Associated Press . PUBLISHED: . 14:11 EST...,"Bishop John Folda, of North Dakota, is taking ..."
1,0002095e55fcbd3a2f366d9bf92a95433dc305ef,(CNN) -- Ralph Mata was an internal affairs li...,Criminal complaint: Cop used his role to help ...
2,00027e965c8264c35cc1bc55556db388da82b07f,A drunk driver who killed a young woman in a h...,"Craig Eccleston-Todd, 27, had drunk at least t..."
3,0002c17436637c4fe1837c935c04de47adb18e9a,(CNN) -- With a breezy sweep of his pen Presid...,Nina dos Santos says Europe must be ready to a...
4,0003ad6ef0c37534f80b55b4235108024b407f0b,Fleetwood are the only team still to have a 10...,Fleetwood top of League One after 2-0 win at S...
...,...,...,...
287108,fffdfb56fdf1a12d364562cc2b9b1d4de7481dee,By . James Rush . Former first daughter Chelse...,Chelsea Clinton said question of running for o...
287109,fffeecb8690b85de8c3faed80adbc7a978f9ae2a,An apologetic Vanilla Ice has given his first ...,"Vanilla Ice, 47 - real name Robert Van Winkle ..."
287110,ffff5231e4c71544bc6c97015cdb16c60e42b3f4,America's most lethal sniper claimed he wished...,America's most lethal sniper made comment in i...
287111,ffff924b14a8d82058b6c1c5368ff1113c1632af,"By . Sara Malm . PUBLISHED: . 12:19 EST, 8 Mar...",A swarm of more than one million has crossed b...


In [7]:
random_sample_training = get_random_sample(train, sample_size=2000)
random_sample_training

Unnamed: 0,id,article,highlights
272581,ed0fed726929c1eeabe6c390e47128dbb7d7a055,By . Mia De Graaf . Britons flocked to beaches...,People enjoyed temperatures of 17C at Brighton...
772,023cd84001b33aed4ff0f3f5ecb0fdd2151cf543,A couple who weighed a combined 32st were sham...,Couple started piling on pounds after the birt...
171868,6a70a0d8d3ed365fe1df6d35f1587a8b9b298618,Video footage shows the heart stopping moment ...,A 17-year-old boy suffering lacerations to his...
63167,b37204c13ea38b511265e41ac69fb12acfb63f85,"Istanbul, Turkey (CNN) -- About 250 people rac...",Syrians citizens hightail it to Turkey .\nMost...
68522,c24e5805afd5145bc48410e876db91d44a06be5e,By . Daily Mail Reporter . PUBLISHED: . 12:53 ...,The Xue Long had provided the helicopter that ...
...,...,...,...
51971,93302a6eb3612d76b9a344b5b9da71df9af2613a,By . Nicola Harley . A legal firm that made mo...,Insult came in a training manual Raleys Solici...
169533,676526749ee6c87e2fabd9558d06a2bcc31dc8ea,The average interest rate on an easy-access IS...,It is the lowest average rate since MoneyFacts...
109664,195c7db04d3745352471544e3beadf5805ae3f1f,(CNN) -- There are plenty of reasons to fall i...,Staircases are important elements in home deco...
118100,248087aad653122712d059e0d14eae65dcf346e7,"By . Steve Robson . PUBLISHED: . 01:29 EST, 25...",Victor Ponta says he is 'rather perplexed' by ...


In [8]:
random_sample_test.to_csv('sample_test.csv')
random_sample_training.to_csv('sample_training.csv')