Connecting Google Drive (GDrive) with Colab

In [1]:
from google.colab import drive
drive.mount("/content/gdrive")

Drive already mounted at /content/gdrive; to attempt to forcibly remount, call drive.mount("/content/gdrive", force_remount=True).


Importing all the neccessary libraries into the notebook

In [2]:
import pandas as pd
import numpy as np
import re

import string  
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('punkt')
from nltk.corpus import stopwords 
from nltk.stem import WordNetLemmatizer

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!



## Dataset-Summary
The CNN / DailyMail Dataset is an English-language dataset containing just over 300k unique news articles as written by journalists at CNN and the Daily Mail. The current version supports both extractive and abstractive summarization, though the original version was created for machine reading and comprehension and abstractive question answering.

###Data-Fields
id: a string containing the heximal formated SHA1 hash of the url where the story was retrieved from

article: a string containing the body of the news article

highlights: a string containing the highlight of the article as written by the article author

### Importing in the dataset

In [3]:
train = pd.read_csv("/content/gdrive/MyDrive/Capstone Project /train.csv")
test = pd.read_csv("/content/gdrive/MyDrive/Capstone Project /test.csv")
val = pd.read_csv("/content/gdrive/MyDrive/Capstone Project /validation.csv")


Combing the three dataset into one , inorder to proceed with data cleaning

In [4]:
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 287113 entries, 0 to 287112
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   id          287113 non-null  object
 1   article     287113 non-null  object
 2   highlights  287113 non-null  object
dtypes: object(3)
memory usage: 6.6+ MB


In [5]:
test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11490 entries, 0 to 11489
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          11490 non-null  object
 1   article     11490 non-null  object
 2   highlights  11490 non-null  object
dtypes: object(3)
memory usage: 269.4+ KB


In [6]:
val.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13368 entries, 0 to 13367
Data columns (total 3 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   id          13368 non-null  object
 1   article     13368 non-null  object
 2   highlights  13368 non-null  object
dtypes: object(3)
memory usage: 313.4+ KB


In [7]:
df = pd.concat([train,test,val],ignore_index=True)

In [8]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 311971 entries, 0 to 311970
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   id          311971 non-null  object
 1   article     311971 non-null  object
 2   highlights  311971 non-null  object
dtypes: object(3)
memory usage: 7.1+ MB


In [9]:
# deleting the rest for memory space
del train, test, val

## Data Cleaning:

In [10]:
# Checking for null value
df.isna().sum()

id            0
article       0
highlights    0
dtype: int64

In [11]:
# Checking for duplicates
df.duplicated(subset= ['article', 'highlights']).sum()

3101

In [12]:
# Dropping the duplicate values
df = df.drop_duplicates(subset= ['article', 'highlights'])
df.shape

(308870, 3)

In [13]:
#Checking if all the id are unique
df.id.nunique()

308870

In [14]:
#Dropping the ids
df = df.drop('id', axis= 1)
df.head(3)

Unnamed: 0,article,highlights
0,By . Associated Press . PUBLISHED: . 14:11 EST...,"Bishop John Folda, of North Dakota, is taking ..."
1,(CNN) -- Ralph Mata was an internal affairs li...,Criminal complaint: Cop used his role to help ...
2,A drunk driver who killed a young woman in a h...,"Craig Eccleston-Todd, 27, had drunk at least t..."


In [15]:
# Remove leading/trailing white space
df = df.apply(lambda x: x.str.strip())

# Remove URLs
df = df.apply(lambda x: x.str.replace(r'http\S+', '', regex=True))

# Remove HTML tags
df = df.apply(lambda x: x.str.replace(r'<.*?>', '', regex=True))

# Remove extra white space
df = df.apply(lambda x: x.str.replace(r'\s+', ' ', regex=True))

# Convert to lowercase
df = df.apply(lambda x: x.str.lower())


In [16]:
# Print the cleaned dataset
df.head()

Unnamed: 0,article,highlights
0,by . associated press . published: . 14:11 est...,"bishop john folda, of north dakota, is taking ..."
1,(cnn) -- ralph mata was an internal affairs li...,criminal complaint: cop used his role to help ...
2,a drunk driver who killed a young woman in a h...,"craig eccleston-todd, 27, had drunk at least t..."
3,(cnn) -- with a breezy sweep of his pen presid...,nina dos santos says europe must be ready to a...
4,fleetwood are the only team still to have a 10...,fleetwood top of league one after 2-0 win at s...


Preprocessing the textual content 

In this code, we first download the required NLTK resources, namely the stopwords and WordNet lemmatizer. We then load the pandas DataFrame df from a CSV file.

Next, we define a function called clean_text that takes a single string argument and applies various cleaning steps using NLTK.It converts the text to lowercase and removes non-alphabetic characters, tokenizes the text using NLTK's word_tokenize function, removes stopwords using NLTK's stopwords module, lemmatizes the remaining words using NLTK's WordNetLemmatizer, and finally joins the clean tokens back into text.

Finally, we apply the clean_text function to the entire 'text' column of the DataFrame using the apply method, and save the cleaned text in a new column called 'cleaned_text'.

Note that you may need to modify the cleaning function depending on the specific requirements of your data. Additionally, you may want to add additional cleaning steps such as removing punctuation or numbers, depending on the specific characteristics of your data.

In [17]:
# Define a function to clean text
def clean_text(text):
    # Remove punctuation
    text = text.translate(str.maketrans('', '', string.punctuation))
    
    text = text.lower()
    
    # Tokenize the text
    tokens = nltk.word_tokenize(text)
    
    # Remove stopwords and lemmatize the remaining words
    stop_words = set(stopwords.words('english'))
    lemmatizer = WordNetLemmatizer()
    clean_tokens = [lemmatizer.lemmatize(token) for token in tokens if token not in stop_words]
    
    # Join the clean tokens back into text
    clean_text = ' '.join(clean_tokens)
    
    return clean_text

In [18]:
df['cleaned_text'] = df['article'].apply(clean_text)

In [19]:
df.head()

Unnamed: 0,article,highlights,cleaned_text
0,by . associated press . published: . 14:11 est...,"bishop john folda, of north dakota, is taking ...",associated press published 1411 est 25 october...
1,(cnn) -- ralph mata was an internal affairs li...,criminal complaint: cop used his role to help ...,cnn ralph mata internal affair lieutenant miam...
2,a drunk driver who killed a young woman in a h...,"craig eccleston-todd, 27, had drunk at least t...",drunk driver killed young woman headon crash c...
3,(cnn) -- with a breezy sweep of his pen presid...,nina dos santos says europe must be ready to a...,cnn breezy sweep pen president vladimir putin ...
4,fleetwood are the only team still to have a 10...,fleetwood top of league one after 2-0 win at s...,fleetwood team still 100 record sky bet league...


In [20]:
contraction_mapping = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", "could've": "could have", "couldn't": "could not",

                           "didn't": "did not", "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", "haven't": "have not",

                           "he'd": "he would","he'll": "he will", "he's": "he is", "how'd": "how did", "how'd'y": "how do you", "how'll": "how will", "how's": "how is",

                           "I'd": "I would", "I'd've": "I would have", "I'll": "I will", "I'll've": "I will have","I'm": "I am", "I've": "I have", "i'd": "i would",

                           "i'd've": "i would have", "i'll": "i will",  "i'll've": "i will have","i'm": "i am", "i've": "i have", "isn't": "is not", "it'd": "it would",

                           "it'd've": "it would have", "it'll": "it will", "it'll've": "it will have","it's": "it is", "let's": "let us", "ma'am": "madam",

                           "mayn't": "may not", "might've": "might have","mightn't": "might not","mightn't've": "might not have", "must've": "must have",

                           "mustn't": "must not", "mustn't've": "must not have", "needn't": "need not", "needn't've": "need not have","o'clock": "of the clock",

                           "oughtn't": "ought not", "oughtn't've": "ought not have", "shan't": "shall not", "sha'n't": "shall not", "shan't've": "shall not have",

                           "she'd": "she would", "she'd've": "she would have", "she'll": "she will", "she'll've": "she will have", "she's": "she is",

                           "should've": "should have", "shouldn't": "should not", "shouldn't've": "should not have", "so've": "so have","so's": "so as",

                           "this's": "this is","that'd": "that would", "that'd've": "that would have", "that's": "that is", "there'd": "there would",

                           "there'd've": "there would have", "there's": "there is", "here's": "here is","they'd": "they would", "they'd've": "they would have",

                           "they'll": "they will", "they'll've": "they will have", "they're": "they are", "they've": "they have", "to've": "to have",

                           "wasn't": "was not", "we'd": "we would", "we'd've": "we would have", "we'll": "we will", "we'll've": "we will have", "we're": "we are",

                           "we've": "we have", "weren't": "were not", "what'll": "what will", "what'll've": "what will have", "what're": "what are",

                           "what's": "what is", "what've": "what have", "when's": "when is", "when've": "when have", "where'd": "where did", "where's": "where is",

                           "where've": "where have", "who'll": "who will", "who'll've": "who will have", "who's": "who is", "who've": "who have",

                           "why's": "why is", "why've": "why have", "will've": "will have", "won't": "will not", "won't've": "will not have",

                           "would've": "would have", "wouldn't": "would not", "wouldn't've": "would not have", "y'all": "you all",

                           "y'all'd": "you all would","y'all'd've": "you all would have","y'all're": "you all are","y'all've": "you all have",

                           "you'd": "you would", "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have",

                           "you're": "you are", "you've": "you have"}


In [21]:
def cleaner_text(text):
    
    text=text.lower()
    text=' '.join([contraction_mapping[i] if i in contraction_mapping.keys() else i for i in text.split()])
    text=re.sub(r'\(.*\)',"",text)
    text=re.sub("'s","",text)
    text=re.sub('"','',text)
    text=' '.join([i for i in text.split() if i.isalpha()])
    text=re.sub('[^a-zA-Z]'," ",text)
    
    return text

In [22]:
df['cleaned_text'] = df['cleaned_text'].apply(cleaner_text)

In [23]:
df['highlights'] = df['highlights'].apply(clean_text)

In [24]:
df['highlights'] = df['highlights'].apply(cleaner_text)

In [25]:
df.head()

Unnamed: 0,article,highlights,cleaned_text
0,by . associated press . published: . 14:11 est...,bishop john folda north dakota taking time dia...,associated press published est october updated...
1,(cnn) -- ralph mata was an internal affairs li...,criminal complaint cop used role help cocaine ...,cnn ralph mata internal affair lieutenant miam...
2,a drunk driver who killed a young woman in a h...,craig ecclestontodd drunk least three pint dri...,drunk driver killed young woman headon crash c...
3,(cnn) -- with a breezy sweep of his pen presid...,nina do santos say europe must ready accept sa...,cnn breezy sweep pen president vladimir putin ...
4,fleetwood are the only team still to have a 10...,fleetwood top league one win scunthorpe peterb...,fleetwood team still record sky bet league one...


### Converting the finally cleaned dataframe into csv and storing the csv in the drive.

In [26]:
df.to_csv('/content/gdrive/MyDrive/Capstone Project /cleaned.csv', index=False)