# EDA

## **1. Setup**

### 1.1 Import Dependencies

In [4]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

### 1.2 Set visualization style

In [5]:
sns.set(style="whitegrid")
plt.rcParams['figure.figsize'] = (10, 6)

## **2. Data Loading**

In [6]:
train_df = pd.read_csv(r'..\..\Data\raw\cnn_dailymail\train.csv')
test_df = pd.read_csv(r'..\..\Data\raw\cnn_dailymail\test.csv')

In [7]:
train_df.shape

287113*0.001

287.113

In [8]:
train = train_df.sample(frac=0.05, random_state=42)
train.shape

(14356, 3)

In [17]:
# [{"id": i, "text": text} for i, text in enumerate(_df[self.text_column])]

for i, text in enumerate(zip(train["article"], train["highlights"])):
    print(f"id: {i} text: {text[0][:50]}... summary: {text[1][:50]}...")
    break

id: 0 text: By . Mia De Graaf . Britons flocked to beaches acr... summary: People enjoyed temperatures of 17C at Brighton bea...


In [27]:
# text_length_words = train_df[text_column].str.split().str.len()

rr = train_df[:1000].copy()

text_length_words = rr["article"].str.split().str.len()
tt = rr[text_length_words <= 500]

tt

# e = train_df["article"][:10000].str.split().str.len()

# train_df = train_df[e <= 500]

Unnamed: 0,id,article,highlights
0,0001d1afc246a7964130f43ae940af6bc6c57f01,By . Associated Press . PUBLISHED: . 14:11 EST...,"Bishop John Folda, of North Dakota, is taking ..."
1,0002095e55fcbd3a2f366d9bf92a95433dc305ef,(CNN) -- Ralph Mata was an internal affairs li...,Criminal complaint: Cop used his role to help ...
5,0004306354494f090ee2d7bc5ddbf80b63e80de6,He's been accused of making many a fashion fau...,Prime Minister and his family are enjoying an ...
13,000cd1ee0098c4d510a03ddc97d11764448ebac2,Louis van Gaal said he had no option but to su...,Manchester United beat Southampton 2-1 at St M...
15,001097a19e2c96de11276b3cce11566ccfed0030,"For most people, it has become a travel essent...",Half of Brits admit to checking work e-mails w...
...,...,...,...
988,02c600858dcc92bf6b460ad67098f97e1c594f8f,"By . Steve Nolan . PUBLISHED: . 00:59 EST, 7 O...",Foreign patients will have to prove they are l...
994,02c971cf94ad3b1696742544778f06cf8a2b1c23,The owners of a $4million Cincinnati mansion t...,Jeffrey Decker and wife Maria claim their insu...
996,02ce5810b37842c00ae90b6c7b70dbf686cd865f,By . Leon Watson and Sebastian Lander . PUBLIS...,Figures released by ABTA show Britons took few...
998,02d123388fbdf6da1466253313fe6641595c291c,By . Rob Cooper . Last updated at 5:05 PM on 2...,High-speed bed is fitted with a V8 600bhp engi...


## **3. Basic Exploration**

### 3.1 Data info

In [4]:
train_df.head() # First 5 rows of the training data

Unnamed: 0,id,article,highlights
0,0001d1afc246a7964130f43ae940af6bc6c57f01,By . Associated Press . PUBLISHED: . 14:11 EST...,"Bishop John Folda, of North Dakota, is taking ..."
1,0002095e55fcbd3a2f366d9bf92a95433dc305ef,(CNN) -- Ralph Mata was an internal affairs li...,Criminal complaint: Cop used his role to help ...
2,00027e965c8264c35cc1bc55556db388da82b07f,A drunk driver who killed a young woman in a h...,"Craig Eccleston-Todd, 27, had drunk at least t..."
3,0002c17436637c4fe1837c935c04de47adb18e9a,(CNN) -- With a breezy sweep of his pen Presid...,Nina dos Santos says Europe must be ready to a...
4,0003ad6ef0c37534f80b55b4235108024b407f0b,Fleetwood are the only team still to have a 10...,Fleetwood top of League One after 2-0 win at S...


In [5]:
train_df.shape , test_df.shape

((287113, 3), (11490, 3))

- Train data
    - Number of **rows**: 287113
    - Number of **columns**: 3
- Test data
    - Number of **rows**: 11490
    - Number of **columns**: 3

In [6]:
train_df.info() # Information about the training data

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 287113 entries, 0 to 287112
Data columns (total 3 columns):
 #   Column      Non-Null Count   Dtype 
---  ------      --------------   ----- 
 0   id          287113 non-null  object
 1   article     287113 non-null  object
 2   highlights  287113 non-null  object
dtypes: object(3)
memory usage: 6.6+ MB


In [7]:
train_df.columns

Index(['id', 'article', 'highlights'], dtype='object')

In [8]:
train_df.dtypes

id            object
article       object
highlights    object
dtype: object

### 3.2 Missing values

In [9]:
train_df.isna().sum() # Check for missing values in the training data

id            0
article       0
highlights    0
dtype: int64

In [10]:
test_df.isna().sum() # Check for missing values in the test data

id            0
article       0
highlights    0
dtype: int64

- No missing values in train and test data.

### 3.3 Duplicates

In [11]:
train_df.duplicated().sum() # Check for duplicate rows in the training data


np.int64(0)

In [12]:
test_df.duplicated().sum() # Check for duplicate rows in the test data

np.int64(0)

- No duplicate rows in train and test data.

## **4. Explore Text**

In [None]:
train_df.head(3)    

Unnamed: 0,id,article,highlights
0,0001d1afc246a7964130f43ae940af6bc6c57f01,By . Associated Press . PUBLISHED: . 14:11 EST...,"Bishop John Folda, of North Dakota, is taking ..."
1,0002095e55fcbd3a2f366d9bf92a95433dc305ef,(CNN) -- Ralph Mata was an internal affairs li...,Criminal complaint: Cop used his role to help ...
2,00027e965c8264c35cc1bc55556db388da82b07f,A drunk driver who killed a young woman in a h...,"Craig Eccleston-Todd, 27, had drunk at least t..."


In [13]:
print(train_df["article"][0])

By . Associated Press . PUBLISHED: . 14:11 EST, 25 October 2013 . | . UPDATED: . 15:36 EST, 25 October 2013 . The bishop of the Fargo Catholic Diocese in North Dakota has exposed potentially hundreds of church members in Fargo, Grand Forks and Jamestown to the hepatitis A virus in late September and early October. The state Health Department has issued an advisory of exposure for anyone who attended five churches and took communion. Bishop John Folda (pictured) of the Fargo Catholic Diocese in North Dakota has exposed potentially hundreds of church members in Fargo, Grand Forks and Jamestown to the hepatitis A . State Immunization Program Manager Molly Howell says the risk is low, but officials feel it's important to alert people to the possible exposure. The diocese announced on Monday that Bishop John Folda is taking time off after being diagnosed with hepatitis A. The diocese says he contracted the infection through contaminated food while attending a conference for newly ordained b

In [14]:
print(train_df["highlights"][0])

Bishop John Folda, of North Dakota, is taking time off after being diagnosed .
He contracted the infection through contaminated food in Italy .
Church members in Fargo, Grand Forks and Jamestown could have been exposed .


In [17]:
train_df["article"][8]

"There are a number of job descriptions waiting for Darren Fletcher when he settles in at West Brom but the one he might not have expected is Saido Berahino’s nanny. Fletcher’s unveiling as the deadline day signing from Manchester United was almost eclipsed by the 21-year-old striker, who is acquiring the habit of talking himself into trouble. Ten years Berahino’s senior, Fletcher will be expected to mentor a player who told the world this week that he wanted to play for a bigger club. Tony Pulis has advised Saido Berahino to focus on his performances at West Brom . Darren Fletcher has signed for the baggies where he will be asked to provide a role model for young players . That is off the pitch. On it, the Scotland midfielder wants to prove he is good enough to cut the mustard in the Premier League after finding starts harder and harder to come by at Old Trafford. Head coach Tony Pulis believes that Fletcher, who has agreed a three-and-a-half year contract, will be captain of Albion o

In [18]:
train_df["highlights"][8]

'Tony Pulis believes Saido Berahino should look up to Darren Fletcher .\nPulis insists Berahino has been listened to the wrong advice .\nBerahino said he wants to move on to bigger things earlier in the week .\nREAD: Berahino available for £20m after Liverpool target angers club .\nCLICK HERE for all the latest West Brom news .'