# NTU OSS TGIFHacks: Data Scraping and Data Cleaning

This notebook covers the data cleaning section of the workshop. The notebook is split into two parts, in the first part, we work with the text data that was mined and later work with the image data that was crawled. 

For the purpose of teaching, the size of the dataset is kept small. Feel free to experiment with a larger dataset.

![](https://i.chzbgr.com/full/8120808448/h2E18CA37/clean-all-the-data)


# PART 1: Data Cleaning of News Headlines

Given we have crawled news headlines and its metadata, we have to perform quite a few steps clean the data. **Do note that the cleaning method could vary based on the task.** The tasks that we are going to do are as follows:

1. Clean the meta data -> Split the datetime, Convert the datatime, Get author's name
2. Clean the headlines

    2.1 Tokenization + Remove punctuation
    
    2.2 Remove stop words
    
    2.3 Normailze the case of letters
    
    2.4 Stemming

In [85]:
import pandas as pd
import numpy as np
import re
from sklearn.model_selection import train_test_split

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from nltk.stem.porter import PorterStemmer


[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/siddesh.suseela/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [12]:
rawDF = pd.read_csv('./newsheadlines/raw_dataset/news_raw.csv')
rawDF.head(10)

Unnamed: 0,title,meta
0,"<h2 class=""categoryArticle__title"">Oil Holds G...","<p class=""categoryArticle__meta"">Sep 22, 2020 ..."
1,"<h2 class=""categoryArticle__title"">Australia P...","<p class=""categoryArticle__meta"">Sep 22, 2020 ..."
2,"<h2 class=""categoryArticle__title"">Chinese Oil...","<p class=""categoryArticle__meta"">Sep 22, 2020 ..."
3,"<h2 class=""categoryArticle__title"">China Promi...","<p class=""categoryArticle__meta"">Sep 22, 2020 ..."
4,"<h2 class=""categoryArticle__title"">Pompeo: We ...","<p class=""categoryArticle__meta"">Sep 22, 2020 ..."
5,"<h2 class=""categoryArticle__title"">India Toppe...","<p class=""categoryArticle__meta"">Sep 21, 2020 ..."
6,"<h2 class=""categoryArticle__title"">Canada’s No...","<p class=""categoryArticle__meta"">Sep 21, 2020 ..."
7,"<h2 class=""categoryArticle__title"">Tesla's “ba...","<p class=""categoryArticle__meta"">Sep 21, 2020 ..."
8,"<h2 class=""categoryArticle__title"">CNOOC Begin...","<p class=""categoryArticle__meta"">Sep 21, 2020 ..."
9,"<h2 class=""categoryArticle__title"">Norway’s Oi...","<p class=""categoryArticle__meta"">Sep 21, 2020 ..."


In [24]:
# Declaring all the constants required
month2idx = {
    'jan':'01',
    'feb':'02',
    'mar':'03',
    'apr':'04',
    'may':'05',
    'jun':'06',
    'jul':'07',
    'aug':'08',
    'sep':'09',
    'oct':'10',
    'nov':'11',
    'dec':'12',
}

emoji_pattern = re.compile("["
                       u"\U0001F600-\U0001F64F"  # emoticons
                       u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                       u"\U0001F680-\U0001F6FF"  # transport & map symbols
                       u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                       u"\U00002702-\U000027B0"
                       u"\U000024C2-\U0001F251"
                       "]+", flags=re.UNICODE)

# ================ UTILITY FUNCTIONS =================

def cleanHTML(x:str) -> str:
    '''Return a string after removing the html tags'''
    rawText = re.sub(r'<.*?>', '', x) # Removes the html tags    
    return rawText

def cleanURLS(x:str) -> str:
    '''Return a string after removing the URLs'''
    rawText = re.sub(r'https?:\/\/\S+|www\.\S+', '', x) # Removes URLs
    return rawText

def cleanSpecialCharacters(x:str) -> str:
    '''Return a string after removing the special characters'''
    rawText = re.sub(r'[^\w\d\s]', '', x) # Removes special characters
    return rawText    

def cleanEmjois(x:str)-> str:
    '''Return a string after removing the Emojis'''
    rawText = emoji_pattern.sub(r'', x)
    return rawText
 
def convertISO(text:str) -> str:
    # ref: https://www.w3.org/QA/Tips/iso-date
    # converting to iso format
    # For example, "3rd of April 2002", in this international format is written: 2002-04-03.
    month, day, year = text.split()
    month = month2idx[month]
    text = year+'-'+month+'-'+day
    return text

In [22]:
# Samples of the columns

for col in rawDF.columns:    
    print(f'Sample for {col}:')
    print('------------------')
    print(rawDF[col].iloc[0])
    print()
    

Sample for title:
------------------
<h2 class="categoryArticle__title">Oil Holds Gains As Large Gasoline Draw Offsets Crude Build</h2>

Sample for meta:
------------------
<p class="categoryArticle__meta">Sep 22, 2020 at 15:40 | Julianne Geiger</p>



In [86]:
# ================ Cleaner Functions =================

def cleanerMetaData(x:str) -> str:
    removedHTML = cleanHTML(x)
    removedHTML = removedHTML.lower()
    timestamp, author = removedHTML.split('|')
    date, time = timestamp.split('at')
    date = convertISO( 
        cleanSpecialCharacters(date) 
    )
#     return {
#         'Date':date,
#         'Time': time,
#         'Author':author,
#         'Timezone':'SGT'
#     }
    return date

def cleanerTitle(x: str) -> str:
    x = cleanHTML(x)
    x = x.lower()
    x = cleanEmjois(x)
    x = cleanSpecialCharacters(x)
    x = cleanURLS(x)
    words = x.strip().split()
    
    stop_words = set(stopwords.words('english'))
    words = [w for w in words if not w in stop_words]
    
    # stemmin gis useful because it reduces the vocab size 
    # and also to predict the sentiment of a sentence you 
    # just need the root meaning
    porter = PorterStemmer()
    stemmed = [porter.stem(word) for word in words]
    
    
    return ' '.join(stemmed)
    

In [87]:
rawDF['date']= rawDF.meta.apply(lambda x: cleanerMetaData(x))
rawDF['title'] = rawDF.title.apply(lambda x: cleanerTitle(x))

In [89]:
rawDF.title

0      oil hold gain larg gasolin draw offset crude b...
1              australia plan us13b invest lowemiss tech
2       chines oil giant could buy exxon north sea asset
3                        china promis tackl climat chang
4                      pompeo build coalit nord stream 2
                             ...                        
195     eastern libya see power outag result oil blockad
196    demand crash hit us refin surg biofuel blend cost
197                   gold price plung 2000 explos ralli
198        russia oil product export us jump 16year high
199    shale giant occident petroleum report major lo...
Name: title, Length: 200, dtype: object

In [90]:
datasetDF = rawDF.drop(['meta','date'], axis=1)

In [91]:
test_dataset, validation_dataset = train_test_split(datasetDF, test_size= 0.2)

In [92]:
test_dataset.reset_index(drop=True, inplace=True)
validation_dataset.reset_index(drop=True, inplace=True)

In [93]:
len(test_dataset), len(validation_dataset)

(160, 40)

# PART 2: Data Preprocessing of Image data

1. Create a dataframe with path and class
3. Split the data into train and test

In [163]:
import numpy as np
from sklearn.model_selection import train_test_split
from tqdm import tqdm_notebook
import os
import shutil
import cv2

In [148]:
DATAPATH = './imgClfDataset/raw_dataset/'
classes = os.listdir(DATAPATH)[1:]
classes

['paintings', 'photographs']

In [149]:
datalist = list()
classlist = list()
pathlist = list()

# Encode
#  0 -> paintings
#  1 -> photographs

for idx, cls in enumerate(classes):
    tmp = os.listdir(DATAPATH+cls+'/full')
    datalist += tmp
    pathlist += [DATAPATH+cls+'/full'] *len(tmp)
    classlist += [idx]*len(tmp)

assert len(classlist) == len(datalist) == len(pathlist)

In [150]:
rawDF = pd.DataFrame(
    {
        'image':datalist,
        'path':pathlist,
        'class':classlist
        
    }
)

rawDF = rawDF.sample(frac=1).reset_index(drop=True)

rawDF['class'].value_counts()

0    648
1    647
Name: class, dtype: int64

In [151]:
train_data, test_data = train_test_split(rawDF, test_size=0.2)

train_data.reset_index(drop=True, inplace=True)
test_data.reset_index(drop=True, inplace=True)

len(train_data), len(test_data)

(1036, 259)

In [152]:
train_data

Unnamed: 0,image,path,class
0,bae3912762c4fc6000d2f1ff57c24aa3df1ff011.jpg,./imgClfDataset/raw_dataset/photographs/full,1
1,8de3bfb56c8439665add3126ce544af064be944c.jpg,./imgClfDataset/raw_dataset/photographs/full,1
2,ef99d0326c60991496c9e2a7a066dbadd0ff9f4a.jpg,./imgClfDataset/raw_dataset/paintings/full,0
3,02f58f692c08c0fb377d754c239ffe3371c267cd.jpg,./imgClfDataset/raw_dataset/photographs/full,1
4,0b3ede5b30a3d2a53a8522438ac7d9eb37b88a53.jpg,./imgClfDataset/raw_dataset/photographs/full,1
...,...,...,...
1031,31f4a57afdd421c53f92b30f8d7c3c22f9d398c5.jpg,./imgClfDataset/raw_dataset/paintings/full,0
1032,0fcd2c32af4a99435fae98ab4312455dd80e19d1.jpg,./imgClfDataset/raw_dataset/paintings/full,0
1033,6376047049e580f0e75ac57bd41cc6f82b075e12.jpg,./imgClfDataset/raw_dataset/paintings/full,0
1034,16a3d7f796bc1aa7b35ebd151559b85fc209750a.jpg,./imgClfDataset/raw_dataset/paintings/full,0


In [169]:
DATASET_PATH = './imgClfDataset/PaintingsVsPhotographs'
TRAIN_PATH = './imgClfDataset/PaintingsVsPhotographs/train'
TEST_PATH = './imgClfDataset/PaintingsVsPhotographs/test'

try:
    os.mkdir(DATASET_PATH)
    os.mkdir(TRAIN_PATH)
    os.mkdir(TEST_PATH)
except:
    pass

In [166]:
tmp_paths = list()
for img, pt in tqdm_notebook(zip(train_data.image, train_data.path)):
    shutil.copy(pt+'/'+img, TRAIN_PATH)
    tmp_paths.append(TRAIN_PATH+'/'+img)
    
train_data['path'] = tmp_paths

tmp_paths = list()
for img, pt in tqdm_notebook(zip(test_data.image, test_data.path)):
    shutil.copy(pt+'/'+img, TEST_PATH)
    tmp_paths.append(TEST_PATH+'/'+img)
    
test_data['path'] = tmp_paths

test_data

Please use `tqdm.notebook.tqdm` instead of `tqdm.tqdm_notebook`
  


HBox(children=(FloatProgress(value=1.0, bar_style='info', max=1.0), HTML(value='')))




A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  


In [167]:
train_data

Unnamed: 0,image,path,class
0,bae3912762c4fc6000d2f1ff57c24aa3df1ff011.jpg,./imgClfDataset/PaintingsVsPhotographs/train/b...,1
1,8de3bfb56c8439665add3126ce544af064be944c.jpg,./imgClfDataset/PaintingsVsPhotographs/train/8...,1
2,ef99d0326c60991496c9e2a7a066dbadd0ff9f4a.jpg,./imgClfDataset/PaintingsVsPhotographs/train/e...,0
3,02f58f692c08c0fb377d754c239ffe3371c267cd.jpg,./imgClfDataset/PaintingsVsPhotographs/train/0...,1
4,0b3ede5b30a3d2a53a8522438ac7d9eb37b88a53.jpg,./imgClfDataset/PaintingsVsPhotographs/train/0...,1
...,...,...,...
1031,31f4a57afdd421c53f92b30f8d7c3c22f9d398c5.jpg,./imgClfDataset/PaintingsVsPhotographs/train/3...,0
1032,0fcd2c32af4a99435fae98ab4312455dd80e19d1.jpg,./imgClfDataset/PaintingsVsPhotographs/train/0...,0
1033,6376047049e580f0e75ac57bd41cc6f82b075e12.jpg,./imgClfDataset/PaintingsVsPhotographs/train/6...,0
1034,16a3d7f796bc1aa7b35ebd151559b85fc209750a.jpg,./imgClfDataset/PaintingsVsPhotographs/train/1...,0
