<a href="https://colab.research.google.com/github/Cllaire/Cariere/blob/master/Preprocess_satirical_news.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

This colab aims to preprocess the dataset for satirical and non-satirical news. The two types of preprocessing that we're going to do is:
* replace named entities with a distinct symbol 
* eliminate tags such as new lines/tabs/multiple spaces



In [17]:
import glob
import pandas as pd
import csv
import re
import math

In [2]:
REPLACE_ENTITY = "$NE$"

In [3]:
def replace_named_entities(text):
  # Get all tokens split by space. 
  tokens = re.split(" ", text)
  # Replace uppercase tokens that don't follow full-stop with $NE$. 
  replaced_tokens = []
  for index in range(len(tokens)):
    if index == 0:
      replaced_tokens.append(tokens[index])
      continue
    if tokens[index] is not "" and tokens[index-1] is not "":
      if tokens[index][0].isupper() and tokens[index-1][-1] is not '.':
        if tokens[index][-1] == '.':
          replaced_tokens.append(REPLACE_ENTITY + '.')
        else:
          replaced_tokens.append(REPLACE_ENTITY)
        continue
      if len(tokens[index]) > 1 and tokens[index][1].isupper():
        replaced_tokens.append(REPLACE_ENTITY)
        continue
    replaced_tokens.append(tokens[index])

  # Replace sequence of $NE$ with only one $NE$. 
  current_len = 0
  simplified_tokens = []
  for token in replaced_tokens:
    if token == REPLACE_ENTITY:
      current_len += 1
    else:
      if current_len > 0:
        simplified_tokens.append(REPLACE_ENTITY)
        current_len = 0
      simplified_tokens.append(token)
  if current_len > 0:
    simplified_tokens.append(REPLACE_ENTITY)
  
  return ' '.join(simplified_tokens)

In [11]:
def replace_tags(text):
  try: 
    text = text.replace('\t',' ') # Remove tab tag.
  except:
    print(text)
  text = text.replace('\n', ' ') # Remove new line tag.
  text = text.replace('\s', ' ') # Remove \s tag.
  text = text.replace('\xa0',  ' ') # Remove \xa0 tag.
  text = re.sub(' +', ' ', text) # Replace multiple spaces with one space. 
  return text

# Retrieve data

In [5]:
file_paths = glob.glob('/content/drive/My Drive/data/*/*/*.csv')

In [20]:
for path in file_paths:
  datasource = pd.read_csv(path)
  print('Read:', path)
  for index in range(len(datasource)):
    if pd.isna(datasource['title'][index]) or pd.isna(datasource['content'][index]):
      datasource.drop(index)
      continue
    datasource['title'][index] = replace_named_entities(replace_tags(datasource['title'][index])) 
    datasource['content'][index] = replace_named_entities(replace_tags(datasource['content'][index]))
  new_path = path[:-4] + '_preprocessed.csv'
  print('Write:', new_path)
  datasource.to_csv(new_path)
  print('No. samples:', len(datasource))  
  

Read: /content/drive/My Drive/data/non-satirical/mediafax/economic.csv
Write: /content/drive/My Drive/data/non-satirical/mediafax/economic_preprocessed.csv
No. samples: 492
Read: /content/drive/My Drive/data/non-satirical/mediafax/social.csv
Write: /content/drive/My Drive/data/non-satirical/mediafax/social_preprocessed.csv
No. samples: 720
Read: /content/drive/My Drive/data/non-satirical/mediafax/externe.csv
Write: /content/drive/My Drive/data/non-satirical/mediafax/externe_preprocessed.csv
No. samples: 705
Read: /content/drive/My Drive/data/non-satirical/mediafax/politic.csv
Write: /content/drive/My Drive/data/non-satirical/mediafax/politic_preprocessed.csv
No. samples: 693
Read: /content/drive/My Drive/data/non-satirical/digi/actualitate.csv
Write: /content/drive/My Drive/data/non-satirical/digi/actualitate_preprocessed.csv
No. samples: 2990
Read: /content/drive/My Drive/data/non-satirical/digi/economie.csv
Write: /content/drive/My Drive/data/non-satirical/digi/economie_preprocessed.

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  if __name__ == '__main__':


Read: /content/drive/My Drive/data/satirical/ghimpele/lifestyle.csv
Write: /content/drive/My Drive/data/satirical/ghimpele/lifestyle_preprocessed.csv
No. samples: 36
Read: /content/drive/My Drive/data/satirical/ghimpele/justitie.csv
Write: /content/drive/My Drive/data/satirical/ghimpele/justitie_preprocessed.csv
No. samples: 22
Read: /content/drive/My Drive/data/satirical/ghimpele/economie.csv
Write: /content/drive/My Drive/data/satirical/ghimpele/economie_preprocessed.csv
No. samples: 32
Read: /content/drive/My Drive/data/satirical/ghimpele/turism.csv
Write: /content/drive/My Drive/data/satirical/ghimpele/turism_preprocessed.csv
No. samples: 35
Read: /content/drive/My Drive/data/satirical/ghimpele/politica.csv
Write: /content/drive/My Drive/data/satirical/ghimpele/politica_preprocessed.csv
No. samples: 36
Read: /content/drive/My Drive/data/satirical/ghimpele/actual.csv
Write: /content/drive/My Drive/data/satirical/ghimpele/actual_preprocessed.csv
No. samples: 135
Read: /content/drive/