# A - 3 - Dataset Preparation

**Process aim:** cleaning the dataset and reduce it to data used for machine learning.

**Input:** a CSV containing metadata and full text

**Subprocesses:**
* import and explore the dataset
* normalize fields containing multiple values
* drop unwanted columns and rows
* add columns
* drop rows
**Output:** a CSV file

In [None]:
import numpy as np
import pandas as pd
import re

## Import and explore the dataset

We create a dataframe from the CSV. Dataframes These table-like structures that are easy to manipulation and analyze. 

* Get information about the dataset such as number of entries (rows), columns names, number of non-null value by columns, etc.
    * dataset.info()
* Get the name of columns:
    * dataset.columns
* Get the x first rows of the table, including the headers: 
    * dataset.head(x)
* Get some analytics regarding the dataset or specific columns:
    * dataset.describe()

In [None]:
# Create the dataframe
# We don't need the column with the url anymore
columns = ['record_id','body', 'date', 'session', 'subjects_geo','subjects_primary', 'subjects_topics', 'symbol', 'title', 
           'type','text']
dataset = pd.read_csv('data/A_input_data/metadata/output/doc_2000_2017_txt.csv',index_col='record_id',usecols=columns, dtype='str')

In [None]:
dataset.columns

In [None]:
# Filter out records with missing data
dataset = dataset[dataset.symbol.notnull()]

Using dataset.info() we can see for each field how many are non-null values. For instance, how many records have subject-topics.

In [None]:
dataset.info()

## Normalize 'date'
To be able to sorg and group by date, we need to ensure that the date is normalized and that all rows have a value. Because this field is not always filled or a complete date, we need to normalize it, as follow:
* use the year instead of the full date
* fill the gaps with the previous value

In [None]:
dataset.date.unique()

In [None]:
dataset['date'] = (dataset.date
                   .str.extract('(\d{4})',expand=False) #  extract the first 4 digits corresponding to the year
                  )

In [None]:
# Filter out unwanted data
dataset = dataset[dataset['date'].notnull()]
dataset = dataset[dataset['date'] != '1999']

In [None]:
dataset.date.unique()

In [None]:
# Check our dataset to ensure the date field has a value for each record
dataset.info()

In [None]:
#  Rename column 'date' to year
dataset = dataset.rename(columns={'date':'year'})

### Normalize 'body':
Annother usefull sorting option is by body.

In [None]:
def normalize_body(line):
    '''
    Clean multiple value cells to return multiple values to return each line:
    - fields with multiple values: value one||value two
    - fields with a single value: value one
    '''
    if isinstance(line, str):
        line = re.sub("(\*+)|\'|\[|\]|\s+","",line)
        line = line.split(',')
        line = [body.split("/")[0] for body in line]
        line = [body for body in line if body in ['A','S','E']]
        line = filter(None, line)
        line = list(set(line))
        line = re.sub("(\*+)|\'|\[|\]|\s+","",str(line))
        return line
    else:
        return line

In [None]:
dataset.info()

In [None]:
dataset['main_body'] = dataset['body'].apply(normalize_body)

In [None]:
dataset.main_body.unique()

In [None]:
dataset = dataset[dataset['main_body'] != '']

## Normalize fields containing multiple values

Some columns have multiple values into square brackets, we want to normalize these colums as follow:
* fields with multiple values: value one||value two
* fields with a single value: value one

In [None]:
def normalize_multiple(line):
    '''
    Clean multiple value cells to return multiple values to return each line:
    - fields with multiple values: value one||value two
    - fields with a single value: value one
    '''
    if isinstance(line, str):
        if line.startswith('['):
            line = re.sub("\[|\]","",line)
            line = line.strip("'")
            line = line.strip('"')
            line = line.strip(" ")
            line = re.sub("('|\"),\s?('|\")","||",line)
            return line
        else:
            line = line.strip("'")
            line = line.strip('"')
            line = line.strip(" ")
            line = re.sub("\*","",line)
            return line
    else:
        return line

In [None]:
dataset['subjects_geo'] = dataset['subjects_geo'].apply(normalize_multiple)
dataset['subjects_topics'] = dataset['subjects_topics'].apply(normalize_multiple)
dataset['subjects_primary'] = dataset['subjects_primary'].apply(normalize_multiple)
dataset['symbol'] = dataset['symbol'].apply(normalize_multiple)
dataset['type'] = dataset['type'].apply(normalize_multiple)

In [None]:
dataset.info()

## Clean text field

Clean special characters like new line (\n), tabs (\t) etc. in 'text'

In [None]:
dataset['text'].head(10)

In [None]:
# Replace special characters such as new lines with a space.
dataset['text'] = dataset['text'].str.replace(r'[\n\t\v]+', ' ')

In [None]:
dataset['text'].head(10)

## Save the output

In [None]:
dataset.to_csv('data/A_input_data/metadata/output/doc_2000_2017_txt_clean.csv')