# B - 1 - Dataset analysis

## Description

The dataset influence the performance the machine learning model. The dataset is composed of identifiers, features and labels:
* **Features** are characteristics of the texts, variables used to classify, such as words appearing in the text or in the title, or any other information about that text that can be use by the model to predict the label. To be understood by the model, features need to be represented as number or boolean, and are organized in mattrix where each rows represent one text and each column a feature. This n-dimensional representation of the features is called the feature space, and the process of extracting and selecting the most pertinent features, is called **feature engineering**.

* **Labels** are used during the training phase. They are the categories or class assigned to each text and that the learning algorithm will need to predict. They can represent a topic, a person, a type of material, etc. They are typically stored in a matrix of boolean values where each row represent a text and each column a label, this matrix is called the **label space**. The **label set** is the ensemble of all possible labels. 

A large feature space can influence positively the accuracy of the predictions, but at the same time, the more features, the more computationnally expensive it will be to train the model, especially when trying to output several labels by text. Because the distribution of label is often inequal (some labels are used more than other), the larger the labelset is, the more data (features, text) are needed, and the more likely it is to encounter computational limitations during the process. The difficulty arising from large dimensional features and label spaces is called **the curse of dimensionality**. 

Analyzing the dataset before training the model is important. It can lead to better features engineering and labelset reduction, and training process.

**Process aim:**
Gathering insight on the dataset to identify potential issues and mitigate them when engineering the features and building the label space.

**Input:** A csv files
**Sub-processes**:


**Output:** a CSV file, statistical reports and graphics.

In [None]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.utils.extmath import density
import matplotlib.pyplot as plt
import seaborn as sbn
%matplotlib inline

In [None]:
# Import the dataset 
dataset = pd.read_csv('data/A_input_data/metadata/output/doc_2000_2017_txt_clean.csv', index_col='record_id')

In [None]:
dataset.info()

In [None]:
dataset.head(1)

## Labels
In this example, we aim at outputting the subject of text document by learning from documents that have already been subject indexed. Subject information are contain in the following column:
* subject_topic: all topical subjects terms
* subject_primary: a subeset of subject topics that contain the most important subjects of each document
* subject_geo: geographical subjects terms 

Each of these fields, or a combination of them can compose our label set. The following information about the labelset will help us to take informed decision to tune our learning process.
* number of label by record, and evolution of this number in time,
* usage of each label in the labelset, 
* etc.

### Labels by documents
#### Create new label columns 
Geographic terms are used less often than topical subjects, in order to take all types of subjects in consideration during the learning process, we will group these terms togethe in new columns:
To allow further analysis, we start by creating new columns that could be used as different labelset.
* all_subjects: groups subject-topic and subject-geo
* subjects_primary-geo: groups subject-primary and subject-geo

In [None]:
# Add a columns all_subjects concatenating topics and geographical terms
dataset['subjects_all'] = (dataset
                           .apply(lambda x:'%s||%s' % (x['subjects_topics'],x['subjects_geo']),axis=1)
                           .apply(lambda x: x.replace('nan||','')) # clean nan strings
                           .apply(lambda x: x.replace('||nan','')) # clean nan strings
                          )
dataset['subjects_all'] = dataset.subjects_all.replace('nan',np.nan)

In [None]:
# Add a columns all_subjects concatenating topics-primary and geographical terms
dataset['subjects_primary_geo'] = (dataset
                                   .apply(lambda x:'%s||%s' % (x['subjects_primary'],x['subjects_geo']),axis=1)
                                   .apply(lambda x: x.replace('nan||','')) # clean nan strings
                                   .apply(lambda x: x.replace('||nan','')) # clean nan strings
)
dataset['subjects_primary_geo'] = dataset.subjects_primary_geo.replace('nan',np.nan)

In [None]:
dataset.info()

In [None]:
dataset.head(2)

For each columns that could be used as labelset, we can count the number of label per record. This will give us a primary idea of indexing practices.

In [None]:
def count_multiple_values(values,separator):
    if isinstance(values,str):
        return len(values.split(separator))
    else:
        return 0

In [None]:
# Add a column that count the number of subjects assigned to each document
dataset['subjects_topics_count'] = (dataset['subjects_topics']
                                    .apply(lambda x: count_multiple_values(x,'||'))
)

In [None]:
# Add a column that count the number of subjects assigned to each document
dataset['subjects_geo_count'] = (dataset['subjects_geo']
                                 .apply(lambda x: count_multiple_values(x,'||'))
)

In [None]:
# Add a column that count the number of subjects assigned to each document
dataset['subjects_primary_count'] = (dataset['subjects_primary']
                                 .apply(lambda x: count_multiple_values(x,'||'))
)

In [None]:
# Add a column that count the number of subjects assigned to each document
dataset['subjects_all_count'] = (dataset['subjects_all']
                                 .apply(lambda x: count_multiple_values(x,'||'))
)

In [None]:
# Add a column that count the number of subjects assigned to each document
dataset['subjects_primary_geo_count'] = (dataset['subjects_primary_geo']
                                 .apply(lambda x: count_multiple_values(x,'||'))
)

In [None]:
dataset.info()

### Labelset

In [None]:
def create_sets(dataset, field, values,label_field,separator):
    sets = []
    subsets = []
    dataset = (dataset[dataset[label_field].notnull()]
              .reset_index()
            )
    for v in values:
        subset = (dataset[dataset[field] == v].copy().reset_index(drop=True))
        labelset = get_binary_labels(subset,label_field,separator)
        sets.append([label_field, str(v), subset, labelset])
        subsets.append(subset)
        if v != values[0]:
            cumulset = pd.concat(subsets)
            labelset = get_binary_labels(cumulset,label_field,separator)
            sets.append([label_field, str(values[0]) + '-'+ str(v), cumulset,labelset])
    return sets

def get_binary_labels(dataset,label_field,separator=','):
    '''
    Takes a pandas dataset, the name of the column containing labels, and separator (i.e. multivalue):
    - transform the column representing the label to a list
    - split it to a list of labels for each row
    - transform the list of textual labels to a binary sparse matrix
    - return the binary sparse matrix representing labels as well as the list of associated textual labels.
    '''
    labels = dataset[label_field].tolist()
    labels = [l.split(separator) for l in labels]
    mlb = MultiLabelBinarizer()
    binary_labels = mlb.fit_transform(labels)
    labels = mlb.classes_
    labelset = pd.DataFrame(binary_labels, columns=labels)
    return labelset

In [None]:
def describe_report(dataset, columns, name='all',groups=None):
    reports = {name: dataset[columns].describe()}
    if groups is not None:
        for i in range(len(groups)):
            reports[groups[i]] = dataset.groupby(groups[i])[columns].describe()            
    return reports

def create_report(dataset):
    labelset = dataset[3].copy().T
    labelset['cardinality'] = labelset.sum(axis=1)
    reports = {dataset[0] + '_' + dataset[1]: 
               {'nr_of_records': dataset[2][dataset[0] + '_count'].describe(),
                'label_cardinality':labelset['cardinality'].describe(),
                'label_ranking': labelset.sort_values('cardinality',ascending=False)
                }
              }
    return reports

In [None]:
field = 'year'
values = [2000,2001,2002,2003,2004,2005,2006,2007,2008,2009,2010,2011,2012,2013,2014,2015,2016]
label_field = 'subjects_all'
separator = '||'

In [None]:
subsets = create_sets(dataset, field, values,label_field,separator)

In [None]:
subjects_all_reports = [create_report(s) for s in subsets]

## Reports

In [None]:
def create_describe_reports(df, columns, name='all',groups=None):
    reports = {name: df[columns].describe()}
    if groups is not None:
        for i in range(len(groups)):
            reports[groups[i]] = df.groupby(groups[i])[columns].describe()            
    return reports

def save_reports(reports,file_path,name):
    # Create a Pandas Excel writer using XlsxWriter as the engine.
    writer = pd.ExcelWriter(file_path + name + '.xlsx', engine='xlsxwriter')
    # Write each dataframe to a different worksheet.
    for k,v in reports.items():
        v.to_excel(writer, sheet_name=k)
    writer.save()

### Label by documents

In [None]:
# Report parameter
report_columns = ['subjects_topics_count',
       'subjects_geo_count', 'subjects_primary_count', 'subjects_all_count',
       'subjects_primary_geo_count']
report_groups = ['main_body','year']
report_name = '650_651'
path = 'reports/'

In [None]:
# Create the report
label_by_documents = create_describe_reports(dataset,report_columns, report_name, report_groups)

In [None]:
label_by_documents['650_651']

In [None]:
label_by_documents['main_body']

In [None]:
label_by_documents['year']

In [None]:
# Save the report
save_reports(label_by_documents, path, report_name)