# Data Loading and Preparation
In this notebook we show: 
- How to use our data loader implementation specifically built for the DeftCorpus dataset ?
- How we prepared the DeftCorpus dataset for usage for Sentence Definition Classification ?

In [1]:
#imports cell
from source.data_loader import DeftCorpusLoader

## Loading dataset for classification using DeftCorpusLoader

Possible Steps for Class Usages:

- Create instance of the class, with the path to your **"data" folder** from "deft_corpus" folder.


- Call `load_classification_data` on class instance with **no arguments passed**, this will create two folders in "deft_files" folder by default including the re-formatted for classification task. Then will load these files and return two dataframes. The two dataframes include two splits, a training split and a development split (used for testing purposes during Training phase of Competition)


- Alternatively, you can call `convert_to_classififcation_format` on class instance with **no arguments passed or pass arguments to specify folders and not use defaults** to first convert and create the two folders. The, call `load_classification_data` with the folders paths created from by the first method. This alternative way is provided for those who intend to work with **their own folder paths rather than the provided defaults.**

*In this notebook, we use the rather easier and preferred method one.*

In [2]:
deft_loader = DeftCorpusLoader("deft_corpus/data")
trainframe, devframe = deft_loader.load_classification_data()

### Exploring dataset 
- There are two columns: `Sentence` which has the sentence text, `HasDef` boolean value to determine whether it is a definition or not. 
- There are **18,157 instances for training** and **865 instances for development** (testing purposes here)

In [3]:
deft_loader.explore_data(trainframe, "train")
deft_loader.explore_data(devframe, "dev")


Head of  train  Dataframe:
                                            Sentence  HasDef
0   3918 . You may recall that 6 x 6 = 36 , 6 x 7...       0
1   Memorizing these facts is rehearsal . Another...       1
2   Chunking is useful when trying to remember in...       0
3   3921 . Use elaborative rehearsal : In a famou...       1
4      Their theory is called levels of processing .       0
Number of instances of  train is 18157

Head of  dev  Dataframe:
                                            Sentence  HasDef
0   309 . Both photosystems have the same basic s...       1
1   Each photosystem is serviced by the light - h...       0
2   The absorption of a single photon or distinct...       0
3   390 . Mistakes in the duplication or distribu...       0
4   To prevent a compromised cell from continuing...       0
Number of instances of  dev is 865


### Exploring classes and classification problem
- There are **12,143 instances** of class label `0` ---> `This sentence is not a definition`.
- There are **6,014 instances** of class label `1` ---> `This sentence is a definition`.
- Determining problem: **Binary Classification Problem.**
- A clear **classes imbalance** case exists in our data. 

In [4]:
trainframe.HasDef.value_counts()

0    12143
1     6014
Name: HasDef, dtype: int64

## Preprocessing dataset using Spacy
- Tokenizing corpus sentences into word tokens.
- Lemmatization of each token. 
- Lowercase each token. 
- Removing stop words, punctuations, spaces and non alphanumeric characters.
- Adds a column on the dataframe for preprocessed tokens according to above rules.

In [None]:
deft_loader.preprocess_data(trainframe)
trainframe