# Loading data
***

0. [Running setup](#setup)
0. [Defining function](#func)
0. [Preparing the dictionary](#dict)
0. [Loading the datasets](#load)
0. [Aliasing](#alias)
***

### 0. Running setup <a id='setup'>

First of all, we need a small setup. In order to do that, we run a script that will create some variables and functions and will import the libraries required to perform the analysis. For further informations, visit the **<a href='../conf/setup.ipynb'>configuration file</a>**.

In [1]:
%run ../conf/setup.ipynb

### 1. Defining function<a id='func'>

In order to perform the analysis, we will use data contained in the `PROMISE` dataset. This collection is made of 8 files representing 8 corresponding Java projects.

This is the list of files to upload:

In [2]:
for file in os.listdir(DATA_PATH):
    if file[:6] not in ('README', '.gitke','.ipynb'):
        print(file)

velocity.arff
jedit.arff
log4j.arff
xalan.arff
xerces.arff
ant.arff
camel.arff
tomcat.arff


As we can see, files we have to load are in `.arff` format. Unfortunately, Pandas can not directly work with these files, but the library *scipy* contains a function that does the job. So, let's define a new function that will help us to load faster our files:

In [3]:
from scipy.io import arff
def load_arff(fileName) -> pd.core.frame.DataFrame:
    file = os.path.join(DATA_PATH, fileName + '.arff')
    data = arff.loadarff(file)
    df = pd.DataFrame(data[0])
    return df

### 2. Preparing the dictionary<a id='dict'>

We will store the datasets into a dictionary, structured as:

`{`
`Name of the project`: `DataFrame`
`}`

The main reason of storing datasets into a dictionary is related to loops: given that we have different datasets on which we will have to compute the same operations, in this way we will be able to use loops, that are way more faster than rewriting 9 times the same code .

In order to prepare the dictionary, we will first create a tuple containing the names of the projects, and then create the dictionary with the names as keys. We will call it `datasets`

In [4]:
names = tuple(('ant','camel','jedit','log4j','tomcat','velocity','xalan','xerces'))
datasets = dict()

### 3. Loading the datasets<a id='load'>

As said before, we have a dictionary, so we can upload the datasets just using a for loop:

In [5]:
for key in names:
    datasets[key] = load_arff(key)

Let's explore the indexes of our dictionary

In [6]:
list(datasets.keys())

['ant', 'camel', 'jedit', 'log4j', 'tomcat', 'velocity', 'xalan', 'xerces']

### 4. Aliasing<a id='alias'>

Now, we have our data stored into `datasets`. 

This means that if we want to recall a dataset, for example, `ant`, we need to write:

In [7]:
datasets['ant']

Unnamed: 0,wmc,dit,noc,cbo,rfc,lcom,ca,ce,npm,lcom3,...,dam,moa,mfa,cam,ic,cbm,amc,max_cc,avg_cc,defects
0,3.0,1.0,0.0,10.0,18.0,3.0,1.0,9.0,1.0,1.100000,...,0.000000,0.0,0.000000,0.444444,0.0,0.0,32.666667,1.0,0.6667,0.0
1,5.0,2.0,0.0,4.0,13.0,0.0,1.0,4.0,4.0,0.625000,...,1.000000,1.0,0.700000,0.500000,0.0,0.0,13.400000,1.0,0.6000,0.0
2,1.0,2.0,0.0,1.0,3.0,0.0,0.0,1.0,1.0,2.000000,...,0.000000,0.0,1.000000,1.000000,0.0,0.0,6.000000,0.0,0.0000,0.0
3,8.0,1.0,9.0,13.0,20.0,12.0,9.0,4.0,8.0,0.800000,...,0.200000,1.0,0.000000,0.406250,0.0,0.0,11.000000,1.0,0.8750,0.0
4,9.0,3.0,0.0,5.0,26.0,16.0,0.0,5.0,7.0,0.750000,...,1.000000,0.0,0.800000,0.388889,0.0,0.0,19.000000,2.0,1.0000,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
740,92.0,3.0,0.0,34.0,261.0,3726.0,8.0,34.0,81.0,0.960246,...,0.970588,11.0,0.291339,0.112476,2.0,2.0,28.021739,15.0,1.5543,4.0
741,6.0,3.0,6.0,10.0,10.0,3.0,7.0,3.0,6.0,0.400000,...,1.000000,0.0,0.857143,0.500000,0.0,0.0,6.500000,3.0,1.5000,0.0
742,7.0,3.0,5.0,9.0,26.0,0.0,5.0,4.0,6.0,0.000000,...,1.000000,0.0,0.857143,0.314286,1.0,3.0,19.000000,3.0,1.5714,0.0
743,5.0,2.0,0.0,8.0,34.0,8.0,1.0,7.0,3.0,0.500000,...,1.000000,0.0,0.884615,1.000000,0.0,0.0,42.000000,11.0,3.4000,1.0


It is not a big deal, of course, but Python offers a way to recall a variable using another name: this is called **aliasing**. When two or more variables are aliased they share the same memory slots, and every change to one of them affects the others. We can alias our datasets in order to recall them just using their name:

In [8]:
for key in names:
    exec(f'{key} = datasets[key]')

We can check if our variables are aliased by verifying their memory location, with the function **id**

In [9]:
id(ant) == id(datasets['ant'])

True

Or, if we want to check one of the other datasets:

In [10]:
id(camel) == id(datasets['camel'])

True

We can now recall a datasets just by using his name:

In [11]:
ant

Unnamed: 0,wmc,dit,noc,cbo,rfc,lcom,ca,ce,npm,lcom3,...,dam,moa,mfa,cam,ic,cbm,amc,max_cc,avg_cc,defects
0,3.0,1.0,0.0,10.0,18.0,3.0,1.0,9.0,1.0,1.100000,...,0.000000,0.0,0.000000,0.444444,0.0,0.0,32.666667,1.0,0.6667,0.0
1,5.0,2.0,0.0,4.0,13.0,0.0,1.0,4.0,4.0,0.625000,...,1.000000,1.0,0.700000,0.500000,0.0,0.0,13.400000,1.0,0.6000,0.0
2,1.0,2.0,0.0,1.0,3.0,0.0,0.0,1.0,1.0,2.000000,...,0.000000,0.0,1.000000,1.000000,0.0,0.0,6.000000,0.0,0.0000,0.0
3,8.0,1.0,9.0,13.0,20.0,12.0,9.0,4.0,8.0,0.800000,...,0.200000,1.0,0.000000,0.406250,0.0,0.0,11.000000,1.0,0.8750,0.0
4,9.0,3.0,0.0,5.0,26.0,16.0,0.0,5.0,7.0,0.750000,...,1.000000,0.0,0.800000,0.388889,0.0,0.0,19.000000,2.0,1.0000,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
740,92.0,3.0,0.0,34.0,261.0,3726.0,8.0,34.0,81.0,0.960246,...,0.970588,11.0,0.291339,0.112476,2.0,2.0,28.021739,15.0,1.5543,4.0
741,6.0,3.0,6.0,10.0,10.0,3.0,7.0,3.0,6.0,0.400000,...,1.000000,0.0,0.857143,0.500000,0.0,0.0,6.500000,3.0,1.5000,0.0
742,7.0,3.0,5.0,9.0,26.0,0.0,5.0,4.0,6.0,0.000000,...,1.000000,0.0,0.857143,0.314286,1.0,3.0,19.000000,3.0,1.5714,0.0
743,5.0,2.0,0.0,8.0,34.0,8.0,1.0,7.0,3.0,0.500000,...,1.000000,0.0,0.884615,1.000000,0.0,0.0,42.000000,11.0,3.4000,1.0


To make a final, definitive check, let's try to edit a value and see if it does affect the aliased variable.

Let's use the value in the position `['wmc'][0]`

In [12]:
ant['wmc'][0]

3.0

This value is now equal to `3.0`. Let's change it to `4.0` and then check the aliased one.

In [13]:
ant['wmc'][0] = 4.0
datasets['ant']['wmc'][0]

4.0

We checked that our aliasing is working! Finally, let's get back to `3.0`

In [14]:
ant['wmc'][0] = 3.0