## Loading and saving data
Often your analysis will start with the loading of a dataset. Some of the most common file types of datasets are excel, csv, mat and npy. We're going to see how to load files of different types.

### .mat files
If you 're used to working with Matlab, or if you're using Matlab for another part of your data collection or analysis, you might be using .mat files to save your data. .mat files have a very specific organisation that is most similar to a dictionary in Python. Each array of data has a name (key) and values. You need the scipy io (input/output) package to load them. 

In [2]:
import scipy.io as sio
import os # We will use this to properly make our path to the data

In [3]:
data = sio.loadmat(os.path.join('data', 'cellTypes.mat'))
type(data)

dict

You can see that the data is loaded into a dict. This always happens, even if you .mat-file only contains 1 array. That is because the it will also add '__header__', '__version__' and '__globals__' to the dict. These have information about the .mat-file. We're not going to worry to much about that now. Let's find out which datasets are in this .mat-file:

In [4]:
data.keys()

dict_keys(['__header__', '__version__', '__globals__', 'allPC_ps', 'allSeq_ps', 'allPCs', 'allSeqs'])

The datasets are the ones after __globals__. You can access them the same way you would values in a 'normal' dict:

In [18]:
all_pcs = data['allPCs']
print(type(all_pcs))
print(all_pcs.shape)

<class 'numpy.ndarray'>
(74, 1)


You can see that the data is stored in a numpy array. You can now start your analysis the way you always would.

If you have a dataset that you would like to save as a .mat-file, scipy also has a function for that. Your data does have to be in a dictionary always. You can either create the dictionary beforehand, or make one while you are saving out your data, like this:

In [6]:
sio.savemat(os.path.join('data', 'all_pcs.mat'), {'all_pcs':all_pcs})

### .csv files

csv (comma separated values) is a common way of saving out data. .csv-files always have values that are separated by a delimiter, which can be a number of different things, but is most commonly a comma. A newline often indicates a new row, but this is also not necessarily true. You can input what the delimiter is when you are loading your data.

#### reading and writing using csv package
There are a couple of ways to load csv. The most basic is using the csv package. This packages contains a reader, which allows you to read your csv file row by row. You can use this to save the rows into a list or array.

In [7]:
import csv

In [8]:
loco_data = []
with open(os.path.join('data','loco.csv'), newline='') as csvfile:
    csvreader = csv.reader(csvfile, delimiter=',', quotechar='|')
    next(csvreader)
    for row in csvreader:
        loco_data.append(float(row[1]))
        
loco_data[:10]

[6144.436,
 878.409,
 372.03,
 372.051,
 373.168,
 373.302,
 373.339,
 373.72,
 372.374,
 373.15]

In the code above we use the keyword 'with'. You can use this when you are working with files. The file is opened in the first line ('open') and will automatically close after the with ends. This is a good trick to stop having too many files open, which takes up a lot of memory and locks it for any other purposes. 

We then use the csv.reader function, which opens an iterator. That means it can take steps through the file, but doesn't actually hold the entire file in its memory. Again, this is a good trick to save memory.

We can use a for-loop on an iterator like we would on a list (which is secretly also a type of iterator). We then save each row to the loco_data list. Note that this file had a header, which is now also added to the list. You can skip it by using the 'next' function for the iterator. 

We can also write csv files in the exact same way

In [9]:
import csv
with open(os.path.join('data', 'cleaned_loco.csv'), mode='w', newline='') as csvfile:
    csvwriter = csv.writer(csvfile, delimiter=',',
                            quotechar='|', quoting=csv.QUOTE_MINIMAL)
    for row in loco_data:
        csvwriter.writerow([row])

Note that we use the parameter 'mode' which tells us what we aim to do with the file we're opening. In this case, we are just writing to it (w), but we could also read (r), write and read (w+) or append (a). 
In the writerow function, we use row, but make it a list by putting square brackets around it. This is because the function tries to iterate through all the values in row and put each of them in its own column. Therefore row has to be an iterable

#### reading and writing using pandas
You can also directly import csv-files into a pandas Dataframe using the 'read_csv' function. As above, you can tell it what the delimeter should be. However, this function has a bunch of extra options, whether the original file has a header nad what the names of the columns should be. You can find a full list here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html?highlight=csv#pandas.read_csv 

In [10]:
import pandas as pd

In [11]:
loco_data = pd.read_csv(os.path.join('data', 'loco.csv'), delimiter=',', header=0, index_col=0)
loco_data.head()

Unnamed: 0_level_0,Y
X,Unnamed: 1_level_1
1,6144.436
2,878.409
3,372.03
4,372.051
5,373.168


You can write to a csv-file using the function 'to_csv'. Again you can tell it the delimiter (sep), but you can also chose to write the header and the indices or not. Find all options here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html

In [12]:
loco_data.to_csv(os.path.join('data', 'cleaned_loco.csv'), sep=',', index=True)

#### reading and writing using numpy
The last way to import a csv-file is to read it directly to a numpy array. For this you will need the 'genfromtxt' function. This function works on any text file (and a csv file is just a special text file). Again you can give it options such as the delimiter, wheter to skip the header and what to do with missing values. See a full list here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.genfromtxt.html

In [13]:
import numpy as np

In [14]:
loco_data= np.genfromtxt(os.path.join('data', 'loco.csv'), delimiter=',',skip_header=1, usecols=(1))
loco_data

array([ 6144.436,   878.409,   372.03 , ..., 20811.674, 20811.088,
       20810.377])

Numpy again uses a function for text-files to save out csv files, 'savetxt'. You have to tell it to save into a csv in the filename. Again you can tell it what to use as delimeter, header, etc. You can find more information here: https://docs.scipy.org/doc/numpy/reference/generated/numpy.savetxt.html

In [15]:
np.savetxt(os.path.join('data', 'cleaned_loco.csv'), loco_data, delimiter=',')

### Excel files

If you have an excel file, the best way to load it in is to save it as a csv file and then load it using one of the methods above. However, if you really want to use the excel file,  you can do so using pandas. The 'read_excel' function takes the sheet as input too, so you can load a specific sheet. 

In [16]:
data_file = pd.read_excel(io = os.path.join('data','loco.xlsx'), sheet='loco')
data_file.head()

Unnamed: 0,X,Y
0,1,6144.436
1,2,878.409
2,3,372.03
3,4,372.051
4,5,373.168


You can also save out a pandas DataFram into an excel sheet using 'to_excel'. Again, I would recommend to choose a csv format, because it is more versatile.

In [17]:
data_file.to_excel(os.path.join('data','cleaned_loco.xlsx'))

### npy files

Numpy has its own way of saving out numpy arrays. You can use the 'load' function to load these files. When you use the function 'save' it will simply save the array with a .npy extension.

The function allows the option to 'pickle' your data, this is a way for python to serialize your data and save it. Usually this is not necessary and even discouraged.

In [20]:
data = np.load(os.path.join('data','celltypes.npy'))

In [22]:
print(type(data))
data[1:10]

<class 'numpy.ndarray'>


array([[0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0],
       [0]], dtype=uint8)

In [23]:
np.save(os.path.join('data','celltypes_saved.npy'), all_pcs)

### Other formats

You might run into data that is not of any of the formats mentioned above. If this is the case, the best thing to do is google something like 'load X file python'. Usually there is a specific library to do this, or your type is a derivative of one of the above and you can use one of these functions to load your data. 