# Importing data from various file types

When a data scientists work with data, the source of the data need not be only one. There are number of file types form where we need to extract the data. In this article we will see how to import data from following file types.

- Excel spreadsheets
- SAS files
- Stata files
- HDF5 fils
- MATLAB files

### Excel spreadsheets

The use of excel file is so widespread that it needs no introduction. Generally, an excel file consists of number of sheets. There are many ways to import excel file. We will use pandas to do so.

In [22]:
#import package pandas
import pandas as pd

#assign file name to a string variable
file = 'battledeath.xlsx'

#call pandas function to rad this excel file
data = pd.ExcelFile(file)

Often an excel file may have various sheets. We can check out how many sheets an excel has by following command

In [23]:
print(data.sheet_names)

['2002', '2004']


To load data from perticular sheet we need to use .parse() function by passing the sheet name or sheet index. Pandas is clever enough to understand you are passing a sheetname or an index.

In [24]:
df1 = data.parse("2002")

In [25]:
df2 = data.parse(0)

### SAS or Stata files

SAS is an acronym for * "Statistical Analysis System" * where as Stata is contraction of *"Statistics"* and *"Data"*. The former is used in business analytics and biostatistics while the later is popular in academic social sciences research such as Economics and Epidemiology.

SAS files are important because SAS is a software suit that performs following steps.
- Advance analytics
- Multivariate analysis
- Business intelligence
- Data Management
- Predictive analytics
- Standard for computational analysis

The most common file extention for SAS files are **.sas7bdat** or **.sas7bcat** which are dataset files and categorical files respectively. The former files can be imported as follows.

In [26]:
from sas7bdat import SAS7BDAT

with SAS7BDAT('sales.sas7bdat') as file:
    df_sas = file.to_data_frame()

Stata files have extention **.dta** and we can import them using pandas as follows.

In [27]:
data = pd.read_stata('disarea.dta')

### HDF5 files

HDF5 stands for *"Hierarchical Data Format version 5"*. This is a standard mechanism for storing large quantities of numerical data. Now how large are we talking ? It is relatively common to deal with datasets which can be hundreds of gigabytes or terabytes. HDF5 itself can scale to exabytes.

We need to import package h5py to extract data from such files.

In [28]:
import h5py

In [29]:
data = h5py.File('L-L1_LOSC_4_V1-1126259446-32.hdf5','r')  # 'r' is to read

In [30]:
#check the type of variable 'data'
print(type(data))

<class 'h5py._hl.files.File'>


**The structure of HDF5 files**

In [31]:
for key in data.keys():
    print(key)

meta
quality
strain


As you can see above, there are 3 keys. Each of these is a HDF group. You can think of these groups as directories.

- meta : Meta-data for the file
- quality : Refers to data quality
- strain : This is the data of interest.

### MATLAB files

MATLAB which is a short for Matrix Laboratory is a numerical computing environment that is an industry standard in the space of Engineering and Science. Lot of people use MATLAB and save data as **.mat** files.

The standard library *scipy* will allow us to read or write .mat files.

In [32]:
import scipy.io

mat = scipy.io.loadmat('ja_data2.mat')

In [33]:
print(type(mat))

<class 'dict'>


As you can see, the data imported from matlab files is stored as dictionary in Python workspace. The mapping is as follows.
- keys : MATLAB variable names
- values : object assigned to variables