# Importing Data from other file types

## Introduction to other file types

### Pickled Files
- File type native to python
- Motivation: Many datatypes for which it isn't obvious how to store them
- Pickled files are serialized
- Serialize = convert object to bytestream

In [None]:
import pickle
with open ('pickled_fruit.pkl', 'rb') as file:
    data =  pickle.load(file)
print(data)

### Importing Excel Spreadsheets

In [None]:
import pandas as pd
file = 'urbanpop.xlsx'
data = pd.ExcelFile(file)
print(data.sheet_names)

df1 = data.parse('sheet name') # sheet name, as a string
df2 = data.parse(0) # sheet index, as float

In [None]:
# Parse the first sheet and rename the columns: df1
df1 = xls.parse(0, skiprows=[0], names=['Country', 'AAM due to War (2002)'])

# Print the head of the DataFrame df1
print(df1.head())

# Parse the first column of the second sheet and rename the column: df2
df2 = xls.parse(1, usecols=[0], skiprows=[0], names=['Country'])

# Print the head of the DataFrame df2
print(df2.head())

### SAS and Stata files
- SAS: Statistical Analysis System
- Stata: 'Statistics' + 'Data'
- SAS: Business analytics and biostatistics
- Stata: academic social sciences research

SAS Files
Used for:
- Advanced analytics 
- Multivariate analysis
- Business intelligence
- Data management
- Predictive analytics
- Standard for computational analysis

Importing SAS Files

In [None]:
import pandas as pd 
from sas7bdat import SAS7BDAT
with SAS7BDAT('urbanpop.sas7bdat') as file:
    df_sas = file.to_data_frame()

In [None]:
# Import sas7bdat package
from sas7bdat import SAS7BDAT

# Save file to a DataFrame: df_sas
with SAS7BDAT('sales.sas7bdat') as file:
    df_sas = file.to_data_frame()

# Print head of DataFrame
print(df_sas.head())

Importing Stata Files

In [None]:
import pandas as pd
data =  pd.read_stata('urbanpop.dta')

In [None]:
# Import pandas
import pandas as pd

# Load Stata file into a pandas DataFrame: df
df = pd.read_stata('disarea.dta')

# Print the head of the DataFrame df
print(df.head())

### HDF5 files
- Hierarchical Data Format version 5
- Standard for storing large quantities of numerical data
- Datasets can be hundred of gigabytes or terabytes
- HDF5 can scale up to exabytes

Importing HDF5 Files

In [None]:
import h5py
filename = 'H-H1_LOSC'
data = h5py.File(filename, 'r')
print(type(data))

The structure of HDF5 Files
- meta: Meta-data for the file
- quality: Refers to data quality
- strain: Strain data from the interferometer

In [None]:
for key in data['meta'].keys():
    print(key)


# if youre interested in 'Description' and 'Detector'
print(np.array(data['meta']['Description']), np.array(data['meta']['Detector']))

In [None]:
# Get the HDF5 group: group
group = data['strain']

# Check out keys of group
for key in group.keys():
    print(key)

# Set variable equal to time series data: strain
strain = np.array(data['strain']['Strain'])

# Set number of time points to sample: num_samples
num_samples= 10000

# Set time vector
time = np.arange(0, 1, 1/num_samples)

# Plot data
plt.plot(time, strain[:num_samples])
plt.xlabel('GPS Time (s)')
plt.ylabel('strain')
plt.show()

### MATLAB
- "Matrix Laboratory"
- Industry standard in engineering and science  
- Data saved as .mat files

To import in python:
Use SciPy
- scipy.io.loadmat() - read .mat files
- scipy.io.savemat() - write .mat files

In [None]:
import scipy.io
filename = 'workspace.mat'
mat = scipy.io.loadmat(filename)

In [None]:
# Print the keys of the MATLAB dictionary
print(mat.keys())

# Print the type of the value corresponding to the key 'CYratioCyt'
print(type(mat['CYratioCyt']))

# Print the shape of the value corresponding to the key 'CYratioCyt'
print(np.shape(mat['CYratioCyt']))

# Subset the array and plot it
data = mat['CYratioCyt'][25, 5:]
fig = plt.figure()
plt.plot(data)
plt.xlabel('time (min.)')
plt.ylabel('normalized fluorescence (measure of expression)')
plt.show()