# Importing data into Pandas from files

This part of Pandas examples will be working with the basic imports from various formats. Latter parts of example will show how to import subset of data and setting index during the import.

Pandas documentation is very good resource for learning and has good examples. Current documentation is located https://pandas.pydata.org/pandas-docs/stable/

Import basic modules used for processing data. It is usual to import Pandas and Numpy.

In [24]:
import pandas as pd
import numpy as np

## Import from CSV

Data used for examples were downloaded from https://github.com/vincentarelbundock/Rdatasetsb

Import data from CSV in data folder. Data contains information on Survey of Labour and Income Dynamics, which is in SLID.csv

In [25]:
csv_df = pd.read_csv('./data/SLID.csv')

Display first five records from imported CSV file via head function.

In [3]:
csv_df.head()

Unnamed: 0.1,Unnamed: 0,wages,education,age,sex,language
0,1,10.56,15.0,40,Male,English
1,2,11.0,13.2,19,Male,English
2,3,,16.0,49,Male,Other
3,4,17.76,14.0,46,Male,Other
4,5,,8.0,71,Male,English


In this example we can see that first column does not have name and is an index column in CSV. We can import this by using specific columns to import into same dataset and view information with head function.

In [4]:
csv1_df = pd.read_csv('./data/SLID.csv', usecols=['wages','education','age','sex','language'])
csv1_df.head()

Unnamed: 0,wages,education,age,sex,language
0,10.56,15.0,40,Male,English
1,11.0,13.2,19,Male,English
2,,16.0,49,Male,Other
3,17.76,14.0,46,Male,Other
4,,8.0,71,Male,English


Display basic information on columns as number of occurrences and datatypes, which were imported from CSV file via info function.

In [5]:
csv1_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7425 entries, 0 to 7424
Data columns (total 5 columns):
wages        4147 non-null float64
education    7176 non-null float64
age          7425 non-null int64
sex          7425 non-null object
language     7304 non-null object
dtypes: float64(2), int64(1), object(2)
memory usage: 232.1+ KB


Total number of observations in in dataset is 7425 and there are various datatypes. When Pandas imports values it tries to guess which type of datatype is in the column. In case it finds that there are null values in the column then values are either imported as text (object datatype) or float for numeric datatype.

## Importing various different formats in Pandas.

### Importing from SAS

Importing SAS datasets is possible with Pandas. Sample datasets were downloaded from 
http://www.principlesofeconometrics.com/sas.htm

Next cell displays information on imported dataset.

In [6]:
sas_df = pd.read_sas('./data/airline.sas7bdat')

In [7]:
sas_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 32 entries, 0 to 31
Data columns (total 6 columns):
YEAR    32 non-null float64
Y       32 non-null float64
W       32 non-null float64
R       32 non-null float64
L       32 non-null float64
K       32 non-null float64
dtypes: float64(6)
memory usage: 1.5 KB


### Import from STATA

Importing data is also possible to do from STATA format. Imported data is from 
http://www.principlesofeconometrics.com/stata.htm

Next cell displays information on imported data

In [8]:
spss_df = pd.read_stata('./data/cars.dta')

In [9]:
spss_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 391
Data columns (total 4 columns):
mpg    392 non-null float32
cyl    392 non-null float32
eng    392 non-null float32
wgt    392 non-null float32
dtypes: float32(4)
memory usage: 9.2 KB


### Import from JSON

Also you can have data in JSON format. This is a test database containing 3 articles in JSON format.

In [10]:
json_df = pd.read_json('./data/database.json')

In [11]:
json_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 5 columns):
_id             3 non-null object
product_name    3 non-null object
quantity        3 non-null int64
supplier        3 non-null object
unit_cost       3 non-null object
dtypes: int64(1), object(4)
memory usage: 112.0+ bytes


In [12]:
json_df.head()

Unnamed: 0,_id,product_name,quantity,supplier,unit_cost
0,{'$oid': '5968dd23fc13ae04d9000001'},sildenafil citrate,261,Wisozk Inc,$10.47
1,{'$oid': '5968dd23fc13ae04d9000002'},Mountain Juniperus ashei,292,Keebler-Hilpert,$8.74
2,{'$oid': '5968dd23fc13ae04d9000003'},Dextromathorphan HBr,211,Schmitt-Weissnat,$20.53


### Import from PICKLE

One of useful formats that is used in python is pickle. More read on this format can be found at https://docs.python.org/3/library/pickle.html. This file is native to python and is sometimes used to transfer data form one computer to another. 

In [13]:
pickle_df = pd.read_pickle('./data/test.pickle')

In [14]:
pickle_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 392 entries, 0 to 391
Data columns (total 4 columns):
mpg    392 non-null float32
cyl    392 non-null float32
eng    392 non-null float32
wgt    392 non-null float32
dtypes: float32(4)
memory usage: 9.2 KB


### Import from Excel

One of the very useful formats for Pandas to have is MS Excel file format. Following data were generated for this purpose. 

In [15]:
excel_df = pd.read_excel('./data/TestExcelData.xlsx', sheet='data2')

In [16]:
excel_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
ID           100 non-null int64
Variable1    100 non-null int64
Variable2    100 non-null int64
dtypes: int64(3)
memory usage: 2.4 KB


Sometimes is useful to read smaller sample data from larger. This can be done by parameter nrows. Following example is read from Excel format and I reads first 10 rows from selected sheet.

It also shows example if we want to input index column that is based on our data and not automatically generated by Pandas. This feature is used in merging Pandas dataframes and series.

In [17]:
excel1_df = pd.read_excel('./data/TestExcelData.xlsx', sheet='data', nrows=10, index_col='ID')

In [18]:
excel1_df.head()

Unnamed: 0_level_0,Variable1,Variable2
ID,Unnamed: 1_level_1,Unnamed: 2_level_1
1,147,151
2,15,1
3,209,118
4,170,97
5,116,11


In [19]:
excel1_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 10 entries, 1 to 10
Data columns (total 2 columns):
Variable1    10 non-null int64
Variable2    10 non-null int64
dtypes: int64(2)
memory usage: 240.0 bytes


# Exporting data

Same principles that apply in importing data are also used in exporting to different formats. In exporting data we use to_* methods.

Following example uses JSON data and exports it to CSV format.

### Export to JSON

In [20]:
json_df.to_csv('./output/json.csv')

Exporting files is basically simple, but there are some useful thing that you can use  and that is dropping Pandas Index variable.

In [21]:
json_df.to_csv('./output/json_no_index.csv', index=False)

### Export to Excel

Many formats can be used for exporting data, but Excel format is one of the most popular for sharing data.

First example shows simple export one dataset into one file.

In [22]:
pickle_df.to_excel('./output/SingleExcel.xlsx', index = False)

One other useful method for writing data from Pandas to Excel is to use ExcelWriter. If you use these steps you can write multiple sheets to excel, add charts and many other features for exporting data.

You can find more information on documentation page for ExcelWriter https://xlsxwriter.readthedocs.io/working_with_pandas.html

In [23]:
excel_writer = pd.ExcelWriter('./output/Multiple_export.xlsx', engine='xlsxwriter')
csv_df.to_excel(excel_writer, sheet_name='Cars')
excel1_df.to_excel(excel_writer, sheet_name='Test data')
excel_writer.save()