# Data Ingestion
Copyright (c) Microsoft Corporation. All rights reserved.<br>
Licensed under the MIT License.

In [1]:
import azureml.dataprep as dprep

Data Prep has the ability to load different types of input data. You can use auto-reading functionality to detect the type of a file, or directly specify a file type and its parameters.

## Table of Contents
[Read Lines](#Read-Lines)<br>
[Read CSV](#Read-CSV)<br>
[Read Compressed CSV](#Read-Compressed-CSV)<br>
[Read Excel](#Read-Excel)<br>
[Read Fixed Width Files](#Read-Fixed-Width-Files)<br>
[Read Parquet](#Read-Parquet)<br>
[Read JSON](#Read-JSON)<br>
[Read SQL](#Read-SQL)<br>
[Read From ADLS](#Read-From-ADLS)<br>
[Read Pandas DataFrame](#Read-Pandas-DataFrame)<br>

## Read Lines

One of the simplest ways to read data using Data Prep is to just read it as text lines.

In [2]:
dflow = dprep.read_lines(path='../data/crime.txt')
dflow.head(5)

Unnamed: 0,Line
0,10140490 HY329907 7/5/2015 23:50 ...
1,10139776 HY329265 7/5/2015 23:30 ...
2,10140270 HY329253 7/5/2015 23:20 ...
3,10139885 HY329308 7/5/2015 23:19 ...
4,10140379 HY329556 7/5/2015 23:00 ...


With ingestion done, you can go ahead and start prepping the dataset.

In [3]:
df = dflow.to_pandas_dataframe()
df

Unnamed: 0,Line
0,10140490 HY329907 7/5/2015 23:50 ...
1,10139776 HY329265 7/5/2015 23:30 ...
2,10140270 HY329253 7/5/2015 23:20 ...
3,10139885 HY329308 7/5/2015 23:19 ...
4,10140379 HY329556 7/5/2015 23:00 ...
5,10140868 HY330421 7/5/2015 22:54 ...
6,10139762 HY329232 7/5/2015 22:42 ...
7,10139722 HY329228 7/5/2015 22:30 ...
8,10139774 HY329209 7/5/2015 22:15 ...
9,10139697 HY329177 7/5/2015 22:10 ...


## Read CSV

When reading delimited files, you can let the underlying runtime infer the parsing parameters (e.g. separator, encoding, whether to use headers, etc.) simply by not providing them. In this case, you can read a file by specifying only its location, then retrieve the first 10 rows to evaluate the result.

In [4]:
dflow_duplicate_headers = dprep.read_csv(path='../data/crime_duplicate_headers.csv')
dflow_duplicate_headers.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
1,10498554,HZ239907,4/15/2016 23:56,007XX E 111TH ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,FALSE,FALSE,...,9,50,11,1183356,1831503,2016,5/11/2016 15:48,41.69283384,-87.60431945,"(41.692833841, -87.60431945)"
2,10516598,HZ258664,4/15/2016 17:00,082XX S MARSHFIELD AVE,890,THEFT,FROM BUILDING,RESIDENCE,FALSE,FALSE,...,21,71,6,1166776,1850053,2016,5/12/2016 15:48,41.74410697,-87.66449429,"(41.744106973, -87.664494285)"
3,10519196,HZ261252,4/15/2016 10:00,104XX S SACRAMENTO AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,FALSE,FALSE,...,19,74,11,,,2016,5/12/2016 15:50,,,
4,10519591,HZ261534,4/15/2016 9:00,113XX S PRAIRIE AVE,1120,DECEPTIVE PRACTICE,FORGERY,RESIDENCE,FALSE,FALSE,...,9,49,10,,,2016,5/13/2016 15:51,,,


From the result, you can see that the delimiter and encoding were correctly detected. Column headers were also detected. However, the first line seems to be a duplicate of the column headers. One of the parameters is a number of lines to skip from the files being read. You can use this to filter out the duplicate line.

In [5]:
dflow_skip_headers = dprep.read_csv(path='../data/crime_duplicate_headers.csv', skip_rows=1)
dflow_skip_headers.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10498554,HZ239907,4/15/2016 23:56,007XX E 111TH ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,False,False,...,9,50,11,1183356.0,1831503.0,2016,5/11/2016 15:48,41.69283384,-87.60431945,"(41.692833841, -87.60431945)"
1,10516598,HZ258664,4/15/2016 17:00,082XX S MARSHFIELD AVE,890,THEFT,FROM BUILDING,RESIDENCE,False,False,...,21,71,6,1166776.0,1850053.0,2016,5/12/2016 15:48,41.74410697,-87.66449429,"(41.744106973, -87.664494285)"
2,10519196,HZ261252,4/15/2016 10:00,104XX S SACRAMENTO AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,False,False,...,19,74,11,,,2016,5/12/2016 15:50,,,
3,10519591,HZ261534,4/15/2016 9:00,113XX S PRAIRIE AVE,1120,DECEPTIVE PRACTICE,FORGERY,RESIDENCE,False,False,...,9,49,10,,,2016,5/13/2016 15:51,,,
4,10534446,HZ277630,4/15/2016 10:00,055XX N KEDZIE AVE,890,THEFT,FROM BUILDING,"SCHOOL, PUBLIC, BUILDING",False,False,...,40,13,6,,,2016,5/25/2016 15:59,,,


Now the data set contains the correct headers and the extraneous row has been skipped by read_csv. Next, look at the data types of the columns.

In [6]:
dflow_skip_headers.dtypes

ID                       FieldType.STRING
IUCR                     FieldType.STRING
Case Number              FieldType.STRING
Latitude                 FieldType.STRING
Domestic                 FieldType.STRING
Year                     FieldType.STRING
Location Description     FieldType.STRING
Arrest                   FieldType.STRING
Community Area           FieldType.STRING
Block                    FieldType.STRING
Primary Type             FieldType.STRING
Location                 FieldType.STRING
Beat                     FieldType.STRING
Ward                     FieldType.STRING
Description              FieldType.STRING
X Coordinate             FieldType.STRING
Longitude                FieldType.STRING
Updated On               FieldType.STRING
District                 FieldType.STRING
Y Coordinate             FieldType.STRING
Date                     FieldType.STRING
FBI Code                 FieldType.STRING

Unfortunately, all of the columns came back as strings. This is because, by default, Data Prep will not change the type of the data. Since the data source is a text file, all values are kept as strings. In this case, however, numeric columns should be parsed as numbers. To do this, set the `inference_arguments` parameter to a new instance of the `InferenceArguments` class, which will trigger type inference to be performed.
Note that setting inference arguments at this step also requires you to choose a strategy for dealing with ambiguous dates. The example below shows the month before day option.

In [7]:
dflow_inferred_types = dprep.read_csv(path='../data/crime_duplicate_headers.csv',
                          skip_rows=1,
                          inference_arguments=dprep.InferenceArguments(day_first=False))
dflow_inferred_types.dtypes

ID                       FieldType.DECIMAL
IUCR                     FieldType.DECIMAL
Case Number              FieldType.STRING
Latitude                 FieldType.DECIMAL
Domestic                 FieldType.BOOLEAN
Year                     FieldType.DECIMAL
Location Description     FieldType.STRING
Arrest                   FieldType.BOOLEAN
Community Area           FieldType.DECIMAL
Block                    FieldType.STRING
Primary Type             FieldType.STRING
Location                 FieldType.STRING
Beat                     FieldType.DECIMAL
Ward                     FieldType.DECIMAL
Description              FieldType.STRING
X Coordinate             FieldType.DECIMAL
Longitude                FieldType.DECIMAL
Updated On               FieldType.DATE  
District                 FieldType.DECIMAL
Y Coordinate             FieldType.DECIMAL
Date                     FieldType.DATE  
FBI Code                 FieldType.DECIMAL

Now several of the columns were correctly detected as numbers and their `FieldType` is Decimal.

With ingestion done, the data set is ready to start preparing.

In [8]:
df = dflow_inferred_types.to_pandas_dataframe()
df

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10498554.0,HZ239907,2016-04-15 23:56:00,007XX E 111TH ST,1153.0,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,False,False,...,9.0,50.0,11.0,1183356.0,1831503.0,2016.0,2016-05-11 15:48:00,41.692834,-87.604319,"(41.692833841, -87.60431945)"
1,10516598.0,HZ258664,2016-04-15 17:00:00,082XX S MARSHFIELD AVE,890.0,THEFT,FROM BUILDING,RESIDENCE,False,False,...,21.0,71.0,6.0,1166776.0,1850053.0,2016.0,2016-05-12 15:48:00,41.744107,-87.664494,"(41.744106973, -87.664494285)"
2,10519196.0,HZ261252,2016-04-15 10:00:00,104XX S SACRAMENTO AVE,1154.0,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,False,False,...,19.0,74.0,11.0,,,2016.0,2016-05-12 15:50:00,,,
3,10519591.0,HZ261534,2016-04-15 09:00:00,113XX S PRAIRIE AVE,1120.0,DECEPTIVE PRACTICE,FORGERY,RESIDENCE,False,False,...,9.0,49.0,10.0,,,2016.0,2016-05-13 15:51:00,,,
4,10534446.0,HZ277630,2016-04-15 10:00:00,055XX N KEDZIE AVE,890.0,THEFT,FROM BUILDING,"SCHOOL, PUBLIC, BUILDING",False,False,...,40.0,13.0,6.0,,,2016.0,2016-05-25 15:59:00,,,
5,10535059.0,HZ278872,2016-04-15 04:30:00,004XX S KILBOURN AVE,810.0,THEFT,OVER $500,RESIDENCE,False,False,...,24.0,26.0,6.0,,,2016.0,2016-05-25 15:59:00,,,
6,10499802.0,HZ240778,2016-04-15 10:00:00,010XX N MILWAUKEE AVE,1152.0,DECEPTIVE PRACTICE,ILLEGAL USE CASH CARD,RESIDENCE,False,False,...,27.0,24.0,11.0,,,2016.0,2016-05-27 15:45:00,,,
7,10522293.0,HZ264802,2016-04-15 16:00:00,019XX W DIVISION ST,1110.0,DECEPTIVE PRACTICE,BOGUS CHECK,RESTAURANT,False,False,...,1.0,24.0,11.0,1163094.0,1908003.0,2016.0,2016-05-16 15:48:00,41.903206,-87.676362,"(41.903206037, -87.676361925)"
8,10523111.0,HZ265911,2016-04-15 08:00:00,061XX N SHERIDAN RD,1153.0,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,RESIDENCE,False,False,...,48.0,77.0,11.0,,,2016.0,2016-05-16 15:50:00,,,
9,10525877.0,HZ268138,2016-04-15 15:00:00,023XX W EASTWOOD AVE,1153.0,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,,False,False,...,47.0,4.0,11.0,,,2016.0,2016-05-18 15:50:00,,,


## Read Compressed CSV

Data Prep can also read delimited files compressed in an archive. The `archive_options` parameter specifies the type of archive and glob pattern of entries in the archive.

At this moment, only reading from ZIP archives is supported.

In [9]:
from azureml.dataprep import ArchiveOptions, ArchiveType

dflow = dprep.read_csv(path='../data/crime.zip',
                          archive_options=ArchiveOptions(archive_type=ArchiveType.ZIP, entry_glob='*10-20.csv'))
dflow.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10140342,HY329728,07/05/2015 10:00:00 PM,050XX W BALMORAL AVE,820,THEFT,$500 AND UNDER,RESIDENCE PORCH/HALLWAY,False,False,...,45,12,06,1141968.0,1935401.0,2015,07/12/2015 12:42:46 PM,41.978806522,-87.753281779,"(41.978806522, -87.753281779)"
1,10140280,HY329658,07/05/2015 10:00:00 PM,007XX S CALIFORNIA AVE,810,THEFT,OVER $500,STREET,False,False,...,2,27,06,1157856.0,1896704.0,2015,07/12/2015 12:42:46 PM,41.872309009,-87.695910499,"(41.872309009, -87.695910499)"
2,10139771,HY329216,07/05/2015 10:00:00 PM,077XX S WINCHESTER AVE,560,ASSAULT,SIMPLE,RESIDENCE PORCH/HALLWAY,False,False,...,18,71,08A,1164744.0,1853262.0,2015,07/12/2015 12:42:46 PM,41.752956036,-87.671849368,"(41.752956036, -87.671849368)"
3,10142577,HY331987,07/05/2015 10:00:00 PM,025XX N NEWCASTLE AVE,930,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,True,False,...,36,18,07,1130361.0,1916262.0,2015,07/12/2015 12:42:46 PM,41.926494742,-87.79640881,"(41.926494742, -87.79640881)"
4,10141300,HY330294,07/05/2015 09:30:00 PM,007XX N SAWYER AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,RESIDENCE,False,True,...,27,23,08B,,,2015,07/12/2015 12:42:46 PM,,,


## Read Excel

Data Prep can also load Excel files using the `read_excel` method.

In [10]:
dflow_default_sheet = dprep.read_excel(path='../data/crime.xlsx')
dflow_default_sheet.head(5)

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,...,Column13,Column14,Column15,Column16,Column17,Column18,Column19,Column20,Column21,Column22
0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
1,1.01405e+07,HY329907,2015-07-05T23:50:00.000000,050XX N NEWLAND AVE,820,THEFT,$500 AND UNDER,STREET,False,False,...,41,10,6,1.12923e+06,1.93332e+06,2015,2015-07-12T12:42:46.000000,41.9733,-87.8002,"(41.973309466, -87.800174996)"
2,1.01398e+07,HY329265,2015-07-05T23:30:00.000000,011XX W MORSE AVE,460,BATTERY,SIMPLE,STREET,False,True,...,49,1,08B,1.16737e+06,1.94627e+06,2015,2015-07-12T12:42:46.000000,42.0081,-87.6596,"(42.008124017, -87.65955018)"
3,1.01403e+07,HY329253,2015-07-05T23:20:00.000000,121XX S FRONT AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,...,9,53,08B,,,2015,2015-07-12T12:42:46.000000,,,
4,1.01399e+07,HY329308,2015-07-05T23:19:00.000000,051XX W DIVISION ST,610,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,False,...,37,25,5,1.14172e+06,1.90746e+06,2015,2015-07-12T12:42:46.000000,41.9022,-87.7549,"(41.902152027, -87.754883404)"


Here, the first sheet of the Excel document has been loaded. You could achieve the same result by specifying the name of the desired sheet explicitly.

In [11]:
dflow_second_sheet = dprep.read_excel(path='../data/crime.xlsx', sheet_name='Sheet2')
dflow_second_sheet.head(5)

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,...,Column13,Column14,Column15,Column16,Column17,Column18,Column19,Column20,Column21,Column22
0,,,,,,,,,,,...,,,,,,,,,,
1,,,,,,,,,,,...,,,,,,,,,,
2,,,,,,,,,,,...,,,,,,,,,,
3,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Column2,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
4,1.01405e+07,HY329907,2015-07-05T23:50:00.000000,050XX N NEWLAND AVE,820,THEFT,$500 AND UNDER,STREET,False,False,...,41,10,6,1.12923e+06,1.93332e+06,2015,2015-07-12T12:42:46.000000,41.9733,-87.8002,"(41.973309466, -87.800174996)"


As you can see, the table in the second sheet had headers as well as three empty rows, so you can modify the arguments accordingly.

In [12]:
dflow_skipped_rows = dprep.read_excel(path='../data/crime.xlsx',
                                      sheet_name='Sheet2',
                                      use_column_headers=True,
                                      skip_rows=3)
dflow_skipped_rows.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Column2,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10140490.0,HY329907,2015-07-05 23:50:00,050XX N NEWLAND AVE,820.0,THEFT,$500 AND UNDER,STREET,False,False,...,41.0,10.0,6,1129230.0,1933315.0,2015.0,2015-07-12 12:42:46,41.973309,-87.800175,"(41.973309466, -87.800174996)"
1,10139776.0,HY329265,2015-07-05 23:30:00,011XX W MORSE AVE,460.0,BATTERY,SIMPLE,STREET,False,True,...,49.0,1.0,08B,1167370.0,1946271.0,2015.0,2015-07-12 12:42:46,42.008124,-87.65955,"(42.008124017, -87.65955018)"
2,10140270.0,HY329253,2015-07-05 23:20:00,121XX S FRONT AVE,486.0,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,...,9.0,53.0,08B,,,2015.0,2015-07-12 12:42:46,,,
3,10139885.0,HY329308,2015-07-05 23:19:00,051XX W DIVISION ST,610.0,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,False,...,37.0,25.0,5,1141721.0,1907465.0,2015.0,2015-07-12 12:42:46,41.902152,-87.754883,"(41.902152027, -87.754883404)"
4,10140379.0,HY329556,2015-07-05 23:00:00,012XX W LAKE ST,930.0,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,False,False,...,27.0,28.0,7,1168413.0,1901632.0,2015.0,2015-07-12 12:42:46,41.88561,-87.657009,"(41.885610142, -87.657008701)"


In [13]:
df = dflow_skipped_rows.to_pandas_dataframe()
df

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Column2,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10140490.0,HY329907,2015-07-05 23:50:00,050XX N NEWLAND AVE,820.0,THEFT,$500 AND UNDER,STREET,False,False,...,41.0,10.0,6,1129230.0,1933315.0,2015.0,2015-07-12 12:42:46,41.973309,-87.800175,"(41.973309466, -87.800174996)"
1,10139776.0,HY329265,2015-07-05 23:30:00,011XX W MORSE AVE,460.0,BATTERY,SIMPLE,STREET,False,True,...,49.0,1.0,08B,1167370.0,1946271.0,2015.0,2015-07-12 12:42:46,42.008124,-87.65955,"(42.008124017, -87.65955018)"
2,10140270.0,HY329253,2015-07-05 23:20:00,121XX S FRONT AVE,486.0,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,...,9.0,53.0,08B,,,2015.0,2015-07-12 12:42:46,,,
3,10139885.0,HY329308,2015-07-05 23:19:00,051XX W DIVISION ST,610.0,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,False,...,37.0,25.0,5,1141721.0,1907465.0,2015.0,2015-07-12 12:42:46,41.902152,-87.754883,"(41.902152027, -87.754883404)"
4,10140379.0,HY329556,2015-07-05 23:00:00,012XX W LAKE ST,930.0,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,False,False,...,27.0,28.0,7,1168413.0,1901632.0,2015.0,2015-07-12 12:42:46,41.88561,-87.657009,"(41.885610142, -87.657008701)"
5,10140868.0,HY330421,2015-07-05 22:54:00,118XX S PEORIA ST,1320.0,CRIMINAL DAMAGE,TO VEHICLE,VEHICLE NON-COMMERCIAL,False,False,...,34.0,53.0,14,1172409.0,1826485.0,2015.0,2015-07-12 12:42:46,41.679311,-87.644545,"(41.6793109, -87.644545209)"
6,10139762.0,HY329232,2015-07-05 22:42:00,026XX W 37TH PL,1020.0,ARSON,BY FIRE,VACANT LOT/LAND,False,False,...,12.0,58.0,9,1159436.0,1879658.0,2015.0,2015-07-12 12:42:46,41.825501,-87.690578,"(41.825500607, -87.690578042)"
7,10139722.0,HY329228,2015-07-05 22:30:00,016XX S CENTRAL PARK AVE,1811.0,NARCOTICS,POSS: CANNABIS 30GMS OR LESS,ALLEY,True,False,...,24.0,29.0,18,1152687.0,1891389.0,2015.0,2015-07-12 12:42:46,41.857828,-87.715029,"(41.857827814, -87.715028789)"
8,10139774.0,HY329209,2015-07-05 22:15:00,048XX N ASHLAND AVE,1310.0,CRIMINAL DAMAGE,TO PROPERTY,APARTMENT,False,False,...,46.0,3.0,14,1164821.0,1932394.0,2015.0,2015-07-12 12:42:46,41.9701,-87.669324,"(41.970099796, -87.669324377)"
9,10139697.0,HY329177,2015-07-05 22:10:00,058XX S ARTESIAN AVE,1320.0,CRIMINAL DAMAGE,TO VEHICLE,ALLEY,False,False,...,16.0,63.0,14,1160997.0,1865851.0,2015.0,2015-07-12 12:42:46,41.78758,-87.685233,"(41.787580282, -87.685233078)"


## Read Fixed Width Files

For fixed-width files, you can specify a list of offsets. The first column is always assumed to start at offset 0.

In [14]:
dflow_fixed_width = dprep.read_fwf('../data/crime.txt', offsets=[8, 17, 26, 33, 56, 58, 74])
dflow_fixed_width.head(5)

Unnamed: 0,10140490,HY329907,7/5/2015,23:50,050XX,N,NEWLAND AVE 820,THEFT
0,10139776,HY329265,7/5/2015,23:30,011XX,W,MORSE AVE 460,BATTERY
1,10140270,HY329253,7/5/2015,23:20,121XX,S,FRONT AVE 486,BATTERY
2,10139885,HY329308,7/5/2015,23:19,051XX,W,DIVISION ST 610,BURGLARY
3,10140379,HY329556,7/5/2015,23:00,012XX,W,LAKE ST 930,MOTOR VEHICLE THEFT
4,10140868,HY330421,7/5/2015,22:54,118XX,S,PEORIA ST 1320,CRIMINAL DAMAGE


Looking at the data, you can see that the first row was used as headers. In this particular case, however, there are no headers in the file, so the first row should be treated as data.

Passing in `PromoteHeadersMode.NONE` to the `header` keyword argument avoids header detection and gets the correct data.

In [15]:
dflow_no_headers = dprep.read_fwf('../data/crime.txt',
                          offsets=[8, 17, 26, 33, 56, 58, 74],
                          header=dprep.PromoteHeadersMode.NONE)
dflow_no_headers.head(5)

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8
0,10140490,HY329907,7/5/2015,23:50,050XX,N,NEWLAND AVE 820,THEFT
1,10139776,HY329265,7/5/2015,23:30,011XX,W,MORSE AVE 460,BATTERY
2,10140270,HY329253,7/5/2015,23:20,121XX,S,FRONT AVE 486,BATTERY
3,10139885,HY329308,7/5/2015,23:19,051XX,W,DIVISION ST 610,BURGLARY
4,10140379,HY329556,7/5/2015,23:00,012XX,W,LAKE ST 930,MOTOR VEHICLE THEFT


In [16]:
df = dflow_no_headers.to_pandas_dataframe()
df

Unnamed: 0,Column1,Column2,Column3,Column4,Column5,Column6,Column7,Column8
0,10140490,HY329907,7/5/2015,23:50,050XX,N,NEWLAND AVE 820,THEFT
1,10139776,HY329265,7/5/2015,23:30,011XX,W,MORSE AVE 460,BATTERY
2,10140270,HY329253,7/5/2015,23:20,121XX,S,FRONT AVE 486,BATTERY
3,10139885,HY329308,7/5/2015,23:19,051XX,W,DIVISION ST 610,BURGLARY
4,10140379,HY329556,7/5/2015,23:00,012XX,W,LAKE ST 930,MOTOR VEHICLE THEFT
5,10140868,HY330421,7/5/2015,22:54,118XX,S,PEORIA ST 1320,CRIMINAL DAMAGE
6,10139762,HY329232,7/5/2015,22:42,026XX,W,37TH PL 1020,ARSON
7,10139722,HY329228,7/5/2015,22:30,016XX,S,CENTRAL PARK AV,E 1811 NARCOTICS
8,10139774,HY329209,7/5/2015,22:15,048XX,N,ASHLAND AVE 131,0 CRIMINAL DAMAGE
9,10139697,HY329177,7/5/2015,22:10,058XX,S,ARTESIAN AVE 13,20 CRIMINAL DAMAGE


## Read Parquet

Data Prep has two different methods for reading data stored as Parquet.

Currently they both require `pyarrow>=0.11.0` to be installed in your Python enviornment.

### Read Parquet File

For reading single `.parquet` files, or a folder full of only Parquet files, use `read_parquet_file`.

In [17]:
dflow = dprep.read_parquet_file('../data/crime.parquet')
dflow.head(5)

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10140490.0,HY329907,2015-07-05 23:50:00,050XX N NEWLAND AVE,820.0,THEFT,$500 AND UNDER,STREET,False,False,...,41.0,10.0,06,1129230.0,1933315.0,2015.0,2015-07-12 12:42:46,41.973309,-87.800175,"(41.973309466, -87.800174996)"
1,10139776.0,HY329265,2015-07-05 23:30:00,011XX W MORSE AVE,460.0,BATTERY,SIMPLE,STREET,False,True,...,49.0,1.0,08B,1167370.0,1946271.0,2015.0,2015-07-12 12:42:46,42.008124,-87.65955,"(42.008124017, -87.65955018)"
2,10140270.0,HY329253,2015-07-05 23:20:00,121XX S FRONT AVE,486.0,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,...,9.0,53.0,08B,,,2015.0,2015-07-12 12:42:46,,,
3,10139885.0,HY329308,2015-07-05 23:19:00,051XX W DIVISION ST,610.0,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,False,...,37.0,25.0,05,1141721.0,1907465.0,2015.0,2015-07-12 12:42:46,41.902152,-87.754883,"(41.902152027, -87.754883404)"
4,10140379.0,HY329556,2015-07-05 23:00:00,012XX W LAKE ST,930.0,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,False,False,...,27.0,28.0,07,1168413.0,1901632.0,2015.0,2015-07-12 12:42:46,41.88561,-87.657009,"(41.885610142, -87.657008701)"


Parquet data is explicitly typed so no type inference is needed.

In [18]:
dflow.dtypes

ID                       FieldType.DECIMAL
IUCR                     FieldType.DECIMAL
Case Number              FieldType.STRING
Latitude                 FieldType.DECIMAL
Domestic                 FieldType.BOOLEAN
Year                     FieldType.DECIMAL
Location Description     FieldType.STRING
Arrest                   FieldType.BOOLEAN
Community Area           FieldType.DECIMAL
Block                    FieldType.STRING
Primary Type             FieldType.STRING
Location                 FieldType.STRING
Beat                     FieldType.DECIMAL
Ward                     FieldType.DECIMAL
Description              FieldType.STRING
X Coordinate             FieldType.DECIMAL
Longitude                FieldType.DECIMAL
Updated On               FieldType.DATE  
District                 FieldType.DECIMAL
Y Coordinate             FieldType.DECIMAL
Date                     FieldType.DATE  
FBI Code                 FieldType.STRING

### Read Parquet Dataset

A Parquet Dataset is different from a Parquet file in that it could be a folder containing a number of Parquet files within a complex directory structure. It may have a hierarchical structure that partitions the data by value of a column. These more complex forms of Parquet data are commonly produced by Spark/HIVE.

For these more complex data sets, you can use `read_parquet_dataset`, which uses pyarrow to handle complex Parquet layouts. This will also handle single Parquet files, though these are better read using `read_parquet_file`.

In [19]:
dflow = dprep.read_parquet_dataset('../data/parquet_dataset')
dflow.head(5)

Unnamed: 0,ID,Case_Number,Date,Block,IUCR,Primary_Type,Description,Location_Description,Domestic,Beat,...,Community_Area,FBI_Code,X_Coordinate,Y_Coordinate,Year,Updated_On,Latitude,Longitude,Location,Arrest
0,10140490.0,HY329907,2015-07-06 06:50:00,050XX N NEWLAND AVE,820,THEFT,$500 AND UNDER,STREET,False,1613.0,...,10.0,06,1129230.0,1933315.0,2015.0,2015-07-12 19:42:46,41.973309,-87.800175,"(41.973309466, -87.800174996)",False
1,10139776.0,HY329265,2015-07-06 06:30:00,011XX W MORSE AVE,460,BATTERY,SIMPLE,STREET,True,2431.0,...,1.0,08B,1167370.0,1946271.0,2015.0,2015-07-12 19:42:46,42.008124,-87.65955,"(42.008124017, -87.65955018)",False
2,10140270.0,HY329253,2015-07-06 06:20:00,121XX S FRONT AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,True,532.0,...,53.0,08B,,,2015.0,2015-07-12 19:42:46,,,,False
3,10139885.0,HY329308,2015-07-06 06:19:00,051XX W DIVISION ST,610,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,1531.0,...,25.0,05,1141721.0,1907465.0,2015.0,2015-07-12 19:42:46,41.902152,-87.754883,"(41.902152027, -87.754883404)",False
4,10140379.0,HY329556,2015-07-06 06:00:00,012XX W LAKE ST,930,MOTOR VEHICLE THEFT,THEFT/RECOVERY: AUTOMOBILE,STREET,False,1215.0,...,28.0,07,1168413.0,1901632.0,2015.0,2015-07-12 19:42:46,41.88561,-87.657009,"(41.885610142, -87.657008701)",False


The above data was partitioned by the value of the `Arrest` column. It is a boolean column in the original crime0 data set and hence was partitioned by `Arrest=true` and `Arrest=false`.

The directory structure is printed below for clarity.

In [20]:
import os
for path, dirs, files in os.walk('../data/parquet_dataset'):
    level = path.replace('../data/parquet_dataset', '').count(os.sep)
    indent = '   ' * (level)
    print(indent + os.path.basename(path) + '/')
    fileindent = '   ' * (level + 1)
    for f in files:
        print(fileindent + f)

parquet_dataset/
   Arrest=true/
      part-00000-34f8a7a7-c3cd-4926-92b2-ba2dcd3f95b7.gz.parquet
   Arrest=false/
      part-00000-34f8a7a7-c3cd-4926-92b2-ba2dcd3f95b7.gz.parquet


## Read JSON

Data Prep can also load JSON files.

In [21]:
dflow_json = dprep.read_json(path='../data/json.json')
dflow_json.head(15)

Unnamed: 0,inspections.business.business_id,inspections.business.name,inspections.business.address,inspections.business.city,inspections.business.postal_code,inspections.business.latitude,inspections.business.longitude,inspections.business.phone_number,inspections.business.TaxCode,inspections.business.business_certificate,inspections.business.application_date,inspections.business.owner_name,inspections.business.owner_address,inspections.Score,inspections.date,inspections.type,inspections.violations
0,16162,Quick-N-Ezee Indian Foods,3861 24th St,SF,94114.0,,,,H34,467114.0,May 9 2005 12:00AM,Jagpreet Enterprises,23682 Clawiter Road\n Hayward\n CA\n 94545,100.0,20130223,Routine - Unscheduled,[]
1,69707,Little Green Cyclo 2,Off The Grid,,,,,,H79,453248.0,Jul 12 2012 12:00AM,LITTLEGREENCYCLO LLC,"100 Esplanade Ave., Apt. 99\n Pacifica\n CA\n ...",93.0,20130224,Routine - Unscheduled,"[{""description"":""103112: No hot water or runni..."
2,67565,King of Thai Noodles Cafe,1541 TARAVAL St,SAN FRANCISCO,94116.0,37.7427,-122.483,,H25,,Oct 12 2011 12:00AM,"Royal Thai Noodles, Inc",2410 19th Ave\n SF\n CA\n 94116,79.0,20130225,Routine - Unscheduled,"[{""description"":""103139: Improper food storage..."
3,67565,King of Thai Noodles Cafe,1541 TARAVAL St,SAN FRANCISCO,94116.0,37.7427,-122.483,,H25,,Oct 12 2011 12:00AM,"Royal Thai Noodles, Inc",2410 19th Ave\n SF\n CA\n 94116,,20130225,Complaint,"[{""description"":""103139: Improper food storage..."
4,68701,Grindz,832 Clement St,SF,94118.0,37.7828,-122.468,,H25,467498.0,Mar 16 2012 12:00AM,"Ono Grindz, LLC",1055 Granada St.\n Vallejo\n CA\n 94591,100.0,20130225,Routine - Unscheduled,[]
5,69186,"Premier Catering & Events, Inc.",1255 22nd St,S.F.,94107.0,,,14155530288.0,H30,362812.0,Apr 30 2012 12:00AM,"Premier Catering & Events, Inc.",298 Magellan Ave.\n SF\n CA\n 94116,,20130225,Reinspection/Followup,[]
6,2689,THE BLUE PLATE,3218 MISSION St,SF,94110.0,37.7452,-122.42,14155286777.0,H25,325714.0,,BLUE ENCLAVE LLC,3218 MISSION ST.\n SAN FRANCISCO\n CA\n 94110,98.0,20130225,Routine - Unscheduled,"[{""description"":""103143: Inadequate warewashin..."
7,15806,Vital Tea Leaf,1044 Grant Ave,San Francisco,94133.0,37.7966,-122.407,,H24,388301.0,May 23 2005 12:00AM,Minh H. Duong,1044 Grant Ave\n San Francisco\n CA\n 94133,98.0,20130225,Routine - Unscheduled,"[{""description"":""103157: Food safety certifica..."
8,21807,The Front Porch,65 29th St A,SF,94110.0,37.7439,-122.422,,H25,398500.0,Jun 7 2006 12:00AM,Front Porch Restaurant LLC,65A 29th Street\n SF\n CA\n 94110,,20130225,Reinspection/Followup,[]
9,69041,Washington Cafe,826 Washington St,San Francisco,94108.0,37.7951,-122.407,,H26,468548.0,Apr 18 2012 12:00AM,"Washington Caf�, Inc. / Louis Kuang",333 Third Avenue\n Daly City\n CA\n 94014,65.0,20130225,Routine - Unscheduled,"[{""description"":""103120: Moderate risk food ho..."


When you use `read_json`, Data Prep will attempt to extract data from the file into a table. You can also control the file encoding Data Prep should use as well as whether Data Prep should flatten nested JSON arrays.

Choosing the option to flatten nested arrays could result in a much larger number of rows.

In [22]:
dflow_flat_arrays = dprep.read_json(path='../data/json.json', flatten_nested_arrays=True)
dflow_flat_arrays.head(5)

Unnamed: 0,inspections.business.business_id,inspections.business.name,inspections.business.address,inspections.business.city,inspections.business.postal_code,inspections.business.latitude,inspections.business.longitude,inspections.business.phone_number,inspections.business.TaxCode,inspections.business.business_certificate,inspections.business.application_date,inspections.business.owner_name,inspections.business.owner_address,inspections.Score,inspections.date,inspections.type,inspections.violations.description
0,16162,Quick-N-Ezee Indian Foods,3861 24th St,SF,94114.0,,,,H34,467114.0,May 9 2005 12:00AM,Jagpreet Enterprises,23682 Clawiter Road\n Hayward\n CA\n 94545,100,20130223,Routine - Unscheduled,
1,69707,Little Green Cyclo 2,Off The Grid,,,,,,H79,453248.0,Jul 12 2012 12:00AM,LITTLEGREENCYCLO LLC,"100 Esplanade Ave., Apt. 99\n Pacifica\n CA\n ...",93,20130224,Routine - Unscheduled,103112: No hot water or running water (High Risk)
2,67565,King of Thai Noodles Cafe,1541 TARAVAL St,SAN FRANCISCO,94116.0,37.7427,-122.483,,H25,,Oct 12 2011 12:00AM,"Royal Thai Noodles, Inc",2410 19th Ave\n SF\n CA\n 94116,79,20130225,Routine - Unscheduled,103139: Improper food storage (Low Risk)
3,67565,King of Thai Noodles Cafe,1541 TARAVAL St,SAN FRANCISCO,94116.0,37.7427,-122.483,,H25,,Oct 12 2011 12:00AM,"Royal Thai Noodles, Inc",2410 19th Ave\n SF\n CA\n 94116,79,20130225,Routine - Unscheduled,103119: Inadequate and inaccessible handwashin...
4,67565,King of Thai Noodles Cafe,1541 TARAVAL St,SAN FRANCISCO,94116.0,37.7427,-122.483,,H25,,Oct 12 2011 12:00AM,"Royal Thai Noodles, Inc",2410 19th Ave\n SF\n CA\n 94116,79,20130225,Routine - Unscheduled,103120: Moderate risk food holding temperature...


## Read SQL

Data Prep can also fetch data from SQL servers. Currently, only Microsoft SQL Server is supported.

To read data from a SQL server, first create a data source object that contains the connection information.

In [23]:
secret = dprep.register_secret(value="dpr3pTestU$er", id="dprepTestUser")
ds = dprep.MSSQLDataSource(server_name="dprep-sql-test.database.windows.net",
                           database_name="dprep-sql-test",
                           user_name="dprepTestUser",
                           password=secret)

As you can see, the password parameter of `MSSQLDataSource` accepts a Secret object. You can get a Secret object in two ways:
1. Register the secret and its value with the execution engine.
2. Create the secret with just an id (useful if the secret value was already registered in the execution environment).

Now that you have created a data source object, you can proceed to read data.

In [24]:
dflow = dprep.read_sql(ds, "SELECT top 100 * FROM [SalesLT].[Product]")
dflow.head(5)

Unnamed: 0,ProductID,Name,ProductNumber,Color,StandardCost,ListPrice,Size,Weight,ProductCategoryID,ProductModelID,SellStartDate,SellEndDate,DiscontinuedDate,ThumbNailPhoto,ThumbnailPhotoFileName,rowguid,ModifiedDate
0,680,"HL Road Frame - Black, 58",FR-R92B-58,Black,1059.31,1431.5,58,1016.04,18,6,2002-06-01,,,b'GIF89aP\x001\x00\xf7\x00\x00\x00\x00\x00\x80...,no_image_available_small.gif,43dd68d6-14a4-461f-9069-55309d90ea7e,2008-03-11 10:01:36.827
1,706,"HL Road Frame - Red, 58",FR-R92R-58,Red,1059.31,1431.5,58,1016.04,18,6,2002-06-01,,,b'GIF89aP\x001\x00\xf7\x00\x00\x00\x00\x00\x80...,no_image_available_small.gif,9540ff17-2712-4c90-a3d1-8ce5568b2462,2008-03-11 10:01:36.827
2,707,"Sport-100 Helmet, Red",HL-U509-R,Red,13.0863,34.99,,,35,33,2005-07-01,,,b'GIF89aP\x001\x00\xf7\x00\x00\x00\x00\x00\x80...,no_image_available_small.gif,2e1ef41a-c08a-4ff6-8ada-bde58b64a712,2008-03-11 10:01:36.827
3,708,"Sport-100 Helmet, Black",HL-U509,Black,13.0863,34.99,,,35,33,2005-07-01,,,b'GIF89aP\x001\x00\xf7\x00\x00\x00\x00\x00\x80...,no_image_available_small.gif,a25a44fb-c2de-4268-958f-110b8d7621e2,2008-03-11 10:01:36.827
4,709,"Mountain Bike Socks, M",SO-B909-M,White,3.3963,9.5,M,,27,18,2005-07-01,2006-06-30T00:00:00.000000,,b'GIF89aP\x001\x00\xf7\x00\x00\x00\x00\x00\x80...,no_image_available_small.gif,18f95f47-1540-4e02-8f1f-cc1bcb6828d0,2008-03-11 10:01:36.827


In [25]:
df = dflow.to_pandas_dataframe()
df.dtypes

ProductID                          int64
Name                              object
ProductNumber                     object
Color                             object
StandardCost                     float64
ListPrice                        float64
Size                              object
Weight                           float64
ProductCategoryID                  int64
ProductModelID                     int64
SellStartDate             datetime64[ns]
SellEndDate                       object
DiscontinuedDate                  object
ThumbNailPhoto                    object
ThumbnailPhotoFileName            object
rowguid                           object
ModifiedDate              datetime64[ns]
dtype: object

## Read from ADLS

There are two ways the Data Prep API can acquire the necessary OAuth token to access Azure DataLake Storage:
1. Retrieve the access token from a recent login session of the user's [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest) login.
2. Use a ServicePrincipal (SP) and a certificate as a secret.

### Using Access Token from a recent Azure CLI session

On your local machine, run the following command:
```
az login
```
If your user account is a member of more than one Azure tenant, you need to specify the tenant, either in the AAD url hostname form '<your_domain>.onmicrosoft.com' or the tenantId GUID. The latter can be retrieved as follows:
```
az account show --query tenantId
```

```python
dflow = read_csv(path = DataLakeDataSource(path='adl://dpreptestfiles.azuredatalakestore.net/farmers-markets.csv', tenant='microsoft.onmicrosoft.com'))
head = dflow.head(5)
head
```

### Create a ServicePrincipal via Azure CLI

A ServicePrincipal and the corresponding certificate can be created via [Azure CLI](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli?view=azure-cli-latest).
This particular SP is configured as Reader, with its scope reduced to just the ADLS account 'dpreptestfiles'.
```
az account set --subscription "Data Wrangling development"
az ad sp create-for-rbac -n "SP-ADLS-dpreptestfiles" --create-cert --role reader --scopes /subscriptions/35f16a99-532a-4a47-9e93-00305f6c40f2/resourceGroups/dpreptestfiles/providers/Microsoft.DataLakeStore/accounts/dpreptestfiles
```
This command emits the appId and the path to the certificate file (usually in the home folder). The .crt file contains both the public certificate and the private key in PEM format.

Extract the thumbprint with:
```
openssl x509 -in adls-dpreptestfiles.crt -noout -fingerprint
```

### Configure ADLS Account for ServicePrincipal

To configure the ACL for the ADLS filesystem, use the objectId of the user or, here, ServicePrincipal:
```
az ad sp show --id "8dd38f34-1fcb-4ff9-accd-7cd60b757174" --query objectId
```
Configure Read and Execute access for the ADLS file system. Since the underlying HDFS ACL model doesn't support inheritance, folders and files need to be ACL-ed individually.
```
az dls fs access set-entry --account dpreptestfiles --acl-spec "user:e37b9b1f-6a5e-4bee-9def-402b956f4e6f:r-x" --path /
az dls fs access set-entry --account dpreptestfiles --acl-spec "user:e37b9b1f-6a5e-4bee-9def-402b956f4e6f:r--" --path /farmers-markets.csv
```

References:
- [az ad sp](https://docs.microsoft.com/en-us/cli/azure/ad/sp?view=azure-cli-latest)
- [az dls fs access](https://docs.microsoft.com/en-us/cli/azure/dls/fs/access?view=azure-cli-latest)
- [ACL model for ADLS](https://github.com/MicrosoftDocs/azure-docs/blob/master/articles/data-lake-store/data-lake-store-access-control.md)

In [26]:
certThumbprint = 'C2:08:9D:9E:D1:74:FC:EB:E9:7E:63:96:37:1C:13:88:5E:B9:2C:84'
certificate = ''
with open('../data/adls-dpreptestfiles.crt', 'rt', encoding='utf-8') as crtFile:
    certificate = crtFile.read()

servicePrincipalAppId = "8dd38f34-1fcb-4ff9-accd-7cd60b757174"

### Acquire an OAuth Access Token

Use the adal package (via: `pip install adal`) to create an authentication context on the MSFT tenant and acquire an OAuth access token. Note that for ADLS, the `resource` in the token request must be for 'https://datalake.azure.net', which is different from most other Azure resources.

In [27]:
import adal
from azureml.dataprep.api.datasources import DataLakeDataSource

ctx = adal.AuthenticationContext('https://login.microsoftonline.com/microsoft.onmicrosoft.com')
token = ctx.acquire_token_with_client_certificate('https://datalake.azure.net/', servicePrincipalAppId, certificate, certThumbprint)
dflow = dprep.read_csv(path = DataLakeDataSource(path='adl://dpreptestfiles.azuredatalakestore.net/crime-spring.csv', accessToken=token['accessToken']))
dflow.to_pandas_dataframe().head()

Unnamed: 0,ID,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
0,10498554,HZ239907,4/15/2016 23:56,007XX E 111TH ST,1153,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT OVER $ 300,OTHER,False,False,...,9,50,11,1183356.0,1831503.0,2016,5/11/2016 15:48,41.69283384,-87.60431945,"(41.692833841, -87.60431945)"
1,10516598,HZ258664,4/15/2016 17:00,082XX S MARSHFIELD AVE,890,THEFT,FROM BUILDING,RESIDENCE,False,False,...,21,71,6,1166776.0,1850053.0,2016,5/12/2016 15:48,41.74410697,-87.66449429,"(41.744106973, -87.664494285)"
2,10519196,HZ261252,4/15/2016 10:00,104XX S SACRAMENTO AVE,1154,DECEPTIVE PRACTICE,FINANCIAL IDENTITY THEFT $300 AND UNDER,RESIDENCE,False,False,...,19,74,11,,,2016,5/12/2016 15:50,,,
3,10519591,HZ261534,4/15/2016 9:00,113XX S PRAIRIE AVE,1120,DECEPTIVE PRACTICE,FORGERY,RESIDENCE,False,False,...,9,49,10,,,2016,5/13/2016 15:51,,,
4,10534446,HZ277630,4/15/2016 10:00,055XX N KEDZIE AVE,890,THEFT,FROM BUILDING,"SCHOOL, PUBLIC, BUILDING",False,False,...,40,13,6,,,2016,5/25/2016 15:59,,,


## Read Pandas DataFrame

There are situations where you may already have some data in the form of a pandas DataFrame.
The steps taken to get to this DataFrame may be non-trivial or not easy to convert to Data Prep Steps. The `read_pandas_dataframe` reader can take a DataFrame and use it as the data source for a Dataflow.

You can pass in a path to a directory (that doesn't exist yet) for Data Prep to store the contents of the DataFrame; otherwise, a temporary directory will be made in the system's temp folder. The files written to this directory will be named `part-00000` and so on; they are written out in Data Prep's internal row-based file format.

In [28]:
dflow = dprep.read_excel(path='../data/crime.xlsx')
dflow = dflow.drop_columns(columns=['Column1'])
df = dflow.to_pandas_dataframe()
df.head(5)

Unnamed: 0,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,Column11,...,Column13,Column14,Column15,Column16,Column17,Column18,Column19,Column20,Column21,Column22
0,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
1,HY329907,2015-07-05T23:50:00.000000,050XX N NEWLAND AVE,820,THEFT,$500 AND UNDER,STREET,False,False,1613,...,41,10,6,1.12923e+06,1.93332e+06,2015,2015-07-12T12:42:46.000000,41.9733,-87.8002,"(41.973309466, -87.800174996)"
2,HY329265,2015-07-05T23:30:00.000000,011XX W MORSE AVE,460,BATTERY,SIMPLE,STREET,False,True,2431,...,49,1,08B,1.16737e+06,1.94627e+06,2015,2015-07-12T12:42:46.000000,42.0081,-87.6596,"(42.008124017, -87.65955018)"
3,HY329253,2015-07-05T23:20:00.000000,121XX S FRONT AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,532,...,9,53,08B,,,2015,2015-07-12T12:42:46.000000,,,
4,HY329308,2015-07-05T23:19:00.000000,051XX W DIVISION ST,610,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,False,1531,...,37,25,5,1.14172e+06,1.90746e+06,2015,2015-07-12T12:42:46.000000,41.9022,-87.7549,"(41.902152027, -87.754883404)"


After loading in the data you can now do `read_pandas_dataframe`.

In [29]:
import shutil
cache_dir = 'dflow_df'
shutil.rmtree(cache_dir, ignore_errors=True)
dflow_df = dprep.read_pandas_dataframe(df, cache_dir)

In [30]:
dflow_df.head(5)

Unnamed: 0,Column2,Column3,Column4,Column5,Column6,Column7,Column8,Column9,Column10,Column11,...,Column13,Column14,Column15,Column16,Column17,Column18,Column19,Column20,Column21,Column22
0,Case Number,Date,Block,IUCR,Primary Type,Description,Location Description,Arrest,Domestic,Beat,...,Ward,Community Area,FBI Code,X Coordinate,Y Coordinate,Year,Updated On,Latitude,Longitude,Location
1,HY329907,2015-07-05T23:50:00.000000,050XX N NEWLAND AVE,820,THEFT,$500 AND UNDER,STREET,False,False,1613,...,41,10,6,1.12923e+06,1.93332e+06,2015,2015-07-12T12:42:46.000000,41.9733,-87.8002,"(41.973309466, -87.800174996)"
2,HY329265,2015-07-05T23:30:00.000000,011XX W MORSE AVE,460,BATTERY,SIMPLE,STREET,False,True,2431,...,49,1,08B,1.16737e+06,1.94627e+06,2015,2015-07-12T12:42:46.000000,42.0081,-87.6596,"(42.008124017, -87.65955018)"
3,HY329253,2015-07-05T23:20:00.000000,121XX S FRONT AVE,486,BATTERY,DOMESTIC BATTERY SIMPLE,STREET,False,True,532,...,9,53,08B,,,2015,2015-07-12T12:42:46.000000,,,
4,HY329308,2015-07-05T23:19:00.000000,051XX W DIVISION ST,610,BURGLARY,FORCIBLE ENTRY,SMALL RETAIL STORE,False,False,1531,...,37,25,5,1.14172e+06,1.90746e+06,2015,2015-07-12T12:42:46.000000,41.9022,-87.7549,"(41.902152027, -87.754883404)"
