# Data Loading, Storage and File Formats


Reading data and making it accessible (often called data loading) is a necessary first step of data cleaning.

Input and output typically fall into a few main categories: reading text files and other more efficient on-disk formats, loading data from databases, and interacting with network sources like web APIs.

## Reading and Writing Data in Text format

Pandas features a number of functions for reading tabular data as DataFrame object.
some of them is *pandas.read_csv* is one of the morest requent used.


### Text and Binary Data Loading Functions in Pandas

| Function           | Description |
|--------------------|-------------|
| `read_csv`         | Load delimited data from a file, URL, or file-like object; uses comma as default delimiter |
| `read_fwf`         | Read data in fixed-width column format (i.e., no delimiters) |
| `read_clipboard`   | Variation of `read_csv` that reads data from the clipboard; useful for converting tables from web pages |
| `read_excel`       | Read tabular data from an Excel XLS or XLSX file |
| `read_hdf`         | Read HDF5 files written by pandas |
| `read_html`        | Read all tables found in the given HTML document |
| `read_json`        | Read data from a JSON string, file, URL, or file-like object |
| `read_feather`     | Read the Feather binary file format |
| `read_orc`         | Read the Apache ORC binary file format |
| `read_parquet`     | Read the Apache Parquet binary file format |
| `read_pickle`      | Read an object stored by pandas using the Python pickle format |
| `read_sas`         | Read a SAS dataset stored in one of the SAS system's custom storage formats |
| `read_spss`        | Read a data file created by SPSS |
| `read_sql`         | Read the results of a SQL query (using SQLAlchemy) |
| `read_sql_table`   | Read a whole SQL table (using SQLAlchemy); equivalent to using a query that selects everything in that table using `read_sql` |
| `read_stata`       | Read a dataset from Stata file format |
| `read_xml`         | Read a table of data from an XML file |





Pandas provides several functions to convert text-based data into DataFrames. These functions come with a wide range of optional parameters, which can be grouped into the following categories:

🔢 Indexing
You can define which column(s) should be used as the index, whether to use headers from the file, custom headers, or none at all.

🔍 Type Detection & Conversion
Pandas can automatically infer data types or let you define them. You can also specify custom missing value markers and conversion rules.

📅 Date & Time Parsing
Supports combining multiple columns into a single datetime column and parsing various date formats.

🔄 Iteration
Large files can be processed in chunks using parameters like chunksize, which helps manage memory efficiently.

🧹 Handling Messy Data
Options exist to skip rows, ignore footers, handle comments, and parse numbers with thousands separators.

⚙️ Why So Many Parameters?
Functions like read_csv() have grown to include over 50 optional arguments to handle the wide variety of real-world data issues. It’s common to feel overwhelmed, but the pandas documentation offers many examples to help you find the right configuration.

🧠 Type Inference vs. Embedded Types
Text-based formats (like CSV or JSON) require pandas to infer data types.
Binary formats (like Parquet, ORC, or HDF5) include type information within the file itself.
⏳ Dates and Custom Types
Handling dates or other specialized data types may require additional parameters or preprocessing.




In [34]:
import pandas as pd

df = pd.read_csv('/content/ex1.csv')

df

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [35]:
!cat /content/ex1.csv

a,b,c,d,message
1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

A file will not always have a header row.

In [36]:
!cat /content/ex2.csv

1,2,3,4,hello
5,6,7,8,world
9,10,11,12,foo

To read this file, you have a couple of options. You can allow pandas to assign default column names, or you can specify names yourself:

In [37]:
pd.read_csv('/content/ex2.csv', header=None)

Unnamed: 0,0,1,2,3,4
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [38]:
pd.read_csv('/content/ex2.csv', names=['a', 'b','c', 'd', 'message'])

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


In [39]:
names = ['a', 'b', 'c', 'd', 'message']
pd.read_csv('/content/ex2.csv', names=names, index_col='message')


Unnamed: 0_level_0,a,b,c,d
message,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
hello,1,2,3,4
world,5,6,7,8
foo,9,10,11,12


In [40]:
!cat /content/csv_mindex.csv

key1,key2,value1,value2
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16

In [43]:
parsed = pd.read_csv("/content/csv_mindex.csv", index_col=['key1', 'key2'])

parsed

Unnamed: 0_level_0,Unnamed: 1_level_0,value1,value2
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
one,a,1,2
one,b,3,4
one,c,5,6
one,d,7,8
two,a,9,10
two,b,11,12
two,c,13,14
two,d,15,16


In [44]:
!cat /content/txt3.txt

A         B         C
aaa -0.264438 -1.026059 -0.619500
bbb  0.927272  0.302904 -0.032399
ccc -0.264273 -0.386314 -0.217601
ddd -0.871858 -0.348382  1.100491

While you could do some munging by hand, the fields here are separated by a variable amount of whitespace. In these cases, you can pass a regular expression as a delimiter for pandas.read_csv. This can be expressed by the regular expression \s+, so we have then:

In [46]:
result = pd.read_csv('/content/txt3.txt', sep="\s+")

result

Unnamed: 0,A,B,C
aaa,-0.264438,-1.026059,-0.6195
bbb,0.927272,0.302904,-0.032399
ccc,-0.264273,-0.386314,-0.217601
ddd,-0.871858,-0.348382,1.100491


#### Skip rows

In [47]:
set = pd.read_csv('/content/ex4.csv', skiprows=[0, 2 ,3])

set

Unnamed: 0,a,b,c,d,message
0,1,2,3,4,hello
1,5,6,7,8,world
2,9,10,11,12,foo


###  Handling Missing values

NA and NULL

In [48]:
!cat /content/ex5.csv

something,a,b,c,d,message
one,1,2,3,4,NA
two,5,6,,8,world
three,9,10,11,12,foo

In [49]:
result = pd.read_csv('/content/ex5.csv')

result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [50]:
print(pd.isnull(result))

   something      a      b      c      d  message
0      False  False  False  False  False     True
1      False  False  False   True  False    False
2      False  False  False  False  False    False


In [51]:
pd.isna(result)

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,True,False,False
2,False,False,False,False,False,False


The `na_values` option accepts a sequence of strings to add to default list of strings recognized as missing.

In [53]:
result = pd.read_csv('/content/ex5.csv', na_values=['NULL'])

result

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [55]:
result2 = pd.read_csv('/content/ex5.csv', keep_default_na=False)

result2

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [56]:
result2.isna()

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,False
1,False,False,False,False,False,False
2,False,False,False,False,False,False


In [58]:
result3 = pd.read_csv('/content/ex5.csv', keep_default_na=False, na_values=['NA'])

result3

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,two,5,6,,8,world
2,three,9,10,11.0,12,foo


In [59]:
result3.isna()

Unnamed: 0,something,a,b,c,d,message
0,False,False,False,False,False,True
1,False,False,False,False,False,False
2,False,False,False,False,False,False


Different NA sentinels can be specified for each column in a dictionary

In [62]:
sentinels = {'message' : ['foo', 'NA'], 'something':['two']}

pd.read_csv('/content/ex5.csv', na_values=sentinels, keep_default_na=False)

Unnamed: 0,something,a,b,c,d,message
0,one,1,2,3.0,4,
1,,5,6,,8,world
2,three,9,10,11.0,12,


### 📘 Some `pandas.read_csv()` Function Arguments

| Argument         | Description |
|------------------|-------------|
| `path`           | String indicating filesystem location, URL, or file-like object. |
| `sep` or `delimiter` | Character sequence or regex to split fields in each row. |
| `header`         | Row number to use as column names; defaults to 0. Use `None` if there's no header row. |
| `index_col`      | Column(s) to use as the row index; can be a single name/number or a list for hierarchical index. |
| `names`          | List of column names for the result. |
| `skiprows`       | Number of rows to skip at the beginning or a list of row numbers to skip. |
| `na_values`      | Sequence of values to treat as NA. Added to default list unless `keep_default_na=False`. |
| `keep_default_na`| Whether to use the default NA list (`True` by default). |
| `comment`        | Character(s) that indicate comments at the end of lines. |
| `parse_dates`    | If `True`, parse all columns as dates. Can also specify list of columns or tuples to combine. |
| `keep_date_col`  | If combining columns to parse dates, keep the original columns (`False` by default). |
| `converters`     | Dict mapping column names/numbers to functions for custom conversion. |
| `dayfirst`       | If `True`, interpret dates as day-first (e.g., 7/6/2012 → June 7, 2012). |
| `date_parser`    | Function to use for parsing dates. |
| `nrows`          | Number of rows to read from the start of the file. |
| `iterator`       | If `True`, return a `TextFileReader` object for chunked reading. |
| `chunksize`      | Number of rows per chunk when using iteration. |
| `skip_footer`    | Number of lines to skip at the end of the file. |
| `verbose`        | Print parsing info like time and memory usage. |
| `encoding`       | Text encoding (e.g., `"utf-8"`); defaults to `"utf-8"` if `None`. |
| `squeeze`        | If only one column is parsed, return a Series instead of DataFrame. |
| `thousands`      | Thousands separator (e.g., `","` or `"."`). |
| `decimal`        | Decimal point character (e.g., `"."` or `","`). |
| `engine`         | Parsing engine: `"c"` (default), `"python"`, or `"pyarrow"`. |


## Reading Text Files in Pieces

When processing very large files or figuring out the right set of arguments to correctly process a large file, you may want to read only a small piece of a file or iterate through smaller chunks of the file.

Before we look at a large file, we make the pandas display settings more compact:

In [64]:
pd.options.display.max_rows = 10

In [65]:
result = pd.read_csv('/content/ex6.csv')

result

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.501840,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q
...,...,...,...,...,...
9995,2.311896,-0.417070,-1.409599,-0.515821,L
9996,-0.479893,-0.650419,0.745152,-0.646038,E
9997,0.523331,0.787112,0.486066,1.093156,K
9998,-0.362559,0.598894,-1.843201,0.887292,G


The elipsis marks ... indicate that rows in the middle of the DataFrame have been omitted.

If you want to read only a small number of rows (avoiding reading the entire file), specify that with `nrows`

In [66]:
pd.read_csv('/content/ex6.csv', nrows=5)

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.50184,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q


In [67]:
result.head()

Unnamed: 0,one,two,three,four,key
0,0.467976,-0.038649,-0.295344,-1.824726,L
1,-0.358893,1.404453,0.704965,-0.200638,B
2,-0.50184,0.659254,-0.421691,-0.057688,G
3,0.204886,1.074134,1.388361,-0.982404,R
4,0.354628,-0.133116,0.283763,-0.837063,Q


To read a file in pieces , specify a `chunksize` as a number of rows

In [68]:
chunker = pd.read_csv('/content/ex6.csv', chunksize=1000)
chunker

<pandas.io.parsers.readers.TextFileReader at 0x784ae9cc8890>

The `TextFileReader` object returned by `pandas.read_csv` allows you to iterate over the parts of the file according to the `chunksize`. For example, we can iterate over `ex6.csv`, aggregating the value counts in the "`key`" column, like so:

In [71]:
chunker = pd.read_csv('/content/ex6.csv', chunksize=1000)

tot = pd.Series([], dtype='int64')

for piece in chunker:
  tot = tot.add(piece['key'].value_counts(), fill_value=0)

tot = tot.sort_values(ascending=False)

In [73]:
tot[:3]

Unnamed: 0_level_0,0
key,Unnamed: 1_level_1
E,368.0
X,364.0
L,346.0
