# 🐼 Pandas Handbook

## 02 - Importing & Exporting Data

Check out the official [Pandas documentation](https://pandas.pydata.org/pandas-docs/stable/)  

This notebook uses the [Ramen Ratings dataset](https://www.kaggle.com/datasets/residentmario/ramen-ratings/data) from Kaggle to demonstrate how to import and export data using various file formats with pandas.

## 📚 Table of Contents  
---  
  
📄 **From and To CSV**  
📄 **From and To Excel**  
📄 **From and To JSON**  
📄 **From and To HTML**  
📄 **From and To SQL**  
📄 **From and To Parquet**  
📄 **From and To Feather**  
📄 **From and To HDF**  
📦 **Comparing File Sizes**  
👉 **Next Topic: Data Inspection**

---  

Import core libraries: pandas for data handling, and os/sys for file path management.

In [1]:
import pandas as pd
import sys
import os

Define paths and filenames for all supported data formats (CSV, Excel, JSON, HTML, SQL, Parquet, Feather, HDF5).

In [2]:
sys.path.append(os.path.abspath(".."))

data_raw = "../data/raw/"
data_processed = "../data/processed/"

csv_file = "ramen-ratings.csv"
tsv_file = "ramen-ratings.tsv"
excel_file = "ramen-ratings.xlsx"  # Requires the openpyxl library installed
json_file = "ramen-ratings.json"
html_file = "ramen-ratings.html"  # Requires the lxml library installed

database = "sqlite:///../data/sample_database.db"  # Requires the SQLAlchemy library installed
sql_table = "sample_table"

parquet_file = "ramen-ratings.parquet"  # Requires the pyarrow library installed
feather_file = "ramen-ratings.feather"  # Requires the pyarrow library installed
hdf_file = "ramen-ratings.h5"  # Requires the pytables library installed

### 📄 From and to CSV

🔽 Load CSV into a DataFrame  
Read a CSV file and display the first 5 rows

In [3]:
import_path = os.path.join(data_raw, csv_file)
df = pd.read_csv(import_path)

df.head()

Unnamed: 0,Review #,Brand,Variety,Style,Country,Stars,Top Ten
0,2580,New Touch,T's Restaurant Tantanmen,Cup,Japan,3.75,
1,2579,Just Way,Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...,Pack,Taiwan,1.0,
2,2578,Nissin,Cup Noodles Chicken Vegetable,Cup,USA,2.25,
3,2577,Wei Lih,GGE Ramen Snack Tomato Flavor,Pack,Taiwan,2.75,
4,2576,Ching's Secret,Singapore Curry,Pack,India,3.75,


💾 Export DataFrame to CSV without index  
Write the DataFrame to a CSV file, excluding the index column.

In [4]:
export_path = os.path.join(data_processed, csv_file)
df.to_csv(export_path, index=False)

🔽 Read CSV with index column  
Read the CSV again, using ````"Review #"```` as the index column.

In [5]:
df = pd.read_csv(import_path, index_col='Review #')

df.head()

Unnamed: 0_level_0,Brand,Variety,Style,Country,Stars,Top Ten
Review #,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2580,New Touch,T's Restaurant Tantanmen,Cup,Japan,3.75,
2579,Just Way,Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...,Pack,Taiwan,1.0,
2578,Nissin,Cup Noodles Chicken Vegetable,Cup,USA,2.25,
2577,Wei Lih,GGE Ramen Snack Tomato Flavor,Pack,Taiwan,2.75,
2576,Ching's Secret,Singapore Curry,Pack,India,3.75,


💾 Export DataFrame to CSV with index  
Write the DataFrame to a CSV file, including the index column.

In [6]:
export_path = os.path.join(data_processed, csv_file)
df.to_csv(export_path)

💾 Export CSV with tab-separated values  
Save the DataFrame as a tab-separated file (TSV).

In [7]:
export_path = os.path.join(data_processed, tsv_file)
df.to_csv(export_path, sep='\t')

### 📄 From and to Excel

🔽 Read Excel file  
Read data from an Excel file and set ````"Review #"```` as the index.

In [8]:
import_path = os.path.join(data_raw, excel_file)
df = pd.read_excel(import_path, index_col='Review #')

df.head()

Unnamed: 0_level_0,Brand,Variety,Style,Country,Stars,Top Ten
Review #,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2580,New Touch,T's Restaurant Tantanmen,Cup,Japan,3.75,
2579,Just Way,Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...,Pack,Taiwan,1.0,
2578,Nissin,Cup Noodles Chicken Vegetable,Cup,USA,2.25,
2577,Wei Lih,GGE Ramen Snack Tomato Flavor,Pack,Taiwan,2.75,
2576,Ching's Secret,Singapore Curry,Pack,India,3.75,


💾 Write to Excel  
Export the DataFrame to an Excel file.

In [9]:
export_path = os.path.join(data_processed, excel_file)
df.to_excel(export_path)

### 📄 From and to JSON

🔽 Read JSON file  
Read structured data from a JSON file into a DataFrame.

In [10]:
import_path = os.path.join(data_raw, json_file)
df = pd.read_json(import_path)

df.head()

Unnamed: 0,Review #,Brand,Variety,Style,Country,Stars,Top Ten
0,2580,New Touch,T's Restaurant Tantanmen,Cup,Japan,3.75,
1,2579,Just Way,Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...,Pack,Taiwan,1.0,
2,2578,Nissin,Cup Noodles Chicken Vegetable,Cup,USA,2.25,
3,2577,Wei Lih,GGE Ramen Snack Tomato Flavor,Pack,Taiwan,2.75,
4,2576,Ching's Secret,Singapore Curry,Pack,India,3.75,


💾 Export to JSON  
Export the DataFrame to a JSON file.

In [11]:
export_path = os.path.join(data_processed, json_file)
df.to_json(export_path)

### 📄 From and to HTML

🔽 Read tables from HTML  
Read tables from an HTML page; ````pd.read_html()```` returns a list of DataFrames.  
Slice the data frame with ````df[0]```` to get the first table.

In [12]:
import_path = os.path.join(data_raw, html_file)
df = pd.read_html(import_path)

df[0].head()

Unnamed: 0,Review #,Brand,Variety,Style,Country,Stars,Top Ten
0,2580,New Touch,T's Restaurant Tantanmen,Cup,Japan,3.75,
1,2579,Just Way,Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...,Pack,Taiwan,1.0,
2,2578,Nissin,Cup Noodles Chicken Vegetable,Cup,USA,2.25,
3,2577,Wei Lih,GGE Ramen Snack Tomato Flavor,Pack,Taiwan,2.75,
4,2576,Ching's Secret,Singapore Curry,Pack,India,3.75,


💾 Export to HTML  
Export the first HTML table (DataFrame) to an HTML file without the index.

In [13]:
export_path = os.path.join(data_processed, html_file)
df[0].to_html(export_path, index=False)

### 📄 From and to SQL

🔌 Create SQLAlchemy engine  
Set up a connection to a SQLite database using SQLAlchemy.

In [14]:
from sqlalchemy import create_engine
engine = create_engine(database)

🔽 Read from SQL table  
Read a SQL table into a DataFrame and set ````"Review #"```` as the index.

In [15]:
df = pd.read_sql(sql_table, engine, index_col='Review #')
df.head()

Unnamed: 0_level_0,Brand,Variety,Style,Country,Stars,Top Ten
Review #,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2580,New Touch,T's Restaurant Tantanmen,Cup,Japan,3.75,
2579,Just Way,Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...,Pack,Taiwan,1.0,
2578,Nissin,Cup Noodles Chicken Vegetable,Cup,USA,2.25,
2577,Wei Lih,GGE Ramen Snack Tomato Flavor,Pack,Taiwan,2.75,
2576,Ching's Secret,Singapore Curry,Pack,India,3.75,


💾 Export to SQL  
Write the DataFrame to the SQL table, replacing it if it already exists.

In [16]:
df.to_sql(sql_table, engine, if_exists='replace')

2580

### 📄 From and to Parquet, Feather and HDF

Parquet, Feather, and HDF5 are special file formats that store data in a compact and fast way, which makes them great for working with large datasets or when you need to save and load data quickly.

#### 📄 From and to Parquet

🔽 Read from Parquet  
Read a Parquet file into a DataFrame.

In [17]:
import_path = os.path.join(data_raw, parquet_file)
df = pd.read_parquet(import_path)

df.head()

Unnamed: 0_level_0,Brand,Variety,Style,Country,Stars,Top Ten
Review #,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2580,New Touch,T's Restaurant Tantanmen,Cup,Japan,3.75,
2579,Just Way,Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...,Pack,Taiwan,1.0,
2578,Nissin,Cup Noodles Chicken Vegetable,Cup,USA,2.25,
2577,Wei Lih,GGE Ramen Snack Tomato Flavor,Pack,Taiwan,2.75,
2576,Ching's Secret,Singapore Curry,Pack,India,3.75,


💾 Export to Parquet  
Save the DataFrame in Parquet format using Snappy compression.

In [18]:
export_path = os.path.join(data_processed, parquet_file)
df.to_parquet(export_path, engine='pyarrow', compression='snappy')

#### 📄 From and to Feather

🔽 Read from Feather  
Read a Feather file into a DataFrame.

In [19]:
import_path = os.path.join(data_raw, feather_file)
df = pd.read_feather(import_path)

df.head()

Unnamed: 0_level_0,Brand,Variety,Style,Country,Stars,Top Ten
Review #,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2580,New Touch,T's Restaurant Tantanmen,Cup,Japan,3.75,
2579,Just Way,Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...,Pack,Taiwan,1.0,
2578,Nissin,Cup Noodles Chicken Vegetable,Cup,USA,2.25,
2577,Wei Lih,GGE Ramen Snack Tomato Flavor,Pack,Taiwan,2.75,
2576,Ching's Secret,Singapore Curry,Pack,India,3.75,


💾 Export to Feather  
Export the DataFrame to a Feather file.

In [20]:
export_path = os.path.join(data_processed, feather_file)
df.to_feather(export_path)

#### 📄 From and to HDF

🔽 Read from HDF5  
Load a DataFrame from an HDF5 file using a specific key.

In [21]:
import_path = os.path.join(data_raw, hdf_file)
df = pd.read_hdf(import_path, key='df')

df.head()

Unnamed: 0_level_0,Brand,Variety,Style,Country,Stars,Top Ten
Review #,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2580,New Touch,T's Restaurant Tantanmen,Cup,Japan,3.75,
2579,Just Way,Noodles Spicy Hot Sesame Spicy Hot Sesame Guan...,Pack,Taiwan,1.0,
2578,Nissin,Cup Noodles Chicken Vegetable,Cup,USA,2.25,
2577,Wei Lih,GGE Ramen Snack Tomato Flavor,Pack,Taiwan,2.75,
2576,Ching's Secret,Singapore Curry,Pack,India,3.75,


💾 Export to HDF5 with compression  
Write to an HDF5 file with compression for better space efficiency.

In [22]:
export_path = os.path.join(data_processed, hdf_file)
df.to_hdf(export_path, key='df', mode='w', format='table', complib='blosc', complevel=9)

### 📦 Comparing the file sizes  
Measure and print the file size (in KB) of each saved file, sorted from smallest to largest.

In [23]:
from pathlib import Path

folder = Path(data_processed)
file_dict = {}

for file in folder.iterdir():
    if file.is_file():
        size_kb = file.stat().st_size / 1024
        file_dict[file.name] = round(size_kb, 2)

sorted_dict = dict(sorted(file_dict.items(), key=lambda item: item[1]))
for key, value in sorted_dict.items():
    print(f"File: {key.split('.')[1].title()} with size: {value}")

File: Parquet with size: 70.78
File: Csv with size: 75.11
File: Xlsx with size: 118.7
File: Feather with size: 119.45
File: H5 with size: 145.09
File: Tsv with size: 157.12
File: Csv with size: 157.13
File: Json with size: 305.87
File: Html with size: 475.29


### 👉 Next Topic: [Data Inspection](./03-data-inspection.ipynb)

Learn how to inspect data with pandas.