# Pandas I/O: convert DataFrame from/to various formats

## Introduction
In this lecture, we will explore how to convert pandas DataFrames to and from different formats such as CSV, Excel, and JSON. This is essential for data manipulation, storage, and retrieval in various data science and data engineering tasks.


## Converting DataFrames to CSV
To convert a DataFrame to a CSV file, you can use the `to_csv` method.


In [24]:
import pandas as pd

# Creating a sample DataFrame
data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [24, 27, 22, 32],
    "City": ["New York", "Los Angeles", "Chicago", "Houston"]
}
df = pd.DataFrame(data)

# Save to CSV
df.to_csv('./data/sample.csv', index=False)


### Specifying Columns and Separator

You can also specify which columns to include in the CSV file and use a different separator, such as a semicolon ; instead of the default comma ,

Next code will create a CSV file "example_names_cities.csv" with the content:

Name;City
Ivan;Sofia
Maria;Plovdiv
Georgi;Varna

In [None]:
# Save only the 'Name' and 'City' columns to a CSV file, using a semicolon as the separator
df.to_csv('example_names_cities.csv', columns=['Name', 'City'], sep=';', index=False)

### Quoting Strategies

The quoting parameter controls when quotes should be applied to cell values, and it accepts one of the constants defined in the csv module:

- `csv.QUOTE_MINIMAL`: Quotes are applied to fields only when necessary (e.g., when a field contains a delimiter like a comma or a quote character). This is the default behavior.
- `csv.QUOTE_ALL`: Quotes are applied to all fields.
- `csv.QUOTE_NONNUMERIC`: Quotes are applied to non-numeric fields.
- `csv.QUOTE_NONE`: No fields are quoted; use this with caution as it may make your CSV file difficult to parse if your data contains the delimiter.

#### Example with Quoted Values
Suppose you have a DataFrame where some names contain commas, and you want to ensure these names are correctly quoted in the CSV file. 

This code snippet will produce an "example_quoted.csv" file where only non-numeric values are quoted, which ensures that names with commas are correctly represented as single fields:

```
Name,Age,City
"Ivan",34,"Sofia"
"Maria, PhD",28,"Plovdiv"
"Georgi, MD",45,"Varna"
```

In [None]:
import csv

# Create a DataFrame with names that include commas
data = {
    'Name': ['Ivan', 'Maria, PhD', 'Georgi, MD'],
    'Age': [34, 28, 45],
    'City': ['Sofia', 'Plovdiv', 'Varna']
}
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file, ensuring proper quoting
df.to_csv('example_quoted.csv', index=False, quoting=csv.QUOTE_NONNUMERIC)


#### Customizing Quote Character

You can also customize the quote character using the quotechar parameter. This is useful if your data includes the standard quote character (") and you want to use a different character to enclose your fields.

Next code will produce example_custom_quote.csv, where quoting symbol is `'`:

```
'Name','Age','City'
'Ivan',34,'Sofia'
'Maria, PhD',28,'Plovdiv'
'Georgi, MD',45,'Varna'
```

In [None]:
# Save the DataFrame using a custom quote character
df.to_csv('example_custom_quote.csv', index=False, quotechar='\'', quoting=csv.QUOTE_NONNUMERIC)

## Converting DataFrames to Excel
To convert a DataFrame to an Excel file, you can use the `to_excel` method.

If you get the ModuleNotFoundError: No module named 'openpyxl', then install 'openpyxl' by `pip install openpyxl`

In [25]:
# Save to Excel
df.to_excel('./data/sample.xlsx', index=False)


## Converting DataFrames to JSON
To convert a DataFrame to a JSON file, you can use the `to_json` method.


In [26]:
# Save to JSON
df.to_json('./data/sample.json', orient='records')


## Reading DataFrames from CSV
To read a DataFrame from a CSV file, you can use the `read_csv` method.


In [27]:
# Read from CSV
df_csv = pd.read_csv('./data/sample.csv')
df_csv.head()


Unnamed: 0,Name,Age,City
0,Alice,24,New York
1,Bob,27,Los Angeles
2,Charlie,22,Chicago
3,David,32,Houston


#### header parameter

By default, the first row of the CSV file is used for header (columns labels). But if our data file has just the data, we should say to read_csv not to use header like that:

In [31]:
csv_url = './data/sample.csv'
csv_df = pd.read_csv(csv_url, sep=",", header=None)
csv_df.head(5)

# note, that the header row, now is part of the data:

Unnamed: 0,0,1,2
0,Name,Age,City
1,Alice,24,New York
2,Bob,27,Los Angeles
3,Charlie,22,Chicago
4,David,32,Houston


#### names parameter

We can set list of column names to use. 
Note, that we must explicitly pass ``header=0``, if we have a header row in our CSV, to be able to replace existing names.

In [32]:
csv_df = pd.read_csv("./data/sample.csv",
                   header=0,
                   names=['A','B','C'])
csv_df.head(3)

Unnamed: 0,A,B,C
0,Alice,24,New York
1,Bob,27,Los Angeles
2,Charlie,22,Chicago


#### Loading big files - nrows parameter

Number of rows of file to read. Useful for reading pieces of large files.

Other useful parameters are <b>chunksize</b> and <b>iterator</b>

In [None]:
csv_df = pd.read_csv("../../datasets/various/drinks.csv", nrows=5)
csv_dfc

## Reading DataFrames from Excel
To read a DataFrame from an Excel file, you can use the `read_excel` method.


In [28]:
# Read from Excel
df_excel = pd.read_excel('./data/sample.xlsx')
df_excel.head()


Unnamed: 0,Name,Age,City
0,Alice,24,New York
1,Bob,27,Los Angeles
2,Charlie,22,Chicago
3,David,32,Houston


## Reading DataFrames from JSON
To read a DataFrame from a JSON file, you can use the `read_json` method.

For more complicated json srtucture often is more convinient to use `pd.json_normalize` method


In [29]:
# Read from JSON: variant 1
df_json_1 = pd.read_json('./data/sample.json')
df_json_1.head()


Unnamed: 0,Name,Age,City
0,Alice,24,New York
1,Bob,27,Los Angeles
2,Charlie,22,Chicago
3,David,32,Houston


In [23]:
# Read from JSON: variant 2
import json

# Load JSON data from file
with open('./data/dict_of_lists.json', 'r') as file:
    json_data = json.load(file)

# Convert to DataFrame
df_json_2 = pd.json_normalize(json_data, 'cryptocurrencies')
df_json_2

Unnamed: 0,rank,name,symbol,price_usd,percent_change_1h,percent_change_24h,percent_change_7d,market_cap_usd,volume_24h_usd,circulating_supply
0,1,Bitcoin,BTC,68456.87,0.03,1.15,0.89,1348286628482,27198913584,19705575
1,2,Ethereum,ETH,3772.77,0.45,0.31,0.24,452820284798,15829088239,120139698
2,3,Tether,USDT,0.9999,0.02,0.06,0.03,111947908779,63651947485,111963763435
3,4,BNB,BNB,595.72,0.02,0.23,0.53,87920452142,1779578697,147585418
4,5,Solana,SOL,168.63,0.1,0.24,1.26,77506643991,2802872850,459619897


## Reading DataFrame from html tables

If you get the ImportError: Missing optional dependency 'lxml',  use `pip install lxml`.

url = 'https://coinmarketcap.com/all/views/all/'

In [11]:
url = 'https://coinmarketcap.com/all/views/all/'
df_crypto = pd.read_html(url)
df_crypto[2]

Unnamed: 0,Rank,Name,Symbol,Market Cap,Price,Circulating Supply,Volume(24h),% 1h,% 24h,% 7d,...,Unnamed: 991,Unnamed: 992,Unnamed: 993,Unnamed: 994,Unnamed: 995,Unnamed: 996,Unnamed: 997,Unnamed: 998,Unnamed: 999,Unnamed: 1000
0,1.0,BTCBitcoin,BTC,"$1.35T$1,346,020,961,960","$68,306.61","19,705,575 BTC","$27,250,132,455",-0.31%,0.99%,0.86%,...,,,,,,,,,,
1,2.0,ETHEthereum,ETH,"$452.24B$452,244,939,982","$3,764.33","120,139,698 ETH *","$15,846,904,549",-0.51%,-0.49%,-0.18%,...,,,,,,,,,,
2,3.0,USDTTether USDt,USDT,"$111.92B$111,919,471,570",$0.9996,"111,963,763,435 USDT *","$63,729,482,518",<0.01%,0.03%,-0.02%,...,,,,,,,,,,
3,4.0,BNBBNB,BNB,"$87.8B$87,800,998,970",$594.92,"147,585,418 BNB *","$1,778,506,371",-0.15%,-0.30%,0.44%,...,,,,,,,,,,
4,5.0,SOLSolana,SOL,"$77.4B$77,401,540,738",$168.40,"459,619,897 SOL *","$2,812,785,939",0.19%,-0.44%,-1.47%,...,,,,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
195,,Centrifuge,,,,,,,,,...,,,,,,,,,,
196,,Highstreet,,,,,,,,,...,,,,,,,,,,
197,,Flux,,,,,,,,,...,,,,,,,,,,
198,,Mask Network,,,,,,,,,...,,,,,,,,,,
