# Pandas I/O: convert DataFrame from/to various formats

## Introduction
In this lecture, we will explore how to convert pandas DataFrames to and from different formats such as CSV, Excel, and JSON. This is essential for data manipulation, storage, and retrieval in various data science and data engineering tasks.


## Converting DataFrames to CSV
To convert a DataFrame to a CSV file, you can use the `to_csv` method.


In [1]:
import pandas as pd

# Creating a sample DataFrame
data = {
    "Name": ["Alice", "Bob", "Charlie", "David"],
    "Age": [24, 27, 22, 32],
    "City": ["New York", "Los Angeles", "Chicago", "Houston"]
}
df = pd.DataFrame(data)

# Save to CSV
df.to_csv('./data/sample.csv', index=False)


### Specifying Columns and Separator

You can also specify which columns to include in the CSV file and use a different separator, such as a semicolon ; instead of the default comma ,

Next code will create a CSV file "example_names_cities.csv" with the content:

Name;City
Ivan;Sofia
Maria;Plovdiv
Georgi;Varna

In [2]:
# Save only the 'Name' and 'City' columns to a CSV file, using a semicolon as the separator
df.to_csv('example_names_cities.csv', columns=['Name', 'City'], sep=';', index=False)

### Quoting Strategies

The quoting parameter controls when quotes should be applied to cell values, and it accepts one of the constants defined in the csv module:

- `csv.QUOTE_MINIMAL`: Quotes are applied to fields only when necessary (e.g., when a field contains a delimiter like a comma or a quote character). This is the default behavior.
- `csv.QUOTE_ALL`: Quotes are applied to all fields.
- `csv.QUOTE_NONNUMERIC`: Quotes are applied to non-numeric fields.
- `csv.QUOTE_NONE`: No fields are quoted; use this with caution as it may make your CSV file difficult to parse if your data contains the delimiter.

#### Example with Quoted Values
Suppose you have a DataFrame where some names contain commas, and you want to ensure these names are correctly quoted in the CSV file. 

This code snippet will produce an "example_quoted.csv" file where only non-numeric values are quoted, which ensures that names with commas are correctly represented as single fields:

```
Name,Age,City
"Ivan",34,"Sofia"
"Maria, PhD",28,"Plovdiv"
"Georgi, MD",45,"Varna"
```

In [3]:
import csv

# Create a DataFrame with names that include commas
data = {
    'Name': ['Ivan', 'Maria, PhD', 'Georgi, MD'],
    'Age': [34, 28, 45],
    'City': ['Sofia', 'Plovdiv', 'Varna']
}
df = pd.DataFrame(data)

# Save the DataFrame to a CSV file, ensuring proper quoting
df.to_csv('example_quoted.csv', index=False, quoting=csv.QUOTE_NONNUMERIC)


#### Customizing Quote Character

You can also customize the quote character using the quotechar parameter. This is useful if your data includes the standard quote character (") and you want to use a different character to enclose your fields.

Next code will produce example_custom_quote.csv, where quoting symbol is `'`:

```
'Name','Age','City'
'Ivan',34,'Sofia'
'Maria, PhD',28,'Plovdiv'
'Georgi, MD',45,'Varna'
```

In [4]:
# Save the DataFrame using a custom quote character
df.to_csv('example_custom_quote.csv', index=False, quotechar='\'', quoting=csv.QUOTE_NONNUMERIC)

## Converting DataFrames to Excel
To convert a DataFrame to an Excel file, you can use the `to_excel` method.

If you get the ModuleNotFoundError: No module named 'openpyxl', then install 'openpyxl' by `pip install openpyxl`

In [5]:
# Save to Excel
df.to_excel('./data/sample.xlsx', index=False)

## Converting DataFrames to JSON
To convert a DataFrame to a JSON file, you can use the `to_json` method.


In [6]:
# Save to JSON
df.to_json('./data/sample.json', orient='records')

## Load DataFrames from CSV

To load a DataFrame from a CSV file, you can use the `read_csv` method.

In [7]:
# Read from CSV
df_csv = pd.read_csv('./data/sample.csv')
df_csv.head()

Unnamed: 0,Name,Age,City
0,Alice,24,New York
1,Bob,27,Los Angeles
2,Charlie,22,Chicago
3,David,32,Houston


#### header parameter

By default, the first row of the CSV file is used for header (columns labels). But if our data file has just the data, we should say to read_csv not to use header like that:

In [8]:
csv_url = './data/sample.csv'
csv_df = pd.read_csv(csv_url, sep=",", header=None)
csv_df.head(5)

# note, that the header row, now is part of the data:

Unnamed: 0,0,1,2
0,Name,Age,City
1,Alice,24,New York
2,Bob,27,Los Angeles
3,Charlie,22,Chicago
4,David,32,Houston


#### names parameter

We can set list of column names to use. 
Note, that we must explicitly pass ``header=None``, if we have a header row in our CSV, to be able to replace existing names.

In [9]:
csv_df = pd.read_csv(
    "./data/sample.csv",
    header=None,
    names=['A','B','C'])
csv_df.head(3)

Unnamed: 0,A,B,C
0,Name,Age,City
1,Alice,24,New York
2,Bob,27,Los Angeles


#### Loading big files - nrows parameter

Number of rows of file to read. Useful for reading pieces of large files.

Other useful parameters are <b>chunksize</b> and <b>iterator</b>

In [10]:
csv_df = pd.read_csv(
    "https://raw.githubusercontent.com/geekcourses/JupyterNotebooksExamples/master/datasets/various/drinks.csv",
    nrows=5
)
csv_df

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


## Load DataFrames from Excel

To load a DataFrame from an Excel file, you can use the `read_excel` method.

In [11]:
# Read from Excel
df_excel = pd.read_excel('./data/sample.xlsx')
df_excel.head()

Unnamed: 0,Name,Age,City
0,Ivan,34,Sofia
1,"Maria, PhD",28,Plovdiv
2,"Georgi, MD",45,Varna


## Load DataFrames from JSON

To load a DataFrame from a JSON file, you can use the `read_json` method.

In [12]:
# Read from JSON: variant 1
df_json_1 = pd.read_json('./data/sample.json')
df_json_1.head()

Unnamed: 0,Name,Age,City
0,Ivan,34,Sofia
1,"Maria, PhD",28,Plovdiv
2,"Georgi, MD",45,Varna


For more complicated json srtucture often is more convinient to use `pd.json_normalize` method

In [13]:
# Read from JSON: variant 2
import json

# Load JSON data from file
with open('./data/dict_of_lists.json', 'r') as file:
    json_data = json.load(file)

# Convert to DataFrame
df_json_2 = pd.json_normalize(json_data, 'cryptocurrencies')
df_json_2

Unnamed: 0,rank,name,symbol,price_usd,percent_change_1h,percent_change_24h,percent_change_7d,market_cap_usd,volume_24h_usd,circulating_supply
0,1,Bitcoin,BTC,68456.87,0.03,1.15,0.89,1348286628482,27198913584,19705575
1,2,Ethereum,ETH,3772.77,0.45,0.31,0.24,452820284798,15829088239,120139698
2,3,Tether,USDT,0.9999,0.02,0.06,0.03,111947908779,63651947485,111963763435
3,4,BNB,BNB,595.72,0.02,0.23,0.53,87920452142,1779578697,147585418
4,5,Solana,SOL,168.63,0.1,0.24,1.26,77506643991,2802872850,459619897


## Load DataFrame from HTML tables

Pandas provides a convenient method (read_html) to read HTML tables directly from a URL. However, this process relies on optional dependencies such as lxml, html5lib, or BeautifulSoup4 (lsml is the default). If you encounter an ImportError for these dependencies, you need to install them using pip.

`pip install lxml`

In next example we will get cryptocurrency data from [coinmarketcap.com](https://coinmarketcap.com/all/views/all/). 

Many Web servers refuse to serve automated script request. To prevent Error 403, we will use [Python's requestt lib](https://requests.readthedocs.io/en/latest/) to mimic a request coming from a web browser. You can install it by:

`pip install requests`

Note, that the pd.read_html() method returns a list of DataFrames as there might be multiple tables on the page. You can access the specific table you need by indexing the list (e.g., tables[0] for the first table).

In [14]:
import requests
import pandas as pd
from io import StringIO

# URL containing the HTML table
url = 'https://tradingeconomics.com/crypto'

# Headers to mimic a browser request
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

# Fetch the HTML content
response = requests.get(url, headers=headers)

# Check if the request was successful
if response.status_code == 200:
    html_content = response.text

    # Use pandas to read the HTML content
    tables = pd.read_html(StringIO(html_content))

    # Get tables count:
    print(f'tables fetched: {len(tables)}')

    # Load the first table into a DataFrame
    df = tables[0]

    # Display the first few rows of the DataFrame
    print(df.head())
else:
    print(f"Failed to retrieve the webpage. Status code: {response.status_code}")


tables fetched: 3
    Crypto        Price       Day       %  Weekly Monthly      YoY  \
0  Bitcoin  68174.00000  598.0000   0.88%   0.64%  16.98%  151.40%   
1    Ether   3723.46000   44.0400  -1.17%  -0.79%  25.21%   99.57%   
2  Binance    591.20000    4.5000  -0.76%  -0.62%   5.72%   92.89%   
3  Cardano      0.44845    0.0028  -0.62%  -3.50%  -0.13%   19.94%   
4   Solana    166.39210    1.8790  -1.12%  -5.79%  23.75%  701.12%   

     MarketCap    Date  
0  $1,291,275M  May/30  
1    $443,810M  May/30  
2     $96,401M  May/30  
3     $15,113M  May/30  
4     $51,773M  May/30  
