# Read CSV, Excel, and Parquet files

![Data Science Workflow](DSworkflow.png)

## Acquire Data
### Common Data Sources
- Web Scraping
- Databasis
- **CSV**
- **Excel**
- **Parquet**

### CSV files
- Comma-Seperated Values ([Wikipedia](https://en.wikipedia.org/wiki/Comma-separated_values))
- CSV is a common data exchange format
- Simple and easy to use


#### How to read CSV file
```Python
import pandas as pd

data = pd.read_csv('files/aapl.csv', parse_dates=True, index_col=0)
```

#### Other useful parameters
- Full documentation [`read_csv()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html): read a comma-separated values (csv) file into **pandas** DataFrame.
- `sep=','`

In [1]:
import pandas as pd

In [2]:
data=pd.read_csv('files/aapl.csv',parse_dates=True,index_col=0)

In [3]:
data

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-01-02,75.150002,73.797501,74.059998,75.087502,135480400.0,73.988464
2020-01-03,75.144997,74.125000,74.287498,74.357498,146322800.0,73.269150
2020-01-06,74.989998,73.187500,73.447502,74.949997,118387200.0,73.852982
2020-01-07,75.224998,74.370003,74.959999,74.597504,108872000.0,73.505653
2020-01-08,76.110001,74.290001,74.290001,75.797501,132079200.0,74.688080
...,...,...,...,...,...,...
2021-11-08,151.570007,150.160004,151.410004,150.440002,55020900.0,150.440002
2021-11-09,151.429993,150.059998,150.199997,150.809998,56787900.0,150.809998
2021-11-10,150.130005,147.850006,150.020004,147.919998,65187100.0,147.919998
2021-11-11,149.429993,147.679993,148.960007,147.869995,41000000.0,147.869995


### Excel files
- Most videly used [spreadsheet](https://en.wikipedia.org/wiki/Spreadsheet)
- Learn more about Excel processing [in this lecture](https://www.learnpythonwithrune.org/csv-groupby-processing-to-excel-with-charts-using-pandas-python/)

#### How to read Excel
- [`read_excel()`](https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html) Read an Excel file into a pandas DataFrame.
```Python
data = pd.read_excel('files/aapl.xlsx', index_col='Date')
```

In [6]:
data=pd.read_excel('files/aapl.xlsx', index_col='Date')

In [7]:
data

Unnamed: 0_level_0,High,Low,Open,Close,Volume,Adj Close
Date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
2020-01-02,75.150002,73.797501,74.059998,75.087502,135480400,73.988464
2020-01-03,75.144997,74.125000,74.287498,74.357498,146322800,73.269150
2020-01-06,74.989998,73.187500,73.447502,74.949997,118387200,73.852982
2020-01-07,75.224998,74.370003,74.959999,74.597504,108872000,73.505653
2020-01-08,76.110001,74.290001,74.290001,75.797501,132079200,74.688080
...,...,...,...,...,...,...
2021-11-08,151.570007,150.160004,151.410004,150.440002,55020900,150.440002
2021-11-09,151.429993,150.059998,150.199997,150.809998,56787900,150.809998
2021-11-10,150.130005,147.850006,150.020004,147.919998,65187100,147.919998
2021-11-11,149.429993,147.679993,148.960007,147.869995,41000000,147.869995


### Parquet files
- [Parquet](https://en.wikipedia.org/wiki/Apache_Parquet) is a free open source format
- Compressed format
    - `!ls -l files/aapl.*`
    - Can easily be a factor 10-20

#### How to read a Parquet file
- [`read_parquet()`](https://pandas.pydata.org/docs/reference/api/pandas.read_parquet.html) Load a parquet object from the file path, returning a DataFrame.
```Python
data = pd.read_parquet('files/aapl.parquet')
```

In [9]:
!ls -l files/aapl.*

-rw-rw-r-- 1 zulkifel zulkifel 52452 جولائ  6 15:32 files/aapl.csv
-rw-rw-r-- 1 zulkifel zulkifel 28338 جولائ 13 15:11 files/aapl.parquet
-rw-rw-r-- 1 zulkifel zulkifel 30375 جولائ 13 15:10 files/aapl.xlsx


In [17]:
data = pd.read_parquet('files/aapl.parquet')

ImportError: Unable to find a usable engine; tried using: 'pyarrow', 'fastparquet'.
pyarrow or fastparquet is required for parquet support

## Great Places to Find Data
- [UC Irvine Machine Learning Repository!](https://archive.ics.uci.edu/ml/index.php)
- [KD Nuggets](https://www.kdnuggets.com/datasets/index.html) Datasets for Data Mining, Data Science, and Machine Learning
    - [KD Nuggets](https://www.kdnuggets.com/datasets/government-local-public.html) Government, State, City, Local and Public
    - [KD Nuggets](https://www.kdnuggets.com/datasets/api-hub-marketplace-platform.html) APIs, Hubs, Marketplaces, and Platforms
    - [KD Nuggets](https://www.kdnuggets.com/competitions/index.html) Analytics, Data Science, Data Mining Competitions
- [data.gov](https://www.data.gov) The home of the U.S. Government’s open data
- [data.gov.uk](https://data.gov.uk) Data published by central government
- [World Health Organization](https://www.who.int/data/gho) Explore a world of health data
- [World Bank](https://data.worldbank.org) source of world data
- [Kaggle](https://www.kaggle.com) is an online community of data scientists and machine learning practitioners.