# Reading Data with Python

### Reading CSV and TXT files

Rather than creating **Series** or **DataFrame** structures from scratch, or even from Python core sequences or nd **ndarrays**, the most typical use of pandas is based on the loading of information from files or sources of information for futher exploration, transformation and analysis.

Python provides us the way in how to read coma-separated values file(.csv) and raw text files(.txt) into pandas DataFrame

In [2]:
import pandas as pd

When you want to work with files, the first thing to do is open it. This is done by invoking the **open()** built-in function
* **open()** has a single required argument that is that path to the file and has a single return the file object.
* the with statement automatically takes care of closing thefile once it leaves the with block, even in cases of error.

In [3]:
with open('btc_market_price.csv') as fp:
    print(fp)

<_io.TextIOWrapper name='btc_market_price.csv' mode='r' encoding='cp1252'>


Once the file is opened, we can read it as:

In [4]:
with open('btc_market_price.csv', 'r') as fp:
    for index, line in enumerate(fp.readlines()):
        if index<10:
            print(index, line)

0 2021-09-18 02:00:00,47374

1 2021-09-18 03:00:00,47642.5

2 2021-09-18 04:00:00,47692.63

3 2021-09-18 05:00:00,47944.8

4 2021-09-18 06:00:00,48349.25

5 2021-09-18 07:00:00,48676.24

6 2021-09-18 08:00:00,48592.38

7 2021-09-18 09:00:00,48645.9

8 2021-09-18 10:00:00,48696.84

9 2021-09-18 11:00:00,48648.44



Or you can do more with :

In [5]:
with open('btc_market_price1.csv' , 'r') as fp:
    for index, line in enumerate(fp.readlines()):
        if index<10:
            timestamp, price = line.split(',')
            print(f'{timestamp}:${price}')

timestamp:$ price

2021-09-18 00:00:$47374

2021-09-19 00:00:$47642.5

2021-09-20 00:00:$47692.63

2021-09-21 00:00:$47944.8

2021-09-22 00:00:$48349.25

2021-09-23 00:00:$48676.24

2021-09-24 00:00:$48592.38

2021-09-25 00:00:$48645.9

2021-09-26 00:00:$48696.84



In case the separate is not the comma, but it still is .csv file, we're going to next step:

### The CSV module

Python includes the buil-in module **csv** that helps a little bit more with the process of reading CSVs:

In [6]:
import csv

In [7]:
with open('btc_market_price.csv', 'r') as fp:
    reader = csv.reader(fp)
    for index, (timestamp, price) in enumerate(reader):
        if index<10:
            print(f'{timestamp}:${price}')

2021-09-18 02:00:00:$47374
2021-09-18 03:00:00:$47642.5
2021-09-18 04:00:00:$47692.63
2021-09-18 05:00:00:$47944.8
2021-09-18 06:00:00:$48349.25
2021-09-18 07:00:00:$48676.24
2021-09-18 08:00:00:$48592.38
2021-09-18 09:00:00:$48645.9
2021-09-18 10:00:00:$48696.84
2021-09-18 11:00:00:$48648.44


The csv modules takes care of splitting the file using a given separator (called **delimiter**) and creating an iterator for us:

In [9]:
with open('text.csv','r') as fp:
    reader =csv.reader(fp,delimiter ='>') # if the file separated by '>' character or whatever
    for index, value in enumerate(reader):
        if not value:
            continue # skip empy lines
        fname, lname, age, math, french = value
        print(f'{fname} {lname}(age {age}) got {math} in Math and {french} in French.')

Jane  Marry(age 87) got 62 in Math and F in French.
Jay  Smith(age 87000) got 78000 in Math and N in French.
Jimmy  Donal(age 87) got 62 in Math and T in French.
Cornor  Kissenger(age 87) got 62 in Math and Y in French.
Scater  Cock(age 87) got 62 in Math and F in French.


### Reading data with Pandas

Probably one of the most recurrent types of work for data analysis: public data sources , logs, historical information tables, exports from database. So the pandas library offers us functions to read and write files in multiple formats like CSV, JSON, XML and Excel XLSX, all ofthem creating a DataFrame with the information read from the file.
We'll learn how to read different type of data including:
* CSV(.csv)
* Raw text files(.txt)
* JSON data from a file and from an API
* Data from a SQL query over a database

There're many other available reading functions as following table shows:

In [49]:
import pandas as pd
pd.DataFrame(pd.read_csv('Reading Data With Pandas.csv'))

Unnamed: 0,Format Type,Data Description,Reader,Writer
0,text,CSV,read_csv,to_csv
1,text,JSON,read_json,to_json
2,text,HTML,read_html,to_html
3,text,Local clipboard,read_clipboard,to_clipboard
4,binary,MS Excel,read_excel,to_excel
5,binary,OpenDocument,read_excel,
6,binary,HDF5 Format,read_hdf,to_hdf
7,binary,Feather Format,read_feather,to_feather
8,binary,Parquet Format,read_parquet,to_parquet
9,binary,Msgpack,read_msgpack,to_msgpack


#### read_csv method
The first method is read_csv, that let us dead comma-separated values (CSV) files and raw text (TXT) files into a DataFrame.
The read_csv function is extremely powerful and you can specify a very broad set of parameters at important time that allow us to accurately configure how the data will be read and persed by specifyin the correct structure, encoding and other details. The most common parameters are:
* filepath: path of the file to be read
* sep: Character(s) that are used as a field separator in the file.
* header: index ofthe row containing the names of the columns (none if none)
* index_col : index of the columns or sequence of indexes that should be used as index of rows of the data
* names : sequence containing the names of the columns (used together with header =None)
* skiprows : number of rows or sequnce of row indexes to ignore in the load
* na_values: sequnce of values that , if found in the file, should be treated as NaN
* dtype: dictionary in which the keys will be column names and the values will be types of Numpy to which their content must be converted.
* parse_date : flag that indicates if python should try to parsedata with a format similar to dates as dates. You can ener a list of column naames that must be joined for the parsing as a date
* date_parse : function to use to try to parse dates
* nrows : number of rows to read from the beginning of the file
* kip_footer: number of rows to ignore at the end of the file
* encoding : Encoding to be expected from the file read
* squeeze : flag that indicates that if the data rad only contains one column the result is a Series instead of a DataFrame
* thousands : character to use to detect teh thousands separator
* decimal : charater to use to detect to decimal separator
* skip_blank_lines : flag that indicates whether blank lines should be ignored.
> Full read_csv documentation can be found at: **pd.read_csv?**

In [14]:
pd.read_csv?

[1;31mSignature:[0m
[0mpd[0m[1;33m.[0m[0mread_csv[0m[1;33m([0m[1;33m
[0m    [0mfilepath_or_buffer[0m[1;33m:[0m [0mUnion[0m[1;33m[[0m[0mForwardRef[0m[1;33m([0m[1;34m'PathLike[str]'[0m[1;33m)[0m[1;33m,[0m [0mstr[0m[1;33m,[0m [0mIO[0m[1;33m[[0m[1;33m~[0m[0mT[0m[1;33m][0m[1;33m,[0m [0mio[0m[1;33m.[0m[0mRawIOBase[0m[1;33m,[0m [0mio[0m[1;33m.[0m[0mBufferedIOBase[0m[1;33m,[0m [0mio[0m[1;33m.[0m[0mTextIOBase[0m[1;33m,[0m [0m_io[0m[1;33m.[0m[0mTextIOWrapper[0m[1;33m,[0m [0mmmap[0m[1;33m.[0m[0mmmap[0m[1;33m][0m[1;33m,[0m[1;33m
[0m    [0msep[0m[1;33m=[0m[1;33m<[0m[0mobject[0m [0mobject[0m [0mat[0m [1;36m0x0000025CFAC77B10[0m[1;33m>[0m[1;33m,[0m[1;33m
[0m    [0mdelimiter[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mheader[0m[1;33m=[0m[1;34m'infer'[0m[1;33m,[0m[1;33m
[0m    [0mnames[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m[1;33m
[0m    [0mindex_col[0m[1;33

In [15]:
csv_url = 'btc_market_price.csv' # normal file
pd.read_csv(csv_url).head()

Unnamed: 0,2021-09-18 02:00:00,47374
0,2021-09-18 03:00:00,47642.5
1,2021-09-18 04:00:00,47692.63
2,2021-09-18 05:00:00,47944.8
3,2021-09-18 06:00:00,48349.25
4,2021-09-18 07:00:00,48676.24


In [23]:
df = pd.read_csv('http://localhost:8888/lab/workspaces/auto-s/tree/btc_market_price.csv')
df

ParserError: Error tokenizing data. C error: Expected 1 fields in line 15, saw 2


##### Add header

In [24]:
df=pd.read_csv('btc_market_price.csv', header=None)
df.head()

Unnamed: 0,0,1
0,2021-09-18 02:00:00,47374.0
1,2021-09-18 03:00:00,47642.5
2,2021-09-18 04:00:00,47692.63
3,2021-09-18 05:00:00,47944.8
4,2021-09-18 06:00:00,48349.25


##### na_values
We can define a **na_values** parameter with the values we want to be recognized as NA/NaN. In this case empty string '', ?, - will be recognized as null value.

In [31]:
df = pd.read_csv('btc_market_price.csv', header=None, na_values = ['','?','_'])
df.head(6)

Unnamed: 0,0,1
0,2021-09-18 02:00:00,47374.0
1,2021-09-18 03:00:00,
2,2021-09-18 04:00:00,
3,2021-09-18 05:00:00,
4,2021-09-18 06:00:00,
5,2021-09-18 07:00:00,48676.24


##### Using **names** parameter

In [34]:
df = pd.read_csv('btc_market_price.csv', header=None, na_values=['','?','_'], names = ['Timestamp', 'Prices'])
df.head()

Unnamed: 0,Timestamp,Prices
0,2021-09-18 02:00:00,47374.0
1,2021-09-18 03:00:00,
2,2021-09-18 04:00:00,
3,2021-09-18 05:00:00,
4,2021-09-18 06:00:00,


##### Using dtype
Pandas try to figure the type of each column automatically. We can use **dtype** to force pandas to use certain dtype

We also can parse a column to **Datetime** object using **to_datetime** method

In [38]:
df = pd.read_csv('btc_market_price.csv', header= None, na_values = ['','_','?'], names = ['Timestamp','Prices'], dtype={'Prices':'float'})
df.dtypes

Timestamp     object
Prices       float64
dtype: object

In [39]:
pd.to_datetime(df['Timestamp']).head()

0   2021-09-18 02:00:00
1   2021-09-18 03:00:00
2   2021-09-18 04:00:00
3   2021-09-18 05:00:00
4   2021-09-18 06:00:00
Name: Timestamp, dtype: datetime64[ns]

##### Using parse_dates to change columns to datetime type

In [47]:
df= pd.read_csv('btc_market_price.csv', header=None, na_values = ['_','?',''], names =['Timestamp','Prices'], parse_dates= ['Timestamp'])
# parse_dates can use index columns or list of int, list of list, dict, ... parse_dates[0]
df.dtypes

Timestamp    datetime64[ns]
Prices              float64
dtype: object

## Save to CSV file

Finnaly we can also save our DataFrame as a CSV file

* DataFrame_name.to_csv() : default_name
* or DataFrame_name.to_csv('csv_name.csv') : specify csv name 