# Reading Data

* Data from many different resources such as CSV, excel, parquet.
* Two main data structures: DataFrame and Series.
* DataFrame is a two-dimensional data structure with labeled rows and columns.
* Series is a one-dimensional array-like structure.

<img src="Assets/read_data.png" class="juno_ui_theme_light" style="width:700px">

## Exercise 1 - read_csv

In [1]:
import pandas as pd

df = pd.read_csv("Data/sales_data_with_stores.csv")

df.head()

Unnamed: 0,store,product_group,product_code,stock_qty,cost,price,last_week_sales,last_month_sales
0,Violet,PG2,4187,498,420.76,569.91,13,58
1,Rose,PG2,4195,473,545.64,712.41,16,58
2,Violet,PG2,4204,968,640.42,854.91,22,88
3,Daisy,PG2,4219,241,869.69,1034.55,14,45
4,Daisy,PG2,4718,1401,12.54,26.59,50,285


## Exercise 2 - dtype parameter

* Specify the data types of any column in read_csv function using dtype parameter

In [2]:
df = pd.read_csv("Data/sales_data_with_stores.csv")

df.dtypes

store                object
product_group        object
product_code          int64
stock_qty             int64
cost                float64
price               float64
last_week_sales       int64
last_month_sales      int64
dtype: object

In [3]:
df = pd.read_csv(
    
    "Data/sales_data_with_stores.csv",
    dtype={"product_code": "string", "last_week_sales":"float"}
    
)

df.dtypes

store                       object
product_group               object
product_code        string[python]
stock_qty                    int64
cost                       float64
price                      float64
last_week_sales            float64
last_month_sales             int64
dtype: object

## Exercise 3 - usecols parameter

In [4]:
column_list = ["store","product_code","cost","price"]

df = pd.read_csv(
    
    "Data/sales_data_with_stores.csv",
    usecols=column_list
)

df.head()

Unnamed: 0,store,product_code,cost,price
0,Violet,4187,420.76,569.91
1,Rose,4195,545.64,712.41
2,Violet,4204,640.42,854.91
3,Daisy,4219,869.69,1034.55
4,Daisy,4718,12.54,26.59


## Exercise 4 - usecols parameter

* Also accepts column indices.
* Helpful when working with files that have a lot of column.

In [5]:
df = pd.read_csv(
    
    "Data/file_with_many_columns.csv",
    usecols=range(10)
)

df.shape

(5, 10)

In [6]:
df.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9
0,8,0,2,4,4,7,0,9,7,1
1,9,7,0,0,8,0,0,4,6,0
2,4,7,2,9,4,5,6,5,5,8
3,4,0,5,5,1,2,4,2,3,2
4,0,9,4,7,0,0,5,9,5,7


## Exercise 5 - nrows parameter

* Read the first n rows in the file

In [7]:
df = pd.read_csv(
    
    "Data/sales_data_with_stores.csv"

)

df.shape

(1000, 8)

In [8]:
df = pd.read_csv(
    
    "Data/sales_data_with_stores.csv",
    nrows=250
)

df.shape

(250, 8)

## Exercise 6 - skiprows parameter

* Skip the first n rows in the file

In [9]:
df = pd.read_csv(
    
    "Data/sales_data_with_stores.csv",
    skiprows=300
)

df.shape

(700, 8)

## Exercise 7 - skiprows parameter

* The skiprows parameter also accepts a lambda expression. 
* For instance, the following code will read every second line from the csv file.

In [10]:
df = pd.read_csv(
    
    "Data/sales_data_with_stores.csv",
    skiprows=lambda x: x % 2 == 1
)

df.shape

(500, 8)

## Exercise 8 - index_col parameter

* Use a specific column as the index in the DataFrame

In [11]:
df = pd.read_csv(
    
    "Data/prices.csv"
)

df.head()

Unnamed: 0,date,product_id,price
0,2022-10-01,100121,10.76
1,2022-10-02,100121,10.73
2,2022-10-03,100121,15.96
3,2022-10-04,100121,11.5
4,2022-10-05,100121,16.49


In [12]:
df = pd.read_csv(
    
    "Data/prices.csv",
    index_col="date"
)

df.head()

Unnamed: 0_level_0,product_id,price
date,Unnamed: 1_level_1,Unnamed: 2_level_1
2022-10-01,100121,10.76
2022-10-02,100121,10.73
2022-10-03,100121,15.96
2022-10-04,100121,11.5
2022-10-05,100121,16.49


## Exercise 9 - parse_dates

* Can convert the data type after reading the data. 
* Another option is to handle this task while reading the data.

In [13]:
df = pd.read_csv(
    
    "Data/prices.csv"
)

df.dtypes

date           object
product_id      int64
price         float64
dtype: object

In [14]:
df = pd.read_csv(
    
    "Data/prices.csv",
    parse_dates=["date"]
)

df.dtypes

date          datetime64[ns]
product_id             int64
price                float64
dtype: object

## Exercise 10 - na_values parameter

In [15]:
df = pd.read_csv("Data/sample_dataset.csv")

df

Unnamed: 0,col1,col2,col3,col4,col5
0,23.0,16.0,45,17,2
1,46.0,16.0,24,2,31
2,2.0,29.0,2,46,24
3,,,25,23,7
4,,,30,34,29
5,,30.0,?,5,6
6,35.0,37.0,?,26,39
7,9.0,5.0,35,11,41
8,13.0,39.0,25,5,39
9,40.0,15.0,32,47,24


In [16]:
df = pd.read_csv("Data/sample_dataset.csv", na_values=["?"])

df

Unnamed: 0,col1,col2,col3,col4,col5
0,23.0,16.0,45.0,17,2
1,46.0,16.0,24.0,2,31
2,2.0,29.0,2.0,46,24
3,,,25.0,23,7
4,,,30.0,34,29
5,,30.0,,5,6
6,35.0,37.0,,26,39
7,9.0,5.0,35.0,11,41
8,13.0,39.0,25.0,5,39
9,40.0,15.0,32.0,47,24


## Exercise 11 - to_csv function

* index parameter: Write row names (index) as a column

In [17]:
df.to_csv("Data/sample_dataset_2.csv", index=False)

df = pd.read_csv("Data/sample_dataset_2.csv")

df.head()

Unnamed: 0,col1,col2,col3,col4,col5
0,23.0,16.0,45.0,17,2
1,46.0,16.0,24.0,2,31
2,2.0,29.0,2.0,46,24
3,,,25.0,23,7
4,,,30.0,34,29


In [18]:
df.to_csv("Data/sample_dataset_3.csv")

df = pd.read_csv("Data/sample_dataset_3.csv")

df.head()

Unnamed: 0.1,Unnamed: 0,col1,col2,col3,col4,col5
0,0,23.0,16.0,45.0,17,2
1,1,46.0,16.0,24.0,2,31
2,2,2.0,29.0,2.0,46,24
3,3,,,25.0,23,7
4,4,,,30.0,34,29


## Exercise 12 - read_clipboad

In [19]:
df = pd.read_clipboard()

df

Unnamed: 0,"{col:col.replace(""_"",",""""")",for,col,in,churn.columns}


## Conclusion

* The use of the function for reading other file types is similar
* read_excel
* read_parquet
* many more