# <strong>Pandas:</strong> Parameters in read_csv | Guide

**Name:** Arsalan Ali<br>
**Email:** arslanchaos@gmail.com

### **Table of Contents**
* 01- Importing Pandas
* 02- Opening a local CSV file
* 03- Opening a CSV file from URL
* 04- Sep Parameter
* 05- Engine Parameter
* 06- Low_Memory Parameter
* 07- Name Parameter
* 08- Index_col Parameter
* 09- Header Parameter
* 10- Use_col Parameter
* 11- Skiprows Parameter
* 12- Nrows Parameter
* 13- Encoding Parameter
* 14- On_bad_lines Parameter
* 15- Dtype Parameter
* 16- Parse_dates Parameter
* 17- Converters Parameter
* 18- Na_values Parameter
* 19- Chunksize Parameter


----

### 01- Importing Pandas
First of all import the Pandas librar

In [2]:
import pandas as pd

### 02- Opening a local CSV file

In [2]:
df = pd.read_csv("tips.csv")
df.head()

Unnamed: 0.1,Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,0,16.99,1.01,Female,No,Sun,Dinner,2
1,1,10.34,1.66,Male,No,Sun,Dinner,3
2,2,21.01,3.5,Male,No,Sun,Dinner,3
3,3,23.68,3.31,Male,No,Sun,Dinner,2
4,4,24.59,3.61,Female,No,Sun,Dinner,4


### 03- Opening a CSV file from URL

In [3]:
url = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
df = pd.read_csv(url)
df.head()

Unnamed: 0,Country,Region
0,Algeria,AFRICA
1,Angola,AFRICA
2,Benin,AFRICA
3,Botswana,AFRICA
4,Burkina,AFRICA


### 04- Sep Parameter
Pandas uses it to read dataset properly by recognizing delimiter.
*  * 1- **Comma separated values (.csv)** – (,)
*  * 2- **Tab separated values (.tsv)** – (\t)
*  * 3- **Text files (.txt)** – spaces ( ), pipes (|), colons (:), semicolons (;) etc


In [5]:
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv"
df = pd.read_csv(url, sep="\t")
df.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


### 05- Engine Parameter
Different ways for Pandas to parse data
* 'c' is faster, while 'python' is currently more feature-complete.
* The 'pyarrow' engine was added as an experimental engine, it supports multi-threading and is faster.
* 'python' supports skipfooter, while 'c' does not.
* 'python' supports flexible sep other than a single character (inc regex), while 'c' does not.
* 'python' supports sep=None with delim_whitespace=False, which means it can auto-detect a delimiter, while 'c' does not.
* 'c' supports float_precision, while 'python' does not (or not necessary).


In [19]:
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv"
df = pd.read_csv(url, sep="\t", engine="python")

df.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


### 06- Low_Memory Parameter
* **low_memory = True (default)** Pandas will load chunks of data of each column to determine dtpye. Eats less memory but causes problem on big data
* **low_memory = False** Pandas will load all columns first and then determine dtype. Eats a lot of memory

In [8]:
url = "https://raw.githubusercontent.com/justmarkham/DAT8/master/data/chipotle.tsv"
df = pd.read_csv(url, sep="\t", low_memory=False)

df.head()

Unnamed: 0,order_id,quantity,item_name,choice_description,item_price
0,1,1,Chips and Fresh Tomato Salsa,,$2.39
1,1,1,Izze,[Clementine],$3.39
2,1,1,Nantucket Nectar,[Apple],$3.39
3,1,1,Chips and Tomatillo-Green Chili Salsa,,$2.39
4,2,2,Chicken Bowl,"[Tomatillo-Red Chili Salsa (Hot), [Black Beans...",$16.98


### 07- Name Parameter
Adds column names to the table

In [13]:
url = "https://raw.githubusercontent.com/manuelsh/chat-bot/master/data/raw/movie-dialog-corpus/movie_titles_metadata.tsv"

df = pd.read_csv(url, sep='\t',names=['sno','name','release_year','rating','votes','genres'])

df.head()

Unnamed: 0,sno,name,release_year,rating,votes,genres
0,m0,10 things i hate about you,1999,6.9,62847.0,['comedy' 'romance']
1,m1,1492: conquest of paradise,1992,6.2,10421.0,['adventure' 'biography' 'drama' 'history']
2,m2,15 minutes,2001,6.1,25854.0,['action' 'crime' 'drama' 'thriller']
3,m3,2001: a space odyssey,1968,8.4,163227.0,['adventure' 'mystery' 'sci-fi']
4,m4,48 hrs.,1982,6.9,22289.0,['action' 'comedy' 'crime' 'drama' 'thriller']


### 08- Index_col Parameter
Assigns a column as the index of the table

In [15]:
url = "https://raw.githubusercontent.com/ahmedbesbes/mlflow/main/data/aug_train.csv"

df = pd.read_csv(url ,index_col='enrollee_id')

df.head()

Unnamed: 0_level_0,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


### 09- Header Parameter
If the column names are treated as a row then header can fix that

In [16]:
url = "https://raw.githubusercontent.com/campusx-official/100-days-of-machine-learning/main/day15%20-%20working%20with%20csv%20files/test.csv"

df = pd.read_csv(url, header=1)
df.head()

Unnamed: 0,0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0
1,2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0
2,3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1
3,4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0


### 10- Use_col Parameter
If you want to load only a few columns in the table

In [17]:
url = "https://raw.githubusercontent.com/campusx-official/100-days-of-machine-learning/main/day15%20-%20working%20with%20csv%20files/test.csv"

df = pd.read_csv(url, header=1, usecols=["enrollee_id","city"])
df.head()

Unnamed: 0,enrollee_id,city
0,29725,city_40
1,11561,city_21
2,33241,city_115
3,666,city_162


### 11- Skiprows Parameter
If you want to skip certain rows

In [20]:
df = pd.read_csv("tips.csv", skiprows=[1,6])
df.head()

Unnamed: 0.1,Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,1,10.34,1.66,Male,No,Sun,Dinner,3
1,2,21.01,3.5,Male,No,Sun,Dinner,3
2,3,23.68,3.31,Male,No,Sun,Dinner,2
3,4,24.59,3.61,Female,No,Sun,Dinner,4
4,6,8.77,2.0,Male,No,Sun,Dinner,2


### 12- Nrows Parameter
If you want to read only specific number of rows

In [21]:
df = pd.read_csv("tips.csv", nrows=3)
df.head()

Unnamed: 0.1,Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,0,16.99,1.01,Female,No,Sun,Dinner,2
1,1,10.34,1.66,Male,No,Sun,Dinner,3
2,2,21.01,3.5,Male,No,Sun,Dinner,3


### 13- Encoding Parameter
When a CSV has a different character encoding we can use encoding parameter to read it properly

In [23]:
url = "https://raw.githubusercontent.com/campusx-official/100-days-of-machine-learning/main/day15%20-%20working%20with%20csv%20files/zomato.csv"

df = pd.read_csv(url ,encoding='latin-1')
df.head()

Unnamed: 0,Restaurant ID,Restaurant Name,Country Code,City,Address,Locality,Locality Verbose,Longitude,Latitude,Cuisines,...,Currency,Has Table booking,Has Online delivery,Is delivering now,Switch to order menu,Price range,Aggregate rating,Rating color,Rating text,Votes
0,6317637,Le Petit Souffle,162,Makati City,"Third Floor, Century City Mall, Kalayaan Avenu...","Century City Mall, Poblacion, Makati City","Century City Mall, Poblacion, Makati City, Mak...",121.027535,14.565443,"French, Japanese, Desserts",...,Botswana Pula(P),Yes,No,No,No,3,4.8,Dark Green,Excellent,314
1,6304287,Izakaya Kikufuji,162,Makati City,"Little Tokyo, 2277 Chino Roces Avenue, Legaspi...","Little Tokyo, Legaspi Village, Makati City","Little Tokyo, Legaspi Village, Makati City, Ma...",121.014101,14.553708,Japanese,...,Botswana Pula(P),Yes,No,No,No,3,4.5,Dark Green,Excellent,591
2,6300002,Heat - Edsa Shangri-La,162,Mandaluyong City,"Edsa Shangri-La, 1 Garden Way, Ortigas, Mandal...","Edsa Shangri-La, Ortigas, Mandaluyong City","Edsa Shangri-La, Ortigas, Mandaluyong City, Ma...",121.056831,14.581404,"Seafood, Asian, Filipino, Indian",...,Botswana Pula(P),Yes,No,No,No,4,4.4,Green,Very Good,270
3,6318506,Ooma,162,Mandaluyong City,"Third Floor, Mega Fashion Hall, SM Megamall, O...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.056475,14.585318,"Japanese, Sushi",...,Botswana Pula(P),No,No,No,No,4,4.9,Dark Green,Excellent,365
4,6314302,Sambo Kojin,162,Mandaluyong City,"Third Floor, Mega Atrium, SM Megamall, Ortigas...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.057508,14.58445,"Japanese, Korean",...,Botswana Pula(P),Yes,No,No,No,4,4.8,Dark Green,Excellent,229


### 14- On_bad_lines Parameter
When there is unreadable/bad data in the CSV, we use it
* '**error**', raise an Exception when a bad line is encountered.
* '**warn**', raise a warning when a bad line is encountered and skip that line.
* '**skip**', skip bad lines without raising or warning when they are encountered.

In [33]:
url = "https://raw.githubusercontent.com/softhints/Pandas-Tutorials/master/data/csv/multiple_bad_line_multi_sep.csv"

df = pd.read_csv(url, sep=";;", on_bad_lines="skip", engine='python')
df

Unnamed: 0,Date,Company A,Company A.1,Company B,Company B.1
0,2021-09-06,1,7.9,2,6.0
1,2021-09-07,1,8.5,2,7.0
2,2021-09-08,2,8.0,1,8.1


### 15- Dtype Parameter
We can specify the datatype of a column to load CSV faster

In [34]:
url = "https://raw.githubusercontent.com/ahmedbesbes/mlflow/main/data/aug_train.csv"

df = pd.read_csv(url, dtype={'target':int}).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  int32  
dtypes: float64(1), int32(1), int64(2), obj

### 16- Parse_dates Parameter
If CSV has dates in string format we can convert them in datetime format while reading CSV

In [38]:
url = "https://raw.githubusercontent.com/softhints/Pandas-Tutorials/master/data/csv/multine_header.csv"

df = pd.read_csv(url, parse_dates=["Date"])

df.head().info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
 #   Column       Non-Null Count  Dtype         
---  ------       --------------  -----         
 0   Date         3 non-null      datetime64[ns]
 1   Company A    4 non-null      object        
 2   Company A.1  4 non-null      object        
 3   Company B    4 non-null      object        
 4   Company B.1  4 non-null      object        
dtypes: datetime64[ns](1), object(4)
memory usage: 288.0+ bytes


### 17- Converters Parameter
We can use it to transform data while reading CSV

In [43]:
df = pd.read_csv("tips.csv", converters={'sex':lambda x: "M" if x=="Male" else "F"})
df.head()

Unnamed: 0.1,Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,0,16.99,1.01,F,No,Sun,Dinner,2
1,1,10.34,1.66,M,No,Sun,Dinner,3
2,2,21.01,3.5,M,No,Sun,Dinner,3
3,3,23.68,3.31,M,No,Sun,Dinner,2
4,4,24.59,3.61,F,No,Sun,Dinner,4


### 18- Na_values Parameter
If you want someother values to be treated as NaN too then you can specify them

In [49]:
url = "https://raw.githubusercontent.com/ahmedbesbes/mlflow/main/data/aug_train.csv"

df = pd.read_csv(url, na_values=['No relevent experience'])

df

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
19153,7386,city_173,0.878,Male,,no_enrollment,Graduate,Humanities,14,,,1,42,1.0
19154,31398,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,14,,,4,52,1.0
19155,24576,city_103,0.920,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,50-99,Pvt Ltd,4,44,0.0
19156,5756,city_65,0.802,Male,Has relevent experience,no_enrollment,High School,,<1,500-999,Pvt Ltd,2,97,0.0


### 19- Chunksize Parameter
If we want to divide a large dataset into smaller chunks

In [46]:
url = "https://raw.githubusercontent.com/ahmedbesbes/mlflow/main/data/aug_train.csv"

dfs = pd.read_csv(url, chunksize=3000)

for chunk in dfs:
    print(chunk.shape)

(3000, 14)
(3000, 14)
(3000, 14)
(3000, 14)
(3000, 14)
(3000, 14)
(1158, 14)
