### Comma Separated Values (CSV)

This notebook will work as a reference for working with CSV files. There are a lot of parameters in <b style = "color:red" > .read_csv()</b> method. We will discuss some of these here in this notebook.

In [1]:
import pandas as pd

### Loading a local csv file

In [2]:
df = pd.read_csv('aug_train.csv')

In [3]:
df.head(3)

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0


### Loading a CSV file by a URL

You can use this <b>code snippet</b> to load data from an URL. That is if you want to collect data from a server you can use this code snippet.

In [4]:
import requests
from io import StringIO

url = "https://raw.githubusercontent.com/cs109/2014_data/master/countries.csv"
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:66.0) Gecko/20100101 Firefox/66.0"}
#requests.get() fetches all data from that url and stores these in req.
req = requests.get(url, headers=headers)       #say you request on this url for data using a header.(intuition)

#Fetches all the text from req and stores it in data 
data = StringIO(req.text)

df1 = pd.read_csv(data) 
df1.head(3)

Unnamed: 0,Country,Region
0,Algeria,AFRICA
1,Angola,AFRICA
2,Benin,AFRICA


### sep Parameter 

Normally in <b>read_csv()</b> the <b>sep</b> is by default set to ,<br>
But to work with tab separated data i.e. <b style = "color:purple">.tsv</b> files you need to put <b>'\t'</b> as a value of sep parameter. As a result, dataframe's columns will be formed by finding tabs rather than commas.

In [5]:
df2 = pd.read_csv('movies.tsv',sep='\t')
df2.tail(6)

Unnamed: 0,m0,10 things i hate about you,1999,6.90,62847,['comedy' 'romance']
610,m611,the world is not enough,1999,6.3,60047.0,['action' 'adventure' 'thriller']
611,m612,watchmen,2009,7.8,135229.0,['action' 'crime' 'fantasy' 'mystery' 'sci-fi'...
612,m613,xxx,2002,5.6,53505.0,['action' 'adventure' 'crime']
613,m614,x-men,2000,7.4,122149.0,['action' 'sci-fi']
614,m615,young frankenstein,1974,8.0,57618.0,['comedy' 'sci-fi']
615,m616,zulu dawn,1979,6.4,1911.0,['action' 'adventure' 'drama' 'history' 'war']


### names Parameter

To give names for the columns.

In [6]:
"""
As there are no valid names for the columns because the dataframe(df2) uses one of its row's values for the column names,
we can use another parameter to put customized column names.
"""
df2 = pd.read_csv('movies.tsv',sep='\t',names = ['serial_no','movie_name','year_released','rating','no_votes','genres'])
df2.tail(6)

Unnamed: 0,serial_no,movie_name,year_released,rating,no_votes,genres
611,m611,the world is not enough,1999,6.3,60047.0,['action' 'adventure' 'thriller']
612,m612,watchmen,2009,7.8,135229.0,['action' 'crime' 'fantasy' 'mystery' 'sci-fi'...
613,m613,xxx,2002,5.6,53505.0,['action' 'adventure' 'crime']
614,m614,x-men,2000,7.4,122149.0,['action' 'sci-fi']
615,m615,young frankenstein,1974,8.0,57618.0,['comedy' 'sci-fi']
616,m616,zulu dawn,1979,6.4,1911.0,['action' 'adventure' 'drama' 'history' 'war']


### index_col Parameter

Sometimes there is a redundant column in your dataset, suppose there is a unique column and there is an auto generated index column with increasing values. One can easily get rid of this index column and make that unique column as index. In this case <b>index_col</b> parameter is used.

In [7]:
df3 = pd.read_csv('aug_train.csv')
df3.head(4)

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0


In [8]:
df3 = pd.read_csv('aug_train.csv',index_col='enrollee_id')
df3.head(4)

Unnamed: 0_level_0,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
enrollee_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1
8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0


### header Parameter

When your column names are treated as row's values, then you can use this parameter to resolve the problem. 

In [9]:
pd.read_csv('test.csv').head(4)

Unnamed: 0.1,Unnamed: 0,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8,Unnamed: 9,Unnamed: 10,Unnamed: 11,Unnamed: 12,Unnamed: 13,Unnamed: 14
0,0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
1,1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0
2,2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0
3,3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1


In [10]:
df4 = pd.read_csv('test.csv',header = 1)
"""
header = 1 means make 1st row's values as column names and form tabular data from 2nd row to last. If the header = 3, 
that means make 3rd row's values as column names and form tabular data from 4th row to last.
"""
df4.head(4)

Unnamed: 0,0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0
1,2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0
2,3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1
3,4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0


### usecols Parameter

When you are importing data, you may know that you don't need some of the columns from that tabular data that is you want only some specific columns and discard the rest of them. In this case, you can use <b>usecols</b> parameter. <br>There are other features in pandas to manipulate with columns but to work with some columns at the beginning, you can use this parameter.

In [11]:
df5 = pd.read_csv('aug_train.csv',usecols=['enrollee_id','gender','city','education_level'])
df5.head(6)

Unnamed: 0,enrollee_id,city,gender,education_level
0,8949,city_103,Male,Graduate
1,29725,city_40,Male,Graduate
2,11561,city_21,,Graduate
3,33241,city_115,,Graduate
4,666,city_162,Male,Masters
5,21651,city_176,,Graduate


### squeeze Parameter

If you want a single column and use read_csv to import the data, then the resultant output will give you a <b>dataframe</b> by convention. But if you use <b>squeeze = True</b> as a parameter inside that <b>read_csv()</b> method, then it will give you a <b>series</b>.

In [12]:
#df6 = pd.read_csv('aug_train.csv',usecols=['enrollee_id'])                  #This will return a dataframe.
df6 = pd.read_csv('aug_train.csv',usecols=['enrollee_id'],squeeze = True)    #This will give us a series.
df6

0         8949
1        29725
2        11561
3        33241
4          666
         ...  
19153     7386
19154    31398
19155    24576
19156     5756
19157    23834
Name: enrollee_id, Length: 19158, dtype: int64

### skiprows Parameters

This parameter is used to <b>skip passed rows</b> in new data frame. If you use 0 then it will refer to the row which incorporates the column names. So actual(value) rows start from 1 onwards. So to skip rows with values use 1 not 0.

In [13]:
df7 = pd.read_csv('aug_train.csv')
df7.head(8)

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
5,21651,city_176,0.764,,Has relevent experience,Part time course,Graduate,STEM,11,,,1,24,1.0
6,28806,city_160,0.92,Male,Has relevent experience,no_enrollment,High School,,5,50-99,Funded Startup,1,24,0.0
7,402,city_46,0.762,Male,Has relevent experience,no_enrollment,Graduate,STEM,13,<10,Pvt Ltd,>4,18,1.0


#### One can use functions inside this <b style = "color:red">skiprows</b> parameter depending on the logic to solve specific problems.

In [14]:
df7 = pd.read_csv('aug_train.csv',skiprows = lambda x: x in range(1,5))     #This will skip the first 4 rows.
#pd.read_csv('aug_train.csv',skiprows = [1,2])                              #This will skip only first and second rows

In [15]:
df7.head(8)

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
1,21651,city_176,0.764,,Has relevent experience,Part time course,Graduate,STEM,11,,,1,24,1.0
2,28806,city_160,0.92,Male,Has relevent experience,no_enrollment,High School,,5,50-99,Funded Startup,1,24,0.0
3,402,city_46,0.762,Male,Has relevent experience,no_enrollment,Graduate,STEM,13,<10,Pvt Ltd,>4,18,1.0
4,27107,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,7,50-99,Pvt Ltd,1,46,1.0
5,699,city_103,0.92,,Has relevent experience,no_enrollment,Graduate,STEM,17,10000+,Pvt Ltd,>4,123,0.0
6,29452,city_21,0.624,,No relevent experience,Full time course,High School,,2,,,never,32,1.0
7,23853,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,5,5000-9999,Pvt Ltd,1,108,0.0


### nrows Parameter

It creates a dataframe with only <b>n</b> number of rows while importing. When it gets difficult to load huge no. of rows(ex. millions,billions) because of memory (RAM) shortages, you might load data parts by parts intelligently to manipulate the dataset effectively and that is where this <b>nrows</b> parameter is really helpful.

In [16]:
df8 = pd.read_csv('aug_train.csv',nrows = 10)
df8
#df8.shape

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0
5,21651,city_176,0.764,,Has relevent experience,Part time course,Graduate,STEM,11,,,1,24,1.0
6,28806,city_160,0.92,Male,Has relevent experience,no_enrollment,High School,,5,50-99,Funded Startup,1,24,0.0
7,402,city_46,0.762,Male,Has relevent experience,no_enrollment,Graduate,STEM,13,<10,Pvt Ltd,>4,18,1.0
8,27107,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,7,50-99,Pvt Ltd,1,46,1.0
9,699,city_103,0.92,,Has relevent experience,no_enrollment,Graduate,STEM,17,10000+,Pvt Ltd,>4,123,0.0


### encoding Parameter

If you get a different encoded dataset and want to use the default encoding for CSV files, which is <b>UTF-8(Unicode Transformation Format)</b>, then you need to use the <b style = "color:orange">encoding</b> parameter.

#### How do I find the encoding of a file?
Open up your file using regular old vanilla Notepad that comes with Windows. It will show you the encoding of the file when you click "Save As..."

In [17]:
df9 = pd.read_csv('zomato.csv',encoding='ANSI')
df9.head(4)

Unnamed: 0,Restaurant ID,Restaurant Name,Country Code,City,Address,Locality,Locality Verbose,Longitude,Latitude,Cuisines,...,Currency,Has Table booking,Has Online delivery,Is delivering now,Switch to order menu,Price range,Aggregate rating,Rating color,Rating text,Votes
0,6317637,Le Petit Souffle,162,Makati City,"Third Floor, Century City Mall, Kalayaan Avenu...","Century City Mall, Poblacion, Makati City","Century City Mall, Poblacion, Makati City, Mak...",121.027535,14.565443,"French, Japanese, Desserts",...,Botswana Pula(P),Yes,No,No,No,3,4.8,Dark Green,Excellent,314
1,6304287,Izakaya Kikufuji,162,Makati City,"Little Tokyo, 2277 Chino Roces Avenue, Legaspi...","Little Tokyo, Legaspi Village, Makati City","Little Tokyo, Legaspi Village, Makati City, Ma...",121.014101,14.553708,Japanese,...,Botswana Pula(P),Yes,No,No,No,3,4.5,Dark Green,Excellent,591
2,6300002,Heat - Edsa Shangri-La,162,Mandaluyong City,"Edsa Shangri-La, 1 Garden Way, Ortigas, Mandal...","Edsa Shangri-La, Ortigas, Mandaluyong City","Edsa Shangri-La, Ortigas, Mandaluyong City, Ma...",121.056831,14.581404,"Seafood, Asian, Filipino, Indian",...,Botswana Pula(P),Yes,No,No,No,4,4.4,Green,Very Good,270
3,6318506,Ooma,162,Mandaluyong City,"Third Floor, Mega Fashion Hall, SM Megamall, O...","SM Megamall, Ortigas, Mandaluyong City","SM Megamall, Ortigas, Mandaluyong City, Mandal...",121.056475,14.585318,"Japanese, Sushi",...,Botswana Pula(P),No,No,No,No,4,4.9,Dark Green,Excellent,365


### Skip bad lines with <b style = "color:orange">error_bad_lines</b> Parameter

In a dataset, there might be some rows where there are values more than the no. of columns. In that case, you will want that dataset to be free of these rows. Because of these <b style = "color : orange">bad lines</b> a parser error will generate. So when you encounter this type of error then there's a great chance that this is happening because of bad lines and you need to use this parameter.

In [43]:
#This faulty code is left purposely.
"""
 This will give parser Error because the dataframe will expect same no. of values in the lines for same no. of fields 
 in a dataset but might see different no. of values than the fields in that dataset.
 """
df10 = pd.read_csv('BX-Books.csv',encoding='ANSI',sep = ';')    #Here the data is separated by ; in the dataset.  

ParserError: Error tokenizing data. C error: Expected 8 fields in line 6452, saw 9


In [28]:
"""
This will skip the bad lines and gives us a dataframe without the rows with different no. of values.
The text in the red box tells you which lines are the bad ones and it makes sure that there are no 
bad lines in your dataframe.
"""
df10 = pd.read_csv('BX-Books.csv',encoding='ANSI',sep = ';',error_bad_lines=False)
df10.head(5)



  exec(code_obj, self.user_global_ns, self.user_ns)
b'Skipping line 6452: expected 8 fields, saw 9\nSkipping line 43667: expected 8 fields, saw 10\nSkipping line 51751: expected 8 fields, saw 9\n'
b'Skipping line 92038: expected 8 fields, saw 9\nSkipping line 104319: expected 8 fields, saw 9\nSkipping line 121768: expected 8 fields, saw 9\n'
b'Skipping line 144058: expected 8 fields, saw 9\nSkipping line 150789: expected 8 fields, saw 9\nSkipping line 157128: expected 8 fields, saw 9\nSkipping line 180189: expected 8 fields, saw 9\nSkipping line 185738: expected 8 fields, saw 9\n'
b'Skipping line 209388: expected 8 fields, saw 9\nSkipping line 220626: expected 8 fields, saw 9\nSkipping line 227933: expected 8 fields, saw 11\nSkipping line 228957: expected 8 fields, saw 10\nSkipping line 245933: expected 8 fields, saw 9\nSkipping line 251296: expected 8 fields, saw 9\nSkipping line 259941: expected 8 fields, saw 9\nSkipping line 261529: expected 8 fields, saw 9\n'
  exec(code_obj, sel

Unnamed: 0,ISBN,Book-Title,Book-Author,Year-Of-Publication,Publisher,Image-URL-S,Image-URL-M,Image-URL-L
0,195153448,Classical Mythology,Mark P. O. Morford,2002,Oxford University Press,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...,http://images.amazon.com/images/P/0195153448.0...
1,2005018,Clara Callan,Richard Bruce Wright,2001,HarperFlamingo Canada,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...,http://images.amazon.com/images/P/0002005018.0...
2,60973129,Decision in Normandy,Carlo D'Este,1991,HarperPerennial,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...,http://images.amazon.com/images/P/0060973129.0...
3,374157065,Flu: The Story of the Great Influenza Pandemic...,Gina Bari Kolata,1999,Farrar Straus Giroux,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...,http://images.amazon.com/images/P/0374157065.0...
4,393045218,The Mummies of Urumchi,E. J. W. Barber,1999,W. W. Norton &amp; Company,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...,http://images.amazon.com/images/P/0393045218.0...


### dtype Parameter

When you need to override a datatype for memory management then this datatype is used.

In [29]:
pd.read_csv('aug_train.csv').info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  float64
dtypes: float64(2), int64(2), object(10)
me

In [31]:
"""
Initially the last column which is 'target' has datatype float. But it can be done using integer also. So to reduce 
the memory size of this dataset you can override it to integer using dtype parameter.
"""
df11 = pd.read_csv('aug_train.csv',dtype={'target':int})
df11.head(5)

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0


In [32]:
df11.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 19158 entries, 0 to 19157
Data columns (total 14 columns):
 #   Column                  Non-Null Count  Dtype  
---  ------                  --------------  -----  
 0   enrollee_id             19158 non-null  int64  
 1   city                    19158 non-null  object 
 2   city_development_index  19158 non-null  float64
 3   gender                  14650 non-null  object 
 4   relevent_experience     19158 non-null  object 
 5   enrolled_university     18772 non-null  object 
 6   education_level         18698 non-null  object 
 7   major_discipline        16345 non-null  object 
 8   experience              19093 non-null  object 
 9   company_size            13220 non-null  object 
 10  company_type            13018 non-null  object 
 11  last_new_job            18735 non-null  object 
 12  training_hours          19158 non-null  int64  
 13  target                  19158 non-null  int32  
dtypes: float64(1), int32(1), int64(2), obj

### Handling Dates

When you use read_csv, then by default the date values are passed as strings. As a result, you won't be able to use the functionalities of dates because those values won't be formatted as date values. Thus we use <b style = "color:orange">parse_dates</b> parameter.

Suppose you have two columns, One with month's values and other with year's values. If you want combine them both you can pass a list of those column names or column index in the parse_dates parameter. <b style = "color:orange">parse_dates = [[1,3]]</b> or <b style = "color:orange">parse_dates = [['colname1','colname2']]</b>

In [35]:
"""
Initially the date will be treated as strings(pandas object) but to convert it to dates we will use parse_dates parameter.
"""
df12 = pd.read_csv('IPL Matches.csv',parse_dates=['date'])
df12.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 816 entries, 0 to 815
Data columns (total 17 columns):
 #   Column           Non-Null Count  Dtype         
---  ------           --------------  -----         
 0   id               816 non-null    int64         
 1   city             803 non-null    object        
 2   date             816 non-null    datetime64[ns]
 3   player_of_match  812 non-null    object        
 4   venue            816 non-null    object        
 5   neutral_venue    816 non-null    int64         
 6   team1            816 non-null    object        
 7   team2            816 non-null    object        
 8   toss_winner      816 non-null    object        
 9   toss_decision    816 non-null    object        
 10  winner           812 non-null    object        
 11  result           812 non-null    object        
 12  result_margin    799 non-null    float64       
 13  eliminator       812 non-null    object        
 14  method           19 non-null     object   

### Converters

Sometimes you might want to do some transformation on data even before loading the dataset

In [38]:
def rename(name):
    if name == "Royal Challengers Bangalore":
        return "RCB"
    else:
        return name
    
rename("Royal Challengers Bangalore")

'RCB'

In [39]:
pd.read_csv('IPL Matches.csv').head(3)

Unnamed: 0,id,city,date,player_of_match,venue,neutral_venue,team1,team2,toss_winner,toss_decision,winner,result,result_margin,eliminator,method,umpire1,umpire2
0,335982,Bangalore,2008-04-18,BB McCullum,M Chinnaswamy Stadium,0,Royal Challengers Bangalore,Kolkata Knight Riders,Royal Challengers Bangalore,field,Kolkata Knight Riders,runs,140.0,N,,Asad Rauf,RE Koertzen
1,335983,Chandigarh,2008-04-19,MEK Hussey,"Punjab Cricket Association Stadium, Mohali",0,Kings XI Punjab,Chennai Super Kings,Chennai Super Kings,bat,Chennai Super Kings,runs,33.0,N,,MR Benson,SL Shastri
2,335984,Delhi,2008-04-19,MF Maharoof,Feroz Shah Kotla,0,Delhi Daredevils,Rajasthan Royals,Rajasthan Royals,bat,Delhi Daredevils,wickets,9.0,N,,Aleem Dar,GA Pratapkumar


In [42]:
"""
Converters parameter works as a dictionary where the key is a column name and a function is used for the value part.
The values of the column are passed as arguments for the function.
"""
df13 = pd.read_csv('IPL Matches.csv',converters={'team1':rename})
df13.head(3)

Unnamed: 0,id,city,date,player_of_match,venue,neutral_venue,team1,team2,toss_winner,toss_decision,winner,result,result_margin,eliminator,method,umpire1,umpire2
0,335982,Bangalore,2008-04-18,BB McCullum,M Chinnaswamy Stadium,0,RCB,Kolkata Knight Riders,Royal Challengers Bangalore,field,Kolkata Knight Riders,runs,140.0,N,,Asad Rauf,RE Koertzen
1,335983,Chandigarh,2008-04-19,MEK Hussey,"Punjab Cricket Association Stadium, Mohali",0,Kings XI Punjab,Chennai Super Kings,Chennai Super Kings,bat,Chennai Super Kings,runs,33.0,N,,MR Benson,SL Shastri
2,335984,Delhi,2008-04-19,MF Maharoof,Feroz Shah Kotla,0,Delhi Daredevils,Rajasthan Royals,Rajasthan Royals,bat,Delhi Daredevils,wickets,9.0,N,,Aleem Dar,GA Pratapkumar


### na_values parameter

When you have <b style = "color:orange">NaN (Not A Number)</b> in your dataset that means you have missing values in your dataset. Sometimes instead of NaN, you might get a - or any sort of characters which refer to empty value then you can convert it to NaN by using <b style = "color:orange">na_values</b> parameter.  

You can do this by using that charater as the parameter's value.

In [47]:
pd.read_csv('aug_train.csv').head(5)

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,Male,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,Male,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,Male,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


In [46]:
"""We are making all the male values as NaN for example. But in real life you might find some character. 
This is just for understanding."""

pd.read_csv('aug_train.csv',na_values='Male').head(5)

Unnamed: 0,enrollee_id,city,city_development_index,gender,relevent_experience,enrolled_university,education_level,major_discipline,experience,company_size,company_type,last_new_job,training_hours,target
0,8949,city_103,0.92,,Has relevent experience,no_enrollment,Graduate,STEM,>20,,,1,36,1.0
1,29725,city_40,0.776,,No relevent experience,no_enrollment,Graduate,STEM,15,50-99,Pvt Ltd,>4,47,0.0
2,11561,city_21,0.624,,No relevent experience,Full time course,Graduate,STEM,5,,,never,83,0.0
3,33241,city_115,0.789,,No relevent experience,,Graduate,Business Degree,<1,,Pvt Ltd,never,52,1.0
4,666,city_162,0.767,,Has relevent experience,no_enrollment,Masters,STEM,>20,50-99,Funded Startup,4,8,0.0


### Loading a huge dataset in chunks

Sometimes when working with huge dataset, you might not be able to load that whole dataset. So what you do is load that dataset in chunks or small batches. So to do this we use <b style = "color:orange">chunksize</b> parameter.

In [58]:
dfs = pd.read_csv('aug_train.csv',chunksize = 5000)

In [59]:
for chunks in dfs:
    print(chunks.shape)

(5000, 14)
(5000, 14)
(5000, 14)
(4158, 14)
