# Dataframe Indexing And Loading

In this section, we'll be largely using smaller or moderate-sized datasets. A common workflow is to read the external dataset, usually from some external file, then begin to clean and manipulate the dataset for analysis. In this lecture we are going to demonstrate how you can load data from a comma separated file into a DataFrame.

## Opening CSV File Using iPython Kernel Settings

The Jupyter notebooks use ipython as the kernel underneath, which provides convenient ways to integrate lower level shell commands, which are programs run in the underlying operating system. If you're not familiar with the shell don't worry too much about this, but if you are, this is super handy for integration of your data science workflows. I want to use one shell command here called "cat", for "concatenate", which just outputs the contents of a file. In ipython if we prepend the line with an exclamation mark it will execute the remainder of the line as a shell command.  So lets look at the content of a CSV file.

In [1]:
!cat Admission_Predict.csv

'cat' is not recognized as an internal or external command,
operable program or batch file.


For some reason a the file is not statement is not executing but the output of the statement would have been a csv file's contents.

We see from the output that there is a list of columns, and the column identifiers are listed as strings on the first line of the file. Then we have rows of data, all columns separated by commas.

Now here we are going to do one thing and one thing only. We are going to open the CSV file in the form of DataFrame. Lets see how we can do it. The general syntax that is used is:

###### < Variable >=pd.read_csv("Name and location of File")

Lets see an example to understand what we are doing.

In [2]:
import pandas as pd
csvDataFrame=pd.read_csv("Admission_Predict.csv")
csvDataFrame

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.00,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.80
4,5,314,103,2,2.0,3.0,8.21,0,0.65
...,...,...,...,...,...,...,...,...,...
395,396,324,110,3,3.5,3.5,9.04,1,0.82
396,397,325,107,3,3.0,3.5,9.11,1,0.84
397,398,330,116,4,5.0,4.5,9.45,1,0.91
398,399,312,103,3,3.5,4.0,8.78,0,0.67


In [3]:
print(type(pd.read_csv("Admission_Predict.csv")))

<class 'pandas.core.frame.DataFrame'>


As observable the return type of the read_csv method is a DataFrame.

In [4]:
#Lets just see first 5 row to form and have better understanding of what is being done.
csvDataFrame.head(5)

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


## Manipulation Of CSV DataFrame

###### Case 1: Manipulating Index Of Dataframe
We notice that by default index starts with 0 while the students' serial number starts from 1. If you jump back to the CSV output you'll deduce that pandas has created a new index. Instead, we can set the serial no. as the index if we want to by using the index_col.

Let see how we do it

In [5]:
csvDataFrame1=pd.read_csv("Admission_Predict.csv",index_col=0)#Here zero represents the column number. Their indexing also
#start from 0
csvDataFrame1.head(5)

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


###### Case 2: Manipulating Names Of Column

Notice that we have two columns "SOP" and "LOR" and probably not everyone knows what they mean. So let's change our column names to make it more clear. In Pandas, we can use the rename() function It takes a parameter called columns, and we need to pass into a dictionary in which the keys are the old column name and the value is the corresponding new column name.

The general syntax is as follow:

###### < Variable >=< Name Of DataFrame to be edited >.rename(column={Dictionary})

Note that the variable on which the rename method is runned saves the edited DataFrame. The orignal DataFrame remains unaffected.

Lets see an example to understand.

Note that this rename method can also be used to edit rows. We just need to pass rows attribute instead of columns.

In [6]:
csvDataFrame2=csvDataFrame1.rename(columns={"LOR":"Letter Of Reccomendation","SOP":"Statement Of Purpose"})
csvDataFrame2

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,Statement Of Purpose,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.00,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.80
5,314,103,2,2.0,3.0,8.21,0,0.65
...,...,...,...,...,...,...,...,...
396,324,110,3,3.5,3.5,9.04,1,0.82
397,325,107,3,3.0,3.5,9.11,1,0.84
398,330,116,4,5.0,4.5,9.45,1,0.91
399,312,103,3,3.5,4.0,8.78,0,0.67


In [7]:
csvDataFrame1

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.00,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.80
5,314,103,2,2.0,3.0,8.21,0,0.65
...,...,...,...,...,...,...,...,...
396,324,110,3,3.5,3.5,9.04,1,0.82
397,325,107,3,3.0,3.5,9.11,1,0.84
398,330,116,4,5.0,4.5,9.45,1,0.91
399,312,103,3,3.5,4.0,8.78,0,0.67


As observed a new dataframe is obtained and the orignal dataframe remains unaffected.

Lets draw a few more observations. Note that though we wanted to edit 2 column names LOR and SOP and we feel that we did enter the correct names but only SOP got edited and LOR remained the same.  This may have been possible only in the case where the names of column we entered are exactly not what they appear to be. So first lets see the exact names of the column. 

This can be done via:

###### < Name Of Dataframe>.columns 

This will produce a list that contains names of column.

In [8]:
csvDataFrame2.columns

Index(['GRE Score', 'TOEFL Score', 'University Rating', 'Statement Of Purpose',
       'LOR ', 'CGPA', 'Research', 'Chance of Admit '],
      dtype='object')

As observable, the name was not just "LOR" but was "LOR " hence no changes were observed. Let now correct our mistakes.

In [9]:
csvDataFrame2=csvDataFrame1.rename(columns={"LOR ":"Letter Of Recommendation","SOP":"Statement Of Purpose"})

In [10]:
csvDataFrame2

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,Statement Of Purpose,Letter Of Recommendation,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.00,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.80
5,314,103,2,2.0,3.0,8.21,0,0.65
...,...,...,...,...,...,...,...,...
396,324,110,3,3.5,3.5,9.04,1,0.82
397,325,107,3,3.0,3.5,9.11,1,0.84
398,330,116,4,5.0,4.5,9.45,1,0.91
399,312,103,3,3.5,4.0,8.78,0,0.67


As observable, changes are a finally observable.

So that works well, but it's a bit fragile. What if that was a tab instead of a space? Or two spaces?, in that case again we would have not been able to edit the dataframe. Another way is to create some function that does the cleaning and then tell rename to apply those changes to selected columns across all of the data. Python comes with a handy string function to strip white space called "strip()". When we pass this in to rename we pass the function as the mapper parameter, and then indicate whether the axis should be columns or index (row labels). The general syntax is as follows.

###### < Variable >=< Dataframe To Be Edited >.rename(mapper=str.strip,axis="rows/columns")

Let see an example to fully understand.

In [11]:
csvDataFrame1=csvDataFrame1.rename(mapper=str.strip,axis="columns")
csvDataFrame1

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.00,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.80
5,314,103,2,2.0,3.0,8.21,0,0.65
...,...,...,...,...,...,...,...,...
396,324,110,3,3.5,3.5,9.04,1,0.82
397,325,107,3,3.0,3.5,9.11,1,0.84
398,330,116,4,5.0,4.5,9.45,1,0.91
399,312,103,3,3.5,4.0,8.78,0,0.67


In [12]:
#lets now see the column to understand the difference
csvDataFrame1.columns

Index(['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR', 'CGPA',
       'Research', 'Chance of Admit'],
      dtype='object')

As observable all extra spaces have been edited out. Now let re edit it.

In [13]:
csvDataFrame2=csvDataFrame1.rename(columns={"LOR":"Letter Of Reccomendation","SOP":"Statement Of Purpose"})
csvDataFrame2

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,Statement Of Purpose,Letter Of Reccomendation,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.00,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.80
5,314,103,2,2.0,3.0,8.21,0,0.65
...,...,...,...,...,...,...,...,...
396,324,110,3,3.5,3.5,9.04,1,0.82
397,325,107,3,3.0,3.5,9.11,1,0.84
398,330,116,4,5.0,4.5,9.45,1,0.91
399,312,103,3,3.5,4.0,8.78,0,0.67


This time the code worked successfully.

In [14]:
# We can also use the df.columns attribute by assigning to it a list of column names which will directly
# rename the columns. This will directly modify the original dataframe and is very efficient especially when
# you have a lot of columns and you only want to change a few. This technique is also not affected by subtle
# errors in the column names, a problem that we just encountered. With a list, you can use the list index to
# change a certain value or use list comprehension to change all of the values

# As an example, lets change all of the column names to lower case. First we need to get our list
cols = list(csvDataFrame1.columns)
# Then a little list comprehenshion
cols = [x.lower().strip() for x in cols]
# Then we just overwrite what is already in the .columns attribute
csvDataFrame1.columns=cols
# And take a look at our results
csvDataFrame1.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


The simple one line code for this whole process would have been.

In [16]:
csvDataFrame1.columns=[x.strip().lower() for x in csvDataFrame1.columns]
csvDataFrame1

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.00,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.80
5,314,103,2,2.0,3.0,8.21,0,0.65
...,...,...,...,...,...,...,...,...
396,324,110,3,3.5,3.5,9.04,1,0.82
397,325,107,3,3.0,3.5,9.11,1,0.84
398,330,116,4,5.0,4.5,9.45,1,0.91
399,312,103,3,3.5,4.0,8.78,0,0.67


In this lecture, you've learned how to import a CSV file into a pandas DataFrame object, and how to do some basic data cleaning to the column names. The CSV file import mechanisms in pandas have lots of different options, and you really need to learn these in order to be proficient at data manipulation. Once you have set up the format and shape of a DataFrame, you have a solid start to further actions such as conducting data analysis and modeling.

Now, there are other data sources you can load directly into dataframes as well, including HTML web pages, databases, and other file formats. But the CSV is by far the most common data format you'll run into, and an important one to know how to manipulate in pandas.
