How to load data from a comma separated file into a dataframe?

Comma-separated file (.csv) is pretty loose as a format and it's incredibly lightweight, and it's totally ubiquitous. We will use one shell command here called cat' for concatenate, which just outputs the contents of a file. In iPython, if we prepend the line with an exclamation mark, it will execute the remainder of the line as a shell command.

In [2]:
!cat resources/week-2/datasets/Admission_Predict.csv

















































































































































































































































































































































































































400,333,117,4,5,4,9.66,1,0.95

In [3]:
import pandas as pd

# read_csv() turns csv into a dataframe
df = pd.read_csv('resources/week-2/datasets/Admission_Predict.csv')
df.head()

Unnamed: 0,Serial No.,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
0,1,337,118,4,4.5,4.5,9.65,1,0.92
1,2,324,107,4,4.0,4.5,8.87,1,0.76
2,3,316,104,3,3.0,3.5,8.0,1,0.72
3,4,322,110,3,3.5,2.5,8.67,1,0.8
4,5,314,103,2,2.0,3.0,8.21,0,0.65


In [4]:
# by default index starts with 0 while the serial number starts from 1
# we can set the serial number as the index by using index_col
df = pd.read_csv('resources/week-2/datasets/Admission_Predict.csv', index_col=0)
df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,SOP,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


Let's change the names of SOP and LOR columns, we can use the rename function. It takes a parameter called columns, and we need to pass in a dictionary which are the keys of the old column name and the value of the corresponding new name. 

In [5]:
# old name: new name
new_df = df.rename(columns = {'GRE Score':'GRE Score', 'TOEFL Score':'TOEFL Score', 
                              'University Rating':'University Rating',
                              'SOP': 'Statement of Purpose', 'LOR': 'Letter of Recommendation',
                              'CGPA': 'CGPA', 'Research': 'Research',
                              'Chance of Admit': 'Chance of Admit'})
new_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,Statement of Purpose,LOR,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [6]:
# From the output, we can see that only the SOP has changed, but not LOR. So why is that? 
# First, we need to make sure we've got all the column names correct. 
# We can use the columns attribute of the DataFrame to get a list. (.attribute, function())

new_df.columns

Index(['GRE Score', 'TOEFL Score', 'University Rating', 'Statement of Purpose',
       'LOR ', 'CGPA', 'Research', 'Chance of Admit '],
      dtype='object')

We see that there's actually a space right after 'LOR' and a space right after 'chance of admit'. So this is why our renamed dictionary does not work for LOR because the key that we used is just three characters 'LOR' instead of four characters 'LOR '. There's a couple of ways that we could address this.

In [9]:
# One way would be the change of column by including the space in the new name. 
# we don't have to pass the full column names all the time!!!!!!
new_df = new_df.rename(columns = {'LOR ': 'Letter of Recommemdation'})
new_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,Statement of Purpose,Letter of Recommemdation,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [10]:
# Another way is to create some function that does the cleaning, then tell rename to apply that function. 
# Python comes with a string function to strip white space called strip. 
# We pass it as the mapper parameter, then indicate whether the axis should be the columns, or the index(row labels).

new_df = new_df.rename(mapper=str.strip, axis='columns')
new_df.head()

Unnamed: 0_level_0,GRE Score,TOEFL Score,University Rating,Statement of Purpose,Letter of Recommemdation,CGPA,Research,Chance of Admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


Now, we've got it. Both SOP and LOR had been renamed and chance of admit has been trimmed up.

In [12]:
#  Remember that df is the same as it always was('SOP', 'LOR ', 'Chance of Admit ').
df.columns

Index(['GRE Score', 'TOEFL Score', 'University Rating', 'SOP', 'LOR ', 'CGPA',
       'Research', 'Chance of Admit '],
      dtype='object')

We can also use the df.columns attribute by assigning it to a list of column names which will directly modify the original DataFrame, and it's very efficient when we have a lot of columns and we only want to change a few. This technique is also not affected by subtle errors in the column names, a problem that we just encountered. With a list, we could just use the list index to change a certain value or use a list comprehension to change all of the values. 

In [14]:
# let's change all of the column names to lower case. First we need to get our list

# making it into a list
cols = list(df.columns)  
cols = [x.lower().strip() for x in cols]  # for loop comprehension

# overwriting what is already in the .columns attibute
df.columns = cols
df.head()

Unnamed: 0_level_0,gre score,toefl score,university rating,sop,lor,cgpa,research,chance of admit
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65


In [15]:
# changing it into upper case
cols = list(df.columns)
cols = [x.upper().strip() for x in cols]
df.columns = cols
df.head()

Unnamed: 0_level_0,GRE SCORE,TOEFL SCORE,UNIVERSITY RATING,SOP,LOR,CGPA,RESEARCH,CHANCE OF ADMIT
Serial No.,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,337,118,4,4.5,4.5,9.65,1,0.92
2,324,107,4,4.0,4.5,8.87,1,0.76
3,316,104,3,3.0,3.5,8.0,1,0.72
4,322,110,3,3.5,2.5,8.67,1,0.8
5,314,103,2,2.0,3.0,8.21,0,0.65
