In [1]:
import numpy as np
from numpy import random
import pandas as pd
from sklearn import datasets

### Series

A Pandas *DataFrame* object can be thought of as multiple Pandas *Series* objects that have a shared index.  So, we start our study of DataFrames with a quick study of Series.

A *Series* can be thought of as an array that is equiped with an index.

##### Example 1

In [2]:
S = pd.Series([50,40,30,20,10])
S

0    50
1    40
2    30
3    20
4    10
dtype: int64

Since we did not specify and index, our series S defined one by default.  The index is the values 0, 1, 2, 3, 4.

We can index the elements in our series using their index value.

In [3]:
S[1]

40

In this regard, a series is a lot like a dictionary or a mapping.

Thinking of S as a list of scores in a gradebook, we can reindex our series with some names by updating the *index* attribute.

In [4]:
S.index = ['Adam', 'Bob', 'Carole', 'Dani', 'Ella']

In [5]:
S

Adam      50
Bob       40
Carole    30
Dani      20
Ella      10
dtype: int64

With the *values* attribute, we get an array of the scores.

In [6]:
S.values

array([50, 40, 30, 20, 10], dtype=int64)

We can easily add 10 points to everyone's score.

In [7]:
S = S+10
S

Adam      60
Bob       50
Carole    40
Dani      30
Ella      20
dtype: int64

We can also select all rows with scores above 30.

In [8]:
S[S>30]

Adam      60
Bob       50
Carole    40
dtype: int64

Suppose we have scores for another test, with the names (index) in a different order...

In [9]:
T = pd.Series([40,30,20,10,0], index = ['Bob', 'Ella', 'Carole', 'Adam', 'Dani'])
T

Bob       40
Ella      30
Carole    20
Adam      10
Dani       0
dtype: int64

... And we want get a total for the scores.  No need to reorder things.  The index takes care of it.

In [10]:
Total = S+T
Total

Adam      70
Bob       90
Carole    60
Dani      30
Ella      50
dtype: int64

$\Box$

### DataFrame

#### Creating DataFrames

There are many ways to create a DataFrame.  Aside from transforming a data format (like a .csv file) into a DataFrame, there are two main Methods:

+ Using a dictionary of equal-length lists or NumPy arrays
+ Using a NumPy array and defining an index and column names

##### Example 2

In [11]:
player_stats_array = np.random.normal(size = (5,3))
player_stats_array = 10*player_stats_array+70
player_stats_array

array([[69.40112698, 83.45008473, 65.59059255],
       [83.74212467, 64.06124131, 71.0839083 ],
       [64.84775288, 66.47601022, 70.2915463 ],
       [82.95348388, 69.82434606, 72.61430299],
       [57.99074223, 81.41162092, 60.12029386]])

In [12]:
player_stats = pd.DataFrame(player_stats_array, index = ['LeBron', 'Larry', 'Michael', 'Yannis', 'Steph'], columns = ['free_throws', 'shooting', '3_ball'])
player_stats

Unnamed: 0,free_throws,shooting,3_ball
LeBron,69.401127,83.450085,65.590593
Larry,83.742125,64.061241,71.083908
Michael,64.847753,66.47601,70.291546
Yannis,82.953484,69.824346,72.614303
Steph,57.990742,81.411621,60.120294


We can get the shape of our DataFrame using the *shape* attribute.

In [13]:
player_stats.shape

(5, 3)

$\Box$

##### Example 3

In [14]:
gradebook_dict = {'Name': ['Adam', 'Bob', 'Carole', 'Dani', 'Ella'], 'HW': [30, 40, 20, 30, 40], 'Quiz1': [10, 5, 20, 15, 20], 'Exam1': [95.2, 56.5, 75.8, 62.0, 99.7]}

gradebook = pd.DataFrame(gradebook_dict)

In [15]:
gradebook

Unnamed: 0,Name,HW,Quiz1,Exam1
0,Adam,30,10,95.2
1,Bob,40,5,56.5
2,Carole,20,20,75.8
3,Dani,30,15,62.0
4,Ella,40,20,99.7


In [16]:
gradebook.shape

(5, 4)

$\Box$

#### Selecting Columns

We can select a single column from our gradebook DataFrame.  A Series is returned.

##### Example 4

In [17]:
gradebook['Exam1']

0    95.2
1    56.5
2    75.8
3    62.0
4    99.7
Name: Exam1, dtype: float64

In [18]:
gradebook.Exam1

0    95.2
1    56.5
2    75.8
3    62.0
4    99.7
Name: Exam1, dtype: float64

In [19]:
type(gradebook['Exam1'])

pandas.core.series.Series

$\Box$

##### Example 5

We can select multiple columns by passing a list of column names.

In [20]:
gradebook[['Quiz1', 'Exam1']]

Unnamed: 0,Quiz1,Exam1
0,10,95.2
1,5,56.5
2,20,75.8
3,15,62.0
4,20,99.7


$\Box$

### Selecting Pieces of a DataFrame

#### Conditional Selection

Using a condition that evaluates to True or False, we can select pieces of a DataFrame.

##### Example 6

Suppose we want to select all rows that got a passing score on Exam 1.

In [21]:
gradebook

Unnamed: 0,Name,HW,Quiz1,Exam1
0,Adam,30,10,95.2
1,Bob,40,5,56.5
2,Carole,20,20,75.8
3,Dani,30,15,62.0
4,Ella,40,20,99.7


In [5]:
gradebook[gradebook['Exam1']>=70.0]

Unnamed: 0,Name,HW,Quiz1,Exam1
0,Adam,30,10,95.2
2,Carole,20,20,75.8
4,Ella,40,20,99.7


This is a new DataFrame.  So, if we just want the names of these students who passed Exam 1, we can select them similarly to what we did in Example 4. 

In [14]:
gradebook[gradebook['Exam1']>=70.0]['Name']

0      Adam
2    Carole
4      Ella
Name: Name, dtype: object

What if we want all rows that got a passing score on Exam 1 and a score of 20 on Quiz 1?

In this case we need to be a little careful.

We need parenthesis around each logical condition and we need to use the bitwise and operator &.

In [11]:
gradebook[(gradebook['Exam1']>=70.0) & (gradebook['Quiz1']==20)]

Unnamed: 0,Name,HW,Quiz1,Exam1
2,Carole,20,20,75.8
4,Ella,40,20,99.7


Or works similarly.  The bitwise or operator is the |. 

In [15]:
gradebook[(gradebook['Exam1']>=70.0) | (gradebook['Quiz1']==20)]

Unnamed: 0,Name,HW,Quiz1,Exam1
0,Adam,30,10,95.2
2,Carole,20,20,75.8
4,Ella,40,20,99.7


Using the *isin* method, we can select all rows for which it is true that a value in the selected column is in a given list.

In [22]:
gradebook

Unnamed: 0,Name,HW,Quiz1,Exam1
0,Adam,30,10,95.2
1,Bob,40,5,56.5
2,Carole,20,20,75.8
3,Dani,30,15,62.0
4,Ella,40,20,99.7


In [43]:
gradebook[gradebook['Quiz1'].isin([10,20])]

Unnamed: 0,Name,HW,Quiz1,Exam1
0,Adam,30,10,95.2
2,Carole,20,20,75.8
4,Ella,40,20,99.7


$\Box$

#### Loc and iLoc

The *loc* and *iloc* methods allow for rows and columns to be selected using syntax similar to that of NumPy.

iloc allows for indexing using integers.

##### Example 7

In [23]:
#Recall our gradebook
gradebook

Unnamed: 0,Name,HW,Quiz1,Exam1
0,Adam,30,10,95.2
1,Bob,40,5,56.5
2,Carole,20,20,75.8
3,Dani,30,15,62.0
4,Ella,40,20,99.7


In [24]:
gradebook.iloc[1:4, 2:4]

Unnamed: 0,Quiz1,Exam1
1,5,56.5
2,20,75.8
3,15,62.0


In [25]:
gradebook.iloc[:, 1:]

Unnamed: 0,HW,Quiz1,Exam1
0,30,10,95.2
1,40,5,56.5
2,20,20,75.8
3,30,15,62.0
4,40,20,99.7


We can index with lists of integers.

In [26]:
gradebook.iloc[[1,3,4], [0,3]]

Unnamed: 0,Name,Exam1
1,Bob,56.5
3,Dani,62.0
4,Ella,99.7


These indexing methods can be mixed.

In [27]:
gradebook.iloc[[1,3], 1:3]

Unnamed: 0,HW,Quiz1
1,40,5
3,30,15


$\Box$

loc allows for the use of index and column names to be used for selection.

##### Example 8

In [29]:
#Recall player_stats
player_stats

Unnamed: 0,free_throws,shooting,3_ball
LeBron,69.307742,58.958324,65.230476
Larry,72.480016,54.496955,76.881494
Michael,74.143069,75.365891,78.601131
Yannis,87.861423,53.475637,66.97812
Steph,79.64442,70.276917,62.552278


In [34]:
player_stats.loc['Larry']

free_throws    72.480016
shooting       54.496955
3_ball         76.881494
Name: Larry, dtype: float64

In [36]:
player_stats.loc[['Larry','Steph']]

Unnamed: 0,free_throws,shooting,3_ball
Larry,72.480016,54.496955,76.881494
Steph,79.64442,70.276917,62.552278


In [38]:
player_stats.loc[:, 'free_throws']

LeBron     69.307742
Larry      72.480016
Michael    74.143069
Yannis     87.861423
Steph      79.644420
Name: free_throws, dtype: float64

In [39]:
player_stats.loc[:, ['free_throws', '3_ball']]

Unnamed: 0,free_throws,3_ball
LeBron,69.307742,65.230476
Larry,72.480016,76.881494
Michael,74.143069,78.601131
Yannis,87.861423,66.97812
Steph,79.64442,62.552278


$\Box$

The *head* and *tail* methods are equivalent to using loc or iloc to access the beginning or end of a DataFrame.  

Since head and tail access the first or last five rows by default, we need a dataset bigger than the ones defined above.  Lets load our old friend, the Iris Dataset.

##### Example 9

In [28]:
#Load Iris Dataset
iris = pd.DataFrame(datasets.load_iris().data, columns = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width'])

In [36]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
0,5.1,3.5,1.4,0.2
1,4.9,3.0,1.4,0.2
2,4.7,3.2,1.3,0.2
3,4.6,3.1,1.5,0.2
4,5.0,3.6,1.4,0.2


In [30]:
iris.tail()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
145,6.7,3.0,5.2,2.3
146,6.3,2.5,5.0,1.9
147,6.5,3.0,5.2,2.0
148,6.2,3.4,5.4,2.3
149,5.9,3.0,5.1,1.8


$\Box$

### Sorting

We can sort a column or multiple columns in a DataFrame using the *sort_values* method.

##### Example 10

Suppose that we want to sort the Iris Dataset by sorting the values in the sepal_width column.

In [17]:
iris.sort_values(by = ['sepal_length']).head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
13,4.3,3.0,1.1,0.1
42,4.4,3.2,1.3,0.2
38,4.4,3.0,1.3,0.2
8,4.4,2.9,1.4,0.2
41,4.5,2.3,1.3,0.3


By default, sort_values sorts in ascending order.  To sort in descending order, pass the *ascending = False* parameter.

In [39]:
iris.sort_values(by = ['sepal_length'], ascending = False).head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
131,7.9,3.8,6.4,2.0
135,7.7,3.0,6.1,2.3
122,7.7,2.8,6.7,2.0
117,7.7,3.8,6.7,2.2
118,7.7,2.6,6.9,2.3


$\Box$

### DataFrame Arithmetic

Arithmetic with DataFrames and Series is much the same as it is with NumPy ndarrays, provided the DataFrames used have the same index and the same columns. 

##### Example 11

In [37]:
df1 = pd.DataFrame({'A':[1,2,3,4], 'B':[5,6,7,8], 'C': [1,1,1,1]})
df2 = pd.DataFrame({'A':[1,1,1,1], 'B':[2,2,2,2], 'C': [5,5,5,5]})
df1

Unnamed: 0,A,B,C
0,1,5,1
1,2,6,1
2,3,7,1
3,4,8,1


In [38]:
df2

Unnamed: 0,A,B,C
0,1,2,5
1,1,2,5
2,1,2,5
3,1,2,5


In [39]:
df1 + df2

Unnamed: 0,A,B,C
0,2,7,6
1,3,8,6
2,4,9,6
3,5,10,6


In [40]:
df1*df2

Unnamed: 0,A,B,C
0,1,10,5
1,2,12,5
2,3,14,5
3,4,16,5


$\Box$

When the index's don't match up, a NaN value is produced.  NaN stands for *not a number*.

##### Example 12

Note that *to_numpy()* converts a Pandas DataFrame to a NumPy ndarray.

In [41]:
df1

Unnamed: 0,A,B,C
0,1,5,1
1,2,6,1
2,3,7,1
3,4,8,1


In [42]:
df3 = pd.DataFrame(df2.to_numpy(), index = range(1,5), columns = df1.columns)
df3

Unnamed: 0,A,B,C
1,1,2,5
2,1,2,5
3,1,2,5
4,1,2,5


In [43]:
df1 + df3

Unnamed: 0,A,B,C
0,,,
1,3.0,8.0,6.0
2,4.0,9.0,6.0
3,5.0,10.0,6.0
4,,,


$\Box$

When the column names don't match a similar thing happens.

##### Example 13

In [44]:
df1

Unnamed: 0,A,B,C
0,1,5,1
1,2,6,1
2,3,7,1
3,4,8,1


In [45]:
df4 = pd.DataFrame(df1.to_numpy())
df4.columns = ['A', 'D', 'C']
df4

Unnamed: 0,A,D,C
0,1,5,1
1,2,6,1
2,3,7,1
3,4,8,1


In [46]:
df1 + df4

Unnamed: 0,A,B,C,D
0,2,,2,
1,4,,2,
2,6,,2,
3,8,,2,


In [47]:
df1*df4

Unnamed: 0,A,B,C,D
0,1,,1,
1,4,,1,
2,9,,1,
3,16,,1,


$\Box$

### Reduction Methods

We can apply NumPy style Ufuncs to columns in a DataFrame.  

##### Example 14

Suppose that we want to know the average for Exam 1.

In [48]:
gradebook['Exam1'].mean()

77.84

It's natural to want to know the high and low scores.

In [106]:
gradebook['Exam1'].max()

99.7

In [107]:
gradebook['Exam1'].min()

56.5

$\Box$

##### Example 15

In [111]:
player_stats

Unnamed: 0,free_throws,shooting,3_ball
LeBron,60.098218,87.253434,82.619776
Larry,71.966202,61.174673,89.819636
Michael,67.729748,67.580134,66.034485
Yannis,67.489296,76.160558,74.379625
Steph,72.187066,77.970777,70.066925


By default, the *sum* method (and other similar NumPy ndarray methods) is applied along axis = 0.  A series is returned.

In [113]:
player_stats.sum()

free_throws    339.470530
shooting       370.139577
3_ball         382.920447
dtype: float64

In [114]:
player_stats.sum(axis = 1)

LeBron     229.971428
Larry      222.960511
Michael    201.344367
Yannis     218.029479
Steph      220.224768
dtype: float64

We may want to know the average, amoungst all players, for each of the categories.

In [115]:
player_stats.mean(axis = 0)

free_throws    67.894106
shooting       74.027915
3_ball         76.584089
dtype: float64

$\Box$

### Creating New Columns

##### Example 16

Suppose we want to add a label column to the Iris Dataset.

In [50]:
iris.iloc[45:55]

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width
45,4.8,3.0,1.4,0.3
46,5.1,3.8,1.6,0.2
47,4.6,3.2,1.4,0.2
48,5.3,3.7,1.5,0.2
49,5.0,3.3,1.4,0.2
50,7.0,3.2,4.7,1.4
51,6.4,3.2,4.5,1.5
52,6.9,3.1,4.9,1.5
53,5.5,2.3,4.0,1.3
54,6.5,2.8,4.6,1.5


In [51]:
labels = datasets.load_iris().target
labels[45:55]

array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])

In [52]:
type(labels)

numpy.ndarray

The following syntax illustrates how to accomplish this task.

In [53]:
iris['label'] = labels

In [54]:
iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,label
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


$\Box$

##### Example 17

Suppose we wanted to add a new column to the gradebook populated with 0's.

In [55]:
gradebook['Exam2'] = 0.0

In [56]:
gradebook

Unnamed: 0,Name,HW,Quiz1,Exam1,Exam2
0,Adam,30,10,95.2,0.0
1,Bob,40,5,56.5,0.0
2,Carole,20,20,75.8,0.0
3,Dani,30,15,62.0,0.0
4,Ella,40,20,99.7,0.0


Now, lets add some scores to Exam 2.

In [59]:
gradebook['Exam3'] = np.array([88.5, 35.7, 78.2, 91.4, 77.5])

In [60]:
gradebook

Unnamed: 0,Name,HW,Quiz1,Exam1,Exam2,Exam3
0,Adam,30,10,95.2,88.5,88.5
1,Bob,40,5,56.5,35.7,35.7
2,Carole,20,20,75.8,78.2,78.2
3,Dani,30,15,62.0,91.4,91.4
4,Ella,40,20,99.7,77.5,77.5


$\Box$

##### Example 18

Now, lets add a column for Quiz 2.

In [61]:
gradebook['Quiz2'] = [10,20,30,40,50]

In [62]:
gradebook

Unnamed: 0,Name,HW,Quiz1,Exam1,Exam2,Exam3,Quiz2
0,Adam,30,10,95.2,88.5,88.5,10
1,Bob,40,5,56.5,35.7,35.7,20
2,Carole,20,20,75.8,78.2,78.2,30
3,Dani,30,15,62.0,91.4,91.4,40
4,Ella,40,20,99.7,77.5,77.5,50


$\Box$

##### Example 18

Suppose we want to create a total column for the gradebook that sums Quiz 1 and Quiz 2.

In [63]:
gradebook['Quiz1']

0    10
1     5
2    20
3    15
4    20
Name: Quiz1, dtype: int64

In [64]:
gradebook['Quiz2']

0    10
1    20
2    30
3    40
4    50
Name: Quiz2, dtype: int64

In [65]:
gradebook['Quiz_Total'] = gradebook['Quiz1']+gradebook['Quiz2']

In [66]:
gradebook

Unnamed: 0,Name,HW,Quiz1,Exam1,Exam2,Exam3,Quiz2,Quiz_Total
0,Adam,30,10,95.2,88.5,88.5,10,20
1,Bob,40,5,56.5,35.7,35.7,20,25
2,Carole,20,20,75.8,78.2,78.2,30,50
3,Dani,30,15,62.0,91.4,91.4,40,55
4,Ella,40,20,99.7,77.5,77.5,50,70


$\Box$

#### Apply

The *apply* method acts on a column (or multiple columns) in a DataFrame.  It takes a function as a parameter.

##### Example 19

Suppose that we want to add a binary column to the gradebook to indicate if a student is passing or not.  Further suppose that our criteria is a passing grade on Exam 1.

In [67]:
gradebook['passing'] = gradebook.Exam1.apply(lambda x: 1 if x >= 70.0 else 0)

In [68]:
gradebook

Unnamed: 0,Name,HW,Quiz1,Exam1,Exam2,Exam3,Quiz2,Quiz_Total,passing
0,Adam,30,10,95.2,88.5,88.5,10,20,1
1,Bob,40,5,56.5,35.7,35.7,20,25,0
2,Carole,20,20,75.8,78.2,78.2,30,50,1
3,Dani,30,15,62.0,91.4,91.4,40,55,0
4,Ella,40,20,99.7,77.5,77.5,50,70,1


$\Box$

Using apply on multiple columns requires that we select multiple columns and pass an axis argument.

##### Example 20

In [73]:
gradebook['Exam_Total'] = gradebook[['Exam1', 'Exam2', 'Exam3']].apply(lambda x: x.sum(), axis = 1)

In [74]:
gradebook

Unnamed: 0,Name,HW,Quiz1,Exam1,Exam2,Exam3,Quiz2,Quiz_Total,passing,Exam_Total
0,Adam,30,10,95.2,88.5,88.5,10,20,1,272.2
1,Bob,40,5,56.5,35.7,35.7,20,25,0,127.9
2,Carole,20,20,75.8,78.2,78.2,30,50,1,232.2
3,Dani,30,15,62.0,91.4,91.4,40,55,0,244.8
4,Ella,40,20,99.7,77.5,77.5,50,70,1,254.7


$\Box$

### Reorder Columns

The gradebook columns look out of order.  Lets reorder them.

##### Example 21

In [75]:
list(gradebook.columns)

['Name',
 'HW',
 'Quiz1',
 'Exam1',
 'Exam2',
 'Exam3',
 'Quiz2',
 'Quiz_Total',
 'passing',
 'Exam_Total']

In [76]:
gradebook = gradebook[['Name', 'HW', 'Quiz1','Quiz2','Exam1','Exam2', 'Exam3', 'Quiz_Total', 'Exam_Total', 'passing']]
gradebook

Unnamed: 0,Name,HW,Quiz1,Quiz2,Exam1,Exam2,Exam3,Quiz_Total,Exam_Total,passing
0,Adam,30,10,10,95.2,88.5,88.5,20,272.2,1
1,Bob,40,5,20,56.5,35.7,35.7,25,127.9,0
2,Carole,20,20,30,75.8,78.2,78.2,50,232.2,1
3,Dani,30,15,40,62.0,91.4,91.4,55,244.8,0
4,Ella,40,20,50,99.7,77.5,77.5,70,254.7,1


$\Box$

The above example worked because the list of column names is small.  How could we reorder the columns when the number of columns is large?

##### Example 22

Lets exchange the positions of the HW and passing columns.

In [77]:
cols = list(gradebook.columns)

In [78]:
cols

['Name',
 'HW',
 'Quiz1',
 'Quiz2',
 'Exam1',
 'Exam2',
 'Exam3',
 'Quiz_Total',
 'Exam_Total',
 'passing']

This problem comes down to changing the positions of the elements 'HW' and 'passing' in the cols list.

We can get the indices of these elements using the *index* method.

In [81]:
HW_index = cols.index('HW')
HW_index

1

In [82]:
passing_index = cols.index('passing')
passing_index

9

Using 2-tuples, we can swap easily.

In [83]:
cols[HW_index], cols[passing_index] = 'passing', 'HW'

In [84]:
gradebook = gradebook[cols]
gradebook

Unnamed: 0,Name,passing,Quiz1,Quiz2,Exam1,Exam2,Exam3,Quiz_Total,Exam_Total,HW
0,Adam,1,10,10,95.2,88.5,88.5,20,272.2,30
1,Bob,0,5,20,56.5,35.7,35.7,25,127.9,40
2,Carole,1,20,30,75.8,78.2,78.2,50,232.2,20
3,Dani,0,15,40,62.0,91.4,91.4,55,244.8,30
4,Ella,1,20,50,99.7,77.5,77.5,70,254.7,40


$\Box$

### Dropping Columns

The *drop* method allows us to drop unwanted columns.

##### Example 23

The passing column is not necessary any longer.  Lets drop it.

In [85]:
gradebook = gradebook.drop(columns = ['passing'])
gradebook

Unnamed: 0,Name,Quiz1,Quiz2,Exam1,Exam2,Exam3,Quiz_Total,Exam_Total,HW
0,Adam,10,10,95.2,88.5,88.5,20,272.2,30
1,Bob,5,20,56.5,35.7,35.7,25,127.9,40
2,Carole,20,30,75.8,78.2,78.2,50,232.2,20
3,Dani,15,40,62.0,91.4,91.4,55,244.8,30
4,Ella,20,50,99.7,77.5,77.5,70,254.7,40


$\Box$

### Renaming Columns

We can use the *rename* method to rename columns.  To tell rename what to rename and what to rename it as, we use a dictionary with key value pairs of the form 

columns['old_name'] = 'new_name'

##### Example 24

Lets rename Quiz1 as Quiz_1 and Quiz2 as Quiz_2.

In [86]:
gradebook = gradebook.rename(columns = {'Quiz1':'Quiz_1', 'Quiz2':'Quiz_2'})
gradebook

Unnamed: 0,Name,Quiz_1,Quiz_2,Exam1,Exam2,Exam3,Quiz_Total,Exam_Total,HW
0,Adam,10,10,95.2,88.5,88.5,20,272.2,30
1,Bob,5,20,56.5,35.7,35.7,25,127.9,40
2,Carole,20,30,75.8,78.2,78.2,50,232.2,20
3,Dani,15,40,62.0,91.4,91.4,55,244.8,30
4,Ella,20,50,99.7,77.5,77.5,70,254.7,40


$\Box$

### Set Index

We may want to turn a column into the index.  This is accomplished with the *set_index* method.

##### Example 25

It makes sense that the Name column be our index for the gradebook.

In [87]:
gradebook

Unnamed: 0,Name,Quiz_1,Quiz_2,Exam1,Exam2,Exam3,Quiz_Total,Exam_Total,HW
0,Adam,10,10,95.2,88.5,88.5,20,272.2,30
1,Bob,5,20,56.5,35.7,35.7,25,127.9,40
2,Carole,20,30,75.8,78.2,78.2,50,232.2,20
3,Dani,15,40,62.0,91.4,91.4,55,244.8,30
4,Ella,20,50,99.7,77.5,77.5,70,254.7,40


In [88]:
gradebook = gradebook.set_index('Name')
gradebook

Unnamed: 0_level_0,Quiz_1,Quiz_2,Exam1,Exam2,Exam3,Quiz_Total,Exam_Total,HW
Name,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
Adam,10,10,95.2,88.5,88.5,20,272.2,30
Bob,5,20,56.5,35.7,35.7,25,127.9,40
Carole,20,30,75.8,78.2,78.2,50,232.2,20
Dani,15,40,62.0,91.4,91.4,55,244.8,30
Ella,20,50,99.7,77.5,77.5,70,254.7,40


$\Box$

#### Loading Data From a CSV File

The Pandas *read_csv* function reads a .csv file into a DataFrame.

##### Example 26

In [2]:
census = pd.read_csv('ChicagoCensusData.csv')

In [3]:
census.head()

Unnamed: 0,COMMUNITY_AREA_NUMBER,COMMUNITY_AREA_NAME,PERCENT_OF_HOUSING_CROWDED,PERCENT_HOUSEHOLDS_BELOW_POVERTY,PERCENT_AGED_16__UNEMPLOYED,PERCENT_AGED_25__WITHOUT_HIGH_SCHOOL_DIPLOMA,PERCENT_AGED_UNDER_18_OR_OVER_64,PER_CAPITA_INCOME,HARDSHIP_INDEX
0,1.0,Rogers Park,7.7,23.6,8.7,18.2,27.5,23939,39.0
1,2.0,West Ridge,7.8,17.2,8.8,20.8,38.5,23040,46.0
2,3.0,Uptown,3.8,24.0,8.9,11.8,22.2,35787,20.0
3,4.0,Lincoln Square,3.4,10.9,8.2,13.4,25.5,37524,17.0
4,5.0,North Center,0.3,7.5,5.2,4.5,26.2,57123,6.0


$\Box$

##### Exercise 1

In this exercise we use the census dataset that was loaded in Example 26.

Perform the following actions on the census dataset:
1. Drop the COMMUNITY_AREA_NUMBER column.
2. Rename the remaining columns so that their names are easier to use.
3. Add a binary column named 'makes_over_50K' that has a 1 if the value of PER_CAPITA_INCOME is greater than 50000 and 0, otherwise.
4. Turn the COMMUNITY_AREA_NAME column into the index.
5. Display your results using the *head* method.

##### Exercise 2

In this exercise you will use the census dataset that you cleaned in Exercise 1.

Display a new DataFrame of shape (5,2) such that the second column contains the top 5 values in the 'HARDSHIP_INDEX' column  and the first column has the 'COMMUNITY_AREA_NAME' associated with the hardship index in the second column.   