<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# Hands on with Pandas

---

### Using Pandas

Pandas is frequently used in data science because it offers a large set of commonly used functions, is relatively fast, and has a large community. Because many data science libraries also use NumPy to manipulate data, you can easily transfer data between libraries (as we will often do in this class!).

Pandas is a large library that typically takes a lot of practice to learn. It heavily overrides Python operators, resulting in odd-looking syntax. For example, given a `DataFrame` called `cars` which contains a column `mpg`, we might want to view all cars with mpg over 35. To do this, we might write: `cars[cars['mpg'] > 35]`. In standard Python, this would most likely give a syntax error.

Pandas also highly favors certain patterns of use. For example, looping through a `DataFrame` row by row is highly discouraged. Instead, Pandas favors using **vectorized functions** that operate column by column. (This is because each column is stored separately as an `ndarray`, and NumPy is optimized for operating on `ndarray`s.)

Do not be discouraged if Pandas feels overwhelming. Gradually, as you use it, you will become familiar with which methods to use and the "Pandas way" of thinking about and manipulating data.

In [28]:
# Load Pandas into Python
import pandas as pd

<a id="reading-files"></a>
### Reading Files, Selecting Columns, and Summarizing

In [29]:
users = pd.read_table('data/user.tbl', sep='|')

**Examine the users data.**

In [30]:
type(users)             # check its type

pandas.core.frame.DataFrame

In [31]:
users                   # Print the first 30 and last 30 rows.

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
5,6,42,M,executive,98101
6,7,57,M,administrator,91344
7,8,36,M,administrator,05201
8,9,29,M,student,01002
9,10,53,M,lawyer,90703


In [32]:
users.head()            # Print the first five rows.

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213


In [33]:
users.head(10)          # Print the first 10 rows.

Unnamed: 0,user_id,age,gender,occupation,zip_code
0,1,24,M,technician,85711
1,2,53,F,other,94043
2,3,23,M,writer,32067
3,4,24,M,technician,43537
4,5,33,F,other,15213
5,6,42,M,executive,98101
6,7,57,M,administrator,91344
7,8,36,M,administrator,5201
8,9,29,M,student,1002
9,10,53,M,lawyer,90703


In [34]:
users.tail()            # Print the last five rows.

Unnamed: 0,user_id,age,gender,occupation,zip_code
938,939,26,F,student,33319
939,940,32,M,administrator,2215
940,941,20,M,student,97229
941,942,48,F,librarian,78209
942,943,22,M,student,77841


In [35]:
# The row index (aka "the row labels" — in this case integers)
users.index            

RangeIndex(start=0, stop=943, step=1)

In [36]:
# Column names (which is "an index")
users.columns

Index(['user_id', 'age', 'gender', 'occupation', 'zip_code'], dtype='object')

In [37]:
# Datatypes of each column — each column is stored as an ndarray, which has a datatype
users.dtypes

user_id        int64
age            int64
gender        object
occupation    object
zip_code      object
dtype: object

In [38]:
# Number of rows and columns
users.shape

(943, 5)

In [39]:
# All values as a NumPy array
users.values

array([[1, 24, 'M', 'technician', '85711'],
       [2, 53, 'F', 'other', '94043'],
       [3, 23, 'M', 'writer', '32067'],
       ...,
       [941, 20, 'M', 'student', '97229'],
       [942, 48, 'F', 'librarian', '78209'],
       [943, 22, 'M', 'student', '77841']], dtype=object)

In [40]:
# Concise summary (including memory usage) — useful to quickly see if nulls exist
users.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 943 entries, 0 to 942
Data columns (total 5 columns):
user_id       943 non-null int64
age           943 non-null int64
gender        943 non-null object
occupation    943 non-null object
zip_code      943 non-null object
dtypes: int64(2), object(3)
memory usage: 36.9+ KB


** Select or index data.**<br>
Pandas `DataFrame`s have structural similarities with Python-style lists and dictionaries.  
In the example below, we select a column of data using the name of the column in a similar manner to how we select a dictionary value with the dictionary key.

In [41]:
# Select a column
users['occupation']

0         technician
1              other
2             writer
3         technician
4              other
5          executive
6      administrator
7      administrator
8            student
9             lawyer
10             other
11             other
12          educator
13         scientist
14          educator
15     entertainment
16        programmer
17             other
18         librarian
19         homemaker
20            writer
21            writer
22            artist
23            artist
24          engineer
25          engineer
26         librarian
27            writer
28        programmer
29           student
           ...      
913            other
914    entertainment
915         engineer
916          student
917        scientist
918            other
919           artist
920          student
921    administrator
922          student
923            other
924         salesman
925    entertainment
926       programmer
927          student
928        scientist
929        sc

In [42]:
type(users['occupation'])

pandas.core.series.Series

In [43]:
# Select one column using the DataFrame attribute. (could be wrong if name is a python key word)
users.occupation

# While a useful shorthand, these attributes only exist
# if the column name has no punctuations or spaces.

0         technician
1              other
2             writer
3         technician
4              other
5          executive
6      administrator
7      administrator
8            student
9             lawyer
10             other
11             other
12          educator
13         scientist
14          educator
15     entertainment
16        programmer
17             other
18         librarian
19         homemaker
20            writer
21            writer
22            artist
23            artist
24          engineer
25          engineer
26         librarian
27            writer
28        programmer
29           student
           ...      
913            other
914    entertainment
915         engineer
916          student
917        scientist
918            other
919           artist
920          student
921    administrator
922          student
923            other
924         salesman
925    entertainment
926       programmer
927          student
928        scientist
929        sc

**Summarize (describe) the data.**<br>
Pandas has a bunch of built-in methods to quickly summarize your data and provide you with a quick general understanding.

In [44]:
# Describe all numeric columns.
users.describe()

Unnamed: 0,user_id,age
count,943.0,943.0
mean,472.0,34.051962
std,272.364951,12.19274
min,1.0,7.0
25%,236.5,25.0
50%,472.0,31.0
75%,707.5,43.0
max,943.0,73.0


In [45]:
# Describe all columns, including non-numeric.
users.describe(include='all')

Unnamed: 0,user_id,age,gender,occupation,zip_code
count,943.0,943.0,943,943,943.0
unique,,,2,21,795.0
top,,,M,student,55414.0
freq,,,670,196,9.0
mean,472.0,34.051962,,,
std,272.364951,12.19274,,,
min,1.0,7.0,,,
25%,236.5,25.0,,,
50%,472.0,31.0,,,
75%,707.5,43.0,,,


In [46]:
# Describe a single column — recall that "users.occupation" refers to a Series.
users["occupation"].describe()

count         943
unique         21
top       student
freq          196
Name: occupation, dtype: object

In [47]:
# Calculate the mean of the ages.
users["age"].mean()

34.05196182396607

**Count the number of occurrences of each value.**

In [48]:
users["occupation"].value_counts()     # Most useful for categorical variables

student          196
other            105
educator          95
administrator     79
engineer          67
programmer        66
librarian         51
writer            45
executive         32
scientist         31
artist            28
technician        27
marketing         26
entertainment     18
healthcare        16
retired           14
salesman          12
lawyer            12
none               9
doctor             7
homemaker          7
Name: occupation, dtype: int64

In [49]:
# Can also be used with numeric variables
#   Try .sort_index() to sort by indices or .sort_values() to sort by counts.
users["age"].value_counts()

30    39
25    38
22    37
28    36
27    35
26    34
24    33
29    32
20    32
32    28
23    28
35    27
21    27
33    26
31    25
19    23
44    23
39    22
40    21
36    21
42    21
51    20
50    20
48    20
49    19
37    19
18    18
34    17
38    17
45    15
      ..
47    14
43    13
46    12
53    12
55    11
41    10
57     9
60     9
52     6
56     6
15     6
13     5
16     5
54     4
63     3
14     3
65     3
70     3
61     3
59     3
58     3
64     2
68     2
69     2
62     2
11     1
10     1
73     1
66     1
7      1
Name: age, Length: 61, dtype: int64

In [50]:
# You can also do it the "long way"
users.groupby("occupation")["user_id"].count()

occupation
administrator     79
artist            28
doctor             7
educator          95
engineer          67
entertainment     18
executive         32
healthcare        16
homemaker          7
lawyer            12
librarian         51
marketing         26
none               9
other            105
programmer        66
retired           14
salesman          12
scientist         31
student          196
technician        27
writer            45
Name: user_id, dtype: int64

<a id="exercise-one"></a>
### Exercise 1

In [51]:
# Read drinks.csv into a DataFrame called "drinks".
drinks = pd.read_csv('data/drinks.csv')

In [52]:
# Print the head and the tail.
drinks.head()



Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


In [53]:
drinks.tail()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
188,Venezuela,333,100,3,7.7,SA
189,Vietnam,111,2,1,2.0,AS
190,Yemen,6,0,0,0.1,AS
191,Zambia,32,19,4,2.5,AF
192,Zimbabwe,64,18,4,4.7,AF


In [54]:
# Examine the default index, datatypes, and shape.
drinks.index, drinks.shape

(RangeIndex(start=0, stop=193, step=1), (193, 6))

In [55]:
drinks.dtypes

country                          object
beer_servings                     int64
spirit_servings                   int64
wine_servings                     int64
total_litres_of_pure_alcohol    float64
continent                        object
dtype: object

In [56]:
# Print the beer_servings Series.
drinks['beer_servings']

0        0
1       89
2       25
3      245
4      217
5      102
6      193
7       21
8      261
9      279
10      21
11     122
12      42
13       0
14     143
15     142
16     295
17     263
18      34
19      23
20     167
21      76
22     173
23     245
24      31
25     231
26      25
27      88
28      37
29     144
      ... 
163    128
164     90
165    152
166    185
167      5
168      2
169     99
170    106
171      1
172     36
173     36
174    197
175     51
176     51
177     19
178      6
179     45
180    206
181     16
182    219
183     36
184    249
185    115
186     25
187     21
188    333
189    111
190      6
191     32
192     64
Name: beer_servings, Length: 193, dtype: int64

In [57]:
# Calculate the average beer_servings for the entire data set.
drinks['beer_servings'].mean()

106.16062176165804

In [58]:
# Count the number of occurrences of each "continent" value and see if it looks correct.
drinks.groupby("continent")['continent'].count()

continent
AF    53
AS    44
EU    45
OC    16
SA    12
Name: continent, dtype: int64

In [60]:
# Or more simply
drinks['continent'].value_counts()

AF    53
EU    45
AS    44
OC    16
SA    12
Name: continent, dtype: int64

<a id="filtering-and-sorting"></a>
### Filtering and Sorting
- **Objective:** Filter and sort data using Pandas.

We can use simple operator comparisons on columns to extract relevant or drop irrelevant information.

**Logical filtering: Only show users with age < 20.**

In [86]:
# Create a Series of Booleans…
# In Pandas, this comparison is performed element-wise on each row of data.
young_bool = users["age"] < 20
young_bool

0      False
1      False
2      False
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12     False
13     False
14     False
15     False
16     False
17     False
18     False
19     False
20     False
21     False
22     False
23     False
24     False
25     False
26     False
27     False
28     False
29      True
       ...  
913    False
914    False
915    False
916    False
917    False
918    False
919    False
920    False
921    False
922    False
923    False
924     True
925    False
926    False
927    False
928    False
929    False
930    False
931    False
932    False
933    False
934    False
935    False
936    False
937    False
938    False
939    False
940    False
941    False
942    False
Name: age, Length: 943, dtype: bool

In [87]:
# …and use that Series to filter rows.
# In Pandas, indexing a DataFrame by a Series of Booleans only selects rows that are True in the Boolean.
users[young_bool]

Unnamed: 0,user_id,age,gender,occupation,zip_code,is_under_20
29,30,7,M,student,55436,True
35,36,19,F,student,93117,True
51,52,18,F,student,55105,True
56,57,16,M,none,84010,True
66,67,17,M,student,60402,True
67,68,19,M,student,22904,True
100,101,15,M,student,05146,True
109,110,19,M,student,77840,True
141,142,13,M,other,48118,True
178,179,15,M,entertainment,20755,True


In [88]:
# Or, combine into a single step.
users[users["age"] < 20]

Unnamed: 0,user_id,age,gender,occupation,zip_code,is_under_20
29,30,7,M,student,55436,True
35,36,19,F,student,93117,True
51,52,18,F,student,55105,True
56,57,16,M,none,84010,True
66,67,17,M,student,60402,True
67,68,19,M,student,22904,True
100,101,15,M,student,05146,True
109,110,19,M,student,77840,True
141,142,13,M,other,48118,True
178,179,15,M,entertainment,20755,True


In [89]:
# Important: This creates a view of the original DataFrame, not a new DataFrame.
# If you alter this view (e.g., by storing it in a variable and altering that)
# You will alter only the slice of the DataFrame and not the actual DataFrame itself
# Here, notice that Pandas gives you a SettingWithCopyWarning to alert you of this.

# It is best practice to use .loc and .iloc instead of the syntax below

users_under20 = users[users["age"] < 20].copy()   # To resolve this warning, copy the `DataFrame` using `.copy()`.
users_under20['is_under_20'] = True # This adds a new column

In [90]:
users.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code,is_under_20
0,1,24,M,technician,85711,False
1,2,53,F,other,94043,False
2,3,23,M,writer,32067,False
3,4,24,M,technician,43537,False
4,5,33,F,other,15213,False


In [91]:
users_under20.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code,is_under_20
29,30,7,M,student,55436,True
35,36,19,F,student,93117,True
51,52,18,F,student,55105,True
56,57,16,M,none,84010,True
66,67,17,M,student,60402,True


To create the is_under_20 column in the original DataFrame we could use `.loc`

The syntax is:

`my_dataframe.loc[<filter_condition>, <column>] = <new_value>`

In [92]:
users.loc[users["age"] < 20, "is_under_20"] = True
users.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code,is_under_20
0,1,24,M,technician,85711,False
1,2,53,F,other,94043,False
2,3,23,M,writer,32067,False
3,4,24,M,technician,43537,False
4,5,33,F,other,15213,False


In [93]:
users.loc[users["age"] >= 20, "is_under_20"] = False
users.head()

Unnamed: 0,user_id,age,gender,occupation,zip_code,is_under_20
0,1,24,M,technician,85711,False
1,2,53,F,other,94043,False
2,3,23,M,writer,32067,False
3,4,24,M,technician,43537,False
4,5,33,F,other,15213,False


`.loc` is also useful if you want to filter **both** rows and columns at the same time

In [94]:
# Select two columns from the filtered results.
users.loc[users["is_under_20"], ["occupation", "age"]]

Unnamed: 0,occupation,age
29,student,7
35,student,19
51,student,18
56,none,16
66,student,17
67,student,19
100,student,15
109,student,19
141,other,13
178,entertainment,15


In [99]:
# Or you can just specify row numbers
users.loc[0:100, ["occupation", "age"]]

Unnamed: 0,occupation,age
0,technician,24
1,other,53
2,writer,23
3,technician,24
4,other,33
5,executive,42
6,administrator,57
7,administrator,36
8,student,29
9,lawyer,53


**Logical filtering with multiple conditions**

In [59]:
# Ampersand for `AND` condition. (This is a "bitwise" `AND`.)
# Important: You MUST put parentheses around each expression because `&` has a higher precedence than `<`.
users[(users["is_under_20"]) & (users["gender"] == 'M')]

Unnamed: 0,user_id,age,gender,occupation,zip_code,is_under_20
29,30,7,M,student,55436,True
56,57,16,M,none,84010,True
66,67,17,M,student,60402,True
67,68,19,M,student,22904,True
100,101,15,M,student,5146,True
109,110,19,M,student,77840,True
141,142,13,M,other,48118,True
178,179,15,M,entertainment,20755,True
220,221,19,M,student,20685,True
245,246,19,M,student,28734,True


In [101]:
users[(users["age"] <20) & (users["gender"] == 'M')] # The same thing, but with age column rather than is_under_20 column

Unnamed: 0,user_id,age,gender,occupation,zip_code,is_under_20
29,30,7,M,student,55436,True
56,57,16,M,none,84010,True
66,67,17,M,student,60402,True
67,68,19,M,student,22904,True
100,101,15,M,student,5146,True
109,110,19,M,student,77840,True
141,142,13,M,other,48118,True
178,179,15,M,entertainment,20755,True
220,221,19,M,student,20685,True
245,246,19,M,student,28734,True


In [60]:
# Pipe for `OR` condition. (This is a "bitwise" `OR`.)
# Important: You MUST put parentheses around each expression because `|` has a higher precedence than `<`.
users[(users["is_under_20"]) | (users["age"] > 60)]

Unnamed: 0,user_id,age,gender,occupation,zip_code,is_under_20
29,30,7,M,student,55436,True
35,36,19,F,student,93117,True
51,52,18,F,student,55105,True
56,57,16,M,none,84010,True
66,67,17,M,student,60402,True
67,68,19,M,student,22904,True
100,101,15,M,student,05146,True
105,106,61,M,retired,55125,False
109,110,19,M,student,77840,True
141,142,13,M,other,48118,True


In [61]:
# Preferred alternative to multiple `OR` conditions
users[users["occupation"].isin(['doctor', 'lawyer'])]

Unnamed: 0,user_id,age,gender,occupation,zip_code,is_under_20
9,10,53,M,lawyer,90703,False
124,125,30,M,lawyer,22202,False
125,126,28,F,lawyer,20015,False
137,138,46,M,doctor,53211,False
160,161,50,M,lawyer,55104,False
204,205,47,M,lawyer,6371,False
250,251,28,M,doctor,85032,False
298,299,29,M,doctor,63108,False
338,339,35,M,lawyer,37901,False
364,365,29,M,lawyer,20009,False


**Sorting**

In [62]:
# Sort a Series.
users["age"].sort_values()

29      7
470    10
288    11
879    13
608    13
141    13
673    13
627    13
812    14
205    14
886    14
848    15
280    15
460    15
617    15
178    15
100    15
56     16
579    16
549    16
450    16
433    16
620    17
618    17
760    17
374    17
903    17
645    17
581    17
256    17
       ..
89     60
307    60
930    60
751    60
468    60
463    60
233    60
693    60
933    61
350    61
105    61
519    62
265    62
857    63
776    63
363    63
844    64
422    64
317    65
650    65
563    65
210    66
348    68
572    68
558    69
584    69
766    70
802    70
859    70
480    73
Name: age, Length: 943, dtype: int64

In [63]:
# Sort a DataFrame by a single column.
users.sort_values('age')

Unnamed: 0,user_id,age,gender,occupation,zip_code,is_under_20
29,30,7,M,student,55436,True
470,471,10,M,student,77459,True
288,289,11,M,none,94619,True
879,880,13,M,student,83702,True
608,609,13,F,student,55106,True
141,142,13,M,other,48118,True
673,674,13,F,student,55337,True
627,628,13,M,none,94306,True
812,813,14,F,student,02136,True
205,206,14,F,student,53115,True


In [64]:
# Use descending order instead.
users.sort_values('age', ascending=False)

Unnamed: 0,user_id,age,gender,occupation,zip_code,is_under_20
480,481,73,M,retired,37771,False
802,803,70,M,administrator,78212,False
766,767,70,M,engineer,00000,False
859,860,70,F,retired,48322,False
584,585,69,M,librarian,98501,False
558,559,69,M,executive,10022,False
348,349,68,M,retired,61455,False
572,573,68,M,retired,48911,False
210,211,66,M,salesman,32605,False
650,651,65,M,retired,02903,False


In [65]:
# Sort by multiple columns.
users.sort_values(['occupation', 'age'])

Unnamed: 0,user_id,age,gender,occupation,zip_code,is_under_20
117,118,21,M,administrator,90210,False
179,180,22,F,administrator,60202,False
281,282,22,M,administrator,20057,False
316,317,22,M,administrator,13210,False
438,439,23,F,administrator,20817,False
508,509,23,M,administrator,10011,False
393,394,25,M,administrator,96819,False
664,665,25,M,administrator,55412,False
725,726,25,F,administrator,80538,False
77,78,26,M,administrator,61801,False


<a id="exercise-two"></a>
### Exercise 2
Use the `drinks.csv` or `drinks` `DataFrame` from earlier to complete the following.

In [109]:
# Filter DataFrame to only include European countries.
drinks[drinks["continent"]== 'EU']

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
1,Albania,89,132,54,4.9,EU
3,Andorra,245,138,312,12.4,EU
7,Armenia,21,179,11,3.8,EU
9,Austria,279,75,191,9.7,EU
10,Azerbaijan,21,46,5,1.3,EU
15,Belarus,142,373,42,14.4,EU
16,Belgium,295,84,212,10.5,EU
21,Bosnia-Herzegovina,76,173,8,4.6,EU
25,Bulgaria,231,252,94,10.3,EU
42,Croatia,230,87,254,10.2,EU


In [110]:
# Filter DataFrame to only include European countries with wine_servings > 300.
drinks[(drinks["continent"]== 'EU') & (drinks["wine_servings"] > 300)]

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
3,Andorra,245,138,312,12.4,EU
61,France,127,151,370,11.8,EU
136,Portugal,194,67,339,11.0,EU


In [111]:
# Calculate the average beer_servings for all of Europe.
drinks[drinks["continent"]== 'EU']["beer_servings"].mean()

193.77777777777777

In [112]:
# Determine which 10 countries have the highest total_litres_of_pure_alcohol.
drinks.sort_values('total_litres_of_pure_alcohol', ascending=False).head(10)

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
15,Belarus,142,373,42,14.4,EU
98,Lithuania,343,244,56,12.9,EU
3,Andorra,245,138,312,12.4,EU
68,Grenada,199,438,28,11.9,
45,Czech Republic,361,170,134,11.8,EU
61,France,127,151,370,11.8,EU
141,Russian Federation,247,326,73,11.5,AS
81,Ireland,313,118,165,11.4,EU
155,Slovakia,196,293,116,11.4,EU
99,Luxembourg,236,133,271,11.4,EU


<a id="columns"></a>
### Renaming, Adding, and Removing Columns

- **Objective:** Manipulate `DataFrame` columns.

In [113]:
# Rename one or more columns in a single output using value mapping.
drinks.rename(columns={'beer_servings':'beer', 'wine_servings':'wine'})

Unnamed: 0,country,beer,spirit_servings,wine,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF
5,Antigua & Barbuda,102,128,45,4.9,
6,Argentina,193,25,221,8.3,SA
7,Armenia,21,179,11,3.8,EU
8,Australia,261,72,212,10.4,OC
9,Austria,279,75,191,9.7,EU


Where are my new columns???

In [114]:
drinks.head()

Unnamed: 0,country,beer_servings,spirit_servings,wine_servings,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


In [115]:
# Rename one or more columns in the original DataFrame.
drinks.rename(columns={'beer_servings':'beer', 'wine_servings':'wine'}, inplace=True)

drinks.head()

Unnamed: 0,country,beer,spirit_servings,wine,total_litres_of_pure_alcohol,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


or you could have re-assigned the new dataframe to the same variable:

```python
drinks = drinks.rename(columns={'beer_servings':'beer', 'wine_servings':'wine'})
```

In [116]:
# Replace all column names using a list of matching length
drink_cols = ['country', 'beer', 'spirit', 'wine', 'litres', 'continent'] 

# Replace during file reading (disables the header from the file)
drinks_renamed = pd.read_csv('data/drinks.csv', header=0, names=drink_cols)
drinks_renamed.head()

Unnamed: 0,country,beer,spirit,wine,litres,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


In [117]:
# Replace after file has already been read into Python.
drinks.columns = drink_cols

drinks.head()

Unnamed: 0,country,beer,spirit,wine,litres,continent
0,Afghanistan,0,0,0,0.0,AS
1,Albania,89,132,54,4.9,EU
2,Algeria,25,0,14,0.7,AF
3,Andorra,245,138,312,12.4,EU
4,Angola,217,57,45,5.9,AF


**Easy Column Operations**<br>
Rather than having to reference indexes and create for loops to do column-wise operations, Pandas is smart and knows that when we add columns together we want to add the values in each row together.

In [118]:
# Add a new column as a function of existing columns.
drinks['servings'] = drinks["beer"] + drinks["spirit"] + drinks["wine"]
drinks['mL'] = drinks["litres"] * 1000

drinks.head()

Unnamed: 0,country,beer,spirit,wine,litres,continent,servings,mL
0,Afghanistan,0,0,0,0.0,AS,0,0.0
1,Albania,89,132,54,4.9,EU,275,4900.0
2,Algeria,25,0,14,0.7,AF,39,700.0
3,Andorra,245,138,312,12.4,EU,695,12400.0
4,Angola,217,57,45,5.9,AF,319,5900.0


**Removing Columns**

In [98]:
# axis=0 for rows, 1 for columns
drinks.drop('mL', axis=1)

Unnamed: 0,country,beer,spirit,wine,litres,continent,servings
0,Afghanistan,0,0,0,0.0,AS,0
1,Albania,89,132,54,4.9,EU,275
2,Algeria,25,0,14,0.7,AF,39
3,Andorra,245,138,312,12.4,EU,695
4,Angola,217,57,45,5.9,AF,319
5,Antigua & Barbuda,102,128,45,4.9,,275
6,Argentina,193,25,221,8.3,SA,439
7,Armenia,21,179,11,3.8,EU,211
8,Australia,261,72,212,10.4,OC,545
9,Austria,279,75,191,9.7,EU,545


In [99]:
drinks.head()

Unnamed: 0,country,beer,spirit,wine,litres,continent,servings,mL
0,Afghanistan,0,0,0,0.0,AS,0,0.0
1,Albania,89,132,54,4.9,EU,275,4900.0
2,Algeria,25,0,14,0.7,AF,39,700.0
3,Andorra,245,138,312,12.4,EU,695,12400.0
4,Angola,217,57,45,5.9,AF,319,5900.0


In [100]:
# Drop on the original DataFrame rather than returning a new one.
drinks.drop('mL', axis=1, inplace=True)

drinks.head()

Unnamed: 0,country,beer,spirit,wine,litres,continent,servings
0,Afghanistan,0,0,0,0.0,AS,0
1,Albania,89,132,54,4.9,EU,275
2,Algeria,25,0,14,0.7,AF,39
3,Andorra,245,138,312,12.4,EU,695
4,Angola,217,57,45,5.9,AF,319


In [101]:
# Drop multiple columns.
drinks.drop(['beer', 'servings'], axis=1)

Unnamed: 0,country,spirit,wine,litres,continent
0,Afghanistan,0,0,0.0,AS
1,Albania,132,54,4.9,EU
2,Algeria,0,14,0.7,AF
3,Andorra,138,312,12.4,EU
4,Angola,57,45,5.9,AF
5,Antigua & Barbuda,128,45,4.9,
6,Argentina,25,221,8.3,SA
7,Armenia,179,11,3.8,EU
8,Australia,72,212,10.4,OC
9,Austria,75,191,9.7,EU
