## About DataFrames

A Pandas `DataFrame`, like a spreadsheet, is made up of columns and rows. Each column is a `pandas.Series` object. A `DataFrame` is, in some ways, similar to a two-dimensional NumPy array, with labels for the columns and index. Unlike a NumPy array, however, a `DataFrame` can contain different data types. You can think of a `pandas.Series` object as a one-dimensional NumPy array with labels. The `pandas.Series` object, like a NumPy array, can contain only one data type. The `pandas.Series` object can use many of the same methods you have seen with arrays, such as `min()`, `max()`, `mean()`, and `medium()`.

The usual convention is to import the Pandas package aliased as `pd`:

In [1]:
import pandas as pd

## Creating DataFrames

You can create `DataFrames` with data from many sources, including dictionaries and lists and, more commonly, by reading files. You can create an empty `DataFrame` by using the `DataFrame` constructor:

In [3]:
df = pd.DataFrame()
print(df)

Empty DataFrame
Columns: []
Index: []


### Creating a DataFrame from a Dictionary

You can create `DataFrames` from a list of dictionaries or from a dictionary, where each key is a column label with the values for that key holding the data for the column. The next listing shows how to create a `DataFrame` by creating a list of data for each column and then creating a dictionary with the column names as keys and these lists as the values. The listing shows how to then pass this dictionary to the `DataFrame` constructor to construct the `DataFrame`.



**Creating a DataFrame from a Dictionary**

In [4]:
first_names = ["shanda", "rolly", "molly", "frank", "rip", "steven", "gwen", "arthur"]

last_names = [
    "smith",
    "brocker",
    "stein",
    "bach",
    "spencer",
    "de wilde",
    "mason",
    "davis",
]

ages = [43, 23, 78, 56, 26, 14, 46, 92]
data = {"first": first_names, "last": last_names, "ages": ages}

participants = pd.DataFrame(data)


In [5]:
participants

Unnamed: 0,first,last,ages
0,shanda,smith,43
1,rolly,brocker,23
2,molly,stein,78
3,frank,bach,56
4,rip,spencer,26
5,steven,de wilde,14
6,gwen,mason,46
7,arthur,davis,92


You can see the column labels across the top, the data in each row, and the index labels to the left.

**Creating a DataFrame from a List of Lists**

You can create a list of lists, with each sublist containing the data for one row, in the order of the columns:

In [6]:
data = [
    ["shanda", "smith", 43],
    ["rolly", "brocker", 23],
    ["molly", "stein", 78],
    ["frank", "bach", 56],
    ["rip", "spencer", 26],
    ["steven", "de wilde", 14],
    ["gwen", "mason", 46],
    ["arthur", "davis", 92],
]


Then you can use this as the data argument:

In [7]:
participants = pd.DataFrame(data)
participants

Unnamed: 0,0,1,2
0,shanda,smith,43
1,rolly,brocker,23
2,molly,stein,78
3,frank,bach,56
4,rip,spencer,26
5,steven,de wilde,14
6,gwen,mason,46
7,arthur,davis,92


Notice that the resulting `DataFrame` has been created with integer column names. This is the default if no column names are supplied. You can supply column names explicitly as a list of strings:

In [8]:
column_names = ["first", "last", "ages"]
index_labels = ["a", "b", "c", "d", "e", "f", "g", "h"]

participants = pd.DataFrame(data, columns=column_names, index=index_labels)
participants


Unnamed: 0,first,last,ages
a,shanda,smith,43
b,rolly,brocker,23
c,molly,stein,78
d,frank,bach,56
e,rip,spencer,26
f,steven,de wilde,14
g,gwen,mason,46
h,arthur,davis,92


**Creating a DataFrame from a File**

While creating `DataFrames` from dictionaries and lists is possible, the vast majority of the time you will create `DataFrames` from existing data sources. Files are the most common of these data sources. Pandas supplies functions for creating `DataFrames` from files for many common file types, including CSV, Excel, HTML, JSON, and SQL database connections.

Say that you want to open a CSV file from the FiveThirtyEight website, https://data.fivethirtyeight.com, under the data set college_majors. After you unzip and upload the CSV file to Colab, you open it by simply supplying its path to the Pandas `read_csv` function:

In [9]:
college_majors = pd.read_csv('../data/college-majors/all-ages.csv')
college_majors

Unnamed: 0,Major_code,Major,Major_category,Total,Employed,Employed_full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th
0,1100,GENERAL AGRICULTURE,Agriculture & Natural Resources,128148,90245,74078,2423,0.026147,50000,34000,80000.0
1,1101,AGRICULTURE PRODUCTION AND MANAGEMENT,Agriculture & Natural Resources,95326,76865,64240,2266,0.028636,54000,36000,80000.0
2,1102,AGRICULTURAL ECONOMICS,Agriculture & Natural Resources,33955,26321,22810,821,0.030248,63000,40000,98000.0
3,1103,ANIMAL SCIENCES,Agriculture & Natural Resources,103549,81177,64937,3619,0.042679,46000,30000,72000.0
4,1104,FOOD SCIENCE,Agriculture & Natural Resources,24280,17281,12722,894,0.049188,62000,38500,90000.0
...,...,...,...,...,...,...,...,...,...,...,...
168,6211,HOSPITALITY MANAGEMENT,Business,200854,163393,122499,8862,0.051447,49000,33000,70000.0
169,6212,MANAGEMENT INFORMATION SYSTEMS AND STATISTICS,Business,156673,134478,118249,6186,0.043977,72000,50000,100000.0
170,6299,MISCELLANEOUS BUSINESS & MEDICAL ADMINISTRATION,Business,102753,77471,61603,4308,0.052679,53000,36000,83000.0
171,6402,HISTORY,Humanities & Liberal Arts,712509,478416,354163,33725,0.065851,50000,35000,80000.0


Pandas uses the data in the CSV file to determine column labels and column types.

## Interacting with DataFrame Data

Once you have data loaded into a `DataFrame`, you should take a look at it. Pandas offers numerous ways of accessing data in a `DataFrame`. You can look at data by rows, columns, individual cells, or some combination of these. You can also extract data based on its value.

### Heads and Tails

In [10]:
college_majors.head()

Unnamed: 0,Major_code,Major,Major_category,Total,Employed,Employed_full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th
0,1100,GENERAL AGRICULTURE,Agriculture & Natural Resources,128148,90245,74078,2423,0.026147,50000,34000,80000.0
1,1101,AGRICULTURE PRODUCTION AND MANAGEMENT,Agriculture & Natural Resources,95326,76865,64240,2266,0.028636,54000,36000,80000.0
2,1102,AGRICULTURAL ECONOMICS,Agriculture & Natural Resources,33955,26321,22810,821,0.030248,63000,40000,98000.0
3,1103,ANIMAL SCIENCES,Agriculture & Natural Resources,103549,81177,64937,3619,0.042679,46000,30000,72000.0
4,1104,FOOD SCIENCE,Agriculture & Natural Resources,24280,17281,12722,894,0.049188,62000,38500,90000.0


The `head` method takes an optional argument, which specifies the number of rows to return. You would specify the top three rows like this:

In [11]:
college_majors.head(3)

Unnamed: 0,Major_code,Major,Major_category,Total,Employed,Employed_full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th
0,1100,GENERAL AGRICULTURE,Agriculture & Natural Resources,128148,90245,74078,2423,0.026147,50000,34000,80000.0
1,1101,AGRICULTURE PRODUCTION AND MANAGEMENT,Agriculture & Natural Resources,95326,76865,64240,2266,0.028636,54000,36000,80000.0
2,1102,AGRICULTURAL ECONOMICS,Agriculture & Natural Resources,33955,26321,22810,821,0.030248,63000,40000,98000.0


The `tail` method works in a similar way to `head` but returns rows from the bottom. It also takes an optional argument that specifies the number of rows to return:

In [12]:
college_majors.tail()

Unnamed: 0,Major_code,Major,Major_category,Total,Employed,Employed_full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th
168,6211,HOSPITALITY MANAGEMENT,Business,200854,163393,122499,8862,0.051447,49000,33000,70000.0
169,6212,MANAGEMENT INFORMATION SYSTEMS AND STATISTICS,Business,156673,134478,118249,6186,0.043977,72000,50000,100000.0
170,6299,MISCELLANEOUS BUSINESS & MEDICAL ADMINISTRATION,Business,102753,77471,61603,4308,0.052679,53000,36000,83000.0
171,6402,HISTORY,Humanities & Liberal Arts,712509,478416,354163,33725,0.065851,50000,35000,80000.0
172,6403,UNITED STATES HISTORY,Humanities & Liberal Arts,17746,11887,8204,943,0.0735,50000,39000,81000.0


### Descriptive Statistics

Once I’ve taken a look at some rows from a `DataFrame`, I like to get a sense of the shape of the data. One tool for doing this is the `DataFrame` `describe` method, which produces various descriptive statistics about the data. You can call `describe` with no arguments, as shown here:

In [13]:
college_majors.describe()

Unnamed: 0,Major_code,Total,Employed,Employed_full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th
count,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0
mean,3879.815029,230256.6,166162.0,126307.8,9725.034682,0.057355,56816.184971,38697.109827,82506.358382
std,1687.75314,422068.5,307324.4,242425.4,18022.040192,0.019177,14706.226865,9414.524761,20805.330126
min,1100.0,2396.0,1492.0,1093.0,0.0,0.0,35000.0,24900.0,45800.0
25%,2403.0,24280.0,17281.0,12722.0,1101.0,0.046261,46000.0,32000.0,70000.0
50%,3608.0,75791.0,56564.0,39613.0,3619.0,0.054719,53000.0,36000.0,80000.0
75%,5503.0,205763.0,142879.0,111025.0,8862.0,0.069043,65000.0,42000.0,95000.0
max,6403.0,3123510.0,2354398.0,1939384.0,147261.0,0.156147,125000.0,78000.0,210000.0


This method calculates the count, mean, standard deviation, minimum, maximum, and quantiles for columns containing **numeric data**. It accepts optional arguments to control which data types are processed and the ranges of the quantiles returned. To change the quantiles, you use the `percentiles` argument:

In [14]:
college_majors.describe(percentiles=[0.1, 0.9])

Unnamed: 0,Major_code,Total,Employed,Employed_full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th
count,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0
mean,3879.815029,230256.6,166162.0,126307.8,9725.034682,0.057355,56816.184971,38697.109827,82506.358382
std,1687.75314,422068.5,307324.4,242425.4,18022.040192,0.019177,14706.226865,9414.524761,20805.330126
min,1100.0,2396.0,1492.0,1093.0,0.0,0.0,35000.0,24900.0,45800.0
10%,2020.8,9775.6,6969.4,5270.0,433.0,0.037053,42000.0,30000.0,60000.0
50%,3608.0,75791.0,56564.0,39613.0,3619.0,0.054719,53000.0,36000.0,80000.0
90%,6108.8,673975.8,474567.6,360475.0,31941.2,0.080062,77400.0,50800.0,106000.0
max,6403.0,3123510.0,2354398.0,1939384.0,147261.0,0.156147,125000.0,78000.0,210000.0


This example specifies percentiles for 10% and 90% rather than the default 25% and 75%. Note that 50% is inserted regardless of the argument.

If you want to see statistics calculated from **nonnumeric columns**, you can specify which data types are processed. You do this by using the `include` keyword. The value passed to this keyword should be a sequence of data types, which can be NumPy data types, such as `np.object`. In Pandas, strings are of type `object`, so the following includes columns with string data types:

In [15]:
import numpy as np

college_majors.describe(include=[np.object])

Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
  college_majors.describe(include=[np.object])


Unnamed: 0,Major,Major_category
count,173,173
unique,173,16
top,COMPUTER AND INFORMATION SYSTEMS,Engineering
freq,1,29


This would also find the string name of the data type, which, in the case of `np.object`, would be object. The following returns statistics appropriately for the type:

In [16]:
college_majors.describe(include=['object'])

Unnamed: 0,Major,Major_category
count,173,173
unique,173,16
top,COMPUTER AND INFORMATION SYSTEMS,Engineering
freq,1,29


So, for strings, you get the count, the number of unique values, the top value, and the frequency of this top value.

You can pass the string `all` instead of a list of data types to produce statistics for all the columns:

In [17]:
college_majors.describe(include='all')

Unnamed: 0,Major_code,Major,Major_category,Total,Employed,Employed_full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th
count,173.0,173,173,173.0,173.0,173.0,173.0,173.0,173.0,173.0,173.0
unique,,173,16,,,,,,,,
top,,COMPUTER AND INFORMATION SYSTEMS,Engineering,,,,,,,,
freq,,1,29,,,,,,,,
mean,3879.815029,,,230256.6,166162.0,126307.8,9725.034682,0.057355,56816.184971,38697.109827,82506.358382
std,1687.75314,,,422068.5,307324.4,242425.4,18022.040192,0.019177,14706.226865,9414.524761,20805.330126
min,1100.0,,,2396.0,1492.0,1093.0,0.0,0.0,35000.0,24900.0,45800.0
25%,2403.0,,,24280.0,17281.0,12722.0,1101.0,0.046261,46000.0,32000.0,70000.0
50%,3608.0,,,75791.0,56564.0,39613.0,3619.0,0.054719,53000.0,36000.0,80000.0
75%,5503.0,,,205763.0,142879.0,111025.0,8862.0,0.069043,65000.0,42000.0,95000.0


In [18]:
college_majors.describe(exclude=['int'])

Unnamed: 0,Major,Major_category,Unemployment_rate,P75th
count,173,173,173.0,173.0
unique,173,16,,
top,COMPUTER AND INFORMATION SYSTEMS,Engineering,,
freq,1,29,,
mean,,,0.057355,82506.358382
std,,,0.019177,20805.330126
min,,,0.0,45800.0
25%,,,0.046261,70000.0
50%,,,0.054719,80000.0
75%,,,0.069043,95000.0


### Accessing Data

In [19]:
participants

Unnamed: 0,first,last,ages
a,shanda,smith,43
b,rolly,brocker,23
c,molly,stein,78
d,frank,bach,56
e,rip,spencer,26
f,steven,de wilde,14
g,gwen,mason,46
h,arthur,davis,92


In [20]:
participants['first']

a    shanda
b     rolly
c     molly
d     frank
e       rip
f    steven
g      gwen
h    arthur
Name: first, dtype: object

In [21]:
participants.ages

a    43
b    23
c    78
d    56
e    26
f    14
g    46
h    92
Name: ages, dtype: int64

This would not work with the columns `first` or `last`, as these already exist as attributes of the DataFrame.

In [26]:
participants[["first", "last"]]

Unnamed: 0,first,last
a,shanda,smith
b,rolly,brocker
c,molly,stein
d,frank,bach
e,rip,spencer
f,steven,de wilde
g,gwen,mason
h,arthur,davis


In [27]:
participants[3:6]

Unnamed: 0,first,last,ages
d,frank,bach,56
e,rip,spencer,26
f,steven,de wilde,14


In [28]:
participants['a':'c']

Unnamed: 0,first,last,ages
a,shanda,smith,43
b,rolly,brocker,23
c,molly,stein,78


In [29]:
mask = [False, True, True, False, False, True, False, False]
participants[mask]

Unnamed: 0,first,last,ages
b,rolly,brocker,23
c,molly,stein,78
f,steven,de wilde,14


**Optimized Access by Label**

The bracket syntax provides a very convenient and easy-to-read way to access data. It is often used in interactive sessions when experimenting with and exploring `DataFrames`, but it is not optimized for performance with large data sets. The recommended way to index into `DataFrames` in production code or for large data sets is to use the `DataFrame` `loc` and `iloc` indexers. These indexers use a bracket syntax very similar to what you have seen here. The `loc` indexer indexes using labels, and `iloc` uses index positions.

In [30]:
participants.loc['c']

first    molly
last     stein
ages        78
Name: c, dtype: object

In [31]:
participants.loc['c':'f']

Unnamed: 0,first,last,ages
c,molly,stein,78
d,frank,bach,56
e,rip,spencer,26
f,steven,de wilde,14


In [32]:
mask = [False, True, True, False, False, True, False, False]
participants.loc[mask]

Unnamed: 0,first,last,ages
b,rolly,brocker,23
c,molly,stein,78
f,steven,de wilde,14


In [33]:
participants.loc[:, 'first']

a    shanda
b     rolly
c     molly
d     frank
e       rip
f    steven
g      gwen
h    arthur
Name: first, dtype: object

In [34]:
participants.loc[:'c', ['ages', 'last']]

Unnamed: 0,ages,last
a,43,smith
b,23,brocker
c,78,stein


In [35]:
participants.loc[:'c', [False, True, True]]

Unnamed: 0,last,ages
a,smith,43
b,brocker,23
c,stein,78


**Optimized Access by Index**

The `iloc` indexer enables you to use index positions to select rows and columns. Much as you’ve seen before with brackets, you can use a single value to specify a single row:

In [36]:
participants.iloc[3]

first    frank
last      bach
ages        56
Name: d, dtype: object

In [37]:
participants.iloc[1:4]

Unnamed: 0,first,last,ages
b,rolly,brocker,23
c,molly,stein,78
d,frank,bach,56


In [38]:
participants.iloc[1:4, :2]

Unnamed: 0,first,last
b,rolly,brocker
c,molly,stein
d,frank,bach


**Masking and Filtering**

A powerful feature of `DataFrames` is the ability to select data based on values. You can use comparison operators with columns to see which values meet some condition. For example, if you want to see which rows of the `college_majors` `DataFrame` have the value Humanities & Liberal Arts as a major category, you can use the equality operator (`==`):

In [39]:
college_majors.Major_category == 'Humanities & Liberal Arts'

0      False
1      False
2      False
3      False
4      False
       ...  
168    False
169    False
170    False
171     True
172     True
Name: Major_category, Length: 173, dtype: bool

This produces a `pandas.Series` object that contains `True` for every row that matches the condition. A series of Booleans is mildly interesting, but the real power comes when you combine it with an indexer to filter the results. Remember that `loc` returns rows for every `True` value of an input sequence. You can make a condition based on a comparison operator and a row, for example, as shown here for the greater-than operator and the row Total:

In [40]:
total_mask = college_majors.loc[:, 'Total'] > 1200000

In [41]:
top_majors = college_majors.loc[total_mask]
top_majors

Unnamed: 0,Major_code,Major,Major_category,Total,Employed,Employed_full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th
25,2300,GENERAL EDUCATION,Education,1438867,843693,591863,38742,0.043904,43000,32000,59000.0
28,2304,ELEMENTARY EDUCATION,Education,1446701,819393,501786,32685,0.038359,40000,31000,50000.0
114,5200,PSYCHOLOGY,Psychology & Social Work,1484075,1055854,736817,79066,0.069667,45000,31000,68000.0
153,6107,NURSING,Health,1769892,1325711,947546,36503,0.026797,62000,48000,80000.0
158,6200,GENERAL BUSINESS,Business,2148712,1580978,1304646,85626,0.051378,60000,40000,95000.0
159,6201,ACCOUNTING,Business,1779219,1335825,1095027,75379,0.053415,65000,42500,100000.0
161,6203,BUSINESS MANAGEMENT AND ADMINISTRATION,Business,3123510,2354398,1939384,147261,0.058865,58000,39500,86000.0


In [42]:
top_majors.Total.min()

1438867

In [43]:
college_majors.Unemployment_rate.describe()

count    173.000000
mean       0.057355
std        0.019177
min        0.000000
25%        0.046261
50%        0.054719
75%        0.069043
max        0.156147
Name: Unemployment_rate, dtype: float64

In [44]:
employ_rate_mask = college_majors.loc[:, 'Unemployment_rate'] <= 0.046261

In [45]:
employ_rate_majors = college_majors.loc[employ_rate_mask]

In [46]:
employ_rate_majors.Major_category.unique()

array(['Agriculture & Natural Resources', 'Education', 'Engineering',
       'Biology & Life Science', 'Computers & Mathematics',
       'Humanities & Liberal Arts', 'Physical Sciences', 'Health',
       'Business'], dtype=object)

**Pandas Boolean Operators**

You can use the three Boolean operators AND (`&`), OR (`|`), and NOT (`~`) with the results of your conditions. You can use `&` or `|` to combine conditions and create more complex ones. You can use `~` to create a mask that is the opposite of your condition.

For example, you can use AND to create a new mask based on the previous ones to see which major categories of the most popular majors have a low unemployment rate. To do this, you use the & operator between your existing masks to produce a new one:

In [47]:
total_rate_mask = employ_rate_mask & total_mask
total_rate_mask

0      False
1      False
2      False
3      False
4      False
       ...  
168    False
169    False
170    False
171    False
172    False
Length: 173, dtype: bool

In [48]:
college_majors.loc[total_rate_mask]

Unnamed: 0,Major_code,Major,Major_category,Total,Employed,Employed_full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th
25,2300,GENERAL EDUCATION,Education,1438867,843693,591863,38742,0.043904,43000,32000,59000.0
28,2304,ELEMENTARY EDUCATION,Education,1446701,819393,501786,32685,0.038359,40000,31000,50000.0
153,6107,NURSING,Health,1769892,1325711,947546,36503,0.026797,62000,48000,80000.0


You can use the `~` operator with your employment rate mask to create a `DataFrame` whose rows all have an employment rate higher than the bottom percentile:

In [49]:
lower_rate_mask = ~employ_rate_mask
lower_rate_majors = college_majors.loc[lower_rate_mask]

In [50]:
lower_rate_majors.Unemployment_rate.min()

0.046261361

In [51]:
college_majors.loc[total_mask | employ_rate_mask]

Unnamed: 0,Major_code,Major,Major_category,Total,Employed,Employed_full_time_year_round,Unemployed,Unemployment_rate,Median,P25th,P75th
0,1100,GENERAL AGRICULTURE,Agriculture & Natural Resources,128148,90245,74078,2423,0.026147,50000,34000,80000.0
1,1101,AGRICULTURE PRODUCTION AND MANAGEMENT,Agriculture & Natural Resources,95326,76865,64240,2266,0.028636,54000,36000,80000.0
2,1102,AGRICULTURAL ECONOMICS,Agriculture & Natural Resources,33955,26321,22810,821,0.030248,63000,40000,98000.0
3,1103,ANIMAL SCIENCES,Agriculture & Natural Resources,103549,81177,64937,3619,0.042679,46000,30000,72000.0
5,1105,PLANT SCIENCE AND AGRONOMY,Agriculture & Natural Resources,79409,63043,51077,2070,0.031791,50000,35000,75000.0
7,1199,MISCELLANEOUS AGRICULTURE,Agriculture & Natural Resources,8549,6392,5074,261,0.03923,52000,35000,75000.0
9,1302,FORESTRY,Agriculture & Natural Resources,69447,48228,39613,2144,0.042563,58000,40500,80000.0
25,2300,GENERAL EDUCATION,Education,1438867,843693,591863,38742,0.043904,43000,32000,59000.0
26,2301,EDUCATIONAL ADMINISTRATION AND SUPERVISION,Education,4037,3113,2468,0,0.0,58000,44750,79000.0
28,2304,ELEMENTARY EDUCATION,Education,1446701,819393,501786,32685,0.038359,40000,31000,50000.0


## Manipulating DataFrames



In [52]:
participants.columns

Index(['first', 'last', 'ages'], dtype='object')

In [53]:
participants.rename(columns={'ages': 'Age'})

Unnamed: 0,first,last,Age
a,shanda,smith,43
b,rolly,brocker,23
c,molly,stein,78
d,frank,bach,56
e,rip,spencer,26
f,steven,de wilde,14
g,gwen,mason,46
h,arthur,davis,92


In [54]:
participants.columns

Index(['first', 'last', 'ages'], dtype='object')

In [55]:
participants.rename(columns={'ages':'Age'}, inplace=True)
participants.columns

Index(['first', 'last', 'Age'], dtype='object')

In [56]:
participants["Zip Code"] = [94702, 97402, 94223, 94705, 97503, 94705, 94111, 95333]
participants


Unnamed: 0,first,last,Age,Zip Code
a,shanda,smith,43,94702
b,rolly,brocker,23,97402
c,molly,stein,78,94223
d,frank,bach,56,94705
e,rip,spencer,26,97503
f,steven,de wilde,14,94705
g,gwen,mason,46,94111
h,arthur,davis,92,95333


In [57]:
participants["Full Name"] = participants.loc[:, "first"] + participants.loc[:, "last"]

participants


Unnamed: 0,first,last,Age,Zip Code,Full Name
a,shanda,smith,43,94702,shandasmith
b,rolly,brocker,23,97402,rollybrocker
c,molly,stein,78,94223,mollystein
d,frank,bach,56,94705,frankbach
e,rip,spencer,26,97503,ripspencer
f,steven,de wilde,14,94705,stevende wilde
g,gwen,mason,46,94111,gwenmason
h,arthur,davis,92,95333,arthurdavis


In [58]:
participants["Full Name"] = (
    participants.loc[:, "first"] + " " + participants.loc[:, "last"]
)
participants


Unnamed: 0,first,last,Age,Zip Code,Full Name
a,shanda,smith,43,94702,shanda smith
b,rolly,brocker,23,97402,rolly brocker
c,molly,stein,78,94223,molly stein
d,frank,bach,56,94705,frank bach
e,rip,spencer,26,97503,rip spencer
f,steven,de wilde,14,94705,steven de wilde
g,gwen,mason,46,94111,gwen mason
h,arthur,davis,92,95333,arthur davis


## Manipulating Data

In [59]:
participants.loc['h', 'first'] = 'Paul'
participants

Unnamed: 0,first,last,Age,Zip Code,Full Name
a,shanda,smith,43,94702,shanda smith
b,rolly,brocker,23,97402,rolly brocker
c,molly,stein,78,94223,molly stein
d,frank,bach,56,94705,frank bach
e,rip,spencer,26,97503,rip spencer
f,steven,de wilde,14,94705,steven de wilde
g,gwen,mason,46,94111,gwen mason
h,Paul,davis,92,95333,arthur davis


In [60]:
participants.iloc[3, 2] = 99
participants

Unnamed: 0,first,last,Age,Zip Code,Full Name
a,shanda,smith,43,94702,shanda smith
b,rolly,brocker,23,97402,rolly brocker
c,molly,stein,78,94223,molly stein
d,frank,bach,99,94705,frank bach
e,rip,spencer,26,97503,rip spencer
f,steven,de wilde,14,94705,steven de wilde
g,gwen,mason,46,94111,gwen mason
h,Paul,davis,92,95333,arthur davis


In [61]:
participants.Age -= 1
participants

Unnamed: 0,first,last,Age,Zip Code,Full Name
a,shanda,smith,42,94702,shanda smith
b,rolly,brocker,22,97402,rolly brocker
c,molly,stein,77,94223,molly stein
d,frank,bach,98,94705,frank bach
e,rip,spencer,25,97503,rip spencer
f,steven,de wilde,13,94705,steven de wilde
g,gwen,mason,45,94111,gwen mason
h,Paul,davis,91,95333,arthur davis


In [62]:
participants.replace('rolly', 'Smiley')

Unnamed: 0,first,last,Age,Zip Code,Full Name
a,shanda,smith,42,94702,shanda smith
b,Smiley,brocker,22,97402,rolly brocker
c,molly,stein,77,94223,molly stein
d,frank,bach,98,94705,frank bach
e,rip,spencer,25,97503,rip spencer
f,steven,de wilde,13,94705,steven de wilde
g,gwen,mason,45,94111,gwen mason
h,Paul,davis,91,95333,arthur davis


In [65]:
participants.replace(r'(s)([a-z]+)', r'S\2', regex=True)

Unnamed: 0,first,last,Age,Zip Code,Full Name
a,Shanda,Smith,42,94702,Shanda Smith
b,rolly,brocker,22,97402,rolly brocker
c,molly,Stein,77,94223,molly Stein
d,frank,bach,98,94705,frank bach
e,rip,Spencer,25,97503,rip Spencer
f,Steven,de wilde,13,94705,Steven de wilde
g,gwen,maSon,45,94111,gwen maSon
h,Paul,davis,91,95333,arthur davis


In [66]:
def cap_word(w):
    return w.capitalize()

participants.loc[:, 'first'].apply(cap_word)

a    Shanda
b     Rolly
c     Molly
d     Frank
e       Rip
f    Steven
g      Gwen
h      Paul
Name: first, dtype: object

In [67]:
def say_hello(row):
    return f'{row["first"]} is {row["Age"]} years old.'

participants.apply(say_hello, axis=1)

a    shanda is 42 years old.
b     rolly is 22 years old.
c     molly is 77 years old.
d     frank is 98 years old.
e       rip is 25 years old.
f    steven is 13 years old.
g      gwen is 45 years old.
h      Paul is 91 years old.
dtype: object