# Pandas Review

Create pandas from:

**1. python dictionary**

In [2]:
import numpy as np
import pandas as pd

my_dict = {
    'id': [1,2,3,4],
    'name': ['ringo', 'paul', 'john', 'george'],
    'age': [23,24,25,26],
    'role': ['drummer', 'singer', 'singer', 'guitar']
}
df_beatles = pd.DataFrame(my_dict)
df_beatles

Unnamed: 0,id,name,age,role
0,1,ringo,23,drummer
1,2,paul,24,singer
2,3,john,25,singer
3,4,george,26,guitar


**2. python list**

Where we have multiple lists we wish to convert into a dataframe, follow the example above(converting them all into a dictionary 1st), or alternately - create a 2-D list of the column values and specify a 1-D list of the column headings.

NOTE: you **must** specfiy the `columns` keywords so `pandas` knows to use that list as the column headings, other wise the headings will be set to `0`, `1`, `2`, etc.

In [14]:
lst = [[1,2,3], ['tom', 'dick', 'harry'], [23,24,25]]
lsts = list(zip(lst[0], lst[1], lst[2])) # additional step to create  tupple of each user
print(lsts)
cols = ['id', 'name', 'age']
df_lst = pd.DataFrame(lsts, columns=cols)
df_lst

[(1, 'tom', 23), (2, 'dick', 24), (3, 'harry', 25)]


Unnamed: 0,id,name,age
0,1,tom,23
1,2,dick,24
2,3,harry,25


We can also specify the index column with the `.index` attribute, e.g.

In [15]:
df_lst.index = ['user_1', 'user_2', 'user_3']
df_lst

Unnamed: 0,id,name,age
user_1,1,tom,23
user_2,2,dick,24
user_3,3,harry,25


**3. csv file**

General format:

```py
df = pd.read_csv('file_name.csv', nrows=5, header=None, sep=',', comment='#', na_values=[''], index_col=col_num)
```
The 1st argument, filename(path), is req'd, the others are optional:

- `nrows` number of rows of file to read.
- `header=None` use row number as the column names.
- `sep` delimiter to use, default `sep` is a comma, `,`.
- `comment` character to split comments.
- `na_values` string to regognise as NaN.
- `index_col` specify the column in the csv file, `col_num`, that should be used as the index column, otherwise it will numbered starting at 0 and up.


In [23]:
brics = pd.read_csv('data/brics.csv', index_col=0)
brics

Unnamed: 0,country,capital,area,population
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0
SA,South Africa,Pretoria,1.221,52.98


## Exploring a Pandas DataFrame

In [65]:
# use the 'PassengerId' as the row label/index column
df = pd.read_csv('data/test.csv', index_col=0)
df.index

Int64Index([ 892,  893,  894,  895,  896,  897,  898,  899,  900,  901,
            ...
            1300, 1301, 1302, 1303, 1304, 1305, 1306, 1307, 1308, 1309],
           dtype='int64', name='PassengerId', length=418)

In [66]:
# return the column headings
df.columns

Index(['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
       'Cabin', 'Embarked'],
      dtype='object')

In [67]:
# dataframe info - num cols, num rows, num non-num values for each row, col data types
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 892 to 1309
Data columns (total 10 columns):
Pclass      418 non-null int64
Name        418 non-null object
Sex         418 non-null object
Age         332 non-null float64
SibSp       418 non-null int64
Parch       418 non-null int64
Ticket      418 non-null object
Fare        417 non-null float64
Cabin       91 non-null object
Embarked    418 non-null object
dtypes: float64(2), int64(3), object(5)
memory usage: 35.9+ KB


In [68]:
# return the first x rows, default is 5
df.head(2)

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S


In [69]:
# return the last 'x' rows, default is 5
df.tail(2)

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1308,3,"Ware, Mr. Frederick",male,,0,0,359309,8.05,,S
1309,3,"Peter, Master. Michael J",male,,1,1,2668,22.3583,,C


In [70]:
# return a numpy array of all the values
# dataframe of mixed datatypes results in a numpy array of objects
df_array = df.values
df_array

array([[3, 'Kelly, Mr. James', 'male', ..., 7.8292, nan, 'Q'],
       [3, 'Wilkes, Mrs. James (Ellen Needs)', 'female', ..., 7.0, nan,
        'S'],
       [2, 'Myles, Mr. Thomas Francis', 'male', ..., 9.6875, nan, 'Q'],
       ...,
       [3, 'Saether, Mr. Simon Sivertsen', 'male', ..., 7.25, nan, 'S'],
       [3, 'Ware, Mr. Frederick', 'male', ..., 8.05, nan, 'S'],
       [3, 'Peter, Master. Michael J', 'male', ..., 22.3583, nan, 'C']],
      dtype=object)

## Selecting columns/rows from a dataframe

### Using Bracket Notation to select columns

Whether using `[column_name]` notation or `.column_name` returns that particular column of data as a pandas `Series` - a 1-D array that can be labelled. You can create a dataframe by 'pasting' together a number of `Series` columns.

In [71]:
# select a column of data
sub = df.Name
sub.head(5)

PassengerId
892                                Kelly, Mr. James
893                Wilkes, Mrs. James (Ellen Needs)
894                       Myles, Mr. Thomas Francis
895                                Wirz, Mr. Albert
896    Hirvonen, Mrs. Alexander (Helga E Lindqvist)
Name: Name, dtype: object

In [72]:
type(sub)

pandas.core.series.Series

To **keep** the data in a dataframe use `[[]]` notation - you can select one or more columns of data, e.g.

In [73]:
sub = df[['Name']]
sub.head(2)

Unnamed: 0_level_0,Name
PassengerId,Unnamed: 1_level_1
892,"Kelly, Mr. James"
893,"Wilkes, Mrs. James (Ellen Needs)"


In [74]:
sub = df[['Name', 'Sex', 'Age']]
sub.head(2)

Unnamed: 0_level_0,Name,Sex,Age
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
892,"Kelly, Mr. James",male,34.5
893,"Wilkes, Mrs. James (Ellen Needs)",female,47.0


In [75]:
type(sub)

pandas.core.frame.DataFrame

### Using bracket notation to select rows

You do this by specfiying a slice, `[start:end]` notation as you do with Python lists.

- uses the row index, and not the label
- selects a dataframe.
- the 'end' is not included.

In [76]:
# row 200 to 204
df[200:205]

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1092,3,"Murphy, Miss. Nora",female,,0,0,36568,15.5,,Q
1093,3,"Danbom, Master. Gilbert Sigvard Emanuel",male,0.33,0,2,347080,14.4,,S
1094,1,"Astor, Col. John Jacob",male,47.0,1,0,PC 17757,227.525,C62 C64,C
1095,2,"Quick, Miss. Winifred Vera",female,8.0,1,1,26360,26.0,,S
1096,2,"Andrew, Mr. Frank Thomas",male,25.0,0,0,C.A. 34050,10.5,,S


In [77]:
type(sub)

pandas.core.frame.DataFrame

### Select a Subsection/Row(s) of a Pandas DataFrame using 'loc' and 'iloc'

**Using loc**

`loc` select data based on labels

- you can select one or more rows, columns or a selection of rows and columns.
- you have to specify rows and columns based on their row and column **labels** , NOT index.
- returns a `Series` if you use single `[]`, `DataFrame` when you use `[[]]`.

In [79]:
# select a Single row based on id
# returns a series
df.loc[1000]

Pclass                                     3
Name        Willer, Mr. Aaron (Abi Weller")"
Sex                                     male
Age                                      NaN
SibSp                                      0
Parch                                      0
Ticket                                  3410
Fare                                  8.7125
Cabin                                    NaN
Embarked                                   S
Name: 1000, dtype: object

In [87]:
# select rows based on the index column
# return a specific number of rows as a dataframe (all columns incuded)
df.loc[[1000, 1002, 1005]]

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1000,3,"Willer, Mr. Aaron (Abi Weller"")""",male,,0,0,3410,8.7125,,S
1002,2,"Stanton, Mr. Samuel Ward",male,41.0,0,0,237734,15.0458,,C
1005,3,"Buckley, Miss. Katherine",female,18.5,0,0,329944,7.2833,,Q


In [88]:
# we can extend this to return specfic rows and columns
df.loc[[1000, 1002, 1005], ['Name', 'Sex', 'Age']]

Unnamed: 0_level_0,Name,Sex,Age
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1000,"Willer, Mr. Aaron (Abi Weller"")""",male,
1002,"Stanton, Mr. Samuel Ward",male,41.0
1005,"Buckley, Miss. Katherine",female,18.5


In [89]:
# we can return a specific set of columns (and all rows) - limit to 1st 5 rows by slicing
df.loc[:, ['Name', 'Sex', 'Age']][0:5]

Unnamed: 0_level_0,Name,Sex,Age
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
892,"Kelly, Mr. James",male,34.5
893,"Wilkes, Mrs. James (Ellen Needs)",female,47.0
894,"Myles, Mr. Thomas Francis",male,62.0
895,"Wirz, Mr. Albert",male,27.0
896,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0


**Using iloc**

`iloc` select data based on position/index starting at 0.

- uses syntax `[row, column]`
- to specify specific rows or columns, use `:`, e.g. `[200:205, 1:4]`
- returns a dataframe

In [90]:
# Select Multiple rows(and all columns)
# starting at row 205, upto but not including row 210
df.iloc[205:210]

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1097,1,"Omont, Mr. Alfred Fernand",male,,0,0,F.C. 12998,25.7417,,C
1098,3,"McGowan, Miss. Katherine",female,35.0,0,0,9232,7.75,,Q
1099,2,"Collett, Mr. Sidney C Stuart",male,24.0,0,0,28034,10.5,,S
1100,1,"Rosenbaum, Miss. Edith Louise",female,33.0,0,0,PC 17613,27.7208,A11,C
1101,3,"Delalic, Mr. Redjo",male,25.0,0,0,349250,7.8958,,S


In [91]:
# the end is not encluded for either row or column
df.iloc[200:205, 0:5]

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1092,3,"Murphy, Miss. Nora",female,,0
1093,3,"Danbom, Master. Gilbert Sigvard Emanuel",male,0.33,0
1094,1,"Astor, Col. John Jacob",male,47.0,1
1095,2,"Quick, Miss. Winifred Vera",female,8.0,1
1096,2,"Andrew, Mr. Frank Thomas",male,25.0,0


In [92]:
# the rows and columns do not have to be in sequence
df.iloc[[1, 45, 68, 201], [1,4,6,8]]

Unnamed: 0_level_0,Name,SibSp,Ticket,Cabin
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
893,"Wilkes, Mrs. James (Ellen Needs)",1,363272,
937,"Peltomaki, Mr. Nikolai Johannes",0,STON/O 2. 3101291,
960,"Tucker, Mr. Gilbert Milligan Jr",0,2543,C53
1093,"Danbom, Master. Gilbert Sigvard Emanuel",0,347080,


In [93]:
# starting from 3rd from end, going from back to front(right to left),
# upto but not including row 410 (-8 from back)
# include all columns
df.iloc[-3:-8:-1]

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1
1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.25,,S
1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9,C105,C
1305,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.05,,S
1304,3,"Henriksson, Miss. Jenny Lovisa",female,28.0,0,0,347086,7.775,,S
1303,1,"Minahan, Mrs. William Edward (Lillian E Thorpe)",female,37.0,1,0,19928,90.0,C78,Q


In [94]:
# same as above but limiting the cols
sub = df.iloc[-3:-8:-1, 0:5]
sub

Unnamed: 0_level_0,Pclass,Name,Sex,Age,SibSp
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
1307,3,"Saether, Mr. Simon Sivertsen",male,38.5,0
1306,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0
1305,3,"Spector, Mr. Woolf",male,,0
1304,3,"Henriksson, Miss. Jenny Lovisa",female,28.0,0
1303,1,"Minahan, Mrs. William Edward (Lillian E Thorpe)",female,37.0,1


In [95]:
type(sub)

pandas.core.frame.DataFrame

In [96]:
# to return specific columns, and all rows - limiting to 1st 5 rows
df.iloc[:, [2,4,6]][0:5]

Unnamed: 0_level_0,Sex,SibSp,Ticket
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
892,male,0,330911
893,female,1,363272
894,male,0,240276
895,male,0,315154
896,female,1,3101298


### Select an individual field/value

In [59]:
# NOTE: rows and columns are ZERO indexed
# Select the field by position, row(0) & column(2), uses 'iloc'
df.iloc[[0], [2]]

Unnamed: 0,Name
0,"Kelly, Mr. James"


In [33]:
# Select the field by Label, uses 'loc'
df.loc[[0], ['Name']]

Unnamed: 0,Name
0,"Kelly, Mr. James"


### Dropping Columns

In [41]:
# NOTE: use 'axis=1' to drop a column, original dataframe is unchanged
# use a comma separated list of one or more column names
# drop a single column from a dataframe
df.drop(['PassengerId', 'SibSp', 'Parch'], axis=1)[200:210]

Unnamed: 0,Pclass,Name,Sex,Age,Ticket,Fare,Cabin,Embarked
200,3,"Murphy, Miss. Nora",female,,36568,15.5,,Q
201,3,"Danbom, Master. Gilbert Sigvard Emanuel",male,0.33,347080,14.4,,S
202,1,"Astor, Col. John Jacob",male,47.0,PC 17757,227.525,C62 C64,C
203,2,"Quick, Miss. Winifred Vera",female,8.0,26360,26.0,,S
204,2,"Andrew, Mr. Frank Thomas",male,25.0,C.A. 34050,10.5,,S
205,1,"Omont, Mr. Alfred Fernand",male,,F.C. 12998,25.7417,,C
206,3,"McGowan, Miss. Katherine",female,35.0,9232,7.75,,Q
207,2,"Collett, Mr. Sidney C Stuart",male,24.0,28034,10.5,,S
208,1,"Rosenbaum, Miss. Edith Louise",female,33.0,PC 17613,27.7208,A11,C
209,3,"Delalic, Mr. Redjo",male,25.0,349250,7.8958,,S


### Dropping Rows

In [43]:
# NOTE: use 'axis=0' to drop a row, original dataframe remains unchanged
# drop rows based on row number
df.drop([204, 205, 206], axis=0)[200:210]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
200,1092,3,"Murphy, Miss. Nora",female,,0,0,36568,15.5,,Q
201,1093,3,"Danbom, Master. Gilbert Sigvard Emanuel",male,0.33,0,2,347080,14.4,,S
202,1094,1,"Astor, Col. John Jacob",male,47.0,1,0,PC 17757,227.525,C62 C64,C
203,1095,2,"Quick, Miss. Winifred Vera",female,8.0,1,1,26360,26.0,,S
207,1099,2,"Collett, Mr. Sidney C Stuart",male,24.0,0,0,28034,10.5,,S
208,1100,1,"Rosenbaum, Miss. Edith Louise",female,33.0,0,0,PC 17613,27.7208,A11,C
209,1101,3,"Delalic, Mr. Redjo",male,25.0,0,0,349250,7.8958,,S
210,1102,3,"Andersen, Mr. Albert Karvin",male,32.0,0,0,C 4001,22.525,,S
211,1103,3,"Finoli, Mr. Luigi",male,,0,0,SOTON/O.Q. 3101308,7.05,,S
212,1104,2,"Deacon, Mr. Percy William",male,17.0,0,0,S.O.C. 14879,73.5,,S


### Using logical operators to filter pandas dataframes

Involves three steps:

1. fetch the column you want to filter by as a pandas `Series`.
2. perform the comparison on the column and store the results
3. use the results to filter the dataset(must pass a list of boolean values)

In [97]:
import pandas as pd

df_brics = pd.read_csv('data/brics.csv', index_col=0)
df_brics

Unnamed: 0,country,capital,area,population
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.1,143.5
IN,India,New Delhi,3.286,1252.0
CH,China,Beijing,9.597,1357.0
SA,South Africa,Pretoria,1.221,52.98


In [98]:
# fetch the column to filter on as a pandas series
filter = df_brics['area'] # or df_brics.loc['area'] OR df_brics.iloc[:, 2]
filter

BR     8.516
RU    17.100
IN     3.286
CH     9.597
SA     1.221
Name: area, dtype: float64

In [99]:
# perform the comparison - returns a pandas series of booleans
comparison = filter > 8.0 
print(comparison)

BR     True
RU     True
IN    False
CH     True
SA    False
Name: area, dtype: bool


In [100]:
# filter the data set - return all those countries with a population > 8 million
df_brics[comparison]

Unnamed: 0,country,capital,area,population
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.1,143.5
CH,China,Beijing,9.597,1357.0


In [101]:
df_brics[df_brics['area'] > 8] # all in one line

Unnamed: 0,country,capital,area,population
BR,Brazil,Brasilia,8.516,200.4
RU,Russia,Moscow,17.1,143.5
CH,China,Beijing,9.597,1357.0


In [102]:
# find those countries with an area of > 8 and < 10
df_brics[np.logical_and(df_brics['area'] > 8, df_brics['area'] < 10)]

Unnamed: 0,country,capital,area,population
BR,Brazil,Brasilia,8.516,200.4
CH,China,Beijing,9.597,1357.0


In [106]:
# create a subset of cars that drive on the right
cars = pd.read_csv('data/cars.csv', index_col = 0)

# Extract drives_right column as Series: dr
dr = cars['drives_right']

# Use dr to subset cars: sel
sel = cars[dr == True]

# Print sel
print(sel)

     cars_per_cap        country  drives_right
US            809  United States          True
RU            200         Russia          True
MOR            70        Morocco          True
EG             45          Egypt          True


In [None]:
# one liner
sel = cars[cars['drives_right']]

In [107]:
# Create car_maniac: observations that have a cars_per_cap over 500
cpc = cars['cars_per_cap']
many_cars = cpc > 500
car_maniac = cars[many_cars]
car_maniac

Unnamed: 0,cars_per_cap,country,drives_right
US,809,United States,True
AUS,731,Australia,False
JAP,588,Japan,False


In [108]:
# as a one-liner
cars[cars['cars_per_cap'] > 500]

Unnamed: 0,cars_per_cap,country,drives_right
US,809,United States,True
AUS,731,Australia,False
JAP,588,Japan,False


In [109]:
# select all rows with cars_per_cap between 100 and 500
cars[np.logical_and(cars['cars_per_cap'] > 100, cars['cars_per_cap'] < 500)]

Unnamed: 0,cars_per_cap,country,drives_right
RU,200,Russia,True


### Selecting a subset based on logic

In [59]:
df[(df.Age > 60) & (df.Sex == 'female')]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
96,988,1,"Cavendish, Mrs. Tyrell William (Julia Florence...",female,76.0,1,0,19877,78.85,C46,S
114,1006,1,"Straus, Mrs. Isidor (Rosalie Ida Blun)",female,63.0,1,0,PC 17483,221.7792,C55 C57,S
179,1071,1,"Compton, Mrs. Alexander Taylor (Mary Eliza Ing...",female,64.0,0,2,PC 17756,83.1583,E45,C
305,1197,1,"Crosby, Mrs. Edward Gifford (Catherine Elizabe...",female,64.0,1,1,112901,26.55,B26,S


In [60]:
df[(df.Age > 55) & (df.Age < 60) & ((df.Pclass == 1) | (df.Pclass == 2)) ]

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
217,1109,1,"Wick, Mr. George Dennick",male,57.0,1,1,36928,164.8667,,S
316,1208,1,"Spencer, Mr. William Augustus",male,57.0,1,0,PC 17569,146.5208,B78,C
343,1235,1,"Cardeza, Mrs. James Warburton Martinez (Charlo...",female,58.0,0,1,PC 17755,512.3292,B51 B53 B55,C
356,1248,1,"Brown, Mrs. John Murray (Caroline Lane Lamson)",female,59.0,2,0,11769,51.4792,C101,S
387,1279,2,"Ashby, Mr. John",male,57.0,0,0,244346,13.0,,S


When we select a subset of a DataFrame using logic, we end up with non-consecutive indices. We can fix this using the method `.reset_index()`.

By default, a new `index` column is created with the old indicies and and the indicies reset. You can avoid the `index` column being created by using the `drop=True` option.

In [69]:
df[(df.Age > 40) & (df.Age < 45) & (df.Pclass == 1)].reset_index(drop=True)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,920,1,"Brady, Mr. John Bertram",male,41.0,0,0,113054,30.5,A21,S
1,992,1,"Stengel, Mrs. Charles Emil Henry (Annie May Mo...",female,43.0,1,0,11778,55.4417,C116,C
2,1036,1,"Lindeberg-Lind, Mr. Erik Gustaf (Mr Edward Lin...",male,42.0,0,0,17475,26.55,,S
3,1050,1,"Borebank, Mr. John James",male,42.0,0,0,110489,26.55,D22,S
4,1107,1,"Head, Mr. Christopher",male,42.0,0,0,113038,42.5,B11,S
5,1137,1,"Kenyon, Mr. Frederick R",male,41.0,1,0,17464,51.8625,D21,S
6,1296,1,"Frauenthal, Mr. Isaac Gerald",male,43.0,1,0,17765,27.7208,D40,C


### Iterating over a Pandas DataFrame

We can iterate over a dataframe one row at a time using the `.iterrows()` method.

```py
df = pd.read_csv('my_file.csv', index_col=0)
for lab, row in df.iterrows():
    # do something...
```

Where `lab` is row label, and `row` is the entire sample row. Each row generated is a panda `Series`.

In [1]:
import pandas as pd

df_cars = pd.read_csv('data/cars.csv', index_col = 0)
df_cars

Unnamed: 0,cars_per_cap,country,drives_right
US,809,United States,True
AUS,731,Australia,False
JAP,588,Japan,False
IN,18,India,False
RU,200,Russia,True
MOR,70,Morocco,True
EG,45,Egypt,True


In [4]:
for lab, row in df_cars.iterrows():
    print('{} short for {}'.format(lab, row['country']))

US short for United States
AUS short for Australia
JAP short for Japan
IN short for India
RU short for Russia
MOR short for Morocco
EG short for Egypt


In [7]:
# Code for loop that adds COUNTRY column
for lab, row in df_cars.iterrows():
    df_cars.loc[lab, 'COUNTRY'] = row['country'].upper()
    
df_cars.head()

Unnamed: 0,cars_per_cap,country,drives_right,COUNTRY
US,809,United States,True,UNITED STATES
AUS,731,Australia,False,AUSTRALIA
JAP,588,Japan,False,JAPAN
IN,18,India,False,INDIA
RU,200,Russia,True,RUSSIA


Using `iterrows()` to iterate over every observation of a Pandas DataFrame is easy to understand, but not very efficient. On every iteration, you're creating a new Pandas `Series`.

If you want to add a column to a DataFrame by calling a function on another column, the `iterrows()` method in combination with a for loop is not the preferred way to go. Instead, you'll want to use `apply()`.

In [9]:
df_cars['COUNTRY'] = df_cars['country'].apply(str.upper)
df_cars.head()

Unnamed: 0,cars_per_cap,country,drives_right,COUNTRY
US,809,United States,True,UNITED STATES
AUS,731,Australia,False,AUSTRALIA
JAP,588,Japan,False,JAPAN
IN,18,India,False,INDIA
RU,200,Russia,True,RUSSIA
