<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Loading-your-data" data-toc-modified-id="Loading-your-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Loading your data</a></span></li><li><span><a href="#Getting-an-overview" data-toc-modified-id="Getting-an-overview-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Getting an overview</a></span></li><li><span><a href="#Renaming-Columns" data-toc-modified-id="Renaming-Columns-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Renaming Columns</a></span></li><li><span><a href="#Replacing-all-occurrences-of-a-string-in-a-column" data-toc-modified-id="Replacing-all-occurrences-of-a-string-in-a-column-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Replacing all occurrences of a string in a column</a></span></li><li><span><a href="#Selecting-data-subsets" data-toc-modified-id="Selecting-data-subsets-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Selecting data subsets</a></span><ul class="toc-item"><li><span><a href="#Selecting-Columns-of-the-data" data-toc-modified-id="Selecting-Columns-of-the-data-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Selecting Columns of the data</a></span></li><li><span><a href="#Ex-1.1" data-toc-modified-id="Ex-1.1-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Ex 1.1</a></span></li><li><span><a href="#Iterator" data-toc-modified-id="Iterator-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Iterator</a></span></li><li><span><a href="#Selecting-rows-by-position" data-toc-modified-id="Selecting-rows-by-position-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Selecting rows by position</a></span></li><li><span><a href="#Selecting-rows-by-condition" data-toc-modified-id="Selecting-rows-by-condition-5.5"><span class="toc-item-num">5.5&nbsp;&nbsp;</span>Selecting rows by condition</a></span></li></ul></li><li><span><a href="#Sorting" data-toc-modified-id="Sorting-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Sorting</a></span></li><li><span><a href="#Ex-2.2" data-toc-modified-id="Ex-2.2-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Ex 2.2</a></span></li><li><span><a href="#Optional" data-toc-modified-id="Optional-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Optional</a></span></li></ul></div>

# Basic Data Analysis with Pandas

Pandas is a popular python package for data science, it offers expressive and flexible data structures for data manipulation and analysis. And here we will focus on one of these data structures (dataframes).

Dataframes are way for storing data in rectangular grids that are easy to view and work with. Each row in a dataframe corresponds to values of an instance, while each column is a vector containing values for a specific variable of instances. The rows can contain different types of values such as numeric, character, logical etc.

In [6]:
# Import the pandas module for data analysis as alias pd
import pandas as pd

## Loading your data

There are a lot of supported data formats for reading (and writing) with pandas including csv, tsv, excel, hdf5, sas, stata, sql...  The documentation provides more details:
http://pandas.pydata.org/pandas-docs/stable/io.html

`read_csv` has several useful arguments, e.g. "sep" (default is ","), "header" (default is first line), "error_bad_lines"...

In [7]:
# read the dataset which is in csv format
# Recognize "?" values as NA/NAN.
df = pd.read_csv("data/adult.csv", na_values="?")
# instead of path, you can pass a url to an online file

Lets see what we have now:

In [8]:
type(df)

pandas.core.frame.DataFrame

In [9]:
df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K
48838,64,,321403,HS-grad,9,Widowed,,Other-relative,Black,Male,0,0,40,United-States,<=50K
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K


## Getting an overview

In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       46043 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education-num   48842 non-null  int64 
 5   marital-status  48842 non-null  object
 6   occupation      46033 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital-gain    48842 non-null  int64 
 11  capital-loss    48842 non-null  int64 
 12  hours-per-week  48842 non-null  int64 
 13  native-country  47985 non-null  object
 14  class           48842 non-null  object
dtypes: int64(6), object(9)
memory usage: 5.6+ MB


In [11]:
len(df)

48842

In [12]:
df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'class'],
      dtype='object')

In [13]:
# values: Return an array representing the data in the Index object
df.columns.values

array(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'class'], dtype=object)

We can get some more details with the dtypes functions:

In [14]:
df.dtypes

age                int64
workclass         object
fnlwgt             int64
education         object
education-num      int64
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
class             object
dtype: object

In [15]:
# to summary statistics of the Dataframe
df.describe()

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
count,48842.0,48842.0,48842.0,48842.0,48842.0,48842.0
mean,38.643585,189664.1,10.078089,1079.067626,87.502314,40.422382
std,13.71051,105604.0,2.570973,7452.019058,403.004552,12.391444
min,17.0,12285.0,1.0,0.0,0.0,1.0
25%,28.0,117550.5,9.0,0.0,0.0,40.0
50%,37.0,178144.5,10.0,0.0,0.0,40.0
75%,48.0,237642.0,12.0,0.0,0.0,45.0
max,90.0,1490400.0,16.0,99999.0,4356.0,99.0


In [16]:
# Compute pairwise correlation of columns, excluding NA/null values.
df.corr(numeric_only=True)

Unnamed: 0,age,fnlwgt,education-num,capital-gain,capital-loss,hours-per-week
age,1.0,-0.076628,0.03094,0.077229,0.056944,0.071558
fnlwgt,-0.076628,1.0,-0.038761,-0.003706,-0.004366,-0.013519
education-num,0.03094,-0.038761,1.0,0.125146,0.080972,0.143689
capital-gain,0.077229,-0.003706,0.125146,1.0,-0.031441,0.082157
capital-loss,0.056944,-0.004366,0.080972,-0.031441,1.0,0.054467
hours-per-week,0.071558,-0.013519,0.143689,0.082157,0.054467,1.0


In [17]:
from scipy.stats import pearsonr
pearsonr(df['education-num'], df['hours-per-week'])

PearsonRResult(statistic=0.1436889093924793, pvalue=1.3860285325205765e-223)

## Renaming Columns

In [18]:
df.columns.values

array(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'class'], dtype=object)

In [19]:
df = df.rename(columns={"sex" : "gender"})
# df.rename(columns={"sex" : "gender"}, inplace=True)

In [20]:
df.columns.values

array(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'gender',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'class'], dtype=object)

In [21]:
df2 = df.rename(columns={"gender" : "sex", "fnlwgt" : "weight"})

In [22]:
df2.columns.values

array(['age', 'workclass', 'weight', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'sex',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'class'], dtype=object)

## Replacing all occurrences of a string in a column

In [23]:
# look at values in "education" column
df2

Unnamed: 0,age,workclass,weight,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K
48838,64,,321403,HS-grad,9,Widowed,,Other-relative,Black,Male,0,0,40,United-States,<=50K
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K


In [24]:
# df2.education = df2.education.replace(["Bachelors", "HS-grad"], ["Bachelor", "Highschool"])
df2.education.replace(["Bachelors", "HS-grad"],
                      ["Bachelor", "Highschool"], inplace=True)
df2

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  df2.education.replace(["Bachelors", "HS-grad"],


Unnamed: 0,age,workclass,weight,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelor,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelor,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,Highschool,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelor,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelor,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K
48838,64,,321403,Highschool,9,Widowed,,Other-relative,Black,Male,0,0,40,United-States,<=50K
48839,38,Private,374983,Bachelor,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K
48840,44,Private,83891,Bachelor,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K


## Selecting data subsets

### Selecting Columns of the data

In [26]:
df.columns

Index(['age', 'workclass', 'fnlwgt', 'education', 'education-num',
       'marital-status', 'occupation', 'relationship', 'race', 'gender',
       'capital-gain', 'capital-loss', 'hours-per-week', 'native-country',
       'class'],
      dtype='object')

In [27]:
# Selecting a specific column of the data:
age = df['age']
age

0        39
1        50
2        38
3        53
4        28
         ..
48837    39
48838    64
48839    38
48840    44
48841    35
Name: age, Length: 48842, dtype: int64

In [28]:
type(age)

pandas.core.series.Series

In [30]:
dict(age)

{0: 39,
 1: 50,
 2: 38,
 3: 53,
 4: 28,
 5: 37,
 6: 49,
 7: 52,
 8: 31,
 9: 42,
 10: 37,
 11: 30,
 12: 23,
 13: 32,
 14: 40,
 15: 34,
 16: 25,
 17: 32,
 18: 38,
 19: 43,
 20: 40,
 21: 54,
 22: 35,
 23: 43,
 24: 59,
 25: 56,
 26: 19,
 27: 54,
 28: 39,
 29: 49,
 30: 23,
 31: 20,
 32: 45,
 33: 30,
 34: 22,
 35: 48,
 36: 21,
 37: 19,
 38: 31,
 39: 48,
 40: 31,
 41: 53,
 42: 24,
 43: 49,
 44: 25,
 45: 57,
 46: 53,
 47: 44,
 48: 41,
 49: 29,
 50: 25,
 51: 18,
 52: 47,
 53: 50,
 54: 47,
 55: 43,
 56: 46,
 57: 35,
 58: 41,
 59: 30,
 60: 30,
 61: 32,
 62: 48,
 63: 42,
 64: 29,
 65: 36,
 66: 28,
 67: 53,
 68: 49,
 69: 25,
 70: 19,
 71: 31,
 72: 29,
 73: 23,
 74: 79,
 75: 27,
 76: 40,
 77: 67,
 78: 18,
 79: 31,
 80: 18,
 81: 52,
 82: 46,
 83: 59,
 84: 44,
 85: 53,
 86: 49,
 87: 33,
 88: 30,
 89: 43,
 90: 57,
 91: 37,
 92: 28,
 93: 30,
 94: 34,
 95: 29,
 96: 48,
 97: 37,
 98: 48,
 99: 32,
 100: 76,
 101: 44,
 102: 47,
 103: 20,
 104: 29,
 105: 32,
 106: 17,
 107: 30,
 108: 31,
 109: 42,
 110: 24,


In [33]:
age.to_dict()

{0: 39,
 1: 50,
 2: 38,
 3: 53,
 4: 28,
 5: 37,
 6: 49,
 7: 52,
 8: 31,
 9: 42,
 10: 37,
 11: 30,
 12: 23,
 13: 32,
 14: 40,
 15: 34,
 16: 25,
 17: 32,
 18: 38,
 19: 43,
 20: 40,
 21: 54,
 22: 35,
 23: 43,
 24: 59,
 25: 56,
 26: 19,
 27: 54,
 28: 39,
 29: 49,
 30: 23,
 31: 20,
 32: 45,
 33: 30,
 34: 22,
 35: 48,
 36: 21,
 37: 19,
 38: 31,
 39: 48,
 40: 31,
 41: 53,
 42: 24,
 43: 49,
 44: 25,
 45: 57,
 46: 53,
 47: 44,
 48: 41,
 49: 29,
 50: 25,
 51: 18,
 52: 47,
 53: 50,
 54: 47,
 55: 43,
 56: 46,
 57: 35,
 58: 41,
 59: 30,
 60: 30,
 61: 32,
 62: 48,
 63: 42,
 64: 29,
 65: 36,
 66: 28,
 67: 53,
 68: 49,
 69: 25,
 70: 19,
 71: 31,
 72: 29,
 73: 23,
 74: 79,
 75: 27,
 76: 40,
 77: 67,
 78: 18,
 79: 31,
 80: 18,
 81: 52,
 82: 46,
 83: 59,
 84: 44,
 85: 53,
 86: 49,
 87: 33,
 88: 30,
 89: 43,
 90: 57,
 91: 37,
 92: 28,
 93: 30,
 94: 34,
 95: 29,
 96: 48,
 97: 37,
 98: 48,
 99: 32,
 100: 76,
 101: 44,
 102: 47,
 103: 20,
 104: 29,
 105: 32,
 106: 17,
 107: 30,
 108: 31,
 109: 42,
 110: 24,


In [34]:
df['age']

0        39
1        50
2        38
3        53
4        28
         ..
48837    39
48838    64
48839    38
48840    44
48841    35
Name: age, Length: 48842, dtype: int64

In [35]:
# an alternative syntax for selecting a single column
df.age

0        39
1        50
2        38
3        53
4        28
         ..
48837    39
48838    64
48839    38
48840    44
48841    35
Name: age, Length: 48842, dtype: int64

In [36]:
# For a single column that returns a pandas.Series object
type(age)

pandas.core.series.Series

In [37]:
# We can compute basically any univariate statistic from a series
print("Mean:", age.mean())
print("Standard deviation:", age.std())
print("Median:", age.median())
print("Maximum value:", age.max())
print("Index of first occurrence of maximum value:", age.idxmax())
print("Mode:", age.mode())
print("25-percentile:", age.quantile(0.25))

Mean: 38.64358543876172
Standard deviation: 13.710509934443555
Median: 37.0
Maximum value: 90
Index of first occurrence of maximum value: 222
Mode: 0    36
Name: age, dtype: int64
25-percentile: 28.0


In [38]:
# we could simply convert this to a list object, but this is something we rarely ever need
# list(age)

In [39]:
df.gender

0          Male
1          Male
2          Male
3          Male
4        Female
          ...  
48837    Female
48838      Male
48839      Male
48840      Male
48841      Male
Name: gender, Length: 48842, dtype: object

In [40]:
df.gender.unique()

array(['Male', 'Female'], dtype=object)

In [41]:
df.gender.value_counts()

gender
Male      32650
Female    16192
Name: count, dtype: int64

In [46]:
# array of values in each row
df.values[0]

array([39, 'State-gov', 77516, 'Bachelors', 13, 'Never-married',
       'Adm-clerical', 'Not-in-family', 'White', 'Male', 2174, 0, 40,
       'United-States', '<=50K'], dtype=object)

In [43]:
df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K
48838,64,,321403,HS-grad,9,Widowed,,Other-relative,Black,Male,0,0,40,United-States,<=50K
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K


In [47]:
# we can also select multiple columns using a list of column names
df2 = df[['age', 'sex', 'education']]
type(df2)

KeyError: "['sex'] not in index"

We get an error because we had renamed the coloumn 'sex' to 'gender'.

In [48]:
df2 = df[['age', 'gender', 'education']]
type(df2)

pandas.core.frame.DataFrame

In [49]:
df2

Unnamed: 0,age,gender,education
0,39,Male,Bachelors
1,50,Male,Bachelors
2,38,Male,HS-grad
3,53,Male,11th
4,28,Female,Bachelors
...,...,...,...
48837,39,Female,Bachelors
48838,64,Male,HS-grad
48839,38,Male,Bachelors
48840,44,Male,Bachelors


In [None]:
df2

### Ex 1
1. Load "adult.csv" into dataframe named adult_df (recognize "?" values as NA/NAN)
2. Get a subset of the dataframe with columns "age", "sex", "education", "hours-per-week", "capital-gain"
3. Rename column "capital-gain" to "capital_gain"
4. Print the column names of adults_df
5. Print number of different values for the attribute education
6. Print the mean "working time per week"
7. Print the max "capital_gain"

In [50]:
adult_df = pd.read_csv('data/adult.csv')
adult_df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K
48838,64,?,321403,HS-grad,9,Widowed,?,Other-relative,Black,Male,0,0,40,United-States,<=50K
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K


In [51]:
adult_df_subset = adult_df[["age", "sex",  "education", "hours-per-week", "capital-gain"]]
#  "age", "sex", "education", "hours-per-week", "capital-gain"
adult_df_subset

Unnamed: 0,age,sex,education,hours-per-week,capital-gain
0,39,Male,Bachelors,40,2174
1,50,Male,Bachelors,13,0
2,38,Male,HS-grad,40,0
3,53,Male,11th,40,0
4,28,Female,Bachelors,40,0
...,...,...,...,...,...
48837,39,Female,Bachelors,36,0
48838,64,Male,HS-grad,40,0
48839,38,Male,Bachelors,50,0
48840,44,Male,Bachelors,40,5455


In [52]:
len(adult_df.education.unique())

17

In [53]:
adult_df.education.value_counts()

education
HS-grad         15783
Some-college    10878
Bachelors        8025
Masters          2657
Assoc-voc        2061
11th             1812
Assoc-acdm       1601
10th             1389
7th-8th           955
Prof-school       834
9th               756
12th              657
Doctorate         594
5th-6th           509
1st-4th           247
Preschool          83
HS-jupytgrad        1
Name: count, dtype: int64

In [None]:
# %load "21_data_exploration_ex_1.py"

### Iterator

In [54]:
# iterate over column names
column_names = df.columns.values
for column_name in column_names:
    print(column_name)

age
workclass
fnlwgt
education
education-num
marital-status
occupation
relationship
race
gender
capital-gain
capital-loss
hours-per-week
native-country
class


In [56]:
# iterate over rows as (index, Series) pairs.
for i, row_data in df.iterrows():
    print(i, type(row_data))
    print(row_data)
    print("-------")
    print(row_data["education"])
    break

0 <class 'pandas.core.series.Series'>
age                          39
workclass             State-gov
fnlwgt                    77516
education             Bachelors
education-num                13
marital-status    Never-married
occupation         Adm-clerical
relationship      Not-in-family
race                      White
gender                     Male
capital-gain               2174
capital-loss                  0
hours-per-week               40
native-country    United-States
class                     <=50K
Name: 0, dtype: object
-------
Bachelors


In [57]:
# Iterate over DataFrame rows as namedtuples.
for row_data in df.itertuples():
    print(row_data[0], type(row_data))
    print(row_data)
    print("-------")
    print(row_data.education)
    break

0 <class 'pandas.core.frame.Pandas'>
Pandas(Index=0, age=39, workclass='State-gov', fnlwgt=77516, education='Bachelors', _5=13, _6='Never-married', occupation='Adm-clerical', relationship='Not-in-family', race='White', gender='Male', _11=2174, _12=0, _13=40, _14='United-States', _15='<=50K')
-------
Bachelors


### Selecting rows by position

In [58]:
# get the first 5 rows
# attention to most left column -> index
df.head()

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [59]:
# get the first 4 rows
df.head(4)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K


In [60]:
# get the last 3 rows
df.tail(3)

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,class
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K
48841,35,Self-emp-inc,182148,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,60,United-States,>50K


In [65]:
# shows a random sample of rows
df_sample = df.sample(3)
df_sample

TypeError: NDFrame.sample() got an unexpected keyword argument 'k'

In [67]:
df

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
48837,39,Private,215419,Bachelors,13,Divorced,Prof-specialty,Not-in-family,White,Female,0,0,36,United-States,<=50K
48838,64,,321403,HS-grad,9,Widowed,,Other-relative,Black,Male,0,0,40,United-States,<=50K
48839,38,Private,374983,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,White,Male,0,0,50,United-States,<=50K
48840,44,Private,83891,Bachelors,13,Divorced,Adm-clerical,Own-child,Asian-Pac-Islander,Male,5455,0,40,United-States,<=50K


In [70]:
# Selecting a specific single row 
# iloc (integer locate) works on the positions in your index (selection by position)
# iloc is primarily integer position based (from 0 to length-1 of the axis). 
# So it uses the position of the row in the index.

# get the row with index 2 (select a row by position)
# Note that index starts with 0
df.iloc[-1]

age                               35
workclass               Self-emp-inc
fnlwgt                        182148
education                  Bachelors
education-num                     13
marital-status    Married-civ-spouse
occupation           Exec-managerial
relationship                 Husband
race                           White
gender                          Male
capital-gain                       0
capital-loss                       0
hours-per-week                    60
native-country         United-States
class                           >50K
Name: 48841, dtype: object

In [71]:
type(df.iloc[2])

pandas.core.series.Series

In [None]:
df_sample.iloc[1]

In [73]:
# select a specific range of rows
type(df.iloc[10:20])

pandas.core.frame.DataFrame

In [74]:
df2 = df.iloc[10:20]
df2

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,class
10,37,Private,280464,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0,0,80,United-States,>50K
11,30,State-gov,141297,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,India,>50K
12,23,Private,122272,Bachelors,13,Never-married,Adm-clerical,Own-child,White,Female,0,0,30,United-States,<=50K
13,32,Private,205019,Assoc-acdm,12,Never-married,Sales,Not-in-family,Black,Male,0,0,50,United-States,<=50K
14,40,Private,121772,Assoc-voc,11,Married-civ-spouse,Craft-repair,Husband,Asian-Pac-Islander,Male,0,0,40,,>50K
15,34,Private,245487,7th-8th,4,Married-civ-spouse,Transport-moving,Husband,Amer-Indian-Eskimo,Male,0,0,45,Mexico,<=50K
16,25,Self-emp-not-inc,176756,HS-grad,9,Never-married,Farming-fishing,Own-child,White,Male,0,0,35,United-States,<=50K
17,32,Private,186824,HS-grad,9,Never-married,Machine-op-inspct,Unmarried,White,Male,0,0,40,United-States,<=50K
18,38,Private,28887,11th,7,Married-civ-spouse,Sales,Husband,White,Male,0,0,50,United-States,<=50K
19,43,Self-emp-not-inc,292175,Masters,14,Divorced,Exec-managerial,Unmarried,White,Female,0,0,45,United-States,>50K


In [75]:
df2.iloc[5]

age                               34
workclass                    Private
fnlwgt                        245487
education                    7th-8th
education-num                      4
marital-status    Married-civ-spouse
occupation          Transport-moving
relationship                 Husband
race              Amer-Indian-Eskimo
gender                          Male
capital-gain                       0
capital-loss                       0
hours-per-week                    45
native-country                Mexico
class                          <=50K
Name: 15, dtype: object

In [76]:
# The loc function uses the label in the index, not the integer position along the index.
# .loc[] works on labels of your index
# select a row by label
df2.loc[15]

age                               34
workclass                    Private
fnlwgt                        245487
education                    7th-8th
education-num                      4
marital-status    Married-civ-spouse
occupation          Transport-moving
relationship                 Husband
race              Amer-Indian-Eskimo
gender                          Male
capital-gain                       0
capital-loss                       0
hours-per-week                    45
native-country                Mexico
class                          <=50K
Name: 15, dtype: object

In [77]:
df2.iloc[5].equals(df2.loc[15])

True

In [78]:
# df2 has no index which has label 5
df2.loc[5]

KeyError: 5

In [80]:
df2

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,class
10,37,Private,280464,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0,0,80,United-States,>50K
11,30,State-gov,141297,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,India,>50K
12,23,Private,122272,Bachelors,13,Never-married,Adm-clerical,Own-child,White,Female,0,0,30,United-States,<=50K
13,32,Private,205019,Assoc-acdm,12,Never-married,Sales,Not-in-family,Black,Male,0,0,50,United-States,<=50K
14,40,Private,121772,Assoc-voc,11,Married-civ-spouse,Craft-repair,Husband,Asian-Pac-Islander,Male,0,0,40,,>50K
15,34,Private,245487,7th-8th,4,Married-civ-spouse,Transport-moving,Husband,Amer-Indian-Eskimo,Male,0,0,45,Mexico,<=50K
16,25,Self-emp-not-inc,176756,HS-grad,9,Never-married,Farming-fishing,Own-child,White,Male,0,0,35,United-States,<=50K
17,32,Private,186824,HS-grad,9,Never-married,Machine-op-inspct,Unmarried,White,Male,0,0,40,United-States,<=50K
18,38,Private,28887,11th,7,Married-civ-spouse,Sales,Husband,White,Male,0,0,50,United-States,<=50K
19,43,Self-emp-not-inc,292175,Masters,14,Divorced,Exec-managerial,Unmarried,White,Female,0,0,45,United-States,>50K


In [81]:
# reset the current index
# if no "drop=True", it tries to insert index into columns
df2.reset_index(drop=True, inplace=True)
df2

Unnamed: 0,age,workclass,fnlwgt,education,education-num,marital-status,occupation,relationship,race,gender,capital-gain,capital-loss,hours-per-week,native-country,class
0,37,Private,280464,Some-college,10,Married-civ-spouse,Exec-managerial,Husband,Black,Male,0,0,80,United-States,>50K
1,30,State-gov,141297,Bachelors,13,Married-civ-spouse,Prof-specialty,Husband,Asian-Pac-Islander,Male,0,0,40,India,>50K
2,23,Private,122272,Bachelors,13,Never-married,Adm-clerical,Own-child,White,Female,0,0,30,United-States,<=50K
3,32,Private,205019,Assoc-acdm,12,Never-married,Sales,Not-in-family,Black,Male,0,0,50,United-States,<=50K
4,40,Private,121772,Assoc-voc,11,Married-civ-spouse,Craft-repair,Husband,Asian-Pac-Islander,Male,0,0,40,,>50K
5,34,Private,245487,7th-8th,4,Married-civ-spouse,Transport-moving,Husband,Amer-Indian-Eskimo,Male,0,0,45,Mexico,<=50K
6,25,Self-emp-not-inc,176756,HS-grad,9,Never-married,Farming-fishing,Own-child,White,Male,0,0,35,United-States,<=50K
7,32,Private,186824,HS-grad,9,Never-married,Machine-op-inspct,Unmarried,White,Male,0,0,40,United-States,<=50K
8,38,Private,28887,11th,7,Married-civ-spouse,Sales,Husband,White,Male,0,0,50,United-States,<=50K
9,43,Self-emp-not-inc,292175,Masters,14,Divorced,Exec-managerial,Unmarried,White,Female,0,0,45,United-States,>50K


In [82]:
df2.loc[5]

age                               34
workclass                    Private
fnlwgt                        245487
education                    7th-8th
education-num                      4
marital-status    Married-civ-spouse
occupation          Transport-moving
relationship                 Husband
race              Amer-Indian-Eskimo
gender                          Male
capital-gain                       0
capital-loss                       0
hours-per-week                    45
native-country                Mexico
class                          <=50K
Name: 5, dtype: object

In [83]:
df2.iloc[5].equals(df2.loc[5])

True

### Selecting rows by condition

In [None]:
h = df.head()

As a reminder, given a list of strings, [] will select columns of the data.

In [None]:
h[["age", "gender"]]

If it is used with lists of booleans, then rows are selected instead!

In [None]:
# select row with index 0, 1 and 4
h[[True, True, False, False, True]]

Slight excursion: We can do many computations with series just as with single numbers:

In [None]:
# h.capital-gain
h.age

In [None]:
# make floor division and multiple by 10 / remove the units digit
# the computation on the right side is applied to each row in dataframe
h.age // 10 * 10

In [None]:
h.age < 40

In [None]:
h[[True, False, True, False, True]]

Instead of selecting rows manually, now let's do it programmatically.

In [None]:
# now... this can be very useful
# get the people younger than 40
h[h["age"] < 40]

In [None]:
len(df[df.age < 40])

In [None]:
young = df[df.age < 40]
len(young)

In [None]:
len(df[df.age < 20])

In [None]:
females = df[df.gender == "Female"]
females

In [None]:
# to combine multiple conditions use &
(df.age < 40) & (df.gender =="Female")

In [None]:
young_females = df[(df.age < 40) & (df.gender == "Female")]
young_females

In [None]:
young_or_female = df[(df.age < 40) | (df.gender == "Female")]
young_or_female

Note: **&** and **|** are bitwise operators. Bitwise operators are used to compare (binary) numbers.

- & -> AND: Sets each bit to 1 if both bits are 1
- |	-> OR: Sets each bit to 1 if one of two bits is 1

## Sorting

In [None]:
df.sort_values("age", inplace=True)

In [None]:
df

In [None]:
df.sort_values ("age", ascending=False)

In [None]:
# because it was not in place.
df

In [None]:
df.sort_values(["age", "hours-per-week"]).head(25)

In [None]:
df.sort_values(["age","hours-per-week"], ascending=False).head(25)

## Ex 2
1. Use adult_df
2. Get all persons with a Bachelor degree as their highest degree into 'bachelors' dataframe
3. Print the number of those persons
4. Print the sum of their capital_gain
5. How many of those persons male and female?
6. Sort them according to their capital_gain and age in descending order and save in the same object
7. Print first 10 of those persons who has age between 20 and 40

In [None]:
# %load "21_data_exploration_ex_2.py"
# ####### step 2
bachelors = adult_df[adult_df["education"] == "Bachelors"]

print("####### step 3")
print("Number of persons with a Bachelor degree as their highest degree:", len(bachelors))

print("####### step 4")
print("Sum of their capital_gain: ", bachelors.capital_gain.sum())

print("####### step 5")
print(bachelors.sex.value_counts())
print('-------OR-------')
print("Female: ", len(bachelors[bachelors["sex"] == "Female"]))
print("Male: ", len(bachelors[bachelors["sex"] == "Male"]))

# ####### step 6
bachelors = bachelors.sort_values(["capital_gain", "age"], ascending=False)

print("####### step 7")
bachelors[(bachelors["age"] >= 20) & (bachelors["age"] <= 40)].head(10)

## Optional

In [None]:
df.head(10)

In [None]:
# select the row with index 9
df.iloc[9]

In [None]:
# select row 9 and column 1 (workclass)
# remember that index starts with 0
df.iloc[9, 1]

In [None]:
# select the row with index 9
df.iloc[9, :]

In [None]:
df.loc[9, "workclass"]

In [None]:
# select column "workclass"
df.loc[:, "workclass"]

In [None]:
# it is same as df.workclass
df.loc[:, "workclass"].equals(df.workclass)

In [None]:
df[["workclass", "fnlwgt"]]

In [None]:
df.iloc[:, 1:3]

In [None]:
# pandas.core.frame.DataFrame.corr() accept these methods:
# - pearson (default) : standard correlation coefficient
# - kendall : Kendall Tau correlation coefficient
# - spearman : Spearman rank correlation

df.corr(numeric_only=True)

In [None]:
df.corr(method="spearman", numeric_only=True)