<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Loading-your-data" data-toc-modified-id="Loading-your-data-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Loading your data</a></span></li><li><span><a href="#Getting-an-overview" data-toc-modified-id="Getting-an-overview-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Getting an overview</a></span></li><li><span><a href="#Renaming-Columns" data-toc-modified-id="Renaming-Columns-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Renaming Columns</a></span></li><li><span><a href="#Replacing-all-occurrences-of-a-string-in-a-column" data-toc-modified-id="Replacing-all-occurrences-of-a-string-in-a-column-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Replacing all occurrences of a string in a column</a></span></li><li><span><a href="#Selecting-data-subsets" data-toc-modified-id="Selecting-data-subsets-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Selecting data subsets</a></span><ul class="toc-item"><li><span><a href="#Selecting-Columns-of-the-data" data-toc-modified-id="Selecting-Columns-of-the-data-5.1"><span class="toc-item-num">5.1&nbsp;&nbsp;</span>Selecting Columns of the data</a></span></li><li><span><a href="#Ex-1.1" data-toc-modified-id="Ex-1.1-5.2"><span class="toc-item-num">5.2&nbsp;&nbsp;</span>Ex 1.1</a></span></li><li><span><a href="#Iterator" data-toc-modified-id="Iterator-5.3"><span class="toc-item-num">5.3&nbsp;&nbsp;</span>Iterator</a></span></li><li><span><a href="#Selecting-rows-by-position" data-toc-modified-id="Selecting-rows-by-position-5.4"><span class="toc-item-num">5.4&nbsp;&nbsp;</span>Selecting rows by position</a></span></li><li><span><a href="#Selecting-rows-by-condition" data-toc-modified-id="Selecting-rows-by-condition-5.5"><span class="toc-item-num">5.5&nbsp;&nbsp;</span>Selecting rows by condition</a></span></li></ul></li><li><span><a href="#Sorting" data-toc-modified-id="Sorting-6"><span class="toc-item-num">6&nbsp;&nbsp;</span>Sorting</a></span></li><li><span><a href="#Ex-2.2" data-toc-modified-id="Ex-2.2-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Ex 2.2</a></span></li><li><span><a href="#Optional" data-toc-modified-id="Optional-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Optional</a></span></li></ul></div>

# Basic Data Analysis with Pandas

Pandas is a popular python package for data science, it offers expressive and flexible data structures for data manipulation and analysis. And here we will focus on one of these data structures (dataframes).

Dataframes are way for storing data in rectangular grids that are easy to view and work with. Each row in a dataframe corresponds to values of an instance, while each column is a vector containing values for a specific variable of instances. The rows can contain different types of values such as numeric, character, logical etc.

In [None]:
# Import the pandas module for data analysis as alias pd
import pandas as pd

## Loading your data

There are a lot of supported data formats for reading (and writing) with pandas including csv, tsv, excel, hdf5, sas, stata, sql...  The documentation provides more details:
http://pandas.pydata.org/pandas-docs/stable/io.html

`read_csv` has several useful arguments, e.g. "sep" (default is ","), "header" (default is first line), "error_bad_lines"...

In [None]:
# read the dataset which is in csv format
# Recognize "?" values as NA/NAN.
df = pd.read_csv("data/adult.csv", na_values="?")
# instead of path, you can pass a url to an online file

Lets see what we have now:

In [None]:
type(df)

In [None]:
df

## Getting an overview

In [None]:
df.info()

In [None]:
len(df)

In [None]:
df.columns

In [None]:
# values: Return an array representing the data in the Index object
df.columns.values

We can get some more details with the dtypes functions:

In [None]:
df.dtypes

In [None]:
# to summary statistics of the Dataframe
df.describe()

In [None]:
# Compute pairwise correlation of columns, excluding NA/null values.
df.corr(numeric_only=True)

## Renaming Columns

In [None]:
df.columns.values

In [None]:
df = df.rename(columns={"sex" : "gender"})
# df.rename(columns={"sex" : "gender"}, inplace=True)

In [None]:
df.columns.values

In [None]:
df2 = df.rename(columns={"gender" : "sex", "fnlwgt" : "weight"})

In [None]:
df2.columns.values

## Replacing all occurrences of a string in a column

In [None]:
# look at values in "education" column
df2

In [None]:
# df2.education = df2.education.replace(["Bachelors", "HS-grad"], ["Bachelor", "Highschool"])
df2.education.replace(["Bachelors", "HS-grad"], ["Bachelor", "Highschool"], inplace=True)
df2

## Selecting data subsets

### Selecting Columns of the data

In [None]:
# Selecting a specific column of the data:
age = df['age']
age

In [None]:
# an alternative syntax for selecting a single column
df.age

In [None]:
# For a single column that returns a pandas.Series object
type(age)

In [None]:
# We can compute basically any univariate statistic from a series
print("Mean:", age.mean())
print("Standard deviation:", age.std())
print("Median:", age.median())
print("Maximum value:", age.max())
print("Index of first occurrence of maximum value:", age.idxmax())
print("Mode:", age.mode())
print("25-percentile:", age.quantile(0.25))

In [None]:
# we could simply convert this to a list object, but this is something we rarely ever need
# list(age)

In [None]:
df.gender

In [None]:
df.gender.unique()

In [None]:
df.gender.value_counts()

In [None]:
# array of values in each row
df.values

In [None]:
# we can also select multiple columns using a list of column names
df2 = df[['age', 'sex', 'education']]
type(df2)

We get an error because we had renamed the coloumn 'sex' to 'gender'.

In [None]:
df2 = df[['age', 'gender', 'education']]
type(df2)

In [None]:
df2

### Ex 1
1. Load "adult.csv" into dataframe named adult_df (recognize "?" values as NA/NAN)
2. Get a subset of the dataframe with columns "age", "sex", "education", "hours-per-week", "capital-gain"
3. Rename column "capital-gain" to "capital_gain"
4. Print the column names of adults_df
5. Print number of different values for the attribute education
6. Print the mean "working time per week"
7. Print the max "capital_gain"

In [None]:
# %load "21_data_exploration_ex_1.py"

### Iterator

In [None]:
# iterate over column names
column_names = df.columns.values
for column_name in column_names:
    print(column_name)

In [None]:
# iterate over rows as (index, Series) pairs.
for i, row_data in df.iterrows():
    print(i, type(row_data))
    print(row_data)
    print("-------")
    print(row_data.education)
    break

In [None]:
# Iterate over DataFrame rows as namedtuples.
for row_data in df.itertuples():
    print(row_data[0], type(row_data))
    print(row_data)
    print("-------")
    print(row_data.education)
    break

### Selecting rows by position

In [None]:
# get the first 5 rows
# attention to most left column -> index
df.head()

In [None]:
# get the first 4 rows
df.head(4)

In [None]:
# get the last 3 rows
df.tail(3)

In [None]:
# shows a random sample of rows
df_sample = df.sample(3)
df_sample

In [None]:
# Selecting a specific single row 
# iloc (integer locate) works on the positions in your index (selection by position)
# iloc is primarily integer position based (from 0 to length-1 of the axis). 
# So it uses the position of the row in the index.

# get the row with index 2 (select a row by position)
# Note that index starts with 0
df.iloc[2]

In [None]:
type(df.iloc[2])

In [None]:
df_sample.iloc[1]

In [None]:
# select a specific range of rows
df.iloc[10:20]

In [None]:
df2 = df.iloc[10:20]
df2

In [None]:
df2.iloc[5]

In [None]:
# The loc function uses the label in the index, not the integer position along the index.
# .loc[] works on labels of your index
# select a row by label
df2.loc[15]

In [None]:
df2.iloc[5].equals(df2.loc[15])

In [None]:
# df2 has no index which has label 5
df2.loc[5]

In [None]:
# reset the current index
# if no "drop=True", it tries to insert index into columns
df2.reset_index(drop=True, inplace=True)
df2

In [None]:
df2.loc[5]

In [None]:
df2.iloc[5].equals(df2.loc[5])

### Selecting rows by condition

In [None]:
h = df.head()

As a reminder, given a list of strings, [] will select columns of the data.

In [None]:
h[["age", "gender"]]

If it is used with lists of booleans, then rows are selected instead!

In [None]:
# select row with index 0, 1 and 4
h[[True, True, False, False, True]]

Slight excursion: We can do many computations with series just as with single numbers:

In [None]:
# h.capital-gain
h.age

In [None]:
# make floor division and multiple by 10 / remove the units digit
# the computation on the right side is applied to each row in dataframe
h.age // 10 * 10

In [None]:
h.age < 40

In [None]:
h[[True, False, True, False, True]]

Instead of selecting rows manually, now let's do it programmatically.

In [None]:
# now... this can be very useful
# get the people younger than 40
h[h["age"] < 40]

In [None]:
len(df[df.age < 40])

In [None]:
young = df[df.age < 40]
len(young)

In [None]:
len(df[df.age < 20])

In [None]:
females = df[df.gender == "Female"]
females

In [None]:
# to combine multiple conditions use &
(df.age < 40) & (df.gender =="Female")

In [None]:
young_females = df[(df.age < 40) & (df.gender == "Female")]
young_females

In [None]:
young_or_female = df[(df.age < 40) | (df.gender == "Female")]
young_or_female

Note: **&** and **|** are bitwise operators. Bitwise operators are used to compare (binary) numbers.

- & -> AND: Sets each bit to 1 if both bits are 1
- |	-> OR: Sets each bit to 1 if one of two bits is 1

## Sorting

In [None]:
df.sort_values("age", inplace=True)

In [None]:
df

In [None]:
df.sort_values ("age", ascending=False)

In [None]:
# because it was not in place.
df

In [None]:
df.sort_values(["age", "hours-per-week"]).head(25)

In [None]:
df.sort_values(["age","hours-per-week"], ascending=False).head(25)

## Ex 2
1. Use adult_df
2. Get all persons with a Bachelor degree as their highest degree into 'bachelors' dataframe
3. Print the number of those persons
4. Print the sum of their capital_gain
5. How many of those persons male and female?
6. Sort them according to their capital_gain and age in descending order and save in the same object
7. Print first 10 of those persons who has age between 20 and 40

In [None]:
# %load "21_data_exploration_ex_2.py"
# ####### step 2
bachelors = adult_df[adult_df["education"] == "Bachelors"]

print("####### step 3")
print("Number of persons with a Bachelor degree as their highest degree:", len(bachelors))

print("####### step 4")
print("Sum of their capital_gain: ", bachelors.capital_gain.sum())

print("####### step 5")
print(bachelors.sex.value_counts())
print('-------OR-------')
print("Female: ", len(bachelors[bachelors["sex"] == "Female"]))
print("Male: ", len(bachelors[bachelors["sex"] == "Male"]))

# ####### step 6
bachelors = bachelors.sort_values(["capital_gain", "age"], ascending=False)

print("####### step 7")
bachelors[(bachelors["age"] >= 20) & (bachelors["age"] <= 40)].head(10)

## Optional

In [None]:
df.head(10)

In [None]:
# select the row with index 9
df.iloc[9]

In [None]:
# select row 9 and column 1 (workclass)
# remember that index starts with 0
df.iloc[9, 1]

In [None]:
# select the row with index 9
df.iloc[9, :]

In [None]:
df.loc[9, "workclass"]

In [None]:
# select column "workclass"
df.loc[:, "workclass"]

In [None]:
# it is same as df.workclass
df.loc[:, "workclass"].equals(df.workclass)

In [None]:
df[["workclass", "fnlwgt"]]

In [None]:
df.iloc[:, 1:3]

In [None]:
# pandas.core.frame.DataFrame.corr() accept these methods:
# - pearson (default) : standard correlation coefficient
# - kendall : Kendall Tau correlation coefficient
# - spearman : Spearman rank correlation

df.corr(numeric_only=True)

In [None]:
df.corr(method="spearman", numeric_only=True)