# Getting Started with pandas

Pandas contains data structures and data manipulation tools designed to make data cleaning and analysis fast and convenient in Python. pandas is often used in tandem with numerical computing tools like NumPy and SciPy, analytical libraries like statsmodels and scikit-learn, and data visualization libraries like matplotlib. pandas adopts significant parts of NumPy's idiomatic style of array-based computing, especially array-based functions and a preference for data processing without for loops.

Pandas was inspired by the R's data.frame and therefore offers similar functions but with Python flavour.

In the Python world the following import conventions for NumPy and pandas are used:

In [None]:
import numpy as np

import pandas as pd

Thus, whenever you see `pd.` in code, it’s referring to pandas. 

You may also find it easier to import Series and DataFrame into the local namespace since they are so frequently used:

In [None]:
from pandas import Series, DataFrame

### Data Structures

To get started with pandas, you will need to get comfortable with its two workhorse data structures: _Series_ and _DataFrame_.

#### Series

A Series is a one-dimensional array-like object containing a sequence of values (of similar types to NumPy types) of the same type and an associated array of data labels, called its index. It is similar to 1-column data.frame in R. The simplest Series is formed from only an array of data:

In [None]:
data = pd.Series([4, 7, -5, 3])

data

The string representation of a Series displayed interactively shows the index on the left and the values on the right. Since we did not specify an index for the data, a default one consisting of the integers `0` through `N - 1` (where `N` is the length of the data) is created. You can get the array representation and index object of the Series via its `array` and `index` attributes, respectively:

In [None]:
data.array

In [None]:
data.index

Often, you'll want to create a Series with an index identifying each data point with a label:

In [None]:
data2 = pd.Series([4, 7, -5, 3], index=["d", "b", "a", "c"])
data2

In [None]:
data2.index

Compared with NumPy arrays, you can use labels in the index when selecting single values or a set of values:

In [None]:
data2["a"]

In [None]:
data2["d"] = 6

data2[["c", "a", "d"]]

Here `["c", "a", "d"]` is interpreted as a list of indices, even though it contains strings instead of integers.

A Series’s index can be altered in place by assignment:

In [None]:
print(data)
data.index = ["Bob", "Steve", "Jeff", "Ryan"]
print(data)

Using NumPy functions or NumPy-like operations, such as filtering with a Boolean array, scalar multiplication, or applying math functions, will preserve the index-value link:

In [None]:
data2[data2>0]

In [None]:
data2 *2

In [None]:
np.exp(data2) # you can use NumPy functions on pandas objects

Another way to think about a Series is as a fixed-length, ordered dictionary, as it is a mapping of index values to data values. It can be used in many contexts where you might use a dictionary:

In [None]:
print("b" in data2)
print("e" in data2)

Should you have data contained in a Python dictionary, you can create a Series from it by passing the dictionary:

In [None]:
sdata = {"Poland": 35000, "Germany": 81000, "France": 71000, "Ireland": 5000}

data3 = pd.Series(sdata)
data3

A Series can be converted back to a dictionary with its `to_dict` method:

In [None]:
data3.to_dict()

When you are only passing a dictionary, the index in the resulting Series will respect the order of the keys according to the dictionary's `keys` method, which depends on the key insertion order. You can override this by passing an index with the dictionary keys in the order you want them to appear in the resulting Series:

In [None]:
country = ["Italy", "Poland", "France", "Germany"]

data4 = pd.Series(sdata, index=country)

data4

Here, three values found in `sdata` were placed in the appropriate locations, but since no value for `"Italy"` was found, it appears as `NaN` (Not a Number), which is considered in pandas to mark missing or `NA` values. Since `"Ireland"` was not included in `country`, it is excluded from the resulting object.

The `isna` and `notna` functions in pandas should be used to detect missing data:

In [None]:
pd.isna(data4)

In [None]:
pd.notna(data4)

Series also has these as methods:

In [None]:
data4.isna()

A useful Series feature for many applications is that it automatically aligns by index label in arithmetic operations. You can think about this as being similar to a join operation:

In [None]:
print(data3)
print(data4)
print(data3 + data4)

Both the Series object itself and its index have a `name` attribute, which integrates with other areas of pandas functionality:

In [None]:
data4.name = "population"
data4.index.name = "country"

data4

### DataFrame

A DataFrame is a Python version of R's data.frame. The similarity of the names is not accidental :)

There are many ways to construct a DataFrame, though one of the most common is from a dictionary of equal-length lists or NumPy arrays:

In [None]:
data = {"country": ["Cyprus", "Cyprus", "Cyprus", "Greece", "Greece", "Greece"],
        "year": [2000, 2001, 2002, 2001, 2002, 2003],
        "pop": [1.5, 1.7, 3.6, 2.4, 2.9, 3.2]}
frame = pd.DataFrame(data)

The resulting DataFrame will have its index assigned automatically, as with Series, and the columns are placed according to the order of the keys in `data` (which depends on their insertion order in the dictionary):

In [None]:
frame

As in R `head` selects first rows and `tail` last rows

In [None]:
frame.head()

In [None]:
frame.tail()

Similarly to `select` from R's dplyr package if you specify a sequence of columns, the DataFrame’s columns will be arranged in that order:

In [None]:
pd.DataFrame(data, columns=["year", "country", "pop"])

If you pass a column that isn’t contained in the dictionary, it will appear with missing values in the result:

In [None]:
frame2 = pd.DataFrame(data, columns=["year", "country", "pop", "debt"])

frame2


In [None]:
frame2.columns

A column in a DataFrame can be retrieved as a Series either by dictionary-like notation or by using the dot attribute notation:

In [None]:
frame2["country"] # Don't forget about quotation marks!

In [None]:
frame2.country

Thanks to quotation marks first method works for any column name.

`frame2.column` works only when the column name is a valid Python variable name and does not conflict with any of the method names in DataFrame. For example, if a column's name contains whitespace or symbols other than underscores, it cannot be accessed with the dot attribute method.

Columns can be modified by assignment. For example, the empty `debt` column could be assigned a scalar value or an array of values:

In [None]:
frame2["debt"] = 16.5

frame2

In [None]:
frame2["debt"] = np.arange(6.)

frame2

When you are assigning lists or arrays to a column, the value’s length must match the length of the DataFrame. If you assign a Series, its labels will be realigned exactly to the DataFrame’s index, inserting missing values in any index values not present:

In [None]:
val = pd.Series([-1.2, -1.5, -1.7], index=["two", "four", "five"])
val

In [None]:
frame2["debt"] = val
frame2

In [None]:
val = pd.Series([-1.2, -1.5, -1.7])
val

In [None]:
frame2["debt"] = val
frame2

Assigning a column that doesn’t exist will create a new column.

The `del` keyword will delete columns like with a dictionary. As an example, I first add a new column of Boolean values where the `country` column equals `"Cyprus"`:

In [None]:
frame2["eastern"] = frame2["country"] == "Cyprus" # read: create column "eastern" and put there answer for question is a value in the "country" column equal to "Cyprus"?

frame2

New columns cannot be created with the `frame2.eastern` dot attribute notation.

The `del` method can then be used to remove this column:

In [None]:
del frame2["eastern"] # no brackets after del !!

frame2.columns

You can transpose the DataFrame (swap rows and columns) with similar syntax to a NumPy array:

In [None]:
frame2.T

**Possible data inputs to the DataFrame constructor:**

 Type | Notes 
 ---- | -----
 2D ndarray | A matrix of data, passing optional row and column labels 
 Dictionary of arrays, lists, or tuples | Each sequence becomes a column in the DataFrame; all sequences must be the same length
 NumPy structured/record array | Treated as the “dictionary of arrays” case
 Dictionary of Series | Each value becomes a column; indexes from each Series are unioned together to form the result’s row index if no explicit index is passed
 Dictionary of dictionaries | Each inner dictionary becomes a column; keys are unioned to form the row index as in the “dictionary of Series” case
 List of dictionaries or Series | Each item becomes a row in the DataFrame; unions of dictionary keys or Series indexes become the DataFrame’s column labels
 List of lists or tuples | Treated as the “2D ndarray” case
 Another DataFrame | The DataFrame’s indexes are used unless different ones are passed
 NumPy MaskedArray | Like the “2D ndarray” case except masked values are missing in the DataFrame result

## Reindexing

An important method on pandas objects is `reindex`, which means to create a new object with the values rearranged to align with the new index. Consider an example:

In [None]:
data = pd.Series([4.5, 7.2, -5.3, 3.6], index=["d", "b", "a", "c"])

data

Calling `reindex` on this Series rearranges the data according to the new index, introducing missing values if any index values were not already present:

In [None]:
data1 = data.reindex(["a", "b", "c", "d", "e"])

data1

For ordered data like time series, you may want to do some interpolation or filling of values when reindexing. The `method` option allows us to do this, using a method such as `ffill`, which forward-fills the values:

In [None]:
data2 = pd.Series(["blue", "purple", "yellow"], index=[0, 2, 4])

data2

In [None]:
data2.reindex(np.arange(6), method="ffill")

With DataFrame, `reindex` can alter the **(row) index, columns, or both**. When passed only a sequence, it reindexes the rows in the result:

In [None]:
frame = pd.DataFrame(np.arange(9).reshape((3, 3)),
                     index=["a", "c", "d"],
                     columns=["Poland", "Czechia", "Italy"])

frame

In [None]:
frame2 = frame.reindex(index=["a", "b", "c", "d"])
frame2

The columns can be reindexed with the `columns` keyword:

In [None]:
frame.reindex(columns=["Czechia" , "Poland", "Italy"])

Another way to reindex a particular axis is to pass the new axis labels as a positional argument and then specify the axis to reindex with the `axis` keyword:

In [None]:
frame.reindex(["Czechia" , "Poland", "Italy"], axis = "columns")

In [None]:
# For additional parameters of reindex check
frame.reindex?

You can also reindex by using the `loc` operator, and many users prefer to always do it this way. This works only if all of the new index labels already exist in the DataFrame (whereas reindex will insert missing data for new labels):

In [None]:
frame.loc[["a", "d", "c"], ["Italy", "Poland"]]

## Indexing, Selection, and Filtering

Series indexing `(data[...])` works analogously to NumPy array indexing, except you can use the Series’s index values instead of only integers. Here are some examples of this:

In [None]:
data = pd.Series(np.arange(4.), index=["a", "b", "c", "d"])
data

In [None]:
print(data["b"])
print(data[1])
print(data[2:4])
print(data[["b", "a", "d"]])
print(data[data<3])


While you can select data by label this way, the preferred way to select index values is with the special `loc` operator:

In [None]:
data.loc[["b", "a", "d"]]

The reason to prefer `loc` is because of the different treatment of integers when indexing with `[]`. Regular `[]`-based indexing will treat integers as labels if the index contains integers, so the behavior differs depending on the data type of the index. For example:

In [None]:
data1 = pd.Series([1, 2, 3], index=[2, 0, 1])
data2 = pd.Series([1, 2, 3], index=["a", "b", "c"])

data1[[0, 1, 2]]

In [None]:
data2[[0, 1, 2]]

When using `loc`, the expression `data.loc[[0, 1, 2]]` will fail when the index does not contain integers:

In [None]:
data2.loc[[0, 1, 2]]

Since `loc` operator indexes exclusively with labels, there is also an `iloc` operator that indexes exclusively with integers to work consistently whether or not the index contains integers:

In [None]:
data1.iloc[[0, 1, 2]]

In [None]:
data2.iloc[[0, 1, 2]]

You can also slice with labels, but it works differently from normal Python slicing in that the **endpoint is inclusive**:

In [None]:
data2.loc["b":"c"]

Assigning values using these methods modifies the corresponding section of the Series:

In [None]:
data2.loc["b":"c"] = 5
data2

It can be a common newbie error to try to call loc or iloc like functions rather than "indexing into" them with square brackets. 
**The square bracket notation is used to enable slice operations and to allow for indexing on multiple axes with DataFrame objects.**

Indexing into a DataFrame retrieves one or more columns either with a single value or sequence:

In [None]:
frame

In [None]:
frame.iloc[:2, 0:3]

In [None]:
frame.loc[:, ["Czechia","Italy"]]

In [None]:
frame.loc["a":"d", ["Czechia","Italy"]]

**Boolean arrays can be used with `loc` but not `iloc`**:

In [None]:
frame.loc[frame.Italy >= 5]

There are many ways to select and rearrange the data contained in a pandas object. In table below you will a short summary of many of them.

 Type | Notes 
 ---- | -----
 `df[column]` | Select single column or sequence of columns from the DataFrame; special case conveniences: Boolean array (filter rows), slice (slice rows), or Boolean DataFrame (set values based on some criterion)
 `df.loc[rows]` | Select single row or subset of rows from the DataFrame by label
 `df.loc[:, cols]` | Select single column or subset of columns by label
 `df.loc[rows, cols]` | Select both row(s) and column(s) by label
 `df.iloc[rows]` | Select single row or subset of rows from the DataFrame by integer position
 `df.iloc[:, cols]` | Select single column or subset of columns by integer position
 `df.iloc[rows, cols]` | Select both row(s) and column(s) by integer position
 `df.at[row, col]` | Select a single scalar value by row and column label
 `df.iat[row, col]` | Select a single scalar value by row and column position (integers)
 `reindex` method | Select either rows or columns by labels

# Exercises

## Series

Create pandas Series consisting values "Spain", "Portugal", "Greece", "Cyprus". Name it countries.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
countries

Change the row names (index) to the first letters of their countries. Usa lower letters.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
countries

Create new Series, called s with numbers 5, 0, 10, -2 and the same index as in 'countries'

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
s

Select values equal 0 or less

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Check if s doesn't contain any `na` o `NaN` value

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

## DataFrame

In [None]:
pov = pd.read_csv("https://github.com/IwoA/PRPT/raw/main/share-of-population-in-extreme-poverty.csv")

Show first 5 rows of the pov dataframe

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Show last 5 rows of the pov dataframe

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Put 'Code' before 'Entity'

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

What are the column names now?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Show only first five rows from column $2.15 a day - share of population below poverty line i one line of code

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Add column 'nrow' with row number starting from 1. You can use numpy function arange.

Show last 5 rows

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Create column ">50%" with True values if share of population below poverty line is higher o equal 50

In [None]:
# YOUR CODE HERE
raise NotImplementedError()
pov

Remove the 'nrow' column

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Reindex columns in following order: Entity, Year, >50%, share of population below poverty line. Use `reindex` function.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Does reindex function change the DataFrame?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Show only Poland and column Year. Hint: first create index with values from the 'Entity' column.

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

What countries are at positions between 100 and 110?

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

Check when Egypt had >50% of population above poverty line. Use `pov.at` method.
Hint: You need to have index with values from the 'Entity' column. The result should be a boolean Series.

Reminder: You can alway check help for a given command using `?` after it, like `pov.at?`

In [None]:
# YOUR CODE HERE
raise NotImplementedError()

In [None]:
pov.at["Zambia", ">50%"]

In [None]:
pov.loc[pov["Entity"]=="Zambia", ["Year",">50%"]] #logic is following: show