# Introduction to Pandas

* ``pandas`` - is an open source library

* ``pandas`` - is a Python data-analysis library

* Want to know more, check [documentation](https://pandas.pydata.org/pandas-docs/stable/).

## Import ``pandas`` and load some data

We start by importing the package ``pandas`` and load the data-set, it is temperature data from Uppsala, taken from SMHI's open data archive.

In [None]:
import pandas as pd
temperatures_df = pd.read_csv('Data/SMHI.csv',delimiter=';')

## Looking at data

In [None]:
temperatures_df.head(2)

In [None]:
temperatures_df.tail(10)

To get an overview of the data-frame we can use ``df.info()``

In [None]:
temperatures_df.info()

To see all columns we can use the member ``.columns``

In [None]:
temperatures_df.columns

## Selection of data

To get an single column from a DataFrame we can use ``[``,``]`` and the name of the column

In [None]:
temperatures_df['Kvalitet'].head(2)

In [None]:
temperatures_df.Kvalitet.head(2)

The result of choosing a column is that it is no longer a DataFrame but a Series. The difference is not always that important, but sometimes it is good to know.

We can extract more column in the following way ``df[['col1', 'col2']]`` (note the double brackets!).

In [None]:
temperatures_df[['Datum', 'Lufttemperatur']].head(10)

Rows can be selected using ``df.iloc[])`` or with ``df.loc[]``. Ex: to choose the first row you can use ``df.iloc[0,:]`` and to chose the first row and first column you can use, ``df.iloc[0,0]`` (indexing in Python starts with 0).

In [None]:
temperatures_df.iloc[0,:]

In [None]:
temperatures_df.iloc[0,0]

Choose a range of rows

In [None]:
temperatures_df.iloc[10:12, :]

## Data exploration

``Pandas`` allows us to explore data, more functions can be found in the documentation but let us look at ``df.describe()`` which creates descriptive statistics of the DataFrame (for numerical columns only) by excluding "NaN" (missing) values.

In [None]:
temperatures_df.describe()

We can directly get statistics in the following way

* ``df.mean()`` - returns the mean of all columns
* ``df.corr()`` - returns the correlation between columns in a dataframe
* ``df.count()`` - returns the number of non-null values in each data frame column
* ``df.max()`` - returns the highest value in each column
* ``df.min()`` - returns the lowest value in each column
* ``df.median()`` - returns the median of each column
* ``df.std()`` - returns the standard deviation of each column

### Filter, sort, group

Sometimes one wants to filter the data using a condition on one of the columns. For example, ``temperatures_df[temperatures_df['Datum']>'2018-10-16']`` gives us rows where the date ``Datum`` is later than 2018-10-16. We can also use ``&`` ("and") or ``|`` ("or") to get different conditions to filter on (called booleansk filtering).

In [None]:
temperatures_df[temperatures_df['Datum']>'2018-10-16']

In [None]:
temperatures_df[(temperatures_df['Datum']>'2018-10-16')&(temperatures_df['Datum']<'2018-11-16')]

### Sorting

In [None]:
temperatures_df.head(10)

Lets sort by temperature

In [None]:
temperatures_df.sort_values('Lufttemperatur')

Lets sort descending on date

In [None]:
temperatures_df.sort_values('Datum',ascending=False).head(10)

Lets sort ascending on date and descending in time of day

In [None]:
temperatures_df.sort_values(['Datum','Tid (UTC)'],ascending=[True,False])

## Grouping data

In [None]:
temperatures_df.groupby('Kvalitet').mean()

### Example 2 - Work with dates
* Convert the date to date-format ``pd.to_datetime``

In [None]:
temperatures_df['Datum'] = pd.to_datetime(temperatures_df['Datum'])

#### Average temperature per day

In [None]:
temperatures_df.groupby('Datum').mean().head(2)

# Creating new columns with year and month

In [None]:
f = lambda x: x**2

def g(x):
    return x**2

In [None]:
temperatures_df['Månad'] = temperatures_df['Datum'].apply(lambda x:x.month)
temperatures_df['År'] = temperatures_df['Datum'].apply(lambda _:_.year)

In [None]:
temperatures_df.head(10)

### Average temperature per month

In [None]:
temperatures_df.groupby(['År','Månad']).mean().head(10)