![LU Logo](https://www.lu.lv/fileadmin/user_upload/LU.LV/www.lu.lv/Logo/Logo_jaunie/LU_logo_LV_horiz.png)


# Pandas - vadošā Python datu analīzes bibliotēka

Pandas ir jaudīga atvērtā pirmkoda Python datu analīzes un apstrādes bibliotēka. 

Tā nodrošina divas galvenās datu struktūras: **Series** (1-dimensiju) and **DataFrame** (2-dimensiju), kas ļauj organizēt, attīrīt un apstrādāt datu kopas. Ar bagātīgu funkciju kopumu dažādu datu formātu lasīšanai un rakstīšanai, kā arī ar visaptverošiem rīkiem datu pārveidošanai un izpētei, Pandas ir kļuvusi par neaizstājamu rīku datu zinātnes un analītikas kopienās.

Pandas ir plaši izmantota atvērtā koda bibliotēka, kas tiek aktīvi attīstīta un ir ar lielisku dokumentāciju.

Vietne: http://pandas.pydata.org/

## Pandas radītājs — Vess Makkinijs (Wes McKinney)

Pandas izveidoja Vess Makkinijs 2008. gadā. Viņš sāka izstrādāt Pandas, strādājot uzņēmumā AQR Capital Management, galvenokārt tāpēc, ka viņam bija nepieciešams elastīgs rīks kvantitatīvai finanšu datu analīzei. Vēlāk Vess Makkinijs izdeva grāmatu "Python for Data Analysis", kurā ir detalizēti apskatīta Pandas bibliotēka, kas palīdzēja tās popularizēšanā datu zinātnes kopienā.

[Python for Data Analysis book 3rd ed](https://www.amazon.com/Python-Data-Analysis-Wrangling-Jupyter-dp-109810403X/dp/109810403X)


![Python for Data Analysis book](https://m.media-amazon.com/images/I/51J1XFfaD4L._SX379_BO1,204,203,200_.jpg)

## Nodarbības saturs

Mēs apskatīsim sekojošas tēmas:

* Pandas instalēšana
* Pandas datu struktūras
  * `Series`
  * `DataFrames`
  * `DateRange`
* datu nolasīšana no datnēm
* Pandas datu izvēle un indeksēšana
* Pandas datu apstrāde
* Pandas datu apkopošana un grupēšana
* Pandas datu vizualizācija

## Prasības priekšzināšanām

* Python sintakse
* Python datu tipi
* Python operatori
* Nosacījumu izteiksmes, zarošanās ar if, elif, else
* Cikli: for un while
* Funkcijas
* imports, moduļi un pakotnes
* Datu struktūras: saraksti, korteži, vārdnīcas, kopas
* Failu ievade/izvade
* Objektorientētās programmēšanas pamati - Klases un objekti
* NumPy pamati

## Nodarbības mērķi

Nodarbības beigās Jums ir jāspēj:

* instalēt Pandas
* izveidot Pandas `Series` un `DataFrames`
* nolasīt datus no datnēm
* izvēlēties un indeksēt Pandas datu struktūrās esošus datus
* apstrādāt Pandas datu struktūrās esošus datus
* apkopot un grupēt Pandas datu struktūrās esošus datus

---

## 1. tēma - Pandas uzstādīšana un pamatoperācijas

### 1.1. Pandas importēšana

In [None]:
# pārbaudīt vai mums ir pieejama Pandas bibliotēka
try:
    import pandas as pd
except ImportError:
    print("pandas not found")

# Retos gadījumos var rasties Pandas un Numpy versiju nesaderība.
# Šādos gadījumos varat mēģināt atjaunināt Numpy, izmantojot šādu komandu:
# !pip install --upgrade numpy
# Komandrindā tā būtu komanda: pip install --upgrade numpy

In [None]:
# drukāt Pandas versiju
print(f"pandas version: {pd.__version__}")

In [None]:
# we will also need numpy and matplotlib
# Pandas utilizes numpy and matplotlib under the hood
# thus you might need to install them as well

import numpy as np
# print version
print(f"numpy version: {np.__version__}")
import matplotlib.pyplot as plt
# print matplotlib version
print(f"matplotlib version: {plt.matplotlib.__version__}")


In [None]:
# setting the max_rows parameter
# max_rows is the maximum number of rows that will be displayed
#pd.reset_option('display.max_rows')
pd.options.display.max_rows = 40

---
### Pandas instalēšana

Vispirms mums ir jāinstalē Pandas, ja tas vēl nav izdarīts.

**Instalēšana no Jupyter Notebook šūnas:**

```python
!pip install pandas
```

Tas instalēs Pandas pašreizējā vidē (ir vēlams lietot Python virtuālo vidi).

**Instalēšana no komandrindas:**

```bash
pip install pandas
```

Šī komanda instalēs Pandas pašreizējā vidē.

---

Pandas ir daudz neobligāto atkarību, kuras ir jāinstalē atsevišķi:
[Pandas Optional Dependencies](https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html#install-optional-dependencies) 

Piemēram, lai instalētu Pandas ar papildu atbalstu Excel datnēm, izmantojiet šādu komandu:

```bash
pip install "pandas[excel]"
```

Šī komanda instalēs piecas citas pakotnes, kas nepieciešamas, lai strādātu ar Excel failiem.

### DataFrame izveidošana

DataFrame ir visbūtiskākā Pandas datu struktūra. Tā ir divdimensiju, heterogēna tabulas veida datu struktūra ar maināmu izmēru un  marķētām (labeled) asīm (rindām un kolonnām). DataFrame ir līdzīga Excel spreadsheet vai SQL tabulai vai arī Series objektus saturošai vārdnīcai (dictionary of Series).

In [None]:
# one common data source of data is a dictionary
# here keys represend column names and values are lists of data

my_data = {
    'Pilsēta': ['Rīga', 'Daugavpils', 'Liepāja'],
    'Iedz.skaits': [630000, 82000, 69000]
}

df = pd.DataFrame(my_data) # df is very common abbreviation for DataFrame object variable name
df

In [None]:
# we can use an existing column as an index
# in this case we will save a reference to the new DataFrame object
df2 = df.set_index(['Pilsēta'])
df2

In [None]:
# we can access the data by index
df2.loc["Rīga"]

### Datu nolasīšana

In [None]:
# Pandas can read data not just from files but also from web URLs:

# city_data = pd.read_csv("data/iedz_skaits_2018.csv", index_col=0)
csv_url = "https://github.com/CaptSolo/LU_Python_2023/raw/main/notebooks/data/iedz_skaits_2018.csv"

city_data = pd.read_csv(csv_url, index_col=0)

# display first five columns - head() method
city_data.head()

In [None]:
city_data

In [None]:
city_data.head()

In [None]:
type(city_data)

In [None]:
# we can plot the data immediately - by default it will plot all columns
# there are many options to customize the plot but default is usually a good start

city_data.plot()
# by default Pandas uses matplotlib for plotting - there are options to use other libraries
plt.xticks(rotation=90) # simple way to rotate x-axis labels

## 2. tēma - Pandas Series

**Series** ir viendimensiju objektu masīvs, kas satur vērtību (līdzīgu NumPy tipiem) virkni un ar to saistītu datu birku masīvu — indeksu.

**DataFrame** struktūra ir veidota uz Series objektu pamata.


In [None]:
# one way to create a series from a DataFrame is to select a single column
# if your DataFrame has only one column you can use squeeze() method
city_series = city_data.squeeze()
# doc of Sqeeze: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.squeeze.html
type(city_series)

In [None]:
city_series.head()

In [None]:
# we can get single value by index
print(city_series["Liepāja"]) # the __str__ method is called
city_series["Liepāja"] # the __repr__ method is called, note the difference

In [None]:
# we can perform operations on the series
city_series.sum()

In [None]:
# we can generate basic statistics for the series
city_series.describe()

In [None]:
# we can filter the data by some condition
city_series[city_series < 1000]

In [None]:
bitmap = city_series < 1000 # we generate a bitmap of the same size as the series
# now we will show a sample of our bitmap
bitmap.sample(20)   # kādēļ sample() nevis head()

In [None]:
# we can select by the bitmap then sort the data
city_series[bitmap].sort_index()

In [None]:
city_series[bitmap].sort_values(ascending=False)

### Series izveide no saraksta



In [None]:
# creating Pandas Series

s = pd.Series([1,4,3.5,3,np.nan,0,-5])
s

In [None]:
# we can perform operations on whole Series in one go:

s + 4

In [None]:
# NaN = Not a Number (used for missing numerical values)
# https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html

In [None]:
s2 = s * 4 
s2

In [None]:
s2 ** 2

In [None]:
### Often Series have an index identifying each data point with a label 

In [None]:
labeledSeries = pd.Series([24, 77, -35, 31], index=['d', 'e', 'a', 'g'])
labeledSeries

In [None]:
## Working with Series data (with some similarities to dictionaries)

labeledSeries['g']

In [None]:
labeledSeries.index

In [None]:
# Checking if a label is in the Series
'd' in labeledSeries

In [None]:
# we can get values from Series
labeledSeries.values

In [None]:
# Series Values are NumPy arrays
type(labeledSeries.values)

In [None]:
# we can select multiple values by index
labeledSeries[['a','d']] # NOTE double list brackets!!

In [None]:
# To generalize, Series behaves like a fixed-length, ordered dictionary with extra helper methods

### Series var tikt izveidotas no vārdnīcas, nododot to pd.Series()

In [None]:
citydict = {'Rīga': 630000, 'Daugavpils': 82000, 'Liepāja': 69000, 'Carnikava': 4800}

In [None]:
cseries = pd.Series(citydict)
cseries

In [None]:
## Overwriting default index
clist = ['Jūrmala', 'Rīga', 'Daugavpils', 'Ogre', 'Liepāja']

cseries2 = pd.Series(citydict, index = clist)
cseries2

In [None]:
# notice Carnikava was lost, since the new index does not have it
# and order was preserved from the given index list

In [None]:
# find missing data
cseries2.isnull()

In [None]:
cseries2.dropna()

In [None]:
cseries2

In [None]:
cseries3 = cseries + cseries2
cseries3

In [None]:
# so NaN + number = NaN

In [None]:
# we can name the table and its index column

cseries.name = "Latvian Cities"
cseries.index.name = "City"
cseries

In [None]:
cseries.index

In [None]:
# changing Index names
cseries.index = ['RīgaIsOld', 'Daugavpils', 'LiepājaWind', 'CarnikavaIsNotaCity']
cseries

In [None]:
# Series values are mutable
cseries['RīgaIsOld'] = 625000
cseries

In [None]:
# We can use rename() method to rename individual elements
cseries4 = cseries.rename(index={'RīgaIsOld':'RīgaRocks'})

In [None]:
cseries4["RīgaRocks"]

### Uz skaitļiem (pozīciju) un uz birkām balstīti indeksi

Darbs ar Pandas objektiem, kas ir indeksēti ar veseliem skaitļiem, bieži mulsina jaunus lietotājus, jo ir dažas atšķirības indeksēšanas semantikā salīdzinājumā ar Python iebūvētajām datu struktūrām, piemēram, sarakstiem un kortežiem. Piemēram, jūs varētu negaidīt, ka šāda komanda izraisīs kļūdu:


In [None]:
ser = pd.Series(np.arange(3.))
ser

In [None]:
try:
    ser[-1]
except KeyError as e:
    print(f"KeyError: {e}")

Šajā gadījumā Pandas varētu "pāriet" uz veselu skaitļu indeksēšanu, taču to ir grūti vispārīgi īstenot, neradot kļūdas.

Piemēram, ja mums ir indekss ar vērtībām 0, 1, 2, tad ir grūti viennozīmīgi noteikt, ko lietotājs vēlas izmantot — uz birkām balstītu vai uz pozīciju balstītu indeksēšanu.

In [None]:
ser[2]

In [None]:
## With a non-integer index there is no potential for ambiguity:

In [None]:
ser2 = pd.Series(np.arange(3.), index=['a', 'b', 'c'])
ser2[-1] # note FutureWarning

In [None]:
ser2

In [None]:
# Regular slicing with an explicit index uses the index:
ser2[::-1]

In [None]:
## To keep things consistent, if you have an axis index containing integers, data selection
## will always be label-oriented. 

# For more precise handling, use loc (for labels) or iloc (for integer index):

In [None]:
ser2.loc['b']

In [None]:
# Note: label indexing includes the endpoint, integer indexing does not
ser.loc[:1]

In [None]:
ser.iloc[:1]

* loc iegūst rindas (vai kolonnas) ar konkrētām birkām no indeksa.

* iloc iegūst rindas (vai kolonnas) konkrētās indeksa pozīcijās (tāpēc tas pieņem tikai veselus skaitļus).

## 3. tēma - Date Range izveide

Datumu diapazoni tiek izmantoti kā indeksi laika sēriju datiem:
* https://pandas.pydata.org/docs/user_guide/10min.html#time-series

In [None]:
# let's get today's data in the form of YYYYMMDD string
from datetime import datetime
today = datetime.today().strftime("%Y%m%d")
today

In [None]:
dates = pd.date_range(today, periods=15)
dates

In [None]:
pd.date_range(today, periods=15, freq="W") # note default W-SUN means weeks starting on Sunday


In [None]:
# let's start with Monday
pd.date_range(today, periods=7, freq="W-MON")

In [None]:
# more on data_range frequency here
# https://stackoverflow.com/questions/35339139/where-is-the-documentation-on-pandas-freq-tags

In [None]:
# Datetime is in the standard library (so all Python installations will have it)
from datetime import date
date.today()

In [None]:
# We can get a data range starting from today
months = pd.date_range(date.today().strftime("%Y-%m-%d"), periods = 10, freq='BMS')
# BMS means Business Month Start in US calendar
months

## 4. tēma - DataFrame

DataFrame ir visbiežāk izmantotā Pandas datu struktūra. Tā ir 2-dimensiju datu tabula, kas satur sakārtotu kolonnu kolekciju.
- katrai kolonnai var būt atšķirīgs datu tips (skaitlisks, teksts, boolean utt.).

DataFrame ir gan rindu, gan kolonnu indeksi.

To var uztvert kā sakārtotu Series vārdnīcu, kur visām Series ir kopīgs rindas indekss.

DataFrame esošie dati tiek glabāti kā viens vai vairāki divdimensiju bloki (līdzīgi kā ndarray).

In [None]:
# There are different ways for creating DataFrames

# A common way is to create it from a dict of equal-length lists or NumPy arrays

In [None]:
# again column names are keys and values are lists of data
data = {'city': ['Riga', 'Riga', 'Riga', 'Jurmala', 'Jurmala', 'Jurmala'],
        'year': [1990, 2000, 2018, 2001, 2002, 2003],
        'popul': [0.9, 0.75, 0.62, 0.09, 0.08, 0.06]}

df = pd.DataFrame(data)
df

In [None]:
# we can specify the order of columns
df2 = pd.DataFrame(data, columns=['year','city', 'popul','budget'])
# note we did not previously have budget column, thus it will be filled with NaN
df2

In [None]:
# missing column simply given Nans

In [None]:
# we can set values for the new column all at once
df2['budget']=300000000
df2

In [None]:
# we could pass specific values for the new column as well
df2['budget']=[300000, 250000, 400000, 200000, 250000, 200000] # need to pass all values
df2

In [None]:
# Many ways of changing individual values

## Recommended way of changing in place (same dataframe)

In [None]:
# iat will let you assign values in specific cells by numerical index
df2.iat[3,2] = 0.063 # so 3 is the row index, 2 is the column index
df2

In [None]:
# selecting single column will give you series
df2["budget"]

In [None]:
type(df2["budget"])

In [None]:
# if you want a single column dataframe then we use double brackets
df2[["budget"]]

In [None]:
# type
type(df2[["budget"]])

In [None]:
# delete column by its name
del df2["budget"]
# alterantive would be to use drop method
# see docs: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html
df2

### DateRange lietošana DataFrame izveidei

In [None]:
# we still have our DateRange, we will use as an index
dates

In [None]:
df = pd.DataFrame(np.random.randn(15,5), index=dates, columns=list('ABCDE'))
# We passed 15 rows of 5 random elements and set index to dates and columns to our basic list elements
df

In [None]:
# we can also create a DataFrame from a dict where values are various types
df2 = pd.DataFrame({ 'A' : 1.,
                      'B' : pd.Timestamp('20130102'),
                      'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                      'D' : np.array([3] * 4,dtype='int32'),
                      'E' : pd.Categorical(["test","train","test","train"]),
                      'F' : 'foo' })
df2

In [None]:
#most columns need matching length!

Categorical data type:
https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html
    

In [None]:
## s = pd.Series([1,4,3.5,3,np.nan,0,-5])
s

In [None]:
# again we either supply one value or exact number of values
df3 = pd.DataFrame({ 'A' : 1.,
                   'B' : pd.Timestamp('20180523'),
                   'C' : s,
                   'D' : [x**2 for x in range(7)],
                   'E' : pd.Categorical(['test','train']*3+["train"]),
                   'F' : 'aha'
                   })
df3

In [None]:
## different datatypes for columns! 

In [None]:
df3.dtypes

In [None]:
df3.head()

In [None]:
df3.tail(3)

In [None]:
df.index

In [None]:
df3.index

In [None]:
df3.values

### Dataframe statistika

In [None]:
# describe method gives basic statistics for numerical columns by default
df3.describe()

In [None]:
# info method gives more detailed information about the DataFrame
df.info()

In [None]:
# we can get statistics for non-numerical columns as well
df3.describe(include='all') 
# note how NaNs are shown where statistics are not applicable


In [None]:
# we can show statistics for non-numericals only
df3.describe(include=['object', 'category'])

In [None]:
# Sorting

In [None]:
df.sort_index(axis=1,ascending=False)
# this sorts columns in reverse order

In [None]:
## Sort by Axis in reverse

In [None]:
df.sort_index(axis=0,ascending=False)
# here we have sort by index in reverse order

In [None]:
# more commonly we want to sort by values in some column
df3.sort_values(by='C', ascending=False)

In [None]:
# Notice that NaN becomes last

In [None]:
# we can sort by multiple columns and supply sorting directions for each
df3.sort_values(by=['E','C'], ascending=[True,False])
# so here we lexicographically sort by E and when we have ties we sort by C numerically in reverse


### Datu izvēle (selection) 

Piezīme: lai gan standarta Python / Numpy izteiksmes datu atlasei un iestatīšanai ir intuitīvas un ērtas interaktīvam darbam, ražošanas kodā ieteicams izmantot optimizētās Pandas datu piekļuves metodes — **.at**, **.iat**, **.loc** un **.iloc**.

In [None]:
df3['D']

In [None]:
df3[:5] # first 5 rows

In [None]:
df3[2:5]

In [None]:
df3[2:5:2]

In [None]:
df3[::-1]

### Datu izvēle pēc birkas

Lai atlasītu datus lietojot birkas:

In [None]:
df

In [None]:
dates

In [None]:
dates[0]

In [None]:
df.loc[dates[0]] # so we used specific date as index to ge the row data as Series

In [None]:
df.loc[dates[2:5]]

### Atlase uz vairākām asīm pēc birkas:

In [None]:
df.loc[:, ['A','B','C']]
# we get all rows and only columns A, B and C

In [None]:
df.loc[dates[2:5], ['A','B','C']]

In [None]:
df.loc['20241113':'20241115',['B','C']]

In [None]:
# Reduction in the dimensions of the returned object:

In [None]:
df.loc['20241114', ["B", "D"]]

In [None]:
## Getting scalars (single values)

In [None]:
df.loc['20241114', ["D"]]

In [None]:
type(df.loc['20241114', ["D"]])

In [None]:
# same as above

In [None]:
df.at[dates[5],'D']

### Datu atlase pēc pozīcijas

Mēs varam atlasīt datus pēc to pozīcijas ar **iloc** metodi:

In [None]:
df.iloc[3] # so we got 4th row

In [None]:
# By integer slices, acting similar to numpy/python:

In [None]:
df.iloc[2:5,:2]
# so 3rd to 6th row (exclusive) and 1st to 3rd column (exclusive)

In [None]:
# By lists of integer position locations, similar to the numpy/python style:

In [None]:
# we can supply lists of indices
df.iloc[[3,5,1],[1,4,2]]

In [None]:
df

In [None]:
df.iloc[2,2] # so 3rd row and 3rd column

In [None]:
# iat is very similar but you only can use single indices not slices or lists
df.iat[2,2]

In [None]:
# For getting fast access to a scalar (equivalent to the prior method):

In [None]:
df.iat[2,2]

### Loģiskā indeksēšana

In [None]:
# Using a single column’s values to select data
df[df.A > 0.2]

In [None]:
# we can use filter on all columns to obtain bitmask
df > 0.2

In [None]:
# Table cells that match given criteria
df[df > 0.2]
# so non matching cells are NaN

In [None]:
# we replace our filter values with NaN
df[df < 0.2] = np.nan # so all values less than 0.2 are replaced with NaN
df

In [None]:
# fill in missing values with some value
df.fillna(value=0.1)

In [None]:
# there is also df.dropna() to drop any ROWS(!) with missing data

### DataFrame datu modificēšana

In [None]:
df

In [None]:
# we used fillna method yet we still have NaNs
# why is that so?

# because fillna returns a new DataFrame, it does not change the original one

# many methods in Pandas work this way, they return a new object by default
# they also have a parameter inplace that can be set to True to change the original object

df.fillna(value=0.1, inplace=True) # will MODIFY the original DataFrame
df

In [None]:
s1 = pd.Series([x**3 for x in range(15)], index=pd.date_range(today, periods=15))
s1

In [None]:
# let's add this new column to our DataFrame
# since indexes are the same - specific DateRange here, Pandas will match them
df['F'] = s1
df

In [None]:
# setting cell values

df.at[dates[1], 'A'] = 33
# similarly we could use loc
df.loc[dates[2], ['B']] = 66
df

### Pandas metožu virknēšana

Metožu virknēšana (method chaining) ļauj vienā Python komandā sakombinēt vairākas Pandas darbības.

Piemēram:
```
df = df.drop(columns=["Rank"])
df = df.query("Province == 'Connacht'")
df.sort_values("Density (/ km2)", ascending=False)
```

vietā var rakstīt:
```
df.drop(columns=["Rank"]) \
  .query("Province == 'Connacht'") \
  .sort_values("Density (/ km2)", ascending=False)
```



In [None]:
# example: https://blanchardjulien.com/posts/chaining/

import pandas as pd

def getDataframe(url_table,ind):
    df = pd.read_html(url_table)[ind]
    return df

df_ie = getDataframe("https://en.wikipedia.org/wiki/Historical_population_of_Ireland",1)
df_ie.sample(5)

In [None]:
df_ie.drop(columns=["Rank", "Change since previous census"]) \
  .query("Province == 'Connacht'") \
  .sort_values("Density (/ km2)", ascending=False)

In [None]:
(
  df_ie.drop(columns=["Rank", "Change since previous census"])
    .query("Province == 'Connacht'")
    .sort_values("Density (/ km2)", ascending=False)
)

In [None]:
# city_data = pd.read_csv("data/iedz_skaits_2018.csv", index_col=0)
csv_url = "https://github.com/CaptSolo/LU_Python_2023/raw/main/notebooks/data/iedz_skaits_2018.csv"

city_data = pd.read_csv(csv_url, index_col=0)

# display first five columns - head() method
city_data.head()

In [None]:
(
    city_data.dropna()
        .rename(columns={"2018 Iedzīvotāju skaits gada sākumā": "Iedz. skaits"})
        .sort_values(by="Iedz. skaits", ascending=False)
        .head(10)
)

## Darbības ar Series un DataFrame

DataFrame metodes un īpašības:
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
        
Series metodes un īpašības:
* https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html
    
Data Science Handbook:
* [Data manipulation with Pandas](https://jakevdp.github.io/PythonDataScienceHandbook/index.html#3.-Data-Manipulation-with-Pandas)

In [None]:
[m for m in dir(df) if not m.startswith("_")]
# note mixture of methods and attributes, here A,B,C,D,E,F are attributes of columns


In [None]:
df = pd.DataFrame(np.random.randn(15,5), index=dates, columns=list('ABCDE'))

df

In [None]:
df.mean()

In [None]:
df.max()

In [None]:
# Other axis

In [None]:
df.mean(axis=1)

In [None]:
df.max(axis=1)

### Teksta operācijas (df.str.*)

In [None]:

str1 = pd.Series(['APPle', 'baNAna', np.nan, 42, 'mangO'])
# NOTE: if we supply mixed types, Pandas will convert all to strings - object type
str1

In [None]:
[name for name in dir(str1.str) if not name.startswith("_")]

In [None]:
help(str1.str.lower)

In [None]:
str1.str.lower()

In [None]:
str1.str.len()

### Apply

In [None]:
df

In [None]:
# Lambda functions are anonymous functions
# (functions defined without a name)

# We can apply a function over all DataFrame elements:

# in general it will be faster than iterating over rows or columns

df.apply(lambda x: x*3) 
# this example is simple and could be done with df*3
# we would use apply for more complex operations

### Datu grupēšana un apkopošana

In [None]:
df = pd.DataFrame({'key': ['A', 'B', 'C', 'A', 'B', 'C'], 
                   'data': range(6)}, 
                  columns=['key', 'data'])
df

In [None]:
df.groupby('key') # we get a groupby object

In [None]:
# we can apply aggregate functions to the groups obtaining a new DataFrame
df.groupby('key').sum()

In [None]:
help(df.groupby)

### Datu apvienošana

In [None]:
## Merge
# often we will want to combine data from different sources

left = pd.DataFrame({"key": ["foo", "bar"], "lval": [1, 2]})
right = pd.DataFrame({"key": ["foo", "bar", "other"], "rval": [4, 5, 0]})

In [None]:
left

In [None]:
right

In [None]:
pd.merge(left, right, on="key")
# thus we are merging on the key column
# by default merge is inner join
# meaning that only keys present in both DataFrames will be included

In [None]:
# we might want to use different types of joins
# for example left join will include all keys from the left DataFrame
# here right join will include all keys from the right DataFrame
pd.merge(left, right, on="key", how="right")
# note how NaNs are used to fill missing values

### Darbības ar datnēm

In [None]:
# writing to a CSV file
df.to_csv("test_pandas2.csv")

In [None]:
# reading from CSV file
new_df = pd.read_csv("test_pandas2.csv", index_col=0)
new_df.head()

In [None]:
# Excel

In [None]:
# this will raise an error if 'openpyxl' package is not installed
df.to_excel('test_pandas.xlsx', sheet_name='Sheet1')


In [None]:
df6 = pd.read_excel('test_pandas.xlsx', 'Sheet1', index_col=0, na_values=['NA'])

In [None]:
df6.head()

### Darbības ar tīmekļa saturu

Pandas prot nolasīt HTML tabulu saturu (ja vien HTML lapa izmanto tabulas).

https://pandas.pydata.org/docs/reference/api/pandas.read_html.html

In [None]:
import pandas as pd

In [None]:
# read tables from an HTML page

url = "https://www.ss.com/lv/transport/cars/audi/"

tables = pd.read_html(url)

print(len(tables))

In [None]:
# select tables matching a search string
# use the 1st line as a header

tables = pd.read_html(url, match="Sludinājumi", header=0)

In [None]:
tables[2][:10]

### Laika sērijas (Time series)

Laika sērijas ir īpaša datu veida forma, kurā laiks ir neatņemama sastāvdaļa. Laika sērijas dati ir bieži sastopami daudzās jomās, piemēram, ekonomikā, finansēs, bioloģijā, fizikā, medicīnā utt.

**Pamatdoma** - laiks ir neatņemama sastāvdaļa, un laika sērijas dati ir sakārtoti augošā laika secībā.

Laiks tātad šeit tiek sadalīts diskrētās laika vienībās(dienās, stundās, minūtes vai citās vienībās), un laika sērijas dati ir parasti sakārtoti vienādos laika intervālos.

In [None]:


# first let's set seed for reproducibility
np.random.seed(2024)
# let's generate 10 years worth of random data using NumPy random.randn
# docs: https://numpy.org/doc/stable/reference/random/generated/numpy.random.randn.html
periods=3650
ts = pd.Series(np.random.randn(periods), index=pd.date_range(today, periods=periods))

In [None]:
ts.tail()

In [None]:
cumulative_series = ts.cumsum() # cumulative sum
# tail() will show the last 5 elements
cumulative_series.tail()


In [None]:
cumulative_series.plot()

In [None]:
cumulative_series["2027-01-01":"2029-01-01"] = np.nan

In [None]:
cumulative_series.plot()

In [None]:
# we can use rolling method to calculate rolling statistics
# we supply a window size of 90 days here
rolling_avg = cumulative_series.rolling(window=90).mean()
rolling_avg # note how there is no average for the first 90 days - NaN because of missing data

In [None]:
rolling_avg.plot()

---

## Papildus resursi

- Dokumentācija: http://pandas.pydata.org/pandas-docs/stable/
- Pandas Cheat Sheet: https://github.com/pandas-dev/pandas/blob/master/doc/cheatsheet/Pandas_Cheat_Sheet.pdf

- https://www.dataschool.io/easier-data-analysis-with-pandas/ (video)

- Apmācības materiāli: https://pandas.pydata.org/pandas-docs/stable/getting_started/tutorials.html
  - ["Getting started"](https://pandas.pydata.org/pandas-docs/stable/getting_started/index.html) - see also the "10 minutes to pandas" section
  - ["Modern Pandas"](http://tomaugspurger.github.io/modern-1-intro.html) tutorial
  - [Python Data Science Handbook - Pandas](https://jakevdp.github.io/PythonDataScienceHandbook/index.html#3.-Data-Manipulation-with-Pandas)