# Analiza podatkov s pandas

[Pandas quick-start guide](http://pandas.pydata.org/pandas-docs/stable/10min.html)  
[Pandas documentation](http://pandas.pydata.org/pandas-docs/stable/)  
[Lecture notes on pandas](../predavanja/Analiza podatkov s knjižnico Pandas.ipynb)


### Naložimo pandas in podatke

In [3]:
# naložimo paket
import pandas as pd
import os.path
# ker bomo delali z velikimi razpredelnicami, povemo, da naj se vedno izpiše le 10 vrstic
pd.options.display.max_rows = 10

# izberemo interaktivni "notebook" stil risanja
%matplotlib notebook
# naložimo razpredelnico, s katero bomo delali
pot = os.path.join('../../02-zajem-podatkov', 'predavanja', 'obdelani-podatki', 'filmi.csv')
filmi = pd.read_csv(pot)
filmi

Unnamed: 0,id,naslov,dolzina,leto,ocena,metascore,glasovi,zasluzek,oznaka,opis
0,4972,The Birth of a Nation,195,1915,6.4,,20737,10000000.0,,The Stoneman family finds its friendship with ...
1,6864,Intolerance: Love's Struggle Throughout the Ages,163,1916,7.8,92.0,13031,2180000.0,,"The story of a poor young woman, separated by ..."
2,9968,Broken Blossoms or The Yellow Man and the Girl,90,1919,7.4,,8700,,,"A frail waif, abused by her brutal boxer fathe..."
3,10323,Das Cabinet des Dr. Caligari,76,1920,8.1,,50866,,,"Hypnotist Dr. Caligari uses a somnambulist, Ce..."
4,12349,The Kid,68,1921,8.3,,100210,5450000.0,,"The Tramp cares for an abandoned child, but ev..."
...,...,...,...,...,...,...,...,...,...,...
9995,9398640,Between Two Ferns: The Movie,82,2019,6.2,58.0,7319,,,Zach Galifianakis and his oddball crew take a ...
9996,9419834,Secret Obsession,97,2019,4.3,,13308,,,"Recuperating from trauma, Jennifer remains in ..."
9997,9495224,Black Mirror: Bandersnatch,90,2018,7.2,,96998,,,"In 1984, a young programmer begins to question..."
9998,9860728,Falling Inn Love,98,2019,5.6,,7389,,,When city girl Gabriela spontaneously enters a...


Poglejmo si podatke.

In [1]:
1+1

2

## Proučevanje podatkov

Razvrstite podatke po ocenah.

In [6]:
filmi.sort_values('ocena', ascending = False)

Unnamed: 0,id,naslov,dolzina,leto,ocena,metascore,glasovi,zasluzek,oznaka,opis
9902,7286456,Joker,122,2019,9.5,70.0,14789,,R,An original standalone origin story of the ico...
4196,252487,Hababam Sinifi,87,1975,9.4,,34256,,,"Lazy, uneducated students share a very close b..."
9946,7738784,Peranbu,147,2018,9.3,,10415,,,"A single father tries to raise his daughter, w..."
2830,111161,Kaznilnica odrešitve,142,1994,9.3,80.0,2136999,28341469.0,R,Two imprisoned men bond over a number of years...
8284,2170667,Wheels,115,2014,9.3,,17371,,R,Two suicidal paraplegic junkies hustle their w...
...,...,...,...,...,...,...,...,...,...,...
9718,5988370,Reis,108,2017,1.5,,71969,,,A drama about the early life of Recep Tayyip E...
9726,6038600,Smolensk,120,2016,1.4,,7417,,,Inspired by true events of 2010 Polish Air For...
9237,4009460,Saving Christmas,79,2014,1.4,18.0,14365,2783970.0,PG,His annual Christmas party faltering thanks to...
9354,4458206,Kod Adi K.O.Z.,114,2015,1.4,,26817,,,A look at the 17-25 December 2013 corruption s...


Poberite stolpec ocen.

In [9]:
ocene = filmi['ocena']
ocene

0       6.4
1       7.8
2       7.4
3       8.1
4       8.3
       ... 
9995    6.2
9996    4.3
9997    7.2
9998    5.6
9999    8.3
Name: ocena, Length: 10000, dtype: float64

Ukaza `filmi['ocena']` in `filmi[['ocena']]` sta različna:

In [10]:
print(type(filmi['ocena']))
print(type(filmi[['ocena']]))

<class 'pandas.core.series.Series'>
<class 'pandas.core.frame.DataFrame'>


Stolpci objekta `DataFrame` so tipa `Series`. Z enojnimi oklepaji poberemo `Series`, z dvojnimi oklepaji pa `DataFrame` podtabelo. Večina operacij (grouping, joining, plotting,  filtering, ...) deluje na `DataFrame`. 

Tip `Series` se uporablja ko želimo npr. dodati stolpec.

Zaokrožite stolpec ocen z funkcijo `round()`.

In [12]:
zaokrozene = round(ocene)

Dodajte zaokrožene vrednosti v podatke.

komentar?



Odstranite novo dodani stolpec z metodo `.drop()` z podanim `columns = ` argumentom.

In [26]:
filmi.drop(columns = 'zaokrozeno')
filmi

Unnamed: 0,id,naslov,dolzina,leto,ocena,metascore,glasovi,zasluzek,oznaka,opis,zaokrozeno
0,4972,The Birth of a Nation,195,1915,6.4,,20737,10000000.0,,The Stoneman family finds its friendship with ...,6.0
1,6864,Intolerance: Love's Struggle Throughout the Ages,163,1916,7.8,92.0,13031,2180000.0,,"The story of a poor young woman, separated by ...",8.0
2,9968,Broken Blossoms or The Yellow Man and the Girl,90,1919,7.4,,8700,,,"A frail waif, abused by her brutal boxer fathe...",7.0
3,10323,Das Cabinet des Dr. Caligari,76,1920,8.1,,50866,,,"Hypnotist Dr. Caligari uses a somnambulist, Ce...",8.0
4,12349,The Kid,68,1921,8.3,,100210,5450000.0,,"The Tramp cares for an abandoned child, but ev...",8.0
...,...,...,...,...,...,...,...,...,...,...,...
9995,9398640,Between Two Ferns: The Movie,82,2019,6.2,58.0,7319,,,Zach Galifianakis and his oddball crew take a ...,6.0
9996,9419834,Secret Obsession,97,2019,4.3,,13308,,,"Recuperating from trauma, Jennifer remains in ...",4.0
9997,9495224,Black Mirror: Bandersnatch,90,2018,7.2,,96998,,,"In 1984, a young programmer begins to question...",7.0
9998,9860728,Falling Inn Love,98,2019,5.6,,7389,,,When city girl Gabriela spontaneously enters a...,6.0


### Opomba: slice
Izbira podtabele ustvari t.i. "rezino" oz. "slice".
Slice ni kopija tabele, temveč zgolj sklic na izvorno tabelo,
in je zato ne moremo spreminjati.
Če želimo kopijo, uporabimo metodo `.copy()` na rezini, ki jo nato lahko spreminjamo.


Izberite podtabelo s stolpci `naslov`, `leto`, in `glasovi`, kateri nato dodate solpec z zaokroženimi ocenami.

In [27]:
podtabela1 = filmi[['naslov', 'leto', 'glasovi']].copy()
podtabela1


Unnamed: 0,naslov,leto,glasovi
0,The Birth of a Nation,1915,20737
1,Intolerance: Love's Struggle Throughout the Ages,1916,13031
2,Broken Blossoms or The Yellow Man and the Girl,1919,8700
3,Das Cabinet des Dr. Caligari,1920,50866
4,The Kid,1921,100210
...,...,...,...
9995,Between Two Ferns: The Movie,2019,7319
9996,Secret Obsession,2019,13308
9997,Black Mirror: Bandersnatch,2018,96998
9998,Falling Inn Love,2019,7389


### Filtracija

Ustvarite filter, ki izbere filme, ki so izšli pred 1930, in filter za filme po 2017.
Združite ju za izbor filmov, ki so izšli pred 1930 ali po 2017.

In [49]:
pred = filmi['leto'] < 1930
po = filmi['leto'] > 2017
dobri = filmi['ocena'] > 8.2
commed = (pred | po) & dobri

filtrirano = filmi[commed]
filtrirano

Unnamed: 0,id,naslov,dolzina,leto,ocena,metascore,glasovi,zasluzek,oznaka,opis,zaokrozeno
4,12349,The Kid,68,1921,8.3,,100210,5450000.0,,"The Tramp cares for an abandoned child, but ev...",8.0
21,17136,Metropolis,153,1927,8.3,98.0,146925,1236166.0,,In a futuristic city sharply divided between t...,8.0
8543,2395469,Gully Boy,153,2019,8.3,65.0,21033,5566534.0,,A coming-of-age story based on the lives of st...,8.0
9280,4154756,Maščevalci: Brezmejna vojna,149,2018,8.5,68.0,708964,678815482.0,PG-13,The Avengers and their allies must be willing ...,8.0
9281,4154796,Maščevalci: zaključek,181,2019,8.6,78.0,561790,858373000.0,PG-13,After the devastating events of Ma&scaron;&#26...,9.0
...,...,...,...,...,...,...,...,...,...,...,...
9966,8108198,Andhadhun,139,2018,8.4,,49103,1193046.0,,A series of mysterious events change the life ...,8.0
9976,8267604,Capharnaüm,126,2018,8.4,75.0,27308,1661096.0,R,While serving a five-year sentence for a viole...,8.0
9977,8291224,Uri: The Surgical Strike,138,2019,8.5,,33760,4186168.0,,Indian army special forces execute a covert op...,8.0
9994,9052870,Chhichhore,143,2019,8.6,,6719,898575.0,,Following a group of friends from university a...,9.0


Definirajte funkcijo, ki preveri ali niz vsebuje kvečjemu dve besedi. Nato s pomočjo `.apply()` izberite vse filme z imeni krajšimi od dveh besed in oceno nad 8.

In [52]:
def naslov(ime, max=2):
    return len(ime.split()) <= 2
kratka_imena = filmi['naslov'].apply(naslov)
res_dobri = filmi['ocena'] > 8.2
filmi[kratka_imena & res_dobri]

### Histogrami

Združite filme po ocenah in jih preštejte.

In [72]:

frekvenca = filmi.groupby('zaokrozeno').size()
frekvenca

zaokrozeno
1.0        4
2.0       42
3.0       52
4.0      230
5.0      853
6.0     3193
7.0     3521
8.0     2034
9.0       70
10.0       1
dtype: int64

Naredite stolpični diagram teh podatkov.

In [70]:
graf = frekvenca.plot.bar()
graf

<IPython.core.display.Javascript object>

<AxesSubplot:xlabel='zaokrozeno'>

Tabele imajo metodo `.hist()`, ki omogoča izgradnjo histogramov za stolpce. Uporabite to metodo za prikaz poenostavljenih podatkov.

In [67]:
graf.hist()

TypeError: hist() missing 1 required positional argument: 'x'

### Izris povprečne dolžine filma glede na leto

### Izris skupnega zasluzka za posamezno leto