# Pandas 2
---
- ### Grupowanie
- ### Kubełkowanie (ang. *Binning*)
  - #### Równych rozmiarów
  - #### Oparty o kwantyle
- ### Tidy Data
- ### Zmiana kształtów Dataframe
  - #### Szeroki w długi (*Wide to Long*)
    - ##### `melt`
  - #### Długi w szeroki (*Long to Wide*)
    - ##### `pivot`
    - ##### `transpose`
    

In [None]:
import numpy as np
import pandas as pd
df = pd.DataFrame({'value': np.random.randint(0, 100, 20)})
df

In [None]:
list(range(0, 105, 10))

In [None]:
pd.cut(df.value, range(0, 105, 10), right=False)

In [None]:
pd.cut(df.value, range(0, 105, 10), right=True)

In [None]:
labels = ["{0} - {1}".format(i, i + 9) for i in range(0, 100, 10)]
labels

In [None]:
pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)

In [None]:
df['Group'] = pd.cut(df.value, range(0, 105, 10), right=False, labels=labels)
df

In [None]:
df.groupby('Group').count()

In [None]:
df.groupby('Group').sum()

In [None]:
df.groupby('Group').agg({'value': ['count', sum]})

In [None]:
def line(series):
    return '#'*len(series)

In [None]:
df.groupby('Group').agg({'value': ['count', line]})

In [None]:
grouped = df.groupby('Group' ).agg({'value': ['count', line]}).fillna(value=np.nan).replace(to_replace=np.nan, value='')
grouped

In [None]:
[ df.quantile(q) for q in [.25, .5, .75] ] 

In [None]:
pd.qcut(df['value'], q=4)

In [None]:
pd.qcut(df['value'], q=4, labels=range(1,5))

In [None]:
labels = ['1st Quartile', '2nd Quartile', '3rd Quartile', '4th Quartile']

In [None]:
df['quartile']=pd.qcut(df['value'], q=4, labels=labels)
df

In [None]:
df.groupby('quartile').agg({"value": ["count", "mean", "sum"]})

---

## Tidy Data

Wickham, Hadley - _"Tidy Data"_
https://www.jstatsoft.org/index.php/jss/article/view/v059i10/v59i10.pdf

- __Each variable you measure should be in one column.__
- __Each different observation of that variable should be in a different row.__
- There should be one table for each "kind" of variable.
- If you have multiple tables, they should include a column in the table that allows them to be linked.

---
## Zmiana kształtów Dataframe
###  <span style="color: cyan">Wide to long</span>
---
- __Each variable you measure should be in one column.__
- __Each different observation of that variable should be in a different row.__

In [None]:
df = pd.DataFrame({'Student': {0: 'Nowak A.', 1: 'Kowalski J.', 2: 'Korzycki M.'},
                   'WF': {0: 5, 1: 4, 2: 2},
                   'J.Polski': {0: 4, 1: 4, 2: 2},
                   'Matematyka': {0: 5, 1: 3, 2: 2}})
df

In [None]:
df.melt(id_vars=['Student'], value_vars=['WF', 'Matematyka', 'J.Polski'])

In [None]:
grades = pd.melt(df, id_vars=['Student'], value_vars=['WF', 'Matematyka', 'J.Polski'],
       var_name='Przedmiot', value_name='Ocena')
grades

In [None]:
grades.sort_values('Student')

In [None]:
df_sorted = grades.sort_values(['Student', 'Przedmiot'])
df_sorted

In [None]:
df_sorted.sort_index()

In [None]:
grades.sort_values(['Student', 'Przedmiot']).reset_index()

---
### Większy przykład

In [None]:
url = 'data/Indicator_BMImale.csv'

df_demographics = pd.read_csv(url, sep = ',')

#### Jakie zasady nie spełniają załadowane tabele?

In [None]:
df_demographics

#### *Melting* czyni te dane *'tidy'*

In [None]:
melted_demographics = pd.melt(df_demographics, id_vars=['Country'])
melted_demographics

In [None]:
melted_demographics.rename(columns = {'variable': 'Year', 'value': 'BMI'}, inplace=True)
melted_demographics

#### Formatowanie

In [None]:
melted_demographics.dtypes

In [None]:
melted_demographics['Year'].apply(pd.to_numeric)

In [None]:
melted_demographics['Year'].astype('int64')
melted_demographics

---
# Long to Wide

### Pivot i Transpose

In [None]:
import numpy as np

messy = pd.DataFrame({'First' : ['John', 'Jane', 'Mary'], 
                      'Last' : ['Smith', 'Doe', 'Johnson'], 
                      'Treatment A' : [np.nan, 16, 3], 
                      'Treatment B' : [2, 11, 1]})
messy

In [None]:
messy.transpose()

In [None]:
messy.T

In [None]:
tidy = pd.melt(messy, 
               id_vars=['First','Last'], 
               var_name='treatment', 
               value_name='result')
tidy

In [None]:
tidy['Name'] = tidy['First'] + ' ' + tidy['Last']

In [None]:
messy1 = tidy.pivot(index='Name',columns='treatment',values='result')
messy1

In [None]:
messy1.index

In [None]:
messy1.reset_index(inplace=True)
messy1

In [None]:
melted_demographics

In [None]:
melted_demographics.pivot(index='Country',columns='Year',values='BMI')

In [None]:
grades.pivot(index='Student',columns='Przedmiot',values='Ocena')