# Pandas 1

***

- ### Data wrangling, munging, tidying
- ### Pandas - wprowadzenie
- ### Wybór wierszy
- ### Iteracja
- ### Deduplikacja
- ### Przykład Data Wrangling

---
# 

# 

# 

# 

## Data Wrangling, Munging, Tidying 

# 

# 

![](img/road.jpg)

### Data Wrangling
- __Discovering__ - eksploracja danych
- __Structuring__ - przygotowanie danych
- __Cleaning__ - czyszczenie danych: unifikacja formatów, deduplikacja itd. itd.
- Enriching - wzbogacenie danych przez łączenie zbiorów
- Validating - sprawdzenie poprawności w kontekście danej dziedziny
- Publishing - udostępnienie

source: https://www.onlinewhitepapers.com/information-technology/six-core-data-wrangling-activities/

## Tidy Data

Wickham, Hadley - _"Tidy Data"_
https://www.jstatsoft.org/index.php/jss/article/view/v059i10/v59i10.pdf

- __Each variable you measure should be in one column.__
- __Each different observation of that variable should be in a different row.__
- There should be one table for each "kind" of variable.
- If you have multiple tables, they should include a column in the table that allows them to be linked.


---
# Pandas

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

In [1]:
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions',
                 'Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}

In [2]:
import pandas as pd

football = pd.DataFrame(data)
print (football)

   year     team  wins  losses
0  2010    Bears    11       5
1  2011    Bears     8       8
2  2012    Bears    10       6
3  2011  Packers    15       1
4  2012  Packers    11       5
5  2010    Lions     6      10
6  2011    Lions    10       6
7  2012    Lions     4      12


In [3]:
football

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5
1,2011,Bears,8,8
2,2012,Bears,10,6
3,2011,Packers,15,1
4,2012,Packers,11,5
5,2010,Lions,6,10
6,2011,Lions,10,6
7,2012,Lions,4,12


In [None]:
football.describe()

In [None]:
football.dtypes

In [None]:
football.head(2)

In [None]:
football.tail()

In [None]:
football.sample(5)

In [None]:
football['year']

In [None]:
football.year

In [None]:
football[['year', 'team', 'wins']]

---

# Wybór wierszy

1. Slicing
2. Individual index (iloc / loc)
3. Boolean indexing
4. Kombinacja powyższych

### Slicing

In [None]:
football

In [None]:
football[3:5]

### Individual index

### Iloc
- An integer, e.g. `5`.
- A list or array of integers, e.g. `[4, 3, 0]`.
- A slice object with ints, e.g. `1:7`.
- A boolean array.
- A function

In [None]:
football.iloc[[0,3]]

### Loc
- A single label
- A list or array of labels, e.g. `['a', 'b', 'c']`.
- A slice object with labels, e.g. `'a':'f'` <span style="color: cyan">__(WARNING - BOTH the start and the end ARE included)__</span>
- A boolean array
- A callable function 

In [None]:
import numpy as np
import pandas as pd

index = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=['A', 'B', 'C'])
df

In [None]:
df.loc['2000-01-03']

In [None]:
df.loc[['2000-01-03', '2000-01-06', '2000-01-07' ]]

In [None]:
df.loc['2000-01-03' : '2000-01-04'] 

### Boolean indexing

In [None]:
football[football.wins > 10]

### Połączenie

In [None]:
football

In [None]:
football[(football.wins > 10) & (football.team == "Packers")]

In [None]:
football[(football.wins > 10) | (football.team == "Packers")]

---
# Iteracja

In [None]:
import pandas as pd
import numpy as np

df = pd.DataFrame(pd.Series(range(10)).values.reshape(5, 2), columns = ['x', 'y'])
df

In [None]:
for col in df.iteritems():
    print(col)

In [None]:
for row in df.iterrows():
    print(row)

In [None]:
for row in df.iterrows():
    print(row[1][0], row[1][1])

In [None]:
data = []

for row in df.iterrows():
    x = row[1][0]
    y = row[1][1]
    data.append( [ x, y, x/y] )
df2 = pd.DataFrame(data, columns = ['x', 'y', 'ratio'])
df2

> ## <span style="color: red">Warning</span>
> Iterating through pandas objects is generally <span style="color: cyan">__slow__</span>. In many cases, iterating manually over the rows is not needed and can be avoided with one of the following approaches:
> - Look for a vectorized solution: many operations can be performed using built-in methods or NumPy functions, (boolean) indexing, …
> - When you have a function that cannot work on the full DataFrame/Series at once, it is better to use apply() instead of iterating over the values. See the docs on function application.
> - If you need to do iterative manipulations on the values but performance is important, consider writing the inner loop with cython or numba. See the enhancing performance section for some examples of this approach.
> 
> https://pandas.pydata.org/pandas-docs/stable/user_guide/basics.html#iteration

### Nieźle - <span style="color: cyan">LIST COMPREHENSION</span>

In [None]:
df2 = pd.DataFrame( (( row[1][0], row[1][1], row[1][0]/row[1][1]) for row in df.iterrows()), columns = ['x', 'y', 'ratio'])
df2

### Lepiej - <span style="color: cyan">APPLY</span>

In [None]:
df2 = pd.DataFrame.copy(df)

In [None]:
def ratio(row):
    print(row.iloc(0)[0])
    return row.x/row.y

df2['ratio'] = df2.apply(ratio, axis=1)
df2

### Najlepiej - <span style="color: cyan">Vectorization</span>

In [None]:
df2 = pd.DataFrame.copy(df)

In [None]:
df2['Total'] = df2.x / df2.y
df2

In [None]:
import pandas as pd
import numpy as np

df = pd.DataFrame(pd.Series(range(20_000)).values.reshape(10_000, 2), columns = ['x', 'y'])
df

In [None]:
%%time
data = []

for row in df.iterrows():
    x = row[1][0]
    y = row[1][1]
    data.append( [ x, y, x/y] )
df2 = pd.DataFrame(data, columns = ['x', 'y', 'ratio'])

In [None]:
%%time
data = []

for row in df.iterrows():
    data.append( [ row[1][0], row[1][1], row[1][0]/row[1][1]] )
df2 = pd.DataFrame(data, columns = ['x', 'y', 'ratio'])

In [None]:
%%time
df2 = pd.DataFrame( [[ row[1][0], row[1][1], row[1][0]/row[1][1]] for row in df.iterrows()], columns = ['x', 'y', 'ratio'])

In [None]:
df2 = pd.DataFrame.copy(df)

In [None]:
%%time
def ratio(row):
    return row.x/row.y

df2['ratio'] = df2.apply(ratio, axis=1)

In [None]:
df2 = pd.DataFrame.copy(df)

In [None]:
%%time
df2['Total'] = df2.x / df2.y

---
# Deduplikacja

In [None]:
import pandas as pd
df = pd.DataFrame({"A":["foo", "foo", "foo", "bar"], "B":[0,1,1,1], "C":["A","A","B","A"]})
df

In [None]:
df.drop_duplicates(subset=['A', 'C'], keep="first")

In [None]:
df.drop_duplicates(subset=['A', 'C'], keep="last")

In [None]:
df.drop_duplicates(subset=['A', 'C'], keep=False)

In [None]:
df.drop_duplicates(subset=['A', 'C'], keep=False).reset_index(drop=True)

---
# Przykład Data wrangling

In [None]:
import pandas as pd

url = 'data/blood.csv'

df_blood = pd.read_csv(url, sep = ',')
df_blood

In [None]:
columns = list(df_blood.columns)
columns

In [None]:
columns[0] = 'Date'
df_blood.columns = columns

df_blood

In [None]:
df_blood.dtypes

### Datetime columns

In [None]:
pd.to_datetime(df_blood.Date, format='%Y-%m-%d')

---
### How to `format`

https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

---

In [None]:
pd.to_datetime(df_blood.Date, infer_datetime_format=True)

In [None]:
df_blood.Date = pd.to_datetime(df_blood.Date, format='%Y-%m-%d')

df_blood

In [None]:
df_blood.dtypes

In [None]:
df_blood['Morning Sys'].astype('Int64')

In [None]:
df_blood.Date.dt.day

In [None]:
df_blood.Date.dt.weekday

### Categorical variables

In [None]:
df_blood.Date.dt.weekday.astype('category')  

In [None]:
df_blood['Day'] = df_blood.Date.dt.weekday.astype('category' )  

df_blood

In [None]:
days = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']

df_blood.Day.cat.categories = days

In [None]:
df_blood

In [None]:
df = df_blood.set_index('Date', drop=True)
df

In [None]:
df.fillna(method='ffill', inplace=True)

df

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
import matplotlib.dates as mdates
import warnings
warnings.filterwarnings('ignore')


plt.figure(figsize=(24,12))
plt.style.use("dark_background")

chart = sns.lineplot(x='Date',
                     y='Morning Dia',
                     color='orange', 
                     data=df
                    )
chart = sns.lineplot(x='Date',
                     y='Morning Sys',
                     color='green', 
                     data=df
                    )

## Średnie kroczące

In [None]:
df["rolling Morning Dia"] = df["Morning Dia"].rolling(5).mean()
df["rolling Morning Sys"] = df["Morning Sys"].rolling(5).mean()

In [None]:
df

In [None]:
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from pandas.plotting import register_matplotlib_converters
register_matplotlib_converters()
import matplotlib.dates as mdates

plt.figure(figsize=(24,12))
plt.style.use("dark_background")
chart = sns.lineplot(x='Date',
                     y='Morning Sys', 
                     color='orange', 
                     linestyle='--',
                     data=df
                    )
chart = sns.lineplot(x='Date',
                     y='Morning Dia',
                     color='green', 
                     linestyle='--',
                     data=df,
                    )
chart = sns.lineplot(x='Date',
                     y='rolling Morning Sys',
                     color='yellow', 
                     data=df
                    )
chart = sns.lineplot(x='Date',
                     y='rolling Morning Dia',
                     color='cyan', 
                     data=df
                    )