# 'Tidy data' with Python
---
#### *Sergiy Tkachuk*  
[@TkachukSergiy](https://twitter.com/TkachukSergiy)  
tkachuk.sergiy23@gmail.com

<details><summary>Sprawdź materiały do prezentacji</summary>
<p>
    
Autor: **Hadley Wickham**  
[Wikipedia](https://en.wikipedia.org/wiki/Tidy_data)  
[Link do artykułu](https://www.jstatsoft.org/index.php/jss/article/view/v059i10/v59i10.pdf)

</p>
</details>

In [None]:
import pandas as pd
import numpy as np

### Zadanie - Python 4

#### Zadanie 1

Napraw `df_e` biorąc maksimum wartości `baz` dla pary `foo`, `bar` (odrzuć konflikty)

In [None]:
df_e = pd.DataFrame({"foo": ['one', 'one', 'two', 'two'],
                   "bar": ['A', 'A', 'B', 'C'],
                   "baz": [1, 2, 3, 4]})

df_e.groupby(['foo','bar']).aggregate('max').reset_index()

### Zadanie 2

Do poniższego DataFrame:
    
`df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, 10))`

Dołożyć:
    
- kolumnę z sumą po wierszach
- wiersz z sumą po kolumnach

In [None]:
df = pd.DataFrame(np.random.randint(1,100, 80).reshape(8, 10))

In [None]:
def sumy(df):
    df['Total'] = df.sum(axis=1)
    i, j = df.shape
    df.loc[i+1] = df.sum(axis=0)
    return df

sumy(df)

#### Zadanie 3

Przerobic DataFrame:

In [None]:
data = {'weekday': ["Monday", "Tuesday", "Wednesday", 
         "Thursday", "Friday", "Saturday", "Sunday"],
        'Person 1': [12, 6, 5, 8, 11, 6, 4],
        'Person 2': [10, 6, 11, 5, 8, 9, 12],
        'Person 3': [8, 5, 7, 3, 7, 11, 15]}
df = pd.DataFrame(data, columns=['weekday',
        'Person 1', 'Person 2', 'Person 3'])

Na postać "tidy" (1 wiersz na ocenę)

In [None]:
melted = pd.melt(df, id_vars=["weekday"], 
                 var_name="Person", value_name="Score")
melted.head()

#### Zadanie 4

Biorąc następujący DataFrame:

In [None]:
df = pd.DataFrame(data = {
    'Day' : ['MON', 'TUE', 'WED', 'THU', 'FRI'], 
    'Google' : [1129,1132,1134,1152,1152], 
    'Apple' : [191,192,190,190,188] 
})
df

In [None]:
reshaped_df = df.melt(id_vars=['Day'], var_name='Company', value_name='Closing Price')
reshaped_df

Stworzyć korzystając z `reshaped_df` przy uzyciu `pivot` DataFrame identyczny z `df`

In [None]:
original_df = reshaped_df.pivot(index='Day', columns='Company')['Closing Price'].reset_index()
original_df.columns.name = None
original_df

---

## Pamiętajmy o podstawowych zasadach:  
1. Każda zmienna tworzy kolumnę.
2. Każda obserwacja stanowi wiersz.
3. Dane w jednej kolumnie są przechowywane w jednym formacie.

In [None]:
url1 = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1273/datasets/df1.csv'
url2 = 'https://s3.amazonaws.com/assets.datacamp.com/production/course_1273/datasets/df2.csv'

df1 = pd.read_csv(url1, sep = ',')
df2 = pd.read_csv(url2, sep = ',')

#### Jakie zasady nie spełniają załadowane tabele?

<details><summary>Odpowiedź</summary>
<p>

df2 --> reguła 2  
Obserwojemy coś w jakimś momencie, dlatego czasowe dane nie mogą stanowić nagłówków.

</p>
</details>

#### Zrób dane bardziej 'tidy'

In [None]:
df2_melted = 

print(df2_melted)

<details><summary>Odpowiedź</summary>
<p>

<code>pd.melt(df2, id_vars=['Country'])</code>

</p>
</details>

#### Metoda 'split'

In [None]:
columns = [c if c != 'owner' else 'name' for c in df1.columns]
columns

In [None]:
surnames = ['Escobar', 'Potter', 'Connor']

In [None]:
df1['full name'] = df1.owner + ' ' + df1.surnames
df1 = df1.drop(['owner', 'surnames'], axis=1)
df1

In [None]:
# error
df1['name'] = df1['full name'].split(' ')

In [None]:
df1['name'] = df1['full name'].apply(lambda x: x.split(' '))
df1

In [None]:
df1['name'] = df1['full name'].apply(lambda x: x.split(' ')[0])
df1

In [None]:
df2_tidy = df2_melted.rename(columns = {'variable': 'Year', 'value': 'Income'})
df2_tidy

In [None]:
df2_melted

In [None]:
df2_melted.rename(columns = {'variable': 'Year', 'value': 'Income'}, inplace=True)
df2_melted

#### Braki danych

In [None]:
missing_f = pd.DataFrame([[1,1,1,1,2,2,2,2],[1,np.nan,np.nan,np.nan,2,np.nan,np.nan,np.nan]])
missing_b = pd.DataFrame([[1,1,1,1,2,2,2,2],[np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan,2]])
display(missing_f)

In [None]:
missing_f.T.fillna(method='ffill')

##### Uzupełnić braki w missing_b metodą 'bfill'

In [None]:
missing_b.T.fillna(method='bfill')

#### Formatowanie

In [None]:
df2_melted.dtypes

In [None]:
df2_melted['Year'] = df2_melted['Year'].apply(lambda x: x[1:5])
df2_melted

In [None]:
df2_melted['Year'].apply(pd.to_numeric)

In [None]:
df2_melted['Year'].astype('int64')

### Więcej zabawy z danymi

In [None]:
messy = pd.DataFrame({'First' : ['John', 'Jane', 'Mary'], 
                      'Last' : ['Smith', 'Doe', 'Johnson'], 
                      'Treatment A' : [np.nan, 16, 3], 
                      'Treatment B' : [2, 11, 1]})
messy

In [None]:
messy.transpose()

In [None]:
messy.T

In [None]:
tidy = pd.melt(messy, 
               id_vars=['First','Last'], 
               var_name='treatment', 
               value_name='result')
tidy

In [None]:
tidy['Name'] = tidy['First'] + ' ' + tidy['Last']

In [None]:
messy1 = tidy.pivot(index='Name',columns='treatment',values='result')
messy1

In [None]:
messy1.index

In [None]:
messy1.reset_index(inplace=True)
messy1

#### Wiele zmiennych przechowywanych w jednej kolumnie

In [None]:
messy_df = pd.read_csv('https://raw.githubusercontent.com/hadley/tidy-data/master/data/tb.csv', sep=',')

display(messy_df.head())

print(messy_df.columns)

print('\nIlość obserwacji - %d.\nIlość zmiennych - %d.' % messy_df.shape)

In [None]:
messy_df.columns = messy_df.columns.str.replace('new_sp_','')
messy_df.head()

In [None]:
messy_df.rename(columns = {'iso2' : 'country'}, inplace=True)
messy_df.head()

In [None]:
messy_df = messy_df[messy_df['year'] == 2000]
messy_df.head()

In [None]:
messy_df.drop(['new_sp','m04','m514','f04','f514'], axis=1, inplace=True)
messy_df.head()

In [None]:
messy_df.iloc[:,:11].head(10)
messy_df.head()

In [None]:
molten = pd.melt(messy_df, id_vars=['country', 'year'], value_name='cases')
molten.head()

In [None]:
molten.sort_values(by=['year', 'country'], inplace=True)
molten.head()

In [None]:
tidy = molten[molten['variable'] != 'mu'].copy()

def parse_age(s):
    s = s[1:]
    if s == '65':
        return '65+'
    else:
        return s[:-2]+'-'+s[-2:]

tidy['sex'] = tidy['variable'].apply(lambda s: s[:1])
tidy['age'] = tidy['variable'].apply(parse_age)
tidy = tidy[['country', 'year', 'sex', 'age', 'cases']]
tidy.head(10)

In [None]:
tidy.fillna(0)

### More pandas

![title](http://www.slate.fr/sites/default/files/styles/1060x523/public/rtx1tglo.jpg)

In [None]:
import fix_yahoo_finance as yf  
data = yf.download('AAPL','2016-01-01','2018-01-01')
data.head()

In [None]:
import matplotlib.pyplot as plt
plt.plot(data.Close)
plt.xticks(rotation='45')
plt.show()

In [None]:
data['Close'].shift().head()

In [None]:
pd.Series(data['Close'] - data['Close'].shift()).head()

... albo po prostu

In [None]:
data['Close'].diff().head()

In [None]:
data.Close.rolling(window=2).mean().head()

#### Znajdź dzienną procentową zmianę akcji

<details><summary>Odpowiedź</summary>
<p>

<code>data.Close.diff() / data.Close.shift()</code>

</p>
</details>

### Zadanie domowe

[O danych](https://data.cityofnewyork.us/Housing-Development/DOB-Job-Application-Filings/ic3t-wcy2)

In [None]:
data = pd.read_csv('https://assets.datacamp.com/production/course_2023/datasets/dob_job_application_filings_subset.csv', sep=',')
data.head()

#### Wybierz następujące kolumny (Job Type, Job Status, City, Doc #, Initial Cost, Total Est. Fee)  dla stanu 'NY'. Wynik zapisz do df.

#### Konwertuj kolumny 'Initial Cost', 'Total Est. Fee' na numeryczny format

#### Gropując po City i Job Type znajdź sumy dla Initial Cost i średnie dla Total Est. Fee