# Python in Data Science

## Data Wrangling with Pandas 
### Part 1 of 3 - introduction

***

## Data Wrangling 1
- Data wrangling, munging, tidying
- Pandas - introduction
- Row selection
- Formating

---

---
## Exercise 1 - easy

Sort `long_list` according to the absolute value (ie. forget about the `-`)


In [None]:
long_list = [ str(i) for i in range(21)] + [ str(i) for i in range(-1, -21, -1)]
str(long_list)

In [None]:
abs(-1)

In [None]:
abs('-1')

In [None]:
def absolute(x):
    x = float(x)
    return x if x > 0 else -x

In [None]:
absolute('-1')

In [None]:
long_list.sort(key=absolute)
str(long_list)

In [None]:
long_list =  [ str(i) for i in range(-1, -21, -1)] + [ str(i) for i in range(21)]
str(long_list)

In [None]:
long_list.sort(key=absolute)
str(dluga_lista)

## Exercise 2 - medium

Generate numbers in the range `1-1000` that are NOT divisible by any number between `2` and `9`.


In [None]:
[ (num,[div for div in range(2,10) if num%div == 0])  for num in range(1,21)]

``` python
[num for num in range(1,1001) if not [div for div in range(2,10) if num%div == 0]]
```

In [None]:
str([num for num in range(1,1001) if not [div for div in range(2,10) if num%div == 0]][:21])

In [None]:
str([num for num in range(1,1001) if not [div for div in range(2,10) if num%div == 0]][-20:])

## Exercise 3 - hard

Create a function that 'flattens' lists

`l = [4, 1, [ 2, [7] ], 3]`

`flatten(l)`

`[4, 1, 2, 7, 3]`


In [None]:
def flatten(input_list):
    ret = []
    for x in input_list:
        if type(x)==type([]):
            for y in flatten(x):
                ret.append(y)
        else:
            ret.append(x)
    return ret

In [None]:
l = [4, 1, [ 2, [7] ], 3]
l = flatten(l)
l

In [None]:
def flatten(input_list):
    copy = list(input_list)
    for i in range(len(input_list)-1,-1,-1):
        del input_list[i]
    for x in copy:
        if type(x)==type([]):
            for y in flatten(x):
                input_list.append(y)
        else:
            input_list.append(x)
    return input_list

In [None]:
l = [4, 1, [ 2, [7] ], 3]
flatten(l)
l

***

![](img/road.jpg)

## Data Wrangling, Munging, Tidying 

### Data Wrangling
- __Discovering__ (data exploration)
- __Structuring__ (data preparation)
- __Cleaning__ (standardization, getting rid of superflous data)
- Enriching (adding new datasets)
- Validating (verifying the business validaty)
- Publishing

source: https://www.onlinewhitepapers.com/information-technology/six-core-data-wrangling-activities/

## Tidy Data

Wickham, Hadley - _"Tidy Data"_
https://www.jstatsoft.org/index.php/jss/article/view/v059i10/v59i10.pdf

- __Each variable you measure should be in one column.__
- __Each different observation of that variable should be in a different row.__
- There should be one table for each "kind" of variable.
- If you have multiple tables, they should include a column in the table that allows them to be linked.

## Pandas

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

In [None]:
import pandas

In [None]:
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions',
                 'Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}

In [None]:
import pandas 

football = pandas.DataFrame(data)
print (football)

In [None]:
football

In [None]:
football.describe()

In [None]:
football.dtypes

In [None]:
football.head()

In [None]:
football.tail()

In [None]:
football.sample(5)

In [None]:
football['year']

In [None]:
football.year

In [None]:
football[['year', 'team', 'wins']]

---

## Row Selection


1. Slicing
2. Individual index (iloc / loc)
3. Boolean indexing
4. Combination

### Slicing

In [None]:
football

In [None]:
football[3:5]

### Individual index

### iloc
- An integer, e.g. 5.
- A list or array of integers, e.g. [4, 3, 0].
- A slice object with ints, e.g. 1:7.
- A boolean array.
- A function

In [None]:
football.iloc[[0,3]]

### loc
- A single label
- A list or array of labels, e.g. ['a', 'b', 'c'].
- A slice object with labels, e.g. 'a':'f' __(WARNING - both the start and the stop are included)__ 
- A boolean array
- A callable function 

In [None]:
import numpy as np
import pandas as pd

index = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=['A', 'B', 'C'])
df

In [None]:
df.loc['2000-01-03']

In [None]:
df.loc['2000-01-03': '2000-01-04'] 

### Boolean indexing

In [None]:
football[football.wins > 10]

### Combination

In [None]:
football

In [None]:
football[(football.wins > 10) & (football.team == "Packers")]

In [None]:
football[(football.wins > 10) | (football.team == "Packers")]

***

## Formating

In [None]:
import numpy as np
import pandas as pd

index = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=['A', 'B', 'C'])
df

In [None]:
df.values

In [None]:
df.index

In [None]:
df.columns

In [None]:
newcols = []
for i in range(len(df.columns)):
    newcols.append(df.columns[i].lower())
df.columns = newcols

In [None]:
df

In [None]:
df.style

In [None]:
def color_negative_red(val):
    color = 'red' if val < 0 else 'white'
    return 'color: %s' % color

In [None]:
df.style.applymap(color_negative_red)

In [None]:
df.index = pd.to_datetime(df.index, format = '%Y-%m-%d').strftime('%Y-%m-%d')
df.style.applymap(color_negative_red)

In [None]:
def highlight_max(s):
    is_max = s == s.max()
    return ['background-color: blue' if v else '' for v in is_max]

In [None]:
df.style.apply(highlight_max)

In [None]:
df.style.apply(highlight_max, axis=1)

In [None]:
df.style.apply(highlight_max, axis=0)

In [None]:
df.style.apply(highlight_max, axis=0).applymap(color_negative_red)

## New columns

In [None]:
df.sum = df.a + df.b + df.c

In [None]:
df['sum'] = df.a + df.b + df.c

In [None]:
df

---
### Exercise 1

Create a summary for `Packers` (create a `DataFrame` for that team)

### Exercise 2

Add the column `games_played` to the DataFrame `football` 

### Exercise 3

Add the column `percentage_games_won` to the DataFrame `football` 

### Exercise 4

Display the data for the  `Packers` team only for even years