# Python in Data Science

## Data Wrangling with Pandas 
### Part 1 of 3 - introduction

***

## Data Wrangling 1
- Data wrangling, munging, tidying
- Pandas - introduction
- Row selection
- Formating

---

---
## Exercise 1 - easy

Sort `long_list` according to the absolute value (ie. forget about the `-`)


In [1]:
long_list = [ str(i) for i in range(21)] + [ str(i) for i in range(-1, -21, -1)]
str(long_list)

"['0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '-1', '-2', '-3', '-4', '-5', '-6', '-7', '-8', '-9', '-10', '-11', '-12', '-13', '-14', '-15', '-16', '-17', '-18', '-19', '-20']"

In [2]:
abs(-1)

1

In [3]:
abs('-1')

TypeError: bad operand type for abs(): 'str'

In [4]:
def absolute(x):
    x = float(x)
    return x if x > 0 else -x

In [5]:
absolute('-1')

1.0

In [6]:
long_list.sort(key=absolute)
str(long_list)

"['0', '1', '-1', '2', '-2', '3', '-3', '4', '-4', '5', '-5', '6', '-6', '7', '-7', '8', '-8', '9', '-9', '10', '-10', '11', '-11', '12', '-12', '13', '-13', '14', '-14', '15', '-15', '16', '-16', '17', '-17', '18', '-18', '19', '-19', '20', '-20']"

In [7]:
long_list =  [ str(i) for i in range(-1, -21, -1)] + [ str(i) for i in range(21)]
str(long_list)

"['-1', '-2', '-3', '-4', '-5', '-6', '-7', '-8', '-9', '-10', '-11', '-12', '-13', '-14', '-15', '-16', '-17', '-18', '-19', '-20', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20']"

In [8]:
long_list.sort(key=absolute)
str(long_list)

"['0', '-1', '1', '-2', '2', '-3', '3', '-4', '4', '-5', '5', '-6', '6', '-7', '7', '-8', '8', '-9', '9', '-10', '10', '-11', '11', '-12', '12', '-13', '13', '-14', '14', '-15', '15', '-16', '16', '-17', '17', '-18', '18', '-19', '19', '-20', '20']"

## Exercise 2 - medium

Generate numbers in the range `1-1000` that are NOT divisible by any number between `2` and `9`.


In [1]:
[ (num,[div for div in range(2,10) if num%div == 0])  for num in range(1,21)]

[(1, []),
 (2, [2]),
 (3, [3]),
 (4, [2, 4]),
 (5, [5]),
 (6, [2, 3, 6]),
 (7, [7]),
 (8, [2, 4, 8]),
 (9, [3, 9]),
 (10, [2, 5]),
 (11, []),
 (12, [2, 3, 4, 6]),
 (13, []),
 (14, [2, 7]),
 (15, [3, 5]),
 (16, [2, 4, 8]),
 (17, []),
 (18, [2, 3, 6, 9]),
 (19, []),
 (20, [2, 4, 5])]

``` python
[num for num in range(1,1001) if not [div for div in range(2,10) if num%div == 0]]
```

In [None]:
str([num for num in range(1,1001) if not [div for div in range(2,10) if num%div == 0]][:21])

In [None]:
str([num for num in range(1,1001) if not [div for div in range(2,10) if num%div == 0]][-20:])

## Exercise 3 - hard

Create a function that 'flattens' lists

`l = [4, 1, [ 2, [7] ], 3]`

`flatten(l)`

`[4, 1, 2, 7, 3]`


In [9]:
def flatten(input_list):
    ret = []
    for x in input_list:
        if type(x)==type([]):
            for y in flatten(x):
                ret.append(y)
        else:
            ret.append(x)
    return ret

In [10]:
l = [4, 1, [ 2, [7] ], 3]
l = flatten(l)
l

[4, 1, 2, 7, 3]

In [11]:
def flatten(input_list):
    copy = list(input_list)
    for i in range(len(input_list)-1,-1,-1):
        del input_list[i]
    for x in copy:
        if type(x)==type([]):
            for y in flatten(x):
                input_list.append(y)
        else:
            input_list.append(x)
    return input_list

In [None]:
l = [4, 1, [ 2, [7] ], 3]
flatten(l)
l

In [17]:
import this

The Zen of Python, by Tim Peters

Beautiful is better than ugly.
Explicit is better than implicit.
Simple is better than complex.
Complex is better than complicated.
Flat is better than nested.
Sparse is better than dense.
Readability counts.
Special cases aren't special enough to break the rules.
Although practicality beats purity.
Errors should never pass silently.
Unless explicitly silenced.
In the face of ambiguity, refuse the temptation to guess.
There should be one-- and preferably only one --obvious way to do it.
Although that way may not be obvious at first unless you're Dutch.
Now is better than never.
Although never is often better than *right* now.
If the implementation is hard to explain, it's a bad idea.
If the implementation is easy to explain, it may be a good idea.
Namespaces are one honking great idea -- let's do more of those!


***

![](img/road.jpg)

## Data Wrangling, Munging, Tidying 

### Data Wrangling
- __Discovering__ (data exploration)
- __Structuring__ (data preparation)
- __Cleaning__ (standardization, getting rid of superflous data)
- Enriching (adding new datasets)
- Validating (verifying the business validity)
- Publishing

source: https://www.onlinewhitepapers.com/information-technology/six-core-data-wrangling-activities/

## Tidy Data

Wickham, Hadley - _"Tidy Data"_
https://www.jstatsoft.org/index.php/jss/article/view/v059i10/v59i10.pdf

- __Each variable you measure should be in one column.__
- __Each different observation of that variable should be in a different row.__
- There should be one table for each "kind" of variable.
- If you have multiple tables, they should include a column in the table that allows them to be linked.

## Pandas

https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf

In [18]:
import pandas

In [19]:
data = {'year': [2010, 2011, 2012, 2011, 2012, 2010, 2011, 2012],
        'team': ['Bears', 'Bears', 'Bears', 'Packers', 'Packers', 'Lions',
                 'Lions', 'Lions'],
        'wins': [11, 8, 10, 15, 11, 6, 10, 4],
        'losses': [5, 8, 6, 1, 5, 10, 6, 12]}

In [20]:
import pandas 

football = pandas.DataFrame(data)
print (football)

   year     team  wins  losses
0  2010    Bears    11       5
1  2011    Bears     8       8
2  2012    Bears    10       6
3  2011  Packers    15       1
4  2012  Packers    11       5
5  2010    Lions     6      10
6  2011    Lions    10       6
7  2012    Lions     4      12


In [21]:
football

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5
1,2011,Bears,8,8
2,2012,Bears,10,6
3,2011,Packers,15,1
4,2012,Packers,11,5
5,2010,Lions,6,10
6,2011,Lions,10,6
7,2012,Lions,4,12


In [22]:
football.describe()

Unnamed: 0,year,wins,losses
count,8.0,8.0,8.0
mean,2011.125,9.375,6.625
std,0.834523,3.377975,3.377975
min,2010.0,4.0,1.0
25%,2010.75,7.5,5.0
50%,2011.0,10.0,6.0
75%,2012.0,11.0,8.5
max,2012.0,15.0,12.0


In [23]:
football.dtypes

year       int64
team      object
wins       int64
losses     int64
dtype: object

In [24]:
football.head()

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5
1,2011,Bears,8,8
2,2012,Bears,10,6
3,2011,Packers,15,1
4,2012,Packers,11,5


In [25]:
football.tail()

Unnamed: 0,year,team,wins,losses
3,2011,Packers,15,1
4,2012,Packers,11,5
5,2010,Lions,6,10
6,2011,Lions,10,6
7,2012,Lions,4,12


In [27]:
footballc

Unnamed: 0,year,team,wins,losses
4,2012,Packers,11,5
5,2010,Lions,6,10
0,2010,Bears,11,5
2,2012,Bears,10,6
3,2011,Packers,15,1


In [28]:
football['year']

0    2010
1    2011
2    2012
3    2011
4    2012
5    2010
6    2011
7    2012
Name: year, dtype: int64

In [29]:
football.year

0    2010
1    2011
2    2012
3    2011
4    2012
5    2010
6    2011
7    2012
Name: year, dtype: int64

In [32]:
football[['year', 'team', 'wins']]

Unnamed: 0,year,team,wins
0,2010,Bears,11
1,2011,Bears,8
2,2012,Bears,10
3,2011,Packers,15
4,2012,Packers,11
5,2010,Lions,6
6,2011,Lions,10
7,2012,Lions,4


---

## Row Selection


1. Slicing
2. Individual index (iloc / loc)
3. Boolean indexing
4. Combination

### Slicing

In [33]:
football

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5
1,2011,Bears,8,8
2,2012,Bears,10,6
3,2011,Packers,15,1
4,2012,Packers,11,5
5,2010,Lions,6,10
6,2011,Lions,10,6
7,2012,Lions,4,12


In [34]:
football[3:5]

Unnamed: 0,year,team,wins,losses
3,2011,Packers,15,1
4,2012,Packers,11,5


In [39]:
a = [ x*x for x in range(10)]
a

[0, 1, 4, 9, 16, 25, 36, 49, 64, 81]

In [40]:
a[3:5]

[9, 16]

### Individual index

### iloc
- An integer, e.g. 5.
- A list or array of integers, e.g. [4, 3, 0].
- A slice object with ints, e.g. 1:7.
- A boolean array.
- A function

In [41]:
football.iloc[[0,3]]

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5
3,2011,Packers,15,1


In [42]:
football.iloc[2]

year       2012
team      Bears
wins         10
losses        6
Name: 2, dtype: object

In [43]:
football.iloc[3:5]

Unnamed: 0,year,team,wins,losses
3,2011,Packers,15,1
4,2012,Packers,11,5


In [44]:
football.iloc[[True,False,True,False,False,False,False, False]]

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5
2,2012,Bears,10,6


### loc
- A single label
- A list or array of labels, e.g. ['a', 'b', 'c'].
- A slice object with labels, e.g. 'a':'f' __(WARNING - both the start and the stop are included)__ 
- A boolean array
- A callable function 

In [45]:
import numpy as np
import pandas as pd

index = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=['A', 'B', 'C'])
df

Unnamed: 0,A,B,C
2000-01-01,-0.529746,1.129327,-1.263318
2000-01-02,0.414268,-1.211658,0.879194
2000-01-03,-0.583975,-0.755551,-1.302686
2000-01-04,-0.409075,0.378425,1.802556
2000-01-05,0.222384,0.922392,0.319765
2000-01-06,0.695072,-1.205148,-0.459194
2000-01-07,0.391734,0.308483,-0.677213
2000-01-08,0.383917,1.224267,-1.090281


In [46]:
df.loc['2000-01-03']

A   -0.583975
B   -0.755551
C   -1.302686
Name: 2000-01-03 00:00:00, dtype: float64

In [47]:
df.loc['2000-01-03': '2000-01-04'] 

Unnamed: 0,A,B,C
2000-01-03,-0.583975,-0.755551,-1.302686
2000-01-04,-0.409075,0.378425,1.802556


### Boolean indexing

In [48]:
football[football.wins > 10]

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5
3,2011,Packers,15,1
4,2012,Packers,11,5


### Combination

In [49]:
football

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5
1,2011,Bears,8,8
2,2012,Bears,10,6
3,2011,Packers,15,1
4,2012,Packers,11,5
5,2010,Lions,6,10
6,2011,Lions,10,6
7,2012,Lions,4,12


In [50]:
football[(football.wins > 10) & (football.team == "Packers")]

Unnamed: 0,year,team,wins,losses
3,2011,Packers,15,1
4,2012,Packers,11,5


In [51]:
football[(football.wins > 10) | (football.team == "Packers")]

Unnamed: 0,year,team,wins,losses
0,2010,Bears,11,5
3,2011,Packers,15,1
4,2012,Packers,11,5


***

## Formating

In [52]:
import numpy as np
import pandas as pd

index = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 3), index=index, columns=['A', 'B', 'C'])
df

Unnamed: 0,A,B,C
2000-01-01,0.457899,0.330025,1.103836
2000-01-02,-0.18245,0.92232,-0.876598
2000-01-03,-0.639069,-0.682079,0.504647
2000-01-04,-1.611921,-0.670617,0.383995
2000-01-05,1.631023,-0.996337,0.871324
2000-01-06,0.667407,2.410278,1.190915
2000-01-07,-0.718672,1.929484,-0.89428
2000-01-08,-0.339477,-0.016698,1.760061


In [53]:
df.values

array([[ 0.45789852,  0.33002497,  1.10383596],
       [-0.18245048,  0.92232013, -0.87659794],
       [-0.63906945, -0.68207864,  0.50464686],
       [-1.61192055, -0.67061725,  0.3839948 ],
       [ 1.63102294, -0.9963374 ,  0.87132431],
       [ 0.66740749,  2.41027762,  1.19091477],
       [-0.71867245,  1.92948354, -0.89428028],
       [-0.33947675, -0.01669775,  1.76006126]])

In [54]:
df.index

DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08'],
              dtype='datetime64[ns]', freq='D')

In [55]:
df.columns

Index(['A', 'B', 'C'], dtype='object')

In [56]:
newcols = []
for i in range(len(df.columns)):
    newcols.append(df.columns[i].lower())
df.columns = newcols

In [57]:
df

Unnamed: 0,a,b,c
2000-01-01,0.457899,0.330025,1.103836
2000-01-02,-0.18245,0.92232,-0.876598
2000-01-03,-0.639069,-0.682079,0.504647
2000-01-04,-1.611921,-0.670617,0.383995
2000-01-05,1.631023,-0.996337,0.871324
2000-01-06,0.667407,2.410278,1.190915
2000-01-07,-0.718672,1.929484,-0.89428
2000-01-08,-0.339477,-0.016698,1.760061


In [58]:
df.style

Unnamed: 0,a,b,c
2000-01-01 00:00:00,0.457899,0.330025,1.103836
2000-01-02 00:00:00,-0.18245,0.92232,-0.876598
2000-01-03 00:00:00,-0.639069,-0.682079,0.504647
2000-01-04 00:00:00,-1.611921,-0.670617,0.383995
2000-01-05 00:00:00,1.631023,-0.996337,0.871324
2000-01-06 00:00:00,0.667407,2.410278,1.190915
2000-01-07 00:00:00,-0.718672,1.929484,-0.89428
2000-01-08 00:00:00,-0.339477,-0.016698,1.760061


In [68]:
def color_negative_red(val):
    color = 'red' if val < 0 else 'yellow'
    return 'color: %s' % color

In [69]:
df.style.applymap(color_negative_red)

Unnamed: 0,a,b,c
2000-01-01 00:00:00,0.457899,0.330025,1.103836
2000-01-02 00:00:00,-0.18245,0.92232,-0.876598
2000-01-03 00:00:00,-0.639069,-0.682079,0.504647
2000-01-04 00:00:00,-1.611921,-0.670617,0.383995
2000-01-05 00:00:00,1.631023,-0.996337,0.871324
2000-01-06 00:00:00,0.667407,2.410278,1.190915
2000-01-07 00:00:00,-0.718672,1.929484,-0.89428
2000-01-08 00:00:00,-0.339477,-0.016698,1.760061


In [71]:
df.index = pd.to_datetime(df.index, format = '%Y-%m-%d').strftime('%Y-%m-%d')
df.style.applymap(color_negative_red)

Unnamed: 0,a,b,c
2000-01-01,0.457899,0.330025,1.103836
2000-01-02,-0.18245,0.92232,-0.876598
2000-01-03,-0.639069,-0.682079,0.504647
2000-01-04,-1.611921,-0.670617,0.383995
2000-01-05,1.631023,-0.996337,0.871324
2000-01-06,0.667407,2.410278,1.190915
2000-01-07,-0.718672,1.929484,-0.89428
2000-01-08,-0.339477,-0.016698,1.760061


In [72]:
def highlight_max(s):
    is_max = s == s.max()
    return ['background-color: blue' if v else '' for v in is_max]

In [73]:
df.style.apply(highlight_max)

Unnamed: 0,a,b,c
2000-01-01,0.457899,0.330025,1.103836
2000-01-02,-0.18245,0.92232,-0.876598
2000-01-03,-0.639069,-0.682079,0.504647
2000-01-04,-1.611921,-0.670617,0.383995
2000-01-05,1.631023,-0.996337,0.871324
2000-01-06,0.667407,2.410278,1.190915
2000-01-07,-0.718672,1.929484,-0.89428
2000-01-08,-0.339477,-0.016698,1.760061


In [74]:
df.style.apply(highlight_max, axis=1)

Unnamed: 0,a,b,c
2000-01-01,0.457899,0.330025,1.103836
2000-01-02,-0.18245,0.92232,-0.876598
2000-01-03,-0.639069,-0.682079,0.504647
2000-01-04,-1.611921,-0.670617,0.383995
2000-01-05,1.631023,-0.996337,0.871324
2000-01-06,0.667407,2.410278,1.190915
2000-01-07,-0.718672,1.929484,-0.89428
2000-01-08,-0.339477,-0.016698,1.760061


In [76]:
df.style.apply(highlight_max, axis=0)

Unnamed: 0,a,b,c
2000-01-01,0.457899,0.330025,1.103836
2000-01-02,-0.18245,0.92232,-0.876598
2000-01-03,-0.639069,-0.682079,0.504647
2000-01-04,-1.611921,-0.670617,0.383995
2000-01-05,1.631023,-0.996337,0.871324
2000-01-06,0.667407,2.410278,1.190915
2000-01-07,-0.718672,1.929484,-0.89428
2000-01-08,-0.339477,-0.016698,1.760061


In [77]:
df.style.apply(highlight_max, axis=0).applymap(color_negative_red)

Unnamed: 0,a,b,c
2000-01-01,0.457899,0.330025,1.103836
2000-01-02,-0.18245,0.92232,-0.876598
2000-01-03,-0.639069,-0.682079,0.504647
2000-01-04,-1.611921,-0.670617,0.383995
2000-01-05,1.631023,-0.996337,0.871324
2000-01-06,0.667407,2.410278,1.190915
2000-01-07,-0.718672,1.929484,-0.89428
2000-01-08,-0.339477,-0.016698,1.760061


## New columns

In [78]:
df.sum = df.a + df.b + df.c

In [79]:
df

Unnamed: 0,a,b,c
2000-01-01,0.457899,0.330025,1.103836
2000-01-02,-0.18245,0.92232,-0.876598
2000-01-03,-0.639069,-0.682079,0.504647
2000-01-04,-1.611921,-0.670617,0.383995
2000-01-05,1.631023,-0.996337,0.871324
2000-01-06,0.667407,2.410278,1.190915
2000-01-07,-0.718672,1.929484,-0.89428
2000-01-08,-0.339477,-0.016698,1.760061


In [80]:
df['sum'] = df.a + df.b + df.c

In [81]:
df

Unnamed: 0,a,b,c,sum
2000-01-01,0.457899,0.330025,1.103836,1.891759
2000-01-02,-0.18245,0.92232,-0.876598,-0.136728
2000-01-03,-0.639069,-0.682079,0.504647,-0.816501
2000-01-04,-1.611921,-0.670617,0.383995,-1.898543
2000-01-05,1.631023,-0.996337,0.871324,1.50601
2000-01-06,0.667407,2.410278,1.190915,4.2686
2000-01-07,-0.718672,1.929484,-0.89428,0.316531
2000-01-08,-0.339477,-0.016698,1.760061,1.403887


In [82]:
df["new_empty"] = 0
df

Unnamed: 0,a,b,c,sum,new_empty
2000-01-01,0.457899,0.330025,1.103836,1.891759,0
2000-01-02,-0.18245,0.92232,-0.876598,-0.136728,0
2000-01-03,-0.639069,-0.682079,0.504647,-0.816501,0
2000-01-04,-1.611921,-0.670617,0.383995,-1.898543,0
2000-01-05,1.631023,-0.996337,0.871324,1.50601,0
2000-01-06,0.667407,2.410278,1.190915,4.2686,0
2000-01-07,-0.718672,1.929484,-0.89428,0.316531,0
2000-01-08,-0.339477,-0.016698,1.760061,1.403887,0


---
### Exercise 1

Create a summary for `Packers` (create a `DataFrame` for that team)

### Exercise 2

Add the column `games_played` to the DataFrame `football` 

### Exercise 3

Add the column `percentage_games_won` to the DataFrame `football` 

### Exercise 4

Display the data for the  `Packers` team only for even years