[Solving 100 Python Pandas Problems! (from easy to very difficult)](https://www.youtube.com/watch?v=i7v2m-ebXB4&t=3567s)

[Github](https://github.com/ajcr/100-pandas-puzzles)

# 100 pandas puzzles

Inspired by [100 Numpy exerises](https://github.com/rougier/numpy-100), here are 100* short puzzles for testing your knowledge of [pandas'](http://pandas.pydata.org/) power.

Since pandas is a large library with many different specialist features and functions, these excercises focus mainly on the fundamentals of manipulating data (indexing, grouping, aggregating, cleaning), making use of the core DataFrame and Series objects. 

Many of the excerises here are stright-forward in that the solutions require no more than a few lines of code (in pandas or NumPy... don't go using pure Python or Cython!). Choosing the right methods and following best practices is the underlying goal.

The exercises are loosely divided in sections. Each section has a difficulty rating; these ratings are subjective, of course, but should be a seen as a rough guide as to how inventive the required solution is.

If you're just starting out with pandas and you are looking for some other resources, the official documentation  is very extensive. In particular, some good places get a broader overview of pandas are...

- [10 minutes to pandas](http://pandas.pydata.org/pandas-docs/stable/10min.html)
- [pandas basics](http://pandas.pydata.org/pandas-docs/stable/basics.html)
- [tutorials](http://pandas.pydata.org/pandas-docs/stable/tutorials.html)
- [cookbook and idioms](http://pandas.pydata.org/pandas-docs/stable/cookbook.html#cookbook)

Enjoy the puzzles!

\* *the list of exercises is not yet complete! Pull requests or suggestions for additional exercises, corrections and improvements are welcomed.*

In [1]:
import pandas as pd
import numpy as np

## DataFrames: harder problems 

### These might require a bit of thinking outside the box...

...but all are solvable using just the usual pandas/NumPy methods (and so avoid using explicit `for` loops).

Difficulty: *hard*

**29.** Consider a DataFrame `df` where there is an integer column 'X':
```python
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
```
For each value, count the difference back to the previous zero (or the start of the Series, whichever is closer). These values should therefore be 

```
[1, 2, 0, 1, 2, 3, 4, 0, 1, 2]
```

Make this a new column 'Y'.

In [2]:
df = pd.DataFrame({'X': [7, 2, 0, 3, 4, 2, 5, 0, 3, 4]})
df

Unnamed: 0,X
0,7
1,2
2,0
3,3
4,4
5,2
6,5
7,0
8,3
9,4


In [3]:
# Incrémation de 1 à chaque que la valeur est différente de 0
x = (df['X'] !=0).cumsum()
x

0    1
1    2
2    2
3    3
4    4
5    5
6    6
7    6
8    7
9    8
Name: X, dtype: int32

In [4]:
# Valeur booléenne True à chaque fois que la valeur de la ligne est 
# différente de la ligne précédente
y = x != x.shift()
y

0     True
1     True
2    False
3     True
4     True
5     True
6     True
7    False
8     True
9     True
Name: X, dtype: bool

In [5]:
df['Y'] = y.groupby((y != y.shift()).cumsum()).cumsum()
df

Unnamed: 0,X,Y
0,7,1
1,2,2
2,0,0
3,3,1
4,4,2
5,2,3
6,5,4
7,0,0
8,3,1
9,4,2


**30.** Consider the DataFrame constructed below which contains rows and columns of numerical data. 

Create a list of the column-row index locations of the 3 largest values in this DataFrame. In this case, the answer should be:
```
[(5, 7), (6, 4), (2, 5)]
```

In [3]:
df = pd.DataFrame(np.random.RandomState(30).randint(1, 101, size=(8, 8)))
df

Unnamed: 0,0,1,2,3,4,5,6,7
0,38,38,46,46,13,24,3,54
1,18,47,4,42,8,66,50,46
2,62,36,19,19,77,17,7,63
3,28,47,46,65,63,12,16,24
4,14,51,34,56,29,59,92,79
5,58,76,96,45,38,76,58,40
6,10,34,48,40,37,23,41,26
7,55,70,91,27,79,92,20,31


In [14]:
# Conversion de la DF en une série
df.unstack()

0  0    38
   1    18
   2    62
   3    28
   4    14
        ..
7  3    24
   4    79
   5    40
   6    26
   7    31
Length: 64, dtype: int32

In [16]:
# Récupération des 3 dernières valeurs par ordre croissant
df.unstack().sort_values()[-3:]

5  7    92
6  4    92
2  5    96
dtype: int32

In [17]:
# Index des 3 valeurs récupérées ci-avant
df.unstack().sort_values()[-3:].index

MultiIndex([(5, 7),
            (6, 4),
            (2, 5)],
           )

In [18]:
# Conversion de la réponse ci-avant en liste
df.unstack().sort_values()[-3:].index.tolist()

[(5, 7), (6, 4), (2, 5)]

**31.** You are given the DataFrame below with a column of group IDs, 'grps', and a column of corresponding integer values, 'vals'.

```python
df = pd.DataFrame({"vals": np.random.RandomState(31).randint(-30, 30, size=15), 
                   "grps": np.random.RandomState(31).choice(["A", "B"], 15)})
```

Create a new column 'patched_values' which contains the same values as the 'vals' any negative values in 'vals' with the group mean:

```
    vals grps  patched_vals
0    -12    A          13.6
1     -7    B          28.0
2    -14    A          13.6
3      4    A           4.0
4     -7    A          13.6
5     28    B          28.0
6     -2    A          13.6
7     -1    A          13.6
8      8    A           8.0
9     -2    B          28.0
10    28    A          28.0
11    12    A          12.0
12    16    A          16.0
13   -24    A          13.6
14   -12    A          13.6
```

In [3]:
df = pd.DataFrame({"vals": np.random.RandomState(31).randint(-30, 30, size=15), 
                   "grps": np.random.RandomState(31).choice(["A", "B"], 15)})


In [5]:
# Trie par ordre croissant des champs ci-après
df.sort_values(by=['grps', 'vals'])

Unnamed: 0,vals,grps
13,-24,A
2,-14,A
0,-12,A
14,-12,A
4,-7,A
6,-2,A
7,-1,A
3,4,A
8,8,A
11,12,A


In [4]:
# Moyenne des valeurs positifs par valeur du champ 'gprs'
means = df[df.vals > 0].groupby('grps').mean()
means

Unnamed: 0_level_0,vals
grps,Unnamed: 1_level_1
A,13.6
B,28.0


In [15]:
# Récupération des moyennes déterminées ci-avant en type nombre réel
a_mean = means.loc['A']['vals'] # résultat : 13.60
b_mean = means.loc['B']['vals'] # résultat : 28.00

In [16]:
# 1ère condition : lorsque le champ 'grps' a pour valeur 'A', alors le nouveau
# champ patched_vals a pour valeur 13.60 sinon la valeur est 28.00
df['patched_vals'] = np.where(df['grps'] == 'A', a_mean, b_mean)
df.head()

Unnamed: 0,vals,grps,patched_vals
0,-12,A,13.6
1,-7,B,28.0
2,-14,A,13.6
3,4,A,13.6
4,-7,A,13.6


In [17]:
# 2ème condition : lorsque la valeur du champ 'vals' est négative, alors la
# valeur du champ 'patched_vals' est inchangée, sinon, on récupère la valeur
# du champ 'vals'
df['patched_vals'] = np.where(df['vals'] < 0, df['patched_vals'], df['vals'])
df

Unnamed: 0,vals,grps,patched_vals
0,-12,A,13.6
1,-7,B,28.0
2,-14,A,13.6
3,4,A,4.0
4,-7,A,13.6
5,28,B,28.0
6,-2,A,13.6
7,-1,A,13.6
8,8,A,8.0
9,-2,B,28.0


**32.** Implement a rolling mean over groups with window size 3, which ignores NaN value. For example consider the following DataFrame:

```python
>>> df = pd.DataFrame({'group': list('aabbabbbabab'),
                       'value': [1, 2, 3, np.nan, 2, 3, np.nan, 1, 7, 3, np.nan, 8]})
>>> df
   group  value
0      a    1.0
1      a    2.0
2      b    3.0
3      b    NaN
4      a    2.0
5      b    3.0
6      b    NaN
7      b    1.0
8      a    7.0
9      b    3.0
10     a    NaN
11     b    8.0
```
The goal is to compute the Series:

```
0     1.000000
1     1.500000
2     3.000000
3     3.000000
4     1.666667
5     3.000000
6     3.000000
7     2.000000
8     3.666667
9     2.000000
10    4.500000
11    4.000000
```
E.g. the first window of size three for group 'b' has values 3.0, NaN and 3.0 and occurs at row index 5. Instead of being NaN the value in the new column at this row index should be 3.0 (just the two non-NaN values are used to compute the mean (3+3)/2)