# Pandas exercise for practice

In [None]:
import pandas as pd
import numpy as np

## Medium Section

## DataFrames: beyond the basics

### Slightly trickier: you may need to combine two or more methods to get the right answer

The previous section was tour through some basic but essential DataFrame operations. Below are some ways that you might need to cut your data, but for which there is no single "out of the box" method.

### 22. You have a DataFrame `df` with a column 'A' of integers. For example:
```python
df = pd.DataFrame({'A': [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
```

How do you filter out rows which contain the same integer as the row immediately above?

You should be left with a column containing the following values:

```python
1, 2, 3, 4, 5, 6, 7
```

In [None]:
print("Hello world")

In [None]:
df = pd.DataFrame({"A": [1, 2, 2, 3, 4, 5, 5, 5, 6, 7, 7]})
df

In [None]:
df[df['A'].ne(df['A'].shift())]

### 23. Given a DataFrame of numeric values, say
```python
df = pd.DataFrame(np.random.random(size=(5,3)))
```
how do you subtract the row mean from each element in the row?

In [None]:
np.random.seed()

df = pd.DataFrame(np.random.random(size=(5, 3)))
df = df - df.mean(axis=0)
df

### 24. Suppose you have DataFrame with 10 columns of real numbers, for example:
```python
df = pd.DataFrame(np.random.random(size=(5,10)),columns=list('abcdefghij'))
```
which columns of numbers has the smallest sum? Return that column's label.

In [None]:
np.random.seed()

df = pd.DataFrame(np.random.random(size=(5, 10)), columns=list("abcdefghij"))
df

In [None]:
df.sum(axis=1).sort_values()

In [None]:
df.sum(axis=1).sort_values().iloc[[-1]].index

### 25. How do you count how many unique rows a DataFrame has (i.e ignore all rows that are duplicate)? As input, use a DataFrame of zeros and ones with 10 rows and 3 columns.
```python
df = pd.DataFrame(np.random.randint(0,2,size=(10,3)))
```

In [None]:
df = pd.DataFrame(np.random.randint(0, 2, size=(10, 3)))
len(df.drop_duplicates())

### 26. In the cell below, you have a DataFrame `df` that consists of 10 columns of floating-point numbers. EXactly 5 enteries in each row

### For each row of the DataFrame, find the column which contains the third NaN value

### You should return a Series of column labels: `e, c, d, h, d`

In [None]:
nan = np.nan

data = [
  [0.04, nan, nan, 0.25, nan, 0.43, 0.71, 0.51, nan, nan],
  [nan, nan, nan, 0.04, 0.76, nan, nan, 0.67, 0.76, 0.16],
  [nan, nan, 0.5, nan, 0.31, 0.4, nan, nan, 0.24, 0.01],
  [0.49, nan, nan, 0.62, 0.73, 0.26, 0.85, nan, nan, nan],
  [nan, nan, 0.41, nan, 0.05, nan, 0.61, nan, 0.48, 0.68],
]

columns = list("abcdefghij")

df = pd.DataFrame(data, columns=columns)

print(df)

(df.isna().cumsum(axis=1) == 3).idxmax(axis=1)

In [None]:
df = df.fillna(-1)

labels = df.columns
outputs = []
for index, row in df.iterrows():
  count = 0
  for i in range(len(row)):
    label = labels[i]
    value = row[i]
    if value == -1:
      count += 1
    if count == 3:
      outputs.append(label)
      break

print(pd.Series(outputs))

### 27. A DataFrame has a column of groups 'grps' and and column of integer values 'vals': 

```python
df = pd.DataFrame({'grps': list('aaabbcaabcccbbc'), 
                   'vals': [12,345,3,1,45,14,4,52,54,23,235,21,57,3,87]})
```
For each *group*, find the sum of the three greatest values. You should end up with the answer as follows:
```
grps
a    409
b    156
c    345
```

In [None]:
# df = pd.DataFrame(
#   {
#     "grps": list("aaabbcaabcccbbc"),
#     "vals": [12, 345, 3, 1, 45, 14, 4, 52, 54, 23, 235, 21, 57, 3, 87],
#   }
# )

# df.sort_values(['grps','vals'],ascending=[1,0])
# df['count'] = df.groupby('grps').cumcount() + 1
# df.pivot_table(columns='grps',values='vals',index='count')
# df.cumsum().loc[3]

In [None]:
df = pd.DataFrame(
  {
    "grps": list("aaabbcaabcccbbc"),
    "vals": [12, 345, 3, 1, 45, 14, 4, 52, 54, 23, 235, 21, 57, 3, 87],
  }
)
df = df.sort_values(['grps','vals'],ascending=[1,0])
df['count'] = df.groupby('grps').cumcount() + 1
df = df.pivot_table(columns='grps',values='vals',index='count')
df.cumsum().loc[3]

### 28. The DataFrame `df` constructed below has two integer columns 'A' and 'B'. The values in 'A' are between 1 and 100 (inclusive). 

For each group of 10 consecutive integers in 'A' (i.e. `(0, 10]`, `(10, 20]`, ...), calculate the sum of the corresponding values in column 'B'.

The answer should be a Series as follows:

```
A
(0, 10]      635
(10, 20]     360
(20, 30]     315
(30, 40]     306
(40, 50]     750
(50, 60]     284
(60, 70]     424
(70, 80]     526
(80, 90]     835
(90, 100]    852
```

In [None]:
df = pd.DataFrame(
    np.random.RandomState(8765).randint(1, 101, size=(100, 2)), columns=["A", "B"]
)
df.sort_values(["A"])

groups = {i / 10: f"({i}, {i+10}]" for i in range(0, 100, 10)}

df["A"] = (df["A"] - 1) // 10
df["A"] = df["A"].map(groups)

df.groupby(["A"])["B"].sum()