# Grouping numbers in intervals by sign change

## The task:

Tekle wrote: 

*I have a list of data that contains negative, positive and zero numbers.
The goal is to count the consecutive negative numbers and zero if exists and same thing to the positive numbers.
Expected outcome of the example below is like this:
group = [4,2,4,4,3]*

In [1]:
import itertools
import numpy as np

val = [0,-1,-2,-3,4,5,-6,-7,-8,-9,10,0,11,10,-1,-2,0]
arr = np.array(val)


groups = itertools.groupby(arr, lambda x :  x <= 0)

for g, v in groups:
    
    print([ m for m  in v])

[0, -1, -2, -3]
[4, 5]
[-6, -7, -8, -9]
[10]
[0]
[11, 10]
[-1, -2, 0]


If a zero exists in a consecutive either to negative numbers or positive numbers it is counted. In the above example, I want the 10,0 to be listed with the list [11,10]. Zero is included if it happens to be within consecutive (either + or -) numbers or at the beginning or end.


## The solution

In [158]:
import numpy as np
import pandas as pd

# I added a second leading zero and several zeros in between to make the 
# mechanism more challenging
val = [0,0,-1,-2,-3,4,5,-6,-7,-8,-9,10,0,0,0,1,0,1,10,-1,-2,0]

# For sure one could solve this by loops and complicated if-else logic, but
# as a Dev mostly working in Spark, Pandas feels slightly more natural 
df = pd.DataFrame({"val": val})

# Calculate the sign of a value. 
# < 0: -1
# == 0: 0
# > 0: +1
df["signs"] = np.sign(df["val"])

# Now the magic happens. The idea is to replace zeros first by NaN, such that 
# filling-mechanisms ignores them and then forward fills them. 
df["manipulated_signs"] = df["signs"]

df.loc[df["manipulated_signs"] == 0, "manipulated_signs"] = np.nan

# Handling leading zeros in the dataframe. If the first values were zero, 
# we converted them to NaNs before. Now, to forward fill correctly, we need
# to derive the first non-NaN value and impute it to the top.
if np.isnan(df.iloc[0]["manipulated_signs"]):
    first_valid = df.iloc[df["manipulated_signs"].notna().idxmax()]["manipulated_signs"]
    df.loc[0, "manipulated_signs"] = first_valid

# The previous conversion to NaN unfortunately came with a conversion to float
# data types. Integers do not look well and are handled less performant. Hence,
# we back-convert after successful forward filling.  
df["ffilled_signs"] = df["manipulated_signs"].ffill().astype("int")

# Define intervals of constant signs, such that we can group over them.
df["sign_group"] = df['ffilled_signs'].diff().ne(0).cumsum()

# Group the values into the sign_group intervals. The sign column is
# just for clearer illustration of which sign a interval has. 
df1 = df.groupby("sign_group").agg(
        sign = ("ffilled_signs", lambda x: list(x)[0]),
        value_list = ("val", lambda x: list(x)),
        count = ("val", "count")
    )

df1 = df1.rename(columns={"ffilled_signs": "sign", "val": "value_list"})
df1.index.name = "interval"
display(df1)  # That's it. Looks like I met the requirements.

Unnamed: 0_level_0,sign,value_list,count
interval,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1,-1,"[0, 0, -1, -2, -3]",5
2,1,"[4, 5]",2
3,-1,"[-6, -7, -8, -9]",4
4,1,"[10, 0, 0, 0, 1, 0, 1, 10]",8
5,-1,"[-1, -2, 0]",3




[5, 2, 4, 8, 3]


In [160]:
# This last line is just to bring the results in the format you defined 
# above.
print(df1['count'].tolist())

[5, 2, 4, 8, 3]
