## Manipulating Data

In [None]:
.apply
.where
.value_counts

## Manipulation Methods

Pass in a NumPy function that works on the series, or a Python function that works on a single value. args and kwds are arguments for func. Returns a series, or dataframe if func returns a series.

In [None]:
s.apply(func, convert_dtype=True,args=(), **kwds)

Pass in a boolean series/dataframe, list, or callable as cond. If the value is True, keep it, otherwise use other value. If it is a function, it takes a series and should return a boolean sequence.

In [None]:
s.where(cond, other=nan, inplace=False,axis=None, level=None, errors='raise',try_cast=False)

Pass in a list of boolean arrays for condlist. If the value is true use the corresponding value from choicelist. If multiple conditions are True, only use the first. Returns a NumPy array.

In [None]:
np.select(condlist, choicelist,default=0)

Pass in a scalar, dict, series, or dataframe for value. If it is a scalar, use that value, otherwise use the index from the old value to the new value.

In [None]:
s.fillna(value=None, method=None,axis=None, inplace=False, limit=None,downcast=None)

Perform interpolation with missing values. method may be linear, time among others.

In [None]:
s.interpolate(method='linear', axis=0,limit=None, inplace=False,limit_direction=None, limit_area=None,downcast=None, **kwargs)

Return a new series with values clipped to lower and upper.

In [None]:
s.clip(lower=None, upper=None, axis=None,inplace=False, *args, **kwargs)

Return a series with values sorted. The kind option may be 'quicksort', 'mergesort' (stable), or 'heapsort'. na_position indicates location of NaNs and may be 'first' or 'last'.

In [None]:
s.sort_values(axis=0, ascending=True,inplace=False, kind='quicksort',na_position='last', ignore_index=False,key=None)

Return a series with index sorted. The kind option may be 'quicksort', 'mergesort' (stable), or 'heapsort'. na_position indicates location of NaNs and may be 'first' or 'last'.

In [None]:
s.sort_index(axis=0, level=None,ascending=True, inplace=False,kind='quicksort', na_position='last',sort_remaining=True,ignore_index=False, key=None)

Drop duplicates. keep may be 'first', 'last', or False. (If False, it removes all values that were duplicated).

In [None]:
s.drop_duplicates(keep='first',inplace=False)

Return a series with numerical ranks. method allows you to specify tie handling. 'average', 'min', 'max', 'first' (uses order they appear in series), 'dense' (like 'min', but rank only increases by one after tie).na_option allows you to specify NaN handling. 'keep' (stay at NaN), 'top' (move to smallest), 'bottom' (move to largest).

In [None]:
s.rank(axis=0, method='average',numeric_only=None, na_option='keep',ascending=True, pct=False)


Return a series with new values. to_replace can be many things. If it is a string, number, or regular expression, you can replace it with a scalar value. It can also be a list of those things which requires values to be a list of the same size. Finally, it can be a dictionary mapping old values to new values.

In [None]:
s.replace(to_replace=None, value=None,inplace=False, limit=None, regex=False,method='pad')

Bin values from x (a series). If bins is an integer, use equal-width bins. If bins is a list of numbers (defining minimum and maximum positions) use those for the edges. right defines whether the right
edge is open or closed. labels allows you to specify the bin names. Out of bounds values will be missing.

In [None]:
pd.cut(x, bins, right=True, labels=None,retbins=False, precision=3,include_lowest=False,duplicates='raise', ordered=True)

Bin values from x (a series) into q equal sized bins (10 for quantiles, 4). Alternatively, can pass in a list of quantile

In [None]:
pd.qcut(x, q, labels=None, retbins=False,precision=3, duplicates='raise')

## Exerises

With a dataset of your choice:
1. Create a series from a numeric column that has the value of 'high' if it is equal to or above
the mean and 'low' if it is below the mean using .apply.
2. Create a series from a numeric column that has the value of 'high' if it is equal to or above
the mean and 'low' if it is below the mean using np.select.
3. Time the differences between the previous two solutions to see which is faster.
4. Replace the missing values of a numeric series with the median value.
5. Clip the values of a numeric series to between to 10th and 90th percentiles.
61
9. Manipulation Methods
6. Using a categorical column, replace any value that is not in the top 5 most frequent values
with 'Other'.
7. Using a categorical column, replace any value that is not in the top 10 most frequent values
with 'Other'.
8. Make a function that takes a categorical series and a number (n) and returns a replace series
that replaces any value that is not in the top n most frequent values with 'Other'.
9. Using a numeric column, bin it into 10 groups that have the same width.
10. Using a numeric column, bin it into 10 groups that have equal sized bins.