# Solution 00: .str accessor

In [None]:
import pandas as pd

In [None]:
df = pd.read_csv("../data/raw/Bee Colony Census Data by County.csv")

In [None]:
df.head()

In [None]:
df['Value'].describe()

The top value is (D) ????
Most of the time you'll be working with well documented (hopefully!) so you can just read the data dictionary and find out what (D) stands for.
Here's [this dataset's](../data/documents/glossary-for-bee-stats.pdf).

Now that you know what the problem is you can address it.

Let's see how many rows have '(D)' as their 'Value'

In [None]:
df[df['Value'] == '(D)']

None?
let's try something different.

In [None]:
df[df['Value'].str.contains('(D)')]

So 2306 rows have '(D)' as their 'Value' but they are not '(D)'.

Let's take a closer look at one.

In [None]:
df.iloc[5] # so this grabs the row at index 5, which we saw is in those that have (D) as 'Value'

In [None]:
df.iloc[5]['Value'] # this grabs the 'Value' column of the row at index 5

In [None]:
![reaction](https://i.imgflip.com/wwnet.jpg)

In [None]:
df.loc[df['Value'] == ' (D)', 'Value'] = 0

The two parts of `.loc[]` are 1) the _row indexer_ and 2) the _column indexer_. <br>
The _row indexer_ here is `df['Value'] == ' (D)'` which means "grab all the rows were `Value` == ' (D)' <br>
The _column indexer_ is just `Value` because we want to grab the whole __series__ `Value`.

Now you can set the (D) values to 0, but that'd mess up your math. What you really want to set them to is `Null` and `pandas` has a special way of denoting `Null`: `nan`. `NaN` means _Not a Number_ but you can't just write 'nan' as the value. To assign `nan` values to anything with `pandas` you use `pd.np.nan` which means "from `pandas` grab `numpy` from `numpy` grab `nan`."

`NumPy` is another library on top of which `pandas` is built. It's actually kinda great, lots of science depends on it check out [numpy.org](numpy.org)

In [None]:
df['Value'].astype(float)

The error is that you cannot convert "10,012" to 10012 because of the comma. <br>
But there's an easy way to `.replace` a character in a string.

In [None]:
df['Value'].str.replace(',','') # look up .replace in python.

In [None]:
df['Value'] = df['Value'].str.replace(',','')

df['Value'].astype(float) # Voilà

In [None]:
df['Value'] = df['Value'].astype(float)

Now you can do math.

In [None]:
df.groupby(['Year', 'State'])['Value'].mean()

In [None]:
ca_df = df[df['State'] == 'CALIFORNIA']

ca_df.head()

In [None]:
ca_df.groupby(['Year', 'Ag District'])['Value'].sum()

In [None]:
# diff view

ca_df.groupby(['Ag District', 'Year'])['Value'].sum()

ca_df.groupby(['Ag District', 'Year'])['Value'].mean()

ca_df.groupby(['Ag District', 'Year'])['Value'].median()

***
***

## Solution 01: just ignore it.

You can also just ignore anything that `contains` '(D)'

In [None]:
dff = df[df['Value'].str.contains('(D)') != True]

In [None]:
dff['Value'] = dff['Value'].str.replace(',','')

In [None]:
dff['Value'] = dff['Value'].astype(float)

In [None]:
dff.head()

In [None]:
dff[dff['State'] == 'CALIFORNIA'].groupby('Ag District')['Value'].mean()

This one is filtering the __dataframe__ `dff` first. From there, you `groupby` 'Ag District' and you grab the 'Value' column, calculate the mean for each group.

In [None]:
dff[dff['State'] == 'CALIFORNIA'].groupby(['Year','Ag District'])['Value'].mean()

***
[Next notebook](04_index.ipynb)