# BW \#73 Avocado Hand
Avocado are becoming a big business: In an NPR interview two years ago, an expert said that the US imports $2.8 billion in avocados each year from one state in Mexico. And where there's business, there's also crime; it would seem that some Mexican drug cartels have found that smuggling avocados can be as lucrative as the drug business

But crime isn't the only problem that avocados have brought with them. They've also brought about a new type of injury, namely "avocado hand." 

According to a Washington Post article from June 26th (https://www.washingtonpost.com/wellness/2024/06/26/avocado-hand-injuries-knife/) researchers found that there were more than 50,000 avocado-related injuries between 1998 and 2017. People seem to really hurt themselves cutting avocados, especially because they're cutting toward their hands.

we'll look at injury reports from the US Consumer Product Safety Commission (CPSC, at https://cpsc.gov), and specifically the National Electronic Injury Surveillance System (NEISS, https://www.cpsc.gov/Research--Statistics/NEISS-Injury-Data). They provide an annual report on injuries in the United States, and among other things, we'll look at avocado-related injuries over the last few years.

## Data and seven questions
The data we'll look at this week is from the database at NEISS:

https://www.cpsc.gov/cgibin/NEISSQuery/home.aspx

We want to download the annual archived data from 2020-2023. You can download them in Excel format, but you will be much happier with tab-delimited fields, believe me.

## Challenges
Learning goals include grouping, plotting, joining, and working with text.

- Load the data from 2020 - 2023 into a single data frame. Make sure that the Treatment_Date column is a datetime.
Remove any rows in which the date is invalid or NA.
- In which month do we see the most accidents? The fewest?


In [1]:
import pandas as pd

In [7]:
data_2023 = pd.read_csv('neiss2023.tsv', sep='\t', low_memory=False)
data_2022 = pd.read_csv('neiss2022.tsv', sep='\t',low_memory=False)
data_2021 = pd.read_csv('neiss2021.tsv', sep='\t', low_memory=False)
data_2020 = pd.read_csv('neiss2020.tsv', sep='\t', low_memory=False)

The errors='coerce' keyword argument means that if the input string cannot be parsed as a datetime, it is left as NaT, the time equivalent of NaN.

In [27]:
df = pd.concat([data_2023, data_2022, data_2021, data_2020])
df['Treatment_Date'] = pd.to_datetime(df['Treatment_Date'], errors='coerce')
df.dropna(subset=['Treatment_Date'])

Unnamed: 0,CPSC_Case_Number,Treatment_Date,Age,Gender,Race,Other_Race,Hispanic,Body_Part,Diagnosis,Other_Diagnosis,...,Product_1,Product_2,Product_3,Alcohol,Drug,Narrative_1,Stratum,PSU,Weight,Sex
0,230106094,2023-01-01,34,2.0,1.0,,2.0,82.0,59.0,,...,478.0,0.0,0.0,0.0,0.0,34YOF WAS WASHING A GLASS THAT BROKE. DX: LA...,S,45.0,76.8216,
1,230106235,2023-01-01,11,2.0,1.0,,2.0,82.0,53.0,,...,3286.0,0.0,0.0,0.0,0.0,11YOF INVOLVED IN FOUR WHEELER TURNOVER.DX: ...,S,29.0,76.8216,
2,230106237,2023-01-01,12,2.0,1.0,,2.0,30.0,53.0,,...,3286.0,0.0,0.0,0.0,0.0,"12YOF WITH MVA, WAS PASSENGER ON A FOUR WHEELE...",S,29.0,76.8216,
3,230106559,2023-01-01,75,2.0,1.0,,2.0,76.0,59.0,,...,4008.0,4076.0,4074.0,0.0,0.0,"75YOF, AT *** VISITING HUSBAND, TRIPPED OVER B...",V,22.0,15.7688,
4,230106561,2023-01-01,38,2.0,1.0,,2.0,75.0,62.0,,...,698.0,0.0,0.0,0.0,0.0,"38YOF, FELL X 4 DAYS GETTING OUT OF HOT TUB, S...",V,22.0,15.7688,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
309365,210305200,2020-12-21,211,,4.0,,2.0,76.0,71.0,NO INJURY,...,4074.0,0.0,0.0,0.0,0.0,11MOM PRESENTS WITH INJURY TO FACE S/P FALLING...,C,10.0,4.8510,1.0
309366,210305201,2020-12-21,218,,2.0,,2.0,33.0,57.0,,...,4076.0,0.0,0.0,0.0,0.0,18MOF PRESENTS WITH FOREARM INJURY AFTER FALL ...,C,10.0,4.8510,2.0
309367,210305202,2020-12-21,15,,1.0,,1.0,83.0,64.0,,...,1884.0,0.0,0.0,0.0,0.0,15YOM PRESENTS WITH FOOT AND ELBOW PAIN. PT TW...,C,10.0,4.8510,1.0
309368,210305205,2020-12-21,3,,1.0,,2.0,75.0,59.0,,...,4076.0,4057.0,0.0,0.0,0.0,3YOF PRESENTS WITH LACERATION TO SCALP. PT FEL...,C,10.0,4.8510,2.0


## In which month do we see the fewest and most accidents
In which month of the year does NEISS see the most accidents? To find out, we can use groupby, counting how often a given month shows up in our Treatment_Date column.

To extract the month from Treatment_Date, we'll use the dt accessor. This is our way to get that information from a datetime column, and we can use it inside of a groupby.

We can thus groupby on each month, and invoke count on any column we want

In [31]:
df.groupby(df['Treatment_Date'].dt.month)['CPSC_Case_Number'].count()

Treatment_Date
1.0     105071
2.0     100557
3.0     108232
4.0     104908
5.0     119410
6.0     115895
7.0     117786
8.0     117627
9.0     117004
10.0    113552
11.0     99074
12.0     92301
Name: CPSC_Case_Number, dtype: int64

To get the months with the largest and smallest number of accidents, we can use `sort_values` on the series we got back from `groupby`, and then we can use idxmin and idxmax to get the indexes of the lowest and highest values:

In [33]:
(
    df
    .groupby(df['Treatment_Date'].dt.month)['CPSC_Case_Number'].count()
    .agg(['idxmin', 'idxmax'])
)

idxmin    12.0
idxmax     5.0
Name: CPSC_Case_Number, dtype: float64

## Correction

In [25]:
all_dfs = [pd.read_csv(f'https://www.cpsc.gov/cgibin/NEISSQuery/Data/Archived%20Data/{one_year}/neiss{one_year}.tsv', 
                       sep='\t',
                       low_memory=False)
           for one_year in range(2020, 2024)]

Unnamed: 0,CPSC_Case_Number,Treatment_Date,Age,Sex,Race,Other_Race,Hispanic,Body_Part,Diagnosis,Other_Diagnosis,...,Product_1,Product_2,Product_3,Alcohol,Drug,Narrative_1,Stratum,PSU,Weight,Gender
0,200104302,2020-01-01,71,1.0,1.0,,2.0,75.0,62.0,,...,1893.0,1820.0,0.0,0.0,0.0,71YOM WAS AT HOME USING THE BATHROOM AND FELL ...,S,46.0,73.8005,
1,200104307,2020-01-01,208,1.0,0.0,,0.0,76.0,53.0,,...,4010.0,1807.0,0.0,0.0,0.0,8MOM ROLLED AND FELL OFF A MATTRESS STRIKING H...,S,46.0,73.8005,
2,200104308,2020-01-01,70,2.0,1.0,,2.0,35.0,71.0,PAIN,...,1842.0,0.0,0.0,0.0,0.0,70YOF TWISTED RIGHT KNEE AFTER MISSING THE LAS...,S,46.0,73.8005,
3,200104309,2020-01-01,24,1.0,1.0,,2.0,93.0,53.0,,...,131.0,0.0,0.0,0.0,0.0,24YOM DROPPED A PROPANE TANK ON RIGHT TOES DX:...,S,46.0,73.8005,
4,200104310,2020-01-01,28,2.0,1.0,,2.0,31.0,71.0,PAIN,...,3277.0,0.0,0.0,0.0,0.0,28YOF FELL OFF OF AN INVERSION TABLE WHEN HER ...,S,46.0,73.8005,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
338260,240303438,2023-12-31,83,,0.0,,0.0,75.0,53.0,,...,1878.0,0.0,0.0,0.0,0.0,83 YOM HERE FOR A FALL. HE GOT UP FROM BED TO ...,S,48.0,72.0202,1.0
338261,240303439,2023-12-31,17,,0.0,,0.0,37.0,64.0,,...,1205.0,0.0,0.0,0.0,0.0,"17 YOM HERE FOR RIGHT ANKLE PAIN, 2 DAYS PTA H...",S,48.0,72.0202,1.0
338262,240303442,2023-12-31,97,,0.0,,0.0,85.0,71.0,DIZZINESS,...,4076.0,0.0,0.0,0.0,0.0,"97 YOF HERE VIA EMS FOR A FALL MONTHS PTA, SHE...",S,48.0,72.0202,2.0
338263,240303444,2023-12-31,53,,0.0,,0.0,85.0,65.0,,...,380.0,613.0,0.0,0.0,0.0,"53 YOM HERE VIA EMS FOR SMOKE INHALATION, CEIL...",S,48.0,72.0202,1.0


In [28]:
df = pd.concat(all_dfs)
df['Treatment_Date'] = pd.to_datetime(df['Treatment_Date'], errors='coerce')
df.dropna(subset=['Treatment_Date'])

Unnamed: 0,CPSC_Case_Number,Treatment_Date,Age,Sex,Race,Other_Race,Hispanic,Body_Part,Diagnosis,Other_Diagnosis,...,Product_1,Product_2,Product_3,Alcohol,Drug,Narrative_1,Stratum,PSU,Weight,Gender
0,200104302,2020-01-01,71,1.0,1.0,,2.0,75.0,62.0,,...,1893.0,1820.0,0.0,0.0,0.0,71YOM WAS AT HOME USING THE BATHROOM AND FELL ...,S,46.0,73.8005,
1,200104307,2020-01-01,208,1.0,0.0,,0.0,76.0,53.0,,...,4010.0,1807.0,0.0,0.0,0.0,8MOM ROLLED AND FELL OFF A MATTRESS STRIKING H...,S,46.0,73.8005,
2,200104308,2020-01-01,70,2.0,1.0,,2.0,35.0,71.0,PAIN,...,1842.0,0.0,0.0,0.0,0.0,70YOF TWISTED RIGHT KNEE AFTER MISSING THE LAS...,S,46.0,73.8005,
3,200104309,2020-01-01,24,1.0,1.0,,2.0,93.0,53.0,,...,131.0,0.0,0.0,0.0,0.0,24YOM DROPPED A PROPANE TANK ON RIGHT TOES DX:...,S,46.0,73.8005,
4,200104310,2020-01-01,28,2.0,1.0,,2.0,31.0,71.0,PAIN,...,3277.0,0.0,0.0,0.0,0.0,28YOF FELL OFF OF AN INVERSION TABLE WHEN HER ...,S,46.0,73.8005,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
338260,240303438,2023-12-31,83,,0.0,,0.0,75.0,53.0,,...,1878.0,0.0,0.0,0.0,0.0,83 YOM HERE FOR A FALL. HE GOT UP FROM BED TO ...,S,48.0,72.0202,1.0
338261,240303439,2023-12-31,17,,0.0,,0.0,37.0,64.0,,...,1205.0,0.0,0.0,0.0,0.0,"17 YOM HERE FOR RIGHT ANKLE PAIN, 2 DAYS PTA H...",S,48.0,72.0202,1.0
338262,240303442,2023-12-31,97,,0.0,,0.0,85.0,71.0,DIZZINESS,...,4076.0,0.0,0.0,0.0,0.0,"97 YOF HERE VIA EMS FOR A FALL MONTHS PTA, SHE...",S,48.0,72.0202,2.0
338263,240303444,2023-12-31,53,,0.0,,0.0,85.0,65.0,,...,380.0,613.0,0.0,0.0,0.0,"53 YOM HERE VIA EMS FOR SMOKE INHALATION, CEIL...",S,48.0,72.0202,1.0


Alternatively we can use `iloc` to retrieve the first and last values:

In [32]:
(
    df
    .groupby(df['Treatment_Date'].dt.month)['CPSC_Case_Number'].count()
    .sort_values(ascending=False)
    .iloc[[0, -1]]
)

Treatment_Date
5.0     119410
12.0     92301
Name: CPSC_Case_Number, dtype: int64