## Using panda to study plants

These datasets compile the sample records of archaeobotanical (plant) remains from archaeological sites located in southwest Asia, central Anatolia and Cyprus dated to the Pre-Pottery Neolithic or earlier. These datasets were downloaded from the UK Archaeology Data Service, the Origins of agriculture: archaeobotanical database.

In [21]:
import pandas as pd
import matplotlib.pyplot as plt

Open the "plant_taxa.csv" file and the "taxa_groups.csv" file using the pd.read_csv() function. Don't forget to make sure the data frames have separate names!

In [41]:
df = pd.read_csv('plant_taxa.csv', encoding='utf-8')#, encoding='cp500')
group_df = pd.read_csv('taxa_groups.csv', encoding='utf-8')#, encoding='cp500')

Look at the first five rows of the plant-taxa dataframe using the .head() method.

In [50]:
df.head()

Unnamed: 0,RecordID,Spcode,OriginalTaxon,IDlevel,IDnotes,PlantPart,Preservation,Quantification,ScoringSystem,NumberInd,QuantificationNotes,MNPP
0,CATE-1,1,Descurainia,G,,seed,c,1,MNI by weight,2666121.0,convert weight into an estimated no of whole s...,2666121
1,CATE-1,2,Triticum dicoccum,sp,,grain,c,1,MNI,68.0,,68
2,CATE-1,3,Triticum dicoccum,sp,,embryo end frgs,c,1,MNI,264.0,,264
3,CATE-1,4,Triticum dicoccum,sp,,apex end frgs,c,1,MNI,112.0,,112
4,CATE-1,5,Triticum dicoccum,sp,,grain,c,1,fragments,116.0,,116


You might notice that one of the columns contains the name of the plant. Which column do you think it is? ____________________________
OriginalTaxon.

Now, let's look at the observations that are only the taxon name: "Arnebia decumbens". That one is used for dyeing. Use the .head() observation to look at the first 5 observations of the Arnebia decumbens. 

In [52]:
df[df['OriginalTaxon'] == 'Arnebia decumbens'].head()

Unnamed: 0,RecordID,Spcode,OriginalTaxon,IDlevel,IDnotes,PlantPart,Preservation,Quantification,ScoringSystem,NumberInd,QuantificationNotes,MNPP
509,WJ07-17,510,Arnebia decumbens,sp,,seed,c,1,MNI,3.0,,3
537,WJ07-14,538,Arnebia decumbens,sp,,seed,c,1,MNI,4.0,,4
566,WJ07-15,567,Arnebia decumbens,sp,,seed,c,1,MNI,2.0,,2
583,WJ07-16,584,Arnebia decumbens,sp,,seed,c,1,MNI,1.0,,1
611,WJ07-22,612,Arnebia decumbens,sp,,seed,c,1,MNI,1.0,,1


How many observations do we have about the Arnebia decumbens?

In [53]:
len(df[df['OriginalTaxon'] == 'Arnebia decumbens'])

279

From the head, it looks like all of them are seeds. Is that true? 

In [54]:
df[df['OriginalTaxon'] == 'Arnebia decumbens']['PlantPart'].value_counts()

seed       278
nutlets      1
Name: PlantPart, dtype: int64

No, one of them is a nutlet!

(WE THINK) that the NumberInd refers to the number of 'Arnebia decumbens' that were found at the site at each dig. What's the average number of seeds that we tend to find? Before we calculate this, we need to remove the observation that is "nutlets"!

In [65]:
arnebia = df[df['OriginalTaxon'] == 'Arnebia decumbens']
arnebia['NumberInd'].loc[arnebia['PlantPart']=='seed'].mean()

32.32608695652174

What percent of the Arnebia decumens findings were seeds? is this the same percent of seeds as the overall database?

In [67]:
# number of arnembia seeds:
len(arnebia['NumberInd'].loc[arnebia['PlantPart']=='seed'])/len(arnebia)

0.996415770609319

In [78]:
# number of seeds in the overall arnebia decumens
sum(df['PlantPart']=='seed')/len(df)

0.5234960542491176

Look at the first five rows of the taxa_groups data. Then, look at the 'domprogwild' column. How many types does this column take on? 

In [79]:
group_df.head()

Unnamed: 0,ProjectTaxon,Family,Genus,IDtype,DomProgWild
0,Abutilon teophrasti,Malvaceae,Abutilon,Single ID,Wild
1,Acacia/Prosopis,Fabaceae,Acacia/Prosopis,Multi ID,Wild
2,Achillea,Asteraceae,Achillea,Single ID,Wild
3,Achillea wilhelmsii,Asteraceae,Achillea,Single ID,Wild
4,Adonis,Ranunculaceae,Adonis,Single ID,Wild


In [80]:
group_df['DomProgWild'].value_counts()

Wild                                    623
Domesticated crop                        29
Presumed Crop Domesticate/Progenitor     18
Crop Progenitor                          10
Crop Domesticate/Progenitor              10
Progenitor/Wild                           4
Crop progenitor                           2
Domesticate crop                          1
Likely crop/progenitor                    1
Domesticate/Crop progenitor               1
.Indet                                    1
Name: DomProgWild, dtype: int64

What percentage of the plants are wild, and what percent of the plants are domesticated?

In [83]:
# wild precentage
sum(group_df['DomProgWild']=='Wild')/len(group_df)

#domesticated percentage
sum(group_df['DomProgWild']=='Domesticated crop')/len(group_df)

0.041428571428571426

Discussion questions: What column value did we use for the 'domesticated percentage' column? Why did or didn't we use certain values?  What errors might there be in the NumberInd column?