# Assigment 4: Data Analysis and Visualization

Use as many Python and markdown cells per question as you deem necessary. **DO NOT SUBMIT CODE THAT DOES NOT RUN.** You will lose points for code that throws errors. 

The data you will work with was taken from [Alaskan vegetation plots](https://daac.ornl.gov/ABOVE/guides/Arrigetch_Peaks_Veg_Plots.html) from 1978-1981. The data set is in the `data/` subdirectory in this repo in two .csv files containing information about research plots and the plant species covering the plots. **Please read the descriptions for the data, as they will help you answer the questions.** 

**Table 1: Data files**
| Data File Name |	Description |
| --- | --- |
| Arrigetch_Peaks_Environmental_Data.csv| Environmental characterization data for Arrigetch Peaks research plots |
| Arrigetch_Peaks_Species_Data.csv | Species cover data for Arrigetch Peaks research plots|

**Table 2. Arrigetch_Peaks_Environmental_Data.csv**
| Column Name	| Units	| Description |
| --- | --- | --- |
| TURBOVEG_PLOT_NUMBER	 |	 | TURBOVEG plot number |
| PLANT_COMMUNITY_NAME	|  |	Primary vegetation types |
| ELEVATION |	m	| Elevation of the plots |
| ASPECT	| deg	| Aspect of the plots |
| SLOPE	| deg	| Slope of the plots |
| COVER_LITTER_LAYER	| % |	Percentage of litter layer cover in the plot |
|COVER_OPEN_WATER	| % | 	Percentage of open water cover in the plot |
| COVER_ROCK	| % |	Percentage of rock cover in the plot |
| COVER_CRUST	| % |	Percentage of crust cover in the plot |
| COVER_BARE_SOIL	| % |	Percentage of bare soil cover in the plot|
| REMARKS	 |  |	Field notes |

**Table 3. Arrigetch_Peaks_Species_Data.csv**
| Column Name	| Units	| Description |
| --- | --- | --- |
| TURBOVEG_PLOT_NUMBER	 |	 | TURBOVEG plot number |
| species name | | data values are Species Cover Classes: where r (rare), + (common, but less than 1% cover), 1 (1-5 percent), 2 (6 to 25%), 3 (25 to 50%), 4 (51 to 75%), 5 (76 to 100%). |

## Question 1: Pandas (15 pt)

Load the two data sets into Python with Pandas. Name the environmental data frame `env`, and the species data frame `species`. Display the first few rows of each data frame. What are the dimensions of the two data frames? (2 pt)

Replace all values in both data frames that are `-9999` with `np.NaN`. (1 pt)

Print how many unique plant community names there are. (1 pt)

Print summary statistics for all numerical columns in `env`, excluding `"TURBOVEG_PLOT_NUMBER"`. (2 pt)

Merge the two data frames together by the column `TURBO_PLOT_NUMBER`. (1 pt)

How many rows in the merged data frame contain missing data? (1 pt)

Which species was present in the most plots? (3 pt)

For all rows in `species`, calculate the sum of all the columns (excluding `"TURBOVEG_PLOT_NUMBER"`) for each row. Add this sum as a new column called `"totals"`. (2 pt)

Read the description included above for the `species` data frame. Are there any inconsistencies between the description and the data? Explain. If there are inconsistences, what would you do to correct them? (2 pt)

In [170]:
import pandas as pd
import numpy as np
data_dir = "/Users/bbb31/assignment-4/data/"
envdata = data_dir + "Arrigetch_Peaks_Environmental_Data.csv"
env = pd.read_csv(envdata)
speciesdata = data_dir + "Arrigetch_Peaks_Species_Data.csv"
species = pd.read_csv(speciesdata)
print(env.head())
print(species.head())

   TURBOVEG_PLOT_NUMBER                            PLANT_COMMUNITY_NAME  \
0                 10925  Ass. Umbilicarietum pensylvanicae-carolinianae   
1                 10926  Ass. Umbilicarietum pensylvanicae-carolinianae   
2                 10927  Ass. Umbilicarietum pensylvanicae-carolinianae   
3                 10928  Ass. Umbilicarietum pensylvanicae-carolinianae   
4                 10929  Ass. Umbilicarietum pensylvanicae-carolinianae   

   ELEVATION  ASPECT  SLOPE  COVER_LITTER_LAYER  COVER_OPEN_WATER  COVER_ROCK  \
0       1090      45  -9999                   0                 0           0   
1        920     315  -9999                   0                 0           0   
2        940     270  -9999                   0                 0           0   
3        950     225  -9999                   0                 0           0   
4        935     270  -9999                   0                 0           0   

   COVER_CRUST  COVER_BARE_SOIL  \
0            0             

In [171]:
#shape 
print('Species shape')
print(species.shape)
print('Environmental Data Shape')
print(env.shape)


Species shape
(439, 409)
Environmental Data Shape
(439, 11)


In [172]:
# replacing -9999 with np.NaN
env = env.replace(-9999, np.NaN)
species = species.replace(-9999, np.NaN)
print(env)

     TURBOVEG_PLOT_NUMBER                            PLANT_COMMUNITY_NAME  \
0                   10925  Ass. Umbilicarietum pensylvanicae-carolinianae   
1                   10926  Ass. Umbilicarietum pensylvanicae-carolinianae   
2                   10927  Ass. Umbilicarietum pensylvanicae-carolinianae   
3                   10928  Ass. Umbilicarietum pensylvanicae-carolinianae   
4                   10929  Ass. Umbilicarietum pensylvanicae-carolinianae   
..                    ...                                             ...   
434                 11359        Carex podocarpa-Salix rotundifolia comm.   
435                 11360     Senecio tomentosus-Salix rotundifolia comm.   
436                 11361     Senecio tomentosus-Salix rotundifolia comm.   
437                 11362     Senecio tomentosus-Salix rotundifolia comm.   
438                 11363                Ass. Andreaetum blytti-rupestris   

     ELEVATION  ASPECT  SLOPE  COVER_LITTER_LAYER  COVER_OPEN_WATER  \
0   

In [173]:
#unique plant names
len(pd.unique( env['PLANT_COMMUNITY_NAME'] )) 

51

In [174]:
#summary statistics for env
columns = ['ELEVATION', 'ASPECT','SLOPE','COVER_LITTER_LAYER', 'COVER_OPEN_WATER','COVER_ROCK','COVER_CRUST', 'COVER_BARE_SOIL']
env[columns].describe()
#how to not include turboveg plot number

Unnamed: 0,ELEVATION,ASPECT,SLOPE,COVER_LITTER_LAYER,COVER_OPEN_WATER,COVER_ROCK,COVER_CRUST,COVER_BARE_SOIL
count,374.0,244.0,147.0,439.0,439.0,439.0,439.0,439.0
mean,1111.802139,184.241803,0.0,36.91344,0.0,13.936219,4.341686,14.321185
std,227.223605,103.151667,0.0,30.059329,0.0,18.16667,12.113122,17.990339
min,730.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,940.0,135.0,0.0,8.0,0.0,1.0,0.0,1.0
50%,1050.0,180.0,0.0,30.0,0.0,8.0,0.0,8.0
75%,1270.0,270.0,0.0,65.0,0.0,20.0,0.0,20.0
max,1920.0,360.0,0.0,100.0,0.0,95.0,80.0,85.0


In [175]:
#merge
merged_data = pd.merge(left = env, right = species, left_on = "TURBOVEG_PLOT_NUMBER", right_on = "TURBOVEG_PLOT_NUMBER")


In [176]:
#missing data
print(len(merged_data.loc[merged_data.isnull().any(axis = 1),:]))

422


In [177]:
species_pop = {}
for each in species.columns:
    if each != 'TURBOVEG_PLOT_NUMBER':
        species_pop[each] = species[each][species[each]>0].shape[0]


species_list = []
for each in species_pop:
    species_list.append(each)

biggest_species = ""
biggest_species_pop = 0

n = 0
for each in species_pop:
    if species_pop[species_list[n]] > biggest_species_pop:
        biggest_species_pop = species_pop[species_list[n]]
        biggest_species = [species_list[n]]
    n +=1
print(biggest_species[0])

Cetraria islandica


In [180]:
sums = species.sum(axis=1) #add together all of the row
species['totals'] = sums - species['TURBOVEG_PLOT_NUMBER'] - species['totals'] #get rid of turboveg and makes it so if i run it multiple times the totals dont stack
#for some reason whenever I restart the thing doesn't work anymore, but then I get rid of the subtractions, run the code, and then add back in the subtractions
print(species)


     TURBOVEG_PLOT_NUMBER  Abietinella abietina  Acarospora schleicheri  \
0                   10925                   0.0                     0.0   
1                   10926                   0.0                     0.0   
2                   10927                   0.0                     0.0   
3                   10928                   0.0                     0.0   
4                   10929                   0.0                     0.0   
..                    ...                   ...                     ...   
434                 11359                   0.0                     0.0   
435                 11360                   0.0                     0.0   
436                 11361                   0.0                     0.0   
437                 11362                   0.5                     0.0   
438                 11363                   1.0                     0.0   

     Aconitum delphinifolium delphinifolium  Alectoria ochroleuca  \
0                             

In [None]:
# Description of the data
# There are many inconsistencies between how the data should be input and how the data is shown. The description
# says the data should be on a scale of 1-5, and yet there are many data points that have 0. There should be a description
# for 0% = 0 coverage. There are also numbers much larger than 5, like 10, 45, 60 and more. To correct these errors,
# you can run a loop that converts everything with 1-5% to a 1, 6-25% to 2, 25-50% to 3, 51-75% to 4 and 76-100% to 5. 
# and then everything will be formatted correctly 

## Question 2: Plotting (15 pt)

Make a figure showing the relationship between elevation and cover rock percentage. Is there a positive relationship, negative relationship, or no relationship between the two variables? (3 pt)

Make a figure showing the distribution of the `"totals"` column you created in the `species` data frame. Print summary statistics for this column, as well. (3 pt)

Create a subset of `env` containing rows with the plant community names `"Caricetum scirpoideae-rupestris"`,`"Pedicularo kanei-Caricetum glacialis"`, and `"Saxifrago tricuspidatae-Artemisietum alaskanae"`. (2 pt)

Create a figure to compare the mean cover bare soil percentage of the plant communities. Describe what the figure tells us-> are there differences among the plant communities in cover bare soil percentage? Which has the highest median value? The lowest? Are there differences in the spread among the communities? (4 pt)

For all figures, label your axes descriptively with units. If necessary, create legends. Make your figures large enough to be easily readable, and **make sure that no text is overlapping**. Save all figures, and make sure to commit them (3 pt).




In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

sns.histplot(data = merged_data, x= 'COVER_ROCK', y= 'ELEVATION')
plt.xlabel('Rock Coverage (percent covered)')
plt.ylabel('Elevation(ft)')
plt.title('Rock Coverage vs Elevation')
print('There seems to be an inverse relationship between elevation and rock coverage')

In [None]:
sns.histplot(species['totals'])
plt.xlabel('Total Number of a Species')
plt.ylabel('Number of Times Occured')
plt.title('Distribution of Plant Totals')
species['totals'].describe()

In [207]:
specific_plants1 = env.loc[env['PLANT_COMMUNITY_NAME'] == 'Caricetum scirpoideae-rupestris']
specific_plants2 = env.loc[env['PLANT_COMMUNITY_NAME'] == 
'Saxifrago tricuspidatae-Artemisietum alaskanae']
specific_plants3 = env.loc[env['PLANT_COMMUNITY_NAME']== 'Saxifrago tricuspidatae-Artemisietum alaskanae'] 
specific_plants = specific_plants1 + specific_plants2 + specific_plants3
# could do it using | but this was simpler


In [216]:
print('Saxifrago tricuspidatae-Artemisietum alaskanae has the greatest median value. The other two species are relatively close to each other in terms of median.There are differences in the coverage and spread of the communities.')

Saxifrago tricuspidatae-Artemisietum alaskanae has the greatest median value. The other two species are relatively close to each other in terms of median.There are differences in the coverage and spread of the communities.
