salmon-genomic-vulnerability

Code and statistical analysis for assessing North American Atlantic salmon population structure and its association with environmental variation.

This is a work in progress. The numbers at the beginning of the script names indicate their order of execution.

Analysis TODO

explore_tony_env_gen_pairing.r - figuring out what is in the data
- looks like 3865 individuals with complete ENV and Genetic information. Making intermediate versions of these files for input to the subsequent analyses
Run PCAs on the ENV and Genetic information for matched samples - 02_pca_on_matched_data.r
Pair data and setup for the RDA - ``
RDA 5k SNP subset, paired data - 04_RDA_5k_random_snp_set.r

reuse old cunner scripts for this, save a version of the dfs to managable sized tsv files (leading cols and PCs only)

RDA full set, paired data - 05_RDA_full_snp_set.r
RDA PCA of full SNP, paired data - 06_RDA_with_PCs_for_genetics.r
- this one I did with only PC1 and PC2 as the inputs, it looked very odd / non informative with only the two greay points on the RDA axes for the inputs
RDA full SNP set with Lat Long correction, paired data - 07_RDA_full_snp_with_lat_lon_correction.r
- RDA, repeated with a correction matrix based on the latitude and longitude of individuals
k means clustering of PC1/PC2
- see if the three groups from eye test emerge quantitatively
- make a file with the cluser assignments, metadata, and the PCs
- plot to map if there are groups.
- try with the 5455 individuals, as well as the reduced data set (paired to env data only).
  - do they tell the same story? hopefully they will
[x] Plot the cluster assignments on a map
- strong geographic component, no differen
[x] Plot the PC loadings as a manhattan plot
- see if either PC1 or PC2 jump out as centered on chromosome 1/23. This would indicate the rearrangement is driving some of the observed population structure.
Compare the clusters via pairwise fst
- find the most efficient way to do this?
- need a way to take the categories and starting data and run this. I'm guessing Plink may be a good route, but could do vcf conversion and previous methods I'd used for cunner as well.
Intra group and full LD analysis
Admixture run with k = 3

see if the different intermediate regions are resolved well (match the k means clustering?) and whether some individuals in the middle look to be

Intra group and overall heterozygosity

can this be calculated easily in plink? need to look into best methods

inspect the regions of interest, obtain list of locations to check the annotation for

load in the PCA P values, the LD regions, and the Fst peaks
from each of the above, design selection criteria and subset the required information (chromosome, bp).
determine the overlap between the different groups above
load in the annotation file, look at what the peaks

[] PCA, for each of the three major clusters and with the admixed individuals removed

i.e. on the 80% home q individuals and none of the mixed ones

Additional analyses to do

[] Fst redo, but with a quantified threshold in lieu of the 0.15 cutoff
[] LD redo, but with the non-LD filtered SNP set (record numbers and such)
[] heterozygosity analyzed on a per location basis (n = 94, separate subfolder)
[] take in the new lists produced above, and re-run the gene query
[] per location, and per cluster effective population size
Look at the "New method to quantify climactic differences" and start working through this in the EDA-new_method_idea.r file.

#raw data I started with

ClimateDataSalmon_1980_2022.csv - here are the columns in the file: "","KeyID","Name","Latitude","Longitude","Elevation","X","P","Replication","Year","Month","Day","Minimum.Air.Temperature","Air.Temperature","Maximum.Air.Temperature","Total.Precipitation","Dew.Point.Temperature","Relative.Humidity","Wind.Direction","Solar.Radiation","Atmospheric.Pressure","Snow.Precipitation","Snow.Depth.Accumulation","Snow.Water.Equivalent","Wind.Speed.at.2.meters"
CIGENE_220K_Metadata_All_May18_2022 - metadata on fish and locations

from biosim made by ben marquis
- can calculate things per day/month/year etc. (tempmin / tempmax).
  - could calculate things like drought events too based on the precipitation.

wild_Salmo220K_NA_EU_filtered_noadmixQ5 - The filtered NA and EU wild fish genotypes
- see: https://github.com/CNuge/salmon-euro-introgression/blob/main/scripts/wild_baseline_220k_admix_pca.r for the code on the filtering steps that were conducted.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

scripts

scripts

README.md

README.md

Repository files navigation

salmon-genomic-vulnerability

Analysis TODO

Additional analyses to do

#raw data I started with

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 243 Commits
data		data
scripts		scripts
README.md		README.md

CNuge/salmon-genomic-vulnerability

Folders and files

Latest commit

History

Repository files navigation

salmon-genomic-vulnerability

Analysis TODO

Additional analyses to do

#raw data I started with

About

Resources

Stars

Watchers

Forks

Languages