Code and statistical analysis for assessing North American Atlantic salmon population structure and its association with environmental variation.
This is a work in progress. The numbers at the beginning of the script names indicate their order of execution.
-
explore_tony_env_gen_pairing.r - figuring out what is in the data
- looks like 3865 individuals with complete ENV and Genetic information. Making intermediate versions of these files for input to the subsequent analyses
-
Run PCAs on the ENV and Genetic information for matched samples -
02_pca_on_matched_data.r
-
Pair data and setup for the RDA - ``
-
RDA 5k SNP subset, paired data -
04_RDA_5k_random_snp_set.r
- reuse old cunner scripts for this, save a version of the dfs to managable sized tsv files (leading cols and PCs only)
-
RDA full set, paired data -
05_RDA_full_snp_set.r
-
RDA PCA of full SNP, paired data -
06_RDA_with_PCs_for_genetics.r
- this one I did with only PC1 and PC2 as the inputs, it looked very odd / non informative with only the two greay points on the RDA axes for the inputs
-
RDA full SNP set with Lat Long correction, paired data -
07_RDA_full_snp_with_lat_lon_correction.r
- RDA, repeated with a correction matrix based on the latitude and longitude of individuals
-
k means clustering of PC1/PC2
-
see if the three groups from eye test emerge quantitatively
-
make a file with the cluser assignments, metadata, and the PCs
-
plot to map if there are groups.
-
try with the 5455 individuals, as well as the reduced data set (paired to env data only).
- do they tell the same story? hopefully they will
[x] Plot the cluster assignments on a map
- strong geographic component, no differen
[x] Plot the PC loadings as a manhattan plot
- see if either PC1 or PC2 jump out as centered on chromosome 1/23. This would indicate the rearrangement is driving some of the observed population structure.
-
-
Compare the clusters via pairwise fst
- find the most efficient way to do this?
- need a way to take the categories and starting data and run this. I'm guessing Plink may be a good route, but could do vcf conversion and previous methods I'd used for cunner as well.
-
Intra group and full LD analysis
-
Admixture run with k = 3
- see if the different intermediate regions are resolved well (match the k means clustering?) and whether some individuals in the middle look to be
- Intra group and overall heterozygosity
- can this be calculated easily in plink? need to look into best methods
- inspect the regions of interest, obtain list of locations to check the annotation for
- load in the PCA P values, the LD regions, and the Fst peaks
- from each of the above, design selection criteria and subset the required information (chromosome, bp).
- determine the overlap between the different groups above
- load in the annotation file, look at what the peaks
- [] PCA, for each of the three major clusters and with the admixed individuals removed
- i.e. on the 80% home q individuals and none of the mixed ones
-
[] Fst redo, but with a quantified threshold in lieu of the 0.15 cutoff
-
[] LD redo, but with the non-LD filtered SNP set (record numbers and such)
-
[] heterozygosity analyzed on a per location basis (n = 94, separate subfolder)
-
[] take in the new lists produced above, and re-run the gene query
-
[] per location, and per cluster effective population size
-
Look at the "New method to quantify climactic differences" and start working through this in the EDA-new_method_idea.r file.
-
ClimateDataSalmon_1980_2022.csv - here are the columns in the file:
"","KeyID","Name","Latitude","Longitude","Elevation","X","P","Replication","Year","Month","Day","Minimum.Air.Temperature","Air.Temperature","Maximum.Air.Temperature","Total.Precipitation","Dew.Point.Temperature","Relative.Humidity","Wind.Direction","Solar.Radiation","Atmospheric.Pressure","Snow.Precipitation","Snow.Depth.Accumulation","Snow.Water.Equivalent","Wind.Speed.at.2.meters"
-
CIGENE_220K_Metadata_All_May18_2022 - metadata on fish and locations
- from biosim made by ben marquis
- can calculate things per day/month/year etc. (tempmin / tempmax).
- could calculate things like drought events too based on the precipitation.
- can calculate things per day/month/year etc. (tempmin / tempmax).
- wild_Salmo220K_NA_EU_filtered_noadmixQ5 - The filtered NA and EU wild fish genotypes
- see: https://github.com/CNuge/salmon-euro-introgression/blob/main/scripts/wild_baseline_220k_admix_pca.r for the code on the filtering steps that were conducted.