Skip to content

CNuge/salmon-genomic-vulnerability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

salmon-genomic-vulnerability

Code and statistical analysis for assessing North American Atlantic salmon population structure and its association with environmental variation.

This is a work in progress. The numbers at the beginning of the script names indicate their order of execution.

Analysis TODO

  1. explore_tony_env_gen_pairing.r - figuring out what is in the data

    • looks like 3865 individuals with complete ENV and Genetic information. Making intermediate versions of these files for input to the subsequent analyses
  2. Run PCAs on the ENV and Genetic information for matched samples - 02_pca_on_matched_data.r

  3. Pair data and setup for the RDA - ``

  4. RDA 5k SNP subset, paired data - 04_RDA_5k_random_snp_set.r

  • reuse old cunner scripts for this, save a version of the dfs to managable sized tsv files (leading cols and PCs only)
  1. RDA full set, paired data - 05_RDA_full_snp_set.r

  2. RDA PCA of full SNP, paired data - 06_RDA_with_PCs_for_genetics.r

    • this one I did with only PC1 and PC2 as the inputs, it looked very odd / non informative with only the two greay points on the RDA axes for the inputs
  3. RDA full SNP set with Lat Long correction, paired data - 07_RDA_full_snp_with_lat_lon_correction.r

    • RDA, repeated with a correction matrix based on the latitude and longitude of individuals
  4. k means clustering of PC1/PC2

    • see if the three groups from eye test emerge quantitatively

    • make a file with the cluser assignments, metadata, and the PCs

    • plot to map if there are groups.

    • try with the 5455 individuals, as well as the reduced data set (paired to env data only).

      • do they tell the same story? hopefully they will

    [x] Plot the cluster assignments on a map

    • strong geographic component, no differen

    [x] Plot the PC loadings as a manhattan plot

    • see if either PC1 or PC2 jump out as centered on chromosome 1/23. This would indicate the rearrangement is driving some of the observed population structure.
  5. Compare the clusters via pairwise fst

    • find the most efficient way to do this?
    • need a way to take the categories and starting data and run this. I'm guessing Plink may be a good route, but could do vcf conversion and previous methods I'd used for cunner as well.
  6. Intra group and full LD analysis

  7. Admixture run with k = 3

  • see if the different intermediate regions are resolved well (match the k means clustering?) and whether some individuals in the middle look to be
  1. Intra group and overall heterozygosity
  • can this be calculated easily in plink? need to look into best methods
  1. inspect the regions of interest, obtain list of locations to check the annotation for
  • load in the PCA P values, the LD regions, and the Fst peaks
  • from each of the above, design selection criteria and subset the required information (chromosome, bp).
  • determine the overlap between the different groups above
  • load in the annotation file, look at what the peaks
  1. [] PCA, for each of the three major clusters and with the admixed individuals removed
  • i.e. on the 80% home q individuals and none of the mixed ones

Additional analyses to do

  • [] Fst redo, but with a quantified threshold in lieu of the 0.15 cutoff

  • [] LD redo, but with the non-LD filtered SNP set (record numbers and such)

  • [] heterozygosity analyzed on a per location basis (n = 94, separate subfolder)

  • [] take in the new lists produced above, and re-run the gene query

  • [] per location, and per cluster effective population size

  • Look at the "New method to quantify climactic differences" and start working through this in the EDA-new_method_idea.r file.

#raw data I started with

  1. ClimateDataSalmon_1980_2022.csv - here are the columns in the file: "","KeyID","Name","Latitude","Longitude","Elevation","X","P","Replication","Year","Month","Day","Minimum.Air.Temperature","Air.Temperature","Maximum.Air.Temperature","Total.Precipitation","Dew.Point.Temperature","Relative.Humidity","Wind.Direction","Solar.Radiation","Atmospheric.Pressure","Snow.Precipitation","Snow.Depth.Accumulation","Snow.Water.Equivalent","Wind.Speed.at.2.meters"

  2. CIGENE_220K_Metadata_All_May18_2022 - metadata on fish and locations

  • from biosim made by ben marquis
    • can calculate things per day/month/year etc. (tempmin / tempmax).
      • could calculate things like drought events too based on the precipitation.
  1. wild_Salmo220K_NA_EU_filtered_noadmixQ5 - The filtered NA and EU wild fish genotypes

About

Associating genomic variation with environment

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published