# Information about maize inbred genotypes

I found two important references:

* [Romay et al. 2013](https://link.springer.com/article/10.1186/gb-2013-14-6-r55)
* [The original Goodman study](https://onlinelibrary.wiley.com/doi/full/10.1111/j.1365-313X.2005.02591.x)

## Romay paper (2013)

Importing the Romay supplementary table:

In [2]:
import pandas as pd

#genotype_info_romay_df = pd.read_csv("/home/rsantos/Repositories/maize_microbiome_transcriptomics/genotype_information/gb-2013-14-6-r55-S1.csv", sep='\t')
genotype_info_romay_df = pd.read_csv("/home/santosrac/Repositories/maize_microbiome_transcriptomics/genotype_information/gb-2013-14-6-r55-S1.csv", sep='\t')

genotype_info_romay_df.set_index('Inbred line', inplace=True)
genotype_info_romay_df.head()

Unnamed: 0_level_0,Accesion N,N GBS samples,N Plants,Avg. IBS,% missing,Breeding program,Pop structure
Inbred line,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
42,PI233312,1,1,,0.47,Other,sweet corn
66,Ames26770,1,1,,0.56,Other,unclassified
90,NSL30903,1,1,,0.52,Other,unclassified
176,PI233313,1,1,,0.67,Other,sweet corn
207,PI601005,2,2,0.996,0.5,ExPVP,non-stiff stalk


In [3]:
#genotype_info_wallace2018_df = pd.read_csv("/home/rsantos/Repositories/maize_microbiome_transcriptomics/correlations_rnaseq_metataxonomics/0_kremling_genotype_plot_day_key.txt", sep='\t')
genotype_info_wallace2018_df = pd.read_csv("/home/santosrac/Repositories/maize_microbiome_transcriptomics/correlations_rnaseq_metataxonomics/0_kremling_genotype_plot_day_key.txt", sep='\t')

genotype_info_wallace2018_df.set_index('Genotype', inplace=True)
genotype_info_wallace2018_df.head()


Unnamed: 0_level_0,Plot_Day
Genotype,Unnamed: 1_level_1
33-16,X14A0309_26
38-11,X14A0233_8
4226,X14A0079_26
4722,X14A0311_8
A188,X14A0021_8


In [4]:
merged_df = genotype_info_wallace2018_df.merge(genotype_info_romay_df, left_index=True, right_index=True, how='left')

In [5]:
merged_df.tail(n=30)

Unnamed: 0_level_0,Plot_Day,Accesion N,N GBS samples,N Plants,Avg. IBS,% missing,Breeding program,Pop structure
Genotype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
SG18,X14A0321_8,NSL75979,3.0,2.0,0.995,0.49,Other,popcorn
T232,X14A0285_26,PI550522,5.0,4.0,0.998,0.25,Other,unclassified
T234,X14A0349_26,Ames27190,3.0,2.0,0.998,0.46,Other,unclassified
T8,X14A0255_8,PI550518,5.0,4.0,0.998,0.26,Other,unclassified
Tx303,X14A0443_26,Ames19327,11.0,5.0,0.995,0.21,Other,unclassified
Tzi11,X14A0491_26,PI540745,3.0,2.0,0.993,0.35,Nigeria,tropical
Tzi16,X14A0423_26,PI540747,4.0,3.0,0.994,0.3,Nigeria,tropical
Tzi18,X14A0477_26,PI506253,6.0,5.0,0.976,0.28,Nigeria,tropical
Tzi25,X14A0419_26,PI506255,4.0,3.0,0.976,0.3,Nigeria,tropical
Tzi9,X14A0463_26,PI506247,4.0,3.0,0.985,0.33,Nigeria,tropical


In [6]:
# Counting the number of genotypes for each group
merged_df['Pop structure'].value_counts()

Pop structure
unclassified       151
stiff stalk         39
tropical            38
non-stiff stalk     27
popcorn              7
sweet corn           5
landraces            2
Name: count, dtype: int64

In [7]:
merged_df['Breeding program'].value_counts()

Breeding program
Other             103
North Carolina     43
Iowa               32
Mexico             28
Minnesota          15
Missouri            7
Wisconsin           7
Indiana             6
Ontario             5
Nigeria             5
Thailand            5
Michigan            3
South Africa        3
Illinois            3
France              1
Spain               1
Nebraska            1
North Dakota        1
Name: count, dtype: int64

In [8]:
merged_df.loc[merged_df['Pop structure'].isna()]

Unnamed: 0_level_0,Plot_Day,Accesion N,N GBS samples,N Plants,Avg. IBS,% missing,Breeding program,Pop structure
Genotype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
CML411,X14A0566_26,,,,,,,
CML504,X14A0569_26,,,,,,,
CML505,X14A0570_8,,,,,,,
CML511,X14A0571_26,,,,,,,
CML84,X14A0572_26,,,,,,,
CML85,X14A0573_26,,,,,,,
CML96,X14A0574_26,,,,,,,


In [11]:
merged_df['Pop structure'].isna().sum()

7

In [13]:
# After exporting, I (RACS) modified the names to remove X added by R (data.frame column names)
merged_df.to_csv('/home/santosrac/Repositories/maize_microbiome_transcriptomics/genotype_information/wallace_et_al_2018_group_assignments_romay2013.tsv', sep='\t')

## Flint-Garcia paper (2005)


In [27]:
import pandas as pd

genotype_info_flintgarcia_df = pd.read_csv("/home/rsantos/Repositories/maize_microbiome_transcriptomics/genotype_information/flint_garcia_s1_table.csv", sep='\t')

genotype_info_flintgarcia_df.set_index('Inbred', inplace=True)
genotype_info_flintgarcia_df.head()

Unnamed: 0_level_0,State/Country,Pedigree,NSS,SS,TS,Pop A,Sweet A,Subpopulation
Inbred,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
796,Yugoslavia,Vukovarski Yellow Dent,0.785,0.189,0.026,0,0,mixed
4226,Illinois,Funk 90 Day,0.917,0.071,0.012,0,0,nss
4722,Indiana,SA x Ohio yel.,0.0,0.0,0.0,1,0,popcorn
33-16,Indiana,Lux Johnson Country white,0.972,0.014,0.014,0,0,nss
38-11,Indiana,Outcross in line from 176A,0.993,0.003,0.004,0,0,nss


In [28]:
genotype_info_wallace2018_df = pd.read_csv("/home/rsantos/Repositories/maize_microbiome_transcriptomics/correlations_rnaseq_metataxonomics/0_kremling_genotype_plot_day_key.txt", sep='\t')

genotype_info_wallace2018_df.set_index('Genotype', inplace=True)
genotype_info_wallace2018_df.head()

Unnamed: 0_level_0,Plot_Day
Genotype,Unnamed: 1_level_1
33-16,X14A0309_26
38-11,X14A0233_8
4226,X14A0079_26
4722,X14A0311_8
A188,X14A0021_8


In [29]:
merged_df = genotype_info_wallace2018_df.merge(genotype_info_flintgarcia_df, left_index=True, right_index=True, how='left')
merged_df.head()

Unnamed: 0_level_0,Plot_Day,State/Country,Pedigree,NSS,SS,TS,Pop A,Sweet A,Subpopulation
Genotype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
33-16,X14A0309_26,Indiana,Lux Johnson Country white,0.972,0.014,0.014,0,0,nss
38-11,X14A0233_8,Indiana,Outcross in line from 176A,0.993,0.003,0.004,0,0,nss
4226,X14A0079_26,Illinois,Funk 90 Day,0.917,0.071,0.012,0,0,nss
4722,X14A0311_8,Indiana,SA x Ohio yel.,0.0,0.0,0.0,1,0,popcorn
A188,X14A0021_8,Minnesota,[(4-29*64)4-29(4)],0.982,0.013,0.006,0,0,nss


In [33]:
# Counting the number of genotypes for each group
merged_df['Subpopulation'].value_counts()

Subpopulation
nss        102
mixed       58
ts          46
ss          39
popcorn      9
sweet        5
Name: count, dtype: int64

* nss = Non stiff-stalk
* ts = Tropical/subtropical
* ss = Stiff-Stalk

In [34]:
merged_df['Subpopulation'].value_counts().sum()

259

In [31]:
merged_df.loc[merged_df['Subpopulation'].isna()]

Unnamed: 0_level_0,Plot_Day,State/Country,Pedigree,NSS,SS,TS,Pop A,Sweet A,Subpopulation
Genotype,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
CML206,X14A0564_26,,,,,,,,
CML330,X14A0565_26,,,,,,,,
CML411,X14A0566_26,,,,,,,,
CML418,X14A0567_8,,,,,,,,
CML504,X14A0569_26,,,,,,,,
CML505,X14A0570_8,,,,,,,,
CML511,X14A0571_26,,,,,,,,
CML84,X14A0572_26,,,,,,,,
CML85,X14A0573_26,,,,,,,,
CML96,X14A0574_26,,,,,,,,


In [32]:
merged_df['Subpopulation'].isna().sum()

17

In Romay paper, the following inbred are classified:
* CML206 (tropical)
* CML330 (tropical)
* CML418 (tropical)
* Il101T (sweet corn)
* IllHy (stiff stalk)
* MR19 Santo Domingo (landraces)
* MR20 Shoepeg (landraces)
* Yu796NS (unclassified)

# Summary

## Romay (2013)

* 118 genotypes have a group based on Romay et al (2013)
* 151 are unclassified (also based on Romay et al. 2013)
* Seven genotypes are not included in the table (or have a different identifier)

Initial correlation analyses considering maize groups will be using the 118 genotypes above (= classified in one of the subpopulations).

## Flint-Garcia (2005)

* 259 genotypes have a group based on Flint-Garcia (2005)
* A few genotypes (seven) were classified by Romay et al. (2013)
* For ten genotypes group information is missing