Step 1 - Import the required python libraries

In [None]:
from IPython import get_ipython
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from scipy import stats

pd.options.mode.chained_assignment = None

Step 2 - Load and read your data file
- pyTCR accepts a single `.tsv` file that should contain all the samples.
  - The following cell attempts to detect whether you are running the notebook in a Google Colab cloud environment or in a local environment, and then loads the data at the specified path.
- The `filePath` variable in the following code cell should be changed to the location of your file. The following options are supported:
  1. A `filePath` from Google Drive (to run on a cloud environment)
  2. A `filePath` from your local computer (to run on a local environment)

In [None]:
# Specify the path to your data in Google Drive or locally
filePath = "./complete_COVID_SAMPLES.tsv.tsv" # "/content/drive/MyDrive/complete_COVID_samples.tsv"

isInGoogle = 'google.colab' in str(get_ipython())

if isInGoogle:
    from google.colab import drive
    drive.mount('/content/drive')

df = pd.read_table(filePath, low_memory=False, engine="c")

optional_fields = ['hospitalized']

df.head()


Basic analysis 1 - Reads count

In [None]:
df_count = df.groupby(['sample'] + optional_fields).agg(
    {'#count': 'sum'}).reset_index().rename(columns={'#count': "reads_count"})

df_count

Basic analysis 2 - Clonotype count

In [None]:
df_diversity = df.groupby(['sample'] + optional_fields,
                          sort=False).size().reset_index(name='clonotype_count')


Basic analysis 3 - Mean frequency

In [None]:
df_mean_frequency = df.groupby(['sample'] + optional_fields).agg(
    {'freq': 'mean'}).reset_index().rename(columns={'freq': "mean_frequency"})


Basic analysis 4 - Geometric mean of clonotype frequency

In [None]:
from scipy.stats.mstats import gmean

samples = df['sample'].unique()

# create an empty dataframe for storing results
df_geomean_frequency = pd.DataFrame(columns=['sample', 'geomean_frequency'])

for sample in samples:
    tmp = df[df['sample'] == sample]
    geomean_frequency = gmean(tmp['freq'])

    # store the results
    df_data = pd.DataFrame({'sample': sample, 'geomean_frequency': geomean_frequency}, index=[0])
    df_geomean_frequency = pd.concat([df_geomean_frequency, df_data], copy=False, ignore_index=True)

df_geomean_frequency


Basic analysis 5 - Mean length of CDR3 nucleotide sequence

In [None]:
df['length_weighted'] = df['cdr3nt'].str.len()*df['freq']
df_mean_cdr3nt_length = df.groupby(['sample'] + optional_fields).agg(
    {'length_weighted': 'sum'}).reset_index().rename(columns={'length_weighted': "mean_cdr3nt_length"})

df_mean_cdr3nt_length

Basic analysis 6 - Convergence

In [None]:
# count unique CDR3
df_unique_CDR3 = df.groupby(['cdr3aa', 'sample'] + optional_fields, as_index=False)[
    'cdr3nt'].agg({'count': 'count'})

# calculate the mean of the unique CDR3 count in each sample
df_unique_CDR3_mean = df_unique_CDR3.groupby(['sample'] + optional_fields).agg(
    {'count': 'mean'}).reset_index().rename(columns={'count': "convergence"})


Basic analysis 7.1 - Spectratype

In [None]:
# CDR3 nucleotide length
df['nt_length'] = df['cdr3nt'].str.len()

# calculates spectratype
df_spectratype = df.groupby(['sample', 'nt_length'] + optional_fields).agg(
    {'freq': 'sum'}).reset_index().rename(columns={'freq': "spectratype"})
df_spectratype


Basic analysis 7.2 - Spectratype bar plot for an individual sample

1.   Define the sample that you would like to plot, replace the "1132289BW_TCRB	" with the sample name of interest
2.   x-axis and y-axis labels, figsize, fontsize are customizable 

In [None]:
sample_name = ""

df_sample = df_spectratype.loc[df_spectratype['sample'] == sample_name]


In [None]:
ax = plt.subplots(figsize=(10, 10))
ax = sns.barplot(data=df_sample, x='nt_length', y='spectratype')
ax.set_xlabel('nt_length', fontsize=20)
ax.set_ylabel('frequency', fontsize=20)
plt.xticks(fontsize=10)
plt.yticks(fontsize=20)


Basic analysis 7.1 Summary table for basic analysis

In [None]:
# merge df_count and df_geomean_frequency first
df_geomean_frequency = df_geomean_frequency.merge(
    df_count, on='sample', how='left')

# create a dataframe that combines all the basic analysis (except for spectratype)
dfs = [df_diversity, df_mean_frequency, df_geomean_frequency,
       df_mean_cdr3nt_length, df_unique_CDR3_mean]

df_combined = pd.merge(dfs[0], dfs[1], left_on=['sample', 'hospitalization'], right_on=[
                       'sample', 'hospitalization'], how='outer')
for d in dfs[2:]:
    df_combined = pd.merge(df_combined, d, left_on=['sample', 'hospitalization'], right_on=[
                           'sample', 'hospitalization'], how='outer')

df_combined


Basic analysis 7.2 - Statistical analysis of mean frequency

Basic analysis 7.2.1 - Test if the metric is normally distributed
1.   the null hypothesis here is normality
2.   if the p value is greater than 0.05, we cannot reject the null hypothesis (it is a normal distribution). If the p value is smaller than 0.05, we reject the null hypothesis (it is not a normal distribution)
3.   change 'clonotype_count' to other metrics that you are interested in



In [None]:
x = stats.normaltest(df_combined['clonotype_count'])
x


Basic analysis 7.2.2 - Mean or median of diversity metrics among groups
1.   if the dataset is normally distributed, calculate mean
2.   if the dataset is not normally distributed, calculate median
3.   change 'clonotype_count' to other metrics that you are interested in


In [None]:
# calculate the mean among two groups
df_metric_mean = df_combined.groupby('hospitalization')[
    'clonotype_count'].mean().reset_index()
df_metric_mean


In [None]:
# calculate the median among two groups
df_metric_median = df_combined.groupby('hospitalization')[
    'clonotype_count'].median().reset_index()
df_metric_median


Basic analysis 7.2.3 - Stat test
1.   if the dataset is normally distributed, use t-test (stats.ttest_ind)
*   change the group1, group2 to the groups/samples that you are interested in
2.   if the dataset is not normally distributed, use Wilcoxon rank-sum test (stats.ranksums)
*   change the group1, group2 to the groups/samples that you are interested in
3.  change 'clonotype_count' to other metrics that you are interested in

In [None]:
df1 = df_combined.copy()
df_group1 = df1[df1['hospitalization'] == True]
df_group2 = df1[df1['hospitalization'] == False]
stats.ttest_ind(df_group1['clonotype_count'], df_group2['clonotype_count'])


Basic analysis 7.3.1 - Bar plot on metric per sample
1.   x-axis and y-axis labels, figsize, fontsize are customizable
2.   change 'clonotype_count' to other metrics that you are interested in

In [None]:
ax = plt.subplots(figsize=(10, 10))
ax = sns.barplot(data=df_combined, x='sample',
                 y='clonotype_count', hue='hospitalization')
ax.set_xlabel('sample', fontsize=20)
ax.set_ylabel('number of clonotypes', fontsize=20)
plt.xticks(fontsize=10, rotation=90)
plt.yticks(fontsize=20)


Basic analysis 2.4 - Clonotype count violin plot per group
1.   x-axis and y-axis labels, figsize, fontsize are customizable  
2.   change the violin plot (sns.violinplot) to the plot type that you are interested in, includes strip plot (sns.stripplot), swarm plot (sns.swarmplot), box plot (sns.boxplot), boxen plot (sns.boxenplot), point plot (sns.pointplot), and bar plot (sns.barplot)
3.   change 'clonotype_count' to other metrics that you are interested in

In [None]:
ax = plt.subplots(figsize=(10, 10))

ax = sns.violinplot(x='hospitalization', y='clonotype_count', data=df_combined)

ax.set_xlabel('hospitalization', fontsize=20)
ax.set_ylabel('number of clonotypes', fontsize=20)
plt.xticks(fontsize=20)
plt.yticks(fontsize=20)

plt.show()
