In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# 1. RNA-seq Exercise: Analyzing RNA-seq data for cell lines treated with compounds

Now, we will do an exercise that is similar to what we learned in the last tutorial about analysing RNA-seq data. Compare RNA-seq across different groups after drug treatment (GEO accession: GSE119088).  The goals of this study are to compare NGS-derived transcriptome profiling (RNA-seq) to find out the difference between EZH2 inhibitor treatment and DMSO group in each of 3 cancer cell lines, and find the relationship between transcriptomes change and drug sensivitity.

Read three dataframes and each is for one cell line

In [2]:
df1=pd.read_csv('./undergrad_chemistry_class_tutorial/S7721_transcriptome_change_07042022.csv')
df2=pd.read_csv('./undergrad_chemistry_class_tutorial/U2932_transcriptome_change_07042022.csv')
df3=pd.read_csv('./undergrad_chemistry_class_tutorial/Pfeif_transcriptome_change_07042022.csv')

Pre-processing the data: combining replicates and getting them ready for the exercise

In [3]:
df1['S7721_before_treatment']=(df1.S7721_c1 + df1.S7721_c2 + df1.S7721_c3)/3
df1['S7721_after_treatment']=(df1.S7721_E1 + df1.S7721_E2 + df1.S7721_E3)/3
df1=df1[['ID', 'S7721_before_treatment', 'S7721_after_treatment']]
df2['U2932_before_treatment']=(df2.U2932_c1 + df2.U2932_c2 + df2.U2932_c3)/3
df2['U2932_after_treatment']=(df2.U2932_E1 + df2.U2932_E2 + df2.U2932_E3)/3
df2=df2[['ID', 'U2932_before_treatment', 'U2932_after_treatment']]
df3['pfeif_before_treatment']=(df3.pfeif_c1 + df3.pfeif_c2 + df3.pfeif_c3)/3
df3['pfeif_after_treatment']=(df3.pfeif_E1 + df3.pfeif_E2 + df3.pfeif_E3)/3
df3=df3[['ID', 'pfeif_before_treatment', 'pfeif_after_treatment']]

## 1.1. Make scatter plot to have a visualization of how gene expressions are changed within each cell line after drug treatment

Hints: the codes for generating the visualization are pretty much provided below, except that you need to replace the underlines and fill in the actual codes. All you need to do is to specify which column is going be placed at each underlined area. 

For example, the first plot is to plot gene expressions for `S7721` cell line whose information is stored in `df1`. The first thing you need to do is to check columns in `df1` by doing `df1.head()`.

In [None]:
df1.head()

We want to plot gene expression before drug treatment against after treatment. So the two columns we are going to use are `df1['S7721_before_treatment']` and `df1['S7721_after_treatment']`. We will use the function `plt.scatter()` to make the plot. The first parameter we put in `plt.scatter()` is the x-axis data, which will be `df1['S7721_before_treatment']` and the second parameter we put in `plt.scatter()` is the y-axis data, which will be `df1['S7721_after_treatment']`.

For `U2932` cell line, its information is stored in `df2`. To check `df2`, you would want to do `df2.head()`

In [None]:
df2.head()

Again we are going to plot gene expression data before treatment against after treatment. So the two columns we are going to use are `df2['U2932_before_treatment']` and `df2['U2932_after_treatment']`. We will use the function `plt.scatter()` to make the plot. The first parameter we put in `plt.scatter()` is the x-axis data, which will be `df2['U2932_before_treatment']` and the second parameter we put in `plt.scatter()` is the y-axis data, which will be `df2['U2932_after_treatment']`.

For `pfeif` cell line, its informaiton is stored in `df3`. To check `df3`, you would want to do `df3.head()`.

In [None]:
df3.head()

Similarly, the two columns we are going to use are `df3['pfeif_before_treatment']` and `df3['pfeif_after_treatment']`. We will use the function `plt.scatter()` to make the plot. The first parameter we put in `plt.scatter()` is the x-axis data, which will be `df3['pfeif_before_treatment']` and the second parameter we put in `plt.scatter()` is the y-axis data, which will be `df3['pfeif_after_treatment']`.

Now, it's time for you to take the hints above and fill in underlined areas with actual codes.

In [None]:
plt.rcParams['font.size'] = 20
plt.rc('axes', labelsize=25)
plt.rc('axes', titlesize=30)
plt.figure(1, figsize=(30, 30))
plt.subplot(331, xlabel='S7721 before drug treatment', ylabel='S7721 after drug treatment', title='before vs after treatment')
plt.scatter(df1['_______'], df1['_______'], s=100)
plt.plot(df1.S7721_before_treatment, df1.S7721_before_treatment, linestyle='-')
plt.xticks(rotation=90)
plt.subplot(332, xlabel='U2932 before drug treatment', ylabel='U2932 after drug treatment', title='before vs after treatment')
plt.scatter(df2['_________'], df2['________'], s=100)
plt.plot(df2.U2932_before_treatment, df2.U2932_before_treatment, linestyle='-')
plt.xticks(rotation=90)
plt.subplot(333, xlabel='Pfeif after drug treatment', ylabel='Pfeif before drug treatment', title='before vs after treatment')
plt.scatter(df3['________'], df3['________'], s=100)
plt.plot(df3.pfeif_before_treatment, df3.pfeif_before_treatment, linestyle='-')
plt.xticks(rotation=90)
plt.show()

## 1.2. Find differentially expressed genes within each cell line (the expression of a specific gene is changed more than 4000(take the absolute value here))

Hint: Here, in each cell line, you would want to calculate the expression difference for each gene before and after treatment

For cell line `S7721`, we will create a new column in its dataframe `df1`, which is called `'deviation'`. This columns is generated after calculating the absolute value of the gene expression difference before and after drug treatment. For example, we would want to subtract `df1['S7721_before_treatment']` by `df1['S7721_after_treatment']`.

In [None]:
df1['deviation']=abs(df1['______']-df1['_______'])

For cell line `U2932`, we will create a new column in its dataframe `df2`, which is also called `'deviation'`. This columns is generated after calculating the absolute value of the gene expression difference before and after drug treatment. For example, we would want to subtract `df2['U2932_before_treatment']` by `df2['U2932_after_treatment']`.

In [None]:
df2['deviation']=abs(df2['________']-df2['_______'])

For cell line `pfeif`, we will create a new column in its dataframe `df3`, which is also called `'deviation'`. This columns is generated after calculating the absolute value of the gene expression difference before and after drug treatment. For example, we would want to subtract `df3['pfeif_before_treatment']` by `df3['pfeif_after_treatment']`.

In [None]:
df3['deviation']=abs(df3['_______']-df3['________'])

Now, we are going to look for gene expression difference larger than `4000`. We are going to refer to the `'deviation'` columns in each dataframe (`df1`, `df2`, `df3`). For each dataframe, we are going to filter it by selecting items in the `deviation` column that are larger than `4000`. You are going to put down a number in the blank area that can properly filter the dataframe. For example, if you want to filter `df1` and only get items in the deviation columns that are larger than `1000`, you would do `'df1[df1.deviation> 1000 ]'`. Then what will be returned are the items that pass the filtering.

In [None]:
df1[df1.deviation>    ]

In [None]:
df2[df2.deviation>    ]

In [None]:
df3[df3.deviation>    ]

## 1.3. Make scatter plots highlighting differentially expressed genes in each cell line

For cell line `S7732`:

In [None]:
#Make genes that are differentially expressed into a list
gene_list_for_S7721=df1[df1.deviation>4000].ID.tolist()

Replace the underscored area and fill with proper codes using the hints given out before:

In [None]:
plt.figure(figsize=(10, 10))
plt.scatter(df1['______'], df1['______'], s=50)
plt.plot(df1.S7721_before_treatment, df1.S7721_before_treatment, linestyle='-')
plt.title("Gene expression change for cell line S7721")
plt.xlabel("before treatment")
plt.ylabel("after treatment")
  
# Loop for annotation of all points
for i in range(len(df1)):
    if df1.ID[i] in gene_list_for_S7721:
        plt.annotate(df1.ID[i], (df1.S7721_before_treatment[i], df1.S7721_after_treatment[i] + 0.1), fontsize=10, color='red')

For cell line `U2932`:

In [31]:
#Make genes that are differentially expressed into a list
gene_list_for_U2932=df2[df2.deviation>4000].ID.tolist()

Replace the underscored area and fill with proper codes using the hints given out before:

In [None]:
plt.figure(figsize=(10, 10))
plt.scatter(df2['________'], df2['________'], s=50)
plt.plot(df2.U2932_before_treatment, df2.U2932_before_treatment, linestyle='-')
plt.title("Gene expression change for cell line U2932")
plt.xlabel("before treatment")
plt.ylabel("after treatment")
  
# Loop for annotation of all points
for i in range(len(df2)):
    if df2.ID[i] in gene_list_for_U2932:
        plt.annotate(df2.ID[i], (df2.U2932_before_treatment[i], df2.U2932_after_treatment[i] + 0.1), fontsize=10, color='red')

For cell line `pfeif`:

In [34]:
#Make genes that are differentially expressed into a list
gene_list_for_pfeif=df3[df3.deviation>4000].ID.tolist()

Replace the underscored area and fill with proper codes using the hints given out before:

In [None]:
plt.figure(figsize=(10, 10))
plt.scatter(df3['_______'], df3['________'], s=50)
plt.plot(df3.pfeif_before_treatment, df3.pfeif_before_treatment, linestyle='-')
plt.title("Gene expression change for cell line pfeif")
plt.xlabel("before treatment")
plt.ylabel("after treatment")

# Loop for annotation of all points
for i in range(len(df3)):
    if df3.ID[i] in gene_list_for_pfeif:
        plt.annotate(df3.ID[i], (df3.pfeif_before_treatment[i], df3.pfeif_after_treatment[i] + 0.1), fontsize=10, color='red')

## 1.4. Barplot only differentially expressed genes within each cell line. Two barplots for each cell line---before treatment and after treatment.

Barplots for differentially expressed genes in cell line `S7721` before and after treatment:

The underlined area should be replaced with the columns that are to be plotted, which should be `'S7721_before_treatment'` and `'S7721_after_treatment'`, respectively.

In [None]:
plt.figure(1, figsize=(20, 10))
plt.rcParams['font.size'] = 7
plt.rc('axes', titlesize=15)
plt.subplot(211, title='Gene expression in S7721 before drug treatment')
plt.bar(df1[df1.ID.isin(gene_list_for_S7721)].ID, df1[df1.ID.isin(gene_list_for_S7721)]._______)
plt.xticks(rotation=10)
plt.subplot(212, title='Gene expression in S7721 after drug treatment')
plt.bar(df1[df1.ID.isin(gene_list_for_S7721)].ID, df1[df1.ID.isin(gene_list_for_S7721)]._______)
plt.xticks(rotation=10)
plt.show()

Barplot for differentially expressed genes in cell line `U2932` before and after treatment:

The underlined area should be replaced with the columns that are to be plotted, which should be `'U2932_before_treatment'` and `'U2932_after_treatment'`, respectively.

In [None]:
plt.figure(1, figsize=(5, 10))
plt.rcParams['font.size'] = 10
plt.rc('axes', titlesize=15)
plt.subplot(211, title='Gene expression in U2932 before drug treatment')
plt.bar(df2[df2.ID.isin(gene_list_for_U2932)].ID, df2[df2.ID.isin(gene_list_for_U2932)]._______, width=0.5)
plt.xticks(rotation=10)
plt.subplot(212, title='Gene expression in U2932 after drug treatment')
plt.bar(df2[df2.ID.isin(gene_list_for_U2932)].ID, df2[df2.ID.isin(gene_list_for_U2932)].________, width=0.5)
plt.xticks(rotation=10)
plt.show()

Barplot for differentially expressed genes in cell line `pfeif` before and after treatment:

The underlined area should be replaced with the columns that are to be plotted, which should be `'pfeif_before_treatment'` and `'pfeif_after_treatment'`, respectively.

In [None]:
plt.figure(1, figsize=(5, 10))
plt.rcParams['font.size'] = 10
plt.rc('axes', titlesize=15)
plt.subplot(211, title='Gene expression in pfeif before drug treatment')
plt.bar(df3[df3.ID.isin(gene_list_for_pfeif)].ID, df3[df3.ID.isin(gene_list_for_pfeif)].________, width=0.5)
plt.xticks(rotation=10)
plt.subplot(212, title='Gene expression in pfeif after drug treatment')
plt.bar(df3[df3.ID.isin(gene_list_for_pfeif)].ID, df3[df3.ID.isin(gene_list_for_pfeif)].________, width=0.5)
plt.xticks(rotation=10)
plt.show()

## 1.5. Look for gene ontology for differentially expressed genes within each cell line by searching on Uniprot website

**Acknowledgements:** This tutorial was developed by Yue Wang at the University of North Carolina at Chapel Hill.