# Introduction to Pandas

This class introduces the pandas package which is centered around the pandas' DataFrame and Series datatypes. Unlike NumPy, DataFrames can store multiple data types, like ints and strings. Typically, pandas is useful for analyzing data that have multiple entries in separate rows and different information in each column. You can think of a Series (1-dimensional) as one column of data while a DataFrame is made up of multiple columns (2-dimensional). We're primarily focusing on the DataFrame class in this lecture, but it's important for you to know that there are differences between the two datatypes. At the end of this notebook, I've linked some additional resources that further expound on this and other more niche functionalities. 

In [80]:
import pandas as pd

# Loading in a file

I'm using the example bed file from Level 3 to introduce pandas since it's something that we're used to seeing.

In [81]:
bed = 'C:/Users/sksuzuki/Documents/EXCISION/level_3/ENCFF239FSU.bed'

We can read files using the ```read_csv()``` function, and specify how our data is separated. Here our data is tab-separated.

In [82]:
df = pd.read_csv(bed, sep = '\t')
df.head()

Unnamed: 0,chr1,230125030,230174106,ENST00000454058.2,0,+
0,chr1,29913143,29915337,ENST00000623731.1,0,+
1,chr1,61801712,61803634,ENST00000624542.1,0,+
2,chr1,23567063,23573122,ENST00000454863.3,0,-
3,chr1,82587312,82588411,ENST00000575085.1,0,+
4,chr1,127401725,127470569,ENST00000509671.1,0,+


Since our data does not have headers to out columns, we can specify that using the ```header``` argument.

In [18]:
df = pd.read_csv(bed, sep = '\t', header = None)
df.head()

Unnamed: 0,0,1,2,3,4,5
0,chr1,230125030,230174106,ENST00000454058.2,0,+
1,chr1,29913143,29915337,ENST00000623731.1,0,+
2,chr1,61801712,61803634,ENST00000624542.1,0,+
3,chr1,23567063,23573122,ENST00000454863.3,0,-
4,chr1,82587312,82588411,ENST00000575085.1,0,+


# Indexing and slicing

Indexing and slicing are both important to know how to look at particular portions of your data. Importantly, the ```index``` of a DataFrame refers to the index of the rows while just ```columns``` or ```headers``` refer to the names of the columns. Here are some of the most commonly used attributes and functions:

In [19]:
df.index

RangeIndex(start=0, stop=12, step=1)

Since we didn't specify or import any index or headers for our DataFrame, Pandas defaults to setting them to numbers.

In [20]:
df.columns

Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')

We can rename our headers so that they're more descriptive of each column.

In [83]:
df.columns = ['chromosome', 'start', 'end', 'gene', 'score', 'strand']

We can sort by a particular column (notice the difference in the index)

In [27]:
df.sort_values(by='gene')

Unnamed: 0,chromosome,start,end,gene,score,strand
0,chr1,230125030,230174106,ENST00000454058.2,0,+
3,chr1,23567063,23573122,ENST00000454863.3,0,-
6,chr3,86481942,86496996,ENST00000460586.1,0,+
11,chr3,94876501,94888501,ENST00000460599.2,0,+
10,chr3,91488742,91876543,ENST00000460876.1,0,+
8,chr3,61801712,61803634,ENST00000465586.1,0,+
9,chr3,87881990,87883995,ENST00000466786.1,0,+
5,chr1,127401725,127470569,ENST00000509671.1,0,+
7,chr1,53947623,53974950,ENST00000558866.4,0,+
4,chr1,82587312,82588411,ENST00000575085.1,0,+


We can grab one column

In [28]:
df['chromosome']

0     chr1
1     chr1
2     chr1
3     chr1
4     chr1
5     chr1
6     chr3
7     chr1
8     chr3
9     chr3
10    chr3
11    chr3
Name: chromosome, dtype: object

Or grab a range of rows

In [30]:
df[0:3]

Unnamed: 0,chromosome,start,end,gene,score,strand
0,chr1,230125030,230174106,ENST00000454058.2,0,+
1,chr1,29913143,29915337,ENST00000623731.1,0,+
2,chr1,61801712,61803634,ENST00000624542.1,0,+


There are two different ways that you can select a subset of your DataFrame: by label and by position. When selecting by label, you're specifying the row and columns you want by their index and headers. When selecting by position, that's reflective of the literal position in the DataFrame. This is important when you filter or reorder your DataFrame.

**Selection by Label** 

using ```.loc```

In [31]:
df.loc[:,['start','strand']]

Unnamed: 0,start,strand
0,230125030,+
1,29913143,+
2,61801712,+
3,23567063,-
4,82587312,+
5,127401725,+
6,86481942,+
7,53947623,+
8,61801712,+
9,87881990,+


In [32]:
df.loc[4:10,['start','strand']]

Unnamed: 0,start,strand
4,82587312,+
5,127401725,+
6,86481942,+
7,53947623,+
8,61801712,+
9,87881990,+
10,91488742,+


**Selection by Postion**

using ```.iloc```

In [33]:
df.iloc[3]

chromosome                 chr1
start                  23567063
end                    23573122
gene          ENST00000454863.3
score                         0
strand                        -
Name: 3, dtype: object

In [34]:
df.iloc[3:5,0:2] #similar to numpy method

Unnamed: 0,chromosome,start
3,chr1,23567063
4,chr1,82587312


In [35]:
df.iloc[1:3,:] #slicing rows explicitly

Unnamed: 0,chromosome,start,end,gene,score,strand
1,chr1,29913143,29915337,ENST00000623731.1,0,+
2,chr1,61801712,61803634,ENST00000624542.1,0,+


In [36]:
df.iloc[:,1:3] #slicing columns explicitly

Unnamed: 0,start,end
0,230125030,230174106
1,29913143,29915337
2,61801712,61803634
3,23567063,23573122
4,82587312,82588411
5,127401725,127470569
6,86481942,86496996
7,53947623,53974950
8,61801712,61803634
9,87881990,87883995


# Thresholding and Filtering

In [44]:
df[df['start'] > 100000000]

Unnamed: 0,chromosome,start,end,gene,score,strand
0,chr1,230125030,230174106,ENST00000454058.2,0,+
5,chr1,127401725,127470569,ENST00000509671.1,0,+


In [45]:
 df[df['chromosome'].isin(['chr2','chr3'])]

Unnamed: 0,chromosome,start,end,gene,score,strand
6,chr3,86481942,86496996,ENST00000460586.1,0,+
8,chr3,61801712,61803634,ENST00000465586.1,0,+
9,chr3,87881990,87883995,ENST00000466786.1,0,+
10,chr3,91488742,91876543,ENST00000460876.1,0,+
11,chr3,94876501,94888501,ENST00000460599.2,0,+


# Joining dataframes

In [51]:
df1 = df[0:3]
df2 = df[7:]
print(df1)
print(df2)

dfs = [df2, df1]

df3 = pd.concat(dfs) 

  chromosome      start        end               gene  score strand
0       chr1  230125030  230174106  ENST00000454058.2      0      +
1       chr1   29913143   29915337  ENST00000623731.1      0      +
2       chr1   61801712   61803634  ENST00000624542.1      0      +
   chromosome     start       end               gene  score strand
7        chr1  53947623  53974950  ENST00000558866.4      0      +
8        chr3  61801712  61803634  ENST00000465586.1      0      +
9        chr3  87881990  87883995  ENST00000466786.1      0      +
10       chr3  91488742  91876543  ENST00000460876.1      0      +
11       chr3  94876501  94888501  ENST00000460599.2      0      +


In [52]:
df3

Unnamed: 0,chromosome,start,end,gene,score,strand
7,chr1,53947623,53974950,ENST00000558866.4,0,+
8,chr3,61801712,61803634,ENST00000465586.1,0,+
9,chr3,87881990,87883995,ENST00000466786.1,0,+
10,chr3,91488742,91876543,ENST00000460876.1,0,+
11,chr3,94876501,94888501,ENST00000460599.2,0,+
0,chr1,230125030,230174106,ENST00000454058.2,0,+
1,chr1,29913143,29915337,ENST00000623731.1,0,+
2,chr1,61801712,61803634,ENST00000624542.1,0,+


In [58]:
left = pd.DataFrame({'chromosome': ['chr1', 'chr3'], 'start': [0, 304]})
right = pd.DataFrame({'chromosome': ['chr1', 'chr3'], 'end': [10, 400]})

print(left)
print(right)

  chromosome  start
0       chr1      0
1       chr3    304
  chromosome  end
0       chr1   10
1       chr3  400


In [59]:
pd.merge(left, right, on='chromosome')

Unnamed: 0,chromosome,start,end
0,chr1,0,10
1,chr3,304,400


# Applying functions to columns

One on the best feature of pandas is that you don't have to loop through each row to manipulate the values in each cell. Instead, use built in functions or use the ```apply()``` function to apply your own function!

In [60]:
def square(x):
    return x**2

In [62]:
df['start']

0     230125030
1      29913143
2      61801712
3      23567063
4      82587312
5     127401725
6      86481942
7      53947623
8      61801712
9      87881990
10     91488742
11     94876501
Name: start, dtype: int64

In [61]:
df['start'].apply(square)

0     52957529432500900
1       894796124138449
2      3819451606130944
3       555406458445969
4      6820664103385344
5     16231199532975625
6      7479126292091364
7      2910346027350129
8      3819451606130944
9      7723244166360100
10     8370189912742564
11     9001550442003001
Name: start, dtype: int64

In [66]:
df.head()

Unnamed: 0,chromosome,start,end,gene,score,strand
0,chr1,230125030,230174106,ENST00000454058.2,0,+
1,chr1,29913143,29915337,ENST00000623731.1,0,+
2,chr1,61801712,61803634,ENST00000624542.1,0,+
3,chr1,23567063,23573122,ENST00000454863.3,0,-
4,chr1,82587312,82588411,ENST00000575085.1,0,+


To save any manipulation to a DataFrame column, you must reassign it.

In [70]:
df['start'] = df['start'].apply(square)
df

Unnamed: 0,chromosome,start,end,gene,score,strand
0,chr1,52957529432500900,230174106,ENST00000454058.2,0,+
1,chr1,894796124138449,29915337,ENST00000623731.1,0,+
2,chr1,3819451606130944,61803634,ENST00000624542.1,0,+
3,chr1,555406458445969,23573122,ENST00000454863.3,0,-
4,chr1,6820664103385344,82588411,ENST00000575085.1,0,+
5,chr1,16231199532975625,127470569,ENST00000509671.1,0,+
6,chr3,7479126292091364,86496996,ENST00000460586.1,0,+
7,chr1,2910346027350129,53974950,ENST00000558866.4,0,+
8,chr3,3819451606130944,61803634,ENST00000465586.1,0,+
9,chr3,7723244166360100,87883995,ENST00000466786.1,0,+


# Level 3 with Pandas

In [72]:
bed = 'C:/Users/sksuzuki/Documents/EXCISION/level_3/ENCFF239FSU.bed'

df = pd.read_csv(bed, sep='\t', header=None)
df.head()

Unnamed: 0,0,1,2,3,4,5
0,chr1,230125030,230174106,ENST00000454058.2,0,+
1,chr1,29913143,29915337,ENST00000623731.1,0,+
2,chr1,61801712,61803634,ENST00000624542.1,0,+
3,chr1,23567063,23573122,ENST00000454863.3,0,-
4,chr1,82587312,82588411,ENST00000575085.1,0,+


In [74]:
# Filter for positive strand
df_p = df[df[5] == '+'].copy()
df_p.head()

Unnamed: 0,0,1,2,3,4,5
0,chr1,230125030,230174106,ENST00000454058.2,0,+
1,chr1,29913143,29915337,ENST00000623731.1,0,+
2,chr1,61801712,61803634,ENST00000624542.1,0,+
4,chr1,82587312,82588411,ENST00000575085.1,0,+
5,chr1,127401725,127470569,ENST00000509671.1,0,+
6,chr3,86481942,86496996,ENST00000460586.1,0,+
7,chr1,53947623,53974950,ENST00000558866.4,0,+
8,chr3,61801712,61803634,ENST00000465586.1,0,+
9,chr3,87881990,87883995,ENST00000466786.1,0,+
10,chr3,91488742,91876543,ENST00000460876.1,0,+


In [76]:
# Find gene interval
df_p[6] = df_p[2] - df[1]
df_p.head()

Unnamed: 0,0,1,2,3,4,5,6
0,chr1,230125030,230174106,ENST00000454058.2,0,+,49076.0
1,chr1,29913143,29915337,ENST00000623731.1,0,+,2194.0
2,chr1,61801712,61803634,ENST00000624542.1,0,+,1922.0
4,chr1,82587312,82588411,ENST00000575085.1,0,+,1099.0
5,chr1,127401725,127470569,ENST00000509671.1,0,+,68844.0


In [77]:
# Get each of the chromosome names
chromo_names = df_p[0].unique()
chromo_names

array(['chr1', 'chr3'], dtype=object)

In [78]:
# Filter for the 5 largest for each unique chromosome name
df_top = []
for contig in chromo_names:
    df_contig = df_p[df_p[0] == contig]
    df_top.append(df_contig.nlargest(5,6))
print(df_top)
df_top = pd.concat(df_top)

print(df_top)

[      0          1          2                  3  4  5        6
5  chr1  127401725  127470569  ENST00000509671.1  0  +  68844.0
0  chr1  230125030  230174106  ENST00000454058.2  0  +  49076.0
7  chr1   53947623   53974950  ENST00000558866.4  0  +  27327.0
1  chr1   29913143   29915337  ENST00000623731.1  0  +   2194.0
2  chr1   61801712   61803634  ENST00000624542.1  0  +   1922.0,        0         1         2                  3  4  5         6
10  chr3  91488742  91876543  ENST00000460876.1  0  +  387801.0
6   chr3  86481942  86496996  ENST00000460586.1  0  +   15054.0
11  chr3  94876501  94888501  ENST00000460599.2  0  +   12000.0
9   chr3  87881990  87883995  ENST00000466786.1  0  +    2005.0
8   chr3  61801712  61803634  ENST00000465586.1  0  +    1922.0]
       0          1          2                  3  4  5         6
5   chr1  127401725  127470569  ENST00000509671.1  0  +   68844.0
0   chr1  230125030  230174106  ENST00000454058.2  0  +   49076.0
7   chr1   53947623   53974950 

In [79]:
# Get the sum of the index
sum(df_top.index)

59

# Further reading and practice

- 10 minute quick introduction: https://pandas.pydata.org/pandas-docs/stable/10min.html
- Merge, Joining and Concatenating dataframes: https://pandas.pydata.org/pandas-docs/stable/merging.html
- More on indexing: https://pandas.pydata.org/pandas-docs/stable/indexing.html