# Next Generation DNA Genome Sequencing

>  *introduction to SparkSQL*   
>  *reading DNA .vcf files*    
>  *processing .vcf files in bulk to derive insights*  

## Diving In

> *Tom Bresee*



<br>
<br>

### A.  &nbsp;  Load Relevant Python Libraries

In [1]:

# my setup:

#   In my case, this is running Anaconda Windows10 scenario

#   Ill do the same version but on Databricks and straight ApachSpark 
#   in another file


In [2]:
# load basic libraries 
import numpy as np
import pandas as pd
import logging
import random
from random import random
from operator import add
import os, sys, time
from time import time
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

In [3]:
# import more libraries
import pyspark
from pyspark.conf import SparkConf
from pyspark import SparkConf
from pyspark import SparkContext
# SQL
from pyspark.sql import SQLContext
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Alchemy
from sqlalchemy.engine import create_engine
# ALS
from pyspark.ml import Pipeline
from pyspark.mllib.recommendation import ALS
# Logistic Regression
from pyspark.ml.classification import LogisticRegression
from pyspark.mllib.classification import LogisticRegressionWithLBFGS
# Kmeans
from pyspark.mllib.clustering import KMeans

In [4]:
# find the spark 
import findspark
#findspark.find()
findspark.init()
# to be safe on windows

In [5]:
#  ![My Title](Images/mypic.png)

<br>

### B.  &nbsp; This is the command you start with on Windows 10 to get SparkSQL going

SparkSession in Spark 2.0 provides builtin support for Hive features including the ability to write queries using HiveQL, access to Hive UDFs, and the ability to read data from Hive tables. To use these features, you do not need to have an existing Hive setup FYI...

In [6]:

# READ:
#   The entry point into **all** functionality in Spark is the *SparkSession* class.
#   There are no longer sep instances depending on your application

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Python Spark SQL basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

# SparkSession in Spark 2.0 provides builtin support for Hive features including 
# the ability to write queries using HiveQL, access to Hive UDFs, and the ability 
# to read data from Hive tables. To use these features, you do not need to have an 
# existing Hive setup.


In [7]:

print(spark)

# see how you now have a .sql.session.SparkSession (it is sql specific)


<pyspark.sql.session.SparkSession object at 0x000001CC6FF767F0>


In [8]:

print(type(spark))


<class 'pyspark.sql.session.SparkSession'>


In [9]:

spark


In [10]:

for i in dir(spark):
    if not i.startswith("_"):
        print(i)
        
# these are the spark.methods available once you create the context 
# i.e. spark.read is a method or spark.udf is a method, etc

# spark.stop() for instance will stop the sparksession
# spark.


Builder
builder
catalog
conf
createDataFrame
newSession
range
read
readStream
sparkContext
sql
stop
streams
table
udf
version


In [11]:

# what is your spark version ? 

spark.version


'2.4.3'

In [12]:

spark.

SyntaxError: invalid syntax (<ipython-input-12-47b9844c153f>, line 2)


In earlier versions of Spark, spark context was entry point for Spark.  As RDD was main API, it was created and manipulated using context API’s. For every other API,we needed to use different contexts. For streaming, we needed StreamingContext, for SQL sqlContext and for hive HiveContext. 
But as DataSet and Dataframe API’s are becoming new standard API’s we need an entry point build for them. So in Spark 2.0, we have a new entry point for DataSet and Dataframe API’s called as Spark Session.
SparkSession is essentially combination of SQLContext, HiveContext and future StreamingContext. All the API’s available on those contexts are available on spark session also. Spark session internally has a spark context for actual computation.

So if you get confused, you need to remember there was a major shift in how you created contexts with Spark 2.x...

In [None]:

# spark.catalog()


<br>

### C.  &nbsp;  Begin Diving into SparkSQL

Inferring the Schema
With a SQLContext, we are ready to create a DataFrame from our existing RDD. But first we need to tell Spark SQL the schema in our data.  Spark SQL can convert an RDD of Row objects to a DataFrame. Rows are constructed by passing a list of key/value pairs as kwargs to the Row class. The keys define the column names, and the types are inferred by looking at the first row. Therefore, it is important that there is no missing data in the first row of the RDD in order to properly infer the schema.

With a SparkSession, applications can create DataFrames from an existing RDD, from a Hive table, or from Spark data sources.
As an example, the following creates a DataFrame based on the content of a JSON file:
    

In [None]:

# print out all existing databases ! 

spark.catalog.listDatabases()


In [None]:

# current database:

spark.catalog.currentDatabase()


In [None]:

# list all current tables, there are none yet ...

spark.catalog.listTables()


In [None]:

# list all .methods under catalog:

for i in dir(spark.catalog): 
    if not i.startswith("_"): 
        print(i)

# you will see me do this alot, you want to see the commands that are possible for spark.catalog.X


<br>

### Reading a json file into a dataframe 

In [None]:

# spark is the existing sql SparkSession
df = spark.read.json("C:/SPARK/examples/example.json")

# Displays the content of the DataFrame to stdout
df.show()


In [None]:

print("the df instance is this type ->  ", type(df))


In [None]:

# list out the methods under read, i.e. how can i read stuff in,what are my read options ?  

for i in dir(spark.read):
    if not i.startswith("_"):
        print(i)

# i.e. u have spark.read.csv option, spark.read.json option, spark.read.parquet option, etc ! 
        

In [None]:

# basic examples of structured data processing using Datasets

# In Python, it’s possible to access a DataFrame’s columns either by attribute (df.age) or 
# by indexing (df['age']).   Its recommended  you use the second, but yes, the first is convenient, 
# we have all done it at some point ...

# Print the schema in a tree format
df.printSchema()

# you are in the databases world now, think like that...


In [None]:

# so if you have ever messed with pandas dataframes, this is similiar, but it does 
# have a fair amount of differences, so think of this as a new construct...

# Select only the "supplier" column
df.select("supplier").show()



In [None]:

# Select the quantity col but increment the 'quantity' by 1000
df.select(df['quantity'] + 1000).show()


In [None]:

df.select()


In [None]:

# create a new df with certain cols only 

df2 = df.select("quantity", "supplier")
df2.show()


In [None]:

# when you use SQL, you use commands like select * from <name>, its the same thing...

df.select('*').show()


In [None]:

print(type(df))


In [None]:

# note:  select is a transformation, not an action 


In [None]:

# Select people older than 21
df.filter(df['quantity'] > 250).show()


In [None]:

# Count products by quantity
df.groupBy("quantity").count().show()


In [None]:

# The sql function on a SparkSession enables applications to run SQL queries 
# programmatically and returns the result as a DataFrame.
# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("products")

sqlDF = spark.sql("SELECT * FROM products")
sqlDF.show()
    

In [None]:

# # $example on:global_temp_view$
# # Register the DataFrame as a global temporary view
# df.createGlobalTempView("people")

# # Global temporary view is tied to a system preserved database `global_temp`
# spark.sql("SELECT * FROM global_temp.people").show()

# # Global temporary view is cross-session
# spark.newSession().sql("SELECT * FROM global_temp.people").show()    
    

In [None]:

# $example on:schema_inferring$
sc = spark.sparkContext

# Load a text file and convert each line to a Row.
lines = sc.textFile("C:/SPARK/examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))


In [None]:

# Infer the schema, and register the DataFrame as a table.
schemaPeople = spark.createDataFrame(people)
schemaPeople.createOrReplaceTempView("people")

# SQL can be run over DataFrames that have been registered as a table
teenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

# The results of SQL queries are Dataframe objects...
# rdd returns the content as an :class:`pyspark.RDD` of :class:`Row`.
teenNames = teenagers.rdd.map(lambda p: "Name: " + p.name).collect()

for name in teenNames:
      print(name)
    

In [None]:

teenagers.show()


In [None]:
sc

In [None]:
spark

### quickref

```


"""
A simple example demonstrating basic Spark SQL features.
Run with:
  ./bin/spark-submit examples/src/main/python/sql/basic.py
"""

from __future__ import print_function

# $example on:init_session$
from pyspark.sql import SparkSession
# $example off:init_session$

# $example on:schema_inferring$
from pyspark.sql import Row
# $example off:schema_inferring$

# $example on:programmatic_schema$
# Import data types
from pyspark.sql.types import *
# $example off:programmatic_schema$


def basic_df_example(spark):
    # $example on:create_df$
    # spark is an existing SparkSession
    df = spark.read.json("examples/src/main/resources/people.json")
    # Displays the content of the DataFrame to stdout
    df.show()
    # +----+-------+
    # | age|   name|
    # +----+-------+
    # |null|Michael|
    # |  30|   Andy|
    # |  19| Justin|
    # +----+-------+
    # $example off:create_df$

    # $example on:untyped_ops$
    # spark, df are from the previous example
    # Print the schema in a tree format
    df.printSchema()
    # root
    # |-- age: long (nullable = true)
    # |-- name: string (nullable = true)

    # Select only the "name" column
    df.select("name").show()
    # +-------+
    # |   name|
    # +-------+
    # |Michael|
    # |   Andy|
    # | Justin|
    # +-------+

    # Select everybody, but increment the age by 1
    df.select(df['name'], df['age'] + 1).show()
    # +-------+---------+
    # |   name|(age + 1)|
    # +-------+---------+
    # |Michael|     null|
    # |   Andy|       31|
    # | Justin|       20|
    # +-------+---------+

    # Select people older than 21
    df.filter(df['age'] > 21).show()
    # +---+----+
    # |age|name|
    # +---+----+
    # | 30|Andy|
    # +---+----+

    # Count people by age
    df.groupBy("age").count().show()
    # +----+-----+
    # | age|count|
    # +----+-----+
    # |  19|    1|
    # |null|    1|
    # |  30|    1|
    # +----+-----+
    # $example off:untyped_ops$

    # $example on:run_sql$
    # Register the DataFrame as a SQL temporary view
    df.createOrReplaceTempView("people")

    sqlDF = spark.sql("SELECT * FROM people")
    sqlDF.show()
    # +----+-------+
    # | age|   name|
    # +----+-------+
    # |null|Michael|
    # |  30|   Andy|
    # |  19| Justin|
    # +----+-------+
    # $example off:run_sql$

    # $example on:global_temp_view$
    # Register the DataFrame as a global temporary view
    df.createGlobalTempView("people")

    # Global temporary view is tied to a system preserved database `global_temp`
    spark.sql("SELECT * FROM global_temp.people").show()
    # +----+-------+
    # | age|   name|
    # +----+-------+
    # |null|Michael|
    # |  30|   Andy|
    # |  19| Justin|
    # +----+-------+

    # Global temporary view is cross-session
    spark.newSession().sql("SELECT * FROM global_temp.people").show()
    # +----+-------+
    # | age|   name|
    # +----+-------+
    # |null|Michael|
    # |  30|   Andy|
    # |  19| Justin|
    # +----+-------+
    # $example off:global_temp_view$


def schema_inference_example(spark):
    # $example on:schema_inferring$
    sc = spark.sparkContext

    # Load a text file and convert each line to a Row.
    lines = sc.textFile("examples/src/main/resources/people.txt")
    parts = lines.map(lambda l: l.split(","))
    people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))

    # Infer the schema, and register the DataFrame as a table.
    schemaPeople = spark.createDataFrame(people)
    schemaPeople.createOrReplaceTempView("people")

    # SQL can be run over DataFrames that have been registered as a table.
    teenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

    # The results of SQL queries are Dataframe objects.
    # rdd returns the content as an :class:`pyspark.RDD` of :class:`Row`.
    teenNames = teenagers.rdd.map(lambda p: "Name: " + p.name).collect()
    for name in teenNames:
        print(name)
    # Name: Justin
    # $example off:schema_inferring$


def programmatic_schema_example(spark):
    # $example on:programmatic_schema$
    sc = spark.sparkContext

    # Load a text file and convert each line to a Row.
    lines = sc.textFile("examples/src/main/resources/people.txt")
    parts = lines.map(lambda l: l.split(","))
    # Each line is converted to a tuple.
    people = parts.map(lambda p: (p[0], p[1].strip()))

    # The schema is encoded in a string.
    schemaString = "name age"

    fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
    schema = StructType(fields)

    # Apply the schema to the RDD.
    schemaPeople = spark.createDataFrame(people, schema)

    # Creates a temporary view using the DataFrame
    schemaPeople.createOrReplaceTempView("people")

    # SQL can be run over DataFrames that have been registered as a table.
    results = spark.sql("SELECT name FROM people")

    results.show()
    # +-------+
    # |   name|
    # +-------+
    # |Michael|
    # |   Andy|
    # | Justin|
    # +-------+
    # $example off:programmatic_schema$

if __name__ == "__main__":
    # $example on:init_session$
    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()
    # $example off:init_session$

    basic_df_example(spark)
    schema_inference_example(spark)
    programmatic_schema_example(spark)

    spark.stop()

    ```

<br>

In [None]:

# Spark Properties, the manual way, instead of having to go the UI ! 

print("sc.appName:\t\t", sc.appName)

print("sc.applicationId:\t", sc.applicationId)

print("sc.master:\t\t", sc.master)

print("sc.appName:\t\t", sc.appName)


<br>


Spark SQL can convert an RDD of Row objects to a DataFrame, inferring the datatypes. Rows are constructed by passing a list of key/value pairs as kwargs to the Row class. The keys of this list define the column names of the table, and the types are inferred by sampling the whole dataset, similar to the inference that is performed on JSON files.


In [None]:

from pyspark.sql import Row
sc = spark.sparkContext

# Load a text file and convert each line to a Row.
lines = sc.textFile("C:/SPARK/examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))

# Infer the schema, and register the DataFrame as a table.
schemaPeople = spark.createDataFrame(people)
schemaPeople.createOrReplaceTempView("people")

# SQL can be run over DataFrames that have been registered as a table.
teenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

# The results of SQL queries are Dataframe objects.
# rdd returns the content as an :class:`pyspark.RDD` of :class:`Row`.
teenNames = teenagers.rdd.map(lambda p: "Name: " + p.name).collect()
for name in teenNames:
    print(name)


<br>

### data files

Spark SQL supports operating on a variety of data sources through the DataFrame interface. A DataFrame can be operated on using relational transformations and can also be used to create a temporary view. Registering a DataFrame as a temporary view allows you to run SQL queries over its data. This section describes the general methods for loading and saving data using the Spark Data Sources and then goes into specific options that are available for the built-in data sources.

<br>

### json 

In [None]:


path = "examples/src/main/resources/people.json"
peopleDF = spark.read.json(path)

# The inferred schema can be visualized using the printSchema() method
peopleDF.printSchema()
# root
#  |-- age: long (nullable = true)
#  |-- name: string (nullable = true)

# Creates a temporary view using the DataFrame
peopleDF.createOrReplaceTempView("people")

# SQL statements can be run by using the sql methods provided by spark
teenagerNamesDF = spark.sql("SELECT name FROM people WHERE age BETWEEN 13 AND 19")
teenagerNamesDF.show()
# +------+
# |  name|
# +------+
# |Justin|
# +------+

# Alternatively, a DataFrame can be created for a JSON dataset represented by
# an RDD[String] storing one JSON object per string
jsonStrings = ['{"name":"Yin","address":{"city":"Columbus","state":"Ohio"}}']
otherPeopleRDD = sc.parallelize(jsonStrings)
otherPeople = spark.read.json(otherPeopleRDD)
otherPeople.show()
# +---------------+----+
# |        address|name|
# +---------------+----+
# |[Columbus,Ohio]| Yin|
# +---------------+----+



In [None]:

df2 = spark.read.json("C:/SPARK/examples/test.json")
df2.show()


In [None]:
df2.printSchema()

In [None]:
df2.createOrReplaceTempView("names")

In [None]:

output = spark.sql("SELECT name FROM names WHERE number_of_files BETWEEN 100 AND 300")
output.show()


In [None]:

# # spark is an existing SparkSession
# dfs = spark.read.json("C:/SPARK/examples/people.json")

# # Displays the content of the DataFrame to stdout
# dfs.show()


In [None]:

# spark is an existing SparkSession
dfs = spark.read.json("C:/SPARK/examples/example.json")

# Displays the content of the DataFrame to stdout
dfs.show()


In [None]:
from pyspark.sql import Row

In [None]:

# spark is an existing SparkSession

path = 'C:/SPARK/examples/src/main/resources/people.json'

df = sqlSparkContext.read.json(path) 

# A JSON dataset is pointed to by path.
# The path can be either a single text file or a directory storing text files


*  https://spark.apache.org/docs/latest/sql-data-sources-json.html

In [None]:

# The inferred schema can be visualized using the printSchema() method
df.printSchema()


In [None]:

print(type(sqlSparkContext))
print(type(sc))
print(type(df))


<br>

### D.  &nbsp;  SQL go 

In [None]:

# Creates a temporary view using the DataFrame
df.createOrReplaceTempView("people")  # in my case would be df.createOrReplace

# SQL statements can be run by using the sql methods provided by spark
df_andy = sqlSparkContext.sql("SELECT name FROM people WHERE age BETWEEN 20 and 40")
df_andy.show()


In [None]:

df_michael = sqlSparkContext.sql("SELECT name FROM people WHERE name = 'Michael'")
df_michael.show()


In [None]:

dir(sqlSparkContext)


In [None]:

#---  REFERENCE - READING OPTIONS  ---

# dir(sqlSparkContext.read)

#  'csv',
#  'format',
#  'jdbc',
#  'json',
#  'load',
#  'option',
#  'options',
#  'orc',
#  'parquet',
#  'schema',
#  'table',
#  'text']



In [None]:

sqlSparkContext.tables()


In [None]:

sqlSparkContext.tables


In [None]:

dir(sqlSparkContext.sql)


<br>


### E.  &nbsp; Go Off 

In [None]:

df_t = sqlSparkContext.read.load("C:/SPARK/examples/src/main/resources/users.parquet")

print(type(df_t))


In [None]:

#---  REFERENCE - your option methods for this DF concept  ---

for i in dir(df_t):
    if not i.startswith("_"):
        print(i)



A DataFrame is a distributed collection of data, which is organized into named columns. Conceptually, it is equivalent to relational tables with good optimization techniques.

A DataFrame can be constructed from an array of different sources such as Hive tables, Structured Data files, external databases, or existing RDDs. This API was designed for modern Big Data and data science applications taking inspiration from DataFrame in R Programming and Pandas in Python.



```
Features of DataFrame

Here is a set of few characteristic features of DataFrame

Ability to process the data in the size of Kilobytes to Petabytes on a single node cluster to large cluster.

Supports different data formats (Avro, csv, elastic search, and Cassandra) and storage systems (HDFS, HIVE tables, mysql, etc).

State of art optimization and code generation through the Spark SQL Catalyst optimizer (tree transformation framework).

Can be easily integrated with all Big Data tools and frameworks via Spark-Core.

Provides API for Python, Java, Scala, and R Programming.

```



```
Spark introduces a programming module for structured data processing called Spark SQL. It provides a programming abstraction called DataFrame and can act as distributed SQL query engine.
```

```
The following command is used for initializing the SparkContext through spark-shell.

$ spark-shell
By default, the SparkContext object is initialized with the name sc when the spark-shell starts.

Use the following command to create SQLContext.

scala> val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
```

* some data files 
   *  https://github.com/apache/spark/tree/master/examples/src/main/resources

In [None]:

michael = sqlSparkContext.sql("SELECT name FROM people WHERE name = 'Michael'")


In [None]:
sc

In [None]:

# ---  REFERENCE - list out the dataframes currently created ! --- 

def list_dataframes():
    from pyspark.sql import DataFrame
    return [k for (k, v) in globals().items() if isinstance(v, DataFrame)]

list_dataframes()


In [None]:

path = 'C:\SPARK\examples\src\main\python\employees.json'

df_emp = sqlSparkContext.read.json(path)    

df_emp
df_emp.show()

# scala> val dfs = sqlContext.read.json("employee.json")


In [None]:

# how many DF do i have right now ? 
print(list_dataframes())


In [None]:

df_emp.printSchema()


In [None]:

# direct call ! ! ! 

df_emp.select("name").show()

df_emp.select("salary").show()


In [None]:

# group by salary 
df_emp.groupBy("salary").count().show()


In [None]:

df_emp.filter(df_emp["salary"] > 3200).show()
# interesting:  scala uses () but python uses []


In [None]:

# peopleDF = spark.read.json("examples/src/main/resources/people.json")

# # DataFrames can be saved as Parquet files, maintaining the schema information.
# peopleDF.write.parquet("people.parquet")

# # Read in the Parquet file created above.
# # Parquet files are self-describing so the schema is preserved.
# # The result of loading a parquet file is also a DataFrame.
# parquetFile = spark.read.parquet("people.parquet")

# # Parquet files can also be used to create a temporary view and then used in SQL statements.
# parquetFile.createOrReplaceTempView("parquetFile")
# teenagers = spark.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
# teenagers.show()
# # +------+
# # |  name|
# # +------+
# # |Justin|
# # +------+



In [None]:

df_emp.write.parquet("C:/SPARK/examples/src/main/resources/tom_emp.parquet")


<br>



### F.  &nbsp;  Parquet

```
Parquet is a columnar format, supported by many data processing systems. The advantages of having a columnar storage are as follows −

Columnar storage limits IO operations.

Columnar storage can fetch specific columns that you need to access.

Columnar storage consumes less space.

Columnar storage gives better-summarized data and follows type-specific encoding.

Spark SQL provides support for both reading and writing parquet files that automatically capture the schema of the original data. Like JSON datasets, parquet files follow the same procedure.
```

In [None]:

from pyspark.sql import SparkSession
# $example on:schema_merging$
from pyspark.sql import Row
# $example off:schema_merging$

# $example on:generic_load_save_functions$
df_users = sqlSparkContext.read.load("C:/SPARK/examples/src/main/resources/users.parquet")
# df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
# $example off:generic_load_save_functions$


```
peopleDF = spark.read.json("examples/src/main/resources/people.json")

# DataFrames can be saved as Parquet files, maintaining the schema information.
peopleDF.write.parquet("people.parquet")

# Read in the Parquet file created above.
# Parquet files are self-describing so the schema is preserved.
# The result of loading a parquet file is also a DataFrame.
parquetFile = spark.read.parquet("people.parquet")

# Parquet files can also be used to create a temporary view and then used in SQL statements.
parquetFile.createOrReplaceTempView("parquetFile")
teenagers = spark.sql("SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
teenagers.show()
# +------+
# |  name|
# +------+
# |Justin|
# +------+
```

In [None]:

# from os import walk
# from pyspark.sql import SQLContext

# sc = SparkContext.getOrCreate()
# sqlContext = SQLContext(sc)

# parquetdir = r'C:\PATH\TO\YOUR\PARQUET\FILES'

# # Getting all parquet files in a dir as spark contexts.
# # There might be more easy ways to access single parquets, but I had nested dirs
# dirpath, dirnames, filenames = next(walk(parquetdir), (None, [], []))

# # for each parquet file, i.e. table in our database, spark creates a tempview with
# # the respective table name equal the parquet filename
# print('New tables available: \n')

# for parquet in filenames:
#     print(parquet[:-8])
#     spark.read.parquet(parquetdir+'\\'+parquet).createOrReplaceTempView(parquet[:-8])
    
# my_test_query = spark.sql("""
# select
#   field1,
#   field2
# from parquetfilename1
# where
#   field1 = 'something'
# """)

# my_test_query.show()


In [None]:

squaresDF = sqlSparkContext.createDataFrame(sc.parallelize(range(1, 6)).map(lambda i: Row(single=i, double=i ** 2)))

squaresDF.write.parquet("data/test_table/key=1")

# Create another DataFrame in a new partition directory,
# adding a new column and dropping an existing column
cubesDF = sqlSparkContext.createDataFrame(sc.parallelize(range(6, 11)).map(lambda i: Row(single=i, triple=i ** 3)))

cubesDF.write.parquet("C:/SPARK/key=2")

# Read the partitioned table
mergedDF = sqlSparkContext.read.option("mergeSchema", "true").parquet("C:/SPARKata/test_table")
mergedDF.printSchema()

# The final schema consists of all 3 columns in the Parquet files together
# with the partitioning column appeared in the partition directory paths.
# root
#  |-- double: long (nullable = true)
#  |-- single: long (nullable = true)
#  |-- triple: long (nullable = true)
#  |-- key: integer (nullable = true)



In [None]:

# .catalog.listTables()


In [None]:


# # raw_data_RDD = sc.textFile("e://README_spark.md")  

# lines = sc.read.text("e://README_spark.md").rdd.map(lambda r: r[0])

# sortedCount = lines.flatMap(lambda x: x.split(' ')).map(lambda x: (int(x), 1)).sortByKey()

# # This is just a demo on how to bring all the sorted data back to a single node.

# # In reality, we wouldn't want to collect all the data to the driver node.

# output = sortedCount.collect()

# for (num, unitcount) in output:
#         print(num)

    

In [None]:

# scala 

# spark: SparkSession = // create the Spark Session
# val df = spark.read.csv("file.txt")


```
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#    http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#

"""
A simple example demonstrating basic Spark SQL features.
Run with:
  ./bin/spark-submit examples/src/main/python/sql/basic.py
"""
from __future__ import print_function

# $example on:init_session$
from pyspark.sql import SparkSession
# $example off:init_session$

# $example on:schema_inferring$
from pyspark.sql import Row
# $example off:schema_inferring$

# $example on:programmatic_schema$
# Import data types
from pyspark.sql.types import *
# $example off:programmatic_schema$


def basic_df_example(spark):
    # $example on:create_df$
    # spark is an existing SparkSession
    df = spark.read.json("examples/src/main/resources/people.json")
    # Displays the content of the DataFrame to stdout
    df.show()
    # +----+-------+
    # | age|   name|
    # +----+-------+
    # |null|Michael|
    # |  30|   Andy|
    # |  19| Justin|
    # +----+-------+
    # $example off:create_df$

    # $example on:untyped_ops$
    # spark, df are from the previous example
    # Print the schema in a tree format
    df.printSchema()
    # root
    # |-- age: long (nullable = true)
    # |-- name: string (nullable = true)

    # Select only the "name" column
    df.select("name").show()
    # +-------+
    # |   name|
    # +-------+
    # |Michael|
    # |   Andy|
    # | Justin|
    # +-------+

    # Select everybody, but increment the age by 1
    df.select(df['name'], df['age'] + 1).show()
    # +-------+---------+
    # |   name|(age + 1)|
    # +-------+---------+
    # |Michael|     null|
    # |   Andy|       31|
    # | Justin|       20|
    # +-------+---------+

    # Select people older than 21
    df.filter(df['age'] > 21).show()
    # +---+----+
    # |age|name|
    # +---+----+
    # | 30|Andy|
    # +---+----+

    # Count people by age
    df.groupBy("age").count().show()
    # +----+-----+
    # | age|count|
    # +----+-----+
    # |  19|    1|
    # |null|    1|
    # |  30|    1|
    # +----+-----+
    # $example off:untyped_ops$

    # $example on:run_sql$
    # Register the DataFrame as a SQL temporary view
    df.createOrReplaceTempView("people")

    sqlDF = spark.sql("SELECT * FROM people")
    sqlDF.show()
    # +----+-------+
    # | age|   name|
    # +----+-------+
    # |null|Michael|
    # |  30|   Andy|
    # |  19| Justin|
    # +----+-------+
    # $example off:run_sql$

    # $example on:global_temp_view$
    # Register the DataFrame as a global temporary view
    df.createGlobalTempView("people")

    # Global temporary view is tied to a system preserved database `global_temp`
    spark.sql("SELECT * FROM global_temp.people").show()
    # +----+-------+
    # | age|   name|
    # +----+-------+
    # |null|Michael|
    # |  30|   Andy|
    # |  19| Justin|
    # +----+-------+

    # Global temporary view is cross-session
    spark.newSession().sql("SELECT * FROM global_temp.people").show()
    # +----+-------+
    # | age|   name|
    # +----+-------+
    # |null|Michael|
    # |  30|   Andy|
    # |  19| Justin|
    # +----+-------+
    # $example off:global_temp_view$


def schema_inference_example(spark):
    # $example on:schema_inferring$
    sc = spark.sparkContext

    # Load a text file and convert each line to a Row.
    lines = sc.textFile("examples/src/main/resources/people.txt")
    parts = lines.map(lambda l: l.split(","))
    people = parts.map(lambda p: Row(name=p[0], age=int(p[1])))

    # Infer the schema, and register the DataFrame as a table.
    schemaPeople = spark.createDataFrame(people)
    schemaPeople.createOrReplaceTempView("people")

    # SQL can be run over DataFrames that have been registered as a table.
    teenagers = spark.sql("SELECT name FROM people WHERE age >= 13 AND age <= 19")

    # The results of SQL queries are Dataframe objects.
    # rdd returns the content as an :class:`pyspark.RDD` of :class:`Row`.
    teenNames = teenagers.rdd.map(lambda p: "Name: " + p.name).collect()
    for name in teenNames:
        print(name)
    # Name: Justin
    # $example off:schema_inferring$


def programmatic_schema_example(spark):
    # $example on:programmatic_schema$
    sc = spark.sparkContext

    # Load a text file and convert each line to a Row.
    lines = sc.textFile("examples/src/main/resources/people.txt")
    parts = lines.map(lambda l: l.split(","))
    # Each line is converted to a tuple.
    people = parts.map(lambda p: (p[0], p[1].strip()))

    # The schema is encoded in a string.
    schemaString = "name age"

    fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
    schema = StructType(fields)

    # Apply the schema to the RDD.
    schemaPeople = spark.createDataFrame(people, schema)

    # Creates a temporary view using the DataFrame
    schemaPeople.createOrReplaceTempView("people")

    # SQL can be run over DataFrames that have been registered as a table.
    results = spark.sql("SELECT name FROM people")

    results.show()
    # +-------+
    # |   name|
    # +-------+
    # |Michael|
    # |   Andy|
    # | Justin|
    # +-------+
    # $example off:programmatic_schema$

if __name__ == "__main__":
    # $example on:init_session$
    spark = SparkSession \
        .builder \
        .appName("Python Spark SQL basic example") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()
    # $example off:init_session$

    basic_df_example(spark)
    schema_inference_example(spark)
    programmatic_schema_example(spark)

    spark.stop()

```

<br>

### Z.

In [None]:

# $example on:init_session$
from pyspark.sql import SparkSession
# $example off:init_session$

# $example on:schema_inferring$
from pyspark.sql import Row
# $example off:schema_inferring$

# $example on:programmatic_schema$
# Import data types
from pyspark.sql.types import *
# $example off:programmatic_schema$

# READ IN JSON FILE ! 
df = sqlContext.read.json("C:/SPARK/examples/src/main/resources/people.json")


In [None]:
print(type(df))

In [None]:
for i in dir(df):
    if not i.startswith("_"):
        print(i)
        
# agg
# alias
# approxQuantile
# cache
# checkpoint
# coalesce
# colRegex
# collect
# columns
# corr
# count
# cov
# createGlobalTempView
# createOrReplaceGlobalTempView
# createOrReplaceTempView
# createTempView
# crossJoin
# crosstab
# cube
# describe
# distinct
# drop
# dropDuplicates
# drop_duplicates
# dropna
# dtypes
# exceptAll
# explain
# fillna
# filter
# first
# foreach
# foreachPartition
# freqItems
# groupBy
# groupby
# head
# hint
# intersect
# intersectAll
# isLocal
# isStreaming
# is_cached
# join
# limit
# localCheckpoint
# na
# orderBy
# persist
# printSchema
# randomSplit
# rdd
# registerTempTable
# repartition
# repartitionByRange
# replace
# rollup
# sample
# sampleBy
# schema
# select
# selectExpr
# show
# sort
# sortWithinPartitions
# sql_ctx
# stat
# storageLevel
# subtract
# summary
# take
# toDF
# toJSON
# toLocalIterator
# toPandas
# union
# unionAll
# unionByName
# unpersist
# where
# withColumn
# withColumnRenamed
# withWatermark
# write
# writeStream

In [None]:

#  people.json

# {"name":"Michael"}
# {"name":"Andy", "age":30}
# {"name":"Justin", "age":19}

# Displays the content of the DataFrame to stdout
df.show()


In [None]:

# Print the schema in a tree format
df.printSchema()    


In [None]:

# Select only the "name" column
df.select("name").show()  
    

In [None]:

# Select everybody, but increment the age by 1
df.select(df['name'], df['age'] + 1).show()
    

In [None]:

# Select people older than 21

df.filter(df['age'] > 21).show()
    

In [None]:

# Count people by age
df.groupBy("age").count().show()
    

In [None]:

# $example on:run_sql$
# Register the DataFrame as a SQL temporary view

# Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")

sqlDF = sqlContext.sql("SELECT * FROM people")
sqlDF.show()  
     

In [None]:

# $example on:global_temp_view$
# Register the DataFrame as a global temporary view
df.createGlobalTempView("people")

# Global temporary view is tied to a system preserved database `global_temp`
sqlContext.sql("SELECT * FROM global_temp.people").show()    

# error =  AnalysisException: "Temporary view 'people' already exists;"

In [None]:

# Global temporary view is cross-session

sqlContext.newSession().sql("SELECT * FROM global_temp.people").show()


## datasource.py

> PARQUET

In [None]:

from pyspark.sql import SparkSession
# $example on:schema_merging$
from pyspark.sql import Row
# $example off:schema_merging$

# $example on:generic_load_save_functions$
df = sqlContext.read.load("C:/SPARK/examples/src/main/resources/users.parquet")
# df.select("name", "favorite_color").write.save("namesAndFavColors.parquet")
# $example off:generic_load_save_functions$
    

In [None]:
print(type(df))

In [None]:
df

In [None]:
df.count()

In [None]:
# < i n s r t   -   image as you do it >

In [None]:
df.show()

Parquet is a columnar format, supported by many data processing systems. The advantages of having a columnar storage are as follows −

Columnar storage limits IO operations.

Columnar storage can fetch specific columns that you need to access.

Columnar storage consumes less space.

Columnar storage gives better-summarized data and follows type-specific encoding.

Spark SQL provides support for both reading and writing parquet files that automatically capture the schema of the original data. Like JSON datasets, parquet files follow the same procedure.

Let’s take another look at the same example of employee record data named employee.parquet placed in the same directory where spark-shell is running.

Given data − Do not bother about converting the input data of employee records into parquet format. We use the following commands that convert the RDD data into Parquet file. Place the employee.json document, which we have used as the input file in our previous examples.

In [13]:

# $example on:generic_load_save_functions$
df2 = sqlContext.read.load("E:/userdata1.parquet")


NameError: name 'sqlContext' is not defined

In [None]:
df2.show()

```
userdata[1-5].parquet: These are sample files containing data in PARQUET format.

-> Number of rows in each file: 1000
-> Column details:
column#		column_name		hive_datatype
=====================================================
1		registration_dttm 	timestamp
2		id 			int
3		first_name 		string
4		last_name 		string
5		email 			string
6		gender 			string
7		ip_address 		string
8		cc 			string
9		country 		string
10		birthdate 		string
11		salary 			double
12		title 			string
13		comments 		string
```

In [None]:

# sqlDF = sqlContext.sql("SELECT * FROM people")

all_info_in_id_column = df2.select("id").show()   


In [None]:

first_name = df2.select("first_name").show()   


<br>
<br>
<br>
<br>

# *Reading VCF files in SparkSQL / Python*

> i don't know how to say this.  This will look like giberish until you dive into what the terms mean etc, and even then it takes a while...

> Deep Dive into processing .vcf files 

> The output we are dealing with is effectively a dict, so even if you dont understand the gene terminology, its just about understanding we are querying the file for certain 'keys'

# READ:  for the love of all holy don't save this to your laptop and associated it with a vCalendar File for outlook or something stupid like that

scikit-allel is a Python package intended to enable exploratory analysis of large-scale genetic variation data.

Variant Call Format (VCF) is a text file format for storing marker and genotype data.

###  not like you necessarily care, but this is about AGTCs, i.e. bases Adenine, Guanine, Thymine, and Cytosine

![title](https://www.genome.gov/sites/default/files/tg/en/illustration/acgt.jpg)


In [14]:

# import scikit-allel, a sci file 
import allel


In [15]:

print(allel.__version__)


1.1.10


In [16]:
### sample - vcf file below: 

In [17]:
#

```
THIS IS A RANDOM SAMPLE.VCF EXAMPLE: 
    
 -  the file begins with meta-info lines with ## 
    

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
##ALT=<ID=DEL:ME:ALU,Description="Deletion of ALU element">
##ALT=<ID=CNV,Description="Copy number variable region">       < - - im inserting a space for clarity 


#CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	NA00002	NA00003

19	111	.	A	C	9.6	.	.	GT:HQ	0|0:10,10	0|0:10,10	0/1:3,3
19	112	.	A	G	10	.	.	GT:HQ	0|0:10,10	0|0:10,10	0/1:3,3
20	14370	rs6054257	G	A	29	PASS	NS=3;DP=14;AF=0.5;DB;H2	GT:GQ:DP:HQ	0|0:48:1:51,51	1|0:48:8:51,51	1/1:43:5:.,.
20	17330	.	T	A	3	q10	NS=3;DP=11;AF=0.017	GT:GQ:DP:HQ	0|0:49:3:58,50	0|1:3:5:65,3	0/0:41:3:.,.
20	1110696	rs6040355	A	G,T	67	PASS	NS=2;DP=10;AF=0.333,0.667;AA=T;DB	GT:GQ:DP:HQ	1|2:21:6:23,27	2|1:2:0:18,2	2/2:35:4:.,.
20	1230237	.	T	.	47	PASS	NS=3;DP=13;AA=T	GT:GQ:DP:HQ	0|0:54:.:56,60	0|0:48:4:51,51	0/0:61:2:.,.
20	1234567	microsat1	G	GA,GAC	50	PASS	NS=3;DP=9;AA=G;AN=6;AC=3,1	GT:GQ:DP	0/1:.:4	0/2:17:2	1/1:40:3
20	1235237	.	T	.	.	.	.	GT	0/0	0|0	./.
X	10	rsTest	AC	A,ATG	10	PASS	.	GT	0	0/1	0|2



FYI, this is the header:
#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    NA00001    NA00002    NA00003


So what is this ? 
After the header, there are DATA LINES, with each data line describing a genetic variant at a particular 
position relative to the reference genome of whichever species you are studying. 

In my case:
CHROM	
POS	
ID	
REF	
ALT	
QUAL
FILTER
INFO
FORMAT
NA00001
NA00002
NA00003

Data lines contain marker and genotype data (one variant per line). A data line is called a VCF record.

My first line describes a variant on chromosome **19** at position **11** relative to the to ____
assembly of the human genome. The reference allele is ‘C’ and the alternate allele is ‘A’, so this etc etc 
```


```

CHROM	
the chromosome.


POS	
the genome coordinate of the first base in the variant. Within a chromosome, VCF records are sorted in order of increasing position.

ID	
a semicolon-separated list of marker identifiers.

REF	
the reference allele expressed as a sequence of one or more A/C/G/T nucleotides (e.g. "A" or "AAC")

ALT	
the alternate allele expressed as a sequence of one or more A/C/G/T nucleotides (e.g. "A" or "AAC"). If there is more than one alternate alleles, the field should be a comma-separated list of alternate alleles.

QUAL	
probability that the ALT allele is incorrectly specified, expressed on the the phred scale (-10log10(probability)).

FILTER	
Either "PASS" or a semicolon-separated list of failed quality control filters.

INFO	
additional information (no white space, tabs, or semi-colons permitted).

FORMAT	
colon-separated list of data subfields reported for each sample. The format fields in the Example are explained below.
```

<br>

In [18]:

# very handy way of printing out the actual .vcf file as a way of getting an idea how this stuff works

with open('C:/SPARK/sample.vcf', mode='r') as vcf:
    print(vcf.read())
    
# prints out the vcf sample file

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Geno

###  *if you are really good you know this is actually the .vcf from the spec that talks about the file formatted columns, so download the spec and then youc an use this verbatim !* 

In [19]:

# to understand how powerful this library is, look at its .methods available for you:

for i in dir(allel): 
    if not i.startswith("_"):
        print(i)
        
# ever heard of the massive library scikit-learn ?  
# think genomic version of that ! 


ANNTransformer
ANN_AA_LENGTH_FIELD
ANN_AA_POS_FIELD
ANN_ANNOTATION_FIELD
ANN_ANNOTATION_IMPACT_FIELD
ANN_CDNA_LENGTH_FIELD
ANN_CDNA_POS_FIELD
ANN_CDS_LENGTH_FIELD
ANN_CDS_POS_FIELD
ANN_DISTANCE_FIELD
ANN_FEATURE_ID_FIELD
ANN_FEATURE_TYPE_FIELD
ANN_FIELD
ANN_FIELDS
ANN_GENE_ID_FIELD
ANN_GENE_NAME_FIELD
ANN_HGVS_C_FIELD
ANN_HGVS_P_FIELD
ANN_RANK_FIELD
ANN_TRANSCRIPT_BIOTYPE_FIELD
AlleleCountsArray
AlleleCountsChunkedArray
AlleleCountsChunkedTable
AlleleCountsDaskArray
CenterScaler
DEFAULT_ALT_NUMBER
DEFAULT_BUFFER_SIZE
DEFAULT_CHUNK_LENGTH
DEFAULT_CHUNK_WIDTH
FIXED_VARIANTS_FIELDS
FeatureTable
FileInputStream
FileNotFoundError
GenotypeAlleleCounts
GenotypeAlleleCountsArray
GenotypeAlleleCountsChunkedArray
GenotypeAlleleCountsDaskArray
GenotypeAlleleCountsDaskVector
GenotypeAlleleCountsVector
GenotypeArray
GenotypeChunkedArray
GenotypeDaskArray
GenotypeDaskVector
GenotypeVector
Genotypes
HaplotypeArray
HaplotypeChunkedArray
HaplotypeDaskArray
INHERIT_MISSING
INHERIT_NONPARENTAL
INHERIT_NO


We will focus on extracting data from Variant Call Format (VCF) files and loading into NumPy arrays, 
pandas data frames, HDF5 files or Zarr arrays for ease of analysis. 


The key is to focus on extracting the necessary data from the VCF file and loading it into a more efficient storage container

In [20]:

callset = allel.read_vcf('C:/SPARK/sample.vcf')


In [21]:

print(type(callset))   # key/value dict ! 


<class 'dict'>


In [22]:

# Here are my available keys:
sorted(callset.keys())


['calldata/GT',
 'samples',
 'variants/ALT',
 'variants/CHROM',
 'variants/FILTER_PASS',
 'variants/ID',
 'variants/POS',
 'variants/QUAL',
 'variants/REF']

In [23]:
# All arrays with keys beginning ‘variants/’ come from the fixed fields in the VCF file. 

In [24]:

'read_vcf' in dir(allel)


True

In [25]:

'read_vcf_headers' in dir(allel)


False

In [26]:

print(callset)


{'samples': array(['NA00001', 'NA00002', 'NA00003'], dtype=object), 'variants/ID': array(['.', '.', 'rs6054257', '.', 'rs6040355', '.', 'microsat1', '.',
       'rsTest'], dtype=object), 'variants/ALT': array([['C', '', ''],
       ['G', '', ''],
       ['A', '', ''],
       ['A', '', ''],
       ['G', 'T', ''],
       ['', '', ''],
       ['GA', 'GAC', ''],
       ['', '', ''],
       ['A', 'ATG', '']], dtype=object), 'variants/POS': array([    111,     112,   14370,   17330, 1110696, 1230237, 1234567,
       1235237,      10]), 'variants/CHROM': array(['19', '19', '20', '20', '20', '20', '20', '20', 'X'], dtype=object), 'variants/FILTER_PASS': array([False, False,  True, False,  True,  True,  True, False,  True]), 'variants/REF': array(['A', 'A', 'G', 'T', 'A', 'T', 'G', 'T', 'AC'], dtype=object), 'calldata/GT': array([[[ 0,  0],
        [ 0,  0],
        [ 0,  1]],

       [[ 0,  0],
        [ 0,  0],
        [ 0,  1]],

       [[ 0,  0],
        [ 1,  0],
        [ 1,  1]],

      

In [27]:

# Commands (methods) you can use:

for i in dir(callset):
    if not i.startswith("_"):
        print(i)

# methods you can call, as a reference...


clear
copy
fromkeys
get
items
keys
pop
popitem
setdefault
update
values


In [28]:

for i in callset.values():  print("\n",i)



 ['NA00001' 'NA00002' 'NA00003']

 ['.' '.' 'rs6054257' '.' 'rs6040355' '.' 'microsat1' '.' 'rsTest']

 [['C' '' '']
 ['G' '' '']
 ['A' '' '']
 ['A' '' '']
 ['G' 'T' '']
 ['' '' '']
 ['GA' 'GAC' '']
 ['' '' '']
 ['A' 'ATG' '']]

 [    111     112   14370   17330 1110696 1230237 1234567 1235237      10]

 ['19' '19' '20' '20' '20' '20' '20' '20' 'X']

 [False False  True False  True  True  True False  True]

 ['A' 'A' 'G' 'T' 'A' 'T' 'G' 'T' 'AC']

 [[[ 0  0]
  [ 0  0]
  [ 0  1]]

 [[ 0  0]
  [ 0  0]
  [ 0  1]]

 [[ 0  0]
  [ 1  0]
  [ 1  1]]

 [[ 0  0]
  [ 0  1]
  [ 0  0]]

 [[ 1  2]
  [ 2  1]
  [ 2  2]]

 [[ 0  0]
  [ 0  0]
  [ 0  0]]

 [[ 0  1]
  [ 0  2]
  [ 1  1]]

 [[ 0  0]
  [ 0  0]
  [-1 -1]]

 [[ 0 -1]
  [ 0  1]
  [ 0  2]]]

 [ 9.6 10.  29.   3.  67.  47.  50.   nan 10. ]


The callset object returned by read_vcf() is a Python dictionary (dict). It contains several NumPy arrays, each of which can be accessed via a key. Here are the available keys:

In [29]:

# The callset object returned by read_vcf() is a Python dictionary (dict). 
# It contains several NumPy arrays, each of which can be accessed via a key. 
# Here are the available keys:

# keys:
sorted(callset.keys())


['calldata/GT',
 'samples',
 'variants/ALT',
 'variants/CHROM',
 'variants/FILTER_PASS',
 'variants/ID',
 'variants/POS',
 'variants/QUAL',
 'variants/REF']

In [30]:

for i in sorted(callset.keys()):  print("-",i)
    

- calldata/GT
- samples
- variants/ALT
- variants/CHROM
- variants/FILTER_PASS
- variants/ID
- variants/POS
- variants/QUAL
- variants/REF


In [31]:

# The ‘samples’ array contains sample identifiers extracted from the header line in the VCF file.

callset['samples']

# look to the far right:
# #CHROM	POS	ID	REF	ALT	QUAL	FILTER	INFO	FORMAT	NA00001	NA00002	NA00003


array(['NA00001', 'NA00002', 'NA00003'], dtype=object)

In [32]:

# All arrays with keys beginning ‘variants/’ come from the fixed fields in the VCF file.
# For example, here is the data from the ‘CHROM’ field:

callset['variants/CHROM']

# chromosomes
    

array(['19', '19', '20', '20', '20', '20', '20', '20', 'X'], dtype=object)

In [33]:

# Here is the data from the ‘POS’ field:

callset['variants/POS']


array([    111,     112,   14370,   17330, 1110696, 1230237, 1234567,
       1235237,      10])

In [34]:

# Here is the data from the ‘QUAL’ field:

callset['variants/QUAL']


array([ 9.6, 10. , 29. ,  3. , 67. , 47. , 50. ,  nan, 10. ],
      dtype=float32)

In [35]:

# All arrays with keys beginning ‘calldata/’ come from the sample fields in the VCF file. 
# For example, here are the actual genotype calls from the ‘GT’ field:

callset['calldata/GT']

# Note the -1 values for one of the genotype calls. By default scikit-allel uses 
# -1 to indicate a missing value for any array with a signed integer data type 
# d(although you can change this if you want).


array([[[ 0,  0],
        [ 0,  0],
        [ 0,  1]],

       [[ 0,  0],
        [ 0,  0],
        [ 0,  1]],

       [[ 0,  0],
        [ 1,  0],
        [ 1,  1]],

       [[ 0,  0],
        [ 0,  1],
        [ 0,  0]],

       [[ 1,  2],
        [ 2,  1],
        [ 2,  2]],

       [[ 0,  0],
        [ 0,  0],
        [ 0,  0]],

       [[ 0,  1],
        [ 0,  2],
        [ 1,  1]],

       [[ 0,  0],
        [ 0,  0],
        [-1, -1]],

       [[ 0, -1],
        [ 0,  1],
        [ 0,  2]]], dtype=int8)

In [36]:

# Aside: genotype arrays
# Because working with genotype calls is a very common task, scikit-allel has 
# a GenotypeArray class which adds some convenient functionality to an array 
# of genotype calls. To use this class, pass the raw NumPy array into the GenotypeArray 
# class constructor, e.g.:


gt = allel.GenotypeArray(callset['calldata/GT'])

gt


Unnamed: 0,0,1,2,Unnamed: 4
0,0/0,0/0,0/1,
1,0/0,0/0,0/1,
2,0/0,1/0,1/1,
...,...,...,...,...
6,0/1,0/2,1/1,
7,0/0,0/0,./.,
8,0/.,0/1,0/2,


One of the things that the GenotypeArray class does is provide a slightly more visually-appealing representation when used in a Jupyter notebook, as can be seen above. There are also methods for making various computations over the genotype calls. For example, the is_het() method locates all heterozygous genotype calls:

In [37]:
gt.is_het()

array([[False, False,  True],
       [False, False,  True],
       [False,  True, False],
       [False,  True, False],
       [ True,  True, False],
       [False, False, False],
       [ True,  True, False],
       [False, False, False],
       [False,  True,  True]])

In [38]:

# To give another example, the count_het() method will count heterozygous calls, summing over 
# variants (axis=0) or samples (axis=1) if requested.
# E.g., to count the number of het calls per variant:
    
gt.count_het(axis=1)


array([1, 1, 1, 1, 2, 0, 2, 0, 2])

In [39]:

# One more example, here is how to perform an allele count,
# i.e., count the number times each allele (0=reference, 1=first alternate, 
# 2=second alternate, etc.) is observed for each variant:
    
ac = gt.count_alleles()
ac 
    

Unnamed: 0,0,1,2,Unnamed: 4
0,5,1,0,
1,5,1,0,
2,3,3,0,
...,...,...,...,...
6,2,3,1,
7,4,0,0,
8,3,1,1,


Fields: 
    
VCF files can often contain many fields of data, and you may only need to extract 
some of them to perform a particular analysis. You can select which fields to extract by passing a list of strings as the fields parameter. For example, let’s extract the ‘DP’ field from within the ‘INFO’ field, and let’s also extract the ‘DP’ field from the genotype call data:
    
    

In [40]:

callset = allel.read_vcf('C:/SPARK/sample.vcf', fields=['variants/DP', 'calldata/DP'])
sorted(callset.keys())


['calldata/DP', 'variants/DP']

In [41]:

# here is the data we just extracted:
callset['variants/DP']



array([-1, -1, 14, 11, 10, 13,  9, -1, -1])

In [42]:

callset['calldata/DP']


array([[-1, -1, -1],
       [-1, -1, -1],
       [ 1,  8,  5],
       [ 3,  5,  3],
       [ 6,  0,  4],
       [-1,  4,  2],
       [ 4,  2,  3],
       [-1, -1, -1],
       [-1, -1, -1]], dtype=int16)


I chose these two fields to illustrate the point that sometimes the same field name (e.g., ‘DP’) can be used both within the INFO field of a VCF and also within the genotype call data. When selecting fields, to make sure there is no ambiguity, you can include a prefix which is either ‘variants/’ or ‘calldata/’. For example, if you provide ‘variants/DP’, then the read_vcf() function will look for an INFO field named ‘DP’. If you provide ‘calldata/DP’ then read_vcf() will look for a FORMAT field named ‘DP’ within the call data.

If you are feeling lazy, you can drop the ‘variants/’ and ‘calldata/’ prefixes, in which case read_vcf() will assume you mean ‘variants/’ if there is any ambiguity. E.g.:
    

In [43]:

callset = allel.read_vcf('C:/SPARK/sample.vcf', fields=['DP', 'GT'])
sorted(callset.keys())


['calldata/GT', 'variants/DP']

If you want to extract absolutely everything from a VCF file, then you can provide a special value '*' as the fields parameter:

In [44]:

callset = allel.read_vcf('C:/SPARK/sample.vcf', fields='*')
sorted(callset.keys())


['calldata/DP',
 'calldata/GQ',
 'calldata/GT',
 'calldata/HQ',
 'samples',
 'variants/AA',
 'variants/AC',
 'variants/AF',
 'variants/ALT',
 'variants/AN',
 'variants/CHROM',
 'variants/DB',
 'variants/DP',
 'variants/FILTER_PASS',
 'variants/FILTER_q10',
 'variants/FILTER_s50',
 'variants/H2',
 'variants/ID',
 'variants/NS',
 'variants/POS',
 'variants/QUAL',
 'variants/REF',
 'variants/is_snp',
 'variants/numalt',
 'variants/svlen']

You can also provide the special value 'variants/*' to request all variants fields (including all INFO), and the special value 'calldata/*' to request all call data fields.

If you don’t specify the fields parameter, scikit-allel will default to extracting data from the fixed variants fields (but no INFO) and the GT genotype field if present (but no other call data).

### Types

NumPy arrays can have various data types, including signed integers (‘int8’, ‘int16’, ‘int32’, ‘int64’), unsigned integers (‘uint8’, ‘uint16’, ‘uint32’, ‘uint64’), floating point numbers (‘float32’, ‘float64’), variable length strings (‘object’) and fixed length strings (e.g., ‘S4’ for a 4-character ASCII string). scikit-allel will try to choose a sensible default data type for the fields you want to extract, based on the meta-information in the VCF file, but you can override these via the types parameter.

For example, by default the ‘DP’ INFO field is loaded into a 32-bit integer array:

In [45]:

callset = allel.read_vcf('C:/SPARK/sample.vcf', fields=['DP'])
callset['variants/DP']


array([-1, -1, 14, 11, 10, 13,  9, -1, -1])

For fields containing textual data, there are two choices for data type. By default, scikit-allel will use an ‘object’ data type, which means that values are stored as an array of Python strings. E.g.:

### Bam !

In [46]:

callset = allel.read_vcf('C:/SPARK/sample.vcf')
callset['variants/REF']


array(['A', 'A', 'G', 'T', 'A', 'T', 'G', 'T', 'AC'], dtype=object)

In [47]:

# The advantage of using ‘object’ dtype is that strings can be of any length. 
# Alternatively, you can use a fixed-length string dtype, e.g.:

callset = allel.read_vcf('C:/SPARK/sample.vcf', types={'REF': 'S3'})
callset['variants/REF']
    

array([b'A', b'A', b'G', b'T', b'A', b'T', b'G', b'T', b'AC'], dtype='|S3')

### Numbers


Some fields like ‘ALT’, ‘AC’ and ‘AF’ can have a variable number of values. I.e., each variant may have a different number of data values for these fields. One trade-off you have to make when loading data into NumPy arrays is that you cannot have arrays with a variable number of items per row. Rather, you have to fix the maximum number of possible items. While you lose some flexibility, you gain speed of access.

For fields like ‘ALT’, scikit-allel will choose a default number of expected values, which is set at 3. E.g., here is what you get by default:
    

In [48]:

callset = allel.read_vcf('C:/SPARK/sample.vcf')
callset['variants/ALT']


array([['C', '', ''],
       ['G', '', ''],
       ['A', '', ''],
       ['A', '', ''],
       ['G', 'T', ''],
       ['', '', ''],
       ['GA', 'GAC', ''],
       ['', '', ''],
       ['A', 'ATG', '']], dtype=object)

In this case, 3 is more that we need, because no variant has more than 2 ALT values. However, some VCF files (especially those including INDELs) may have more than 3 ALT values.

If you need to increase or decrease the expected number of values for any field, you can do this via the numbers parameter. E.g., increase the number of ALT values to 5:

In [49]:

callset = allel.read_vcf('C:/SPARK/sample.vcf', numbers={'ALT': 5})
callset['variants/ALT']


array([['C', '', '', '', ''],
       ['G', '', '', '', ''],
       ['A', '', '', '', ''],
       ['A', '', '', '', ''],
       ['G', 'T', '', '', ''],
       ['', '', '', '', ''],
       ['GA', 'GAC', '', '', ''],
       ['', '', '', '', ''],
       ['A', 'ATG', '', '', '']], dtype=object)

### Genotype ploidy


By default, scikit-allel assumes you are working with a diploid organism, and so expects to parse out 2 alleles for each genotype call. If you are working with an organism with some other ploidy, you can change the expected number of alleles via the numbers parameter.

For example, here is an example VCF with tetraploid genotype calls:

In [50]:
#<>

### Region

You can extract data for only a specific chromosome or genome region via the region parameter. The value of the parameter should be a region string of the format ‘{chromosome}:{begin}-{end}’, just like you would give to tabix or samtools. E.g.:

In [51]:

callset = allel.read_vcf('C:/SPARK/sample.vcf', region='20:1000000-1231000')
callset['variants/POS']


array([1110696, 1230237])

### Samples

In [52]:

# You can extract data for only specific samples via the samples parameter. 
# e.g. extract data for samples ‘NA00001’ and ‘NA00003’:

callset = allel.read_vcf('C:/SPARK/sample.vcf', samples=['NA00001', 'NA00003'])
callset['samples']


array(['NA00001', 'NA00003'], dtype=object)

In [53]:

allel.GenotypeArray(callset['calldata/GT'])

# Note that the genotype array now only has two columns, corresponding to the two samples requested.


Unnamed: 0,0,1,Unnamed: 3
0,0/0,0/1,
1,0/0,0/1,
2,0/0,1/1,
...,...,...,...
6,0/1,1/1,
7,0/0,./.,
8,0/.,0/2,


## we can also take the .vcf and store it as hdf5

In [54]:
# #  but will store extracted data into an HDF5 file stored on disk

### vcf_to_hdf5()

For large datasets, the vcf_to_hdf5() function is available. This function again takes similar parameters to read_vcf(), but will store extracted data into an HDF5 file stored on disk. The extraction process works through the VCF file in chunks, and so the entire dataset is never loaded entirely into main memory. A bit further below I give worked examples with a large dataset, but for now here is a simple example:

In [55]:

# STOP:  if this file already exists, this will error out, so make sure it doesn't already exist
# saving the file directly as hdf5 ! ! ! 

allel.vcf_to_hdf5('C:/SPARK/sample.vcf', 'C:/SPARK/sample2_hdf5.h5', fields='*', overwrite=True)


In [56]:

# now lets assume i just showed up and wanted to review this information (the hdf5 file):

import h5py  # conda install this 

callset = h5py.File('C:/SPARK/sample2_hdf5.h5', mode='r')

callset

print(callset)


<HDF5 file "sample2_hdf5.h5" (mode r)>

<HDF5 file "sample2_hdf5.h5" (mode r)>


In [57]:

chrom = callset['variants/CHROM']
chrom


<HDF5 dataset "CHROM": shape (9,), type "|O">

In [58]:

pos = callset['variants/POS']
pos


<HDF5 dataset "POS": shape (9,), type "<i4">

In [59]:

gt = callset['calldata/GT']
gt


<HDF5 dataset "GT": shape (9, 3, 2), type "|i1">

In [60]:

# This dataset object is useful because you can then load all or only part of 
# the underlying data into main memory via slicing. e.g.

# load second to fourth items into NumPy array
chrom[1:3]


array(['19', '20'], dtype=object)

In [61]:

# load genotype calls into memory for second to fourth variants, all samples
allel.GenotypeArray(gt[1:3, :])


Unnamed: 0,0,1,2
0,0/0,0/0,0/1
1,0/0,1/0,1/1


# Assume you want all of this in a dataframe ! 

In [62]:

# For some analyses it can be useful to think of the data in a VCF file as a table or data frame, 
# especially if you are only analysing data from the fixed fields and don’t need the genotype 
# calls or any other call data. The vcf_to_dataframe() function extracts data from a VCF and 
# loads into a pandas DataFrame. E.g.:

df = allel.vcf_to_dataframe('C:/SPARK/sample.vcf')

df

# extract my data from the vcf and put into a dataframe ! 


  df = pandas.DataFrame.from_items(items)


Unnamed: 0,CHROM,POS,ID,REF,ALT_1,ALT_2,ALT_3,QUAL,FILTER_PASS
0,19,111,.,A,C,,,9.6,False
1,19,112,.,A,G,,,10.0,False
2,20,14370,rs6054257,G,A,,,29.0,True
3,20,17330,.,T,A,,,3.0,False
4,20,1110696,rs6040355,A,G,T,,67.0,True
5,20,1230237,.,T,,,,47.0,True
6,20,1234567,microsat1,G,GA,GAC,,50.0,True
7,20,1235237,.,T,,,,,False
8,X,10,rsTest,AC,A,ATG,,10.0,True


In [63]:
print(type(df))

<class 'pandas.core.frame.DataFrame'>


In [64]:

print(df)


  CHROM      POS         ID REF ALT_1 ALT_2 ALT_3  QUAL  FILTER_PASS
0    19      111          .   A     C               9.6        False
1    19      112          .   A     G              10.0        False
2    20    14370  rs6054257   G     A              29.0         True
3    20    17330          .   T     A               3.0        False
4    20  1110696  rs6040355   A     G     T        67.0         True
5    20  1230237          .   T                    47.0         True
6    20  1234567  microsat1   G    GA   GAC        50.0         True
7    20  1235237          .   T                     NaN        False
8     X       10     rsTest  AC     A   ATG        10.0         True


In [65]:

# so its just the original vcf data but put into a dataframe, which is cool:

# #CHROM  POS    ID    REF    ALT    QUAL   

# 19      111         .    A    C    9.6   
# 19      112          .    A    G    10    
# 20      14370        rs6054257    G    A    29    
# 20      17330    .    T    A    3    q10       
# 20      1110696    rs6040355    A    G,T    67   
# 20      1230237    .    T    .    47      
# 20      1234567    microsat1    G    GA,GAC    50   
# 20      1235237    .    T    .    .    .    .   
# X       10    rsTest    AC    A,ATG    10    


In [66]:

# some values (cols) on the right were missing, but if you want ALL of them:

df = allel.vcf_to_dataframe('C:/SPARK/sample.vcf', fields='*', alt_number=2)
df

# pandas type reflection 


Unnamed: 0,CHROM,POS,ID,REF,ALT_1,ALT_2,QUAL,DP,AA,NS,...,AC_1,AC_2,DB,FILTER_PASS,FILTER_q10,FILTER_s50,numalt,svlen_1,svlen_2,is_snp
0,19,111,.,A,C,,9.6,-1,,-1,...,-1,-1,False,False,False,False,1,0,0,True
1,19,112,.,A,G,,10.0,-1,,-1,...,-1,-1,False,False,False,False,1,0,0,True
2,20,14370,rs6054257,G,A,,29.0,14,,3,...,-1,-1,True,True,False,False,1,0,0,True
3,20,17330,.,T,A,,3.0,11,,3,...,-1,-1,False,False,True,False,1,0,0,True
4,20,1110696,rs6040355,A,G,T,67.0,10,T,2,...,-1,-1,True,True,False,False,2,0,0,True
5,20,1230237,.,T,,,47.0,13,T,3,...,-1,-1,False,True,False,False,0,0,0,False
6,20,1234567,microsat1,G,GA,GAC,50.0,9,G,3,...,3,1,False,True,False,False,2,1,2,False
7,20,1235237,.,T,,,,-1,,-1,...,-1,-1,False,False,False,False,0,0,0,False
8,X,10,rsTest,AC,A,ATG,10.0,-1,,-1,...,-1,-1,False,True,False,False,2,-1,1,False


In [67]:

print(df)

# bam ! 

# i actually think this format is clearer ...


  CHROM      POS         ID REF ALT_1 ALT_2  QUAL  DP AA  NS   ...    AC_1  \
0    19      111          .   A     C         9.6  -1     -1   ...      -1   
1    19      112          .   A     G        10.0  -1     -1   ...      -1   
2    20    14370  rs6054257   G     A        29.0  14      3   ...      -1   
3    20    17330          .   T     A         3.0  11      3   ...      -1   
4    20  1110696  rs6040355   A     G     T  67.0  10  T   2   ...      -1   
5    20  1230237          .   T              47.0  13  T   3   ...      -1   
6    20  1234567  microsat1   G    GA   GAC  50.0   9  G   3   ...       3   
7    20  1235237          .   T               NaN  -1     -1   ...      -1   
8     X       10     rsTest  AC     A   ATG  10.0  -1     -1   ...      -1   

   AC_2     DB  FILTER_PASS  FILTER_q10  FILTER_s50  numalt  svlen_1  svlen_2  \
0    -1  False        False       False       False       1        0        0   
1    -1  False        False       False       False      

In [68]:

# wow, you can query with SparkSQL and bendify anything you want !

df.query('DP > 10 and QUAL > 20')



Unnamed: 0,CHROM,POS,ID,REF,ALT_1,ALT_2,QUAL,DP,AA,NS,...,AC_1,AC_2,DB,FILTER_PASS,FILTER_q10,FILTER_s50,numalt,svlen_1,svlen_2,is_snp
2,20,14370,rs6054257,G,A,,29.0,14,,3,...,-1,-1,True,True,False,False,1,0,0,True
5,20,1230237,.,T,,,47.0,13,T,3,...,-1,-1,False,True,False,False,0,0,0,False


In [69]:

### export as .csv

allel.vcf_to_csv('C:/SPARK/sample.vcf', 'C:/SPARK/example.csv', fields=['CHROM', 'POS', 'DP'])


with open('C:/SPARK/example.csv', mode='r') as f:
    print(f.read())
    


CHROM,POS,DP
19,111,-1
19,112,-1
20,14370,14
20,17330,11
20,1110696,10
20,1230237,13
20,1234567,9
20,1235237,-1
X,10,-1



### so you have thme file saved within a dataframe, you can now manipulate it with SQL commands if you wanted to ...


# Now lets pull down a massive .vcf file and process it ! 


* http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/?C=S;O=A

In [70]:

# downloaded a 1.2 GB file of VCF raw data

vcf_path = 'C:/SPARK/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz'


In [71]:

!ls -lh {vcf_path}

# bam, a 1.2G file 


-rw-r--r-- 1 TBresee mkpasswd 1.2G Jun  4 19:33 C:/SPARK/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz


In [72]:

callset = allel.read_vcf(vcf_path, fields=['numalt'], log=sys.stdout)


[read_vcf] 65536 rows in 3.84s; chunk in 3.84s (17061 rows/s); 1 :2308933
[read_vcf] 131072 rows in 7.73s; chunk in 3.89s (16854 rows/s); 1 :4177969
[read_vcf] 196608 rows in 11.76s; chunk in 4.03s (16259 rows/s); 1 :6022445
[read_vcf] 262144 rows in 15.71s; chunk in 3.95s (16610 rows/s); 1 :8078286
[read_vcf] 327680 rows in 19.71s; chunk in 4.00s (16377 rows/s); 1 :10246876
[read_vcf] 393216 rows in 23.48s; chunk in 3.77s (17383 rows/s); 1 :12313599
[read_vcf] 458752 rows in 27.44s; chunk in 3.96s (16551 rows/s); 1 :15033300
[read_vcf] 524288 rows in 31.36s; chunk in 3.93s (16690 rows/s); 1 :17226235
[read_vcf] 589824 rows in 35.13s; chunk in 3.76s (17415 rows/s); 1 :19176875
[read_vcf] 655360 rows in 38.90s; chunk in 3.77s (17364 rows/s); 1 :21331176
[read_vcf] 720896 rows in 42.78s; chunk in 3.88s (16876 rows/s); 1 :23514706
[read_vcf] 786432 rows in 46.65s; chunk in 3.87s (16955 rows/s); 1 :25882976
[read_vcf] 851968 rows in 50.38s; chunk in 3.73s (17584 rows/s); 1 :28279507
[read_

<br>

```
When processing larger VCF files it’s useful to get some feedback on how fast things are going. 

Ultimately I am going to extract all the data from this VCF file into a Zarr store. 
However, before I do that, I’m going to check how many alternate alleles I should expect. 
I’m going to do that by extracting just the ‘numalt’ field, which scikit-allel will compute 
from the number of values in the ‘ALT’ field:
``` 
    

In [73]:

# let’s see what the largest number of alternate alleles is:

numalt = callset['variants/numalt']

np.max(numalt)



12

In [74]:

# Out of interest, how many variants are multi-allelic?
count_numalt = np.bincount(numalt)
count_numalt


array([      0, 6437262,   29538,    1064,     165,      49,      10,
             5,       0,       0,       0,       0,       1], dtype=int64)

In [75]:

n_multiallelic = np.sum(count_numalt[2:])
n_multiallelic

# So there are only a very small number of multi-allelic variants (30,832), the vast majority (6,437,262) 
# have just one alternate allele.



30832