# Data science in Python

- Course GitHub repo: https://github.com/pycam/python-data-science
- Python website: https://www.python.org/ 

## Session 1.2: Using existing python modules to explore data in files

- [Importing module `statistics`](#Importing-module-statistics)
  - [Exercise 1.2.1](#Exercise-1.2.1)
- [Python file and directory manipulations](#Python-file-and-directory-manipulations)
  - [Exercise 1.2.2](#Exercise-1.2.2)
- [Using the `csv` module](#Using-the-csv-module)
  - [Exercise 3.2](#Exercise-3.2)

## Mind map

<img src="img/mind_maps/mind_maps.002.jpeg">

## Importing module `statistics`

Like other laguages, Python has the ability to import external modules (or libraries) into the current program. These modules may be part of the standard library that is automatically included with the Python installation, they may be extra libraries which you install separately or they may be other Python programs you have written yourself. Whatever the source of the module, they are imported into a program via an **`import`** command.

For example, if we wish to access the `mean()` and `median()` functions in Python, we can use the **`import`** keyword to get [the module named `statistics`](https://docs.python.org/3/library/statistics.html) and access its contents with the dot notation:

In [None]:
import statistics
statistics.mean([1, 2, 3, 4, 4])

Also we can use the `as` keyword to give the module a different name in our code, which can be useful for brevity and avoiding name conflicts:

In [None]:
import statistics as stats
stats.mean([1, 2, 3, 4, 4])

Alternatively we can import the separate components using the `from … import` keyword combination:

In [None]:
from statistics import mean, median
mean([1, 2, 3, 4, 4])

We can import multiple components from a single module, either on one line like as seen above or on separate lines:

In [None]:
from statistics import mean
from statistics import median

### Listing module contents

Using the [function `dir()`](https://docs.python.org/3/library/functions.html?highlight=dir#dir) and passing the module name:

In [1]:
import statistics
dir(statistics)

['Decimal',
 'Fraction',
 'StatisticsError',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__spec__',
 '_check_type',
 '_counts',
 '_decimal_to_ratio',
 '_exact_ratio',
 '_ss',
 '_sum',
 'collections',
 'math',
 'mean',
 'median',
 'median_grouped',
 'median_high',
 'median_low',
 'mode',
 'pstdev',
 'pvariance',
 'stdev',
 'variance']

### Getting help directly from Jupyter notebook

In [2]:
statistics?

In [None]:
help(statistics)

## Exercise 1.2.1

- Calculate the average GDP per capita per country in Europe in 1962, its median and standard deviation using `data/gapminder.csv` data; and compare these figures with those from Americas.

In [13]:
from statistics import mean, median, stdev
gdp_EUR_1962_list=[]
gdp_AME_1962_list=[]

with open("data/gapminder.csv") as f:
    header = next(f)  # comment out header 
    #print(header) 
    for line in f:

        data = line.strip().split(',')
        country = data[0]
        continent = data[1]
        year = int(data[2])
        gbpPercap = float(data[-1])
        
        if continent == "Europe" and year == 1962:
            gdp_EUR_1962_list.append(gbpPercap)
        if continent == "Americas" and year == 1962:
            gdp_AME_1962_list.append(gbpPercap)
            
    #print(gdp_EUR_1962_list)
print('GDP per capita per country in EUR in 1962')
print('mean', mean(gdp_EUR_1962_list), '  stddev', stdev(gdp_EUR_1962_list))
    
print('GDP per capita per country in AMR in 1962')
print('mean', mean(gdp_AME_1962_list), '  stddev', stdev(gdp_AME_1962_list))
    

GDP per capita per country in EUR in 1962
mean 8365.4868143   stddev 4199.193906418378
GDP per capita per country in AMR in 1962
mean 4901.5418704   stddev 3421.740568771563


## Python file and directory manipulations

These two modules `os.path` and `os` implements some useful functions on pathnames, and for accessing the filesystem. To read or write files, we use `open()`. 

### [`os.path` — Common pathname manipulations](https://docs.python.org/3/library/os.path.html)

- `join(*paths)` : joins the paths together into one long path
- `exists(path)` : returns whether path exists
- `isfile(path)` : returns whether path is a “regular” file (as opposed to a directory)
- `isdir(path)` : returns whether path is a directory
- `dirname(path)` : returns directory containing the path
- `basename(path)` : returns the path minus the dirname(path) in front
- `split(path)` : returns (dirname(path), basename(path))

### [`os` — Miscellaneous operating system interfaces](https://docs.python.org/3/library/os.html)

- `listdir(path)` : returns a list of files/directories in the directory path

Building the path to your file from a list of directory and filename makes your script able to run on any platforms.

In [14]:
import os.path
data_filepath = os.path.join("data", "gapminder.csv")
# data/mydata.txt - Unix
# data\mydata.txt - Windows
print(data_filepath)

data/gapminder.csv


Checking if a file exists before opening it:

In [None]:
os.path.exists(data_filepath)

Checking if it is a file:

In [None]:
os.path.isfile(data_filepath)

or a directory:

In [None]:
os.path.isdir(data_filepath)

Extracting the directory of the file path:

In [None]:
data_dirname = os.path.dirname(data_filepath)
print(data_dirname)

Checking if it is a directory:

In [None]:
os.path.isdir(data_dirname)

Extracting the file name from the file path:

In [None]:
data_filename = os.path.basename(data_filepath)
print(data_filename)

Getting the directory and the file name from the file path using `os.path.split()` which returns two variables its directory and file name:

In [None]:
data_dirname, data_filename = os.path.split(data_filepath)
print(data_dirname, data_filename)

Listing the content of a directory using `os.listdir()` is equivalent to `ls` in the shell:

In [None]:
import os
print(os.listdir(data_dirname))

## Exercise 1.2.2

- List all `.txt` files from the `data` directory, print their file path only if it is a file.
- Check that the file `genes.txt` exists in `data/`, open the tab separated file, and calculate the length of each genes.

In [52]:
import os

dName = 'data'

listContents = os.listdir(dName)
#print(listContents)

for content in listContents:
    filepath = os.path.join(dName, content)
    
    if os.path.isfile(filepath) and filepath.endswith('.txt'):
        print(filepath)

['mydata.txt', 'gapminder_gdp_europe.csv', 'GRCm38.gff3', 'GRCh38.gff3', 'gapminder_gdp_africa.csv', 'AilMel.gff3', 'gapminder_gdp_americas.csv', 'genes.txt', 'genes_withstrand.txt', '.~lock.gapminder.csv#', 'gapminder.csv', 'gapminder_gdp_oceania.csv', 'glpa.fa', 'sample.fa', 'gapminder_gdp_asia.csv', 'GRCz11.gff3']
data/mydata.txt
data/genes.txt
data/genes_withstrand.txt


In [56]:
import os

fName = os.path.join("data","genes.txt")

if os.path.exists(fName):
    print('file {} exists'.format(fName))
    with open(fName, 'r') as f:
        header = f.readline()
        print(header)
        for line in f:
            gene,chrom,startPos,endPos = line.strip().split('\t')
            genelength = int(endPos) - int(startPos) + 1
            print('{} gene has length = {}', format(gene, genelength))

file data/genes.txt exists
gene	chrom	start	end



TypeError: must be str, not int

In [47]:
import os
data_filepath = os.path.join("data")
#print(os.listdir(data_filepath)[0])
for i in range(len(os.listdir(data_filepath))):
    if ".txt" in os.listdir(data_filepath)[i] and os.path.isfile(os.path.join("data",os.listdir(data_filepath)[1])):
        print(os.listdir(data_filepath)[i])


gene_filepath = os.path.join("data","genes.txt")
print(os.path.exists(gene_filepath))

with open(gene_filepath) as f:
    header = next(f)  # comment out header 
    #print(header) 
    for line in f:
        data = line.strip().split('\t')
        start = int(data[-2])
        end = int(data[-1])
        length = end - start
        print(data[0],'length:',length)
        #print(length)
        #data = line.strip().split(',')
        
# print(os.listdir(data_filepath)[1])
# print(os.path.isfile(os.path.join("data",os.listdir(data_filepath)[1])))



mydata.txt
genes.txt
genes_withstrand.txt
True
BRCA2 length: 84194
TNFAIP3 length: 16098
TCF7 length: 37154


## Using the `csv` module

The so-called CSV (Comma Separated Values) format is the most common import and export format for spreadsheets and databases. The `csv` module implements methods to read and write tabular data in CSV format.

The csv module’s `reader()` and `writer()` methods read and write CSV files. You can also read and write data into dictionary form using the `DictReader()` and `DictWriter()` methods.

For more information about this built-in Python library go to [CSV File Reading and Writing documentation](https://docs.python.org/3/library/csv.html).

Let's now read our `data/genes.txt` tab separated file using the `csv` module into a dictionary based on the column headers using `csv.DictReader()`.

|gene |	chrom |	start |	end |
|-- | -- | -- | -- | 
|BRCA2 |	13 |	32889611 |	32973805 |
|TNFAIP3 |	6 |	138188351 |	138204449 |
|TCF7 |	5 |	133450402 |	133487556 |

First, import the `csv` module:

In [58]:
import csv

Read the data and store each dictionary into a list. Note that `DictReader()` returns an [ordered dictionary](https://docs.python.org/3/library/collections.html#ordereddict-objects).

Ordered dictionaries are like regular dictionaries but they remember the order that items were inserted. When iterating over an ordered dictionary, the items are returned in the order their keys were first added.

In [59]:
data = []
with open("data/genes.txt") as f:
    reader = csv.DictReader(f, delimiter = "\t")
    for row in reader:
        print(row)
        data.append(row)

for d in data:
    print(d['chrom'], d['gene'], d['start'], d['end'])

{'end': '32973805', 'start': '32889611', 'chrom': '13', 'gene': 'BRCA2'}
{'end': '138204449', 'start': '138188351', 'chrom': '6', 'gene': 'TNFAIP3'}
{'end': '133487556', 'start': '133450402', 'chrom': '5', 'gene': 'TCF7'}
13 BRCA2 32889611 32973805
6 TNFAIP3 138188351 138204449
5 TCF7 133450402 133487556


data is a list of ordered dictionary representing each row of the data file:

In [60]:
# accessing first dictionary from the list
print(data[0])

# printing its keys
print(data[0].keys())

# its values
print(data[0].values())

# the value associated with the key 'gene'
print(data[0]['gene'])

{'end': '32973805', 'start': '32889611', 'chrom': '13', 'gene': 'BRCA2'}
dict_keys(['end', 'start', 'chrom', 'gene'])
dict_values(['32973805', '32889611', '13', 'BRCA2'])
BRCA2


In [None]:
# looping over the list to print each gene
for d in data:
    print(d['gene'])

In [None]:
# calculating the length of each gene and adding its value into the dictionary
for d in data:
    d['len'] = int(d['end']) - int(d['start']) + 1
    print(d)

The main advantage of using the `DictReader()` method and the `csv` module is to write code that is easier to read and more flexible. Using the name of the column instead if its index make it more meaningful when reading code, and using this method of reading comma or tab separated files, give you the flexibility to add columns and changed their orders without having to modify your code.

Let's have a look now at the file `data/genes_withstrand.txt` and spot the differences with `data/genes.txt`. Even though columns `chrom` and `gene` have been swapped and column `strand` added, the code written previously is still working.

In [None]:
data_withstrand = []
with open("data/genes_withstrand.txt") as f:
    reader = csv.DictReader(f, delimiter = "\t")
    for row in reader:
        print(row)
        data_withstrand.append(row)

for d in data_withstrand:
    print(d['chrom'], d['gene'], d['start'], d['end'])

In [78]:
# Write a delimited file using the csv module from a list of dictionaries 
with open("gene_lengths.txt", "w") as f:
    writer = csv.DictWriter(f, data[0].keys(), delimiter='\t')
    writer.writeheader() # write header

    for d in data:
        writer.writerow(d) # write row

# Open the output file and print out its content
with open("gene_lengths.txt") as f:
    for line in f:
        print(line.strip())

continent	country	pop	year	lifeExp	gdpPercap
Asia	Afghanistan	8425333	1952	28.801	779.4453145
Asia	Afghanistan	9240934	1957	30.332	820.8530296
Asia	Afghanistan	10267083	1962	31.997	853.10071
Asia	Afghanistan	11537966	1967	34.02	836.1971382
Asia	Afghanistan	13079460	1972	36.088	739.9811058
Asia	Afghanistan	14880372	1977	38.438	786.11336
Asia	Afghanistan	12881816	1982	39.854	978.0114388
Asia	Afghanistan	13867957	1987	40.822	852.3959448
Asia	Afghanistan	16317921	1992	41.674	649.3413952
Asia	Afghanistan	22227415	1997	41.763	635.341351
Asia	Afghanistan	25268405	2002	42.129	726.7340548
Asia	Afghanistan	31889923	2007	43.828	974.5803384
Europe	Albania	1282697	1952	55.23	1601.056136
Europe	Albania	1476505	1957	59.28	1942.284244
Europe	Albania	1728137	1962	64.82	2312.888958
Europe	Albania	1984060	1967	66.22	2760.196931
Europe	Albania	2263554	1972	67.69	3313.422188
Europe	Albania	2509048	1977	68.93	3533.00391
Europe	Albania	2780097	1982	70.42	3630.880722
Europe	Albania	3075321	1987	72	3738.932735

## Getting help from the official Python documentation

The most useful information is online on https://www.python.org/ website and should  be used as a reference guide.

- [Python3 documentation](https://docs.python.org/3/) is the starting page with links to tutorials and libraries' documentation for Python 3
    - [The Python Tutorial](https://docs.python.org/3/tutorial/index.html)
        - [Modules](https://docs.python.org/3/tutorial/modules.html)
        - [Brief Tour of the Standard Library: Mathematics](https://docs.python.org/3/tutorial/stdlib.html#mathematics)
    - [The Python Standard Library Reference](https://docs.python.org/3/library/index.html) is the reference documentation of all libraries included in Python like:
        - [`statistics` - Mathematical statistics functions](https://docs.python.org/3/library/statistics.html)
        - [`os.path` — Common pathname manipulations](https://docs.python.org/3/library/os.path.html)
        - [`os` — Miscellaneous operating system interfaces](https://docs.python.org/3/library/os.html)
        - [`csv` — CSV File Reading and Writing](https://docs.python.org/3/library/csv.html)

## Exercise 1.2.3

- Change the script you wrote for [Exercise 1.2.1](#Exercise-1.2.1) to make use of the `csv` module to calculate the average GDP per capita per country in Europe in 1962, its median and standard deviation using `data/gapminder.csv` data; and compare these figures with those from Americas.

In [82]:
import csv
from statistics import mean, stdev

fName = os.path.join("data","gapminder.csv")
#check if exists
if os.path.exists(fName):
    print('{} is present'.format(fName))
    
gdp_1962_EU = []   

with open(fName, 'r') as f:
#with open("data/gapminder.csv") as f:
    reader = csv.DictReader(f, delimiter = ',')
    for data in reader:
        if data['continent'] == 'Europe' and data['year'] == '1962':
            gdp_1962_EU.append(float(data['gdpPercap']))


print('EU GDP per cap in 1962, mean {:.2f}, std {:.2f}'.format(mean(gdp_1962_EU),stdev(gdp_1962_EU)))        


data/gapminder.csv is present
EU GDP per cap in 1962, mean 8365.49, std 4199.19


## Next session

Go to our next notebook: [Session 1.3: Creating functions and modules to write reusable code](13_python_data.ipynb)