# Data science in Python

- Course GitHub repo: https://github.com/pycam/python-data-science
- Python website: https://www.python.org/ 

## Session 1.1: Starting with data and Python
- [Jupyter notebook](#Jupyter-notebook)
  - [Exercise 1.1.1](#Exercise-1.1.1)
- [Shell commands](#Shell-commands)
  - [Exercise 1.1.2](#Exercise-1.1.2)
- [Basic Python](#Basic-Python)
  - [Exercise 1.1.3](#Exercise-1.1.3)

## Mind map

<img src="img/mind_maps/mind_maps.001.jpeg">

## Jupyter notebook

<img src="http://jupyter.org/assets/nav_logo.svg">

- The [Jupyter Notebook](http://jupyter.org/) is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. 

- Jupyter provides a rich architecture for interactive data science and scientific computing with: 
    - Over 40 programming languages such as Python, R, Julia and Scala.
    - A browser-based notebook with support for code, rich text, math expressions, plots and other rich media.
    - Support for interactive data visualization.

### How to install Jupyter on your own computer?

- We recommend using a virtual environment after having installed [Python 3](https://www.python.org/) on your computer
```bash
python3 -m venv venv
source venv/bin/activate # activate your virtual environment
```
- Install Jupyter:
```
pip install jupyter
```
- Start the notebook server from the command line:
```
jupyter notebook
```
- You should see the notebook home page open in your web browser.

### How to run Python in a Jupyter notebook?

- See [Jupyter Notebook Basics](http://nbviewer.jupyter.org/github/jupyter/notebook/blob/master/docs/source/examples/Notebook/Notebook%20Basics.ipynb)

## Exercise 1.1.1
- Create a new Jupyter notebook with one Markdown cell and one Python Code cell. Run the code.
- Download the python file associated with the notebook newly created

## Shell commands 

- Three commands
  - `pwd` to print working directory
  - `ls` to list content of a directory
  - `cd` to change directory

### Run the Python interpreter

On a Mac or Linux machine you first have to open a terminal and then type the command `python3`.
<center><img src="img/python_shell.png"></center>

### Run Python code from a file

For running Python code from a file, open a Terminal window and type the command `python3` or just `python` followed by the name of the script or file that contains Python code.

```
python3 scripts/hello.py
```

Please, make sure that you are running the version 3 of Python:
```
python --version
```


## Exercise 1.1.2
- Find in a terminal window the Python script downloaded from Jupyter notebook and execute it.
```
python3 my-script.py
```
- List all data files in `data/` folder from the course materials and find our first data file `gapminder.csv`.

## Basic Python

### Cheat Sheet

- [Cheat Sheet](cheat_sheet_basic_python.ipynb)

### For loops

The **`for` loop** in Python iterates over each item in a collection (such as a list) in the order that they appear in the collection. What this means is that a variable (`colour` in the below example) is set to each item from the collection of values in turn, and each time this happens the indented block of code is executed again.

In [None]:
all_colours = ['red', 'blue', 'green']
for colour in all_colours:
    print(colour)

### Files

To read from a file, your program needs to open the file and then read the contents of the file. You can read the entire contents of the file at once, or read the file line by line. The **`with`** statement makes sure the file is closed properly when the program has finished accessing the file.


Passing the `'w'` argument to `open()` tells Python you want to write to the file. Be careful; this will erase the contents of the file if it already exists. Passing the `'a'` argument tells Python you want to append to the end of an existing file.

In [1]:
# reading from file
with open("data/genes.txt") as f:
    for line in f:
        print(line.strip())

gene	chrom	start	end
BRCA2	13	32889611	32973805
TNFAIP3	6	138188351	138204449
TCF7	5	133450402	133487556


In [4]:
# printing only the gene name and the chromosome columns
with open("data/genes.txt") as f:
    for line in f:
        data = line.strip().split("\t")
        #data = line.strip().split()
        print(data[0], data[1], data[-1])

gene chrom end
BRCA2 13 32973805
TNFAIP3 6 138204449
TCF7 5 133487556


### Conditional execution

A conditional **`if/elif`** statement is used to specify that some block of code should only be executed if a conditional expression evaluates to `True`, there can be a final **`else`** statement to do something if all of the conditions are `False`.
Python uses **indentation** to show which statements are in a block of code. 

In [None]:
# printing only the gene name and its position for chromosome 6
with open("data/genes.txt") as f:
    for line in f:
        data = line.strip().split()
        if data[1] == '6':
            print(data[0], data[2], data[3])

### Getting help

[The Python 3 Standard Library](https://docs.python.org/3/library/index.html) is the reference documentation of all libraries included in Python as well as built-in functions and data types.


The Basic Python [Cheat Sheet](cheat_sheet_basic_python.ipynb) is a quick summary based on the course ['Introduction to solving biological problems with Python'](http://pycam.github.io/).

In [5]:
help(len)          # help on built-in function
help(list.extend)  # help on list function

Help on built-in function len in module builtins:

len(...)
    len(object)
    
    Return the number of items of a sequence or collection.

Help on method_descriptor:

extend(...)
    L.extend(iterable) -> None -- extend list by appending elements from the iterable



In [14]:
help("a string".strip)

SyntaxError: invalid syntax (<ipython-input-14-3b82ec228b4c>, line 2)

To get help for the `split()` function, you can look at the [Python documentation]((https://docs.python.org/3/library/index.html)) and search for [`str.split()`](https://docs.python.org/3/library/stdtypes.html?highlight=split#str.split)

In [19]:
help("a string".split)
#print("a string".split()[1])

string


In [None]:
# help within jupyter
str.split?

## Exercise 1.1.3

We are going to look at a [Gapminder](https://www.gapminder.org/) dataset, made famous by Hans Rosling from his Ted presentation [‘The best stats you’ve ever seen’](http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen).

- Read data from the file `data/gapminder.csv`.
- Find which European countries have the largest population in 1952 and 2007.

In [114]:
max_pop_1952 = 0

with open("data/gapminder.csv") as f:
    header = next(f)  # comment out header 
    #print(header) 
    for line in f:

        data = line.strip().split(',')
        country = data[0]
        continent = data[1]
        year = int(data[2])
        pop = int(data[-2])
        #print(country, continent, year, pop)
        
        if continent == "Europe" and year == 1952:
            if pop > max_pop_1952:
                max_pop_1952 = pop
                max_country_1952 = country
                
            #print(max_pop_1952, max_country_1952)
print(max_pop_1952, max_country_1952) 

69145952 Germany
country,continent,year,lifeExp,pop,gdpPercap



In [104]:
data_year_1952=[]
data_year_2007=[]
# reading from file
with open("data/gapminder.csv") as f:
    for line in f:
        data = line.strip().split(',')
        if data[2] == '1952':
            data_year_1952.append(data)
        if data[2] == '2007':
            data_year_2007.append(data) 

print('entries no',len(data_year_1952))
pop_list=[]
for i in range(len(data_year_1952)):
    pop_list.append(int(data_year_1952[i][4]))

print(max(pop_list))

with open("data/gapminder.csv") as f:
    for line in f:
        data = line.strip().split(',')
        if data[4] == str(max(pop_list)):
            print(data)




        

entries no 141
556263527
['China', 'Asia', '1952', '44', '556263527', '400.448611']


In [107]:
data_year_1952=[]
with open("data/gapminder.csv") as f:
    for line in f:
        data = line.strip().split(',')
        if data[2] == '1952':
            data_year_1952.append(data)


print('entries no',len(data_year_1952))
pop_list=[]
for i in range(len(data_year_1952)):
    pop_list.append(data_year_1952[i][4])
print(max(pop_list))

big = 0
for i in range(len(data_year_1952)):
    if int(data_year_1952[i][4]) > int(big):
        big = data_year_1952[i][4]
        big_index = i
        print(big)
        print(data_year_1952[big_index])

    

entries no 141
9939217
8425333
['Afghanistan', 'Asia', '1952', '28.801', '8425333', '779.4453145']
9279525
['Algeria', 'Africa', '1952', '43.077', '9279525', '2449.008185']
17876956
['Argentina', 'Americas', '1952', '62.485', '17876956', '5911.315053']
46886859
['Bangladesh', 'Asia', '1952', '37.484', '46886859', '684.2441716']
56602560
['Brazil', 'Americas', '1952', '50.917', '56602560', '2108.944355']
556263527
['China', 'Asia', '1952', '44', '556263527', '400.448611']


## Next session

Go to our next notebook: [Session 1.2: Using existing python modules to explore data in files](12_python_data.ipynb)