# Intro to Python

# Variables in Python
Variables in programming languages hold values.
* In Python a single `=` (equals sign) assigns the value on the right to the name of the variable on the left.
* Variables are created when a value is assigned to it.
* In the code block below, Python assigns the numerical value `42` to a variable called `age`

##### Creating variables

In [19]:
helpful_articles = 42
library = "Cushing/Whitney Medical Library"

##### Rules for creating variables
* Variable names can only contain letters, digits, and underscores `_`
* Variable names cannot start with a digit
* Variable names are case sensitive (`library`, `Library`, and `LIBRARY`) are three different variables

##### Use `print` to display values
* `print` is a Python function used to read-out, display, or "print out" the value stored within a variable name.
* Function in python look like this, `print()`, where the function name is followed by a set of parentheses. 
* You will provide values to the function within the parentheses. These values can also be called "arguments".
* In the case of print, the values that are fed into the parentheses are the items you want to display, or print.
* You can print the values of variables, strings, or digits.
* You can also print multiple values in one print statement

In [20]:
print(helpful_articles)

42


In [21]:
print(100 + 2 / 3 * 3)

102.0


In [22]:
print("String is another word for text")

String is another word for text


In [23]:
print(String is another word for text)

SyntaxError: invalid syntax (<ipython-input-23-c0a89e8aae61>, line 1)

In [24]:
print("The", library, "is my favorite library")

The Cushing/Whitney Medical Library is my favorite library


##### Using variables within calculations

In [26]:
helpful_articles = helpful_articles + 3
print(helpful_articles)

48


##### Updating Variables
* Python operates from top to bottom.
* The value held in second will not be updated to reflect that `first = 2`, because we are not reassigning the variable

In [61]:
first = 1
second = 5 * first
first = 2
print('first is:', first, 'and second is:', second)

first is: 2 and second is: 5


### <font color = green> Break here for 5 minutes to complete exercises.</font>

# Data Types in Python
* Every value in a program has a specific type.
* Integer(`int`): represents positive or negative whole numbers like 3 or -15.
* Floating point (`float`): (i.e. decimal point): represents real numbers like 3.14159 or -2.5.
* Character string (`str`): text.
    * Written in either single quotations marks or double quotation marks (`""` or `''`)
    * The quotation marks are not printed when the string is displayed
    
##### Use the `type` function to find the type of a value
* `type` workes on values and the values of variables

In [32]:
print(type(52))

<class 'int'>


In [33]:
print(type("some words"))

<class 'str'>


In [34]:
message = "penny for your thoughts"
print(message)
print(type(message))

penny for your thoughts
<class 'str'>


In [35]:
print(type("100"))

<class 'str'>


##### Data type conversions

In [38]:
excel_cell = "299182"
print(type(excel_cell))
excel_cell = int(excel_cell)
print(type(excel_cell))

<class 'str'>
<class 'int'>


In [45]:
number = 100
print(str(number) + "ish")

100ish


##### Strings have length and an index
* The built-in function `len` counts the number of characters in a string
* The characters (individual letters, numbers, spaces, etc.) within a string are ordered. For example the strings "AB" and "BA" are not the same.
* Each position in the string is given a number - the first character is `0`, the second character is `1`, and so on. This number representing the position of a character is called an index. Note: Python indexes start from 0. 

In [49]:
print(library) # let's refer to a string variable we created previously
print(len(library))

Cushing/Whitney Medical Library
31


Of the 31 characters that make up our string, `C` should be the first one. We can check this. In C based programming languages like Python, index counting starts from 0.

In [50]:
print(library[0])

C


##### Use slicing to parse out a substring
* A part of a string is called a substring. A substring can be as short as a single character.
* An item in a list is called an element. Whenever we treat a string as a list, the string's elements are its individual characters. 
* We take a slice of a string or list using `[start:stop]`, where `start` is replaced with the index of the first element we want and `stop` is replaced by the index of the element just after the last element we want.
* Slicing does not change the contents of the original string. The slice is a copy of part of the original string.

In [60]:
print(library)
print(library[16:23])
print(library[24:31])

Cushing/Whitney Medical Library
Medical
Library


### <font color = green> Break here for 5 minutes to complete exercises.</font>

# Functions and Finding Help in Python
* Different functions may take 0 or 1, or many arguments.
* Functions are likely to be specific about the data type they need.
* Functions may have default values for some arguments.
* You can learn about functions using a `help` function.
* Python has built-in functions, as well as many other functions that are associated with external packages. 
* You can create your own functions in Python. 

##### Functions might be able to take multiple items

In [71]:
print(max(2039, 39228, 3948, 10029))
print(min(2039, 39228, 3948, 10029))

39228
2039


##### Functions might have default settings

In [2]:
print(round(3.712))
print(round(3.712, 1))

4
3.7


##### Finding more information about a function

In [3]:
help(round)

Help on built-in function round in module builtins:

round(number, ndigits=None)
    Round a number to a given precision in decimal digits.
    
    The return value is an integer if ndigits is omitted or None.  Otherwise
    the return value has the same type as the number.  ndigits may be negative.



##### Access more functions by importing external libraries into a project

We will discuss libraries further in a later section, but this is how you add or import a library (i.e., package) into a python project. 
* Libraries are installed once per your machine, but they need to be imported into each project you would like to use them in. 
* In order to use functions from a specific package, you need to indicate which package the function is coming from using the syntax: library_name.function_name()
    * E.g., statistics.mode()

In [6]:
import statistics
n = [1, 1, 2, 3, 3, 3, 3]
s = statistics.mode(n)
print(s)

3


# Lists
* Doing calculations with a hundred variables called `patient_001`, `patient_002`, `patient_003`, etc., would be very slow and tedious. However, if all of these patients were in a list, you can perform calculations across each item in the list in an automated way.
* Lists store multiple values.
* Items in a list are stored between hard brackets `[]`.
* Values in lists are separated by commas `,`. 

##### Creating and returning list contents

In [77]:
weights = [157, 180, 166, 150, 183, 160]
print("Weights in list:", weights)
print("Length of weights list:", len(weights))

Weights in list: [157, 180, 166, 150, 183, 160]
Length of weights list: 6


##### Use an index to return a specific element from a list



In [79]:
print('First item in list:', weights[0])

First item in list: 157


##### Use an index to replace an item in a list

In [80]:
weights[0] = 156
print('Weights list is now:', weights)

Weights list is now: [156, 180, 166, 150, 183, 160]


##### Adding (i.e. appending) items to a list
* `append` is a "method" of list. Methods are like functions, but tied to a particular object.
* Use `object_name.method_name` to call methods.
* You can find the methods that object have associated with them by running the `help` function on the object name (eg, `help(list)`)

In [8]:
names = ["Elo", "Molly", "Charlie", "Riley"]
print("Original list:", names)
names.append("Ben")
names.append("Charolette")
print("List after append", names)

Original list: ['Elo', 'Molly', 'Charlie', 'Riley']
List after append ['Elo', 'Molly', 'Charlie', 'Riley', 'Ben', 'Charolette']


##### Combining lists together
* You can combine lists together with another list method called `extend`

In [9]:
names_1 = ["Elo", "Molly", "Charlie", "Riley"]
names_2 = ["Ben", "Charolette"]
names_1.extend(names_2)
print(names_1)

['Elo', 'Molly', 'Charlie', 'Riley', 'Ben', 'Charolette']


##### Create an empty list and append items to it

In [98]:
empty_list = []
print(empty_list)
empty_list.append("this is a single string")
print(empty_list)
empty_list.append(["this", "is", "a", "few", "strings"])
print(empty_list)

[]
['this is a single string']
['this is a single string', ['this', 'is', 'a', 'few', 'strings']]


### <font color = green> Break here for 5 minutes to complete the exercises in Part 3: Functions and Lists.</font>

# For Loops
* For loops allow you to drill down into a data structure.
    * Operate on sentence in a paragraph, each word in a sentence, or each character in a word.
    * Operate on each table in a database, each column in a spreadsheet, or each cell in a column. 
    * Operate on each item in a list.
* A for loop executes commands once for each element in a set.

In [10]:
for number in [2, 3, 5]:
    print(number) #indentations in python are important!

2
3
5


##### You can also return items in a list that has been stored as a variable

In [11]:
for name in names_1:
    print('First name:', name)

First name: Elo
First name: Molly
First name: Charlie
First name: Riley
First name: Ben
First name: Charolette


##### The body of a loop can contain many statements

In [12]:
prime_numbers = [2, 3, 5]
for p in prime_numbers:
    first_equation = p + 100
    second_equation = p * -1
    print(p, first_equation, second_equation)

2 102 -2
3 103 -3
5 105 -5


### <font color = green> Break here for 5 minutes to complete the exercises in Part 4: For Loops.</font>

# Python Libraries 
* A "library" in python is a collection of files (called modules) that contain functions for use by other programs.
* A Python program must import a library in order to use it. You will use `import` to do this.
* Refer to items from specific libraries as library_name.item_name.
    * Python uses `.` to mean "part of"
* Use the `help` function to learn about the contents of a library module.

# Pandas (a Python Library) and Data Frames
* Pandas is a widely-used Python library for statistics, particularly on tabular data.
* Pandas borrows many features from R’s dataframes.
    * A 2-dimensional table whose columns have names and potentially have different data types.
* Load it with import pandas as pd. The alias pd is commonly used for Pandas. Shortning pandas to pd this saves typing time. 

##### Import the Pandas library to this project

In [14]:
import pandas as pd

##### Read a Comma Separate Values (CSV) data file with pd.read_csv.
* This imports the csv into your project and saves it as a variable called data.
* As you work with `data`, you are not altering the original CSV file. 

In [18]:
import requests 
import io
#This line indicates the url where the csv file is stored on GitHub
url="https://raw.githubusercontent.com/CWML/intro-to-python-cwml-workshop/master/data/newly_hiv_infected_number_all_ages.csv"
#The next line retreives the content stored by the url
file=requests.get(url).content
#This last line creates a variable called "data" and decodes the utf-8 character encoding
data = pd.read_csv(io.StringIO(file.decode('utf-8')), index_col = "country")
print(data)

                              1990      1991      1992      1993      1994  \
country                                                                      
Afghanistan                    NaN       NaN       NaN       NaN       NaN   
Angola                      7500.0    8500.0    9500.0   11000.0   12000.0   
Argentina                   4900.0    5400.0    5800.0    6300.0    6700.0   
Armenia                       60.0      60.0      60.0      60.0     120.0   
Australia                      NaN       NaN       NaN       NaN       NaN   
Austria                        NaN       NaN       NaN       NaN       NaN   
Azerbaijan                     NaN       NaN       NaN       NaN       NaN   
Bahamas                      750.0     750.0     750.0     750.0     750.0   
Bangladesh                    60.0     160.0     160.0     160.0     160.0   
Barbados                     160.0     160.0     160.0     160.0     160.0   
Belarus                       60.0      60.0      60.0      60.0

In [19]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Index: 143 entries, Afghanistan to Zimbabwe
Data columns (total 22 columns):
1990    68 non-null float64
1991    68 non-null float64
1992    68 non-null float64
1993    68 non-null float64
1994    68 non-null float64
1995    68 non-null float64
1996    68 non-null float64
1997    68 non-null float64
1998    68 non-null float64
1999    68 non-null float64
2000    68 non-null float64
2001    68 non-null float64
2002    68 non-null float64
2003    68 non-null float64
2004    68 non-null float64
2005    68 non-null float64
2006    68 non-null float64
2007    68 non-null float64
2008    68 non-null float64
2009    68 non-null float64
2010    49 non-null float64
2011    132 non-null float64
dtypes: float64(22)
memory usage: 25.7+ KB


##### Use `DataFrame.loc[... , ...]` to select values by their index labels.
* The position before the comma indicates the row, and the position after the comma indicates the column returned. 
* Fill in both spaces before and after the comma to return a single cell (sample of 1 country during one year). 
* Indicate only the first position (with a `:` in the second position) to return an entire row (country) from the data frame.
* Indicate only the second position (with a `:` in the first position) to return an entire column (year) from the data frame.

In [32]:
print(data.loc["France","1995"])

7900.0


In [34]:
print(data.loc["France",:])

1990    5900.0
1991    6500.0
1992    7000.0
1993    7200.0
1994    7600.0
1995    7900.0
1996    8000.0
1997    7600.0
1998    7100.0
1999    6700.0
2000    5900.0
2001    5300.0
2002    5300.0
2003    5300.0
2004    5300.0
2005    5300.0
2006    5300.0
2007    5300.0
2008    5300.0
2009    5900.0
2010    6000.0
2011    6100.0
Name: France, dtype: float64


In [35]:
print(data.loc[:,"1995"])

country
Afghanistan                      NaN
Angola                       13000.0
Argentina                     6700.0
Armenia                        120.0
Australia                        NaN
Austria                          NaN
Azerbaijan                       NaN
Bahamas                        750.0
Bangladesh                     160.0
Barbados                       160.0
Belarus                         60.0
Belgium                          NaN
Belize                         350.0
Benin                         7500.0
Bhutan                           NaN
Bolivia                          NaN
Botswana                     36000.0
Brazil                           NaN
Bulgaria                         NaN
Burkina Faso                 14000.0
Burundi                      27000.0
Cambodia                     11000.0
Cameroon                     51000.0
Canada                           NaN
Central African Republic     25000.0
Chile                            NaN
Colombia                      

##### Use `DataFrame.loc[... , ...]` to select multiple columns or rows

In [40]:
print(data.loc["Angola":"Cameroon", "2005":"2011"])

                 2005     2006     2007     2008     2009     2010     2011
country                                                                    
Angola        23000.0  24000.0  24000.0  24000.0  24000.0  24000.0  23000.0
Argentina      7700.0   7700.0   7600.0   7500.0   7500.0      NaN   5600.0
Armenia         120.0    120.0    120.0    300.0    300.0      NaN    350.0
Australia         NaN      NaN      NaN      NaN      NaN      NaN   1100.0
Austria           NaN      NaN      NaN      NaN      NaN      NaN   1200.0
Azerbaijan        NaN      NaN      NaN      NaN      NaN      NaN    750.0
Bahamas         750.0    750.0    350.0    350.0    350.0    350.0    350.0
Bangladesh      750.0    750.0    750.0    750.0    750.0   1100.0   1300.0
Barbados         60.0     60.0     60.0     60.0     60.0     60.0     60.0
Belarus        2400.0   2100.0   1900.0   1800.0   1800.0   1800.0   1900.0
Belgium           NaN      NaN      NaN      NaN      NaN      NaN   1300.0
Belize      

##### Use `DataFrame.loc[... , ...]` to call a customized subset

In [43]:
north_america = ["Canada", "United States", "Mexico"]
every_five = ["1990", "1995", "2000", "2005", "2010"]
subset = data.loc[north_america,every_five]
print(subset)

                  1990     1995     2000     2005     2010
country                                                   
Canada             NaN      NaN      NaN      NaN      NaN
United States  88000.0  51000.0  52000.0  49000.0  49000.0
Mexico         11000.0  12000.0  13000.0  12000.0  10000.0


##### Summarizing data subsets

In [44]:
print(subset.describe())

               1990          1995          2000          2005          2010
count      2.000000      2.000000      2.000000      2.000000      2.000000
mean   49500.000000  31500.000000  32500.000000  30500.000000  29500.000000
std    54447.222151  27577.164466  27577.164466  26162.950904  27577.164466
min    11000.000000  12000.000000  13000.000000  12000.000000  10000.000000
25%    30250.000000  21750.000000  22750.000000  21250.000000  19750.000000
50%    49500.000000  31500.000000  32500.000000  30500.000000  29500.000000
75%    68750.000000  41250.000000  42250.000000  39750.000000  39250.000000
max    88000.000000  51000.000000  52000.000000  49000.000000  49000.000000


##### Finding elements where rates per year are higher than average. This returns a TRUE or FALSE boolean.

In [53]:
print(data > data.mean())

                           1990   1991   1992   1993   1994   1995   1996  \
country                                                                     
Afghanistan               False  False  False  False  False  False  False   
Angola                    False  False  False  False  False  False  False   
Argentina                 False  False  False  False  False  False  False   
Armenia                   False  False  False  False  False  False  False   
Australia                 False  False  False  False  False  False  False   
Austria                   False  False  False  False  False  False  False   
Azerbaijan                False  False  False  False  False  False  False   
Bahamas                   False  False  False  False  False  False  False   
Bangladesh                False  False  False  False  False  False  False   
Barbados                  False  False  False  False  False  False  False   
Belarus                   False  False  False  False  False  False  False   

In [55]:
filter = data > data.mean()
print(data[filter])

                              1990      1991      1992      1993      1994  \
country                                                                      
Afghanistan                    NaN       NaN       NaN       NaN       NaN   
Angola                         NaN       NaN       NaN       NaN       NaN   
Argentina                      NaN       NaN       NaN       NaN       NaN   
Armenia                        NaN       NaN       NaN       NaN       NaN   
Australia                      NaN       NaN       NaN       NaN       NaN   
Austria                        NaN       NaN       NaN       NaN       NaN   
Azerbaijan                     NaN       NaN       NaN       NaN       NaN   
Bahamas                        NaN       NaN       NaN       NaN       NaN   
Bangladesh                     NaN       NaN       NaN       NaN       NaN   
Barbados                       NaN       NaN       NaN       NaN       NaN   
Belarus                        NaN       NaN       NaN       NaN

##### Return countries that have reported numbers that are higher than average every year.

In [63]:
higher_than_average = data[filter]
always_higher_than_average = higher_than_average.dropna() #drops rows with 1 or more NAN value
print(always_higher_than_average)
print(always_higher_than_average.describe())

                   1990      1991      1992      1993      1994      1995  \
country                                                                     
Kenya          120000.0  170000.0  240000.0  280000.0  280000.0  250000.0   
Malawi          79000.0   85000.0   88000.0   89000.0   88000.0   90000.0   
Nigeria         96000.0  140000.0  180000.0  230000.0  280000.0  320000.0   
South Africa    44000.0   73000.0  120000.0  190000.0  300000.0  430000.0   
Tanzania       180000.0  200000.0  200000.0  200000.0  190000.0  170000.0   
Uganda         130000.0  120000.0  110000.0  110000.0   98000.0   92000.0   
United States   88000.0   52000.0   50000.0   52000.0   49000.0   51000.0   
Zimbabwe       180000.0  200000.0  210000.0  240000.0  260000.0  250000.0   

                   1996      1997      1998      1999    ...         2002  \
country                                                  ...                
Kenya          210000.0  170000.0  150000.0  140000.0    ...     130000.0  

##### Use .iloc to subset the dataframe

In [77]:
subset = data.iloc[88:92,:5] #.iloc uses numerical indexes instead of row or column names like .loc does
print(subset)

               1990     1991     1992     1993     1994
country                                                
Mozambique  25000.0  33000.0  43000.0  55000.0  67000.0
Myanmar     14000.0  13000.0  13000.0  16000.0  17000.0
Namibia      3800.0   5200.0   7200.0   9800.0  13000.0
Nepal         350.0    350.0    750.0   1000.0   1500.0


##### Perform calculations over columns

In [87]:
print("1990:\n", subset.loc[:, "1990"],"\n1994: \n", subset.loc[:, "1994"])
print("diff: \n", subset.loc[:, "1994"] - subset.loc[:, "1990"])

1990:
 country
Mozambique    25000.0
Myanmar       14000.0
Namibia        3800.0
Nepal           350.0
Name: 1990, dtype: float64 
1994: 
 country
Mozambique    67000.0
Myanmar       17000.0
Namibia       13000.0
Nepal          1500.0
Name: 1994, dtype: float64
diff: 
 country
Mozambique    42000.0
Myanmar        3000.0
Namibia        9200.0
Nepal          1150.0
dtype: float64


### <font color = green> Break here for 5 minutes to complete the exercises in Part 5: Pandas and Data Frames.</font>

# Conditional Statements
* An `if` statement (more properly called a conditional statement) controls whether a block of code is executed or not. 
* The structure of an `if` statement is similar to that of a `for` loop:
    * The first line opens with `if` and ends with a colon `:`
    * The body containing one or more statements is indented (by 4 spaces or a tab)

In [90]:
result = 42
if result == 42:
    print(result, "is the answer to the meaning of life and the universe")

result = 10
if result < 42:
    print(result, "is not the answer to the meaning of life")

42 is the answer to the meaning of life
10 is not the answer to the meaning of life


##### Conditionals are often used within `for` loops

In [92]:
results = [10, 20, 12, 43, 50, 42]
for result in results:
    if result > statistics.mean(results):
        print(result, "is larger than average")

43 is larger than average
50 is larger than average
42 is larger than average


##### Use `else` within a to execute a block of code when an `if` condition is *not* true 

In [93]:
for result in results: 
    if result > statistics.mean(results):
        print(result, "is larger than average")
    else:
        print(result, "is smaller than average")

10 is smaller than average
20 is smaller than average
12 is smaller than average
43 is larger than average
50 is larger than average
42 is larger than average


##### Use `elif` to add additional tests

In [95]:
for result in results: 
    if result > 42:
        print(result, "is too large")
    elif result < 42:
        print(result, "is too small")
    else:
        print(result, "is the answer to the meaning of life and the universe")

10 is too small
20 is too small
12 is too small
43 is too large
50 is too large
42 is the answer to the meaning of life and the universe


### <font color = green> Break here for 5 minutes to complete the exercises in Part 6: For Loops, and to wrap up any other exercises you would like to work on.</font>