# Python Introduction

Python is a very powerful tool for automating tasks that would otherwise be time-consuming or impossible to do by hand or other conventional tools and techniques. Here, we'll go over basic ways to use Python for data analysis to introduce you to a slice of its potential.

## Markdown

We are working in a **Jupyter Notebook**. This lets us have descriptive text in Markdown as well running analysis in Python code. You can highlight text in **bold** or *italics*.

For more on Markdown, check out [this cheatsheet](https://www.markdownguide.org/cheat-sheet/).


## Code Block

Jupyter notebooks allow code to be run in blocks (also called chunks). Lines are run top to bottom. You can always edit a code block and re-run it to make changes.

In [None]:
#  INSTRUCTIONS:   Write a message in the quotes.
#  type Shift+Enter to run the cell.
message = ''

print(message)

## Python as a calculator

An incredibly simple yet pivotal role of Python is to perform math calculations (addition, subtraction, multiplication, etc.). We show how to do basic math below.

You'll see the symbol `#` used often. These are comments, and they are used to write descriptions. Any characters following `#` are not run or executed.

In [None]:
3 + 4 * 5  # addition and multiplication 

In [None]:
12 / (6 - 4) # division and substraction

In [None]:
2 ** 3 # exponentiation

#### Question 1: Math
Calculate the following value in Python: $ \frac{25}{(35 - 3)^3} $

In [None]:
### Put your code below here:


## Assigning Variables
A foundational tool in Python is assigning values to variables. We do this with the `=` operator.

In [None]:
x = 50 # x is 50

This sets the variable `x` to be 50, an **integer**, or `int`. This value of x is now stored in our notebook, and we can access this value in other cells until the notebook is reset. For instance, subtracting 20 from `x` prints out a value of 30.

In [None]:
# What if I use x again in a different cell?
x - 20

**Variables persist between cells once they have been run (executed).**

If we ever want to check the value of any variable, we can use the built-in `print()` command to display the value. 

In [None]:
y = 35
print(y)

We can also assign the value of one variable to another variable. If we execute `x = y`, x takes the current value of `y` and assigns that to `x`.

*Note: `y` will be unaffected by this assignment. `x = y` should be interpretted as "let x take the current value of y".*

In [None]:
x = y
print(x)
print(y)

If we change `y` to be a different value, `x` will be unaffected.

In [None]:
y = 3.8
print(x) # will not always be the same value as y
print(y)

**Basic variables only change value when something is assigned to them.**
They are **not** like spreadsheets where a cell can depend on another and update automatically.

Variables can be integers, floats (numbers with decimals), and strings (sets of characters). Strings must be specified with double quotes or single quotes.

In [None]:
a = 52 # integer
b = 3.14 # float
c = 'Inigo Montoya' # string

#### Question 2.  Swapping Values
Given the code below, what is the value of the variable `swap` by the end of the block?

In [None]:
x = 1.0
y = 3.0
swap = x
x = y
y = swap 

**What's in a name?** _Variable name conventions_
- Use only letters, digits, and underscores _
- Start with a letter (typically lower case)
- Variable names are case sensitive
- Use meaningful names!

**Variables must be created before they are used.** Otherwise, Python will throw an error.

## Read in data

### Import libraries

To analyze data, we first read it into our environment. This is done using **external libraries**, collections of functions and useful tools that are not in Python by default. We can easily include these using `import`. The library we need first is called `pandas`, which is used for importing and interacting with tabular data. 

In [None]:
import pandas as pd  

Using `as pd` allows us to reference `pandas` with only typing `pd`.

With `pandas` imported, we can read in .csv files with the `pandas` function `read_csv()`.

In that function, we can specify the file we want to use with a URL or with the path to a local file as a string.

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/DeisData/python/master/data/gapminder.csv") # read in data

Our data is now saved as a data frame in Python as the variable `df`. With the data now in the environment, we can take a look at the first few rows with `df.head()`.

In [None]:
df.head()

We can see that this data frame has several different columns, with information about countries and demography.

## Summarize data frame

It is important to understand the data we are working with before we begin analysis. First, let's look at the dimenions of the data frame using `df.shape`. It gives the number of rows by the number of columns.

In [None]:
df.shape


This shows that our data frame has 14740 rows by 9 columns.

We can also use `df.columns` to display the column names.

In [None]:
df.columns

### Categorical variables
Next, let's summarize the categorical, non-numerical variables. For instance, we can identify how many unique regions we have in the data set.

First, to select a column, we use the notation `df['COLUMN_NAME']`.

In [None]:
df['region']

To identify unique entries in this column, we can use the `pd.unique()` function. 

In [None]:
df['region'].unique()

The `countries` column has many unique values, so we'll just use the `len()` function to see how many unique countries we have.

In [None]:
df['country'].nunique() # this is called nesting functions -> calling functions within other functions

### Numerical variables

Numerical columns can be summarized in several ways. Let's find the mean first.

To make things simpler, we'll just do calculations on the `population`, `life_expectancy`, and `babies_per_woman` columns. We can put those names in a `list` and then specify that list for the columns.

In [None]:
num_cols = [ 'population', 'life_expectancy', 'babies_per_woman' ] # numerical columns

df[num_cols]

With this set of columns, we can run `.mean()` to find the mean of each column.

In [None]:
df[num_cols].mean() # returns the mean of each column

If we want a larger variety of summary statistics, we can use the `.describe()` method.

In [None]:
df[num_cols].describe()

We can also break down subgroupings of our data with the method `.groupby()`.

In [None]:
grouped_data = df.groupby('region')
grouped_data['population'].describe()

### Accessing rows and specific entries

You can also to access a specific row using `df.loc[ROW, :]`. The colon specifies to select all columns for that row number.

In [None]:
df.loc[0, :] # the first row

We can use `.loc` to find the value of specific entries, as well.

In [None]:
df.loc[0, 'country']

#### Question 3. Data Summary
Print out the summary statistics for columns `age5_surviving`, `gdp_per_day`, and `gdp_per_capita`.

In [None]:
### your code below:


## Manipulate data 

### Subset by row

Sometimes, we want to create a subset of the main data frame based on certain conditions. We do this by using `df.loc` and specifying a condition for the rows. 

Below, we take all of the rows where `babies_per_woman` is greater or equal to 4 with `df['babies_per_woman'] >= 4` and assign this to a new data frame.

To check that this was done correctly, we can look at the minimum of the `babies_per_woman` column in the new data frame with  `.min()`.

In [None]:
# take all rows where babies_per_woman is greater or equal to 4 and make a new data frame
df_4 = df[df['babies_per_woman'] >= 4]
df_4['babies_per_woman'].min()

We can also subset with categorical variables. Here, we take all rows where the country is Hungary. 

In [None]:
df_hungary = df[df['country'] == 'Hungary']
pd.unique(df_hungary['country'])

### Math

If we multiply a data frame by a single number, each value in the column will be muliplied by that value.

In [None]:
df['babies_per_woman'] * 1000

We can also do math between columns, since they have the same length. Elements of the same row are added, substacted, multiplied, or divided. 

Here, we subtract the `life_expectancy` column from the `age5_surviving` column and assign it to a new column called `life_difference`. 

In [None]:
df['life_difference'] = df['age5_surviving'] - df['life_expectancy'] 
print(df['life_difference'])

This new column is now reflected in the data frame. 

In [None]:
print(df.columns)

#### Question 4. Subsetting 

Create a subset of data from Lithuania. 

Within that subset, calculate the mean GDP per 1000 people across entries.

*Hint: Multiply per capita GDP by 1000.*

In [None]:
### your code below




## Simple plotting

Now that we understand how to work with data frames a bit more, we can start to make some basic data visualizations. To do this we will use the `matplotlib` library, specifically a set of functions in a module called `pyplot`. 

In [None]:
import matplotlib.pyplot as plt

First, let's make a histogram showing the overall distribution of life expectancy. 

To do this, we initialize a blank figure and set of axes with `plt.subplots()`. 

We then directly add the histogram to the axes with `ax.hist()`, being sure to specify the life expectancy column. 

Finally, we can display the figure with `plt.show()`

In [None]:
figure, ax = plt.subplots() # create blank figure and axes
ax.hist(df['life_expectancy']) # add histogram to axes
plt.show() # display figure

We also have many customibility options. For the histogram itself, we can specify the number of bins, the color of the bins, and color of the bin edges within `hist()`.

We can also specify axis labels with `ax.set_xlabel()` and `ax.set_ylabel()`. The plot title is set with `ax.set_title()`.



In [None]:
figure, ax = plt.subplots()
ax.hist(df['life_expectancy'],bins=30, color="grey", edgecolor='black') # specify bins, color, and edge color
ax.set_xlabel('Life Expectancy') # x axis label
ax.set_ylabel('Count') # y axis planning
ax.set_title('Distribution of Life Expectancy') # add title
plt.show()

There are many more axis and plot customizations you can do. Be sure check out [the `matplotlib` documentation](https://matplotlib.org/).

### Line Plot

Line plots are another simple visualization we can make through `matplotlib`.

Let's a plot of life expectancy in Jamaica over time. First, we need to subset the data frame to only include data from Jamaica.

Then, we make a plot just as we did before, but instead of using `ax.hist()`, we use `ax.plot(x, y)`, putting the year first to specify the x axis, followed by life expectancy for the y. 

In [None]:
# subset data
df_jm = df[df['country']=='Jamaica']
# create plot
figure, ax = plt.subplots()
ax.plot(df_jm['year'], df_jm['life_expectancy'], color='#333') # a dark charcoal
ax.set_xlabel('Life expectancy')
ax.set_ylabel('Year')
ax.set_title('Life expectancy over time in Jamaica')
plt.show()

#### Question 5: Putting it together

Plot a line plot of the average birth's per woman in Greece by the year, only including years after 1900. Label the axes and make a title.

In [None]:
### your code here:

## Resources

- [Official Python documentation](https://www.python.org/doc/)
- Take time with tutorials at [Kaggle.com](https://www.kaggle.com/learn)
- [Stack Overflow](https://stackoverflow.com/)
- [Getting started with Pandas](https://pandas.pydata.org/docs/getting_started/index.html#getting-started)
- [Pandas cheatsheet](https://pandas.pydata.org/Pandas_Cheat_Sheet.pdf)
- Data Visualization: [Python Graph Gallery](https://www.python-graph-gallery.com/)
- Other visualization libraries: [Seaborn](https://seaborn.pydata.org/tutorial.html), [Plotly](https://plotly.com/python/)
- Install Python: [Anaconda](https://docs.anaconda.com/anaconda/install/)

This lesson is adapted from 
<a href='http://swcarpentry.github.io/python-novice-gapminder/design/'>Software Carpentry.</a>

## Contact
Ford Fishman<br>
Data Analysis Specialist for Science<br>
Brandeis Library<br>
[fordfishman@brandeis.edu](fordfishman@brandeis.edu)<br>
[dataservices@brandeis.edu](dataservices@brandeis.edu)<br>
[Set up an appointment](https://calendar.library.brandeis.edu/appointments/fordfishman)