# Worksheet 22: Introduction to Python

Jupyter Notebooks are convenient to organize our code with some text!

### 1. Set up

In [None]:
# IMPORTANT
# Running this chunk lets you have multiple outputs from a single chunk; run it first!
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

We will use different packages today but will load them as we learn!

### 2. Basics in Python

Let's first talk about different types of objects and assignments:

In [None]:
# Create an object and print
one = 1
one

In [None]:
# Create two objects at once, separate names by a comma then separate assignments by a comma
two, four = one + one, 4*one
two
four

In [None]:
# Create a list of numbers from 1 to (not including) 7, by 2
alist = list(range(1,7,2))
alist

In [None]:
# Some basic built-in functions
len(alist)
min(alist)
max(alist)

Quick indexing in Python is also done using `[]` but since Python begins counting at 0, the first element is at the 0th position. In general, the nth element is located in the (n−1)th position.

In [None]:
alist[0]

### 3. Working with arrays in NumPy

The package `NumPy` has many built-in functions that we can use when working with data structures. When importing a package, we usually give it a short name. Then we will call functions from this package using the abbreviation:

In [None]:
# Import a new package
import numpy as np

We can create an array which is actually a vector:

In [None]:
# Create an array (vector) instead of a list
anarray = np.array(range(1,7,2)) # array is from the numpy package
anarray

We can do operations on arrays which we did not work out with lists: 

In [None]:
# Operations on elements of arrays
alist*2 # multiply a list by 2
anarray*2 # multiply an array by 2

`NumPy` also has functions that are very useful to create summaries:

In [None]:
# Operations on arrays
np.sum(anarray) # sum of elements

np.mean(anarray) # mean of elements

np.std(anarray) # standard deviation of elements

#### Think! What other measure do we report along the mean? Why?

### 4. Working with data frames in Pandas

Objects in `pandas` can be thought as enhanced versions of `Numpy` structured arrays in which the rows and columns are identified with labels rather than integer indices.

In [None]:
# Import a new package
import pandas as pd

One very useful structure for us: data frames!

In [None]:
# Create a data frame
df = pd.DataFrame([{'var1': 1, 'var2': 2}, {'var1': 3, 'var2': 4}])
df

df.index # Look at the rows

df.columns # Look at the columns

df.var1 # Look at the second variable

The `pandas` package contains functions for data wrangling that are similar to `dplyr` in R.

| dplyr     | pandas      |
|-----------|-------------|
| select    | filter      |
| filter    | query       |
| arrange   | sort_values |
| group_by  | groupby     |
| mutate    | assign      |
| summarize | agg         |

We'll do some data wrangling with a dataset next.

### 5. Working with existing datasets in Seaborn

Some built-in datasets are contained in the `seaborn` package which is also a library for making visualizations in Python. 

In [None]:
# Import a new package
import seaborn as sns

In [None]:
# Load dataset from library
iris = sns.load_dataset('iris')

Instead of pipes in Python, we can do a chain with `.`:

In [None]:
# Take a look
iris.head() 

In [None]:
# An example of data wrangling with Python
(iris.filter(['species', 'petal_length', 'petal_width']) # keep some variables
.query('species == "setosa"') # keep some observations
.agg(['mean'])) # calculate the mean

### 6. Visualizing data with Matplotlib and Seaborn

Both `matplotlib` and `seaborn` contain functions to make visualizations in Python.

In [None]:
# Import a new package
import matplotlib.pyplot as plt

In [None]:
# Create a bar graph
counts = iris['species'].value_counts() # find counts
plt.bar(counts.index, counts.values)

In [None]:
# Create a histogram
plt.hist(x = iris['petal_length'], bins = [1,2,3,4,5,6,7], edgecolor="black") # define bins
plt.xlabel('Petal Length (cm)') # label x-axis
plt.title('A simple histogram') # give a title

#### Think! Why is it important to define bins for a histogram?


In [None]:
# Create a scatterplot
plt.plot(iris['petal_length'], iris['petal_width'], linestyle = 'none', marker = 'o')
plt.xlabel('Petal Length (cm)') # label x-axis
plt.ylabel('Petal width (cm)') # label y-axis
plt.title('A simple scatterplot') # give a title

In [None]:
# Seaborn is a more versatile data visualization package
sns.scatterplot(x="petal_length", y="petal_width", data=iris, hue='species')
plt.xlabel('Petal Length (cm)') # label x-axis
plt.ylabel('Petal width (cm)') # label y-axis
plt.title('A more elaborate scatterplot') # give a title

In [None]:
# Join different plots into one!
sns.jointplot(x="petal_length", y="petal_width", data=iris, kind='reg')

In [None]:
# Look at the strength of the relationships between the variables with a correlation matrix
sns.heatmap(iris.corr(), annot=True)

A very useful website to decide on appropriate visualizations and how to make them in Python: https://www.python-graph-gallery.com/

### 7. Making predictions/classification with Scikit-learn

`Scikit-learn` is an open source machine learning library that supports supervised and unsupervised learning. It provides various tools for regression, classification, cross-validation, and model evaluation (and many other common tasks in Machine Learning).

#### 7.a. Predict a numeric variable

We will predict the petal length of an iris flowers based on the other variables of the flower with linear regression.

In [None]:
# Import new packages
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In Python, we usually separate the independent variables/predictors from the outcome/response variable.

In [None]:
# Separate predictors and response
X_petal = iris.drop(columns=['petal_length','species']) # drop outcome and species
y_petal = iris['petal_length']

#### Think! Why do we usually build a model by splitting the data into train and test sets?


In [None]:
# Split data into train and test sets
X_train_petal, X_test_petal, y_train_petal, y_test_petal = train_test_split(X_petal, y_petal, test_size=0.3)

Now, we can define our model: which algorithm to use based on which set of predictors/response.

In [None]:
# Perform linear regression
petal_length_model = LinearRegression()
petal_length_model.fit(X_train_petal, y_train_petal)

Use the model to make predictions:

In [None]:
# Predict on test set
y_pred_petal = petal_length_model.predict(X_test_petal)
print(y_pred_petal)

Finally, we need to evaluate the performance of our model.

#### Think! What measure of performance do we typically use to evaluate a linear regression model?


#### 7.b. Classify a categorical variable

The iris flowers in the dataset come from 3 different species. Can we predict from which species a flower is if we know its petal and sepal length/width? Let's try to use a K-Nearest-Neighbors model:

In [None]:
# Import new packages
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

As usual in Python, separate the independent variables/predictors from the outcome/response variable then split the data into train and test sets:

In [None]:
# Separate predictors and response
X_species = iris.drop(columns=['species'])
y_species = iris['species']

# Split data into train and test sets
X_train_species, X_test_species, y_train_species, y_test_species = train_test_split(X_species, y_species, test_size=0.3)

In [None]:
# Fit a kNN model with 5 nearest neighbors
species_model = KNeighborsClassifier(n_neighbors = 5)
species_model.fit(X_species, y_species)

Use the model to make predictions:

In [None]:
# Predict on test set
y_pred_species = species_model.predict(X_test_species)
print(y_pred_species)

#### Think! What measure of performance do we typically use to evaluate a linear regression model?
