# Advanced Python: Regressions


### Irises and linear regression

<center>
<img src="../pictures/irises_vg.jpg" style="width:372px;height:284px;">
<br>
<i>Irises (1889, Vincent van Gogh)</i>
</center>

Regression (known otherwise as line/curve-fitting) is a powerful and one of the easiest tools you can apply to your dataset. The general goal of regression is to find some appropiate function $f$ that predicts the value of one or more continuous variables $y$ given the value of some $k$-dimensional input variables vector $x$. The simplest form of $f$ is a linear function, which is the focus of this notebook.

Some other useful packages to practice: `pandas`, `seaborn`

Reference:
1. [Linear Regression using Iris Dataset — 'Hello, World!' of Machine Learning](https://medium.com/analytics-vidhya/linear-regression-using-iris-dataset-hello-world-of-machine-learning-b0feecac9cc1)
2. Pattern Recognition and Machine Learning, Chapter 4: Linear Models for Regression (Bishop, 2006)
3. [The Iris Dataset](https://scikit-learn.org/1.4/auto_examples/datasets/plot_iris_dataset.html)

In [None]:
# import packages
import pandas as pd

In [2]:
# import dataset from sklearn
from sklearn.datasets import load_iris
iris = load_iris()

# create a dataframe using pandas
iris_df = pd.DataFrame(data=iris.data, columns=iris.feature_names)
iris_df['species'] = iris.target
species_names = iris.target_names

### Warmup exercises:
1. What features of the irises are included in the dataset? How do you find out?
   1. What are the mean, min, median, max and quantiles of the features? `pandas` has a built-in method to take care of all these.
2. What is in the "species" column? 
   > *Hint*: you can find out the answer by calling .unique() on the "species" column of the DataFrame. 

   > *Hint*: The numbers in the "species" column are actually the indices of the species in the `species_names` array.
   1. Substitue the species column with actual species names.
3. What's the sepal length of the 10th sample in the dataset?
4. What's the largest sepal width in the dataset? Which sample does it belong to?

In [None]:
# your code here

### Exercises 1:
1. Recreate the following visualization:
<center>
<img src="../pictures/sepalLength_sepalWidth.png" style="width:500px;height:400px;">
<br>
<i>Fig 1: Sepal Length vs Sepal Width</i>
</center>

> *Hint*: Use `plt.scatter` (or seaborn's `scatterplot`). Doesn't really matter what colors you use, but show the distribution of Sepal length vs. sepal width and add an appropriate legend.

In [None]:
# your code here

### Exercises 2:
Using `sns.pairplot` (`sns` == `seaborn`), you can visualize the pairwise relationships in the dataset. Call `pairplot` on the iris dataset and use `species` as `hue`.

1. Are all the plots useful? Are there any redundant plots?
2. Suppose we want to **predict** the value of one feature based on another. Which pairs of features would you choose and why?

In [None]:
# your code here

### Exercises 3: Linear regression
Of course, we can use regression/a linear function predict the value of one feature based on another (it's just sometimes the predictions doesn't work very well/make sense--e.g. predicting shark attack at beaches based on ice cream sales). Here, we will use "petal length" to predict "petal width".

> We can use linear regression to predict values; conversely, we can also prove/show that two features are correlated by showing linear regression works well on them.

1. `sklearn` (i.e. `scikit-learn`) has a built-in linear regression model. Use it to predict the petal width based on the petal length.
2. What's the coefficient and intercept of the linear function?
3. Plot the predicted values against the actual values. Does the linear function fit well?
4. What's a quantitative measure of how well the linear function fits? **Actually, how did we get this line in the first place?**

In [None]:
# your code here

### How can regressions be used in neuroscience/psychology?