### Simple Linear Regression using numpy (with pandas and pyplot)

In the following, we'll take a closer look at the relationships between following variables:
- temperature (y) versus latitude (x)
- temperature (y) versus longitude (x)

![title](images/latitude_longitude.jpg)

In [None]:
import csv
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# For compatibility across multiple platforms
import os
IB = os.environ.get('INSTABASE_URI',None) is not None
open = ib.open if IB else open

### Reading CSV file into dataframe
Let's begin by reading Cities.csv into a dataframe.

In [None]:
f = open('datasets/Cities.csv','r')
cities = pd.read_csv(f)

### Plotting a line
Suppose we want to plot a line that passes through points (1,2) and (3,7). 

The way to do this would be as follows:

In [None]:
plt.plot([1,3], [2,7], color='green')
plt.show()

### Temperature versus latitude scatterplot
A <b>scatterplot</b> is a type of plot that displays values for typically two variables for a set of data. 

In the following plot, each point represents a <b>city</b> with its corresponding <b>latitude</b> (x) and <b>temperature</b> (y) values.

In [None]:
cities.plot.scatter(x='latitude', y='temperature')
plt.show()

### Add linear regression
We want to calculate a simple <b>linear regression line</b> that <b>passes through the data</b> such that the distance of each data point from the line is minimized. 

Luckily, the Numpy package comes with a function called <b>ployfit()</b> that automatically calculates the linear regression model for us. 

In [None]:
cities.plot.scatter(x='latitude', y='temperature')
a,b = np.polyfit(cities.latitude, cities.temperature, 1) # Regression line is y = ax + b
x1 = min(cities.latitude)
x2 = max(cities.latitude)
plt.plot([x1,x2], [a*x1 + b, a*x2 + b], color='red')
plt.xlim(x1,x2)
plt.show()

Here, we see that temperature appears to be <b>negatively correlated</b> with latitude. 

### Correlation coefficients (r values)
Recall that the value of <b>r</b> is between -1 and 1. 
- 1: maximum positive correlation 
- 0: no correlation
- -1: maximum negative correlation

In order to get the correlation coefficient of the linear regression model, we can use Numpy's <b>corrcoef()</b> function. 

In [None]:
cc = np.corrcoef(cities.latitude, cities.temperature)[1,0]
print('Correlation coefficient for temperature versus latitude:', cc)
cc = np.corrcoef(cities.longitude, cities.temperature)[1,0]
print('Correlation coefficient for temperature versus longitude:', cc)

Thus, we confirm that temperature is <b>negatively correlated</b> to latitude as its correlation coeffiticent is closer to -1. 

Meanwhile, temperature does not seem to be correlated with longitude. 

### Linear regression for interactive temperature predictor
We'll now compute latitude-temperature regression from cities in Norway, France, and Turkey.

In [None]:
train = cities[(cities.country=='Norway') | (cities.country=='France') | (cities.country=='Turkey')]
# Compute and show regression
plt.scatter(train.latitude, train.temperature)
a,b = np.polyfit(train.latitude, train.temperature, 1)
x1 = min(train.latitude)
x2 = max(train.latitude)
plt.plot([x1,x2], [a*x1 + b, a*x2 + b], color='red')
plt.xlim(x1,x2)
plt.show()
# Loop asking user for city name, compute predicted + actual temperature
while True:
    name = input('Enter city name (or "quit" to quit): ')
    if name == 'quit': break
    city = cities[cities.city == name]
    if len(city) == 0:
        print('City not in dataset')
    else:
        # Use float() to convert dataframe element to value
        print('Predicted temperature:', a * float(city.latitude) + b)
        print('Actual temperature:', float(city.temperature))

### <font color="green">Your Turn: World Cup Data</font>

In [None]:
# Read Players.csv into dataframe
f = open('datasets/Players.csv','r')
players = pd.read_csv(f)

In [None]:
# From the players data, compute and plot a linear regression for
# passes made (y-axis) versus minutes played (x-axis).
# Reminder: copy-paste-modify approach to programming!

In [None]:
# Show the correlation coefficient for the passes-minutes regression.
# Also show correlation coefficients for tackles versus minutes, shots
# versus minutes, and saves versus minutes

In [None]:
# Use linear regression for interactive number-of-passes predictor
# Training data: compute minutes-passes regression for players from
# Greece, USA, and Portugal

In [None]:
# SUPER BONUS!!
# Repeat previous but use separate predictor for the four different positions
# (goalkeeper,defender,midfielder,forward). Does it do better?
# Try comparing correlation coefficients against one regression for all players.
#
# Note: To extract a string value from a dataframe element use df.iloc[0].element,
# e.g., if "player" is a one-row dataframe, then player.iloc[0].position returns
# the player's position as a string