# Introduction to Data Science and Machine Learning

<p align="center">
    <img width="699" alt="image" src="https://user-images.githubusercontent.com/49638680/159042792-8510fbd1-c4ac-4a48-8320-bc6c1a49cdae.png">
</p>

---

## KNN and Linear regression - Homework

Here a couple of exercises to better fix in your mind the working schemas of KNN and Linear Regression algorithms.

### KNN

#### Exercise 1

> Use $k$NN just implemented to solve a classification problem. (_e.g._ the notorious Iris classification problem).

##### Import data

In [1]:
import pandas as pd
import numpy as np

from sklearn.datasets import load_iris

data = load_iris()

X = data['data']
y = data['target']

print(f"Dataset is made by {len(X)} data, whose first 5 lines are \n {X[:5]} \n ")
print(f"Target vector is {len(y)}-long, and targets names are \n {data['target_names']}")

Dataset is made by 150 data, whose first 5 lines are 
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]] 
 
Target vector is 150-long, and targets names are 
 ['setosa' 'versicolor' 'virginica']


In [2]:
from sklearn.neighbors import KNeighborsClassifier
import random
import time

i=[i for i in range(X.shape[0])]
random.shuffle(i)
X=X[i]
y=y[i]

tset=0.7

prc=int(tset*X.shape[0])
X_train=X[:prc]
y_train=y[:prc]

if prc==X.shape[0]:
    X_test=X_train
    y_test=y_train
else:
    X_test=X[prc:]
    y_test=y[prc:]

method=['cosine','euclidean','manhattan']

for method in method:    
    t0 = time.time()
    knn = KNeighborsClassifier(n_neighbors=3, metric=method)
    knn.fit(X_train,y_train)

    pred=knn.predict(X_test)
    print("\n")
    print(pred)
    print("\n")
    print(y_test)
    print("\n")
    prec=pred==y_test
    prec=np.round(np.average(prec)*100,2)
    print(method)
    print('\nThe percentage of a right prediction is: {}%'.format(prec))
    print('\nThe time needed for the calculations is: {}s'.format(time.time()-t0))




[2 1 0 0 0 2 1 2 1 2 1 1 1 0 0 0 0 1 1 0 0 2 2 1 1 0 1 1 0 1 2 0 1 2 1 1 0
 0 1 0 0 0 0 2 1]


[2 1 0 0 0 2 1 2 1 2 1 1 1 0 0 0 0 1 1 0 0 2 1 1 1 0 1 1 0 1 2 0 1 2 1 1 0
 0 1 0 0 0 0 2 1]


cosine

The percentage of a right prediction is: 97.78%

The time needed for the calculations is: 0.0020208358764648438s


[2 1 0 0 0 2 1 2 1 2 1 1 1 0 0 0 0 1 1 0 0 2 2 1 1 0 1 1 0 1 2 0 1 2 1 1 0
 0 2 0 0 0 0 2 1]


[2 1 0 0 0 2 1 2 1 2 1 1 1 0 0 0 0 1 1 0 0 2 1 1 1 0 1 1 0 1 2 0 1 2 1 1 0
 0 1 0 0 0 0 2 1]


euclidean

The percentage of a right prediction is: 95.56%

The time needed for the calculations is: 0.002391815185546875s


[2 1 0 0 0 2 1 2 1 2 1 1 1 0 0 0 0 1 1 0 0 2 2 1 1 0 1 1 0 1 2 0 1 2 1 1 0
 0 2 0 0 0 0 2 1]


[2 1 0 0 0 2 1 2 1 2 1 1 1 0 0 0 0 1 1 0 0 2 1 1 1 0 1 1 0 1 2 0 1 2 1 1 0
 0 1 0 0 0 0 2 1]


manhattan

The percentage of a right prediction is: 95.56%

The time needed for the calculations is: 0.002048969268798828s


#### Exercise 2

> Apply $k$NN to the [wave energy outputs regression problem](https://archive.ics.uci.edu/ml/datasets/Wave+Energy+Converters#) with a big dataset. Use different metrics and compare numerical performances.

### Linear Regression

#### Exercise 1

Is there a relationship between water salinity & water temperature? Can you predict the water temperature based on salinity? 

1. Using data contained in this [csv](https://www.kaggle.com/sohier/calcofi#bottle.csv) try to give an answer to this question.

2. Knowing that we have to find the _minimun_ of the cost function with respect to $\beta$ and that $ \partial_\beta J(\beta) = 0 $ is an equation in $\beta$. Use linear algebra to find the right coefficients $\beta$ without any loop calculation.

3. Use the equation found above to (re-)calculate $\beta$ and compare with the gradient descent and `sklearn` results.

_Hint for point 2._ Recall that one may use matrix notation to write
$$ \partial_\beta J(\beta) = X^t(X\beta - y) $$

#### Exercise 2

For example, we want to study the trend of fuel consumption as a function of the engine capacity, we can collect our measures in a table like the following.

| Engine capacity (cm$^3$) | Average Consumption (l/100km) |
|---|----|
| $800$  |  $6$    | 
| $1000$ |  $7.5$  | 
| $1100$ |  $8$    | 
| $1200$ |  $8.7$  | 
| $1600$ |  $12.4$ | 
| $2000$ |  $16$   | 
| $3000$ |  $20$   | 
| $4500$ |  $28$   | 

Apply linear regression to find the average consumption of an engine with `test_capacity = 1800`.

Use both `sklearn` library and your defined functions and compare the results.

_Hint for data conversion._ Recall that one may use pandas and python dictionaries to create dataframes.

```python
measures = pd.DataFrame({'Consumption_avg': [6, 7.5, 8, 8.7, 12.4, 16, 20, 28], 
                         'Capacity': [800, 1000, 1100, 1200, 1600, 2000, 3000, 4500]})
```

In [3]:
import pandas as pd
import numpy as np
import time
from sklearn.neighbors import KNeighborsRegressor

In [4]:
a=np.array([i for i in range(0,49)]).astype(str)
city=['Adelaide','Perth','Sydney','Tasmania']
k=10

In [5]:
print("-------------------------")
print("0. Exit\n") 
print("1. Method: cosine\n")
print("2. Method: euclidean\n")
print("3. Method: manhattan\n")
print("------------------------")

choice=int(input("Choose one: "))

while(choice > 0):
    
    if(choice == 1): 
        method='cosine'
        break
        
    elif(choice == 2):
        method='euclidean'
        break
        
    elif(choice == 3):
        method='manhattan'
        break 
        
    else:
        print("The choice is not included in the options...")
        choice=int(input("\nChoose one: "))

print("\nMethod: {}\n".format(method))

if choice!=0:
    t0 = time.time()
    for name in city:
        name_data=name+'_Data.csv'
        dati=pd.read_csv(name_data, header=None, names =a)
        X = dati.sample(frac=1)
        X=dati[a[0:48]]
        X_train=X.iloc[k:]
        X_test=X.iloc[:k]
        y=dati[a[48]]
        y_train=y.iloc[k:]
        y_test=y.iloc[:k]
        reg=[]
        err=[]
        
        knn = KNeighborsRegressor(n_neighbors=3, metric=method)
        knn.fit(X_train,y_train)
        reg=np.round(knn.predict(X_test),2)
        err=np.round(np.average(abs((reg-y_test.values)/y_test.values))*100,2)
        
        print('For {} resulted \n{}\nwith a relative error (respectively) of \n{}%\ncalculeted using the {} distance\n'.format(name,reg,err, method))
    print('The time needed for the calculation is: {}s'.format(time.time()-t0))
else:
    print('\nExit loop\n')
    pass

-------------------------
0. Exit

1. Method: cosine

2. Method: euclidean

3. Method: manhattan

------------------------
Choose one: 1

Method: cosine

For Adelaide resulted 
[1395194.64 1385539.83 1367618.23 1415519.52 1399667.38 1393200.74
 1425638.37 1395860.   1399393.46 1395513.89]
with a relative error (respectively) of 
2.26%
calculeted using the cosine distance

For Perth resulted 
[1479008.7  1484328.48 1479536.52 1485664.92 1483202.1  1477388.03
 1485836.08 1482851.23 1486747.97 1484928.87]
with a relative error (respectively) of 
0.18%
calculeted using the cosine distance

For Sydney resulted 
[1463622.17 1489888.39 1495087.31 1459869.25 1475727.95 1502133.8
 1487319.69 1481674.56 1478298.32 1482371.62]
with a relative error (respectively) of 
0.31%
calculeted using the cosine distance

For Tasmania resulted 
[3728153.14 3772413.95 3739839.13 3761135.22 3812664.8  3782774.06
 3689854.57 3818792.34 3772477.69 3765970.53]
with a relative error (respectively) of 
1.15%
calcul