# ASSIGNMENT 1

### Details

The assignment consists of two parts:
- Linear regression
- Logistic regression

The tasks are described throughout this notbook. Please implement what is described in the tasks and provide a written answer (in English or German) where it is required. You are allowed to add new cells to the notebook if it makes it easier for you. However, please do not remove any existing cells.



### Handing in

Please download the notebook as .ipynb file and upload the file to the assignment in MS teams.
You are allowed to submit several times. The last submission *before* the deadline is what will be graded. Late submissions will get a penalty as discussed in the lecture.


### Finally...

have fun and best of luck!

---


*Sidenote*:

*In case you are unsure about what to do (e.g. due to unclear/ambiguous task description), please feel free to ask questions! (either in the next lecture or via teams/email) If you feel like you do have to make assumptions while solving the task, please describe the assumption you are making.*

## Import statements

---

Importing all required libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler, LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.metrics import r2_score, accuracy_score, precision_score, recall_score

# Linear regression

## Task
Your task is to do a **linear regression** on the **red-wine quality dataset**

In [None]:
wine = pd.read_csv('https://github.com/schneiderson/ATIT2-21/raw/master/sample_data/winequality-red.csv', sep=";")
wine.head()



---


In linear regression an attribute is a **good predictor** if there is **a correlation** between the independent variable (x: some attribute of the wine) and the independent variable (y: wine quality).

Check which numeric attributes have a correlation with the wine quality. To achieve this you can either plot the graph and check the correlation visually (using the scatter function below). Alternatively, you can use **sns.pairplot(wine)** to plot combinations of different variables.

If the correlation is not strong it will be hard to spot in a graph. In this case it is best to use pandas **wine.corr()** function to dispaly the correlation values in table format.

(optional: you can print the correlation matrix with a heatmap from the seaborn library. sns.heatmap(wine, annot=True))

In [None]:
def scatter(column):
    plt.figure(figsize=(5,5))
    plt.scatter(wine[column], wine['quality'])
    plt.title(column+' vs Quality')
    plt.ylabel('quality')
    plt.xlabel(column)
    plt.tight_layout()

## Data exploration

---

Plot the four attributes with the highest correlation to the wine quality.

In [None]:
# plot the four attributes with highest correlation
# scatter("colum_name")


## Data preparation

---


Use one of the attributes with the highest correlation and train a linear regression model.

Before training, split the dataset in training and test data. (80%, 20%)

In [None]:
# use train_test_split method from sklearn.model_selection library to split data in training and testing data

np.random.seed(0) # setting the seed will make the "random" split deterministic (more reliable and comparable results)

# hint: it should look something like this:
# df_train, df_test = train_test_split(<?>, train_size = 0.8, test_size = 0.2, random_state = 100)



---


Do feature scaling on the numerical independant variables for both the training dataset and the test dataset.

Use fit_transform() on the training dataset and transform() on the test dataset.

In [None]:
# use MinMaxScaler from sklearn.preprocessing
scaler = MinMaxScaler()

# scaler.fit_transform(<training_data>)
# scaler.transform(<test_data>)



---


Can you explain why you should not use fit_transform on both datasets? (Hint: 
think about what training and test data represent if you train a model to solve a real problem where future predictions are made on **unseen data**.)

**--> your answer here...**

## Model training

---


Train a linear regression model on the training data. Use the attribute which you have identified has the highest correlation.

In [None]:
# use LinearRegression class from sklearn_linear_model 
lm = LinearRegression()

# we need to reshape the array because sklearn LinearRegression expects arrays in the shape of (n, f), 
# where f is the number of features we want to train on.
x_train = df_train['<column_name?>'].values.reshape(-1, 1)
y_train = df_train['price'].values.reshape(-1, 1)

lm.fit(<x>, <y>)


## Predictions

---

Make predictions on the test set.

In [None]:
# reshape the test data just as you've done it above with the test data
x_test = 
y_test =

# then pass the data to the linear regression model to make predictions
y_predicted = lm.predict(x_test)

(optional: use a goodness of fit metric like r2_score to check how good the regression is)

In [None]:
r2_score(y_test, y_predicted)



---


Plot true values and calculated regression line.

In [None]:
plt.scatter(x_test, y_test, color='black')
plt.plot(x_test, y_predicted, color='blue', linewidth=3)

plt.show()



---







# Logistic regression


## Task
This dataset contains information about whether or not an individual has clicked an online ad. The dataset contains several features e.g. the time and individual has spent on a site per day, age, daily internet usage time, location and gender. Your task is to use logistic regression and find out which feature is best to **predict if an individual has clicked the ad**.

In [None]:
df_ads = pd.read_csv('https://github.com/schneiderson/ATIT2-21/raw/master/sample_data/advertising_le.csv')
df_ads.head()

## Data Exploration


---




Please plot a pairplot of the input varialbes using **sns.pairplot(df, hue='Clicked on Ad')**. The resulting graphs can tell you about the relation of some input variables and the target variable.

In [None]:
sns.pairplot(df_ads, hue='Clicked on Ad')

Think about the graphs shown in this plot and try to interpret them.

Based on these plots, which variables might be most useful to predict whether an individual has clicked an ad? Please briefly elaborate how you arrive at your conclusion.

**--> your answer here...**

## Data Preparation

---

First we encode the **"Clicked on Ad"** feature. At this point the feature is a column containing texts either **"yes" or "no"**. What we want is a **numerical feature**. Hence, we can use a **Label Encoder** to transform the column to a binary encoded column.


In [None]:
le = LabelEncoder()
df_ads['Clicked on Ad'] = le.fit_transform(df_ads['Clicked on Ad'])

Do a train test split, just like in the previous task. (80/20 train/test split)

In [None]:
# do a train test split like you've done in the previous task


Do feature scaling for the numerical columns.

In [None]:
scaler2 = MinMaxScaler()
# do feature scaling like in the previous task


## Model training

---

Fit the LogisticRegression model to the training data. Use the one feature that you think will produce the best result (hint: the pairplot might give you an idea which feature that might be)

In [None]:
# we need to reshape the array because sklearn LogisticRegression expects arrays in the shape of (n, f), 
# where f is the number of features we want to train on. (Check the previous task for reference)
x_ads_train = 
y_ads_train = 

logreg = LogisticRegression()
logreg.fit(x_ads_train, y_ads_train)

## Predictions and model evaluation

---

Evaluate the accuracy, precision and recall metrics on the test data.

In [None]:
x_ads_test = 
y_ads_test = 

y_ads_predictions = logreg.predict(x_ads_test) # generate predictions

print('Accuracy: {}'.format(accuracy_score(y_ads_test, y_ads_predictions)))
print('Precision: {}'.format(precision_score(y_ads_test, y_ads_predictions)))
print('Recall: {}'.format(recall_score(y_ads_test, y_ads_predictions)))

Based on these three metrics, try to discuss which result is best under which circumstances.

**--> your answer here...**



---


Recall from the lecture: We can do linear and logistic regression on multiple features. 

Deos the performance of the logistic regression model improve if you use a combination of several features?

Please train another logistic regression model and analyse its performance to support your answer.