<a href="https://colab.research.google.com/github/GaiaSaveri/intro-to-ml/blob/main/challenges/challenge-zero.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Challenge $0$


## 1. ***Data cleaning with Pandas***

Use the library `pandas` to load and clean the required dataset.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import os

Obtain the data file

In [None]:
FFILE = './50_Startups.csv'
if os.path.isfile(FFILE): 
    print("File already exists")
    if os.access(FFILE, os.R_OK):
        print ("File is readable")
    else:
        print ("File is not readable, removing it and downloading again")
        !rm FFILE
        !wget "https://raw.github.com/alexdepremia/ML_IADA_UTs/main/challenge_0/50_Startups.csv"
else:
    print("Either the file is missing or not readable, download it")
    !wget "https://raw.github.com/alexdepremia/ML_IADA_UTs/main/challenge_0/50_Startups.csv"

In [3]:
# load the dataset using pandas
data = pd.read_csv('50_Startups.csv')
# extract data feature matrix and labels
X = data.iloc[:,:-2].values
y = data.iloc[:,3].values
df = pd.DataFrame(data)

In [None]:
y

***Play with data***

In [None]:
df.shape

In [None]:
df.replace(to_replace = 0.00, value = df.mean(axis=0), inplace=True)  # inject the mean of the column when value is 0
df.head() 

***Select two categories for binary classification*** 

In [7]:
df_sel=df[(df.State=="California") | (df.State=="Florida")]

In [None]:
df_sel.head() # column title and first rows of the dataset

In [None]:
df_sel.dtypes # type of each column  

***Encode categorical data*** 

One-hot encoding of categorical feature _State_

In [10]:
df_one = pd.get_dummies(df_sel["State"])

In [None]:
df_one.head()

In [None]:
# construct the final dataset that you will use for learning and prediction
df_fin = pd.concat((df_one, df_sel), axis=1)
df_fin = df_fin.drop(["Florida"], axis=1)
df_fin = df_fin.drop(["State"], axis=1)
# California is class 1, Florida is class 0
df_fin = df_fin.rename(columns={"California": "State"})
df_fin.head()

***Normalize***

Divide by the absolute value of the maximum so that features are in \[0, 1\]

In [13]:
def absolute_maximum_scale(series):
    return series / series.abs().max()

for col in df_fin.columns:
    df_fin[col] = absolute_maximum_scale(df_fin[col])

In [None]:
df_fin.head()

***Classification***

Prepare the dataset:

In [16]:
y = df_fin["State"] # ground truth labels
X = df_fin.drop(["State"], axis=1) # datapoints features
# extract actual values from series
y = y.values
X = X.values

Train test split

$75\%$ of the data are in the training set, the remaining $25\%$ constitutes the test set.

In [17]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y , test_size=0.25, random_state=0)

Train the Logistic Regression Model

In [18]:
from sklearn.linear_model import LogisticRegression

In [None]:
LR = LogisticRegression(random_state=0, solver='lbfgs').fit(X_train, y_train)
LR.predict(X_test)
round(LR.score(X_test,y_test), 4)

***Plot results***

***Add regularization***

Implement from scratch the regularized logistic regression model (with all the regularization techniques seen during the course). 

***Model assessment***

Given true and predicted values, compute the most common classification metrics to assess the quality of your predictions. 

In [None]:
from sklearn.metrics import classification_report
y_true = y_test
y_pred = LR.predict(X_test)

target_names = ['California', 'Florida']
print(classification_report(y_true, y_pred, target_names=target_names))

Repeat the previous task for regularized logistic regression and compare the results. 

***ROC curve***

Implement a function for producing the Receiver Operating Characteristic (ROC) curve.

Given true and predicted values, plot the ROC curve using your implemented function.