## ML task

In this lab, you will one of the ML tasks by applying the methods that we learned in the previous classes. You can use **one** of the prepared datasets from UCI ML Repository or choose another dataset (eg. from Kaggle). 

Below are some questions and tips to help you in designing the experiments and summarizing the results.

In [None]:
! pip install lime

In [None]:
import urllib.request
import zipfile
import pandas as pd
import seaborn as sns

from IPython.display import Image 
import pydotplus
from sklearn.externals.six import StringIO  

from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier, export_graphviz

from sklearn.model_selection import GridSearchCV
from sklearn import metrics 
from sklearn.model_selection import train_test_split 
import warnings
warnings.filterwarnings('ignore')
import numpy as np


from matplotlib import pyplot as plt
import wordcloud

import nltk
nltk.download('stopwords')
nltk.download('wordnet')
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import WordPunctTokenizer, word_tokenize

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import make_pipeline
from sklearn import metrics 
from sklearn.model_selection import GridSearchCV
from matplotlib.colors import ListedColormap
import math 
from sklearn.preprocessing import StandardScaler
from sklearn.utils import check_random_state
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_blobs, make_regression
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.base import BaseEstimator
from sklearn.neural_network import MLPClassifier, MLPRegressor
from sklearn.preprocessing import minmax_scale
from sklearn.metrics import accuracy_score, mean_squared_error
from sklearn import metrics
from sklearn.pipeline import make_pipeline


import random
from lime.lime_text import LimeTextExplainer
import urllib.request
import zipfile
%matplotlib inline




In [None]:
def plot_decision_tree(model, feature_names):
  dot_data = StringIO()
  export_graphviz(model, out_file=dot_data,  
                  filled=True, rounded=True, 
                  feature_names=feature_names)
  graph = pydotplus.graph_from_dot_data(dot_data.getvalue())  
  return Image(graph.create_png())

def display_confusion_matrix(y_test, y_pred):
  confusion_matrix = pd.DataFrame(metrics.confusion_matrix(y_test, y_pred))
  confusion_matrix.index.name = 'Actual'
  confusion_matrix.columns.name = 'Predicted'
  sns.heatmap(confusion_matrix, annot=True)

def plot_wordcloud(texts: list, title: str=''):
  wc = wordcloud.WordCloud(background_color="white").generate(' '.join(texts))
  plt.figure()
  plt.imshow(wc)
  plt.axis("off")
  plt.title(title)
  plt.show()


def preprocess_tokens(text):
    lemmatizer = WordNetLemmatizer()
    tokens = nltk.tokenize.regexp_tokenize(text, '[a-zA-Z]{3,}')
    return [lemmatizer.lemmatize(word).lower() for word in tokens]

## Problem definition

* What is the goal of the prediction?
* What problem does it solve?
* What type of ML task is it (classification/regression)?


### SMS Spam Collection Data Set - text classification

A text classification task: predict if a given SMS is spam or not, based on its text. (You can try to classify the texts without any cleaning as the stopwords and typos may have some impact on prediction)

https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection

In [None]:
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip"
urllib.request.urlretrieve(data_url, 'smsspamcollection.zip')
data_file = zipfile.ZipFile('smsspamcollection.zip')
dataset = pd.read_csv(data_file.open('SMSSpamCollection'), delimiter="\t", header=None)
dataset.columns = ["spam", "text"]
X = dataset["text"]
y = dataset["spam"]
dataset.head()

### Seoul Bike Sharing Demand Data Set - regression
Predict the number of bikes rented on given hour

https://archive.ics.uci.edu/ml/datasets/Seoul+Bike+Sharing+Demand


In [None]:
data_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/00560/SeoulBikeData.csv"
urllib.request.urlretrieve(data_url, 'SeoulBikeData.csv')
dataset = pd.read_csv('SeoulBikeData.csv', sep=',', encoding = "ISO-8859-1")
dataset["Holiday"] = (dataset["Holiday"] == "Holiday").astype(int)
dataset["Functioning Day"] = (dataset["Functioning Day"] == "Yes").astype(int)
X = dataset[['Hour', 'Temperature(°C)', 'Humidity(%)',
       'Wind speed (m/s)', 'Visibility (10m)', 'Dew point temperature(°C)',
       'Solar Radiation (MJ/m2)', 'Rainfall(mm)', 'Snowfall (cm)', 
       'Holiday', 'Functioning Day']]
y = dataset["Rented Bike Count"]
X.head()

### Wine Quality Data Set - classification/regression

The predicted value is wine quality grade - you can predict it as a continuous value (regression) or a discrete one (classification).

https://archive.ics.uci.edu/ml/datasets/Wine+Quality

In [None]:
data_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv'
urllib.request.urlretrieve(data_url, 'winequality-white.csv')
dataset = pd.read_csv('winequality-white.csv', sep=';')
X = dataset[dataset.columns[:-1]]
y = dataset["quality"]
X.head()

## Exploratory data analysis
*  How many records are in the dataset?
  `df.shape`
*   What is the distribution of the target value?
`df.describe()`, `sns.histplot(y)`, `sns.countplot(y)`
* Are there any missing values? `df.isnull().sum()`
* (For numerical data) - are there any correlations between the variables? `sns.heatmap(df.corr())`
* (For text data) - What are the most common words for each class? `plot_wordcloud`

In [None]:
?? # display the number of records (dataset shape)
?? # plot the distributuon or calculate the dataset statistics
?? # display the missing values summary
?? # display the correlations or plot the word clouds


## Modeling

  * Did you use any preprocessing? (eg. `StandardScaler` for numerical data, `CountVectorizer` for text, `make_pipeline` for combining the stages with the model)
  * Which models did you use? (Select at least 2, including one interpretable model: `DecisionTreeRegressor`, `RandomForestRegressor`, `LinearRegression`, `MLPRegressor` for regression or 
`DecisionTreeClassifier`, `RandomForestClassifier`, `MLPClassifier`, `LogisticRegression` for classification)
  * What hyper-parameters did you search? `GridSearchCV`

In [None]:
?? # define the model or pipeline stages
?? # define the parameter grid
?? # create GridSearchCV object


In [None]:
?? # split data into train and test sets
?? # fit the search object on the train set
?? # generate predictions on the test set


## Evaluation 
  * What evaluation metrics did you apply on the test set? (use `train_test_split` and `metrics.accuracy_score`, `display_confusion_matrix` for classification or `metrics.mean_squared_error` for regression)
  * Which model and hyperparameters yielded best results? Which one would you choose?


In [None]:
?? # calculate the metrics for predicted and test y

## (Optional) Results explanation
  * What features were the most important (globally in the decision tree or linear model or locally in LIME for selected examples)

In [None]:
??