# Sparse Matrix Projections

`this is code` *this is italic* **this is bold**

# Note for the Devs (Sebastian, Skyler, Debasmita, Yannik)

If you want to execute this an test it several times with different links, you can set the parameter `birthday_version = False` or delete it for that matter. I programmed a little quiz into it for Jan, thought he'd enjoy, please try it yourself too and let me know if you find any mistakes/errors. Otherwise feel free to add some more fun logic to that game ;). 

In [6]:
import pandas as pd
from IPython.display import display, Markdown
import time
import importlib
from sklearn.random_projection import johnson_lindenstrauss_min_dim
from sklearn.random_projection import SparseRandomProjection
from sklearn.model_selection import train_test_split
from sklearn import metrics
from sklearn import datasets

import matplotlib.pyplot as plt
import numpy as np

class RandomSparseRepresentation:
    """This class executes the RandomSparseRepresentation"""
    def __init__(self, birthday_version = False):
        if birthday_version:
            self._jans_birthday()
            self._printmd("---")
        self._printmd(
"""Welcome to the interface of **RandomSparseRepresentation**! :)
        
You have now instantiated an object, with which you can create a RandomSparseRepresentation.
In order to do so, please first pick a dataset from this website [UCI ML](https://archive.ics.uci.edu/ml/index.php).

Once you have done so, please use the function ```get_data()``` on your object to download that data. 
This function takes one necessary parameter and an optional one. The necessary one is the URL to
the dataset you obtain when you right click in the data folder on the dataset and copy that link.
Should the dataset not be a `.csv` within the datafolder on the UCI website, but rather a `.data`
please also provide the column names as a list, which you can find in the `.names` file in the datafolder.""")
        
    def get_data(self, url: str, names: list = None, **kwargs):
        try:
            if names:
                self.data = pd.read_csv(url, names = names, **kwargs)
            else:
                self.data = pd.read_csv(url)
            self._printmd(
"""You successfully downloaded your dataset to the object!

Now we can go ahead and split the data.
Please call the `split_data()` function for it. You can pass it the `test_size` parameter, to split your
data into test and train sets, the default value is `0.3`. Here are the first 5 rows of our data:""")
            display(self.data.head(5))
        except BaseException as e:
            raise e
    
    def split_data(self, test_size = 0.3):
        self._printmd(
f"""The first thing we need to do, is to determine which of the columns shall be our target variable.
Hence they are all printed out in the next step.

{[x for x in self.data.columns]}

In the next step please input a column name, which is contains your target variable.""")
        time.sleep(1)
        target = input(prompt = "Please input your target variable here: ")
        self.X, self.y = self.data.drop(target, axis = 1), self.data[target]
        self.X_train, self.X_test, self.y_train, self.y_test = train_test_split(self.X, self.y,
                                                                                test_size = test_size, random_state = 11)
        self._printmd(
f"""Your data has now be splitted into a train and test set by a ratio of `{test_size}`.
This was done, by selecting the column `{target}` as target column and the rest as independent variables.""")
    
    def JL_lemma(self, epsilon=0.1):
        """Sebastian & Skyler will write something about the JL lemma, why it works with small datasets."""
        
        self._printmd(
f"""In general, the theory of Professor Johnson and Professor Lindenstrauss posits
the amount of columns to which we can reduce our dataset without losing any distance related information.
We can specify a parameter called `epsilon` which determines the margin in which the distance is contained.

Our current dataset has {self.data.shape[0]} observations. Using the JL algorithm, we could reduce it to
{johnson_lindenstrauss_min_dim(self.data.shape[0], eps = epsilon)} dimensions.""")
        if johnson_lindenstrauss_min_dim(self.data.shape[1], eps = epsilon) > self.data.shape[1]:
            self._printmd(
"""The JL also works, if we have a smaller dataset... **Ask group**!""")
        self._printmd(
"""The next step is to set a define a baseline metric, on which we want to evaluate
our algorithm with our later reduced dataset. For this please call the function `baseline()`.""")
        
    def baseline(self, model = None):
        """Sebastian will something SHORT on the metrics"""
        
        if not model:
            raise AttributeError("Please specify the model for your baseline metric! This can be done like \
`model = LinearSVC`, whereas LinearSVC refers to the function from sklearn.svm.")
        try:
            self.mod = model()
            self.mod.fit(self.X_train, self.y_train)
        except BaseException as e:
            raise e
        self._printmd(
"""In order to asses the performance of a classifier, it is important to incorporate a numerical evaluation of the algorithm. 
For this, a variety of performance measures are available. It is essential to make use of an adequate performance measure as 
their applicability and significance depend on the dataset as well as the specific classification task.
There are a few metrics we can choose from, the needed API (which you need to input next) can be viewed
[here](https://scikit-learn.org/stable/modules/model_evaluation.html). For the task at hand, the performance 
measures used are either *accuracy* or the $f_1$ *score*.
<span style="color:red">Formula is not shown correctly; talk with Yannik</span>

$$
Accuracy = \frac{True\ Positives + True\ Negatives }{True\ Positives + False\ Positives + True\ Negatives + False\ Negatives}
$$

*Accuracy* measures the performance of a classification model as the number of correct 
predictions divided by the total number of predictions. Its main advantage is its easy interpretability. 
Nevertheless, *accuracy* should only be used for balanced datasets. When dealing with imbalanced datasets,
i.e. when some classes are much more frequent than others, *accuracy* is not a reliable performance measure. 

$$
f_1 = 2 * \frac{Precision * Recall}{Precision + Recall}
$$

The $f_1$ Score is the harmonic mean of *precision* and *recall*, i.e. it applys equal weight to both. 
The $f_1$ Score represents a meaningful evaluation for imbalanced datasets. As such, we recommend to
choose `accuracy_score` for balanced datasets and `f1_score` for imbalanced datasets.

Additionally, for imbalanced datasets, i.e. situations in which the `f1_score` is chosen, it needs to 
be differentiated between binary and multi-class classification. For multi-class classification, the 
parameter *average* ought to be specified as its default is only applicable if targets are [binary](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score).
Four other parametervalues are possible: *micro*, *macro*, *weighted* and *samples*. *Samples* is only 
meaningful for multilabel classification, which will not be in the scope of this assignment. Thus, we will 
only examine *micro*, *macro* and *weighted*. 

The *marco* $f_1$ *score* is computed as a simple arithmetic mean of the per-class $f_1$ *scores*. 
It does not take label imbalance into account.

The *weighted* $f_1$ *score* alters *macro* to account for label imbalance. The weight is applied by 
the number of true instances for each label.

The *micro* $f_1$ *score* is calculated counting the total true positives, false negatives and false positives.
Thus, the *micro* $f_1$ *score* is equal to total number of true positives over the total number of all observations.
Further explanations can be found [here](https://scikit-learn.org/stable/modules/model_evaluation.html#multiclass-and-multilabel-classification.).

In conclusion, we recommend to chose `average = weighted` for the performance metric `f1_score` for the 
purpose of this assignment as this will account for the imbalance in the dataset. 

The chosen metric used for our baseline should be inputted in the following prompt. Be sure to insert it 
like `accuracy_score` if you want to use `accuracy_score` or respective for all other metrics.
<span style="color:red">NEEDS TO BE ADJUSTED: f1-score and type of average; talk with Yannik
Additionally: Discuss Over-, undersampling with group</span>""")
        self.metric = input(prompt = "Please insert your metric here: ")
        self.baseline = getattr(metrics, self.metric)(self.mod.predict(self.X_test), self.y_test)
        self._printmd(
"""Awesome, you have set your baseline! Now call the function `apply_random_projection` to check out,
how good your model performs when we reduce its dimensions.""")
        
    def apply_random_projection(self):
        """Debasmita & Yannik look into Random Projections"""
        
        self._printmd(
"""Now we can apply our random project onto our dataset, we loaded earlier. Once that function is done \
you can head over to the next function which is called `plot`. That function will plot the baseline and your \
chose metric over the different dimensions.""")
        self.accuracies = []
        self.dims = np.int32(np.linspace(2, self.data.shape[1], 20))
        # Loop over the projection sizes, k
        for dim in self.dims:
            # Create random projection
            sp = SparseRandomProjection(n_components = dim)
            X = sp.fit_transform(self.X_train)

            # Train classifier of your choice on the sparse random projection
            model = self.mod
            model.fit(X, self.y_train)

            # Evaluate model and update accuracies
            test = sp.transform(self.X_test)
            self.accuracies.append(getattr(metrics, self.metric)(self.mod.predict(test), self.y_test))
            
    def plot(self):
        # Create figure
        plt.figure()
        plt.xlabel("# of dimensions k")
        plt.ylabel(f"{self.metric}")
        plt.xlim([2, self.data.shape[1]])
        plt.ylim([0, 1])

        # Plot baseline and random projection accuracies
        plt.plot(self.dims, [self.baseline] * len(self.accuracies), color = "r")
        plt.plot(self.dims, self.accuracies)

        plt.show()
        
    def _jans_birthday(self):
        self.cost = 0
        self._printmd(
            """Dear Jan,

Group 10 (that includes Skyler MacGowan, Sebastian Sydow, Debasmita Dutta and Yannik Suhre), wishes you all the best for your
birthday! We hope you have/had a beautiful day despite these challenging times! As a small birthday present, we have programmed
a little riddle for you. Here you go:

A bat and a ball together cost 1.10€. The bat costs one euro more than the ball. Now our question for you is, how much costs
the ball? Please input, what you think into the prompt!
            """)
        
        counter = 0
        while True:
            self._riddle_for_jan()
            if "," in str(self.cost):
                self._printmd("""Got'cha! Be aware that this has to be a floating **point** number with a
                                 **point** as decimal seperator! Try again, this time with a **point** as decimal point! ;)""")
                counter += 1
                continue
            else:
                if self.cost == 0.1:
                    self._printmd("""Sorry, that is wrong. If you do the math, you will end with a total price of 1,20€ for
                                     bat and ball. That ain't work! Think again and try again. ;)""")
                    counter += 1
                    continue
                elif self.cost != 0.05:
                    self._printmd(f"""Sorry, your answer with {self.cost} is wrong. One hint,
                    try to solve the equation $x + (x + 1) = 1.1$. Try again.""")
                    counter += 1
                    continue
                elif self.cost == 0.05:
                    fun = input(prompt = f"Are you really want to log {self.cost} in? (yes/no) ")
                    if fun == "no":
                        self._printmd("""Hm, what shall we do with you? You do not wanna log the answer in... So we'd say,
                                      start anew :P""")
                        continue
                    elif fun == "yes":
                        counter += 1
                        self._printmd(f"""Boooooooyaaaaah! You got it right! It just only took you {counter} tries!""")
                        break
        print("                           !     !     ! \n\
(          (    *         |V|   |V|   |V|        )   *   )       ( \n\
 )   *      )             | |   | |   | |        (       (   *    ) \n\
(          (           (*******************)    *       *    )    * \n\
(     (    (           (    *         *    )               )    ( \n\
 )   * )    )          (   \|/       \|/   )         *    (      ) \n\
(     (     *          (<<<<<<<<<*>>>>>>>>>)               )    ( \n\
 )     )        ((*******************************))       (  *   ) \n\
(     (   *     ((         HAPPY BIRTHDAY!!!!    ))      * )    ( \n\
 ) *   )        ((   *    *   *    *    *    *   ))   *   (      ) \n\
(     (         ((  \|/  \|/ \|/  \|/  \|/  \|/  ))        )    ( \n\
*)     )        ((^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^))       (      ) \n\
(     (   (**********************************************) )  * (")
            
    def _riddle_for_jan(self):
        self.cost = input(prompt = "How much does the ball cost? ")
        if "." in self.cost:
            self.cost = float(self.cost)
    
    def _printmd(self, string):
        return display(Markdown(string))

## ToDo's
1. I need some datasets
2. Create a plot where you plot a metric (e.g. accuracy) w.r.t. number of dimensions / number of features that survived
3. Can I log in to Wharton using Jupyter Notebook and can I use their resources?

## Questions
1. What does it mean that data is "embedded in euclidean spaces"?

# Euclidean space

Mister Skyler will edit it!

A euclidean space is defined by the linearty of the data. This means that the differences between the different observations are equidistance (i.e. the difference between $x_1$ and $x_2$ is the same as the difference between $x_n$ and $x_{n+1}$)
- Outline what is Euclidean space and what are the properties thereof.
- Make some plots and maybe a SVM (euclidean based) data

# Non euclidean space
In difference to the aforestated Eulicdean space, is the non euclidean space. This means, that the distances between $x_n$ and $x_{n+1}$ is not the same as from $x_{n+2}$ to $x_{n+3}$. An example is the measure of Lautstärke (dB). An increase of 3 dB makes the music double as loud.
- dB
- What are the differences
- Plots and maybe a SVM with the same data- 

In [7]:
## Random projections of high-dimensional data
# for database example: digits
# Jan Nagler (adapted, Rosebrock), April 21
from sklearn.svm import LinearSVC
#from RandomProjectionClass import RandomSparseRepresentation

%matplotlib inline

import warnings
warnings.filterwarnings('ignore') # works

# What are the datasets we recommed?
1. @Debasmita https://archive.ics.uci.edu/ml/datasets/Urban+Land+Cover


We will meet tmrw evening and until then everybody looks into the writing and looks a bit into the datasets.

In [8]:
data = RandomSparseRepresentation(birthday_version=True)

Dear Jan,

Group 10 (that includes Skyler MacGowan, Sebastian Sydow, Debasmita Dutta and Yannik Suhre), wishes you all the best for your
birthday! We hope you have/had a beautiful day despite these challenging times! As a small birthday present, we have programmed
a little riddle for you. Here you go:

A bat and a ball together cost 1.10€. The bat costs one euro more than the ball. Now our question for you is, how much costs
the ball? Please input, what you think into the prompt!
            

How much does the ball cost? 0.05
Are you really want to log 0.05 in? (yes/no) yes


Boooooooyaaaaah! You got it right! It just only took you 1 tries!

                           !     !     ! 
(          (    *         |V|   |V|   |V|        )   *   )       ( 
 )   *      )             | |   | |   | |        (       (   *    ) 
(          (           (*******************)    *       *    )    * 
(     (    (           (    *         *    )               )    ( 
 )   * )    )          (   \|/       \|/   )         *    (      ) 
(     (     *          (<<<<<<<<<*>>>>>>>>>)               )    ( 
 )     )        ((*******************************))       (  *   ) 
(     (   *     ((         HAPPY BIRTHDAY!!!!    ))      * )    ( 
 ) *   )        ((   *    *   *    *    *    *   ))   *   (      ) 
(     (         ((  \|/  \|/ \|/  \|/  \|/  \|/  ))        )    ( 
*)     )        ((^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^))       (      ) 
(     (   (**********************************************) )  * (


---

Welcome to the interface of **RandomSparseRepresentation**! :)
        
You have now instantiated an object, with which you can create a RandomSparseRepresentation.
In order to do so, please first pick a dataset from this website [UCI ML](https://archive.ics.uci.edu/ml/index.php).

Once you have done so, please use the function ```get_data()``` on your object to download that data. 
This function takes one necessary parameter and an optional one. The necessary one is the URL to
the dataset you obtain when you right click in the data folder on the dataset and copy that link.
Should the dataset not be a `.csv` within the datafolder on the UCI website, but rather a `.data`
please also provide the column names as a list, which you can find in the `.names` file in the datafolder.

In [9]:
data.get_data("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
              names = ["sepal length in cm", "sepal width in cm", "petal length in cm", "petal width in cm", "class"])

You successfully downloaded your dataset to the object!

Now we can go ahead and split the data.
Please call the `split_data()` function for it. You can pass it the `test_size` parameter, to split your
data into test and train sets, the default value is `0.3`. Here are the first 5 rows of our data:

Unnamed: 0,sepal length in cm,sepal width in cm,petal length in cm,petal width in cm,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


In [10]:
data.split_data()

The first thing we need to do, is to determine which of the columns shall be our target variable.
Hence they are all printed out in the next step.

['sepal length in cm', 'sepal width in cm', 'petal length in cm', 'petal width in cm', 'class']

In the next step please input a column name, which is contains your target variable.

Please input your target variable here: class


Your data has now be splitted into a train and test set by a ratio of `0.3`.
This was done, by selecting the column `class` as target column and the rest as independent variables.

In [11]:
data.JL_lemma()

In general, the theory of Professor Johnson and Professor Lindenstrauss posits
the amount of columns to which we can reduce our dataset without losing any distance related information.
We can specify a parameter called `epsilon` which determines the margin in which the distance is contained.

Our current dataset has 150 observations. Using the JL algorithm, we could reduce it to
4294 dimensions.

The JL also works, if we have a smaller dataset... **Ask group**!

The next step is to set a define a baseline metric, on which we want to evaluate
our algorithm with our later reduced dataset. For this please call the function `baseline()`.

In [None]:
data.baseline(model = LinearSVC)

In order to asses the performance of a classifier, it is important to incorporate a numerical evaluation of the algorithm. 
For this, a variety of performance measures are available. It is essential to make use of an adequate performance measure as 
their applicability and significance depend on the dataset as well as the specific classification task.
There are a few metrics we can choose from, the needed API (which you need to input next) can be viewed
[here](https://scikit-learn.org/stable/modules/model_evaluation.html). For the task at hand, the performance 
measures used are either *accuracy* or the $f_1$ *score*.
<span style="color:red">Formula is not shown correctly; talk with Yannik</span>

$$
Accuracy = rac{True\ Positives + True\ Negatives }{True\ Positives + False\ Positives + True\ Negatives + False\ Negatives}
$$

*Accuracy* measures the performance of a classification model as the number of correct 
predictions divided by the total number of predictions. Its main advantage is its easy interpretability. 
Nevertheless, *accuracy* should only be used for balanced datasets. When dealing with imbalanced datasets,
i.e. when some classes are much more frequent than others, *accuracy* is not a reliable performance measure. 

$$
f_1 = 2 * rac{Precision * Recall}{Precision + Recall}
$$

The $f_1$ Score is the harmonic mean of *precision* and *recall*, i.e. it applys equal weight to both. 
The $f_1$ Score represents a meaningful evaluation for imbalanced datasets. As such, we recommend to
choose `accuracy_score` for balanced datasets and `f1_score` for imbalanced datasets.

Additionally, for imbalanced datasets, i.e. situations in which the `f1_score` is chosen, it needs to 
be differentiated between binary and multi-class classification. For multi-class classification, the 
parameter *average* ought to be specified as its default is only applicable if targets are [binary](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.f1_score.html#sklearn.metrics.f1_score).
Four other parametervalues are possible: *micro*, *macro*, *weighted* and *samples*. *Samples* is only 
meaningful for multilabel classification, which will not be in the scope of this assignment. Thus, we will 
only examine *micro*, *macro* and *weighted*. 

The *marco* $f_1$ *score* is computed as a simple arithmetic mean of the per-class $f_1$ *scores*. 
It does not take label imbalance into account.

The *weighted* $f_1$ *score* alters *macro* to account for label imbalance. The weight is applied by 
the number of true instances for each label.

The *micro* $f_1$ *score* is calculated counting the total true positives, false negatives and false positives.
Thus, the *micro* $f_1$ *score* is equal to total number of true positives over the total number of all observations.
Further explanations can be found [here](https://scikit-learn.org/stable/modules/model_evaluation.html#multiclass-and-multilabel-classification.).

In conclusion, we recommend to chose `average = weighted` for the performance metric `f1_score` for the 
purpose of this assignment as this will account for the imbalance in the dataset. 

The chosen metric used for our baseline should be inputted in the following prompt. Be sure to insert it 
like `accuracy_score` if you want to use `accuracy_score` or respective for all other metrics.
<span style="color:red">NEEDS TO BE ADJUSTED: f1-score and type of average; talk with Yannik
Additionally: Discuss Over-, undersampling with group</span>

In [None]:
data.apply_random_projection()

In [None]:
data.plot()