# Lab 2

In the previous lab, you have seen how numpy works and got to experiment with it. In the next part of this tutorial, you will implement a machine learning algorithm on a simple classification problem. The goal is for you to familiarise yourselves with the different tools and libraries available in python including:

**Scikit-Learn** - a very popular machine learning library, containing implementations of various classification, regression and clustering algorithms such as SVM, Random Forests, gradient boosting, k-means. It also contains a few datasets which are imported together with the package, one of wich will be used in this tutorial.

Documentation: https://scikit-learn.org/stable/modules/classes.html

**Matplotlib** - a python library used to generate visual representations (plots) of data and different mathematical procedures. It has an active community of developers and users, thus making it easy to find tutorials and help with errors. In this lab, you will get to experiment with its basic functionalities, but be mindful that matplotlib can be used to generate complex diagrams.

Documentation and extra Tutorials: https://matplotlib.org/stable/tutorials/index

**Pandas** - it is another python library used for data manipulation. The main diference from numpy is that Pandas is mainly used for data analysis and preprocessing, and less for mathematical operations. The main data structure implemented within pandas are its *data frames*, which you will see working in this lab. 
Hint: it uses the dictionary structure mentioned in Lab1 Homework bonus

Documentation: https://pandas.pydata.org/docs/

Radu-Daniel Voit, Last update 31/01/2022

Copyright University of Southampton, 2022. Permission is granted for copies to be made for personal use by University of Southampton students. This content should not be shared on published outside the University of Southampton.


## Problem definition

We will import the **Breast Cancer Wisconsin** dataset from the Scikit-Learn library. It contains 569 instances, each having 32 attributes. Each instance represents a cancer scan, and the goal of this task is to clasify whether an instance represents a benings or malignan cancerous scan.

**Note:** Although this dataset is not specific for a NLP problem, you will encounter the same structural forms (data with multiple instances and features, represented as a matrix). However, in NLP, the matrix can represent a *TF-IDF*, or *PPMI*, as you will see during the module.

First, we import the usefull libraries and the data:

In [None]:
# Run this code

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer

Data = load_breast_cancer()

Please write the line of code that prints the type of the variable *Data*

In [None]:
# Write your code here:


As you can see, its type is defined within the sklearn library. As mentioned in the previous lab, it is important to pay attention to this as it can lead errors. 

This type in particular works similar to the default python dictionary. To see the keys, run the following code:

In [None]:
# Run this code 

Data.keys()

In order to access the values stored by each key of a dictionary, one can simply use the *your_dictionary[key]* command. However, the type defined within sklearn allows for an easier calling method

Run the following code and complete it in order to print the number of datapoints, features, in order to confirm the documentation:

In [None]:
# Run and complete the code:

print("The data contains", len(Data.data), "points")
print()
#print("There are", - YOUR CODE HERE -, "features")
print()
#print("There are", - YOUR CODE HERE -, "different target labels")

If you want to also check the name of the categories and whether or not every instance is labelled, please write the code below.

In [None]:
# Write code here


Now, lets create a pandas dataframe instance and fill it with our data. But first, please check the type and shape of *Data.data*:

In [None]:
# Write code here


The data is stored as a numpy array. Let's load it into a pandas dataframe:

In [None]:
# Run this code

# Create pandas dataframe
# pass the feature names as the labels for columns
# check documentation for more parameters of DataFrame

df = pd.DataFrame(Data.data, columns= Data.feature_names)
print(df)

**Note:** Pandas can read data automatically from files with extensions like .csv or .json. Consult the documentation for more information

## Data analysis

Using the newly created dataframe, we can now test some of the methods associated with it. For instance, if you want to see the first n rows, use the *.head(n)* method, where the default value for n is 5. Run the code below for an example:

In [None]:
# Run this code

df.head()

Now, using the documentation, please write the code that generates statistics about the dataset and provides a concise summary of *df*. 

**Note:** You can see all methods in the DataFrame class here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html

In [None]:
# Write code here

# Hint: look for describe() and info() methods
# Hint: it is not necessary to pass parameters to the methods


Methods can be applied on specific columns in the data frame as well. Run the code below. What can be done with this information? What type of data (e.g. numerical, categorical) can benefit from this kind of functions?

In [None]:
# Run this code
df["mean area"].value_counts()

Some of the raw datasets that you could work with don't separate the targets from the rest of the data. Or maybe you prefer to have all data in a single dataframe. Complete and run the code below to concatenate the labels column to *df*. Use the documentation for guidance.

In [None]:
# Complete and run the code:

target_df = pd.DataFrame(Data.target, columns=["Target"])
print(target_df)
print()

df = pd.concat(#- your code here-)

Now, use the previous analysis methods to check whther the targets were corectly added to the frame.

In [None]:
# Write code here


Use the corect method to print how many datapoints are associated with each label

In [None]:
# Write code here


## Visual representations

Before starting to implement any machine learning model, you will need to thorougly analyse the data for information on its sparsity, distribution, etc. However, in the ML community, it is important to be able to communicate information clearly and concisely. Visual representations are generally preffered as a way of communicating statistical information. 

Run the code below. How can you interpret the result?

In [None]:
# Run the following code

# This is called a magic function in IPython that helps with the page organisation
%matplotlib inline 

import matplotlib.pyplot as plt 

df.hist(bins=30, figsize=(20,15)) 
plt.show()

Modify and play with parameters if you want, see how it affects the output.

Now, lets generate a more complex graph. When analysing the data, you might want to see how labels are distributed among different features. 

Lets take into consideration two of our features, *mean radius* and *mean smoothness*, for example.  Complete the code below to generate a plot. Play with the different variables such as colors and figsize, and check the documentation for the other methods used.

In [None]:
#Complete and run the code

import matplotlib

df = pd.concat([df,target_df], axis = 1)

df["Target"].value_counts()
target = df["Target"]

colors = ['red', 'blue']

fig = plt.figure(figsize=(10,10))
 

What do you think that this plot is missing in order to be considered a good plot for communicating the importance of the selected features?

**Note:** The best practice when plotting is to always add labels for your axis and a legend for your colors.

Look again at the following code (similar to the previous). Using the documentation, please complete the code in order to add labels for your axis. Also, please add a color legend in order to understand which label is represented as blue and which is red. 

In [None]:
# Complete and run the code

import matplotlib

radius = df['mean radius'] 
smooth = df['mean smoothness'] 
target = df['Target'] 
colors = ['red', 'blue']

fig = plt.figure(figsize=(10,10))
scatter = plt.scatter(radius, smooth, c=target, cmap=matplotlib.colors.ListedColormap(colors))

# Hint: check what can be used from matplotlib.pyplot (e.g. legend method) and use the documentation  
# Hint: use parameters such as fontsize in order to make your labels and legend more readable

# Your code here:


**Note** There are more than one way to create and manipulate plots with matplotlib. For example, one can use matplotlib.pyplot.subplots in order to manipulate multiple subplots in the same figure. Check the documentation for more 
https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html

**Note** Matplotlib support pages usually come with plot examples, and their respective code. You can check some of them to familiarise yourself with more designs 

## Data partitioning and Model training


It is important to mention that in most real-wrold cases, and especially in NLP, the data would need to be preprocessed and "cleaned" of errors, missing values, capings etc.. You will get to learn more about this in the context of textual data later in the module.

For now, you will only focus on how to partition the data into training and testing sets using the tools you have just learned.

First, you need to split the dataframe into input *X* and target *Y*. The following code uses the *pandas.DataFrame.iloc* method for spliting the dataframe based on indexes. With this ocasion, you can learn how slices work in python.

As a simple rule of thumb, for an array l = [0,1,...,n], you can:

    1. Select the first i items like l[:i] (this will select items up to index i-1)
    2. Select a particular portion of the list l[i:j] (select items from index i to j-1)
    3. Select all items starting from an index l[i:] (select items from index i to the end of the array)
    4. Select items from index i to j by a step s l[i:j:s]
    5. Select all using l[:]. Usefull when you want to slice only on particular dimensions in a multidimensional array
    5. You can also use negative numbers:
        l[-1] - selects last item
        l[-i:] - last i items
        l[:-i] - select all except last i items

Complete and run the code to separate our input data from the labels:

In [None]:
# Complete and run the code:

# Hint: iloc performs selection on both dimensions -> make sure you keep all rows and only select based on the columns
# Hint: in this case, the labels/targets are on the last column.

X = # Your code here
Y = # Your code here

Now, run the next code to perform the train-test split using the scikit-learn  method *train_test_split*. You can read more about it here:
https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html

In [None]:
# Run this code:

from sklearn.model_selection import train_test_split 

X_train, X_test, Y_train, Y_test=train_test_split(X,Y,test_size = 0.2)

**Note:** the *train_test_split* shuffles the data by default. In order to obtain reproducable results, a random seed should be generated and passed to the random_state parameter.

What does *test_size = 0.2* mean? Write the code to check the shapes of the resulted variables:

In [None]:
# Write code here:

The data is not split and prepared to be fed to a model. As previousely mentioned, scikit learn library has a multitude of model implementations you can choose from. Considering this classification problem, choose a suitable model (Random Forest, SVM, KNeighbours, etc.) and instantiate it in the following code block. 

**Note:** Choose your hyperparameters as you want, your goal is to implement a functional model for now. You will deal with the accuracy of models in NLP problems later during the module.

Use the documentation as needed:https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

In [None]:
# Complete and run the code

from sklearn # -Your code here - 

model = # -Your code here-

Now, fit the model to our data by running the next code:

In [None]:
# Run this code:
model.fit(X_train, Y_train)


Again, using the documentation, generate predictions on the *X_test* set.

In [None]:
# Write code here
prediction = # -Your code here

In order to verify how good the model performed, choose a metric by which to measure. You can look up the *sklearn.metrics* to see available implemetation for metrics. 

Run the following code to see how your model performed using the confusion matrix. Alterate the code to use any other metric. Also, depending on the model you have used, you can check it's own *.score()* method.

In [None]:
# Run this code:

from sklearn.metrics import confusion_matrix

confusion_matrix(Y_test, prediction)

How do you interpret these results?

**Note:** For more complex problems, one cannot only rely on a single model or a single set of hyperparameter. In order to test multiple model settings, you can use *sklearn.model_selection.GridSearchCV*

## HOMEWORK ##


Having experimented with **scikit learn**, **matplotlib** and **pandas** , you can now have some practice with a text-based dataset. 

The **20newsgroups** is a popular dataset used for experimentation with tasks such as text classification and text clustering. The dataset contains posts/messages from 20 different newsgroups. Each group is focused on a different subject. 

**Documentation:** http://qwone.com/~jason/20Newsgroups/

Run the following code to fetch the dataset using scikitlearn. The dataset is freely available from multiple sources, including the documentation, if you will want to experiment with it. There is also a challenge on Kaggle for it: https://www.kaggle.com/crawford/20-newsgroups

In [None]:
# Run this code:

from sklearn.datasets import fetch_20newsgroups
newsgroups = fetch_20newsgroups(subset = "all")

Look at the content of the dataset. What can you say about the format? How many datapoints are in the set? How is the data labelled (what are the target names?). Use the documentation if needed:
https://scikit-learn.org/0.19/datasets/twenty_newsgroups.html#newsgroups

Hint: *newsgroups* has the same type as *Data*, from the beginning of the lab. 

In [None]:
# Write code here


### Exercise 1

Using any of the python data structures learned and any available documentation, plot the top 20 most occuring words in a histogram.

In [None]:
# Write code here

### Exercise 2

Calculate the term frequency-inverse document frequency (TF-IDF) of the words in the data. It is a simple way of calculating how representative a word can be for a document within a larger set of documents (corpus). Although you will find out more about it during the module, some extra reading never hurts :)

Use the following documentation for guidance:https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Using *Exercise 1*, plot the top 20 words with the best tf-idf score. Is there a difference from the previous results? If yes, why? Give it a thought.

In [None]:
# Write code here

# Solutions

 1. type(Data)
 
 2. print("There are", len(Data.feature_names), "features")
    print("There are", len(Data.target_names), "different target labels")
    
 3. df.describe()
    df.info()
    
 4. df = pd.concat([df,target_df], axis = 1)
 
 5. df["Target"].value_counts()
 
 6. radius = df["mean radius"]
    smooth = df["mean smoothness"]
    plt.scatter(radius, smooth, c=target, cmap=matplotlib.colors.ListedColormap(colors)) 
 
 
 7. plt.xlabel("Mean Radius")
    plt.ylabel("Mean Smoothness")
    plt.legend(handles= scatter.legend_elements()[0], labels = ["malignant", "benign"], fontsize=10) 
 
 8. X = df.iloc[:, :-1]
    Y = df.iloc[:, -1] 
    
 9. Use .shape
 
 10. Example with Random Forest Classifier:
 
     from sklearn.ensemble import RandomForestClassifier  
     model = RandomForestClassifier(max_depth=10, random_state=0)
      
 11. prediction = model.predict(X_test)
