'''
# (Very Important) Instructions to get started
Please, use a different browser in case you are using *MS Internet Explorer* or *MS Edge* and you experience **mulfunctions**. 

To **run** or **work** on this [jupyter notebook](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html) you **must** 

1. **Save the notebook** to your Google Drive. After saving the notebook, you will be able to run it.

  * Select **File** / **Save a copy in Drive**  from the Colab menu (an example is below, the notebook name may change)
     ![Colab Menu](https://drive.google.com/uc?export=download&id=1-WfIFWuHC6OSJb3iwnR7NqpkXs9tvwO2)
  * If required, login with your google account  
  ![Signin Button](https://drive.google.com/uc?export=download&id=1yomWF3t03TiPsrp6AAZDXIFpz5XXTvM1)  (any goole account is fine, you can use the campus account or your own gmail account)
  

2.  Files are usually saved in the  **Colaboratory** *directory* which is located on the root of your google drive. 

3. To open the notebook again:
    * login to the google drive you saved the notebook: [drive.google.com](http://drive.google.com/)
    * open the **Colaboratory** directory
    * **right click** on the file. From the drop down menu choose **Open With** / **Google Colaboratory**

4. In case of problems, please refer to [this document](https://docs.google.com/document/d/1Y-ABvbOQhMvi7COibLJopL-mnPsaCBv_KRr6eKAsnj8/edit?usp=sharing)

5. By saving the first colab notebook in gdrive, it is automatically installed a **software extension** which is required to open or create new notebooks in your gdrive

# Exercises using Colab, Machine Learning, and ...

This is a [jupyter notebook](https://jupyter-notebook.readthedocs.io/en/stable/notebook.html) interface where several elements can be mixed e.g., 

* Cells containing programs (we will focus on the python programming language),
* Inputs and outputs of the computations,
* Explanatory text, 
* Mathematics, 
* Images,
* Rich media representations of objects
* ...

## Cell Types
We will mostly focus on two cell types:

* Code
* Text 

## Markdown
Contents in **texts cells** can be written using [markdonw syntax](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html). 

Here it is a quick reference to the markdown syntax: 

[https://en.support.wordpress.com/markdown-quick-reference/](https://en.support.wordpress.com/markdown-quick-reference/) 


## Creating an empty Colab
Please follows the next steps

* Open in a browser the URL [https://drive.google.com/](https://drive.google.com/) 

* Please choose

+ New / More / Colaboratory

* Save the notebook in your preferred gdrive location  


### COLAB Keyboard Shortcuts
To access the COLAB keyboard shortcuts:

[Menu] Tools / Keyboard Shortcuts (ctrl to be replaced by Command in the Mac)

* Ctrl+M  M     To change a cell into text
* Ctrl+M  Y      To change a cell into code
* Ctrl+M L       Toggle line numbers

'''

In [None]:
######
# Importing the libraries
# They will be described/commented later 

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC, SVC
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score 
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
import time
import pandas as pd
from sklearn.base import BaseEstimator, TransformerMixin # useful for implementing text preprocessing components that can be pipelined
import nltk # nltk provides some useful algorithms and data for natural language processing
from IPython.display import IFrame # to get a better output on Notebooks

In [None]:
#######
##from IPython.display import IFrame  
#gsheetUrl="https://docs.google.com/spreadsheets/d/1X6zJNKMWgBsu-Zhe1H3x9smHEz2ft2IZRE0QqTlhDZU/edit?usp=sharing"
#display(IFrame(gsheetUrl, width=1000, height=400))

In [None]:
####################################
# Functions to load the dataset

def loadDataFrame():
  # questo e' un url che ci permette di scaricare i dati in formato .csv
  # notate nella parte finale dell'url: ?tqx=out:csv&gid=1
  # ...out:csv dati disponibili in formato .csv, gid=1 fa riferimento al foglio che avete editato
  urlCsv = "https://docs.google.com/spreadsheets/d/1X6zJNKMWgBsu-Zhe1H3x9smHEz2ft2IZRE0QqTlhDZU/gviz/tq?tqx=out:csv&gid=1"

  # To get more info on pandas read_csv(), please refer to 
  # https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_csv.html
  df = pd.read_csv(urlCsv, 
              quoting=0,  # Quoting=0 removes fields surrounding quotes
              header=0,           # column names are in the first row (i.e., row 0)
              usecols=['Prog.', 'Title', 'ID', 'CategoryID', 
                      'CategoryName', 'Description'] # loads only the specified columns       
                     ) 
  return df
  
## In case of emergency

def loadDataFrameTsv():
  # Link to download the file from the browser (through a splash/lainding page)
  # browserLinkSolutions="https://drive.google.com/file/d/1CfZHSwElAOHrlgku0x9ybjXAXMwuaqNz/view?usp=sharing" 
  #
  # next link gives direct access to the file (no splash page) 
  tsvLinkSolutions="https://drive.google.com/uc?export=download&id=1CfZHSwElAOHrlgku0x9ybjXAXMwuaqNz"
  df=pd.read_csv(tsvLinkSolutions, sep='\t',quoting=0, header=0) # Quoting=0 removes fields surrounding quotes # header=0, column names are in the first row (i.e., row 0))
  return df

######
# Let's load again the data
# If you get an error running this cell,
# Please execute again the cells containing the definition of 
# loadDataFrame() and loadDataFrameTsv()

df = None 
df = loadDataFrame()
#df = loadDataFrameTsv()

if type(df)==type(None) or df.loc[:, 'CategoryID'].isna().sum() >50: 
  print('Found a lot of NaN, reading emergency dataset') # print is used to show simple values
  df = loadDataFrameTsv()
df.head()

Unnamed: 0,Prog.,Title,ID,CategoryID,CategoryName,Description
0,1,Senior Accountant,7269055,1,Accountants,A key position has arisen in a long establishe...
1,2,Accounts Semi Senior,1357788,1,Accountants,£15.000 - £18.000 (depending on experience) pl...
2,3,Chef De Partie- Dining Pub-Hungerford,4881772,3,Chef,Chef de Partie Our client is a beautiful dinin...
3,4,Chef Manager,8195384,3,Chef,Advert Information Location: Welwyn Garden Cit...
4,5,Systems Administrator,11094265,2,System Administrators,Systems Administrator A leading global website...


In [None]:
#####
# Improving data quality

def improveQuality(parDF):
  #Removing NaN 
  parDF2 = parDF.dropna() # removes lines with NaN (i.e., Null values)

  # Converting CategoryID label to int
  parDF2.loc[:,'CategoryID'] = parDF2.loc[:,'CategoryID'].astype(int)

  # Using only records with permittedLabels in CategoryID
  permittedLabels=[1,2,3,4] 
  # .isin() works well also when values are float while peermittedLabels has int elements
  parDF3=parDF2.loc[df.loc[:,'CategoryID'].isin(permittedLabels)]
  
  return parDF3

# Balancing by undersammpling
def underSample2Min(df, labelName):
    ''' The dataset is undersampled so that all label groups will have the same size,
        corresponding to the (original) minimal label set.
        The parameter labelName is the DataFrmae column hosting the labels'''
        
    vc = df.loc[:,labelName].value_counts() # Counting label frequencies
    lab2freq = dict(zip(vc.index.tolist(), vc.values.tolist()))
    #print(lab2freq) # if you want to see lab2freq, please uncomment this command
    #print(min(lab2freq.values()))
    minfreq = min(lab2freq.values())
    #print(minfreq)
    idxSample=[]
    for selectedLabel, actualFreq in lab2freq.items():
        selIndexes=df.loc[df.loc[:,labelName]==selectedLabel, :].sample(n=minfreq).index.tolist()
        idxSample+=selIndexes
    idxSample.sort()
    #print(type(idxSample), idxSample)
    #print(list(df.index)[:5])
    
    df2 = df.loc[idxSample, :]
    #print(len(idxSample), df2.shape);exit()
    df2 = df2.reset_index() # otherwise missing index may cause problem
    return df2

# Performing quality improvement
df2 = improveQuality(df)
df3 = underSample2Min(df2, 'CategoryID')

# Let's check again to the category sizes
print('\nCat. sizes (after undersampling')
print(df3.loc[:,'CategoryID'].value_counts())

xAll = df3.loc[:, 'Title']
yAll = df3.loc[:, 'CategoryID']


# Splitting in train-test
xTrainVec, xTestVec, yTrain, yTest = train_test_split(
    xAll, yAll, # the x and y to be partitioned
    test_size=0.30, # the test set size will be 30% of the original dataset, i.e. trainini size will be 70%
    random_state=0, # random_state e' il seed del generatore di numeri casuali usato per guidare la partizione in dataset di train e test 
    stratify=yAll # stratify: tries to ensure a proportional distribution of labels among train and test set
)

xTrain = list(xTrainVec)
xTest = list(xTestVec)
print('Ignorate eventuali warning')


Cat. sizes (after undersampling
3    132
2    132
1    132
4    132
Name: CategoryID, dtype: int64


In [None]:
######
# Simple Pipeline
# tockenization and stop words filtering is performed using 
# CountVectorizer built-in functions

# Let's build a pipeline for preprocessing and classifying job vacancy titles
cp = Pipeline([  
   ('vectorizer', CountVectorizer()),
   ('classifier', LogisticRegression() ), # LinearSVC is another classifier you can try
                           ])

# Racchiudo tutti i valori di configurazione in un'unica struttura dati
# il nome di ogni parametro di un componente della pipeline deve essere 
# preceduto dal nome del componente 
# separato da 2 underscore "_"
clsfParams = {
   'classifier__C': 0.001,  # C is equivalent to 1/lambda
   'classifier__multi_class': 'ovr', # one vs the rest. Strategy to turn a binary classifier into a multil-label classifier
   'vectorizer__stop_words':'english', # posso usare [] al posto di 'english' se non voglio usare le stop_words
   'vectorizer__ngram_range': (1,1), #only unigrams are considered. (1,2) is for both unigrams and bigrams, (1,3) is for unigrams bigrams and trigrams, ...
}

# The parameters are set NOW, after pipeline creation
# This is strange at the very beginning,
# but later you will apreciate it
cp.set_params(**clsfParams) 
cp.fit(xTrain, yTrain)

# Performing the prediction
yPred=cp.predict(xTest)

#Computing the classification report
print('Classification report')
clasRepSt01 = classification_report(yTest,yPred)
print(clasRepSt01)

#Computgin accuracy too
print('Accuracy')
print(accuracy_score(yTest,yPred))

# What will be the accuracy, in your opinion?












#

Classification report
              precision    recall  f1-score   support

           1       0.89      0.62      0.74        40
           2       0.74      0.93      0.82        40
           3       0.93      1.00      0.96        39
           4       0.90      0.88      0.89        40

    accuracy                           0.86       159
   macro avg       0.86      0.86      0.85       159
weighted avg       0.86      0.86      0.85       159

Accuracy
0.8553459119496856


# Improvements

In [None]:
########################
# Let's customize (and improve) the preprocessing 

# seguono delle classi per implementare gestire dei testi

class BaseWrapper(BaseEstimator, TransformerMixin):
    """class wrapping a sentence processing function so that it can be used in a sklearn.pipeline.Pipeline"""

    def fit(self, x, y=None): #This method is usually overridden in children classes
        """ This method actually does nothing. 
        It will be overridden by the child classes. 
        In its children implementations this method will perform all the  
        setup activities required before calling either the method transform() 
        or the method predict().
        E.g., a machine learning classifier is trained calling the fit() method, 
        once trained it can be used to classify new elements by calling the method predict()
        """
        return self

    def manageSentence(self, sentence): #This method is usually overridden in children classes
        """Called by transform(). The sentence is expected to be a either a string or a list of words, 
        this method can return either a string or a list of words"""
        return sentence 
    
    def transform(self, listOfSentences):
        """ sentenceList: list of sentences.
        Every sentence can be either a string or a list of words
        Return a list of lists. Each sentence is preprocessed using the manageSentence() method.
        Each child class can override the manageSentence() method to implement a specific preprocessing behavior.
        The list of preprocessed documents is returned."""
        toReturn = []
        for sentence in listOfSentences:
            processedSentence = self.manageSentence(sentence)
            toReturn.append(processedSentence)
        return toReturn
        # using python list comprehension, the above method can be implemented in a single line:
        # return [self.manageSentence(sentence) for sentence in listOfSentences]
        # more details https://towardsdatascience.com/python-basics-list-comprehensions-631278f22c40

class HTMLAccentsReplacer(BaseWrapper):
    def manageSentence(self, sentence):
        """Replace html representations of special letters with the corresponding unicode character. 
        E.g.  &agrave with à.
        Args:
           * s(string): the string where the html codes should be replaced  """
        assert type(sentence)==type('') or type(sentence)==type(u''), "HTMLAccentsReplacer Assertion Error" # if the parameter is not the right type, the execution is interrupted. This is useful to catch errors
        # sostituita con la versione seguente piu' completa
        replacemap={u'&Ecirc;': u'\xca', u'&raquo;': u'\xbb', u'&eth;': u'\xf0', u'&divide;': u'\xf7', 
                    u'&atilde;': u'\xe3', u'&Aelig;': u'\xc6', u'&frac34;': u'\xbe', u'&nbsp;': u' ', 
                    u'&Aumbl;': u'\xc4', u'&Ouml;': u'\xd6', u'&Egrave;': u'\xc8', u'&Icirc;': u'\xce', 
                    u'&deg;': u'\xb0', u'&ocirc;': u'\xf4', u'&Ugrave;': u'\xd9', u'&ndash;': u'\u2013', 
                    u'&gt;': u'>', u'&Thorn;': u'\xde', u'&aring;': u'\xe5', u'&frac12;': u'\xbd', 
                    u'&frac14;': u'\xbc', u'&Aacute;': u'\xc1', u'&szlig;': u'\xdf', u'&trade;': u'\u2122', 
                    u'&igrave;': u'\xec', u'&aelig;': u'\xe6', u'&times;': u'\xd7', u'&egrave;': u'\xe8', 
                    u'&Atilde;': u'\xc3', u'&Igrave;': u'\xcc', u'&Eth;': u'\xd0', u'&ucirc;': u'\xfb', 
                    u'&lsquo;': u'\u2018', u'&agrave;': u'\xe0', u'&thorn;': u'\xfe', u'&Ucirc;': u'\xdb', 
                    u'&amp;': u'&', u'&uuml;': u'\xfc', u'&yuml;': u'', u'&ecirc;': u'\xea', u'&laquo;': u'\xab', 
                    u'&infin;': u'\u221e', u'&Ograve;': u'\xd2', u'&oslash;': u'\xf8', u'&yacute;': u'\xfd', 
                    u'&plusmn;': u'\xb1', u'&icirc;': u'\xee', u'&auml;': u'\xe4', u'&ouml;': u'\xf6', 
                    u'&Ccedil;': u'\xc7', u'&euml;': u'\xeb', u'&lt;': u'<', u'&eacute;': u'\xe9', 
                    u'&ntilde;': u'\xf1', u'&pound;': u'\xa3', u'&Iuml;': u'\xcf', u'&Eacute;': u'\xc9', 
                    u'&Ntilde;': u'\xd1', u'&rsquo;': u'\u2019', u'&euro;': u'\u20ac', u'&rdquo;': u'\u201d', 
                    u'&Acirc;': u'\xc2', u'&ccedil;': u'\xe7', u'&Iacute;': u'\xcd', u'&quot;': u'"', 
                    u'&Aring;': u'\xc5', u'&Oslash;': u'\xd8', u'&Otilde;': u'\xd5', u'&Uacute;': u'\xda', 
                    u'&reg;': u'\xae', u'&Yacute;': u'\xdd', u'&iuml;': u'\xef', u'&ugrave;': u'\xf9', 
                    u'&alpha;': u'\u03b1', u'&copy;': u'\xa9', u'&ldquo;': u'\u201c', u'&oacute;': u'\xf3', 
                    u'&Euml;': u'\xcb', u'&uacute;': u'\xfa', u'&ograve;': u'\xf2', u'&acirc;': u'\xe2', 
                    u'&aacute;': u'\xe1', u'&Agrave;': u'\xc0', u'&Oacute;': u'\xd3', u'&Uuml;': u'\xdc', 
                    u'&iacute;': u'\xed', u'&cent;': u'\xa2', u'&Ocirc;': u'\xd4', u'&mdash;': u'\u2014', 
                    u'&otilde;': u'\xf5', u'&beta;': u'\u03b2'}
        for before in replacemap:
            after=replacemap[before] # getting the string to be replaced
            sentence=sentence.replace(before, after)
        return sentence 
 
'''
# Required in python2, no more necessary in python3
class Str2Unicode(BaseWrapper):
    def manageSentence(self, sentence):
        """Converts raw strings to unicode, to better manage accented letters, money symbols (e.g., pounds)"""
        #print(type(sentence), sentence) # ****** cancellami
        # if the parameter is not the right type, the execution is interrupted
        assert type(sentence)==type('') or type(sentence)==type(u''), "Str2Unicode Assertion Error"
        if type(sentence)==type(u''): # Now it should work also with python3
            return sentence
        elif type(sentence)==type(''):
            return sentence.decode('utf-8', errors='strict')  # interpret all raw strings into unicode
        else:
            return sentence
'''

class Tokenizer(BaseWrapper): 
    def manageSentence(self, sentence):
        """This method turn a single document (i.e., a string) into a list of single words (i.e., tokens). 
        The parameter "sentence" is expected to be a string, this method returns a list of strings whereas each string 
        is a tokenized word. This method replaces all the punctuation with spaces. 
        Two or more consecuitve spaces are reduced to a single space. 
        Then the strin is splitted in substring using the spaces as split markers"""
        
        if sentence==None:
            return[]
        # if the parameter is not the right type, the execution is interrupted
        assert type(sentence)==type('') or type(sentence)==type(u''), "Tokenizer Assertion Error"
        punteggiatura=u'!{}[]?"",;.:-<>|/\\*=+-_% \n\t\r()'+u"'" +u'\u2019'+u'\u2018' #\r and \n can be used as "new line" 
        # Unicode Character 'RIGHT SINGLE QUOTATION MARK' (U+2019)
        # 
        for l in punteggiatura:
           #print(s)
           sentence=sentence.replace(l,u" ") #replacing all punctuation characters with spaces           

        # loop untill all double spaces are removed
        while sentence.find(u"  ")!=-1:
            sentence=sentence.replace(u"  ",u" ")  #replacing double spaces with a single one
        return sentence.split(u' ')   #e.g., "a b c d".split(' ')  returns ['a','b','c','d']

class LowerCaseReducer(BaseWrapper): 
    def manageSentence(self, sentence):
        """sentence is expected to be a list of words (each item is a string), 
        this method returns a list of strings whereas each string is the lower case version of the original word"""
        # preliminary check over the input data type
        assert type(sentence)==type([]), "LowerCaseReducer, Assertion Error" 
        # The next line uses a python trick called List Comprehensions. 
        # More details about List Comprehension on http://www.pythonforbeginners.com/basics/list-comprehensions-in-python
        return [w.lower() for w in sentence] 
        # builds a new list, where each word of the original list is turned into a lower case string 


class EnglishStopWordsRemover(BaseWrapper):
    def getStopWords(self):
        """This method returns a list of English stop words. Stop words can be added to the list"""
        return [u'i', u'me', u'my', u'myself', u'we', u'our', u'ours', u'ourselves', 
                u'you', u'your', u'yours', u'yourself', u'yourselves', 
                u'he', u'him', u'his', u'himself', u'she', u'her', u'hers', u'herself', 
                u'it', u'its', u'itself', u'they', u'them', u'their', u'theirs', u'themselves', 
                u'what', u'which', u'who', u'whom', u'this', u'that', u'these', u'those', 
                u'am', u'is', u'are', u'was', u'were', u'be', u'been', u'being', 
                u'have', u'has', u'had', u'having', u'do', u'does', u'did', u'doing', 
                u'a', u'an', u'the', u'and', u'but', u'if', u'or', u'because', u'as', u'until', u'while', 
                u'of', u'at', u'by', u'for', u'with', u'about', u'against', u'between', u'into', u'through', 
                u'during', u'before', u'after', u'above', u'below', u'to', u'from', u'up', u'down', u'in', u'out', 
                u'on', u'off', u'over', u'under', u'again', u'further', u'then', u'once', u'here', u'there', u'when', 
                u'where', u'why', u'how', u'all', u'any', u'both', u'each', u'few', u'more', u'most', u'other', 
                u'some', u'such', u'no', u'nor', u'not', u'only', u'own', u'same', u'so', u'than', u'too', u'very', 
                u's', u't', u'can', u'will', u'just', u'don', u'should', u'now', u'd', u'll', u'm', u'o', u're', 
                u've', u'y', u'ain', u'aren', u'couldn', u'didn', u'doesn', u'hadn', u'hasn', u'haven', u'isn', 
                u'ma', u'mightn', u'mustn', u'needn', u'shan', u'shouldn', u'wasn', u'weren', u'won', u'wouldn']
    
    def manageSentence(self, sentence):
        """sentence is expected to be a list of words (a list where each item is a string containing a single word), 
        this method returns the input list where the stop words are removed """
        assert type(sentence)==type([]) , "EnglishStopWordsRemover, Assertion Error" 
        stopWords = self.getStopWords()
        return [w for w in sentence if w not in stopWords]
    
class EnglishStemmer(BaseWrapper):
    def __init__(self):
        """Load the NLTK English stemmer. A stemmer is an algorithm that recues a word to its base form 
        e.g., "books" is reduced to "book", 'children' is reduced to 'child'. """
        self.st = nltk.stem.SnowballStemmer("english") # loading the NLTK stemmer
    def  manageSentence(self, sentence):
        """sentence is expected to be a list of words (a list where each item is a string containing a single word), 
        this method returns a list of stemmed words"""
        assert type(sentence)==type([]), "EnglishStemmer, Assertion Error" 
        return [self.st.stem(w) for w in sentence]

# Here it is an examle about stemming
#es = EnglishStemmer()
#wt = Tokenizer()
#res=es.transform(wt.transform(["we are looking for some new cars", "having better performances"]))
#print(res)
# [[u'we', u'are', u'look', u'for', u'some', u'new', u'car'], [u'have', u'better', u'perform']]    


class RemoveNumbers(BaseWrapper):
    def manageSentence(self, sentence):
        """Sentence is expected to be a list of words (a list where each item is a string containing a single word), 
        this method returns the input list where the numbers are removed. """
        assert type(sentence)==type([]), "RemoveNumbers, Assertion Error" 
        return [w for w in sentence if w.isdigit()==False]

class RemoveEmptyWords(BaseWrapper):
    def manageSentence(self, sentence):
        """Sentence is expected to be a list of words (a list where each item is a string containing a single word), 
        this method returns the input list where the empty words are removed """
        assert type(sentence)==type([]), "RemoveEmptyWords, Assertion Error" 
        return [w for w in sentence if not (w==u'' or w=='')]     

# No more required. Useful for previous versions of CountVectorizer 
class Bag2Text(BaseWrapper):
    def manageSentence(self, sentence):
        """sentence is expected to be a list of words (a list where each item is a string containing a single word), 
        this method returns a single string obtained joining the words and separing them using the space"""
        assert type(sentence)==type([]), "Bag2Text, Assertion Error" 
        # Next line builds a string by joining with spaces all the elements of sentence
        return u' '.join(sentence)  

def unityFunction(x):
  """This function returns the same object received as input. 
  For advanced pythonists: equivalent to lambda x:x """
  return x

In [None]:
######
# 

# Let's build a pipeline for preprocessing and classifying job vacancy titles
cp = Pipeline([  
   #('Str2Unicode', Str2Unicode()), no more required in Python3
   ('HTMLAccentsReplacer', HTMLAccentsReplacer() ),
   ('Tokenizer', Tokenizer() ),
   ('LowerCaseReducer', LowerCaseReducer() ),
   ('StopWordsRemover', EnglishStopWordsRemover() ),
   ('Stemmer', EnglishStemmer() ),
   ('RemoveNumbers', RemoveNumbers() ),
   ('RemoveEmptyWords', RemoveEmptyWords() ),
   #('Bag2Text', Bag2Text() ),
   ('vectorizer', CountVectorizer()),
   ('classifier', LogisticRegression() ), # LinearSVC
                           ])

# Collecting all the param values in a single data structure. 
# The parameter keyword should be composed as follows: pipeline_component_name + '__' + param name. 
clsfParams = {
   'classifier__C': 0.001, 
   #'classifier__multi_class': 'ovr', # one vs the rest. Strategy to turn a binary classifier into a multil-label classifier
   'vectorizer__preprocessor': unityFunction, # since we provided a customized preprocessing pipeline, we turn off the usual preprocessing pipeline
   'vectorizer__tokenizer': unityFunction, # Same as above. 
   'vectorizer__ngram_range': (1,1),
}

cp.set_params(**clsfParams)
cp.fit(xTrain, yTrain)

yPred=cp.predict(xTest)
print('Classification report')
clasRepSt02 = classification_report(yTest,yPred)
print(clasRepSt02)
print('Accuracy')
print(accuracy_score(yTest,yPred))

# What will be the accuracy, in your opinion?



Classification report
              precision    recall  f1-score   support

           1       0.90      0.70      0.79        40
           2       0.74      0.93      0.82        40
           3       0.95      1.00      0.97        39
           4       0.95      0.88      0.91        40

    accuracy                           0.87       159
   macro avg       0.89      0.88      0.87       159
weighted avg       0.88      0.87      0.87       159

Accuracy
0.8742138364779874


In [None]:
######
# Grid Search

# Building the pipeline
cp = Pipeline([  
   #('Str2Unicode', Str2Unicode()),
   ('HTMLAccentsReplacer', HTMLAccentsReplacer() ),
   ('Tokenizer', Tokenizer() ),
   ('LowerCaseReducer', LowerCaseReducer() ),
   ('StopWordsRemover', EnglishStopWordsRemover() ),
   ('Stemmer', EnglishStemmer() ),
   ('RemoveNumbers', RemoveNumbers() ),
   ('RemoveEmptyWords', RemoveEmptyWords() ),
   #('Bag2Text', Bag2Text() ),
   ('vectorizer', CountVectorizer()),
   ('classifier', LogisticRegression() ),  # LinearSVC
                           ])

# Setting for each parameter the value space (i.e., the set of values to evaluate)
paramSpace = {
   'classifier__C': [0.001, 0.01, 0.1, 1, 10, 100], # Values for grid search should be enclosed by []
   'classifier__multi_class': ['ovr'], # default value, otherwise lot of warnings # one vs the rest. Strategy to turn a binary classifier into a multil-label classifier
   'classifier__solver': ['liblinear'],
   'classifier__class_weight': [None, 'balanced'], # if the classes were imbalanced, we could try this approach
   'vectorizer__preprocessor': [unityFunction], # since we provided a customized preprocessing pipeline, we turn off the usual preprocessing pipeline
   'vectorizer__tokenizer': [unityFunction], # Same as above. 
   'vectorizer__ngram_range': [(1,1), (1,2), (1,3)],   #
   'vectorizer__max_df': [0.7],  # If a term is in more of the 70% of documents, it is too frequent to be discriminative
   'vectorizer__min_df': [2, 4], # Minimum number of documents where the term should appear (otherwise it won't be considered in the Vocabulary)
} # a python list [] is mandatory even if only one element is in 

start_time = time.time()
# cv=4 k-fold validation, con k=4
gs = GridSearchCV(cp, param_grid=paramSpace, scoring='accuracy', cv=4)
gs.fit(xTrain,yTrain)
# ora mostro il tempo di fine
print("--- %s seconds ---" % (time.time() - start_time))
print(gs.best_params_)
print('Scoring result')
print(gs.best_score_)

# Please, ignore the warnings

--- 33.91047286987305 seconds ---
{'classifier__C': 1, 'classifier__class_weight': None, 'classifier__multi_class': 'ovr', 'classifier__solver': 'liblinear', 'vectorizer__max_df': 0.7, 'vectorizer__min_df': 4, 'vectorizer__ngram_range': (1, 1), 'vectorizer__preprocessor': <function unityFunction at 0x7f0addf60b90>, 'vectorizer__tokenizer': <function unityFunction at 0x7f0addf60b90>}
Scoring result
0.9105890603085554


'''
The accuracy value above may be less than the one computed on the test set. One reason is that due to cross-validation, the training set size is less than the one previously used
'''

In [None]:
######
# 

# Doing classification again, using the best parameters, as selected by Grid Search
clsfParams = {
   'classifier__C': 1, 
   'classifier__multi_class': 'ovr', # default value, otherwise lot of warnings # one vs the rest. Strategy to turn a binary classifier into a multil-label classifier
   'classifier__solver': 'liblinear',
   'vectorizer__preprocessor': unityFunction, # since we provided a customized preprocessing pipeline, we turn off the usual preprocessing pipeline
   'vectorizer__tokenizer': unityFunction,
   'vectorizer__ngram_range': (1,1),
   'vectorizer__min_df': 4,
   'vectorizer__max_df': 0.7,
}
cp.set_params(**clsfParams)
cp.fit(xTrain, yTrain)
yPred=cp.predict(xTest)
print('Classification report')
print(classification_report(yTest,yPred))
print('Accuracy')
print(accuracy_score(yTest,yPred))

# What will be the accuracy, in your opinion?




Classification report
              precision    recall  f1-score   support

           1       0.72      0.97      0.83        40
           2       1.00      0.70      0.82        40
           3       0.95      1.00      0.97        39
           4       0.97      0.88      0.92        40

    accuracy                           0.89       159
   macro avg       0.91      0.89      0.89       159
weighted avg       0.91      0.89      0.89       159

Accuracy
0.8867924528301887


In [None]:
print('Visualizziamo nuovamente il primo classification report') 
print(clasRepSt01)

Visualizziamo nuovamente il primo classification report
              precision    recall  f1-score   support

           1       0.89      0.62      0.74        40
           2       0.74      0.93      0.82        40
           3       0.93      1.00      0.96        39
           4       0.90      0.88      0.89        40

    accuracy                           0.86       159
   macro avg       0.86      0.86      0.85       159
weighted avg       0.86      0.86      0.85       159



# Link

Questo notebook puo' essere scaricato presso

[https://colab.research.google.com/drive/1KKTrTlGmp51HcvjjyoINJLbd2av-v-SF?usp=sharing](https://colab.research.google.com/drive/1KKTrTlGmp51HcvjjyoINJLbd2av-v-SF?usp=sharing)