# 2019-06-20 Exam

### General Instructions:

Welcome to the **Python Programming (for Data Science)** exam session! Please, read **carefully** the instructions below before start writing code. 

This session will last **75 minutes** and is divided into **two parts**: one about "general" Python programming and the other about Python programming for Data Science. Each part is made of a set of exercises, which globally accounts for **16** + **16** = **32 points**.
You will earn all of the points associated to an exercise **if and only if** the answer you provide passes successfully **all** the tests (both those that are visible and those that are hidden to you).<br />

To actually write down your implementation, make sure to fill in any place that says <code style="color:green">**_# YOUR CODE HERE_**</code>. Note also that you should **either comment or delete** any <code style="color:green">**raise NotImplementedError()**</code> exception.<br />

For this exam session **you will not be allowed** to use any lecture material yet you will be able to access the following APIs:

-  [Python](https://docs.python.org/3.6/library/index.html)
-  [Numpy](https://docs.scipy.org/doc/numpy-1.13.0/reference/)
-  [Scipy](https://docs.scipy.org/doc/scipy-1.0.0/reference/)
-  [Pandas](https://pandas.pydata.org/pandas-docs/version/0.22/api.html)
-  [Matplotlib](https://matplotlib.org/2.1.1/api/index.html)
-  [Seaborn](http://seaborn.pydata.org/api.html)

Once you are done, save this notebook and rename it as follows:

<code>**YOURUSERNAME_2019-06-20.ipynb**</code>

where <code>**YOURUSERNAME**</code> is your actual username. To be consistent, we are expecting your username to be composed by your first name's initial, followed by your full lastname. As an example, in my case this notebook must be saved as <code>**gtolomei_2019-06-20.ipynb**</code> (Remember to insert an underscore <code>**'_'**</code> between your username and the date).<br />

Finally, go back to the [Moodle](https://elearning.studenti.math.unipd.it/esami/mod/assign/view.php?id=528) web page of the "**2019-06-20 Python Programming Exam**"; there, you will be able to upload your notebook file for grading.

<center><h3>Submissions are allowed until <span style="color:red">Thursday, 20 June 2019 at 10:45 AM</span></h3></center>

Note that there is no limit on the number of submissions; however, be careful when you upload a new version of this notebook because each submission overwrites the previous one. 
The due date indicated above is **strict**; after that, the system will not accept any more submissions and the latest uploaded notebook will be the one considered for grading.

The archive you have downloaded (<code style="color:magenta">**2019-06-20-exam.tar**</code>) is orgaized as follows:

<code style="color:red">**2019-06-20-exam**</code> (root)<br />
|----<code style="color:green">**2019-06-20.ipynb**</code> (_this_ notebook)<br />
|----<code>**corpus.txt**</code> (the text corpus you will be using for answering general Python programming questions)<br />
|----<code>**dataset.csv**</code> (the dataset you will be using for answering data science related questions)<br />
|----<code>**README.txt**</code> (a description of the dataset above)

<center><h3>... Now, sit back, relax, and do your best!</h3></center>

**First Name** = Your _first name_ here

**Last Name** = Your _last name_ here

In [None]:
import math
import string
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Adding the following line, allows Jupyter Notebook to visualize plots
# produced by matplotlib directly below the code cell which generated those.
%matplotlib inline
import seaborn as sns
from collections import Counter
from nose.tools import assert_equal
from operator import itemgetter

EPSILON = .0000001 # tiny tolerance for managing subtle differences resulting from floating point operations

TEXT_CORPUS_FILE = "corpus.txt"
DATASET_FILE = "dataset.csv"

# Part 1: General Coding (16 points)

For **Part 1**, you will be asked to use the list below - called <code>**corpus**</code> - which contains a list of text documents, where each document is represented by a lowercase string with no punctuation character whatsoever.<br /> 
Please, execute the cell right below to successfully load those documents into <code>**corpus**</code>, see a few sample documents, and then answer the following questions.

In [None]:
# used to replace any punctuation symbol with an empty character ('')
translator = str.maketrans('', '', string.punctuation)
# load each individual document as a lowercase string into the list of strings `corpus`
corpus = [" ".join(doc.strip().lower().translate(translator).split()) for doc in open(TEXT_CORPUS_FILE)]
# print out the first 5 documents loaded
print("The following are the first 5 documents loaded out of a total of {} documents:\n".format(len(corpus)))
print("\n".join(corpus[:5]))

## Exercise 1.1 (1 point)

Implement the function <code>**longest_doc_id**</code>, which returns the **id** of the document with the largest length among those calculated across all the documents in the <code>**corpus**</code>.<br />
We define the _length_ of a document as the number of the tokens which the document string is made of; a _token_ is any substring which is separated from the others by a **whitespace character**, i.e., <code>**" "**</code>.

(**EXAMPLE:** If the document you are working with is the string <code>**"I think therefore I am"**</code>, then the corresponding tokens will be: <code>**"I"**</code>, <code>**"think"**</code>, <code>**"therefore"**</code>, <code>**"I"**</code>, and <code>**"am"**</code> thereby the length of this document will be **5**.

In [None]:
def longest_doc_id():
    """
    Returns the id of the longest document among all the documents in the `corpus`.
    """
    ### BEGIN SOLUTION
    return np.argmax([len(doc.split(" ")) for doc in corpus])
    ### END SOLUTION

In [None]:
"""
Test the correctness of the implementation of the `longest_doc_id` function
"""

# Tests
assert_equal(True, longest_doc_id() > 12)
assert_equal(True, longest_doc_id() < 246)
### BEGIN HIDDEN TESTS
assert_equal(True, longest_doc_id() == 29)
### END HIDDEN TESTS

## Exercise 1.2 (3 points)

Implement the function <code>**doc_stats**</code>, which returns a custom data structure, i.e., dictionary, where each key is a <code>**doc_id**</code> and each value is a tuple containing the <code>**min**</code>, <code>**max**</code>, <code>**mean**</code>, and <code>**standard deviation**</code> (in this very specific order) of the number of characters of the words which _that_ <code>**doc_id**</code> is made of.<br />
For example, if the document is <code>"I have been to Chargoggagoggmanchauggagoggchaubunagungamaugg lake last summer"</code>, then:
-  <code>**min = 1**</code>
-  <code>**max = 45**</code>
-  <code>**mean = 8.75**</code>
-  <code>**std_dev = 13.77**</code>

(**NOTE:** The _Chargoggagoggmanchauggagoggchaubunagungamaugg_ lake truly exists, and is located in Webster, Massachussets, USA)

In [None]:
def doc_stats():
    """
    Returns a dictionary where each key is a `doc_id` and each value is a tuple containing 
    the min, max, mean, and standard deviation of the number of characters of each word for that document.
    """
    doc_stats = {} # This is the variable that needs to be returned
    ### BEGIN SOLUTION
    for doc_id, doc in enumerate(corpus):
        doc_word_n_chars = [len(word) for word in doc.split(" ")]
        doc_stats[doc_id] = (np.min(doc_word_n_chars), 
                             np.max(doc_word_n_chars),
                             np.mean(doc_word_n_chars),
                             np.std(doc_word_n_chars)
                            )
    ### END SOLUTION
    return doc_stats

In [None]:
"""
Test the correctness of the implementation of the `doc_stats` function
"""

# Call off the function implemented above
stats = doc_stats()

# Tests
assert_equal(2, stats[0][0])
assert_equal(11, stats[0][1])
assert_equal(True, np.abs(7.25 - stats[0][2]) < EPSILON)
assert_equal(True, np.abs(3.2691742076555053 - stats[0][3]) < EPSILON)
### BEGIN HIDDEN TESTS
assert_equal(2507, len(stats))
assert_equal(1, stats[41][0])
assert_equal(15, stats[41][1])
assert_equal(True, np.abs(6.45454545455 - stats[41][2]) < EPSILON)
assert_equal(True, np.abs(3.79865134832 - stats[41][3]) < EPSILON)
### END HIDDEN TESTS

## Exercise 1.3 (5 points)

Implement the function <code>**words_smaller_than**</code>, which takes as input two integers <code><b>k</b></code> and <code>**doc_id**</code>, and returns the number of words in <code>**doc_id**</code> whose length (i.e., number of characters) is **strictly smaller** than <code><b>k</b></code>.<br />
By convention, documents are identified by their index position in the <code>**corpus**</code> list, therefore the first element of the list will correspond to <code>**doc_id = 0**</code>, the second to <code>**doc_id = 1**</code>, and so on and so forth. As such, if the input <code>**doc_id**</code> is outside of its valid range $[0, N-1]$ (where $N$ is the total number of documents in the <code>**corpus**</code> list), the function should immediately return <code>**-1**</code>.

In [None]:
def words_smaller_than(k, doc_id):
    """
    Returns the number of unique words in `doc_id` whose length is strictly smaller than `k` characters. 
    If the input `doc_id` is outside of its valid range [0, N-1] (where N is the total number of documents in the corpus list), 
    the function should immediately return -1
    """
    ### BEGIN SOLUTION
    if doc_id < 0 or doc_id >= len(corpus):
        return -1
    return sum(i < k for i in [len(w) for w in list(set(corpus[doc_id].split(" ")))])
    ### END SOLUTION

In [None]:
"""
Test the correctness of the implementation of the `words_smaller_than` function
"""

# Tests
assert_equal(2, words_smaller_than(3, 0))
assert_equal(0, words_smaller_than(1, 31))
assert_equal(10, words_smaller_than(11, 42))
assert_equal(8, words_smaller_than(8, 73))
### BEGIN HIDDEN TESTS
assert_equal(-1, words_smaller_than(5, -3))
assert_equal(14, words_smaller_than(13, 23))
assert_equal(6, words_smaller_than(12, 26))
### END HIDDEN TESTS

## Exercise 1.4 (7 points)

In this exercise, you will train a simplified *unigram language model* **for each** document contained in the <code><b>corpus</b></code> collection. Having one model for each document, we will in turn use it to compute a ranked list of the documents which are most likely "similar" to a given input query (i.e., a string).<br />
In a nutshell, the *unigram language model* needs you to compute the probabilty associated with each word in the document, **disregarding any other surrounding words** (i.e., no contextual information is used to estimate such a probability). More specifically, assuming $V = \{w_1, \ldots, w_{N_d}\}$ is the vocabulary of $N_d$ words extracted from the document $d$, then its associated model $M_d$ will be computed as follows:

$$
M_d = [P(w_1), \ldots, P(w_{N_d})]
$$
where:
$$
P(w_i) = \frac{\text{count}(w_i, d)}{\sum_{w_j\in d}\text{count}(w_j, d)}
$$

Eventually, we can train several models $M_1, \ldots, M_N$, i.e., one for each document in our corpus.

Once all those models are in place, you will be able to implement the function <code><b>get_most_similar_docs</b></code>, which takes as input a string <code><b>query</b></code> and an integer <code><b>k</b></code>, and returns an **ordered list of** <code><b>k</b></code> **pairs**. Each element of such a list is a tuple made of $(doc_d, prob)$, where $prob$ is the probability of the query $q = t_1\ldots t_m$ given the trained unigram language model associated with document $d$, i.e., $P(q = t_1\ldots t_m~|~M_d)$, which can be computed as follows:

$$
prob(q~|~M_d) = \prod_{i=1}^k P(t_i~|~M_d)
$$

To ease this step, you are already given with the function <code><b>compute_probability</b></code>, which takes as input a query $q$ (i.e., a list of words) and a document model $M_d$, and returns the probability $P(q~|~M_d)$ as described above.

Finally, the list of pairs returned must be sorted by non-increasing values of $prob$ (i.e., the first element will be the pair containing the document id which most-likely generated the query, and so on...)

(**SUGGESTION:** Try solving this exercise using a "top-down" approach)

In [None]:
def compute_probability(query, model):
    """
    Returns the probability of a query (i.e., a sequence of words) given a certain document model
    """
    query_probability_log_sum = np.sum([np.log(model[word]) if word in model else np.log(EPSILON)
                                       for word in query.split(" ")])
    return np.exp(query_probability_log_sum)
    

def train_unigram_language_model(doc):
    """
    Returns the unigram language model M trained from a single document d
    M_d = {w_1: P(w_1 | M_d), ..., w_{N_d}: P(w_{N_d} | M_d)}
    where P(w_i | M_d) = count(w_i, d)/sum_{j}count(w_j, d)
    """
    model = {} # dictionary to be populated and returned
    ### BEGIN SOLUTION
    words = doc.split(" ")
    n_words = len(words)
    for word in set(words):
        model[word] = words.count(word) / n_words
    ### END SOLUTION
    return model

def train_unigram_language_models(corpus):
    """
    Returns the global dictionary containing all the unigram language models for all documents of the corpus
    {doc_1: M_1, ..., doc_N: M_N}
    """
    models = {} # dictionary to be populated and returned
    ### BEGIN SOLUTION
    # HINT: Use the function `train_unigram_language_model(doc)` to train the model for a single document
    for doc_id, doc in enumerate(corpus):
        models[doc_id] = train_unigram_language_model(doc)
    ### END SOLUTION
    return models

def get_most_similar_docs(query, k):
    """
    Returns the ordered list of k pairs [(doc_1, prob_1), ..., (doc_k, prob_k)]
    such that P(query | M_1) >= P(query | M_2) >= ... >= P(query | M_k)
    where M_i is the unigram language model generated from document i
    """
    # 1. Build a dictionary containing all the models indexed by doc_id, as follows {doc_1: M_1, ..., doc_N: M_N}
    models = None
    # 2. Use this data structure in combination with the `compute_probability` function to construct the list of ordered pairs
    ### BEGIN SOLUTION
    models = train_unigram_language_models(corpus)
    return sorted([(doc_id, compute_probability(query, models[doc_id])) for doc_id in models], 
                   key=itemgetter(1), reverse=True)[:k]
    ### END SOLUTION

In [None]:
"""
Test the correctness of the implementation of the `get_most_similar_docs` function
"""

assert_equal(319, get_most_similar_docs("network design", 10)[0][0])
assert_equal(True, np.abs(0.015625000000000007 - get_most_similar_docs("network design", 10)[1][1]) < EPSILON)
assert_equal(240, get_most_similar_docs("network design", 10)[9][0])
assert_equal(True, np.abs(2.2222222222222197e-08 - get_most_similar_docs("network design", 10)[6][1]) < EPSILON)
# ### BEGIN HIDDEN TESTS
assert_equal(10, len(get_most_similar_docs("algorithm design", 10)))
assert_equal(2303, get_most_similar_docs("algorithm design", 10)[0][0])
assert_equal(True, np.abs(0.0027700831024930735 - get_most_similar_docs("algorithm design", 10)[4][1]) < EPSILON)
# ### END HIDDEN TESTS

# Part 2: Data Science (16 points)

In this part, you will be working with the dataset file <code>**dataset.csv**</code>. For a complete description of this data source, please refer to the <code>**README.txt**</code> file included in the archive.
In a nutshell, this dataset contains **1781** unique (anonymised) URLs, along with a set of **18 features** and a **binary class label** (<code>**TYPE**</code>), which indicates whether the corresponding URL is malicious (<b>1</b>) or not (<b>0</b>).<br />
The cell below is responsible for correctly loading the dataset from the <code>**dataset.csv**</code> file. Once this is executed, you can start answering the questions below.

In [None]:
# Load the dataset stored at `DATASET_FILE` using "," as field separator and '?' to detect NAs

data = pd.read_csv(DATASET_FILE, 
                   sep=',',
                   na_values='?')

print("Loaded `websites` dataset into a dataframe of size ({} x {})".format(data.shape[0], data.shape[1]))

data.head()

## Exercise 2.1 (1 point)

Implement the function <code>**get_the_longest_content**</code> below. This takes as input a <code>**pandas.DataFrame**</code> object, and returns the record (i.e., the <code>**pandas.Series**</code>) corresponding to the URL with the longest HTTP header (i.e., the <code>**CONTENT_LENGTH**</code> field) in the dataset.

In [None]:
def get_the_longest_content(data):
    """
    Returns the record corresponding to the URL with the longest HTTP header in the dataset
    """
    ### BEGIN SOLUTION
    return data[data.CONTENT_LENGTH == np.max(data.CONTENT_LENGTH)]
    ### END SOLUTION

In [None]:
"""
Test the correctness of the implementation of the `get_the_longest_content` function
"""

assert_equal("B0_834", get_the_longest_content(data)["URL"].iloc[0])
assert_equal("utf-8", get_the_longest_content(data)["CHARSET"].iloc[0])
assert_equal(0, get_the_longest_content(data)["SOURCE_APP_PACKETS"].iloc[0])
### BEGIN HIDDEN TESTS
assert_equal(649263.0, get_the_longest_content(data)["CONTENT_LENGTH"].iloc[0])
assert_equal(136, get_the_longest_content(data)["URL_LENGTH"].iloc[0])
assert_equal(43, get_the_longest_content(data)["NUMBER_SPECIAL_CHARACTERS"].iloc[0])
### END HIDDEN TESTS

## Exercise 2.2 (3 points)

Implement the function <code>**app_bytes_stats**</code> below. This takes as input a <code>**pandas.DataFrame**</code> object and returns a tuple containing the min, max, avg, and median value of <code>**APP_BYTES**</code> feature, yet computed on a _slice_ of the input <code>**pandas.DataFrame**</code>.<br />
The sliced dataset represents the subpopulation containing only **malicious** URLs whose length is **strictly above the overall median**, and whose content length ranges in $[200, 1600)$ bytes.

In [None]:
def app_bytes_stats(data):
    """
    Returns a tuple containing the min, max, avg, and median value of `APP_BYTES` feature,
    yet limited to a slice of the input DataFrame (data). 
    In particular, this slice will contain instances referring only to malicious URLs
    whose length is strictly above the overall median, and whose content length ranges in [200,1600) bytes.
    """
    ### BEGIN SOLUTION
    sliced_data = data[(data.TYPE == 1) & 
                       (data.URL_LENGTH > data.URL_LENGTH.median()) & 
                       (data.CONTENT_LENGTH >= 200) & 
                       (data.CONTENT_LENGTH < 1600)]

    return (sliced_data.APP_BYTES.min(), 
            sliced_data.APP_BYTES.max(), 
            sliced_data.APP_BYTES.mean(), 
            sliced_data.APP_BYTES.median())
    ### END SOLUTION

In [None]:
"""
Test the correctness of the implementation of the `app_bytes_stats` function
"""

# Call `app_bytes_stats` function
stats = app_bytes_stats(data)

assert_equal(0, stats[0])
assert_equal(2330, stats[1])
### BEGIN HIDDEN TESTS
assert_equal(True, np.abs(962.8 - stats[2]) < EPSILON)
assert_equal(746.0, stats[3])
assert_equal(4, len(stats))
assert_equal(tuple, type(stats))
### END HIDDEN TESTS

## Exercise 2.3 (5 points)

Implement the function <code>**get_most_dangerous_servers**</code> below, which takes as input a <code>**pandas.DataFrame**</code> and an integer <code><b>k</b></code>, and returns an **ordered list** of <code><b>k</b></code> elements, where each element of is a tuple containing the name of the server (i.e., the <code>**SERVER**</code> field) and the *probability* of that server being involved in malicious web traffic. The final list shall be ordered by probability of maliciousness (not-ascending) and, within that, lexicographically sorted (not-descending).

(**SUGGESTION:** In order to answer this question, you will need to compute the probability of maliciousness fo each **group** of servers.)

In [None]:
def get_most_dangerous_servers(data, k):
    ### BEGIN SOLUTION
    servers = data.groupby(by=["SERVER"])
    result = []
    for server_name, server_group in servers:
        n_malicious = server_group[server_group.TYPE == 1].shape[0]
        n_safe = server_group[server_group.TYPE == 0].shape[0]
        result.append((server_name, n_malicious/server_group.shape[0]))

    return sorted(sorted(result, key=itemgetter(0), reverse=False), key=itemgetter(1), reverse=True)[:k]
    ### END SOLUTION

In [None]:
"""
Test the correctness of the implementation of the `get_most_dangerous_servers` function
"""

# Call `get_most_dangerous_servers` function
most_dangerous_servers = get_most_dangerous_servers(data, 15)

assert_equal(15, len(most_dangerous_servers))
assert_equal("Apache/1.3.31 (Unix) PHP/4.3.9 mod_perl/1.29 rus/PL30.20", most_dangerous_servers[1][0])
assert_equal(1.0, most_dangerous_servers[3][1])
### BEGIN HIDDEN TESTS
assert_equal("nginx/1.8.0", most_dangerous_servers[13][0])
assert_equal(True, (0.8571428571428571 - most_dangerous_servers[8][1]) < EPSILON)
assert_equal("Apache/2.2.22 (Debian)", most_dangerous_servers[12][0])
assert_equal(True, (0.4166666666666667 - most_dangerous_servers[12][1]) < EPSILON)
assert_equal(239, len(get_most_dangerous_servers(data, 2000)))
### END HIDDEN TESTS

## Exercise 2.4 (7 points)

This exercise is made of **3** main questions, which you can answer independently to each other.

### Question 1 (1 point)

Feature <code>**CHARSET**</code> represents a categorical variable which can take on <b>5</b> distinct values (excluding NAs).
Assign to the variable <code>**us_ascii**</code> below the total number of records in the dataset whose charset is equal to <code><b>us-ascii</b></code>.

In [None]:
us_ascii = None

### BEGIN SOLUTION
us_ascii = data.CHARSET.value_counts()['us-ascii']
### END SOLUTION

In [None]:
"""
Test the correctness of the `us_ascii`
"""

assert_equal(False, (us_ascii == None))
### BEGIN HIDDEN TESTS
assert_equal(155, us_ascii)
### END HIDDEN TESTS

### Question 2 (3 points)

Plot the histogram along with the empirical density of the <code>**CONTENT_LENGTH**</code> using <code>**sns.distplot**</code>, and assign the result of the plot to the variable <code>**dist_plot**</code>. 

In addition to that, compute both the <code>**skewness**</code> and the <code>**kurtosis**</code> of the distribution. Those measure the simmetry and the "thickness" of the tail of the distribution. Given a sample of $N$ i.i.d. observations of a single random variable $X$, i.e., $x_1, \ldots, x_N$, a possible way to compute those values is as follows:

$$
\text{skewness} = \frac{\frac{1}{N}\sum_{i=1}^N (x_i - \bar{x})^3}{s^3}
$$

$$
\text{kurtosis} = \frac{\frac{1}{N}\sum_{i=1}^N (x_i - \bar{x})^4}{s^4} - 3
$$
where $\bar{x}$ and $s$ are the sample mean and the sample standard deviation of $x_1, \ldots, x_N$, respectively.

(**NOTE:** Remember to correctly handle any possible missing values, e.g., call the <code><b>dropna()</b></code> method if needed...)

In [None]:
dist_plot = None # assign this to the outcome of sns.distplot call
skewness = None # assign this to the value of skewness
kurtosis = None # assign this to the value of kurtosis

### BEGIN SOLUTION
# Create a Figure containing 1x1 subplots
fig, ax = plt.subplots(1, 1, figsize=(8,6))
# Histogram with empirical density
content_length = data.CONTENT_LENGTH.dropna()
dist_plot = sns.distplot(content_length, color='#0099cc', ax=ax)

N = content_length.shape[0]
m_3 = np.sum(np.power(content_length - content_length.mean(), 3)) / N
s_3 = np.power(content_length.std(), 3)
m_4 = np.sum(np.power(content_length - content_length.mean(), 4)) / N
s_4 = np.power(content_length.std(), 4)

skewness = m_3 / s_3
kurtosis = (m_4 / s_4) - 3
### END SOLUTION

In [None]:
"""
Test the correctness of `dist_plot`, `skewness`, and `kurtosis`
"""

assert_equal(False, (dist_plot == None))
assert_equal(False, (skewness == None))
assert_equal(False, (kurtosis == None))
### BEGIN HIDDEN TESTS
assert_equal(True, np.abs(10.538473351947596 - skewness) < EPSILON)
assert_equal(True, np.abs(143.59344613722587 - kurtosis) < EPSILON)
### END HIDDEN TESTS

### Question 3 (3 points)

Implement the function <code>**n_right_num_spec_chars_outliers**</code> below, which takes as input a <code>**pandas.DataFrame**</code> object and returns the number of right outliers of the column <code>**NUMBER_SPECIAL_CHARACTERS**</code> (if any).<br />
Any data point **less than** (resp., greater than) of the left (resp., right) fence is considered an outlier. Both left and right fences are empirically computed as follows:

$$
F_\textrm{left} = Q_1 - 1.5 * \texttt{IQR};~~F_\textrm{right} = Q_3 + 1.5 * \texttt{IQR}
$$

where $Q_1$ and $Q_3$ represents the 1st and 3rd quartile of the distribution of interest **without considering NAs**, and $\texttt{IQR} = Q_3 - Q_1$.

(**SUGGESTIONS:** Start from drawing the box plot and visually check whether there is any outlier or not. You can either invoke the <code>**quantile**</code> function defined on a <code>**pandas.Series**</code> object **or** use the <code>**numpy.percentile**</code> function which takes as input a <code>**pandas.Series**</code> object or, more generally, any object that can easily be converted into a <code>**numpy.array**</code>).

In [None]:
box_plot = None # assign this to the outcome of sns.boxplot call

### BEGIN SOLUTION
# Create a Figure containing 1x1 subplots
fig, ax = plt.subplots(1, 1, figsize=(8,6))
box_plot = sns.boxplot(x=data.NUMBER_SPECIAL_CHARACTERS, 
                       palette=sns.color_palette("hls", n_colors=2), 
                       ax=ax)
### END SOLUTION

def n_right_num_spec_chars_outliers(data):
    """
    Returns the number of right outliers of the column `NUMBER_SPECIAL_CHARACTERS`
    """
    fence_right = None # value of the right fence
    
    ### BEGIN SOLUTION
    q1, q3 = data.NUMBER_SPECIAL_CHARACTERS.dropna().quantile([.25, .75])
    IQR = (q3 - q1)
    fence_right = q3 + 1.5 * IQR
    
    return data[data.NUMBER_SPECIAL_CHARACTERS.dropna() > fence_right].shape[0]
    ### END SOLUTION

In [None]:
"""
Test the correctness of `n_right_num_spec_chars_outliers`
"""

# Call `n_right_num_spec_chars_outliers` function
outliers = n_right_num_spec_chars_outliers(data)

assert_equal(False, (box_plot == None))
assert_equal(False, (outliers == None))
assert_equal(True, (outliers > 10))
### BEGIN HIDDEN TESTS
assert_equal(76, outliers)
### END HIDDEN TESTS