# Exercises
file name: anomaly_detection.py or anomaly_detection.ipynb

#### Discrete data + probability
Use basic probability to identify anomalous request methods. You will want to make sure the text is normalized in order to reduce the noise.

#### Time series + EMA
Discover users who are accessing our curriculum pages way beyond the end of their codeup time. What would the dataframe look like? Use time series method for detecting anomalies, like exponential moving average with %b.

#### Clustering - DBSCAN
Use dbscan to detect anomalies in other products from the customers dataset.

Use dbscan to detect anomalies in number of bedrooms and finished square feet of property for the filtered dataset you used in the clustering project (single unit properties with a logerror).

# %%%%%%%%%%%%%%%%%

# Anomaly Detection of Discrete Data using Probability
## Discrete data + probability

### Use basic probability to identify anomalous request methods. You will want to make sure the text is normalized in order to reduce the noise.

In [1]:
from __future__ import unicode_literals
from __future__ import division
import itertools
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import numpy as np
import pandas as pd
import math
from sklearn import metrics
from random import randint
from matplotlib import style
import seaborn as sns
%matplotlib inline

#### Wrangle Data
##### Acquire

In [2]:
colnames=['ip', 'timestamp', 'request_method', 'status', 'size',
          'destination', 'request_agent']
df_orig = pd.read_csv('http://python.zach.lol/access.log',          
                 engine='python',
                 header=None,
                 index_col=False,
                 names=colnames,
                 sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
                 na_values='"-"',
                 usecols=[0, 3, 4, 5, 6, 7, 8]
)

new = pd.DataFrame([["95.31.18.119", "[21/Apr/2019:10:02:41+0000]", 
                     "GET /api/v1/items/HTTP/1.1", 200, 1153005, np.nan, 
                     "python-requests/2.21.0"],
                    ["95.31.16.121", "[17/Apr/2019:19:36:41+0000]", 
                     "GET /api/v1/sales?page=79/HTTP/1.1", 301, 1005, np.nan, 
                     "python-requests/2.21.0"],
                    ["97.105.15.120", "[18/Apr/2019:19:42:41+0000]", 
                     "GET /api/v1/sales?page=79/HTTP/1.1", 301, 2560, np.nan, 
                     "python-requests/2.21.0"],
                    ["97.105.19.58", "[19/Apr/2019:19:42:41+0000]", 
                     "GET /api/v1/sales?page=79/HTTP/1.1", 200, 2056327, np.nan, 
                     "python-requests/2.21.0"]], columns=colnames)

df = df_orig.append(new)
df.timestamp = df.timestamp.str.replace(r'(\[|\])', '', regex=True)
df.timestamp= pd.to_datetime(df.timestamp.str.replace(':', ' ', 1)) 
df = df.set_index('timestamp')
for col in ['request_method', 'request_agent', 'destination']:
    df[col] = df[col].str.replace('"', '')

df['request_method'] = df.request_method.str.replace(r'\?page=[0-9]+', '', regex=True)

df['size_mb'] = [n/1024/1024 for n in df['size']]

In [3]:
df.head().append(df.tail())

Unnamed: 0_level_0,ip,request_method,status,size,destination,request_agent,size_mb
timestamp,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2019-04-16 19:34:42,97.105.19.58,GET /api/v1/sales HTTP/1.1,200,512495,,python-requests/2.21.0,0.488753
2019-04-16 19:34:42,97.105.19.58,GET /api/v1/items HTTP/1.1,200,3561,,python-requests/2.21.0,0.003396
2019-04-16 19:34:44,97.105.19.58,GET /api/v1/sales HTTP/1.1,200,510103,,python-requests/2.21.0,0.486472
2019-04-16 19:34:46,97.105.19.58,GET /api/v1/sales HTTP/1.1,200,510003,,python-requests/2.21.0,0.486377
2019-04-16 19:34:48,97.105.19.58,GET /api/v1/sales HTTP/1.1,200,511963,,python-requests/2.21.0,0.488246
2019-04-17 12:55:14,97.105.19.58,GET /api/v1/sales HTTP/1.1,200,510166,,python-requests/2.21.0,0.486532
2019-04-21 10:02:41,95.31.18.119,GET /api/v1/items/HTTP/1.1,200,1153005,,python-requests/2.21.0,1.099591
2019-04-17 19:36:41,95.31.16.121,GET /api/v1/sales/HTTP/1.1,301,1005,,python-requests/2.21.0,0.000958
2019-04-18 19:42:41,97.105.15.120,GET /api/v1/sales/HTTP/1.1,301,2560,,python-requests/2.21.0,0.002441
2019-04-19 19:42:41,97.105.19.58,GET /api/v1/sales/HTTP/1.1,200,2056327,,python-requests/2.21.0,1.961066


In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 13978 entries, 2019-04-16 19:34:42 to 2019-04-19 19:42:41
Data columns (total 7 columns):
ip                13978 non-null object
request_method    13978 non-null object
status            13978 non-null int64
size              13978 non-null int64
destination       25 non-null object
request_agent     13978 non-null object
size_mb           13978 non-null float64
dtypes: float64(1), int64(2), object(4)
memory usage: 873.6+ KB


In [5]:
df.shape

(13978, 7)

In [6]:
df.isna().sum()

ip                    0
request_method        0
status                0
size                  0
destination       13953
request_agent         0
size_mb               0
dtype: int64

##### Clean up text

In [7]:
maxcount = 0
for i, command in enumerate(df.request_method):
    if len(command.split('/')) > maxcount:
        maxcount = len(command.split('/'))
        print('maxcount: ', maxcount)
        print(command)
print('final maxcount is ', maxcount)

maxcount:  5
GET /api/v1/sales HTTP/1.1
maxcount:  6
GET /api/v1/items/next_page HTTP/1.1
maxcount:  8
GET /api/v1//api/v1/items HTTP/1.1
maxcount:  9
GET /api/v1//api/v1/items/next_page HTTP/1.1
final maxcount is  9


Count number of unique values.

In [8]:
df.nunique()

ip                 22
request_method     22
status              3
size              191
destination        18
request_agent       9
size_mb           191
dtype: int64

In [9]:
df.status.value_counts()

200    13960
499       16
301        2
Name: status, dtype: int64

In [10]:
df.ip.value_counts()

97.105.19.58      11999
173.173.113.51     1059
72.181.113.170      613
72.181.105.81       246
68.201.219.223       21
24.26.242.9          21
35.175.171.137        2
70.121.214.34         2
52.87.230.102         2
52.90.165.200         1
97.105.15.120         1
34.207.64.242         1
95.31.16.121          1
54.145.52.184         1
34.229.70.250         1
95.31.18.119          1
45.23.250.16          1
35.174.209.2          1
52.91.30.150          1
3.92.201.136          1
3.88.129.158          1
54.172.14.223         1
Name: ip, dtype: int64

In [None]:
df.groupby('ip')['size_mb'].agg(['max', 'mean']).sort_values(by='max')

In [None]:
ip_df = pd.DataFrame(df.groupby('ip')['size_mb']\
                     .mean().reset_index()\
                     .rename(index=str, columns={'index': 'ip','size_mb': 'avg_size_mb' }))
ip_df.head()

In [None]:
ip_df2 = pd.DataFrame(df.ip.value_counts(dropna=False)/df.ip.count()).reset_index().\
                rename(index=str, columns={'index': 'ip', 'ip': 'ip_proba'})
ip_df = ip_df.merge(ip_df2)


# see those where rate < 1% 
ip_df[ip_df.ip_proba < .01].sort_values(by=['ip_proba', 'avg_size_mb'])
# 19	95.31.18.119	1.099591	0.000072

Split the request path into sections. Each section will get its own row.

In [None]:
s = df['request_path'].str.split('/').apply(pd.Series, 1).stack()

s.index = s.index.droplevel(-1) # to line up with df's index

s.name = 'request_path' # needs a name to join

s

Drop the 'request_path' column before the join.

In [None]:
df_no_request_path = df.drop(columns = 'request_path')

In [None]:
df_split = df_no_request_path.join(s.apply(lambda x: pd.Series(x.split('/'))))

In [None]:
df_split.head(2)

In [None]:
df_split.rename(columns={0:'request_path_section'}, inplace=True)
df_split.head(20)

But... that doesn't immediately appear to be useful...

In [None]:
df['size_mb'] = [n/1024/1024 for n in df['size']]

In [None]:
df.describe()

In [None]:
df.head(2)

Tried to use a function, ipaddress.ip_address, that is supposed to return the country and other info about the ip address.

In [None]:
from ipaddress import ip_address

In [None]:
ip_address(df.ip)
# I have no idea what this error means...

In [None]:
count = 0
try:
    network = ip_address(df.ip)
except ValueError:
    count += 1
    print('address/netmask is invalid:', count)

In [None]:
df.head()

In [None]:
df.ip.value_counts()

Creating a new dataframe with only ip_address and counts.

In [None]:
dff = df['ip'].value_counts().reset_index()
dff.columns = ['ip', 'count']

In [None]:
dff

This is where I'm using ipinfo to try to read the ipaddress. It works when I use the default that tells me the stats on MY ip address.

In [None]:
import re
import json
from urllib.request import urlopen

url = 'http://ipinfo.io/json'
response = urlopen(url)
data = json.load(response)

IP=data['ip']
org=data['org']
city = data['city']
country=data['country']
region=data['region']

print('Your IP detail\n ')
print('IP : {4} \nRegion : {1} \nCountry : {2} \nCity : {3} \nOrg : {0}'.format(org,region,country,city,IP))

I couldn't figure out how to pass the ipaddress as an argument, so I found this code, but couldn't figure out how to make it work... 

In [None]:
import ipinfo
access_token = '123456789abc'
handler = ipinfo.getHandler(access_token)
ip_address = '173.173.113.51'
details = handler.getDetails(ip_address)
details.city
details.loc

### Detecting Anomalies in Discrete Variables

Finding anomalies in already existing data

We can see easily some anomalies around IP addresses --- that's from the curriculum... I'm not seeing any anomalies...

Creating a dataframe with probabilities that each ip address will be used.

In [None]:
ip_df = pd.DataFrame(df.ip.value_counts(dropna=False)).reset_index().\
                rename(index=str, columns={'index': 'ip', 'ip': 'ip_count'})
ip_df2 = pd.DataFrame(df.ip.value_counts(dropna=False)/df.ip.count()).reset_index().\
                rename(index=str, columns={'index': 'ip', 'ip': 'ip_proba'})
ip_df = ip_df.merge(ip_df2)


# see those where rate < 1% 
ip_df[ip_df.ip_proba < .01]

Only interested in the low probability ip addresses. Had to weed down to ip_df.ip_proba < 0.000005 to weed down the count significantly. That gave 256 rows.

My graph is very pretty, but not at all meaningful.

In [None]:
sm_chance = ip_df[ip_df.ip_proba < 0.0002]
sm_chance.shape

In [None]:
print(len(sm_chance))

print(sm_chance)


plt.figure(figsize=(12, 4))
splot = sns.barplot(data=sm_chance, x = 'ip', y = 'ip_count', ci = None)
for p in splot.patches:
    splot.annotate(format(p.get_height(), '.0f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', xytext = (0, 10), 
                   textcoords = 'offset points'
                   )
    plt.xticks(rotation='vertical')

### Detecting anomalies by establishing a baseline and evaluate as new data arrives
##### Establish baseline

In [None]:
df.columns

Dropping date and time and selecting a range of dates.

In [None]:
# train = df['2018-01-26 09:55:03':'2019-01-26 09:55:03'][['stuff', 'id', 'cohort', 'ip']]
train = df[['ip', 'request_path', 'status', 'size', 'request_agent', 'size_mb']]

##### Compute probabilities based on train sample

In [None]:
ip_df = pd.DataFrame(train.ip.value_counts(dropna=False)/train.ip.count()).reset_index().\
                rename(index=str, columns={'index': 'ip', 'ip': 'ip_proba'})

##### Merge probabilities with all data (train + new data)
- Where the ip address is new, i.e. not seen in the training dataset, fill the probability with a value of 0.

In [None]:
df = df.reset_index().merge(ip_df, on=['ip'], how='left').fillna(value=0).set_index('timestamp')
df.ip.value_counts()

### Conditional Probabilities: probabilities using 2 discrete variables
##### Probability of Status given IP Address:
If we are looking for an unexpected status (like authentication failure) from a known/common IP address.

In [None]:
train.columns

In [None]:
ip_probs = train.groupby('ip').size().div(len(df))

status_given_ip = pd.DataFrame(train.groupby(['ip', 'size_mb']).\
                               size().div(len(train)).\
                               div(ip_probs, 
                                   axis=0, 
                                   level='ip').\
                               reset_index().\
                               rename(index=str, 
                                      columns={0: 'proba_status_given_ip'})
                              )

In [None]:
ip_status_count = pd.DataFrame(train.groupby(['ip', 'size_mb'])['request_path'].\
                                count().reset_index().\
                                rename(index=str, 
                                       columns={'request_path': 'ip_status_count'}))


ip_status = status_given_ip.merge(ip_status_count)

##### Add these probabilities to original events to detect anomalous events

In [None]:
df = df.reset_index().merge(ip_status, on=['ip', 'size_mb'], how='left').fillna(value=0).set_index('timestamp')

In [None]:
df.head(2)

In [None]:
df.sort_values(by=['proba_status_given_ip'], ascending=True)

In [None]:
(df['proba_status_given_ip'] < .005).sum()

In [None]:
sm_proba = df[df.proba_status_given_ip < .005]
print(sm_proba.shape)
sm_proba.head()

In [None]:
plt.scatter(df.proba_status_given_ip, df.ip_proba)

In [None]:
plt.scatter(sm_proba.proba_status_given_ip, sm_proba.ip_proba)

# Detecting Anomalies of Continuous Variables with Time Series Using Statistical Methods

# *Pick a cohort and run this time series analysis on it*
## Time series + EMA
### Discover users who are accessing our curriculum pages way beyond the end of their codeup time. What would the dataframe look like? Use time series method for detecting anomalies, like exponential moving average with %b.



### Statistical Methods
- Flag the data points that deviate from the expected, based on the statistical properties, such as mean, median, mode, and quantiles.

- You could define an anomalous data point as one that deviates by a certain standard deviation from the mean.

- You could use a simple or exponential moving average to smooth short-term fluctuations and highlight long-term ones.

- This method is challenging with really noisy data.

### Anomalies in the amount of data consumed over time

In [None]:
from __future__ import division
import itertools
import warnings
warnings.filterwarnings("ignore")
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
from numpy import linspace, loadtxt, ones, convolve
from sklearn.ensemble import IsolationForest
import numpy as np
import pandas as pd
import collections
import math
from sklearn import metrics
from random import randint
from matplotlib import style
import seaborn as sns
# style.use('fivethirtyeight')
%matplotlib inline

def evaluate(actual, predictions, output=True):
    mse = metrics.mean_squared_error(actual, predictions)
    rmse = math.sqrt(mse)

    if output:
        print('MSE:  {}'.format(mse))
        print('RMSE: {}'.format(rmse))
    else:
        return mse, rmse    

def plot_and_eval(predictions, actual, metric_fmt='{:.2f}', linewidth=4):
    if type(predictions) is not list:
        predictions = [predictions]

    plt.figure(figsize=(16, 8))
    plt.plot(train,label='Train')
    plt.plot(test, label='Test')

    for yhat in predictions:
        mse, rmse = evaluate(actual, yhat, output=False)        
        label = f'{yhat.name}'
        if len(predictions) > 1:
            label = f'{label} -- MSE: {metric_fmt} RMSE: {metric_fmt}'.format(mse, rmse)
        plt.plot(yhat, label=label, linewidth=linewidth)

    if len(predictions) == 1:
        label = f'{label} -- MSE: {metric_fmt} RMSE: {metric_fmt}'.format(mse, rmse)
        plt.title(label)

    plt.legend(loc='best')
    plt.show()    
    
def get_data():
    df = pd.read_csv('http://python.zach.lol/access.log',          
                  engine='python',
                  header=None,
                  index_col=False,
                  names=['ip', 'timestamp', 'request_path', 'status', 'size', 'request_agent'],
                  sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
                  na_values='-',
                  usecols=[0, 3, 4, 5, 6, 8],
                  skip_blank_lines=True)

    df.timestamp = df.timestamp.str.replace('[', '')
    df.timestamp = df.timestamp.str.replace(']', '')
    df.timestamp= pd.to_datetime(df.timestamp.str.replace(':', ' ', 1)) 

    df.request_path = df.request_path.str.replace('"', '')

    df = df.set_index('timestamp')
    df = df.tz_localize('utc').tz_convert('America/Chicago')
    
    # Replace the request paths that have only / with home_page.
    df.replace(regex=r'^/$', value='home_page', inplace=True)
    
    # Take off the page number stuff from the request paths...
    df['request_path'] = df.request_path.str.replace(r'\?page=[0-9]+', '', regex=True)

    for col in ['request_path', 'request_agent', 'destination']:
        df[col] = df[col].str.replace('"', '')

    
    # add a field for mb size
    df['size_mb'] = [n/1024/1024 for n in df['size']]
     
    return(df)

#### Wrangle Data
##### Acquire

In [None]:
# df = get_data()

In [None]:
colnames=['ip', 'timestamp', 'request_method', 'status', 'size',
          'destination', 'request_agent']
df_orig = pd.read_csv('http://python.zach.lol/access.log',          
                 engine='python',
                 header=None,
                 index_col=False,
                 names=colnames,
                 sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
                 na_values='"-"',
                 usecols=[0, 3, 4, 5, 6, 7, 8]
)

new = pd.DataFrame([["95.31.18.119", "[21/Apr/2019:10:02:41+0000]", 
                     "GET /api/v1/items/HTTP/1.1", 200, 1153005, np.nan, 
                     "python-requests/2.21.0"],
                    ["95.31.16.121", "[17/Apr/2019:19:36:41+0000]", 
                     "GET /api/v1/sales?page=79/HTTP/1.1", 301, 1005, np.nan, 
                     "python-requests/2.21.0"],
                    ["97.105.15.120", "[18/Apr/2019:19:42:41+0000]", 
                     "GET /api/v1/sales?page=79/HTTP/1.1", 301, 2560, np.nan, 
                     "python-requests/2.21.0"],
                    ["97.105.19.58", "[19/Apr/2019:19:42:41+0000]", 
                     "GET /api/v1/sales?page=79/HTTP/1.1", 200, 2056327, np.nan, 
                     "python-requests/2.21.0"]], columns=colnames)

df = df_orig.append(new)
df.timestamp = df.timestamp.str.replace(r'(\[|\])', '', regex=True)
df.timestamp= pd.to_datetime(df.timestamp.str.replace(':', ' ', 1)) 
df = df.set_index('timestamp')
for col in ['request_method', 'request_agent', 'destination']:
    df[col] = df[col].str.replace('"', '')

df['request_method'] = df.request_method.str.replace(r'\?page=[0-9]+', '', regex=True)

df['size_mb'] = [n/1024/1024 for n in df['size']]

In [None]:
df.isna().sum()

In [None]:
def get_data():
    df = pd.read_csv('http://python.zach.lol/access.log',          
                  engine='python',
                  header=None,
                  index_col=False,
                  names=['ip', 'timestamp', 'request_path', 'status', 'size', 'request_agent'],
                  sep=r'\s(?=(?:[^"]*"[^"]*")*[^"]*$)(?![^\[]*\])',
                  na_values='-',
                  usecols=[0, 3, 4, 5, 6, 8])

    df.timestamp = df.timestamp.str.replace('[', '')
    df.timestamp = df.timestamp.str.replace(']', '')
    df.timestamp= pd.to_datetime(df.timestamp.str.replace(':', ' ', 1)) 

    df.request_path = df.request_path.str.replace('"', '')

    df = df.set_index('timestamp')
    df = df.tz_localize('utc').tz_convert('America/Chicago')
    
    # Replace the request paths that have only / with home_page.
    df.replace(regex=r'^/$', value='home_page', inplace=True)
    
    # Take off the page number stuff from the request paths...
    df['request_path'] = df.request_path.str.replace(r'\?page=[0-9]+', '', regex=True)
    
    # add a field for mb size
    df['size_mb'] = [n/1024/1024 for n in df['size']]
    
    return(df)

In [None]:
df.info()

In [None]:
df.status.value_counts()

##### Clean up text

In [None]:
maxcount = 0
for i, command in enumerate(df.request_method):
    if len(command.split('/')) > maxcount:
        maxcount = len(command.split('/'))
        print('maxcount: ', maxcount)
        print(command)
print('final maxcount is ', maxcount)

1. resample to 30 minute intervals taking max of size
2. fill in missing datetimestamps (those not present because no data was captured during that time. We want to have continuous time and those time periods filled with 0)

In [None]:
df.index.date

In [None]:
my_datetime_fmt = mdates.DateFormatter('%m-%d %H:%T')

df_ts_size = df['size_mb'].resample('30T').median()

idx = pd.date_range(
    df_ts_size.sort_index().index.min(), 
    df_ts_size.sort_index().index.max(),
    freq='30min'
    )

df_ts_size = df_ts_size.reindex(idx, fill_value=0).fillna(value=0)

In [None]:
df_ts_size

##### Using all data and not splitting into train/test

In [None]:
start_date_train = df_ts_size.head(1).index[0]
end_date_train = '2019-04-17 23:30:00'
start_date_test = '2019-04-18 00:00:00'

train = df_ts_size[:end_date_train]
test = df_ts_size[start_date_test:]

In [None]:
train.head().append(train.tail())

In [None]:
plt.figure(figsize=(12, 8))
plt.plot(train)
plt.plot(test)
plt.show()

##### SMA - Simple Moving Average

In [None]:
# Calculating the short-window simple moving average
short_rolling = train.rolling(window=12).mean()

# Calculating the long-window simple moving average
long_rolling = train.rolling(window=24).mean()

##### Plot the 2 window sizes for the SMA

In [None]:
fig, ax = plt.subplots(figsize=(12,4))

ax.plot(train.index, 
        train,
        label='Size (MB)')

ax.plot(short_rolling.index, 
        short_rolling, 
        label = '6-Hour SMA')
ax.plot(long_rolling.index, 
        long_rolling, 
        label = '12-Hour SMA')

ax.legend(loc='best')
ax.set_ylabel('Size (MB)')
# ax.xaxis.(rotate=90)
# ax.xaxis.set_major_formatter(my_datetime_fmt)

##### Compute the Exponential Moving Average

In [None]:
# Using Pandas to calculate a 2 hour span EMA. 
# adjust=False specifies that we are interested in the 
# recursive calculation mode.
ema_short = train.ewm(span=12, adjust=False).mean()
ema_short[0:3]

ema_long = train.ewm(span=24, adjust=False).mean()
ema_long[0:3]

##### Compare SMA with EMA

In [None]:
fig, ax = plt.subplots(figsize=(12,4))

ax.plot(train.index, 
        train,
        label='Size (MB)')

ax.plot(short_rolling.index, 
        short_rolling, 
        label = '6-Hour SMA')
ax.plot(long_rolling.index, 
        ema_short, 
        label = 'Span 6-Hour EMA')
ax.plot(long_rolling.index, 
        long_rolling, 
        label = '12-Hour SMA')
ax.plot(long_rolling.index, 
        ema_long, 
        label = 'Span 12-Hour EMA')

ax.legend(loc='best')
ax.set_ylabel('Size (MB)')

yhat = pd.DataFrame(dict(actual=test))

#### Forecast using the EMA

In [None]:
# periods = 24
yhat['moving_avg_forecast'] = ema_long.iloc[-1]

##### Compute the '%b' for each record

In [None]:
# compute the absolute error:
yhat['error'] = abs(yhat.actual - yhat.moving_avg_forecast)

# compute the mean of the absolute error:
# yhat.error.median()

# compute upper band and lower band using IQR with weight of 3

q3 = yhat.error.describe().loc['75%']
q1 = yhat.error.describe().loc['25%']

# adding .1 to the IQR so the we don't end up with a denominator of 0. 
ub = q3 + 3*(q3-q1+.1)
lb = q1 - 3*(q3-q1+.1)

yhat['pct_b'] = (yhat.actual-lb)/(ub-lb)

In [None]:
# Maggie's code to find outliers:
# span = 24
# ema_long = train.ewm(span=span, adjust=False).mean()
# midband = ema_long[-1]
# ub = midband + ema_long[-24:-1].std()*3
# lb = midband - ema_long[-24:-1].std()*3

# yhat['moving_avg_forecast'] = midband

##### Extract the anomalies

In [None]:
yhat[yhat.pct_b > 1]

##### Plot

In [None]:
plot_and_eval(yhat.moving_avg_forecast, actual=test)
plt.figure(figsize=(12,4))
plt.plot(yhat.pct_b)

In [None]:
yhat[yhat.pct_b > 1]

In [None]:
df[df.size_mb>1]

IP Address	Country	Region	City
95.31.18.119	Russia 	Moscow	Moscow
ISP	Organization	Latitude	Longitude
CORBINA-BROADBAND	Not Available	55.7315	37.6454

IP Address	Country	Region	City
97.105.19.58	United States 	Texas	San Antonio (Downtown)
ISP	Organization	Latitude	Longitude
Spectrum	Codeup LLC	29.4267	-98.4896