# Cell Tower Outage Prediction Algorithm 

## Steps in CRISP-DM (Cross Industry Standard Process for Data Minning 

Step 1: Understand the Business Objective/Question <br>
Step 2: Data Understanding <br>
Step 3: Data Preperation <br>
Step 4: Modeling <br>
Step 5: Evaluation <br>
Step 6: Deployment <br>

# Step 1: Understand the Business Question

We have data about the times when individual cell towers experienced technical difficulties. Our job is to use the provided data
to predict the probablity of each cell tower going down in the future, and predict how severe the outage will be.

# Step 2: Data Understanding

The following are the variables we have been given and will be using in our classifcation model

## **Data Dictionary**

### **Dependent Variable**
**Fault_severity:** Categorical variable used to represent the severity of the faults 0-being no fault 2-being many faults <br>
<br>
### **Independent Variables**
<font>**ID:**</font> Used to identify a specific event at a specific location <br>
<font>**Location:**</font> Used to identify the location of the cell tower <br>
<font>**Log_feature:**</font> The Types of features logged for that each ID <br>
<font>**Volume:**</font> The amount of logged features <br>
<font>**Resource_type:**</font> The type of resource provided by that specific ID<br>
<font>**Severity_type:**</font> The type of severity level logged for that specific ID<br>


# Step 3: Data Preperation 

## Begin by Downloading necessary libraries 

In [1]:
import pandas as pd 
import numpy as np

## Read the necessary files into Pandas Data Frames

In [2]:
resources = pd.read_csv("resource_type.csv")
severity = pd.read_csv("severity_type.csv")
log = pd.read_csv("log_feature.csv")
event = pd.read_csv("event_type.csv")
train = pd.read_csv("train.csv")

## Step 3a Clean the Data 

Create a combine data frame by merging each of the individual data frames

In [3]:
# Merge all the data frames into one data frame
df=log.merge(event)
df=df.merge(resources)
df=df.merge(severity)
data=train.merge(df)


In [4]:
data.head()

Unnamed: 0,id,location,fault_severity,log_feature,volume,event_type,resource_type,severity_type
0,14121,location 118,1,feature 312,19,event_type 34,resource_type 2,severity_type 2
1,14121,location 118,1,feature 312,19,event_type 35,resource_type 2,severity_type 2
2,14121,location 118,1,feature 232,19,event_type 34,resource_type 2,severity_type 2
3,14121,location 118,1,feature 232,19,event_type 35,resource_type 2,severity_type 2
4,9320,location 91,0,feature 315,200,event_type 34,resource_type 2,severity_type 2


Strip out the words from the Location, Log_feature, Event_type, Severity_type and resource_type columns 

In [5]:
# Loops through each column splitting the strings into a list then appending the numbers to a separate list
loc=[]
for row in data['location']:
    bers=row.split()
    bers=bers[1]
    loc.append(bers)
    
log_f=[]
for row in data['log_feature']:
    bers=row.split()
    bers=bers[1]
    log_f.append(bers)
    
event_t=[]
for row in data['event_type']:
    bers=row.split()
    bers=bers[1]
    event_t.append(bers)
    
resource_t=[]
for row in data['resource_type']:
    bers=row.split()
    bers=bers[1]
    resource_t.append(bers)
    
severity_t=[]
for row in data['severity_type']:
    bers=row.split()
    bers=bers[1]
    severity_t.append(bers)

# Lists with numbers assigned back to respective columns in combined data frame 
data['location']=loc
data['log_feature']=log_f
data['event_type']= event_t
data['resource_type']= resource_t 
data['severity_type']=severity_t

In [6]:
data.head()

Unnamed: 0,id,location,fault_severity,log_feature,volume,event_type,resource_type,severity_type
0,14121,118,1,312,19,34,2,2
1,14121,118,1,312,19,35,2,2
2,14121,118,1,232,19,34,2,2
3,14121,118,1,232,19,35,2,2
4,9320,91,0,315,200,34,2,2


## Step 3b Creating Dummy Variables 

delete all the non categorical variables from a copy of your dataset

In [7]:
categories = data.copy()
del categories['fault_severity']
del categories['id']
del categories['volume']


Store the information from the categorical columns in a list

In [8]:
# create list of names of the categorical columns you want to turn into dummy variables 
dummy_cats = categories.columns
dummies = []

# run the list through loop that appends values from each categorical column from the orginal data set to a new list (dummies)
for i in range(len(dummy_cats)):
    dummies.append(data[dummy_cats[i]])
    
# The newly created list and the list of column names should be the same length 
print len(dummy_cats)
print len(dummies)



5
5


Create the dummy variable columns 

In [10]:
# Create prefixes to identify which category dummy variables belong to (used because we removed string in front of nums above)
prefixes = ['loc', 'logf', 'e_t', 'r_t', 's_t']
for i in range(len(prefixes)):
    # creates dummy variables for each column in the categorical dataframe and adds prefexies to the newly created colums
    dummycreation = pd.get_dummies(categories[dummy_cats[i]], prefix = prefixes[i])
    # joins/adds the columns to the to the categories dataframe
    categories = categories.join(dummycreation)
    #Deltetes orginal categorical column
    del categories[dummy_cats[i]]

In [11]:
# updated data frame with dummy variable columns 
categories.head()

Unnamed: 0,loc_1,loc_10,loc_100,loc_1000,loc_1002,loc_1005,loc_1006,loc_1007,loc_1008,loc_1009,...,r_t_5,r_t_6,r_t_7,r_t_8,r_t_9,s_t_1,s_t_2,s_t_3,s_t_4,s_t_5
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,1,0,0,0


## Step 3c Compacting the data 

Add the location column back into the dummy_variable dataframe and compress data by location

In [12]:
# add the location column back into the dummy_variable dataframe
df=categories.copy()
df['location']=data['location']

In [13]:
# combine all the location instances by summing up the columns where the location is similar 
df_sum = df.groupby('location').sum()

In [14]:
# flag each of the dummy variable columns (these coulmns should onlt be 1 or 0)
col_names=df_sum.columns
for cols in col_names:
    df_sum[cols]=df_sum[cols].apply(lambda x: 1 if x > 0 else 0)

In [15]:
# Check data Frame to ensure flags correctly applied
df_sum.head()

Unnamed: 0_level_0,loc_1,loc_10,loc_100,loc_1000,loc_1002,loc_1005,loc_1006,loc_1007,loc_1008,loc_1009,...,r_t_5,r_t_6,r_t_7,r_t_8,r_t_9,s_t_1,s_t_2,s_t_3,s_t_4,s_t_5
location,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1,0,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,1,0,0,0
10,0,1,0,0,0,0,0,0,0,0,...,0,0,0,1,0,1,0,0,0,0
100,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,1,1,0,1,0
1000,0,0,0,1,0,0,0,0,0,0,...,0,0,0,1,0,1,1,0,0,0
1002,0,0,0,0,1,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0


Create another data frame for the numeric variables and compress the data by location

In [16]:
# Create a data frame for the numerical variables
df_extra=pd.DataFrame()
df_extra['location']=data['location']
df_extra['volume']=data['volume']
# Sum up all numeric variables by location
df_extra = df_extra.groupby(by='location').sum()

In [17]:
# Check to ensure data is summed and grouped by location
df_extra.head()

Unnamed: 0_level_0,volume
location,Unnamed: 1_level_1
1,664
10,20
100,246
1000,29
1002,2


Create another data frame for the dependent variable and compress the data by location

In [18]:
# create new dataframe with independent variable (fault severity) Group by location and take the mean of the fault_severities

df_independent= data.copy()

del df_independent['id']
del df_independent['log_feature']
del df_independent['volume']
del df_independent['event_type']
del df_independent['resource_type']
del df_independent['severity_type']

df_independent['fault_severity']=df_independent['fault_severity'].apply(lambda x: float(x))

df_independent = df_independent.groupby(by='location').mean()
df_independent['fault_severity'] = df_independent['fault_severity'].apply(lambda x: round(x, 0))


# Step 4: Modeling 

Combine the independent data frames and store the independent and dependent data frames in variables for the modeling process


In [19]:
# combine the categorical and numerical dataframes and store in X
x=df_extra.join(df_sum)
# store the independnt data frame in y
y=df_independent['fault_severity']

Import train test split and split the data

In [20]:
from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)



Import the necessary modeling libraries and fit the model

In [21]:
from sklearn.ensemble import GradientBoostingClassifier 

# create the model
gradient = GradientBoostingClassifier()
# fit the model 
g = gradient.fit(X_train, y_train)

Predict using the model and store in a list for later use

In [22]:
predicted=g.predict(x)

# Step 5 Evaluation

In [23]:
# Get the Accuracy score (percent of times algorthim correctly predicted the dependent variable (fault_severity))

g.score(X_test, y_test)

0.70967741935483875

# Step 6 Deployment

Create Prediction/Probability DataFrame

In [24]:
# Create data frame with location and columns for each possible predicted outcome
df_prob=pd.DataFrame(columns = ['location','Predicted','Probablity 0','Probablity 1', 'Probablity 2'])
df_prob['location']=data['location'].unique()
# Store the predicted outcome in the predicted outcome column 
df_prob['Predicted']=predicted


In [25]:
# Store the probablities for 0, 1 ,2 fault severity in a list
probablities=g.predict_proba(x)

In [26]:
# Convert list to an array 
probablities=np.array(probablities)

Put the probablities of each outcome into the created data frame

In [27]:
# define lists to store the probabilities for each column in the data frame
rows=range(0,929)
cols=[0,1,2]
list_1=[]
list_2=[]
list_3=[]

# Create loop that stores the probabilities for each column into a list
for row in rows:
    for col in cols:
        if col==0:
            num=probablities[row][col]
            list_1.append(num)
        if col==1:
            num=probablities[row][col]
            list_2.append(num)
        if col==2:
            num=probablities[row][col]
            list_3.append(num)


In [28]:
# Assign lists to the probability columns

df_prob['Probablity 0']=list_1
df_prob['Probablity 1']=list_2
df_prob['Probablity 2']=list_3

Check the data frame

In [29]:
df_prob.head()

Unnamed: 0,location,Predicted,Probablity 0,Probablity 1,Probablity 2
0,118,1.0,0.073057,0.909965,0.016979
1,91,0.0,0.608914,0.373421,0.017665
2,152,0.0,0.873024,0.116071,0.010905
3,931,0.0,0.608303,0.374005,0.017693
4,120,0.0,0.712712,0.274312,0.012977
