# Building your First Model in Alteryx

Having built the model from Lesson 3-22 in alteryx, I wanted to try if I can get the same results with Python and statsmodels. It was surprisingly easy to code the model. For the first part, I had to extend the excel data from the given format

<img src="3-22-excel-1.png" width=500>

to a format with dummy variables for the category of Industry.

<img src="3-22-excel-2.png">

As we learned in our lessons, Industry is a nominal value which can't be count or measured. To use it in a linear regression, we have to convert it in at least an ordinal form. You might also recall, that three values given in the nominal data, you only have to add two dummy variables. That is, the third value (the missing value) is the value the regression is compared to. Having extended the given data, I used most of the code from my last example

https://github.com/jegali/DataScience/blob/main/lesson-3-12-multi-ticket-sample.ipynb

and the statsmodel library to calculate the values for the linear regression.

In [7]:
# a reference to the pandas library
import pandas as pd

# To visualize the data
import matplotlib.pyplot as plt  

# This is new. Let's try a library which does 
# the linear regression for us
import statsmodels.api as sm

from sklearn import linear_model
from sklearn.linear_model import LinearRegression

# To visualize the data
import matplotlib.pyplot as plt  

# the excel file must be in the same directory as this notebook
# be sure to use the right excel data file.
# Udacity has some files named linear-example-data with different content
# This one is the enriched excel file
excel_file= 'linear-example-data-6.xlsx'

# via pandas, the contents ae read into a variable or data frame named data
data = pd.read_excel(excel_file)

# let's have a look at the data
# print (" Contents of the file ", excel_file)
# print(data)

# We want to calculate the Average number of tickets, so this is our dependent variable 
# and has to be put on the Y-Axis
Y = data['Average Number of Tickets']

# We use all other columns as independent values and thus data feed for the X-Axis
# You may notice that column "Manufacturing" is missing. This is the dummy variable I will leave out.
# The column "industry" is also missing - we don't need it as it only contains nominal data.
X = data[['Number of Employees','Value of Contract','Retail','Services']]

# let's to the evaluation with statsmodels
# we have to add a constant to the calculation or
# we do not have a Y-intercept
X = sm.add_constant(X)

# build the model
model = sm.OLS(Y,X).fit()
model_prediction = model.predict(X)
model_details = model.summary()

print(model_details)

                                OLS Regression Results                               
Dep. Variable:     Average Number of Tickets   R-squared:                       0.965
Model:                                   OLS   Adj. R-squared:                  0.956
Method:                        Least Squares   F-statistic:                     103.7
Date:                       Tue, 05 Jan 2021   Prob (F-statistic):           9.69e-11
Time:                               12:38:46   Log-Likelihood:                -62.537
No. Observations:                         20   AIC:                             135.1
Df Residuals:                             15   BIC:                             140.1
Df Model:                                  4                                         
Covariance Type:                   nonrobust                                         
                          coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------

Here we have the results calculated by Alteryx - you may find them at the end of the video in lesson 3-22:

<img src="3-22-alteryx-1.png">

This solution may not be as convenient as the usage of alteryx as alteryx does a lot of conversion for us, as well as creating the dummy variables - but I think we have a lot more degrees of freedom using python.