In [1]:
# Importing useful packages
import numpy as np
from scipy import stats
import pandas as pd
import sklearn as sk
import seaborn as sb
import datetime as dt
import pylab 
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from new_var import calc,y1function,y2function,C3function,C6function
from outliers import outlier
from Standardising import standard
from Recoding_SIC_Codes import Industry_Division 
%matplotlib inline
from sklearn import datasets
from sklearn.feature_selection import RFE
from sklearn.linear_model import LogisticRegression
import statsmodels.api as sm
from sklearn.linear_model import LogisticRegression
from sklearn import metrics
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn.cross_validation import train_test_split

# Read in Data file and define NaN values
ipo_data = pd.read_excel("Competition #1 Raw Data_UPDATED I3.xlsx",header=0,na_values='-' )
ipo_data.I3 = ipo_data.I3.astype(object) # Converting to object for the moment to tidy up summary statistics
# Run Descriptive Statistics on Data File
#print(ipo_data.dtypes)
print(ipo_data.dtypes)

ModuleNotFoundError: No module named 'Recoding_SIC_Codes'

One extremely important point the authors of this report wish to make is that there is a mistake in the data dictionary. The variable *I3* is coded as text in the dictionary when it is fact an integer variable representing industry codes for the companies. The reason for this is that some company's codes have been erroneously entered in the data set, possessing multiple codes with commas separating the codes. There are two mistakes here, the first being have multiple codes and the second using commas to separate as this is converting the field to a text or object field. One member of our team painstakingly went through the missing and incorrect codes and corrected them in the raw data file. This allowed us to read in *I3* correctly as an int variable. We converted to an object for this next section as it is a categorical variable an it does not possess summary statistics.

In [None]:
# Printing summary statistics
ipo_data.describe()

The resulting table above led us to several immediate takeaways:
* Variables *T2* to *T5* have incorrect minimums. We know this to be true due to examining the data. Upon inspection it became clear that, as an example, *T2*, which is Number of Words in MD&A section of a company's IPO prospectus cannot be 0, especially when we look at these particular cases *T3* - *T5* variables and see numbers greater than 0. Additionally it is also impossible for these variables to have a negative word count. Based on this evidence, it was decided to treat values of this nature as "Missing"
* *P(H)*, was displaying a minimum of 0. Given that this is the upper bound of the IPO price range, this value must be incorrect
* There may be instances where *P(L)* is higher than *P(H)*, which would be incorrect, as the upper bound of the price range could not be lower than the lower bound of the range
* Many attributes had maximum values that seemed dramatically high given the mean and interquartile range provided such as *C3* and *C7* 
* *C6* has some unusually high values
* As the purpose of this study is to predict the underpricing phenonium, it is important to note that the mean value of *P(IPO)* is less than the mean value of *P(1Day)*
* The max of the S1 filings variable (*C1*) is for over five years in the future. Considering companies typically can get this completed within six months, this was viewed as an error
* Our target and control variables would need to be created

These immediate takeaways gave us a better foundation of where our data stood, and gave us ideas of where to next investigate anomalies and outliers within each attribute and record.

Next we look at the frequncy breakdown of our industry indicators in the variable *I3*:

In [None]:
# Returns the frequency
ipo_data.I3.value_counts()

As we can see, the code 2834 has the largest frequncy of 76. Another key feature to note is the length, we have a 184 unique values. This will require some recoding later along with some research in to the codes in order to determine which range of codes correspond to which sectors in order to reduce the number of categories. This is discussed in the Recoding SIC Codes section of this report.

Next we examine hthe missingness of our data. We know that their is 682 records in our data, so using this value we can calulate the percentage of missing data for each variable.

In [None]:
# Generating a print statement to highlight the issing of our data
for i in range(len(ipo_data.columns)):
    j = ipo_data.columns[i]
    miss=((682-ipo_data[str(j)].count())/682)*100
    print("The missingness of variable {}".format(j))
    print("{0:.2f}%".format(miss))

As we can see, the column with the largest amount of missing data is *C7* or the Sales field with 10.56%. The columns related to sentences and word counts appear quite small, but as we know, there are unnoticed missing values as 0 and -1 should be classified as missing but are in fact counted in this case.

Before we begin preparing our data for modelling but before that we look at the shapes of our distributions for our continuous variables. It is important to note we drop the NaN values for these plots to be generated as the Python histogram function does not deal with missing data well. Later in this report, we will see how we actually dealt with the missing data.

In [None]:
cont_var=['C1','C3','C4','C5','C6','C7','T1','T2','T3','T4','T5','S1','S2','S3'] #Variables that need to be normalized
cont_plot=ipo_data[cont_var]

# plotting our variables
for i in range(len(cont_plot.columns)):
        plt.hist(cont_plot.iloc[:,i].dropna(),bins=30)
       
        plt.title('%s' % cont_plot.columns[i])
       
        plt.show()

As we can see, much of our data is heavily skewed. This is due in part to the presence of outliers and extreme values that exist in our data set. For example, look at the histogram for the variable *T5*, and notice the outlier at the 10000 mark which is some considerable distance away from the rest of our distribution but nonetheless, is a part of the distribution and thus creates the skewness in our data. The data that is the least skewed is *C4*, but a slight left skew still exists.

In the next section we move onto data preperation. This involves imputing missing data, dealing with outliers, normalising our data and other such tasks that are all completed to increase the accuracy of our logistic regression models.

### Imputation
When replacing missing values, our methods depended on the data type of the specific variable in question. For descriptive attributes like *I3*, we conducted research to find the correct identification numbers and then manually updated our dataset. For Continuous/Float variables, we used a python script to replace the missing variables with their respective means, as this is a dependable procedure when dealing with missing values. For the purposes of this report, we arbitrarily choose the mean for our benchmark model, but replacing with the median is an equally acceptable method.

Additionally, negative and zero values in the *T1* to *S1* columns are removed as these are believed to be errors in the data. We feel justified in this as researching the forms, it is our understanding that these are mandatory sections to be completed, confirming our belief that these are errors. Removing negatives was done because it is impossible to have a negative word total, thus we believe it is either there in error or to represent missing. As the variable *C2* is a binary variable, it was decided to replace the missing values with "1" as it is far more common value than 0 based on the summary statistics.

In [2]:
# Replacing NaN in C2 with 1 
ipo_data['C2']=ipo_data['C2'].fillna(1)

# Replace NaN values with mean
ipo_data=ipo_data.fillna(ipo_data.mean())

# Removing negative and 0 values for the word/sentence counts as it is beleived that these are errors
# Removing 0 and negatives for T' calculations        
ipo_data['T2']=ipo_data['T2'].mask(ipo_data['T2'] <= 0,ipo_data['T2'].mean())
ipo_data['T1']=ipo_data['T1'].mask(ipo_data['T1'] <= 0,ipo_data['T1'].mean())
ipo_data['T3']=ipo_data['T3'].mask(ipo_data['T3'] <= 0,ipo_data['T3'].mean())
ipo_data['T4']=ipo_data['T4'].mask(ipo_data['T4'] <= 0,ipo_data['T4'].mean())
ipo_data['T5']=ipo_data['T5'].mask(ipo_data['T5'] <= 0,ipo_data['T5'].mean())
ipo_data['S1']=ipo_data['S1'].mask(ipo_data['S1'] <= 0,ipo_data['S1'].mean())

# We can now look at more representive descriptive statistics
ipo_data.describe()

NameError: name 'ipo_data' is not defined

### Variable Creation
Based on the provided data dictionary, we created two target variables, and three control Variables:
#### Target
 1. Y1 - Binary variable, set to 1 if the IPO offer price is less than the mid range of the IPO price
 2. Y2 - Binary variables, set to 1 if the IPO offer price is less than the first day trading price
 
#### Control
 1. C3x - Binary variable, set to 1 if Earnings per Share is positive
 2. C5x - Continuous/Float variable, representing the share overhang
 3. C6x - Continuous/Float variable, representing the up revision
 
Additionally, we also converted all our count variables in *T1*-*T5* and *S1*-*S3* into ratios for better statisitic analysis.
 
 [New Variable Creation](https://github.com/ConorFeeney/IS540-Code/blob/master/new_var.py) 

In [3]:
# Creating our target and control variables
ipo_data['Y1'] = ipo_data.apply(y1function, axis=1)
ipo_data['Y2'] = ipo_data.apply(y2function, axis=1)
ipo_data['C3x'] = ipo_data.apply(C3function, axis=1)
ipo_data['C6x'] = ipo_data.apply(C6function, axis=1)

#Creating ratios for the words, sentences and postive/negative counts. Ratios provide insight
calc(ipo_data)

#Removing redundent data
del ipo_data['C3']
del ipo_data['C5']
del ipo_data['C6']
del ipo_data['T1']
del ipo_data['T2']
del ipo_data['T3']
del ipo_data['T4']
del ipo_data['T5']
del ipo_data['S1']
del ipo_data['S2']
del ipo_data['S3']

#Looking at the new description of the data
ipo_data.describe()

NameError: name 'ipo_data' is not defined

As we created several new variables, for example, we create *T3x* which is the ratio of Real Words to Total Words, we felt justified in removing the variable *T3* from our data set as it would be covered in all future analysis by *T3x*. This was done to keep our data frame tidy. This logic was also applied for *T4* (now *T4x*, ratio of long sentences to total sentences), *T5* (now *T5x*, ratio of long words to total words), *S1* (now *S1x*, ratio of positive words to total words), *S2* (now *S2x*, ratio of negative words to total words) and *S3* (now *S3x*, ratio of uncertain words to total words). Finally we removed *T1* and *T2* as these were the total counts for words and sentences that were used to create these ratio variables.

As we read in our data, some variables were not read correctly in terms of types so we changed their type to be more appropriate and then printed the types of all our variables in order to ensure we were happy with them.

In [4]:
# Converting C6x and C2 to correct type
ipo_data.C6x = ipo_data.C6x.astype(float)

ipo_data.C2 = ipo_data.C2.astype(int)
print(ipo_data.dtypes) # checking the types

NameError: name 'ipo_data' is not defined

### Normalization

Most statistical methods (the parametric methods) include the assumption that the sample is drawn from a population where the values have a Normal distribution. One of the first steps of statistical analysis of your data is therefore to check the distribution of the different variables.

Upon completing the task of daeling with missing values and errors in the data, it was decided to move on to normalizing our data, followed by dealing with outliers and finally standardising our data.

The Normal distribution is symmetrical, not very peaked or very flat-topped, and if we exam the charts below we can see that our data is often skewed. 


In [5]:
normal_var=['C1','C4','C6x','C7','T3x','T4x','T5x','S1x','S2x','S3x'] #Variables that need to be normalized
norm_plot=ipo_data[normal_var]
for i in range(len(norm_plot.columns)):
        plt.hist(norm_plot.iloc[:,i],bins=30,normed=True)
        xt = plt.xticks()[0]  
        xmin, xmax = min(xt), max(xt)  
        lnspc = np.linspace(xmin, xmax, len(norm_plot.iloc[:,i]))
        plt.title('%s' % norm_plot.columns[i])
        # lets try the normal distribution first
        m, s = stats.norm.fit(norm_plot.iloc[:,i]) # get mean and standard deviation  
        pdf_g = stats.norm.pdf(lnspc, m, s) # now get theoretical values in our interval  
        plt.plot(lnspc, pdf_g, label="Norm")
        plt.show()

NameError: name 'ipo_data' is not defined

Examining the charts we can note several points of interest. *C1*, *C6x*, *C7*, *T3x*, *T5x*, *S1x*-*S3x* are all heavily skewed to the right. This is more than likely caused by the presence of extreme values and outliers, which we will be dealing with in the next section. *T4x* is also right skewed, but far less so. The variable C4 is left skewed. For the purposes of our benchmark we will be dealing with this skewness using some powerful, but simple methods.

For dealing with skew, the following transformations perform well:
 * The **log** transformation (sometimes computed **log**(x+A) where A is some constant. This is done to deal with negative or 0 values.
 * The **Square Root** function
 * Converting to a **Fraction**, i.e. **1/x**
 * The **Powers** transformation

We can also use some combination thereof. For our base model, it was decided to keep things simple initially with a goal to revisit this section in the future after normal transformations have been built into our loop for the selection of the Logistic Regression model.

For right skewed data, the **log** transformation works well, and this was the selected transformation for our benchmark model for the severly right skewed data listed above. For C4 and T4x, it is suspected that if the outliers were dealt with, the data would become more normally distributed. As this is our benchmark, this was the decided approach with a view of returning to this as we seek to improve the model.

In [6]:
transform=['C1','C6x','C7','T3x','T5x','S1x','S2x','S3x'] #Variables that need to be normalized
norm_plot=ipo_data[transform]

# plotting our hists
for i in range(len(norm_plot.columns)):
        plt.hist(np.log(norm_plot.iloc[:,i]+1),bins=30,normed=True)
        plt.title('%s' % norm_plot.columns[i])
        plt.show()

NameError: name 'ipo_data' is not defined

As we can see, the **log** transformation has worked quite well for some of our variables. *C1*, *C7* look to be much more normally distributed. The remainder still need some work, their issue being a high percentage of a zero value. Trying the **Square Root** function we get the following results:

In [7]:
transform=['C6x','T3x','T5x','S1x','S2x','S3x'] #Variables that need to be normalized
norm_plot=ipo_data[transform]

# plotting our hists
for i in range(len(norm_plot.columns)):
        plt.hist(norm_plot.iloc[:,i]**0.5,bins=30,normed=True)
        plt.title('%s' % norm_plot.columns[i])
        plt.show()

NameError: name 'ipo_data' is not defined

It was decided to square root the remaining variables. The justification was that the square root appeared to make more normal the *S1x-S3x* variables. As for *C6x*, *T3x* and *T5x* variables, clearly standard transformations will not work on these and will require further research into methods such as binning. For the purposes of our base model, it was decided to keep them square rooted.

In [8]:
sqrt_transform=['C6x','T3x','T5x','S1x','S2x','S3x'] #Variables that need to be square rooted
ipo_data[sqrt_transform]=ipo_data[sqrt_transform]**0.5 #square rooting variable

log_transform=['C1','C7'] #Variables to be log transformed
ipo_data[log_transform]=np.log(ipo_data[log_transform])#log transformation

# Viewing new summary statistics
ipo_data.describe()

NameError: name 'ipo_data' is not defined

### Outliers
In statistics, an outlier is an observation point that is distant from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error; the latter are sometimes excluded from the data set. Outliers can occur by chance in any distribution, but they often indicate either measurement error or that the population has a heavy-tailed distribution, as we saw in the last section, it caused some weighted tails in certain variables.  Outlier points can therefore indicate faulty data, erroneous procedures, or areas where a certain theory might not be valid.  However, in large samples, a small number of outliers is to be expected (and not due to any anomalous condition). There are methods that can deal with outliers, thankfully. While outliers are expected; extreme illogical data constitute faulty data. A negative return of 786 EPS in the case of NeuroMetrix, Inc. indicate faulty data. Through careful examination we can separate faulty data from outliers.

To deal with outliers, we plan on using the following methods for various data versions:
 1. Calculating the points that are greater than or less than 3 standard deviations
   away from the mean and setting any values outside this range to the upper / lower bound, respectively.
 2. Calculating the points that are greater than or less than 3 standard deviations
   away from the mean and setting any values outside this range to the mean
 3. Calculating the Interquartile Range and finding values outside the limits Q1-IQR*1.5 and Q3+IQR*1.5 
   and setting to be the mean
 4. Calculating the Interquartile Range and finding values outside the limits Q1-IQR*1.5 and Q3+IQR*1.5 
    and setting to be the Q1 or Q3, respectively.

 [Outliers](https://github.com/ConorFeeney/IS540-Code/blob/master/outliers.py) 

As we are generating two models, the script to deal with outliers is done in that section. This is due to as each model has different target variables, meaning, the same outlier method produce different accuracy results for the two models. Thus we use differing methods for each model in order to maximise our model's accuracy.

### Standardization
Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data pre-processing step. Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance. Bearing all this mind, we felt that this could only help improve our model accuracy.

To standardize, we developed code for three different methods:
 * Min/Max scaling
 * Zscore standardization
 * Decimal scaling
 
[Standardising Function](https://github.com/ConorFeeney/IS540-Code/blob/master/Standardising.py) 

Our standardisation function still has some work needed, but for the purposes of our benchmark model, it suffices.

### Recoding
During the data understanding step, we realized that *I3*, or the Standard Industry Classification codes for each record, was not populated and/or incorrect for instances. After correcting these instances, we were able to match to an Industry Division based on the codes. This allowed us to recode the *I3* column to a categorical variable of industry divisions, that we will later be able to use as a way of clustering the records prior to modeling, to hopefully improve the model fit. However, this was not completed for the base model as this was a recent development to our data. As we saw earlier, there were over 180 unique SI codes, which is a significant amount. It was decided to do some research in an attempt to find a method to aggregate codes together in order to create a more manageable number.

Thankfully one was obtained. We can aggregate these codes up to a "Division" level as shown in the table below.

| Range of SIC Codes | Division                                                           |
|--------------------|--------------------------------------------------------------------|
| 0100-0999          | Agriculture, Forestry and Fishing                                  |
| 1000-1499          | Mining                                                             |
| 1500-1799          | Construction                                                       |
| 1800-1999          | Not Used                                                           |
| 2000-3999          | Manufacturing                                                      |
| 4000-4999          | Transportation, Communications, Electric, Gas and Sanitary service |
| 5000-5199          | Wholesale Trade                                                    |
| 5200-5999          | Retail Trade                                                       |
| 6000-6799          | Finance, Insurance and Real Estate                                 |
| 7000-8999          | Services                                                           |
| 9100-9729          | Public Administration                                              |
| 9900-9999          | Non Classifiable                                                   |


We created an external function to do this recoding for us, which can be found in the link below.

[Recoding SIC Column](https://github.com/ConorFeeney/IS540-Code/blob/master/Recoding_SIC_Codes.py) 

In [9]:
# Add the new data to the end of the table
ipo_data.I3 = ipo_data.I3.astype(int) # recoding to correct type

# applying our function to recode our industry codes to sector level
ipo_data['IndDivision'] = ipo_data.apply(Industry_Division, axis=1) 
print(ipo_data.head(5))

NameError: name 'ipo_data' is not defined

## Correlation analysis
Next, we needed to select the predictor variables with low pair-wise correlation values. In order to do this, we used Spearman's correlation test to determine the statistical dependence between the rankings of pairs of variables.

In [10]:
corr=['C1','C4','C6x','C7','T3x','T4x','T5x','S1x','S2x','S3x','Y1','Y2']
ipo_data[corr].corr(method='spearman').style.format("{:.2}").background_gradient(cmap=plt.get_cmap('coolwarm'), axis=1)

NameError: name 'ipo_data' is not defined

Our correlation analysis is still in early stages, and still requires significant work. At present, we only have completed a 'Spearman' correlation analysis between continious variables and our target variables. In the final paper, we plan to have completed chi tests for comparing our categorical variables and comparing our continious with our categorical variables we will attempt to use ANOVA. 

To discuss the above we some strong correlations. Note that the predictor variable *C6x* is strongly correlated with the target variable *Y1*. Our next strongest correlations come from *T3x* and *T5x*, and *S3x* with both *T3x* and *T5x*. These correlations make sense as they relate to types of words, for example, *S3x* is related to uncertain words (in terms of sentiment), which would have some correlation with both long and real words. In particular, long words coupled with uncertain words could be correlated due to that technical words tend be "long" and would have no sentiment associated with them. 

Due to the early stages of our correlation analysis, our only action from this data is to drop the variable *C6x* as a predictor for the model identified with target variable *Y1*, as it may weaken the effect from other variables

### Data Prep Summary
After experimenting with each method in the suggested tasks, we decided the best approach for our model would be to utlize different combinations and orders of each in a series of models and evaluate performance along the way. In order to do this easily, we created functions for each step that will iterate through each method, and selected the results that produced the most optimal results. The code for the iterative loop observing the different methods modelling results can be found [here](https://github.com/ConorFeeney/IS540-Code/blob/master/Model%20For%20loop.ipynb).

Based on the above the above, the models were generated as thus:

## Model Generation
### Y1 Logistic Regression Model
It was decided to keep all variables for the model creation as we are conducting a predictive analysis. Currently there is one exception, and this is due to the high correlation found. *C6x* is not used in the model *Y1* as it may weaken the impact of the other predictor variables.

In [11]:
# Creating a copy of our data frame to avoid errors in the Y2 model
ipo_data_y1=pd.DataFrame.copy(ipo_data)

# for loop to deal with outliers in float variables
for i in range(len(ipo_data_y1.columns)): 
    if ipo_data_y1.iloc[:,i].dtype == float:
        outlier(ipo_data_y1.iloc[:,i],1) 
        
# for loop to standardise  float variables        
for i in range(len(ipo_data_y1.columns)): 
    if ipo_data_y1.iloc[:,i].dtype == float:  
        ipo_data_y1.iloc[:,i]=standard(ipo_data_y1.iloc[:,i],1)
        
# Our Logistic Regression model with results for Y1      
logreg = LogisticRegression()
# we removed C6x due to high correlation
train=['C3x','C4','C1','C7','T3x','T4x','T5x','S1x','S2x','S3x','C2']


# Setting our predictors
X=ipo_data_y1[train]

# Setting our Targer
y=ipo_data_y1['Y1']

# Using recursive feature selection to aide in the selection of predictors
rfe = RFE(logreg, 18)
rfe = rfe.fit(X,y)

# Implementing the model
logit_model=sm.Logit(ipo_data_y1['Y1'],ipo_data_y1[train])
result=logit_model.fit()

# Dividing our data set into testing and training, with our test set being of size 0.3. Then fitting the model
X1_train, X1_test, y1_train, y1_test = train_test_split(X, y, test_size=0.3, random_state=0)
logreg = LogisticRegression()
logreg.fit(X1_train, y1_train)

# Predicting the test set results and calculating the accuracy
y1_pred = logreg.predict(X1_test)

# Using a seven fold cross validation to avoid overfitting and train our model
kfold = model_selection.KFold(n_splits=7, random_state=7)
modelCV = LogisticRegression()
scoring = 'accuracy'
results = model_selection.cross_val_score(modelCV, X1_train, y1_train, cv=kfold, scoring=scoring)

# Doing a confusion matrix to see our correct and incorrect placements
confusion_matrix_y1 = confusion_matrix(y1_test, y1_pred)

# Printing results
print(rfe.support_)
print(rfe.ranking_)
print(result.summary())
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X1_test, y1_test)))
print("10-fold cross validation average accuracy: %.3f" % (results.mean()))

print(confusion_matrix_y1)
print(classification_report(y1_test, y1_pred))

# Calculating our AUC/ROC and showing the graph
logit_roc_auc = roc_auc_score(y1_test, logreg.predict(X1_test))
fpr, tpr, thresholds = roc_curve(y1_test, logreg.predict_proba(X1_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
#plt.savefig('Log_ROC')
plt.show()

NameError: name 'ipo_data' is not defined

As we can see, there is a lot of output here. The first bit of output relates to Recursive Feature Elimination (RFE) and this is based on the idea to repeatedly construct a model and choose either the best or worst performing feature, setting the feature aside and then repeating the process with the rest of the features. This process is applied until all features in the dataset are exhausted. The goal of RFE is to select features by recursively considering smaller and smaller sets of features. As we can see all of our selected variables come back with "True".

Next we have some statistical output that relates to our model. We see the log Likelihood, and r-squared valus. Beneath these we see our table containing the value of our coefficients in the "coeff" column. As we are attempting to predict, as opposed to explain, it was decided to keep all variables in the model. 

After this, predicting the test set results and calculating the accuracy is carried out and we obtain a value of 0.61. Following, we utilise cross validation. Cross validation attempts to avoid overfitting while still producing a prediction for each observation dataset. We are using 10-fold Cross-Validation to train our Logistic Regression model. The score we receive for this is 0.583, idicating that the average accuracy remains very close to the Logistic Regression model accuracy; hence, we can conclude that our model generalizes well. Before we move onto the F1 score and AUC we generate a confusion matrix and observe 64+62 (126) correct predictions and 36+43 (79) in incorrect predictions.

Before we look at the F1 score here are some pointers to remember:
* The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier to not label a sample as positive if it is negative.
* The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.
* The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.
* The F-beta score weights the recall more than the precision by a factor of beta. beta = 1.0 means recall and precision are equally important.
For the Y1 target variable we had an F1 score of 0.61.

Finally we observe the ROC or AUC. The receiver operating characteristic (ROC) curve is another common tool used with binary classifiers. The dotted line represents the ROC curve of a purely random classifier; a good classifier stays as far away from that line as possible (toward the top-left corner). 

In the next block of code, we create a model for the *Y2* target variable. However, due to the in-balance in the outcomes of *Y2*, where around 70% of the values are "1", it was decided to reduce the number of records in the data set, specifically reducing those with the value 1. This was done to provide a closer to 50/50 split in the data so that the model would not become overly biased towards predicting "1" for our test set.

In [12]:
# Creating a copy of our data frame to avoid errors in the Y2 model
ipo_data_y2=pd.DataFrame.copy(ipo_data)

# Subsetting the data in order to create a balanced data set between Y2 = 1 and Y2 = 0
df_subset = ipo_data_y2.loc[ipo_data_y2['Y2'] == 1].sample(300)
ipo_data_y2 = ipo_data_y2.drop(df_subset.index)

# for loop to deal with outliers in float variables
for i in range(len(ipo_data_y2.columns)): 
    if ipo_data_y2.iloc[:,i].dtype == float:
        outlier(ipo_data_y2.iloc[:,i],4) 
        
# for loop to standardise  float variables        
for i in range(len(ipo_data_y2.columns)): 
    if ipo_data_y2.iloc[:,i].dtype == float:  
        ipo_data_y2.iloc[:,i]=standard(ipo_data_y2.iloc[:,i],1)
        
# Our Logistic Regression model with results for y2      
logreg = LogisticRegression()
# This is our  predictor variables,
train=['C3x','C4','C1','C7','C6x','T3x','T4x','T5x','S1x','S2x','S3x','C2']



# Setting our predictors
X=ipo_data_y2[train]

# Setting our Targer
y=ipo_data_y2['Y2']

# Using recursive feature selection to aide in the selection of predictors
rfe = RFE(logreg, 18)
rfe = rfe.fit(X,y)

# Implementing the model
logit_model=sm.Logit(ipo_data_y2['Y2'],ipo_data_y2[train])
result=logit_model.fit()

# Dividing our data set into testing and training, with our test set being of size 0.3. Then fitting the model
X2_train, X2_test, y2_train, y2_test = train_test_split(X, y, test_size=0.3, random_state=0)
logreg = LogisticRegression()
logreg.fit(X2_train, y2_train)

# Predicting the test set results and calculating the accuracy
y2_pred = logreg.predict(X2_test)

# Using a seven fold cross validation to avoid overfitting and train our model
kfold = model_selection.KFold(n_splits=7, random_state=7)
modelCV = LogisticRegression()
scoring = 'accuracy'
results = model_selection.cross_val_score(modelCV, X2_train, y2_train, cv=kfold, scoring=scoring)

# Doing a confusion matrix to see our correct and incorrect placements
confusion_matrix_y2 = confusion_matrix(y2_test, y2_pred)

# Printing results
print(rfe.support_)
print(rfe.ranking_)
print(result.summary())
print('Accuracy of logistic regression classifier on test set: {:.2f}'.format(logreg.score(X2_test, y2_test)))
print("10-fold cross validation average accuracy: %.3f" % (results.mean()))

print(confusion_matrix_y2)
print(classification_report(y2_test, y2_pred))

# Calculating our AUC/ROC and showing the graph
logit_roc_auc = roc_auc_score(y2_test, logreg.predict(X2_test))
fpr, tpr, thresholds = roc_curve(y2_test, logreg.predict_proba(X2_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.2f)' % logit_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
#plt.savefig('Log_ROC')
plt.show()

NameError: name 'ipo_data' is not defined

Notice that for the four variables we use, RFE deems them to be significant. The four variables we have selected are again all statistically significant, and the approach to select these four is identical to the one used in the previous model.

After this, predicting the test set results and calculating the accuracy is carried out and we obtain a value of 0.61. Following, we utilise cross validation. The score we receive for this is 0.637, idicating that the average accuracy remains close to the Logistic Regression model accuracy; hence, we can conclude that our model generalizes well. Before we move onto the F1 score and AUC we generate a confusion matrix and observe 41+29 (70) correct predictions and 19+26 (45) in incorrect predictions.

For the Y2 target variable we had an F1 score of 0.61.

Finally we oberseve the ROC or AUC.  

### Conclusion and Next Steps
As we can see, we have accomplished a decent chunk of work in this first few weeks of this project. We have dealt with missing data, both the official NaN and more hidden ones like negative values in the number of words variables. This was done by replacing the missing values with the mean (except in the case of the dummy variable *C2* as discussed earlier). After this we began our data preperation for modelling.

In this section we did a number of tasks, initially we began by attempting to normalise our continuous varaibles using the **log** and **square root** transformations to a acceptable degree success for our base model. We have shown the four methods of how we have dealt with outlies discussed in the Outliers section, as well as shown the three methods of standardising the data with the min - max method for example. We began our recoding of the *I3* variable and successfully recoded it into its sector level. We also  initiated our correlation analysis between continious variables.

Finally we generated models for *Y1* and *Y2* respectively. We held a discussion of the results as well as a brief introduction to our accuracy measures with their respective accuracy scores.

As next steps for this project we intend to examine additionally methods of imputing missing such as replacing with the median and using linear regression. For the small number of variables that could not be reasonable normalised through simple transformations, we will attempt to use binning in hopes to improve the model by improving these variables as logistic regression is heavily dependent on normal data. 

Next, we intend to create dummy variables for our sector level categorical variables for the companies and include this variable in our model analysis. We will continue to expand our correlation analysis in an effort to find other significant relationships between predictor and target variables by comparing continuous with categorical variables and categorical with categorical variables.

Finally we will use these improvements and new variables with our models to attempt to increase the accuracy of our logistic regression model. Ideally we will obtain as strong as a score as possible using logistic regression but we will also explore the neural network approach to modelling in an effort to create the most accurate model, as well as decision trees.

### Workload Breakdown
In terms of team work and fair sharing of the load, the team is working quite well. We meet at least once a week to give progress reports along with regular updates via email, calls and texts. We critique each other, and help each other when one or more is struggling.

Please see below a brief breakdown of the work carried out by our team thus far, with guidelines for future steps:
#### Completed Tasks
* Tim: Began code for normalisation, Coded outlier method 3 + 4, z score standardising, read in data and completed initial analysis, recoded days variable
* Danielle: Coded outlier method 1+2, decimal scaling standardising, manually imputed I3 column in raw data and created code for target and control variables
* Conor: Tidied outlier code into function, coded min max standardisation, coded initial correlation analysis and coded logistic regression model

This report was written as a team. Danielle wrote the initial first draft, with particular emphasis on her tasks. Tim re-drafted Danielles report and added information to his tasks, Conor completed final draft and added information to his completed tasks.

#### Future Tasks
* Tim: Creation of dummy variables for Days, I3 and binning of variables discussed in normalisation. Taking over of Logistic Regression model to improve using these results
* Danielle: Completing correlation analysis, aiding Tim in logistic regression model as needed, researching and implementing another model generation method (Naive Bayes / Neural network)
* Conor: Passing over of logistic regression code, tidying of standardisation program, creating code for decision tree and potentionally one other method

