# Competition 1 - Write Up #

#### Research Question & Goal ####

What are the determinants of the IPO underpricing phenomena? It is our job as a group to understand and identify the underlying determinants that factor into IPO underpricing.

### Business Understanding ###

According to Investopedia.com, Underpricing is the listing of an intial public offering (IPO) below its market value. When the offer price of the stock is lower than the price of the first trade, the stock is considered to be underpriced. This will only last for a short amount of time, as the demand of the stock is going to drive it back up to its value.

From a company standpoint, they wish to have the intial public offering as high as possible, which in turn raises the most capital. The quantitative factors that go into an initial public offering are all financial analysis reports from the company itself. Before the IPO, the company will be analyzed by its sales, expenses, earnings, and cash flow. Furthermore, a company's earnings and expected earnings growth are the biggest factors in the IPO. Marketability in a specific industry and the general market also can drive an IPO up or down.

Once the investment bankers or IPO underwriters determine the IPO price of the company's stock, the day before the stock is offered publically, the company will market the IPO to potential investors. For historical purposes, IPOs are viewed as risky investments because of the lack of historical data that is collected on them. The less liquidity that the stock/company has and predicatble IPO shares are going to be, the more likely they are going to be underprices to compensate for assumed risk. Company's also underprice their IPO to entice more investors to buy stocks to raise more capital.

With all of this information about intial public offerings, is there a few determinants that can be identified as to why the phenomenon of underpricing exists? The dataset that we have been provided provide information about companies and information regarding their IPO, such as IPO Offering, IPO Characteristics, Textual Characterisitics, Sentiment Characteristics, Target Variables, Control Variables, and IPO Identifiers.

The variables that have been provided are listed below:

 - P(IPO) - Offer Price
 - P(H) - Price Range Higher Bound
 - P(L) - Price Range Lower Bound
 - P(1Day) - First Day Trading Price
 - C1 - Days
 - C2 - Top-Tier Dummy
 - C3 - Earnings per Share
 - C4 - Prior NASDAQ 15-Day Returns
 - C5 - Outstanding Shares
 - C6 - Offering Shares
 - C7 - Sales
 - T1 - Number of Sentences
 - T2 - Number of Words
 - T3 - Number of Real Words
 - T4 - Number of Long Sentences
 - T5 - Number of Long Words
 - S1 - Number of Positive Words
 - S2 - Number of Negative Words
 - S3 - Number of Uncertain Words
 - Y1 - Pre-IPO Price Revision
 - Y2 - Post-IPO Initial Return
 - C3' - Positive EPS Dummy
 - C5' - Share Overhang
 - C6' - Up Revision
 - I1 - Ticker
 - I2 - Company Name
 - I3 - Standard Industry Classifier

## Data Understanding | Exploration ##

To handle our data, we first imported the data into a dataframe a tried to examine the data through the JupyterLab interface. We realized that it would be easier to analyze the data through Excel, so that is what we decided to do our data exploration through.

After the intial data exploration, we looked for missing values, and ways for us to impute the missing values. The I3 column was the first column that had missing data. We decided that we can manually impute the missing values, as they were readily available online through government websites.

Fore the columns that needed numerical imputations, we decided to impute using the median, as there were some major skews for some of the columns.

After that, we decided to create new column names for each column, as there were some columns that included parentheses that were a bit tough to use later in the notebooks. Then we exported the dataframe to a csv file for use in both of our pipelines.

Here is the final csv file from our Data Understanding | Exploration: [Data Exploration|Understanding](./ReadyDF.csv)

## Handling Skewness - Both Pipelines ##

For our 'MinMax_Pipeline', we fixed the skew at the **END** of the pipeline, in comparison to the 'Z-Score Pipeline' when we fixed the skew at the **BEGINNING** of our pipeline.

For the 'MinMax_Pipeline' we only had to fix the skew of 1 column (C7), as Normalizing and Standardizing the data was very helpful in fixing skew. [Skew for MinMax_Pipeline](./Min_Max_Pipeline/Skew_MinMax.ipynb)

For the 'Z-Score Pipeline', we had to play with finding the best fixes for skew based on how skewed each column was. The columns that required more skew were a bit tricky, as we could not get some of them any lower than they are shown on the notebook. [Skew for Z-Score Pipeline](./Z-Score_Pipeline/Skew_Z.ipynb)

## Outliers - Both Pipelines ##

For handling outliers, in the pipeline using MinMax, we had to deal with the outliers first before doing anything else. In the Z-Score Pipeline, we had to deal with the Outliers **SECOND** after handling the skew.

For handling outliers for the MinMax_Pipeline, we used IQR (Inter-Quartile Range) to being the outliers back towards the mean of the column. We used a function that we definted that would be used for each column with outliers. [Outliers for MinMax Pipeline](./Min_Max_Pipeline/Outliers_MinMax.ipynb)

For handling the outliers with the Z-Score Pipeline, we used standard deviation to help with the outliers. We did have one outlier still appear on out of our box-and-whisker plots, but we believe that it is a visual error for the box-and-whisker plot.[Outliers for Z-Score Pipeline](./Z-Score_Pipeline/Outliers_Z.ipynb)



## Normalization - Both Pipelines ##

For normalization for both pipelines, we normalized the MinMax Pipeline second, after fixing out outliers, and we normalized our Z-Score Pipeline at the end of all three steps.

We used MinMax for normalizing the data in the MinMax Pipeline. [Normalization MinMax Pipeline](./Min_Max_Pipeline/Normalization(MinMax).ipynb)

In the Z-Score Pipeline, we used the z-score for each column to normalize the data. [Normalization for Z-Score Pipeline](./Z-Score_Pipeline/Normalization_Z.ipynb)

## Correlation Analysis - Both Pipelines ##

For our correlation analysis, we had to concatinate the dataframes for each pipeline into one dataframe in order for the correlation matrix to show up on our notebooks.

The conclusion that we came to is that we cannot use columns `C6'` when we run our models for `Y1`, as it is highly correlated.

MinMax Pipeline Correlation: [Click Here](./Min_Max_Pipeline/Correlation_Analysis_MinMax.ipynb)

Z-Score Pipeline Correlation: [Click Here](./Z-Score_Pipeline/Correlation_Analysis.ipynb)

## RFE - Both Pipelines ##

For RFE in both pipelines, we could not use column `C6'`, as it was highly correlated with the columns `Y1`. We did use it for column `Y2` though.

The models that we have run are at the bottom of each respective notebook. It gave us a general indication about why initial variables to use when running our models for the final step.

RFE MinMax Pipeline: [Click Here](./Min_Max_Pipeline/RFE_MinMax.ipynb)

RFE Z-Score Pipeline: [Click Here](./Z-Score_Pipeline/RFE_Z-Score.ipynb)

## Evaluation Code - Both Pipelines ##

Finally, the evaluation that we did was based on f1 and auc scores, that Dr. Tao provided a notebook for.

For each notebook, we initially started with all of the variables when running the evaluation code, and then we would take a variable away, and see if the f1 and auc scores would improve or not. If they improved, we would leave out the features indefintely. If by taking out the feature made the scores worse, we would add back the feature, and then move onto taking out the next feature. This was our way of determining the best possible f1 and auc scores from our data.

There is one caveat, as the f1 and auc scores did change each time that we ran the evaluation code, which was a bit worrisome, as our best model lost some points when we reran it a few times.

Evaluation Code for MinMax Pipeline: [Click Here](./Min_Max_Pipeline/Evaluation-Code-Good-MinMax.ipynb)

Evaluation Code for Z-Score Pipeline: [Click Here](./Z-Score_Pipeline/Evaluation-Code-Good-Z-Score.ipynb)