<a href="https://colab.research.google.com/github/AryaJ3365/Investment-Prediction-Application/blob/main/Investment_Prediction_Application_2_0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#**Investment Prediction App featuring LGBM Regression**

##**Load Data**



###Import various needed python packages such as LGBM regressor, pandas, and seaborn

In [24]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.model_selection import GridSearchCV
from lightgbm import LGBMRegressor
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
import joblib

Here we read in the CSV file that contains examples of past successful and unsuccessful investments to train the machine learning environment that I created through online research. Note: In the future I hope to create using startup investment data gathered fully through WiProsper.

In [25]:
df = pd.read_csv("https://raw.githubusercontent.com/AryaJ3365/Datasets/main/regression.csv")

##**Data Preparation**

###Analysis of the data

First we look through the head of the data set which essentially is just showing us how the first five rows of the data set look like.

Here is the key for the data set:

1.   Investment Name: The name of where the investment came from.
2.   Investment: Total dollars invested (USD)
3.   Time: Short-term = 1, Long-term = 2
4.   Skill: Low = 1, Medium = 2, High = 3
5.   Impact: Low = 1, Medium = 2, High = 3
6.   Successful: The percentage represented as an integer of the investment being successful.



In [26]:
df.head()

Unnamed: 0,Investment Name,Investment,Time,Skill,Impact,Successful
0,Stock A,10000,2,3,3,70
1,Stock B,15000,1,2,3,60
2,Mutual Fund C,20000,2,1,1,80
3,ETF D,25000,1,3,1,85
4,Real Estate E,5000,2,2,3,75


Next we use the isna() command to verify that there is no null values (values with nothing in it) that may accidentally change the results we are looking for. Good thing here as we can see below is there are no null values in any of the columns.

In [27]:
df.isna().sum()

Investment Name    0
Investment         0
Time               0
Skill              0
Impact             0
Successful         0
dtype: int64

Data separation as X and y.

In [28]:
X, y = df.drop('Successful', axis = 1), df['Successful']

##**Data Splitting**

###Column Transformer

The goal of the column transformer is to take non-quantitive data columns such as Investment Name for example and leave that out of the shape of the data set. This is to ensure that the machine learning algorithm is forming conclusions on the prediction based on quantitive data from our training and testing data sets. However, these columns will still be used later but will be seperately encoded into quantitive data at a later point in the program.

In [29]:
cat_cols = X.dtypes[X.dtypes == 'O'].index.tolist()
cat_cols

['Investment Name']

In [30]:
ct = ColumnTransformer([
#     ('num', StandardScaler(), ['Successful', 'Success Rate']),
    ('ohe', OneHotEncoder(sparse=False, handle_unknown='ignore'), cat_cols)
], remainder='passthrough')

In [31]:
ct.fit_transform(X).shape



(70, 74)

###LightGBM Regressor Implementation

To implement the machine learning model and column transformer we use a pipeline. Essentially first the pipeline will apply the column transformer then next it will apply the LGBM regressor machine learning algorithm.

Unlike other machine learning algorithms the LGBM regressor uses a tree based algorithm that grows vertically rather than horizontally. To explain further it chooses a leaf that provides the maximum value for the model to learn based on estimators. Which means it is choosing leafs that will allow the algorithm to learn faster rather than going through every single leaf making it a much more efficient algorithm.

In [32]:
pipe = Pipeline([
    ('trf', ct),
    ('model', LGBMRegressor(random_state=0))
])

###Gradient Boosting

Next to make our prediction model more accurate we are going to implement gradient boosting to allow us to compare between the different model boosting types.

1. Dart Model: Utilizes easy handeling, pre-processing, and time series forecasting which allows us to be able to predict future trends as well.
2. GDBT (Gradient Boosted Decision Tree) Model: Uses a variety of decision trees and analyzes them to predict the most likely value for the output.
3. GOSS (Gradient Based One Side Sampling) Model: Selects a subset of data from the training set and selects the output based on the magnitude of the gradients that were calculated.

In [33]:
params = {
    'model__n_estimators':[100,130,150,170,190],
    'model__boosting_type': ['dart', 'gbdt', 'goss']
}

After all the models calculate the values for the output (predicted feature) the values get appended to a grid so we can then determine which model scored best based on the estimators that were assigned to it.

In [34]:
gs = GridSearchCV(pipe, param_grid=params, scoring='neg_root_mean_squared_error', n_jobs=-1)

We then fit the data using One Hot Encoder which essentially justs takes our categorical data and then transforms it into numerical features of our dataset.

In [35]:
gs.fit(X, y)



###Grid Search

Here are the results of the grid search algorithm so we can compare all three machine learning algorithms provided by gradient boosting. What we really are looking for to determine which model worked best is the mean test score that was closest to 0 then it just came down to which model had the fastest mean_fit_time.

In [36]:
pd.DataFrame(gs.cv_results_).sort_values(by='rank_test_score')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_model__boosting_type,param_model__n_estimators,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
10,0.014585,0.000443,0.004604,0.000163,goss,100,"{'model__boosting_type': 'goss', 'model__n_est...",-13.068328,-18.825505,-17.060896,-19.538983,-14.823776,-16.663498,2.425301,1
11,0.014408,0.002458,0.004832,0.000407,goss,130,"{'model__boosting_type': 'goss', 'model__n_est...",-13.068328,-18.825505,-17.060896,-19.538983,-14.823776,-16.663498,2.425301,1
12,0.015201,0.003159,0.005176,0.000618,goss,150,"{'model__boosting_type': 'goss', 'model__n_est...",-13.068328,-18.825505,-17.060896,-19.538983,-14.823776,-16.663498,2.425301,1
13,0.01917,0.00699,0.006276,0.001416,goss,170,"{'model__boosting_type': 'goss', 'model__n_est...",-13.068328,-18.825505,-17.060896,-19.538983,-14.823776,-16.663498,2.425301,1
14,0.016631,0.003994,0.00455,0.000721,goss,190,"{'model__boosting_type': 'goss', 'model__n_est...",-13.068328,-18.825505,-17.060896,-19.538983,-14.823776,-16.663498,2.425301,1
5,0.022165,0.00815,0.004591,0.000305,gbdt,100,"{'model__boosting_type': 'gbdt', 'model__n_est...",-14.54261,-20.272219,-17.402771,-18.7981,-15.169429,-17.237026,2.154607,6
4,0.034582,0.004006,0.00573,0.001346,dart,190,"{'model__boosting_type': 'dart', 'model__n_est...",-14.538462,-19.479327,-18.262659,-18.981625,-15.295704,-17.311555,2.00732,7
6,0.022437,0.002724,0.004831,0.000106,gbdt,130,"{'model__boosting_type': 'gbdt', 'model__n_est...",-14.981748,-20.322212,-17.474634,-18.722705,-15.310228,-17.362305,2.024965,8
7,0.021553,0.000677,0.00543,0.000593,gbdt,150,"{'model__boosting_type': 'gbdt', 'model__n_est...",-15.243332,-20.340535,-17.521777,-18.677814,-15.386258,-17.433943,1.949102,9
8,0.025761,0.005295,0.005284,0.000564,gbdt,170,"{'model__boosting_type': 'gbdt', 'model__n_est...",-15.471337,-20.362157,-17.551274,-18.641471,-15.435287,-17.492305,1.890806,10


Here we output which model type was the best and the number of estimators used with that model.

In [37]:
gs.best_params_

{'model__boosting_type': 'goss', 'model__n_estimators': 100}

Here we find what was the true best score out of all the models.

In [38]:
gs.best_score_

-16.663497768870034

Here is another look at how our pipeline is using one hot encoder with column transformer in order to properly run the LGBM regressor.

In [39]:
gs.best_estimator_

###Joblib

Finally we utilize joblib to load array files that were saved during the grid search. Joblib also enables us to use multiprocessing accross numerous machines so we can if we want use more computing resources to accelerate our model's training processes.

In [40]:
joblib.dump(gs.best_estimator_, 'model.joblib')

['model.joblib']

In [41]:
mdl = joblib.load('model.joblib')

##**Investment Prediction**

###Dataset Information and Rules

Bringing back the head of the CSV so we know the columns needed to set up our investment that we are trying to predict if it will be successful.

Once again here is the key for the data set:

1.   Investment Name: The name of where the investment came from.
2.   Investment: Total dollars invested (USD)
3.   Time: Short-term = 1, Long-term = 2
4.   Skill: Low = 1, Medium = 2, High = 3
5.   Impact: Low = 1, Medium = 2, High = 3
6.   Successful: The percentage represented as an integer of the investment being successful.




In [42]:
df.head(10)

Unnamed: 0,Investment Name,Investment,Time,Skill,Impact,Successful
0,Stock A,10000,2,3,3,70
1,Stock B,15000,1,2,3,60
2,Mutual Fund C,20000,2,1,1,80
3,ETF D,25000,1,3,1,85
4,Real Estate E,5000,2,2,3,75
5,Bond F,7500,2,3,1,65
6,Commodity G,35000,2,2,2,55
7,Forex H,5000,1,3,1,50
8,Private Equity I,30000,2,1,2,70
9,Hedge Fund J,2000,1,3,1,75


###Data Entry

Within the Xnew line in our code in between each comma type in the data of the four features you are looking at using the above key and snippet of the data set as a guide. The Xnew line should be structured like this:

**Xnew = [Investment Name, Investment, Time, Skill, Impact]**



In [43]:
q = pd.DataFrame([['Tesla Stock', '20000', '2', '3', '3']],
             columns=X.columns)
q

Unnamed: 0,Investment Name,Investment,Time,Skill,Impact
0,Tesla Stock,20000,2,3,3


This line will output whether the chance of the investment (startup) being successful based on KPI metrics as a percentage.



In [64]:
 a = mdl.predict(q)

 for i in a:
  score = i

new_score = float(score)


print("The investment has a {:0.2f}% chance of being successful.".format(new_score))

The investment has a 59.57% chance of being successful.
