# Kickstart Informed: Executive Notebook

This notebook can be run to go through the entire modeling process, _with the exception of the hyperparameter tuning, which can be found [here]('data_processing/GridSearch_CLOUD.ipynb')_ 

### Review: What is Kickstarter? Why use machine learning? 

Kickstarter is a well-known crowdfunding website that was founded in 2009, with a focus on providing a platform for people fund new creations such as art, film, technology, and games of both the digital and physical variety.

Largely, each individual project contains a large amount of objective data, following trends, new inventions and innovations, or other themes. On top of that, one may need to know a large amount of information such as when the project should be published, how long it should last, and the goal outside of just the manufacturing cost and shipping & handling. The more research it takes, the more time and energy one may have to put into that research that they could have instead spent working on the project.

That's where Kickstart Informed comes in. With Machine Learning, one can add to their project to help increase the chance of success, without having to research quite the same amount of information. While it can't process things such as what the project itself is about and all of the intimate details that go along with it, it can take numerical, catagorical, binary, and date information to make some predictions.

### Starting with the Clean Data

_If you would like to review the data cleaning process, go [here]('data/DataCleaningBook.ipynb')_

As we can see here, we're starting with a total of 9 categories. Let's review them. 

- `country`: The country that the project creator resides in OR the country that the business is located in
- `currency`: The original currency that the project used
- `goal`: The project's goal in US Dollars
- `staff_pick`: Whether or not the project was officially featured by Kickstarter
- `category_parent_name` : The name of the parent category that the project was in
- `start_month` : The month that the project began 
- `end_month` : The month that the project ended in
- `project_length` : The total number of days that the project ran for
- `blurb_len` : The length of the project's blurb

The data has been split into train and test, with train being 80% of the dataset and test being 20%
- Train: 138,192
- Test: 34549

In [None]:
from data_processing.src.modeling import *

X_train, y_train, X_test, y_test = load_train_test_data(ex = True)

In [None]:
X_train.head()

First, we're going to run 12 different vanilla models, using RepeatedStratifiedKFold to perform cross validation. Using those results, the weaker models will be pruned and the best models will be analyzed

In [None]:
models = {'Log': LogisticRegression(), 'Knn': KNeighborsClassifier(), 'DT': DecisionTreeClassifier(random_state = 10), 
          'Gaussian': GaussianNB(), 'Multinomail': MultinomialNB(), 'LDA': LinearDiscriminantAnalysis(),
          'LinearSVC': LinearSVC(max_iter = 1250, random_state = 10), 'SDGSVC': SGDClassifier(random_state = 10),  
          'ADA': AdaBoostClassifier(random_state = 10), 'Bagging': BaggingClassifier(random_state = 10), 
          'Ridge': RidgeClassifier(random_state = 10), 'RF': RandomForestClassifier(random_state = 10)}

# # getting results and model
result_dict = test_models(X_train, y_train, models, n_jobs = 2)

save_cv_results(result_dict, 'data_processing/models/VanillaResults_1.p')

In [None]:
results = [i[1] for i in result_dict.items()]
model_names = [i for i in result_dict.keys()]
plot_model_results(results, model_names, 'data_processing/models/VanillaResults_1.png', figure_title = 'Precision for Each Vanilla Model (version 1)', 
                   figsize = (13, 8))

After reviewing the results, `Log`, `KNN`, `Gaussian`, `LinearSVC`, `SDGSVC`, and `Ridge` are determined to be too weak, and they are purged from the list of models. 

With those out of the way, the cross validation is performed again

In [None]:
models = {'Knn': KNeighborsClassifier(), 'DT': DecisionTreeClassifier(random_state = 10), 
          'Multinomail': MultinomialNB(), 'LDA': LinearDiscriminantAnalysis(),
          'ADA': AdaBoostClassifier(random_state = 10), 'Bagging': BaggingClassifier(random_state = 10), 
          'RF': RandomForestClassifier(random_state = 10)}


# getting results and model
result_dict = test_models(X_train, y_train, models, n_jobs = 2)

save_cv_results(result_dict, 'data_processing/models/VanillaResults_2.p')

In [None]:
results = [i[1] for i in result_dict.items()]
model_names = [i for i in result_dict.keys()]
plot_model_results(results, model_names, 'models/data_processing/VanillaResults_2.png', figure_title = 'Precision for Each Vanilla Model (version 2)', 
                   figsize = (13, 8))

After reviewing this round, all but `ADA`, `Bagging`, and `RF` are purged. `DT` is purged as it's already very similar to `RF`

Additionally, with the good performance of `ADA`, `GradientBoost` is added.

In [None]:
models = {'ADA': AdaBoostClassifier(random_state = 10), 'Bagging': BaggingClassifier(random_state = 10), 
          'RF': RandomForestClassifier(random_state = 10), 'GradientBoost' : GradientBoostingClassifier(random_state = 10)}

#create stacked model
new_models = stacked_model(models)

# getting results and model
result_dict = test_models(X_train, y_train, new_models, n_jobs = 2)

save_cv_results(result_dict, 'data_processing/models/VanillaResults_3.p')

In [None]:
results = [i[1] for i in result_dict.items()]
model_names = [i for i in result_dict.keys()]
plot_model_results(results, model_names, 'data_processing/models/VanillaResults_3.png', figure_title = 'Precision for Each Vanilla Model (version 3)', 
                   figsize = (13, 8))

After selecting the above four models, I moved on to [tuning them using Google cloud](GridSearch_CLOUD.ipynb)

All results from the gridsearch were pickled and stored in [here](data_processing/models/Gridsearch_models), with the best results from all four pickled [here](data_processing/models/BestTunedClassifiers.p)

Below are dictionaries of all of the best parameters for the models. All of the model names are stored in one list, while all of the other models are stored in another so that they can be examined together. 

In [None]:
tuned_models = {
    'ADA': AdaBoostClassifier(
        algorithm = 'SAMME.R',
        learning_rate = 1.5,
        n_estimators = 1000,
        random_state = 6
    ),
    
    'Bagging': BaggingClassifier(
        bootstrap = True,
        bootstrap_features = True,
        max_features = 0.5,
        max_samples = 35,
        n_estimators = 19,
        random_state = 6,
        warm_start = True
    ),
    
    'RF': RandomForestClassifier(
        bootstrap = False,
        criterion = 'entropy',
        max_depth = 30,
        max_features = 'auto',
        max_leaf_nodes = None,
        max_samples = None,
        min_samples_leaf = 15,
        min_samples_split = 2,
        n_estimators = 10,
        random_state = 6
    ),
    
    'GradientBoost' : GradientBoostingClassifier(
        learning_rate = 1,
        loss = 'deviance',
        max_depth = 3,
        min_samples_split = 15,
        n_estimators = 100,
        random_state = 6,
        subsample = 1,
    )}
tuned_model_names = list(tuned_models.keys())
tuned_models = list(tuned_models.values())

With that, we'll also set four different variables corrisponding to each classifier to an arbitrary number.

In [None]:
ADA_clf, Bag_clf, RF_clf, GB_clf = 0,0,0,0

clfs = [ADA_clf, Bag_clf, RF_clf, GB_clf]

Using a simple for loop, we can simply instantiate and fit each model using the parameters from the gridsearch

In [None]:
for i in range(4):
    clfs[i] = tuned_models[i]
    clfs[i].fit(X_train, y_train)
    print(f"Trained {tuned_model_names[i]}")

With the models trained, the results can be viewed!

At first, I'm going to just generate classification reports to review the results for eveything, but the main items I'm keeping an eye on are precision and accuracy.

Ideally, the model would have a low number of projects misclassified as successful when they actually weren't, to help make sure false hopes aren't given to the user. 

In [None]:
y_pred = 0

for i in range(4):
    print(f"""Classification Report for {tuned_model_names[i]}
-----""")
    y_pred = clfs[i].predict(X_test)
    print(classification_report(y_test, y_pred))
    print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")

Already, it looks like GradientBoost might be the best in terms of accuracy, however the precision for AdaBoost and GradientBoost are matched for precision. So next, I'm going to be looking at both the precision score and accuracy score for everything to get a more detailed number

In [None]:
for i in range(4):
    y_pred = clfs[i].predict(X_test)
    print(f"""Precision Score for {tuned_model_names[i]} : {precision_score(y_test, y_pred)}""")
    print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")

In [None]:
for i in range(4):
    print(f"""Accuracy Score for {tuned_model_names[i]} : {clfs[i].score(X_test, y_test)}""")
    print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")

And with that, we can see that AdaBoost and GradientBoost are very close on both precision and accuracy, but GradientBoost is at least a little bit more accurate than AdaBoost. To be sure, I went to check the confusion matrix for each

In [None]:
for i in range(4):
    print(f"""Confusion Matrix for {tuned_model_names[i]}
-----""")
    y_pred = clfs[i].predict(X_test)
#     plot_confusion_matrix(clfs[i], X_test, y_test, cmap = "Blues")
    print(confusion_matrix(y_test, y_pred))
    print("~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~")

With that, I can once again confirm GradientBoost has the lowest number of false positives, on top of having the most correct guesses for both true positives and true negatives!

### Reviewing the Results

With the tuned GradientBoost model being just barely 73% accurate, it's not ideal. However, it's enough to be helpful and augment what the project creator already knows, and give some idea of ideal date ranges, project lenghts, and goals.
As long as the amount of risk is considered, and the 1/4 chance that the model is incorrect, then it can be considered to be helpful, as long as it is not depended upon.

### Next Steps

- Test other algorithms
    - While working on this project, I was unable to get XGBoost to work. If possible, one of my next steps would be to troubleshoot this issue and see if it performs better than the other models I have tested
    - As well, I would like to research other classifier algorithms that may be used to further improve performance
    - I would like to test using a stacked classifier to improve overall model performance
   
- Test creating models for each parent category
    - This is based off of my theory that making a model for each parent category, and using the child categories as a feature may improve the performance of the models. This will be very time consuming due to the number of parent categories. 
    
- Alter number of features used 
    - I would like to alter the number of features used, and see if removing some may improve overall model performance.

- Create a separate NLP algorithm that analyzes blurb data and titlesto analyze whether or not the words used may draw more attention, and therefore, more backers.
    - Would take a lot of preprocessing and manual addition of words, as well as manual removal of stopwords. Some types of items may not be recognized as "real" when stopwords and non-english words are removed, which may cause issues and lower accuracy. 