# Trying to predict success using ML models

This notebook consists on an attempt of using our extracted features from the movie scripts in order to predict whether a movie will be successful or not.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
import operator
from scipy.stats import pointbiserialr
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import StratifiedKFold
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from time import time

We are loading the dataset with all the features, raw and also aggregated

In [2]:
path = "feat_extraction/movies_with_feats_pedro.csv"

df = pd.read_csv(path, index_col=0)
df.index.name = ""
del df["Unnamed: 0.1"]
df.head()

Unnamed: 0,Processed Title,Success,n_unique_words_char_1,n_unique_words_char_2,n_unique_words_char_3,n_unique_words_char_4,n_unique_words_char_5,FK_read_level_char_1,FK_read_level_char_2,FK_read_level_char_3,...,tot_hw_sents,feel_ratio,fill_ratio,hw_ratio,main_char_rel_diag_length,stdvs_unique_words_above_mean,FK_read_level_mean_char,stdvs_n_stop_words_above_mean,stdvs_n_curse_words_above_mean,stdvs_n_mentions_others_above_mean
,,,,,,,,,,,,,,,,,,,,,
Avatar,Avatar,1.0,670.0,565.0,252.0,425.0,276.0,2.0,2.0,2.0,...,269.0,0.016194,0.008097,0.272267,39.537232,-0.090538,2.4,0.063141,0.106202,-0.541956
The Dark Knight Rises,"Dark-Knight-Rises,-The",1.0,531.0,514.0,506.0,418.0,434.0,2.0,2.0,3.0,...,208.0,0.040289,0.089876,0.214876,23.705825,0.255438,2.2,0.789323,-0.686754,-0.17631
The Avengers,"Avengers,-The",1.0,560.0,623.0,359.0,425.0,205.0,2.0,3.0,3.0,...,222.0,0.036066,0.296175,0.242623,30.146047,-0.118151,3.0,-0.004098,-0.70941,-0.438622
Pirates of the Caribbean: Dead Man's Chest,Pirates-of-the-Caribbean-Dead-Man's-Chest,1.0,629.0,373.0,285.0,378.0,281.0,2.0,1.0,1.0,...,148.0,0.025086,0.064994,0.168757,36.743621,-0.481995,1.8,-0.057889,-0.732066,-0.867858
Men in Black 3,Men-in-Black-3,1.0,1206.0,867.0,234.0,209.0,185.0,2.0,3.0,2.0,...,277.0,0.021722,0.127112,0.222848,57.328163,0.742729,2.8,0.533814,-0.437539,1.349863


We need to drop these features

In [3]:
df["polarity_of_mentions_char_1"].iloc[0]

"{'neg': 0.13, 'neu': 0.727, 'pos': 0.144, 'compound': 0.9994}"

In [4]:
X = df.drop(["Processed Title", "Success", "polarity_of_mentions_char_1", "polarity_of_mentions_char_2", 
             "polarity_of_mentions_char_3", "polarity_of_mentions_char_4", "polarity_of_mentions_char_5"], axis=1)
y = df["Success"]

Let's compute the [point-biserial correlation coefficient](https://en.wikipedia.org/wiki/Point-biserial_correlation_coefficient) between our target and each of our features

In [5]:
corr = {}
for feature in X.columns:
    corr[feature] = pointbiserialr(X[feature].values, y.values)
        
sorted_corr = sorted(corr.items(), key=operator.itemgetter(1))
print("\nTop 5 most negative correlated features:\n")
for i in sorted_corr[:5]:
    print(i[0], ":", i[1].correlation)
print("\nTop 5 most positive correlated features:\n")
for i in sorted_corr[::-1][:5]:
    print(i[0], ":", i[1].correlation)


Top 5 most negative correlated features:

n_unique_words_char_5 : -0.0625966891567367
hw_per_sent_char_4 : -0.05588129871815716
num_pass_sents_char_2 : -0.05359501936621161
feels_per_sent_char_4 : -0.05223540032604609
passive_ratio : -0.051792981697795296

Top 5 most positive correlated features:

overall_polarity_char_2 : 0.07773382447981679
compound_polarity_of_mentions_char_2 : 0.06305073767233481
pos_polarity_of_mentions_char_1 : 0.0592639484639471
compound_polarity_of_mentions_char_5 : 0.05378153632926945
overall_polarity_char_5 : 0.05155577879978894


As we can see, this is not good news... Let's keep only the features with an absolute correlation greater than 0.05

In [6]:
thres = 0.05
feat_to_keep = [feat[0] for feat in sorted_corr if abs(feat[1].correlation) > thres]
feat_to_keep

['n_unique_words_char_5',
 'hw_per_sent_char_4',
 'num_pass_sents_char_2',
 'feels_per_sent_char_4',
 'passive_ratio',
 'overall_polarity_char_5',
 'compound_polarity_of_mentions_char_5',
 'pos_polarity_of_mentions_char_1',
 'compound_polarity_of_mentions_char_2',
 'overall_polarity_char_2']

In [7]:
X = X[feat_to_keep]
y = df["Success"]
y.value_counts()

1    540
0    113
Name: Success, dtype: int64

We can also see that the class labels are not very balanced.

In [8]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify = y)
st = StandardScaler()
X_train_std = st.fit_transform(X_train)
X_test_std = st.transform(X_test)

In [9]:
y_test.value_counts(), "No-info rate:", 162/(162+34)

(1    162
 0     34
 Name: Success, dtype: int64, 'No-info rate:', 0.826530612244898)

In [10]:
logreg = LogisticRegression(random_state=0)
logreg.fit(X_train_std, y_train)
logreg.score(X_test_std, y_test)

0.826530612244898

In [11]:
logreg.predict(X_test_std)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [12]:
pd.crosstab(y_test, logreg.predict(X_test_std), rownames=['True'], colnames=['Predicted'])

Predicted,1
True,Unnamed: 1_level_1
0,34
1,162


Well, this is not good. Our model is not learning, but predicting everything as successful. As the correlation was pointing, it seems that our features do not contain enough predictive power... Let's try to grid-search this.

In [13]:
def find_best_model(metric="accuracy"):

    # The code below was taken from GWU MLI Class
    clfs = {'lr': LogisticRegression(random_state=0),
            'mlp': MLPClassifier(random_state=0),
            'dt': DecisionTreeClassifier(random_state=0),
            'rf': RandomForestClassifier(random_state=0),
            'knn': KNeighborsClassifier(),
            'gnb': GaussianNB(),
            "ada": AdaBoostClassifier(random_state=0),
           }
    pipe_clfs = {}
    for name, clf in clfs.items():
        pipe_clfs[name] = Pipeline([('StandardScaler', StandardScaler()), ('clf', clf)])
    param_grids = {}
    C_range = [10 ** i for i in range(-4, 5)]
    param_grid = [{'clf__multi_class': ['ovr'], 
                   'clf__solver': ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
                   'clf__C': C_range},

                  {'clf__multi_class': ['multinomial'],
                   'clf__solver': ['newton-cg', 'lbfgs', 'sag', 'saga'],
                   'clf__C': C_range}]
    param_grids['lr'] = param_grid
    param_grid = [{'clf__hidden_layer_sizes': [10, 100, 200],
                   'clf__activation': ['identity', 'logistic', 'tanh', 'relu']}]
    param_grids['mlp'] = param_grid
    param_grid = [{'clf__min_samples_split': [2, 10, 30],
                   'clf__min_samples_leaf': [1, 10, 30]}]
    param_grids['dt'] = param_grid
    param_grid = [{'clf__n_estimators': [2, 10, 30],
                   'clf__min_samples_split': [2, 10, 30],
                   'clf__min_samples_leaf': [1, 10, 30]}]
    param_grids['rf'] = param_grid
    param_grid = [{'clf__n_neighbors': list(range(1, 11))}]
    param_grids['knn'] = param_grid
    param_grid = [{'clf__var_smoothing': [10 ** i for i in range(-10, -7)]}]
    param_grids['gnb'] = param_grid
    param_grid = [{'clf__base_estimator': [DecisionTreeClassifier(max_depth=1), LogisticRegression(random_state=0), 
                                          ], "clf__n_estimators": [i*10 for i in range(2, 7)], 
                  "clf__learning_rate": [0.01, 0.1, 1], "clf__algorithm": ["SAMME", "SAMME.R"]}]
    param_grids['ada'] = param_grid
    best_score_param_estimators = []
    for name in pipe_clfs.keys():
        gs = GridSearchCV(estimator=pipe_clfs[name],
                          param_grid=param_grids[name],
                          scoring=metric,
                          n_jobs=-1,
                          cv=StratifiedKFold(n_splits=10,
                                             shuffle=True,
                                             random_state=0))
        start = time()
        gs = gs.fit(X_train, y_train)
        print("the gird-search took", round(time() - start), "seconds for", name)
        best_score_param_estimators.append([gs.best_score_, gs.best_params_, gs.best_estimator_])
    best_score_param_estimators = sorted(best_score_param_estimators, key=lambda x : x[0], reverse=True)
    for best_score_param_estimator in best_score_param_estimators:
        print([best_score_param_estimator[0], best_score_param_estimator[1], type(best_score_param_estimator[2].named_steps['clf'])], end='\n\n')
        
    return best_score_param_estimators

In [14]:
best_score_param_estimators = find_best_model()

the gird-search took 8 seconds for lr
the gird-search took 11 seconds for mlp
the gird-search took 0 seconds for dt
the gird-search took 3 seconds for rf
the gird-search took 1 seconds for knn
the gird-search took 0 seconds for gnb
the gird-search took 17 seconds for ada
[0.8293216630196937, {'clf__C': 0.001, 'clf__multi_class': 'ovr', 'clf__solver': 'liblinear'}, <class 'sklearn.linear_model.logistic.LogisticRegression'>]

[0.8293216630196937, {'clf__min_samples_leaf': 1, 'clf__min_samples_split': 30, 'clf__n_estimators': 30}, <class 'sklearn.ensemble.forest.RandomForestClassifier'>]

[0.8293216630196937, {'clf__algorithm': 'SAMME', 'clf__base_estimator': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=0, solver='warn',
          tol=0.0001, verbose=0, warm_start=False), 'clf__learning_rate': 0.01, 'clf__n_estimators': 20}, <class 'sklearn.ens

In [15]:
best_score_param_estimators[0][2].fit(X_train_std, y_train)
best_score_param_estimators[0][2].score(X_test_std, y_test)

0.8214285714285714

In [16]:
best_score_param_estimators[0][2].predict(X_test_std)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [17]:
pd.crosstab(y_test, best_score_param_estimators[0][2].predict(X_test_std), rownames=['True'], colnames=['Predicted'])

Predicted,0,1
True,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0,34
1,1,161


If we tune to maximize accuracy, we still don't learn anything. Let's go for the least amount of false positives

In [18]:
best_score_param_estimators = find_best_model(metric="precision")

the gird-search took 5 seconds for lr
the gird-search took 12 seconds for mlp
the gird-search took 0 seconds for dt
the gird-search took 3 seconds for rf
the gird-search took 1 seconds for knn
the gird-search took 0 seconds for gnb
the gird-search took 20 seconds for ada
[0.8462899024804271, {'clf__min_samples_leaf': 10, 'clf__min_samples_split': 2}, <class 'sklearn.tree.tree.DecisionTreeClassifier'>]

[0.8461221873195848, {'clf__n_neighbors': 2}, <class 'sklearn.neighbors.classification.KNeighborsClassifier'>]

[0.8355593470591097, {'clf__min_samples_leaf': 1, 'clf__min_samples_split': 10, 'clf__n_estimators': 2}, <class 'sklearn.ensemble.forest.RandomForestClassifier'>]

[0.8312369021238799, {'clf__algorithm': 'SAMME', 'clf__base_estimator': DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
    

In [19]:
best_score_param_estimators[0][2].fit(X_train_std, y_train)
best_score_param_estimators[0][2].score(X_test_std, y_test)

0.7551020408163265

In [20]:
best_score_param_estimators[0][2].predict(X_test_std)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0,
       1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [21]:
pd.crosstab(y_test, best_score_param_estimators[0][2].predict(X_test_std), rownames=['True'], colnames=['Predicted'])

Predicted,0,1
True,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2,32
1,16,146


This is even worse than before. We only get two less false positives but 16 false negatives..! 

Let's try with oversampling and undersampling, although this will probably not work. it seems that our features are pretty much irrelevant in regards to success.

In [22]:
rus = RandomUnderSampler(random_state=0)
X_resampled, y_resampled = rus.fit_resample(X, y)
np.unique(y_resampled, return_counts=True)

(array([0, 1]), array([113, 113]))

In [23]:
X_train, _, y_train, _ = train_test_split(X_resampled, y_resampled,
                                                    test_size=0.3, random_state=0, stratify = y_resampled)
X_train_std = st.fit_transform(X_train)
X_test_std = st.transform(X_test)

In [24]:
logreg = LogisticRegression(random_state=0)
logreg.fit(X_train_std, y_train)
logreg.score(X_test_std, y_test)

0.5204081632653061

In [25]:
best_score_param_estimators = find_best_model(metric="accuracy")

the gird-search took 3 seconds for lr
the gird-search took 6 seconds for mlp
the gird-search took 0 seconds for dt
the gird-search took 2 seconds for rf
the gird-search took 0 seconds for knn
the gird-search took 0 seconds for gnb
the gird-search took 13 seconds for ada
[0.6012658227848101, {'clf__activation': 'logistic', 'clf__hidden_layer_sizes': 100}, <class 'sklearn.neural_network.multilayer_perceptron.MLPClassifier'>]

[0.5886075949367089, {'clf__C': 0.1, 'clf__multi_class': 'multinomial', 'clf__solver': 'newton-cg'}, <class 'sklearn.linear_model.logistic.LogisticRegression'>]

[0.5822784810126582, {'clf__algorithm': 'SAMME.R', 'clf__base_estimator': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='warn',
          n_jobs=None, penalty='l2', random_state=0, solver='warn',
          tol=0.0001, verbose=0, warm_start=False), 'clf__learning_rate': 0.1, 'clf__n_estimators': 50}, <class 'sklearn.ensem

In [26]:
best_score_param_estimators[0][2].fit(X_train_std, y_train)
best_score_param_estimators[0][2].score(X_test_std, y_test)

0.5357142857142857

In [27]:
rus = RandomOverSampler(random_state=0)
X_resampled, y_resampled = rus.fit_resample(X, y)
np.unique(y_resampled, return_counts=True)

(array([0, 1]), array([540, 540]))

In [28]:
X_train, _, y_train, _ = train_test_split(X_resampled, y_resampled,
                                                    test_size=0.3, random_state=0, stratify = y_resampled)
X_train_std = st.fit_transform(X_train)
X_test_std = st.transform(X_test)

In [29]:
logreg = LogisticRegression(random_state=0)
logreg.fit(X_train_std, y_train)
logreg.score(X_test_std, y_test)

0.576530612244898

In [30]:
best_score_param_estimators = find_best_model(metric="accuracy")

the gird-search took 4 seconds for lr
the gird-search took 20 seconds for mlp
the gird-search took 0 seconds for dt
the gird-search took 3 seconds for rf
the gird-search took 2 seconds for knn
the gird-search took 0 seconds for gnb
the gird-search took 26 seconds for ada
[0.921957671957672, {'clf__min_samples_leaf': 1, 'clf__min_samples_split': 2, 'clf__n_estimators': 30}, <class 'sklearn.ensemble.forest.RandomForestClassifier'>]

[0.8822751322751323, {'clf__n_neighbors': 1}, <class 'sklearn.neighbors.classification.KNeighborsClassifier'>]

[0.8637566137566137, {'clf__min_samples_leaf': 1, 'clf__min_samples_split': 2}, <class 'sklearn.tree.tree.DecisionTreeClassifier'>]

[0.7328042328042328, {'clf__activation': 'relu', 'clf__hidden_layer_sizes': 200}, <class 'sklearn.neural_network.multilayer_perceptron.MLPClassifier'>]

[0.6904761904761905, {'clf__algorithm': 'SAMME.R', 'clf__base_estimator': DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
            max_feat

In [32]:
best_score_param_estimators[0][2].fit(X_train_std, y_train)
best_score_param_estimators[0][2].score(X_test, y_test)

0.8061224489795918

This is even worse... Let us do the same using all the features

In [33]:
X = df.drop(["Processed Title", "Success", "polarity_of_mentions_char_1", "polarity_of_mentions_char_2", 
             "polarity_of_mentions_char_3", "polarity_of_mentions_char_4", "polarity_of_mentions_char_5"], axis=1)
y = df["Success"]

In [34]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0, stratify = y)
st = StandardScaler()
X_train_std = st.fit_transform(X_train)
X_test_std = st.transform(X_test)

In [35]:
y_test.value_counts(), "No-info rate:", 162/(162+34)

(1    162
 0     34
 Name: Success, dtype: int64, 'No-info rate:', 0.826530612244898)

In [36]:
logreg = LogisticRegression(random_state=0)
logreg.fit(X_train_std, y_train)
logreg.score(X_test_std, y_test)

0.8163265306122449

In [37]:
logreg.predict(X_test_std)

array([1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])

In [38]:
pd.crosstab(y_test, logreg.predict(X_test_std), rownames=['True'], colnames=['Predicted'])

Predicted,0,1
True,Unnamed: 1_level_1,Unnamed: 2_level_1
0,2,32
1,4,158


In [39]:
best_score_param_estimators = find_best_model(metric="precision")

the gird-search took 15 seconds for lr
the gird-search took 29 seconds for mlp
the gird-search took 1 seconds for dt
the gird-search took 4 seconds for rf
the gird-search took 2 seconds for knn
the gird-search took 0 seconds for gnb
the gird-search took 34 seconds for ada
[0.8471063396104105, {'clf__n_neighbors': 2}, <class 'sklearn.neighbors.classification.KNeighborsClassifier'>]

[0.8352044807914365, {'clf__activation': 'tanh', 'clf__hidden_layer_sizes': 10}, <class 'sklearn.neural_network.multilayer_perceptron.MLPClassifier'>]

[0.8348245967095926, {'clf__algorithm': 'SAMME.R', 'clf__base_estimator': DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'), 'clf__learning_rate': 1, 'clf__n_estimato

In [40]:
best_score_param_estimators[0][2].fit(X_train_std, y_train)
best_score_param_estimators[0][2].score(X_test_std, y_test)

0.5510204081632653

In [41]:
best_score_param_estimators[0][2].predict(X_test_std)

array([1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1,
       1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0,
       1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1,
       0, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1,
       0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0,
       0, 1, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0,
       1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0])

In [42]:
pd.crosstab(y_test, best_score_param_estimators[0][2].predict(X_test_std), rownames=['True'], colnames=['Predicted'])

Predicted,0,1
True,Unnamed: 1_level_1,Unnamed: 2_level_1
0,7,27
1,61,101


Still nothing useful. We can try changing the success threshold in order to make it more balance

In [43]:
path = "success_data.csv"
df1 = pd.read_csv(path, index_col=0)

def discretize(row):
    if row["Worldwide ROI (%)"] > 75:
        return 1
    else:
        return 0

df["Success"] = df1.apply(discretize, axis=1)

In [44]:
X = df.drop(["Processed Title", "Success", "polarity_of_mentions_char_1", "polarity_of_mentions_char_2", 
             "polarity_of_mentions_char_3", "polarity_of_mentions_char_4", "polarity_of_mentions_char_5"], axis=1)
y = df["Success"]
y.value_counts()

1    454
0    199
Name: Success, dtype: int64

In [45]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0, stratify=y)
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr.score(X_test, y_test)

0.6412213740458015

Nop. We also tried with different thresholds for ROI, and also with domestic ROI, but there is just no pattern to be learnt. We removed movies with very large ROI and tried some other combinations of features, but still, no luck. We can conclude that linguistic features such as the ones we extracted about the characters dialogues are great for clustering such characters, but do not hold any real relation with a movie's success. Instead, much more complex features regarding the relationship's between characters, the story progression and quality, will probably be the ones for which an underlying function between them and some portion of a movie's success actually exists.