# Managing Innovation Project

#### Installing packages

In [1]:
%%capture
#This could take a little while, but not to worry
#It won't display anything because I captured the output
import sys
!{sys.executable} -m pip install -r ../requirements.txt
!{sys.executable} -m spacy download en

#### Loading necessary packages:

In [2]:
%load_ext autoreload
%autoreload 2
import numpy as np
import pandas as pd
import plotly.express as px

import utils
import utils.numbers
from utils.data import load_sheet
from utils.sentiment import add_sentiment


# Loading the data

In [3]:
#Loads the excel table into a dictionary of DataFrames
sheet = load_sheet("../../.dat/Lego_subset_22_merge.xlsx")

Creating create a DataFrame where comments and submissions are joined, so that we can obtain comment count and sentiment

In [4]:
comments_df = sheet["ideas"].merge(
    sheet["comments"],
    how="left",
    suffixes=("_idea", "_comment"),
    on="submission_id"
)

In [5]:
comments_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 476 entries, 0 to 475
Data columns (total 17 columns):
 #   Column                Non-Null Count  Dtype         
---  ------                --------------  -----         
 0   user_id_idea          476 non-null    int64         
 1   submission_id         476 non-null    int64         
 2   topic_alias_idea      476 non-null    object        
 3   title                 476 non-null    object        
 4   idea                  476 non-null    object        
 5   idea type             467 non-null    object        
 6   tags                  225 non-null    object        
 7   publish_date_idea     476 non-null    datetime64[ns]
 8   n_votes_idea          476 non-null    int64         
 9   expert_selected       476 non-null    bool          
 10  idea_experience       476 non-null    int64         
 11  user_id_comment       467 non-null    float64       
 12  topic_alias_comment   467 non-null    object        
 13  comment_id          

## Sentiment Analysis

We add sentiment scores based on the comments

In [6]:
sentiment_df = add_sentiment(comments_df, based_on="comment")

We summarize both sentiment polarity and sentiment subjectivity in a mean score for each submission.
I also add comment counts to each submission.
These newly added columns are then normalized, so that they are on the same scale

In [7]:
sentiment_summary = (
    sentiment_df[["submission_id", "sentiment_polarity", "sentiment_subjectivity"]]
    .groupby("submission_id")
    .mean()
)
comment_count = (
    comments_df[["submission_id", "comment"]]
    .groupby("submission_id")
    .count()
    .rename({"comment": "comment_count"}, axis=1)
)
df = (
    sheet["ideas"]
    .merge(sentiment_summary, on="submission_id")
    .merge(comment_count, on="submission_id")
)

Here are some histograms to help us investigate the distributions of the variables

In [8]:
px.histogram(df, x = "sentiment_polarity").show()
px.histogram(df, x = "sentiment_subjectivity").show()
px.histogram(df, x="n_votes").show()
px.histogram(df, x= "comment_count").show()

#### Interpretation:
 - We can clearly see that the overwhelming majority of comments is positive in sentiment,
 meaning that people don't usually comment when they have negative views about an idea.
 - The distribution of number of votes and comment count are very similar
 - Most posts get around 2 comments and 5 votes

Some correlation analysis and plots about these variables

In [9]:
utils.numbers.corr_test(df, "sentiment_polarity", "sentiment_subjectivity")
utils.numbers.corr_test(df, "n_votes", "comment_count")
utils.numbers.corr_test(df, "sentiment_polarity", "n_votes")

SpearmanrResult(correlation=0.5884248250641151, pvalue=1.5048173222790123e-10)


SpearmanrResult(correlation=0.690806167784053, pvalue=1.3056054302164864e-16)


SpearmanrResult(correlation=0.15488009289790128, pvalue=0.1258390001765169)


### Interpretation:
 - We can see that sentiment subjectivity and polarity are highly positively correlated, meaning that the more positive a comment is, the more subjective language it uses.
 - Number of votes and number of comments is highly correlated, meaning that engagement is basically one variable
  - Number of votes is not correlated with sentiment polarity, meaning that higher average sentiment of comments does not necessarily mean more engagement


We normalize these newly added columns, so that they are on the same scale and can be unified at will:

In [10]:
df = utils.numbers.normalize_columns(
    df,
    ["sentiment_polarity", "sentiment_subjectivity", "n_votes", "comment_count"]
)

Since engagement is basically one variable we take the mean of the number of votes and number of comments, that way we can have a joint engagement variable:

In [11]:
df = df.assign(
    engagement=(df["n_votes"]+df["comment_count"])/2,
)

#### Top 10 ideas by engagement

In [12]:
most_popular = df.sort_values(by="engagement")["idea"].head(10)
for i, idea in enumerate(most_popular):
    print(f"{i+1}. - {idea}")

1. - Create Skylanders LEGO Figurs - the skylanders figurs should be a part of normal play sets and thereby both offer the physical play and play with the mini figure in a game.
2. - Combining a dungeon and dragons storytelling and game element to the LEGO play experience  having a LEGO play master (Dungeon master) that builds the story and set the quests for the rest of the people using the guidelines and framework set through the product
3. - A software or Guide for the creation or development for new system. in pmy pc when i have difent types of partes  i can creat a system an leather purched the pieces and recreat. and difent ways to mix difent systems."
4. - I built for several nights for the Unimog U400, a multi-function tool car. It is really a good way for us to build it. I feel so good to see the basic car machine operation in such a model. As an engineer, it is really better to build such model than only read it in book. Although I am not an Technic fans,  it is really intere

#### Top 10 ideas by mean sentiment of comments

In [13]:
most_loved = df.sort_values(by="sentiment_polarity")["idea"].head(10)
for i, idea in enumerate(most_loved):
    print(f"{i+1}. - {idea}")

1. - The successful business model for digital play has been found (but not yet perfected)...and the winner is: Skylanders! For those of you who dont know Skylanders here is a short intro: Skylanders is a video game for gaming consols (x-box,  ps3 etc). Along with the game you get a physical platform (RFID chip reader) and Skylander figures - placing one of the figures on the platform activates the figure in the game and lets you play with that character as long as the figure is placed upon the platform - and switching the figure obviously lets you play with a new character (with new skills,  new looks,  new persona etc). The game is basically running around in a virtual world fighting monsters and completing in-game mini games....much like LEGO universe. Looking at my nephew and his friends (age 7) this game has rapidly become a part of their daily routine. They bring the figures to each other so they can try out the new figures they have just bought/received.  In retail prices in DK 

#### Top 10 ideas by a balanced score of engagement and sentiment

In [14]:
df = df.assign(
    balanced_score = (df["engagement"]+df["sentiment_polarity"])/2,
)
balanced = df.sort_values(by="balanced_score")["idea"].head(10)
for i, idea in enumerate(balanced):
    print(f"{i+1}. - {idea}")

1. - The successful business model for digital play has been found (but not yet perfected)...and the winner is: Skylanders! For those of you who dont know Skylanders here is a short intro: Skylanders is a video game for gaming consols (x-box,  ps3 etc). Along with the game you get a physical platform (RFID chip reader) and Skylander figures - placing one of the figures on the platform activates the figure in the game and lets you play with that character as long as the figure is placed upon the platform - and switching the figure obviously lets you play with a new character (with new skills,  new looks,  new persona etc). The game is basically running around in a virtual world fighting monsters and completing in-game mini games....much like LEGO universe. Looking at my nephew and his friends (age 7) this game has rapidly become a part of their daily routine. They bring the figures to each other so they can try out the new figures they have just bought/received.  In retail prices in DK 

## Topic modelling
We use negative matrix factorization to look for some topics in the data.
As an arbitrary choice I decided to go with 5, but you can change that.
Words with a higher frequency than 20% in every document are filtered, so that they don't appear in every topic.

In [15]:
from utils.topic_modelling import NMFTopics, add_nmf_topics

#Change this number if you want to have a different number of topics
NUM_TOPICS = 5
#Change this number if you want to have a different cutoff filter
CUTOFF = 0.2

nmf_model = NMFTopics(n_topics=NUM_TOPICS, cutoff_frequency=CUTOFF).fit(df["idea"])


The 'init' value, when 'init=None' and n_components is less than n_samples and n_features, will be changed from 'nndsvd' to 'nndsvda' in 1.1 (renaming of 0.26).


Maximum number of iterations 200 reached. Increase it to improve convergence.



Here's the top 10 words in each topic ordered by how much they contribute to the topic

In [16]:
topics = nmf_model.get_topics(top_words=10)
for i, topic in enumerate(topics):
    print(i, " : ", topic)

0  :  {'child': 0.7579048893503166, 'minifigure': 0.5398455797616327, 'friend': 0.22758998799379318, 'great': 0.2203668277939416, 'family': 0.19422401715186585, 'app': 0.18725788385418274, 'parent': 0.17643619344595826, 'week': 0.17558899643003584, 'fun': 0.1706481441662322, 'etc': 0.16573865889695955}
1  :  {'figure': 0.6488251938887852, 'game': 0.4708799339846939, 'new': 0.20028231113825928, 'character': 0.18555091380155722, 'physical': 0.1653300990754163, 'create': 0.1650901040404295, 'skylanders': 0.1622294342025475, 'robot': 0.15985629139204816, 'skylander': 0.14401418790337933, 'web': 0.1325172902841167}
2  :  {'bag': 0.6398941044239204, 'open': 0.5051133111847331, 'smart': 0.15888909068160542, 'pack': 0.1544345023564113, 'away': 0.14033112360371794, 'plastic': 0.11752071858253603, 'nice': 0.11383079488432808, 'little': 0.11312615521753557, 'issue': 0.11032882457669722, 'way': 0.10209329955554287}
3  :  {'theme': 0.38190676333685664, 'village': 0.3035692096430462, 'christmas': 0.

We add a new column to the dataframe indicating which topic it is closest to

In [17]:
df = add_nmf_topics(df, "idea", nmf_model)

We plot the engagement and sentiment scores for each topic:

In [26]:
px.box(df, x="idea_topic", y="engagement").show()
px.box(df, x="idea_topic", y="sentiment_polarity").show()

#### Interpreatation:
We can see, that there are no drastic differences in sentiment per topic, but topic 2 seems to get more engagement than the others overall.

Here's a histogram of all the topics, you can clearly see that topic 4 has the most ideas submitted

In [19]:
px.histogram(df["idea_topic"])

On the  other hand it seems like experts seem to favour topic 2 quite drastically

In [20]:
count_expert_selected = (
    df[df["expert_selected"] == 1]
    .groupby("idea_topic")
    .count()["expert_selected"]
    .to_frame()
    .reset_index()
)
px.bar(count_expert_selected, x="idea_topic", y="expert_selected")

## Supervised learning

We transform the ideas to term frequency-inverse document frequency vectors, and then try to see
whether we can reliably classify if an idea is going to be selected by an expert or not.

In [21]:
from sklearn.dummy import DummyClassifier
from sklearn.ensemble import RandomForestClassifier

from utils.learn import evaluate_model

X = nmf_model.tf_idf.transform(df["idea"])
y = df["expert_selected"].tolist()

We test 2 models on the data:
 - DummyClassifier: A classifier that does not learn anything from the data, we use it to obtain chance level accuracy scores
 - RandomForest: An ensemble learning method that usually performs well in classification tasks and is quite robust against overfitting

In order to evaluate whether RandomForest performs any better than a the Dummy we conduct the following simulation 500 times:
 - We shuffle the data and divide it into a training and a testing set by a random split
 - We fit the models to the training part of the dataset
 - We test the models' accuracy on data it has not seen before, thereby seeing how much it generalises

In [22]:
n_sim=500
dummy_accuracy = evaluate_model(
    DummyClassifier,
    X,y,
    n_sim=n_sim
)
rforest_accuracy = evaluate_model(
    RandomForestClassifier,
    X,y,
    n_sim=n_sim
)
evaluation_df = pd.DataFrame({
    "model": ["Dummy"]*n_sim+["RandomForest"]*n_sim,
    "accuracy": list(dummy_accuracy) + list(rforest_accuracy)
})

After the simulations we look at the box plot of accuracy scores to see if the two models perform significantly differently

In [23]:
px.box(evaluation_df, x="model", y="accuracy", points="all")

And we run a t-test to see whether the accuracy scores are significantly different from each other.

In [24]:
from scipy.stats import ttest_ind

ttest_ind(dummy_accuracy, rforest_accuracy)

Ttest_indResult(statistic=0.625619924899416, pvalue=0.5317073023141603)

#### Interpretation:
It can clearly be seen from the visualization that the models perform virtually identically.
Our t-test yields a p-value way above 0.05, meaning that there is a high probability that the small difference there is between the two distributions is simply due to random noise.

This essentially means that we did **not** manage to obtain a model that is capable of classifying an ideas as worthy for consideration based on word usage.

It must be mentioned, that this model is not particularly sofisticated, nor do we have a particularly large dataset to work with.
Deep learning methods on larger datasets could obtain more meaningful results. 