# Natural Language Processing Application: Sentimental Analysis on Steam Reviews (Possibly?)

## Team

* Gabriel Aracena
* Joshua Canode
* Aaron Galicia

### Project Description

A key area of knowledge in data analytics is the ability to extract meaning from text. This assignment provides the foundational skills in this area by detecting whether a text conveys a positive or negative message.

Analyze the sentiment (e.g., negative, neutral, positive) conveyed in a large body (corpus) of texts using the NLTK package in Python. Complete the steps below. Then, write a comprehensive technical report as a Python Jupyter notebook to include all code, code comments, all outputs, plots, and analysis. Make sure the project documentation contains a) Problem statement, b) Algorithm of the solution, c) Analysis of the findings, and d) References.

## Abstract

The objective is to use a deep artificial neural network (ANN) to determine an optimal team composition from a pool of basketball players. Given player characteristics, we want to identify the best five players that result in a balanced team.

### Data Preparation:

* Load the NBA Players Dataset.
* Filter to get a pool of 100 players from a random 5-year window.
* Normalize/Standardize player characteristics.

### ANN Model Building:

* Design a Multi-layer Perceptron (MLP) based on the architecture of the CST-435 An Artificial Neural Network Model Image (see below)
* Define layers: Input layer, Hidden layers, and Output layer.
* Determine the appropriate activation function, optimizer, and loss function for the MLP.

![ANNModel](ANNModel.png)

### Training the ANN:

* Forward propagation: Use player characteristics to propagate input data through the network and generate an output.
* Calculate the error using a predefined cost function.
* Backpropagate the error to update model weights.
* Repeat the above steps for several epochs.

### Evaluation and Team Selection:

* Use forward propagation on the trained ANN to predict player effectiveness or class labels.
* Apply a threshold function to these predictions.
* Select the top five players that meet the optimal team criteria.

## Model Architecture

* Input Layer: This layer will have neurons equal to the number of player characteristics we're considering (e.g. points, assists, offensive rebounds, defensive rebounds,etc.).
* Hidden Layers: Multiple hidden layers can be used to capture intricate patterns and relationships. We initially thought we would do 5 hidden layers, one for each position,  but we decided to stick with only a single layer for simplicity and might change that later. 
* Output Layer: This layer can have neurons equal to the number of classes or roles in the team we're predicting for (e.g., point guard, shooting guard, center, etc.). Each neuron will give the likelihood of a player fitting that role.

## Activation and Threshold Function

During forward propagation, each neuron processes input data and transmits it to the next layer. An activation function is applied to this data. For this model, we can use the ReLU (Rectified Linear Unit) activation function for hidden layers due to its computational efficiency and the ability to handle non-linearities. The softmax function might be applied to the output layer as it provides a probability distribution.

After obtaining the output, a threshold function is applied to convert continuous values into distinct class labels. In this case, it can be the player's most likely role in the team.

## Interpretation and Conclusion

The final output provides us with a categorization of each player in our pool. By examining the predicted class labels and the associated probabilities, we can:
* Identify which role or position each player is most suited for.
* Select the top players for each role to form our optimal team.

We are going to define target values for each position and use hope to use that in the end of each training to classify if the output team was good or not. 

It's worth noting that the "optimal" team is contingent on the data provided and the neural network's training. For better results, the model should be regularly trained with updated data, and other external factors (like team chemistry and current form) should also be considered in real-world scenarios. For our optimal team we defined some weights based on each player position that will take into account the 2 most important stats for each position according to our criteria. See Definig player types bellow:

In [None]:
import nltk
from nltk.chunk import ChunkParserI
from nltk.chunk.util import conlltags2tree, tree2conlltags
from nltk.tag import UnigramTagger, BigramTagger
from nltk.sentiment import SentimentIntensityAnalyzer

import nltk
nltk.download('vader_lexicon')

class NEChunkParser(ChunkParserI):
    def __init__(self, train_sents):
        train_data = [[(t, c) for w, t, c in sent] for sent in train_sents]
        self.tagger = BigramTagger(train_data, backoff=UnigramTagger(train_data))
        self.sentiment_analyzer = SentimentIntensityAnalyzer()

    def parse(self, sentence):
        pos_tags = [pos for (word, pos) in sentence]
        tagged_pos_tags = self.tagger.tag(pos_tags)
        chunktags = [chunktag if chunktag is not None else 'O' for (pos, chunktag) in tagged_pos_tags]
        conlltags = [(word, pos, chunktag) for ((word, pos), chunktag) in zip(sentence, chunktags)]
        return conlltags2tree(conlltags)

    def analyze_sentiment(self, sentence):
        sentence_text = ' '.join(word for word, pos in sentence)
        score = self.sentiment_analyzer.polarity_scores(sentence_text)
        return score

# Expanded sample training data
train_sents = [
    [('James', 'NNP', 'B-PERSON'), ('works', 'VBZ', 'O'), ('in', 'IN', 'O'), ('Intel', 'NNP', 'B-ORG')],
    [('Mary', 'NNP', 'B-PERSON'), ('lives', 'VBZ', 'O'), ('in', 'IN', 'O'), ('New York', 'NNP', 'B-LOC')],
    [('Google', 'NNP', 'B-ORG'), ('is', 'VBZ', 'O'), ('a', 'DT', 'O'), ('technology', 'NN', 'O'), ('company', 'NN', 'O')],
    [('Barack', 'NNP', 'B-PERSON'), ('Obama', 'NNP', 'I-PERSON'), ('was', 'VBD', 'O'), ('the', 'DT', 'O'), ('president', 'NN', 'O'), ('of', 'IN', 'O'), ('the', 'DT', 'O'), ('United States', 'NNP', 'B-LOC')],
    [('The', 'DT', 'O'), ('Eiffel', 'NNP', 'B-LOC'), ('Tower', 'NNP', 'I-LOC'), ('is', 'VBZ', 'O'), ('in', 'IN', 'O'), ('Paris', 'NNP', 'B-LOC')],
    [('Apple', 'NNP', 'B-ORG'), ('produces', 'VBZ', 'O'), ('the', 'DT', 'O'), ('iPhone', 'NN', 'O')]
]

chunker = NEChunkParser(train_sents)

# Test the custom NER
test_sent = [('James', 'NNP'), ('is', 'VBZ'), ('from', 'IN'), ('Intel', 'NNP')]
parsed_tree = chunker.parse(test_sent)
print("Named Entity Recognition:")
print(parsed_tree)

# Test sentiment analysis
test_sentence = [('James', 'NNP'), ('loves', 'VBZ'), ('working', 'VBG'), ('at', 'IN'), ('Intel', 'NNP')]
sentiment_score = chunker.analyze_sentiment(test_sentence)
print("\nSentiment Analysis:")
print("Sentence: ", " ".join(word for word, pos in test_sentence))
print("Positive Sentiment: ", sentiment_score['pos'])
print("Negative Sentiment: ", sentiment_score['neg'])
print("Neutral Sentiment: ", sentiment_score['neu'])
print("Compound Sentiment: ", sentiment_score['compound'])
