# Example of a principle component analysis looking at key words in publications and their relation to views:
* The dependent variable we want it to predict is # of views
* Based off the independent variable keywords

```mermaid
flowchart TB
    
    subgraph output
		keywords_likely_to_produce_events
    end
    subgraph model_defined_with_algorithm
		principle_component_analysis
    end
    subgraph inputs
    publication_name
    publication_weight
    publication_keywords
    end
    inputs --> model_defined_with_algorithm --> output
```

In [None]:
#packages to import
from sklearn.decomposition import PCA
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import StandardScaler
import pandas as pd

## Pulling in Data

In [None]:
our_dataset = pd.read_excel('publication_keywords_and_weight.xlsx')
print("What is the python datatype of this read in excel sheet? :", type(our_dataset))
print("What is the size of this dataframe: ", our_dataset.shape)
our_dataset.head()

# Dataframe processing:
Our dataframe (python's version of an excel sheet) is in a form that could use some help to make the machine be able to know what we're talking about when.

Some things we're doing:
This may not actually be needed but I'm curious so I'm doing it currently to I can also visualize the data: 
1. Extracting the 'Publication_keywords' for TF-IDF transformation
2. Applying TF-IDF Vectorizer to convert text data into numerical format, this is something that machines like to do when working with words to get it into numbers. They find the pattern of characters (the word's spelling), and look at term frequency times the inverse document frequency

In [None]:
# Making the publication keywords being read in as a list so it sees each word as distinct from the next keywords seperated by a comma
# our_dataset.Publication_keywords = our_dataset.Publication_keywords.str.split(',')
# print(our_dataset.shape)
# our_dataset.head()

## Defining Model

Prompt I used:

In [None]:
def perform_pca(dataframe):
    # Extracting the 'Publication_keywords' for TF-IDF transformation
    keywords = dataframe['Publication_keywords']
    
    # Applying TF-IDF Vectorizer to convert text data into numerical format
    vectorizer = TfidfVectorizer(stop_words='english')
    tfidf_matrix = vectorizer.fit_transform(keywords)
    
    # Standardizing the 'publication_weight' column
    scaler = StandardScaler()
    publication_weight_scaled = scaler.fit_transform(dataframe[['publication_weight']])
    
    # Combining TF-IDF features with 'publication_weight'
    features_combined = pd.concat([pd.DataFrame(tfidf_matrix.toarray()), pd.DataFrame(publication_weight_scaled, columns=['publication_weight'])], axis=1)
    
    # Performing PCA
    pca = PCA(n_components=2)  # Adjust n_components based on the desired dimensionality reduction
    principal_components = pca.fit_transform(features_combined)
    
    # Creating a DataFrame for the principal components
    principal_df = pd.DataFrame(data=principal_components, columns=['principal_component_1', 'principal_component_2'])
    
    # Displaying the principal components
    display(principal_df)
    
    # Identifying the features (keywords) that contribute most to the principal components
    feature_names = vectorizer.get_feature_names_out().tolist() + ['publication_weight']
    most_important_features = [feature_names[index] for index in pca.components_[0].argsort()[-10:][::-1]]  # Adjust the number of features as needed
    
    # Displaying the most important features for the first principal component
    print("Most important features for the first principal component:", most_important_features)

## Using model to analyse the dataset

In [None]:
# Example usage (assuming 'df' is your DataFrame)
output = perform_pca(our_dataset)
output

# Retrospective/ Things to do better next time
1. do model testing

# Model evaluations:

Most of the time when dealing with training a data science model, you split your labeled dataset into "test" and "train" data. 

* the "train" dataset is usually bigger than the test dataset, and is used to train the data science / machine learning model. This data is labeled (in this model we have the label of points)
* the "test" dataset is used to evaluate how well the model performs. The independent variable is used to predict the dependent variable. 