# Feature Engineering and Modeling - Paper Themes from Digital Twins

In order to compare the papers and automatically decide with one of them have the similarities, the following steps need to me done:

- Organize each paper in the dataset as one string to be compared between other string representations of papers
- Create a CountVectorizer in order to transform each string in a vector 
- Compare each vector using cosine similarity 
- With the similarity results, rank the most similar ones to display them

the following code was based in a older project that can be seen [here](https://github.com/turing-usp/locaturing)


In [111]:
# importing necessary libraries 
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity

In [72]:
df = pd.read_csv("../data_clean.csv", index_col='Unnamed: 0')
df.head()

Unnamed: 0,Name,Material Type,Keyword,Country,Abstract
0,Digital twin design for real-time monitoring -...,Others,"Digital_twin, monitoring",Taiwan,
1,Pervasive and Connected Digital Twins - A Visi...,Others,"Digital_twin, Healthcare, monitoring",Italy,
2,"Digital Twin: Values, Challenges and Enablers ...",Others,"Digital_twin, Hybrid Analysis and Modeling, Ma...","Norway, United States",
3,Cyber-Physical Cloud Manufacturing Systems Wit...,Others,"Digital_twin, MTconnect, cyberphysical cloud m...",United States,
4,Data‐driven physics‐based digital twins via a ...,Others,"Digital_twin, data-model fusion, reduced-order...","Switzerland, United States",


In [112]:
# replacing NaN values, in order to concat the string even with missing values
df.fillna("",inplace=True)

## String creation

In order to create the string (our soup of words), we need to add the strings in each feature with the weights choosen

In [114]:
def replace_commas(df):
    '''
    Replacing commas in order to do not be considered in the count vectorizer
    '''

    for column in df.columns:
        df[column]=df[column].apply(lambda x: x.replace(",",""))
    
    return df 

df = replace_commas(df)

In [115]:
from typing import List, Tuple
def _create_soup_str(df: pd.DataFrame, columns: List[str],
                pesos=(20, 30, 20, 20, 10)) -> str:
    """
    function that creates a list of strings in form os a new column in the df, 
    the pesos are the weights of each column to be considered in the similarity. 
    In practice, the weights are the number of times each word will be repeted to 
    form the string. 
    """
    final_string = ''
    for peso, column in zip(pesos, columns):
        column_value = peso*df[column]
        if isinstance(df[column].iloc[0], list):
            column_value = df[column].apply(lambda row, peso=peso:' '.join(peso*row))
        final_string += column_value + ' '
    return final_string


def create_soup(df: pd.DataFrame, columns=['Name', 'Material Type', 'Keyword', 'Country', 'Abstract'],
                 pesos=(20, 30, 20, 20, 10)) -> pd.DataFrame:
    """
    Apply the cration of the string soup gathering every dataframe and transform into one string
    """
    df['soup'] = _create_soup_str(
        df,
        columns,
        pesos
    )

    df['soup'] = df['soup'].apply(lambda x : " ".join(x.split()))

    return df

In [116]:
result=create_soup(df)

## Count vectorizer creation

In [117]:
def create_model(df: pd.DataFrame) -> List[List[float]]:
    """
    Initiates the CountVectorizer and apply to the total string of the paper
    """
    count = CountVectorizer(stop_words='english')
    count_matrix = count.fit_transform(df['soup'])
    cosine_sim = cosine_similarity(count_matrix, count_matrix)

    return cosine_sim

def reset_df_index(df: pd.DataFrame) -> Tuple[pd.DataFrame, pd.Series]:
    """
    #reset indexes in order to organize the ranking of results for later
    """
    df_process = df.reset_index()
    indices = pd.Series(df_process.index, index=df_process['Name'])

    return df_process, indices

In [118]:
cosine_sim = create_model(df)

In [119]:
df_process, indices = reset_df_index(df)

## Sorting recommendations

In [120]:
def get_recommendations(df: pd.DataFrame, title: str,
    cosine_sim: List[List[float]], indices: pd.Series) -> List[str]:
    """
    with the cosine similarity results, sort the values and select the papers that are most 
    similar with the one entered in the data
    """
    # Get the index of the Paper that matches the "Name" column
    idx = indices[title]

    # Get the pairwsie similarity scores of all papers with the one choosen to be compared
    sim_scores = list(enumerate(cosine_sim[idx]))

    # Sort the Papers based on the similarity scores
    sim_scores = sorted(sim_scores, key=lambda x: x[1], reverse=True)

    # Get the scores of the 10 most similar ones 
    sim_scores = sim_scores[1:11]

    # Get the indices
    movie_indices = [i[0] for i in sim_scores]

    # Return the top 10 most similar papers
    titles = df['Name'].iloc[movie_indices]
    #return df.query('Name in @titles')
    return titles.values

In [122]:
recomend = get_recommendations(df_process,
                    title="Pervasive and Connected Digital Twins - A Vision for Digital Health",
                    cosine_sim=cosine_sim,
                    indices=indices
                    )

print(f"list of the 10 most similar papers: \n {recomend}")

list of the 10 most similar papers: 
 ['Industry 4.0 and the digital twin'
 'A five-step approach to planning data-driven digital twins for discrete manufacturing systems'
 'Sustainable Manufacturing Digital Twins: A Review of Development and Application'
 'A taxonomy of digital twins' 'SEKAI\xa0Digital\xa0Twins Ltd'
 'PTC Digital Twin'
 'Digital twins for high-tech machining applications—a model-based analytics-ready approach'
 'Untangling the requirements of Digital twin'
 'Autonomous context-aware adaptive Digital Twins—State of the art and roadmap'
 'Digital twin modeling']
